Null Hypothesis Significance Testing (NHST) must be Discarded or Downgraded
When a scientist describes the results of his or her study as “statistically significant, p < .05,” that scientist is likely using Null Hypothesis Significance Testing (NHST, or what is referred to as “significance testing” or the common “p-value”). Describing exactly what NHST is would extend this brief introduction into a lengthy paper, but it is sufficient to say that the vast majority of statisticians today regard NHST as deeply flawed. In fact, the flagship journal of the American Statistical Association recently published a special issue devoted to criticisms of and alternatives to NHST. Ronald Wasserstein’s summary of the papers in this special issue can be read online, and perhaps the most important recommendation is that researchers stop using the words “statistically significant” altogether. Similarly, a group of over 800 scientists recently endorsed a letter published in the scientific journal Nature, calling for a ban on “statistical significance.” These recent publications are just two examples among many in recent years in which scholars have exposed the flaws in significance testing and called for replacement methods.
Observation Oriented Modeling represents one such alternative to NHST. In the briefest possible terms, NHST is built around the goal of estimating population parameters from sample statistics (e.g., when a pollster attempts to predict the winner of an election from a sample of voters). OOM, by comparison, is built around the goal of creating explanatory models of natural systems, including human acts, powers, and habits (e.g., when early scientists built models to explain blood circulation). Moreover, NHST is centered around the common p-value, which is a probability whose determination often relies upon unrealistic and unwarranted assumptions regarding one’s data. The central question with significance testing is “Given a host of assumptions (many of which are unrealistic and unwarranted), such as random sampling, normal population distributions, homogeneous population distributions, etc., what is the probability of obtaining a test statistic (e.g., a t-value) as extreme or more extreme than the one observed?” OOM, by comparison, is built around determining the accuracy of one’s theoretical model. The central question in OOM is “how many observations in my study are consistent with my theoretical explanation?” A probability statistic, called a “chance-value” (or c-value), can be derived in OOM, but it is secondary and free of assumptions.
Incidentally, criticisms of NHST extend back to the mid-1900s. One of the earliest and most famous can be found in David Bakan’s 1968 book On Method: Toward a Reconstruction of Psychological Investigation. Two other classics are Paul Meehl’s paper Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology, in which he states that significance testing is “one of the worst things that ever happened in the history of psychology” (p. 817), and David Lykken’s book chapter, What’s Wrong with Psychology Anyway? The most erudite summary today of the many flaws in NHST can be found in Raymond Hubbard’s book, Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science. Finally, NHST is convoluted and confusing, so much so that studies have shown that large majorities of researchers and majorities of statistics professors (!) cannot properly define the p-value nor can they accurately describe exactly what it is telling them about their data. Consequently, when a researcher declares his or her findings as “statistically significant,” odds are the researcher doesn’t even know what this declaration means. Gerd Gigerenzer’s Mindless Statistics paper summarizes some of these sobering facts. Clearly, NHST (significance testing or the common p-value) must go or be downgraded for use in highly specific (esoteric) circumstances.
Psychologists do not Possess any Rulers (Units of Measure)
Suppose a psychologist wishes to measure belief in God with the following item:
Using the scale below, please indicate the extent to which you agree with the following statement: “I believe in the existence of God”
not at all 0 1 2 3 4 5 6 completely
Has the psychologist truly measured “belief in God” as a quality possessed in varying degrees by different persons? The key word to focus on when attempting to answer this question is of course “measured.” In the simplest terms, by using a 0 to 6 scale, the psychologist presumes to be working with a crude unit of measure, and that a person scoring 6 possesses twice the amount of belief as a person scoring 3. Or, perhaps less stringently, the psychologist believes that the difference between, say, scores of 4 and 2 represents an amount of belief in God equivalent to the difference between 6 and 4 (i.e., 4 – 2 = 6 – 4 = 2). The problem here is that there is absolutely no scientific evidence that this 0 to 6 scale represents a unit of measure of belief in God. Writing more items (e.g., “The likelihood that God exists is quite high.”) and adding or averaging the ratings does nothing to change this fact.
Going beyond this simple example, we find that psychologists do not have established units of measure for the qualities they find most interesting, such as intelligence, personality traits, depression, anxiety, attitudes, self-esteem, etc. Psychologists of course have numerous inventories, tests, and questionnaires which can be completed and scored, producing numbers that apparently represent levels of intelligence, introversion, depression, etc. None of these scale scores, however, represents an actual unit of measure for the quality being investigated. Think of it this way: there are dozens of personality questionnaires to measure your level of introversion/extraversion, and they all yield different ranges of scores. Moreover, you may end up scoring as an extravert on one questionnaire and as an introvert on another. If psychologists had units of measure for introversion/extraversion, this troubling outcome would be rare and clearly attributable to measurement error; yet this sort of inconsistency is common. You can also Imagine physicists attempting to study temperature, weight, height, gravitational pull, mass, acceleration, etc. with such incongruous measurement methods. In such an imagined world, physics would simply not have advanced to where it is today. What we are discussing here is the problem of measurement, and unfortunately most psychologists have chosen simply to ignore it.
If we regard “belief in God” as a quality that a person may or may not possess, exactly how is that quality present within a given person? Is it actually structured like a ruler; that is, like a continuum? If we could “open” a person’s mind or brain and find “belief in God,” would it look like a ruler? There is no good reason to hold that it is structured in this way. Rather, it is more reasonable to treat “belief in God” as a non-quantitative quality. We might ask a person whether or not he or she believes in God and code the response as “yes” or “no.” We might also ask the person to qualify the response with terms such as “certain,” “uncertain,” “strong,” or “weak.” These qualifiers themselves can be treated as non-quantitative and coded accordingly, or they can perhaps be treated as expressions of ordinal relations (e.g., a person judges his belief in God as stronger today than yesterday). However, none of this presupposes that “belief in God” is a continuous quality for which units of measure can be established. This is an extremely important point because if we are to study some quality of human nature successfully, we must have an accurate understanding of how it is structured. If, for instance, a quality is not continuously structured and we use methods of data analysis such as t-tests, ANOVAs, or Structural Equation Models that presuppose continuity, then the results from these analyses will necessarily be inaccurate to some degree. With such inaccuracies built into our research methods, how are we ever to arrive at a truly scientific understanding of a quality like “belief in God?”
Paul Barrett (2003) argues summarizes the measurement crisis, stating hat psychology can only move forward if it (1) demonstrates the continuous quantitative structure of its cherished attributes, (2) develops non-quantitative techniques, or (3) behaves in a more brutally honest fashion regarding the serious limitations of current methods, treating them as — at best — crude approximations of attributes. The methods in the Observation Oriented Modeling software follow Barrett’s second option. In short, a researcher can use the OOM software without making the unrealistic assumption that the qualities being studied are continuities. In other words, the OOM software can be used without having established units of measure (rulers). Without presumed continuous qualities, the act of measurement is a matter of counting the number of instances of some behavior or phenomenon indicative of the quality. For example, the number of individuals who believe in God can be tallied, or the number of times each day a person invokes some act that is relevant to a belief in God (e.g., praying) can be tallied. Unambiguous quantitative statements about these frequencies can then be made, such as “Robert prayed 5 times on Sunday and only 2 times on Saturday.” Such statements clearly do not rest on the presupposition of units of measure. Simple and sophisticated patterns of joint observation frequencies can moreover be examined with the OOM software, thus allowing for the modeling of more complex aspects of human behavior. The end result is that the OOM analysis will provide a more accurate reflection of reality than analyses based on data artificially forced into an apparent continuous quantitative structure like “belief in God” above.
Statistical Methods Have Grown Unnecessarily Complex
Compared to traditional and modern methods of data analysis, the techniques employed in the OOM software are simple to use, easy to understand, and intuitively compelling. We regard this parsimony as a strength, particularly in light of David A. Freedman’s classic paper Statistical Models and Shoe Leather. William Mason, writing in the same 1991 issue of Sociological Methodology (vol. 21), summarized several of Freedman’s arguments thusly:
- Simple [statistical/data analytic] tools should be used extensively. More complex tools should be used rarely, if at all. Thus, we should be doing more graphical analyses and computing fewer regressions, correlations, survival models, structural equation models, and so on.
- Virtually all social science modeling efforts (and here I include social science experiments, though I’m not sure Freedman would) fail to satisfy reasonable criteria for justification of the stochastic assumptions. (p. 338)
Much to the dismay of social scientists, Freedman adopted (during the latter part of his career) a critical view toward the use of linear modeling techniques such as regression, path analysis, and structural equation modeling in sociology, psychology, and education, among other disciplines. Consistent with Freedman’s arguments, Observation Oriented Modeling is a relatively simple analytic tool that permits researchers to engage their data using compelling graphical techniques and assumption-free methods of analysis. In some ways, the methods of OOM satisfy the