Null Hypothesis Significance Testing (NHST) must be Discarded or Downgraded
When a scientist describes the results of his or her study as “statistically significant, p < .05,” that scientist is likely using Null Hypothesis Significance Testing (NHST, or what is referred to as “significance testing” or the common “p-value”). Describing exactly what NHST is would extend this brief introduction into a lengthy paper, but it is sufficient to say that the vast majority of statisticians today regard NHST as deeply flawed. In fact, the flagship journal of the American Statistical Association recently published a special issue devoted to criticisms of and alternatives to NHST. Ronald Wasserstein’s summary of the papers in this special issue can be read online, and perhaps the most important recommendation is that researchers stop using the words “statistically significant” altogether. Similarly, a group of over 800 scientists recently endorsed a letter published in the scientific journal Nature, calling for a ban on “statistical significance.” These recent publications are just two examples among many in recent years in which scholars have exposed the flaws in significance testing and called for replacement methods.
Observation Oriented Modeling represents one such alternative to NHST. In the briefest possible terms, NHST is built around the goal of estimating population parameters from sample statistics (e.g., when a pollster attempts to predict the winner of an election from a sample of voters). OOM, by comparison, is built around the goal of creating explanatory models of natural systems, including human acts, powers, and habits (e.g., when early scientists built models to explain blood circulation). Moreover, NHST is centered around the common p-value, which is a probability whose determination often relies upon unrealistic and unwarranted assumptions regarding one’s data. The central question with significance testing is “Given a host of assumptions (many of which are unrealistic and unwarranted), such as random sampling, normal population distributions, homogeneous population distributions, etc., what is the probability of obtaining a test statistic (e.g., a t-value) as extreme or more extreme than the one observed?” OOM, by comparison, is built around determining the accuracy of one’s theoretical model. The central question in OOM is “how many observations in my study are consistent with my theoretical explanation?” A probability statistic, called a “chance-value” (or c-value), can be derived in OOM, but it is secondary and free of assumptions.
Incidentally, criticisms of NHST extend back to the mid-1900s. One of the earliest and most famous can be found in David Bakan’s 1968 book On Method: Toward a Reconstruction of Psychological Investigation. Two other classics are Paul Meehl’s paper Theoretical Risks and Tabular Asterisks: Sir Karl, Sir Ronald, and the Slow Progress of Soft Psychology, in which he states that significance testing is “one of the worst things that ever happened in the history of psychology” (p. 817), and David Lykken’s book chapter, What’s Wrong with Psychology Anyway? The most erudite summary today of the many flaws in NHST can be found in Raymond Hubbard’s book, Corrupt Research: The Case for Reconceptualizing Empirical Management and Social Science. Finally, NHST is convoluted and confusing, so much so that studies have shown that large majorities of researchers and majorities of statistics professors (!) cannot properly define the p-value nor can they accurately describe exactly what it is telling them about their data. Consequently, when a researcher declares his or her findings as “statistically significant,” odds are the researcher doesn’t even know what this declaration means. Gerd Gigerenzer’s Mindless Statistics paper summarizes some of these sobering facts. Clearly, NHST (significance testing or the common p-value) must go or be downgraded for use in highly specific (esoteric) circumstances.
Psychologists do not Possess any Rulers (Units of Measure)
Suppose a psychologist wishes to measure belief in God with the following item:
Using the scale below, please indicate the extent to which you agree with the following statement: “I believe in the existence of God”
not at all 0 1 2 3 4 5 6 completely
Has the psychologist truly measured “belief in God” as a quality possessed in varying degrees by different persons? The key word to focus on when attempting to answer this question is of course “measured.” In the simplest terms, by using a 0 to 6 scale, the psychologist presumes to be working with a crude unit of measure, and that a person scoring 6 possesses twice the amount of belief as a person scoring 3. Or, perhaps less stringently, the psychologist believes that the difference between, say, scores of 4 and 2 represents an amount of belief in God equivalent to the difference between 6 and 4 (i.e., 4 – 2 = 6 – 4 = 2). The problem here is that there is absolutely no scientific evidence that this 0 to 6 scale represents a unit of measure of belief in God. Writing more items (e.g., “The likelihood that God exists is quite high.”) and adding or averaging the ratings does nothing to change this fact.
Going beyond this simple example, we find that psychologists do not have established units of measure for the qualities they find most interesting, such as intelligence, personality traits, depression, anxiety, attitudes, self-esteem, etc. Psychologists of course have numerous inventories, tests, and questionnaires which can be completed and scored, producing numbers that apparently represent levels of intelligence, introversion, depression, etc. None of these scale scores, however, represents an actual unit of measure for the quality being investigated. Think of it this way: there are dozens of personality questionnaires to measure your level of introversion/extraversion, and they all yield different ranges of scores. Moreover, you may end up scoring as an extravert on one questionnaire and as an introvert on another. If psychologists had units of measure for introversion/extraversion, this troubling outcome would be rare and clearly attributable to measurement error; yet this sort of inconsistency is common. You can also Imagine physicists attempting to study temperature, weight, height, gravitational pull, mass, acceleration, etc. with such incongruous measurement methods. In such an imagined world, physics would simply not have advanced to where it is today. What we are discussing here is the problem of measurement, and unfortunately most psychologists have chosen simply to ignore it.
If we regard “belief in God” as a quality that a person may or may not possess, exactly how is that quality present within a given person? Is it actually structured like a ruler; that is, like a continuum? If we could “open” a person’s mind or brain and find “belief in God,” would it look like a ruler? There is no good reason to hold that it is structured in this way. Rather, it is more reasonable to treat “belief in God” as a non-quantitative quality. We might ask a person whether or not he or she believes in God and code the response as “yes” or “no.” We might also ask the person to qualify the response with terms such as “certain,” “uncertain,” “strong,” or “weak.” These qualifiers themselves can be treated as non-quantitative and coded accordingly, or they can perhaps be treated as expressions of ordinal relations (e.g., a person judges his belief in God as stronger today than yesterday). However, none of this presupposes that “belief in God” is a continuous quality for which units of measure can be established. This is an extremely important point because if we are to study some quality of human nature successfully, we must have an accurate understanding of how it is structured. If, for instance, a quality is not continuously structured and we use methods of data analysis such as t-tests, ANOVAs, or Structural Equation Models that presuppose continuity, then the results from these analyses will necessarily be inaccurate to some degree. With such inaccuracies built into our research methods, how are we ever to arrive at a truly scientific understanding of a quality like “belief in God?”
Paul Barrett (2003) argues summarizes the measurement crisis, stating hat psychology can only move forward if it (1) demonstrates the continuous quantitative structure of its cherished attributes, (2) develops non-quantitative techniques, or (3) behaves in a more brutally honest fashion regarding the serious limitations of current methods, treating them as — at best — crude approximations of attributes. The methods in the Observation Oriented Modeling software follow Barrett’s second option. In short, a researcher can use the OOM software without making the unrealistic assumption that the qualities being studied are continuities. In other words, the OOM software can be used without having established units of measure (rulers). Without presumed continuous qualities, the act of measurement is a matter of counting the number of instances of some behavior or phenomenon indicative of the quality. For example, the number of individuals who believe in God can be tallied, or the number of times each day a person invokes some act that is relevant to a belief in God (e.g., praying) can be tallied. Unambiguous quantitative statements about these frequencies can then be made, such as “Robert prayed 5 times on Sunday and only 2 times on Saturday.” Such statements clearly do not rest on the presupposition of units of measure. Simple and sophisticated patterns of joint observation frequencies can moreover be examined with the OOM software, thus allowing for the modeling of more complex aspects of human behavior. The end result is that the OOM analysis will provide a more accurate reflection of reality than analyses based on data artificially forced into an apparent continuous quantitative structure like “belief in God” above.
Statistical Methods Have Grown Unnecessarily Complex
Compared to traditional and modern methods of data analysis, the techniques employed in the OOM software are simple to use, easy to understand, and intuitively compelling. We regard this parsimony as a strength, particularly in light of David A. Freedman’s classic paper Statistical Models and Shoe Leather. William Mason, writing in the same 1991 issue of Sociological Methodology (vol. 21), summarized several of Freedman’s arguments thusly:
- Simple [statistical/data analytic] tools should be used extensively. More complex tools should be used rarely, if at all. Thus, we should be doing more graphical analyses and computing fewer regressions, correlations, survival models, structural equation models, and so on.
- Virtually all social science modeling efforts (and here I include social science experiments, though I’m not sure Freedman would) fail to satisfy reasonable criteria for justification of the stochastic assumptions. (p. 338)
Much to the dismay of social scientists, Freedman adopted (during the latter part of his career) a critical view toward the use of linear modeling techniques such as regression, path analysis, and structural equation modeling in sociology, psychology, and education, among other disciplines. Consistent with Freedman’s arguments, Observation Oriented Modeling is a relatively simple analytic tool that permits researchers to engage their data using compelling graphical techniques and assumption-free methods of analysis. In some ways, the methods of OOM satisfy the the APA’s 1999 attempt to reform the reporting practices of psychologists regarding their statistical analyses, calling for more transparency in the way data and results are presented.
Persons Tragically get Lost in Average Results
Imagine designing a cockpit for military aircraft that fits the average pilot. Specifically, you measure several thousand airmen and compute their average height, weight, arm length, etc. and design your cockpit based on the resulting mean values. In fact, the U.S. Air Force used this very strategy in the 1940s to design cockpits. The outcome? Cockpits that didn’t fit anyone! Todd Rose recounts this story from his book The End of Average and uses it as a cautionary tail against relying too heavily on aggregate statistics such as means, variances, and correlations when attempting to develop and accurate understanding of reality. Modern psychologists and other life scientists, however, have come to rely too heavily on such aggregate statistics, and this unhealthy dependency may be thwarting scientific progress by hindering the development of theories which can explain the behavior of individual persons (like having cockpits that don’t fit any given person). In a recent study, Fisher, Medaglia, and Jeronimus (2018) in fact found that effects from aggregate statistics don’t necessarily apply to the individuals in a given psychological study. Lamiell (2013) has gone to great lengths to remind personality psychologists, in particular, that between-person differences or effects discovered through aggregate statistical analysis do not necessarily reflect what is happening at the level of the individual (see his most recent book: Psychology’s Misuse of Statistics and Persistent Dismissal of its Critics). The Big Five personality factors, for example, can readily be found in aggregated data, but the factors do not regularly emerge from the analysis of individual responses (Grice et al., 2006; see also, Molenaar and Campbell, 2009). The Power Law of Learning is another example phenomenon that can be seen in the aggregate (average) but not at the level of the individual, thus raising the question of whether or not it is truly a law (see Heathcote et al., 2000, as reported by Speelman and McGann, 2013). There is a genuine and potentially hazardous disconnect, then, between conclusions drawn from between-person, aggregate statistics and statements or theories meant to offer insight into the psychology of individual persons. Finally, for a thorough analysis of how psychologists have come to this unhealthy reliance on aggregate statistics, what Danziger calls the “triumph of the aggregate,” we recommend his book Constructing the Subject.
Losing sight of the persons in your study is never a concern in Observation Oriented Modeling because it is a way of conceptualizing and analyzing data that does not rely on traditional aggregate statistics such as means, medians, variances, correlations, etc. Instead, OOM—like Exploratory Data Analysis (EDA; Tukey, 1977; Behrens and Yu, 2003)—relies primarily upon techniques of visual examination to detect and explain dominant patterns within a set of observations. Going beyond EDA, however, OOM can incorporate patterns that are generated a priori on the basis of theory, thus promoting model building and development. It also synchronizes visual examination of the data with transparent analyses that (1) identify those individuals whose observations are consistent with the predicted or identified pattern, and (2) provide an index of a given pattern’s robustness within a sample.
Richer Causal Language is Needed to Understand People and Other Living Beings
Most modern psychologists represent their theories or hypotheses using variable-based models. The simplest such model is made up of two boxes connected by an arrow to represent causal direction. For instance, a psychologist might hypothesize that changes in serotonin levels causes changes in mood (serotonin –> mood). While such models might prove useful, they are likely too simplistic to represent the complexities of human psychology or more generally the complexity of living organisms. As concluded by a consortium of NIMH editors in 2000 (Editorial Statement in Applied Developmental Science), “We believe that traditional, variable-oriented, sample-based research strategies and data analytic techniques alone cannot reveal the complex causal processes that likely give rise to normal and abnormal behavior among different children and adolescents. To a large extent, the predominant methods of our social and psychological sciences have valued quantitative approaches over all others, to the exclusion of methods which might clarify the ecological context of behavioral and social phenomena.”
Observation Oriented Modeling employs Aristotle’s more sophisticated notion of causality as well as integrated models for visualizing complex phenomena. Integrated models go beyond using only boxes and arrows to represent the causes underlying the phenomena being studied. Various geometric figures are used to represent distinct structures and processes, and these figures can be connected using any of Aristotle’s four causes. An example of an integrated model can be found in this recent paper by Grice et al. (2017). Of particular note, scientists can employ Aristotle’s notion of final cause in OOM which Thomas Aquinas considered the most important species of cause. In simplest terms, a final cause in human psychology is a purpose or goal, such as when we say that the cause of John enrolling at OSU was his goal of seeking a bachelor’s degree in psychology. Psychological theories that invoke final cause are possible, as can be seen in Joseph Rychlak’s Logical Learning Theory which he tested over several decades of research. We also recommend Rychlak’s books: The Psychology of Rigorous Humanism, Introduction to Personality and Psychotherapy, and In Defense of Human Consciousness. In Rigorous Humanism Rychlak provides a tabled history of Aristotle’s four causes (material, formal, efficient, and final) in philosophy and in science. Our work with Vladimir Lefebvre‘s algebraic model of cognition also supports the use of final cause models. It certainly provides a formal cause model of the cognition involved in binary decision tasks. Lastly, Bill Powers’ Perceptual Control Theory, which posits that “behavior is goal directed and purposeful, not mechanical and responsive”, invokes the notion of final cause and offers a promising framework for the development of the types of integrated models advocated in OOM. Because integrated models are causal models, efficient cause, which is allied with the randomized controlled experiment, as well as formal, material, and final causes are all treated in OOM. Aristotle made it clear that all four causes are needed to understand nature, and in OOM their incorporation into integrated models provides the means for modeling complex individual-level behavior in psychology and in other life sciences.
It has become well known in recent years that an apparent large majority of studies in psychology cannot be replicated by researchers in independent laboratories. This fact was made clear by the Reproducibility Project which found that of 100 studies, less than half replicated successfully. Moreover, the magnitudes of effects for most replicated studies were found to be smaller in the replication samples. This latter finding was dubbed the “decline effect” in this New Yorker article (and postscript) in which the author discussed how social science effects often don’t replicate in independent samples, or at best the size of the effects shrink. From Observation Oriented Modeling point of view psychological “effects” are built on a foundation of sand (i.e., NHST and positivistic methodologies) and the decline effect is a product of the Human Nature that comes into play when picking over statistical results. The decline effect is not Nature attempting to reveal a secret to us, as argued by the New Yorker author. A related story involves a recently published and highly publicized study on psi phenomena in the Journal of Personality and Social Psychology. When a subsequent failed replication was sent to JPSP it was not sent out for peer review but was instead immediately rejected by the editor. Quite ironically, the authors who failed to replicate the psi research found that paranormal journals (which are supposedly pseudo-scientific) are more likely to publish attempted replications than the ostensibly top-tier journals like JPSP. The lesson from all of this is simple: exact replication is an indisputable hallmark of science, and some of our “top” journals are simply not interested. With the work of the Reproducibility Project, however, we are hopeful the climate in psychology will change regarding the importance of exact replication research. While not specifically devoted to replication, The Open Science Framework, in which researchers can post their study materials and share their data, is clearly a positive development in modern science.
There is no shortcut when it comes to establishing the generalizability of a scientist’s particular findings. Stated simply, there is no substitute for replication. If you really wish to demonstrate that your particular research finding is replicable, you must run the study again. The tools in Observation Oriented Modeling offer no exception to this rule. If, for instance, a psychologist uses the OOM software and finds that her theory accurately accounts for the observations made on 90% of the persons in her study, the only way she can know if a similar result will be obtained beyond her laboratory is to ask colleagues to replicate her study. Of course she can conduct the study again herself to insure that she can replicate the finding, but further generalizing will require the efforts of other scientists. Finally, in OOM, the primary tool used to argue for the generalizability of one’s findings will be an integrated model. As noted above, an integrated model spells out the causal structures and processes that explain the patterns in one’s data. If the causes are well understood, they can be expected to operate outside of one’s personal laboratory, and one’s findings should thus replicate successfully. Imagine William Harvey’s original discovery of how blood circulates through the body, pumped by the heart. Harvey could sketch his model onto a piece of paper, and given our common human nature, he can well expect his model to fit every person he or someone else were to examine in the future. Again, his sketched integrated model, which represents his understanding of the causes and effects of blood circulation, is what allows him to generalize beyond a given sample of individuals he might examine.