Friend Reviews. To see what your friends thought of this book, please sign up. To ask other readers questions about Sense and Nonsense of Statistical Inference , please sign up. Be the first to ask a question about Sense and Nonsense of Statistical Inference. Lists with This Book. This book is not yet featured on Listopia. Community Reviews. Showing Rating details. All Languages. More filters. Sort order. Diffusion Tensor rated it really liked it May 16, Marco rated it liked it Jul 26, Damian Neri rated it liked it Dec 02, Nitin CR added it Apr 02, Steve marked it as to-read Apr 22, Olivia marked it as to-read Oct 15, Haider Alwasiti marked it as to-read Aug 16, Krabas marked it as to-read Nov 13, There are no discussion topics on this book yet.

About Charmont Wang. From the model, the authors reported, ''The analysis suggests that a real drop in the rate of the CPI Consumer Price Index is associated with phase I, but the effect of phase II is less certain. I was very proud that statistical tests were used to address important issues such as Nixon's price control. However, several graduate students majoring in economics commented that few if any in business schools would agree that either phase I or phase II of Nixon's price control is significant.

The students were courteous enough not to embarrass their instructor. But I felt stupid, and the event triggered a soul-searching for the nature of statistical tests. After years of reflection, I believe that statistical users will be better off if they take note of a two-stage test-of-significance as follows: Step 1: Is the difference practically significant? If the answer is NO, don't bother with the next step: Step 2: Is the difference statistically significant?

This simple strategy certainly conflicts with the orthodox teaching: "Don't look at the data before you test! Nevertheless it will serve statistics users in two ways: 1 Mistakes such as 4 Chapter 1 those made by the psychometricians and the Bureau of Census may be avoided. A question about this two-stage test-of-significance has been frequently asked: How do we know if a difference is "practically significant? In general, there is no statistical formula that is readily applicable. Is this difference practically significant?

It depends. If we are talking about the auto insurance premiums charged by two different companies, then the difference is not significant. On the other hand, if we are talking about the increment of the postage of an ordinary letter, then the difference is highly significant.

## Randomization Does Not Help Much, Comparability Does

In certain cases, the appreciation of the "practical significance" can be subtle. For the verbal test, the average in was about ; by , the average was down to about The SD, however, has remained nearly constant at , and histograms for the scores follow the normal curve. On the surface, a point drop may not seem like much, but it has a large effect on the tails of the distribution. But let's think about the percentage a little bit further. First, there are millions of students taking the SAT each year. Suppose that there are four million students.

Three percent of them would translate into , That's a lot. Second, students who score over are more likely to be scientists, engineers, managers, intellectuals, or future political leaders. Therefore, the drop in SAT scores, although only 20 points on the average, is practically highly significant.

We now move back to the case of Nixon's price control. In an econometric seminar at Princeton University, I met Professor Tiao and asked him specifically if Nixon's price control was significant in the ordinary sense. Professor Tiao looked stunned. It appeared that it was the first time somebody had asked him this question. His response was quick and right on the mark, though: he is only a statistician, so I should ask the question of people in business schools. In Fig. Fads and Fallacies In Hypothesis Testing 5 x t.

Reprinted by permission. Note that at the right end of the graph, the black dots phase n control are running at about the same level as those before the phase I control. Therefore, phase n control did not appear to have reduced the inflation rate. Phase I control, on the other hand, appears effective. But a question is: If phase I was indeed effective, why was phase n needed? Further, if either control was effective, why was neither used by later administrations?

A motivation behind the proposed two-stage test-of-significance was that statistical methods usually take the place of a scientist's instinct and subjectmatter knowledge. The situation often gets worse if complicated statistical models are used in empirical studies, where subject-matter knowledge is relatively weak. An example is a statistical model called discriminant function. The function is popular among behavioral scientists e. This model depends on complex multivariate analysis formulas that general scientists are likely to have trouble understanding.

Yet some researchers display a fondness for the complicated model and the terms like discriminant power, two-stage discriminant analysis, etc. See an in-depth discussion of Bern's study in Chapter S, Section n. But a simple fact is that the statistical discriminant function is a generalized version of the t-test-it concerns only "statisti- Chapter 1 6 cal significance," not ''practical significance. Playing around with the discriminant analysis appears to be a research fashion, but the classification based on such analysis may well identify a biological trait as a psychological characteristic again.

To have a closer look at the issue of "moderate values of P," we first consider the following example of the Kolmogorov- Smirnov test Conover, , p. The test statistic for the two-tail test is. Based on this test statistic, Conover concluded: ''Therefore Ho is accepted at the. Eyeball examination indicates that the two distributions have nearly the same mean but quite different variances Gideon, Consider these P-values as evidence that the underlying distributions are normal.

The reasons are as follows: 1 The conclusion means that there are not enough data to reject the null hypothesis. This is because the F-test is a more powerful test if the nonnal-curve assumptions are correct. Such arguments are bewildering to general scientists. But unfortunately all these arguments are correct. The following explanations may help. First, Freedman's statement that "moderate values of P can be taken as evidence for the null" is not quite true. In addition to a moderate value of P, one needs at least two other conditions in order to accept a null hypothesis: 1 there is substantial knowledge about the subject matter, and 2 the sample size is relatively large.

To illustrate, consider the following example FPP, , p. The technique involves crossing different lines to get a new line which has the most advantageous combination of genes; detailed genetic modeling is required. One project involved breeding new lines for resistance to an insect called "brown plant hopper;" lines were raised, with the results shown below.

Are the facts consistent with this model? In the case of the original conclusion, hidden behind the chi-square test is the feeling and the consensus among all scientists involved in the project. The plus and minus signs albeit for only 3 values also indicate that the differences may well be due to chance variation inherent in the genetic model.

This chance variation can be formally quantified by the P-value of the chisquare test. A side note is that if nobody is bothered by the 3. We now go back to Conover's non-parametric test which concluded that "1Io is accepted at the. For instance, in Conover's book p. The test is non-parametric, in the sense that it does not need the normal-curve assumptions. This test, as one can expect from the graphic display of the data, yields a P-value of. The null hypothesis is thus rejected without any reasonable doubt.

To sum up, Conover's conclusion that "Ho is accepted at level" is not really wrong, if one accepts the current statistical terminology. This terminology is a typical case of "caveat emptor" FPP, , p. The article depicts the chi-square test as a formula to measure the fit between theory and reality.

As we have seen in the previous rice example, this is too good to be true. To test the goodness-of-fit of a theory against experimental data, a scientist has to rely in large measure on his subject-matter knowledge, not only on a chi-square test which concerns only statistical, not practical, significance. Another misleading device of goodness-of-fit is the so- called "diagnostic checking" in regression or time-series modeling. Many statistical packages offer stepwise regression or automatic time-series modeling.

The procedures are based on the search for the best choice of an array of diagnostic statistics. In this kind of modeling, as hinted by promoters of the software packages, all you have to do is enter the data and then the computer will take care of the rest. To the insiders, such practices are in fact using a shot-gun approach in search of an essentially black box model which describes a very loose relationship between the inputs and the outputs but sheds little light on the true mechanism that generated the data. Statistical models of this sort, according to Geisser , Statistical Science , "represent a lower order of scientific inquiry.

To illustrate,let's cite a remarkable experiment from Freedman The experiment mimics a common practice in regression analysis where the underlying mechanism is relatively unknown see references in Freedman's article. More precisely, a matrix was created with rows data points and 51 columns variables so that all the entries in the matrix were pure noise in this case, all entries are independent, identically distributed, and normal. The last column of the matrix was taken as the dependent variable Y in a regression equation, and the rest as independent variables.

The fact that such diagnostic tests may lead to a phantom relationship in regression or time-series analysis is a profound issue that will be discussed in more depth in Chapter 4 and the related parts in Chapters 2,3, and 6. At this moment, we intend to point out only one fact. When Fisher and Neyman developed their theories for statistical testing, the test procedures 10 Chapter 1 required clearly-defined chance models derived from substantial knowledge of subject matter.

This logical sequence-model first, tests next-is reversed in "modern" diagnostic tests: the mechanism that generated the data is poorly understood, therefore the investigators use statistical tests to search and justify a model that might have described a relationship which never really existed. Lousy statistics can be found almost everywhere. Many of them arise, in our opinion, from mechanical use of statistical tests. The book Statistics by Freedman, Pisani and Purves , , , launched a great effort to counteract such use. In the preface, they wrote, Why does this book include so many exercises that cannot be solved by plugging into a formula?

The reason is that few real-life statistics problems can be solved that way. With this in mind, we are compelled to include here a conversation between DeGroot and Lehmann in Statistical Science. DeGroot said, ''Take a simple example of a chi-square test. Why is it universally used?

Well, it is a simple procedure. You calculate it, and you don't have to think about it. Nevertheless, most applied statistics books that we have seen exhibit essentially the same attitude the list of the authors includes prominent statisticians at Chicago, Stanford, Harvard, Cornell, etc.

The situation has been greatly improved by the philosophy and some techniques of Tukey's BDA. But much remains to be done. As an example of the chi-square test, a speaker at an econometrics seminar at Princeton said that he used the chi-square test with sample size near half a million , in a cross-sectional study for a big bank. It was interesting to see that this scholar, who is so careful in every detail of the proof by Lebes- Fads and Fallacies In Hypothesis Testing 11 gue integration and functional analysis, is the same scholar who used a "simple" chi-square test in such a fuzzy manner.

It was equally amazing that nobody except this author tried to point out that it was not right to do a chisquare test this way. For example, if we increase the sample size times, the numerator in the chi-square test will increase about 10, times while the denominator increases only times.

This leads to the rejection of any reasonable null hypothesis. In contrast to using the chi-square test without thinking about it, the following statement by Lehmann is a wise suggestion for any statistics users DeGroot, : [Many] Statistical problems don't have unique answers. There are lots of different ways of formulating problems and analyzing them, and there are different aspects that one can emphasize.

This statement reveals an important aspect of scientific activities: the intuitive 'and unpredictable way scientists actually work. In a sense, the term "scientific method" is misleading. It may suggest that there is a precisely formulated set of procedures that, if followed, will lead automatically to scientific discoveries.

Statistics users who are more conscientious about the validity of their results thus often ask this question: Which test should I use? The answer to this question is, surprisingly, not contained in most books. It is like a professional secret prohibited to non-statisticians or to entry-level statistics majors. The correct answer, I finally found out, was: It is up to the users.

This answer does not seem to provide much guidance. Further, it is even more confusing that, according to the orthodox teaching of the NeymanPearson school, one cannot look at the data before one sets up a one-tail or two-tail test. The orthodox teaching of "Don't look at the data before you test" has something to do with the probabilities of the type I and type IT errors in hypothesis testing. The probabilities are meaningful, according to the school of Neyman and Pearson, only before the data are examined.

## Sense and Nonsense of Statistical Inference: Controversy: Misuse, and Subtlety

This may sound strange, but consider this example. A box contains 30 red and 70 blue marbles. A marble is picked at random from the box. Therefore it is worthwhile to examine how statisticians in other schools think about the issue. Dempster mentioned that ''taken seriously, Neyman's theory suggests that all the statistician's judgment and intelligence is exercised before the data arrives, and afterwards it only remains to compute and report the result.

Huber's explanation of this phenomenon is: if P-value is small P. Freedman et al. For instance, if it was one- tailed, and you think it should have been two-tailed, just double the P-value. However, many Bayesians may raise a wall against such practices see, e. The issue at stake is very nasty, and scientists are advised to stay away from this mess. Statistics users need not panic at the objections of radical Bayesians. As apparent from the examples shown in this chapter, this cure often is worse than the disease.

Bewildered by statistical tests, scientists often underestimate the potential of "'non-significant" results. This tendency is another disservice of our profession to the course of science. Some examples are as follows.

Fads and Fallacies In Hypothesis Testing 13 In a survey of articles published in leading medical journals, Pocock et al. Pocock observed that abstracts of research publications are very important in transmitting new knowledge to scientific community. But -intimidated by the so-called "non-significant" statistical tests, scientists often do not report in the abstract the findings that are associated with large Pvalues.

A cOnclusion was drawn by Pocock: Because of the obsession with significance testing in the medical literature, authors often give insufficient attention to estimating the magnitude of treatment differences. This phenomenon is not unique in medical literature; the problem is also prevailing in other empirical studies. In another survey of tests of association, Archer and Waterman counted non-significant and significant outcomes. In both studies, the actual magnitudes of differences and correlations were not indicated.

We suspect that many so-called "non-significant" results may in fact contain important information if the investigators had looked at the magnitudes of the differences. Lest statistical jargon lead researchers astray, it should be made explicit that "statistical insignificance" simply means that noise overwhelms the signal and that one needs a larger sample or better measurements.

In many cases, increasing sample size will eventually lead to a "statistically significant" result. We now summarize the discussions of significance tests in the tree diagram shown in Fig. The reason is that many things can go wrong in the stages of data. An example follows. Modern astrophysicists conclude that our universe is filled with ultrasmooth CBR cosmic blackbody radiation.

Big Bang. Theoretical physicists predict that primordial fluctuations in the mass density, which later became the clusters of galaxies that we see today, should have left bumpiness on the blackbody radiation. The prediction is of great importance to the study of the Big Bang model and has thereby generated numerous experiments and articles on the subject. See references in Wilkinson, In order to test the theory, CBR has been measured with sophisticated radiometers on the ground, in balloons, and in satellites.

But none of the estimated amplitudes has been proven to be statistically significant. The situation is rather disappointing because the best available technology has been fully explored. A question then arises: "What can scientists do now? Abandon the Big Bang theory? The only thing which has been established is that blackbody radiation is extremely smooth. No more, no less. As a matter of fact, physicists dedicate themselves to a continual search for the fluctuation and a deeper understanding of the blackbody radiation.

Their activilies include better calibration of the instruments, the reduction of the background noise, and applications of sophisticated statistical methods, etc. Fads and Fallacies In Hypothesis Testing 15 In one instance, some physicists attempted to capitalize on the findings in the existing studies. They proposed intricate arguments a combination of Bayes Theorem, likelihood ratio, Legendre polynomiais, type I and type II errors, etc. Their arguments are very intelligent but were dismissed by this author on the ground that the prior distribution is sloppy and indeed, as they admitted, contradictory to existing knowledge.

After a series of failures, these physicists still have no intention of giving up; they vowed that they will try other priors. They also are thinking of going to the South Pole to get better measurement of blackbody radiation. Physics has long been admired by many as the champion of "exact science. Nevertheless, their relentless pursuit of good theory and good measurement is highly respectable and is in sharp contrast to many empirical studies that rely mainly on statistical tests of significance.

It is hardly an overstatement to say that the type I and type II errors are dominant components of hypothesis testing in statistical literature.

However, scientists usually perform statistical tests based only on type I error and seldom bother to check the type II error. This lack of adherence to the NeymanPearson paradigm can also be found in the following books, where the lack of adherence turned into the lack of respect. Wonnacott and Wonnacott discussed some difficulties with the classical N-P Neyman-Pearson procedure and asked: "Why do we even bother to discuss it? A question to Freedman is: How can scientists perform statistical tests without knowing type I and type II errors? For this issue, consider the following example which I found successful in convincing some skeptics and would-be scientists of the importance of type I and type II errors.

The issue is the safe usage level of a cheap but potentially harmful farm chemical. Assume the safe level is 10 units or less. So the null hypothesis is either A Ho: p. In case A , the null Chapter 1 16 hypothesis says that the chemical is at a safe level, while in B the null hypothesis says that the chemical is at a dangerous level. Users are often confused as to which null is more appropriate.

Let's discuss this issue later. To set up decision rules, we need sample size n, the probability of type I error, and the estimated standard deviation SO. For case B , the rejection region is X opposite conclusions! To clarify the confusion, let's compare the type I errors. For case A , alpha is the error which farmers do not want to risk. For case B , alpha is the error which consumers do not want to risk. Therefore, to protect consumers, we should choose the right-tail null hypothesis, even if the consumers have to pay more for the farm products.

Another reason is: If we save money today, we may have to pay more to hospitals tomorrow. Hence B is the choice after we analyze the nature of the type I errors. Some of my colleagues and students enjoyed the above discussion of type I errors and praised it as a "beautiful analysis. Two years after I introduced the previous example in the classroom, I totally changed my mind. To highlight the point, let me quote a statement on the calculus of variations from a classic book by Bellman and Dreyfus , p.

This is what I now feel about the type I and type n errors advocated by Lehmann among others. For decision-makers, a larger sample size and a confidence interval will certainly solve this problem without any help from the analysis of type I or type n errors. But the practice is facing increasing challenge from some QC experts who put more emphasis on "zero defects" and the concept of "doing the things right at the beginning.

In some ways, enthusiastic statistics users are convinced that the analysis of type I and type n errors are powerful tools for decision-making. For instance, if you look at the freshman books in business statistics which are supposed to be relevant to real-life applications , you will often find several chapters on "decision theory" that deal with type I and type n errors. Students, on the other hand, regard these materials as part of "an unimportant, mechanical, uninteresting, difficult but required subject" Minton, Only those who lose their sanity completely would try to deal with type I and type n errors in any business setting.

In an ASA American Statistical Association panel discussion on the role of the textbook in shaping the business statistics curriculum, Hoyer decried that: The Neyman-Pearson decision theory is a logically indefensible and intellectually bankrupt decision mechanism. We applaud this conclusion and hope that someday those who produce business statistics textbooks will take note and really ponder the true nature of the Neyman-Pearson theory of statistical hypothesis testing. But this task is primarily a job belonging to theoretical statisticians, not practicing statisticians nor general scientists.

In fact, learned statisticians seldom compute type n error on a set of experimental data. They simply calculate P-values and confidence intervals , and the conclusions are not less rigorous. It is a common belief that type n error helps to determine the sample size in the design of experiments.

However, the procedure is cumbersome and very artificial. In this case, the concern was the strength of two-different cement-stabilized base material. For this simple comparison, the investigators enlisted a special scheme, and, according to Anderson and McLean, "the sequential procedure recommended here or by Stein must be followed" [emphasis supplied]. This so-called sequential procedure, as one must be aware, has nothing to do with the actual magnitude of the difference.

Rather, it concerns only the number of additional Chapter 1 18 specimens to take. The procedure is complicated and involves contrived selection of beta-values i. As a matter of fact, the width of a confidence interval can be used to determine the sample size, and it is definitely not true that the sequential procedure must be followed. In more complicated designs such as multiple comparisons or two-way layouts , the choices of sample size involve certain ingenious procedures developed by mathematical statisticians.

Such procedures are of great esoteric interest to specialists but are seldom needed in data analysis. In our opinion, interval estimation such as Tukey, Sheff6, or Bonferroni simultaneous confidence intervals will not be less rigorous than the stopping rules based on type I and type IT errors. In conjunction with the interval estimation, a "sequential procedure" should proceed as follows: 1.

Examine the actual magnitudes of the differences in a multiple comparison. Throw out the cases that are neither practically nor statistically significant.

- Sense and Nonsense of Statistical Inference;
- The Whole World Over?
- Controversy: Misuse, and Subtlety, 1st Edition;
- 1st Edition.
- The Mermaid Chair.
- Misuse of statistics.

Conduct a trimmed multiple comparison8 which by this moment may be only a simple 2-population comparison. This approach would apply to most statistical designs, such as one-factor analysis two-factor analysis complete factorial experiments incomplete factorial experiments nested factorial designs Latin square designs Note that all these designs are special cases of regression models. The treatment effects thus can be assessed by the estimation of linear combinations of parameters. In certain statistical tests such as chi-square test, normality test, and runs tests the use of the confidence interval is not appropriate; but do not expect that the type I and type IT errors will help in the determination of the sample size.

In most cases, a learned statistics user simply follows his feeling, and generally this works out better than many exotic designs which are based on the type IT errors. In our opinion, the accept-reject method is useful only for certain engineers in quality control, but irrelevant to the data analyses in general science.

However, the accept-reject method and the calculation of the power function are repeatedly promoted in leading medical journals see, e. The reasons are given in Pocock et a1. Fads and Fallacies In Hypothesis Testing 19 First, data snooping may introduce bias in the study; hence the intended size of a trial should be determined in advance. Second, it is ethically desirable to stop the study early in case clear evidence of a treatment difference has been found.

Since repeated use of statistical testing increases the risks of a type I error, some stopping rules based on power calculations are recommended to researchers in biomedical science see references in Pocock. Pocock's reasons for power calculation in clinical trials are questionable for both ethical and technical reasons. For ethical reasons, most multiple comparisons involving human beings indeed can be and should be trimmed to twopopulation comparisons.

Therefore interval estimation can be used to determine the sample size. Using human subjects in a complicated design only results in a waste of valuable resources. For technical reasons, Pocock's recommendation is a two- stage procedure that requires testing to take place before the estimation of the contrast between parameter values.

However, it is known see, e. As shown in Olshen , for a broad spectrum of experimental designs, the conditional probability of simultaneous coverage given that the F-test has rejected the null hypothesis is less than the unconditional probability of the coverage. Further, it is impossible to assign precisely a conditional probability of coverage, because that conditional probability depends on unknown parameters. In conclusion, after much effort, we fail to see any convincing case in which the statistical scaffolding of power calculation is needed.

At this moment, we believe that the power calculation is only a weak and often misleading procedure and that scientists may be better off to abandon the accept-reject method altogether. Bailar and Dan also disliked the reporting of small P-values. Note that in scientific reporting, a statistical result is often marked with one star if P 20 Chapter 1 Freedman et al. Ordinary statistics users may now wonder about the following question: "Should a scientist report small P-values, to say 1O-6?

However, the P-value is 4. In practice, the reporting of P-value is superior to the accept-reject method in the sense that it gives a useful measure of the credibility of Ho' This use of P-values has strong flavor of Bayesianism, but an exact interpretation of this measure is difficult and controversial. The debate of those ten statisticians is intriguing, and it appears that hardly any of those scholars came out of that mess completely clean. For statistics users, the bad news is that a Bayesian calibration of P-value is very difficult or nearly impossible , but the good news is that P-values are "reliable for the primary hypothesis in well-designed for good power experiments, surveys, and observational studies" Morris, , p.

For hypotheses of secondary interest, call a professional statistician. Unlike the probabilities of type I and type II errors, P-value per se does not bear a frequentist interpretation of the calculated probability value. As a result, inference based on P-value alone is not grounded on probability theory, but rather on "inductive reasoning"-a very powerful yet often misleading scientific method for more discussions of the merit and the problem of inductive reasoning, see Chapter 2, "Quasi-Inferential Statistics.

Berger and Delampady asserted that in certain testing procedures "formal use of P-values should be abandoned. This is true. But in practice, scientists usually compare P-values. Inferences of this type then appear to be deductive, not inductive. In brief, P-value enjoys the power of inductive reasoning and at the same time establishes its legitimacy upon the strength of the theory of Neyman and Egon Pearson. The above discussion appreciates an advantage of the N-P theory over Fisher's fiducial probability. But this advantage of the N-P theory disappears when scientists approach real-life data.

Indeed, the tyranny of the N-P theory in many branches of empirical science is detrimental, not advantageous, to the course of science. To certain theoretical statisticians, however, interval estimation is a rather simple procedure and has little intellectual appeal. Such statisticians therefore lavishly invest their academic freedom in the theory of hypothesis testing. Fortunately, this investment has generated many beautiful theories on the power properties of statistical tests see, e.

In addition, these theories were extended to the Hajek-LeCam inequality and a new field of adaptive estimators that attain the lower bound of the inequality and thus provide optimal estimation. On other fronts, test statistics are used to construct new estimators. For instance, Wang derived, from test statistics, a minimum distance estimator for first order autoregressive processes. For such reasons, the Neyman-Pearson theory remains an important topic in mathematical statistics. But regretfully, the N-P theory is only a theory. It is not a method, or at least not a method that cannot be replaced by better methods.

In certain branches of academic disciplines, according to Lehmann a al DeGroot, , ''hypothesis testing is what they do most. In regard to the wide use of statistical tests, DeGroot , p. They concluded: Chapter 1 22 Taken together, the questions present a strong case for vital reform in test use, if not for their total abandonment in research. Henkel , then an associate professor of sociology at the University of Maryland, further concluded that: "In the author's opinion, the tests are of little or no value in social science research.

Nevertheless, we deeply resent the way Neyman-Pearson theory is taught and applied. Most applied statistics books we find are loaded with "involuted procedures of hypothesis testing" DeGroot, , p. On the surface, those applied books look rigorous in their mathematics, but in fact they are neither mathematical nor statistical, but only mechanical! No wonder empirical scientists are often misled by statistical formulas and produce many results which earn little respect from serious scientists see the examples of psychometric research in Chapter 1, Section I and Chapter 5, Section As a comparison, literary criticism in history and good journalism, which are relentlessly grounded on hard evidence and insightful analysis, may bear more scientific spirit than many so-called "data analyses" carried out by psychometricians and perhaps by many other empirical scientists, too.

We believe, despite such questioning, the test of significance is in all probability here to stay especially the nonparametric and goodness-of-fit tests , but its reappraisal for general scientists is also necessary. As long as leading statisticians keep on ignoring the true nature of statistical tests and keep teaching and writing introductory books for non-statistics students, the reappraisal is almost impossible.

This is a solemn appeal to statisticians to look at the issue seriously, and this is the only way that the problem can receive the treatment it merits. NOTES 1. Kruskal and Majors sampled 52 usages of "relative importance" from a diversity of disciplines such as physics, chemical engineering, biology, medicine, social sciences, and humanities. The sample was taken by using a probability method from a frame of odd papers having relative importance s or relative injluence s in the title. See also Problem 2 in a popular textbook Hicks, , p. In this case, the difference between the observed value and the expected value is highly significant z 3.

I am somewhat relieved to know that Freedman, too, is a human being. The mistake, however, was corrected in Freedman, Pisani, Purves, and Adhikari , 2nd edition, Norton. Mechanical application of statistics are misleading and intellectually offensive. See also Goldstein and Goldstein, , p. A computer program, BACON Langley, , was used to rediscover Kepler's third law with a combination of Kepler's methods and sophisticated heuristics.

BACON was also claimed to have rediscovered many other scientific laws. But the program, to my knowledge, has not yet discovered any useful law. Minton's discussion did not deal just with business statistics, but was more general in nature. There are only two situations in which we believe the concept of type II errors is really useful: 1 comparison of sensitivities of control charts, and 2 acceptance sampling in quality control.

However, the current trend in QC is a de-emphasis of acceptance sampling. An ingenious way to conduct a trimmed multiple comparison is the classical FFE fractional factorial experiment; see, e. This is one of the greatest contributions of statistics quality control. The procedure uses formal inferential statistics.

But FFE is different from the traditional methods such as Tukey's or Scheffc's procedures: FFE is more a technique for exploratory data analysis, and less a method for confirmatory analysis. Design of Experiments, A Realistic Approach. Marcel Dekker, New York.

Archer, S. Human Development, Vol. Bailar, J. Bellman, R. Applied Dynamic Programming. Bern, S. The Measurement of Psychological Androgyny. Bem, S. Berger, J. Testing Precise Hypotheses. Statistical Science, Vol. Breiman, L. Nail Finders, Edifice, and Oz. Brown, L. The Conditional Level of the t Test. Annals of Mathematical Statistics, Vol.

Bureau of the Census Casella, G. JASA, Vol.

## Ubuy Hong Kong Online Shopping For crc in Affordable Prices.

Conover, W. Practical Nonparametric Statistics. Wiley, New York. DeGroot, M. A Conversation with Erich L. Dempster, A. Pwposes and Limitations of Data Analysis. Box, T. Leonard and C. Academic Press, New York.

DerSimonian, R. Reporting on Methods in Clinical Trials. Diaconis, P. Annals of Statistics, Vol. Ethier, S. Testing for Favorable Numbers on a Roulette Wheel. JASA, Freedman, D. A Note on Screening Regression Equations. American Statistician, Vol. Norton, New York. Geisser, S. Opera Selecta Boxi. Gideon, R. Glantz, S. Circulation, Vol. Goldstein, M. Plenum Press, New York. Hlfjek, J. Theory ofRank Tests. Hacking, I. Trial by Number. Henkel, R. Tests of Significance. Sage, Beverly Hills, California. Hicks, C.

Fundamental Concepts in the Design of Experiments, 3rd ed. Holt, Rinehart, and Winston, New York. Hoyer, R. A preprint. Huber, P. Data Analysis: In search of an Identity. Kuhn, T. Langley, P. Data-driven Discovery of Physical Laws. Cognitive Science, Vol. Lehmann, E. The Neyman-Pearson Theory after 50 Years.

I, , Wadsworth, Belmont, California. Leonard, T. Some Philosophies of Inference and Modelling. Leonard, and C. Minton, P. The Visibility of Statistics as a Discipline. The American Statistician, Vol. Morris, C. Morrison, D. The Significance Test Controversy. Aldine, Chicago.

Olshen, R. The Conditional Level of the F-Test. Pocock, S. Statistical Problems in the Reporting of Clinical Trials. Trenton TImes Survey: Kids do an Hour of Homework Daily. Wang, C. Wilkinson, D. Anisotropy of the Cosmic Blackbody Radiation. Science, Vol. Wonnacott, T. Introductory Statistics, 4th edition, Wiley, New York. Chapter 2 Quasi-Inferential Statistics I. It is interesting to see that many famous statisticians were involved in the discussion of randomness in the years , , and One lesson we learned from their discussions is that the word ''random'' is probably the most used and misused in statistical vocabulary.

In the March issue of Science, Kolata described some of Diaconis' interesting work in randomness under the tide ''What does it mean to be random? For more discussions along this line, see Diaconis and Engel , and DeGroot In statistical inference, if a phenomenon is not random, a man-made randomization can be induced to achieve the randomness needed in a legitimate calculation. But the procedure is not always applicable. Therefore opinions often conflict with each other on the reporting of the "inferential statistics" based on encountered data.

For instance, Meier , JASA spelled out that he does not quite share David Freedman's hard-line position against formal statistical inference for observational data. In a Neyman Lecture held in Chicago, Dempster b spent some 10 minutes on the following issues: A chance mechanisms B personal measures of uncertainty and argued that A C B In the February issue Qf Statistical Science, Madansky commented on Freedman's general quest which includes a never-wavering adherence to randomization.

In a conference held in New Jersey, April , Freedman questioned a speaker: "Do you have randomization in your study? The following is a typical example of Freedman's hard-line position FPP, , pp. Freedman's answer to the above question is: Theory says, watch out for this man. What population is he talking about? Why are his students like a simple random sample from the population?

Until he can answer these questions, don't pay much attention to the calculations. He may be using them just to pull the wool over your eyes. As Cochran pointed out: Experiments in behavioral psychology are often conducted using graduate students, and other volunteer students paid or unpaid in a university's psychology department.

The target populations may be all young persons in a certain age range. No wonder many statisticians, e. However, on what grounds can the above psychologist calculate SE Standard Error? Some attempts to justify the calculation of the standard error of observational data can be found in Freedman and Lane a, b. See also Diaconis b for a good survey of some theories of data description that do not depend on assumptions such as random samples or stochastic errors.

This chapter is to provide another view of the issue. For this purpose, we will first present a short discussion of epistemology the philosophy of science. Deductive reasoning is epitomized by mathematics and logic; the latter deals mainly with validity of arguments rather than the truth of our universe. Inductive reasoning is making a general statement or a scientific law based on certain experimental data.

Note that mathematical induction is neither inductive nor deductive. It is Peano's fifth axiom. Using this axiom to prove theorems is deductive, not inductive. However the mental process of formulating the theorem prior to a formal proof is inductive, not deductive. The name "mathematical induction" is somewhat misleading. The inductive principle was first systematically described by Francis Bacon in and has long been seen as the hallmark of science. However, in , David Hume pointed out that no number of singular observation statements, however large, could logically entail an unrestrictedly general statement.

He further argued that inductive reasoning is of psychology, not of logic. The trouble with induction, which has been called "Hume's problem," has baffled philosophers from his time to our own see, e. Carnap and many logical positivists proposed the degree of confirmation for scientific theory and thus tried to formulate an inductive logic.

This attempt appears sensible but proved to be exceptionally difficult. Indeed, no inductive logic seemed capable of capturing man's intuitions about what confirms a theory and what does not. When a man cannot explain a certain natural phenomenon, he usually creates new terminology to satisfy himself. According to Popper , p. Sir Karl Popper asserted that ''the way in which knowledge progresses, and especially our scientific knowledge, is by unjustified and unjustifiable anticipations, by guesses, by tentative solutions to our problems, by conjectures.

In his classic book Lady Luck, Warren Weaver , p. However, if induction is "a branch of logic," it must be a special kind of "fuzzy logic. Apparently the statistical discipline has close ties and great contributions to inductive reasoning. However, the practice of statistical reasoning is not always so straight-forward as presented in many popular textbooks. As a matter of fact, the whole notion of "statistical inference" often is more of a plague and less of a blessing to research workers.

In contrast to non-scientific statements which are based on authority, tradition, emotion, etc. A big problem is that many so-called "empirical evidences" are not evidence at all. For example, see Figs. Luckiesh, : In Fig. For examples in this regard, see Chapters 3, 4, and 5. In fact statistics is often classified as a subcategory of lies Moore, To earn our credit back, let's first consider the following question: Is statistical inference deductive or inductive? Given a fixed population, the answer is simple: 1 If we don't have randomization, then the generalization is inductive.

In a sample survey, the very act of randomization mixing the tickets in a box completely changes the nature of ''the unobserved. The generalization is then based on Kolmogorov's probability theory, which is of logic, not of psychology. The summary statistics drawn from a grab set are of great value for scientific endeavors, but they are descriptive, not inferential statistics. Many people interpret the data of a grab set as a random sample from some hypothetical universe composed of data elements like those at hand.

Put another way, the sample is used to define the population. Such a hypothetical population is only conceptual-it may not exist see Henkel, , p. In a small experiment, this author took classroom polls to estimate the percentage of seat-belt usage in New Jersey. It is interesting to compare the results reported by two local newspapers, the Princeton Packet and the Trenton Times. The Trenton Times December 8, reported The sampling method of the Trenton limes was also not indicated. On the same issue, the Princeton Packet reported a Which of the above samples are drawn by scientific sampling?

Scientific results in respected journals and statistics books are expected to be more reliable than ordinary newspaper results. However, this reliability is often questionable. It is very difficult to stop people from calculting SE out of grab sets once they have learned or thought they had learned the formulas. In a book review I wrote for a publisher, I was surprised that two prominent statistiCians were calculating inferential statistics from a grab set; they didn't even bother to ASSUME the randomness of the data.

This psychological indulgence in calculating confidence intervals on grab sets is not uncommon.

### Sense and nonsense of statistical inference : controversy, misuse, and subtlety, Chamont Wang

Similar examples can be found almost everywhere. This brings up a serious question. The deductive part of statistics mathematical statistics, probability is exact science in the sense that it is internally consistent. Without this, applied statistics is only an act of personal belief. The inductive part of statistics can be good science, but can also be pseudo-science hiding behind a facade of irrelevant computations. Sir Ronald A. Fisher quoted by FPP, , p. Freedman and Navidi , p. In one example, Thomas Reider , p.

The statistical enterprise is a complex and often erratic one. It includes many diversified fields and a large number of people with different interests, skills, and depths of understanding. Many respectable mathematical statisticians devote their lives to the foundation of mathematical statistics and strictly keep their hands off data. It is not a bad thing to see theoretical statisticians working on applied problems. But a caveat to those statisticians is this: There is often a big gap between statistical theory and reality. For instance, it is very difficult to construct a non-measurable set on the real line Royden, , but in the real world many events are simply not "measurable" or can only be poorly measured.

For example, how is one going to measure a student's progress in a philosophy course? Furthermore, assume that there is a good measure for this purpose, can one apply the Kolmogorov Extension Theorem Chow and Teicher, , pp. The above "application" of measurable functions and of Kolmogorov Extension Theorem may appear silly; but looking at the current state of statistical practice, for instance a field called survival analysis, many applications of the theory are similarly far-fetched. In fact, in a semi-public setting, a leading statistician asserted that "the whole survival analysis is ridiculous!

It is disheartening to see that many theoreticians totally lost their rigorous training in the assumptions and solid proofs when they approached real-life data. The principle of randomization is just one of the things they seem to forget or ignore. This attitude toward applying statistics is ethically questionable and is in fact counterproductive to the goal of science. Freedman has long been railing against this inattention. It is good to see that he is getting echoes from different valleys. Many social scientists have been using statistics intensively because their discipline "has a problem legitimizing itself as a science" Nell Smelser of U.

Berkeley; quoted from the New York Times, April 28, , p. E7 of the University of Wisconsin asserted that "the mainstream of sociology is empirical and quantitative" and advocated large-scale computers and large-sample surveys.

### “Retire Statistical Significance”: The discussion.

Quantitative psychologists, biologists, medical researchers, and economists are all using statistics as often as sociologists. However, a question is: "Are current practices of statistical methods scientific or not? Freedman, as apparent from many of his writings, seems determined to confront the gross misuse of statistical methods. Madansky and others worry that Freedman's quest may put most statisticians out of business. Madansky dubbed Freedman the "neturei karta" a Hebrew phrase, meaning the purist or the guardian of the city of statistics, and concluded that if Freedman is successful, then "the State ofIsrael" statistics will cease to exist.

Like Madansky, we were once troubled by Freedman's hard-line position until philosophy of science came to the rescue. In the following section, we will try to explain this philosophy and to further popularize Freedman's position. Data of this type are of great value for intellectual exploration. However, the whole practice of "statistical inference" based on this kind of data is problematic and in many ways deceiving. As suggested in FPP , p.