Lesson 6
Introduction to Statistical Inference

Reading Assignment

The following sections in the textbook: 10.1 - 10.16; 13.1 - 13.9; 13.12.

Statistical Inference

The last portion of this course—and the later stage of most introductory statistics courses—deals with statistical inference.
Statistical inference combines the methods of descriptive statistics with the theory of probability for the purpose of learning what samples of data tell about the characteristics of populations from which they were drawn.

Probability

It is unnecessary to understand the theory of probability to acquire a good working knowledge of inferential statistics. So little more will be said about probability theory per se other than that your intuitive notion will suffice. The probability of an event is the proportion of times it will occur in a long string of independent opportunities. Flip a fair coin 10,000 times. Toss a pair of dice 1,000 times. Guess at the answers to the 500 true-false questions on your final exam.
  • The probability of a coin coming up heads is .50
  • The probability of a thrown die coming up with a 6 is 1/6 = .17
  • The probability that a baby born is male is about .51
Enough said.

Populations & Samples; Parameters & Statistics

The statistical characteristics of populations are called parameters. The mean, variance and correlation of x and y in a population are examples of parameters. It is conventional to represent parameters with Greek letters (such as mu, , pi, , and rho, , for mean, proportion and correlation, respectively). Note the parameters in the above figure.

Samples are taken from populations to learn something about the parameters. Naturally, one would want the sample to be representative of the population if it is to reflect on the population's characteristics. But if the population is so unknown as to have to be sampled to learn about it, then how could it be known well enough to determine whether a sample was representative of it? One could, perhaps, reason that the population of U.S. voters is roughly half male and half female so that a sample of voters shouldl\ likewise be split 50-50 between the sexes. But if sex is unrelated to candidate preference, it counts not at all in favor of the sample's representativeness that it too is half male and half female.
So, the inferential problem confronts a dilemma immediately. If the population is unknown, how can one have confidence that a particular sample from it is representative of it? The solution to the dilemma uses the theory of probability. If the sample is drawn according to the laws of probability, then the degree to which the sample is representative of the population can be calculated in probabilistic terms. Hence, one will be able to say that the probability is 95% that the sample is representative of the population to a certain degree. Hence, the concept of a "representative sample" is replaced by the concept of a randomly representative sample.

Random Samples

A random sample is a sample drawn in such a way that each element of the population has an equal and independent chance of being included in the sample—or so the statistician says. So, how do you draw a random sample?
    To draw a simple random sample you must
  • Give each element of the population an ID number
  • Use a table of random digits to select those elements that enter the sample
For example, the 589 school districts in your state are given ID numbers from 001 through 589. You go to a table of random digits in the back of a statistics book and you close your eyes and pick a place to start other than the upper-left of the page. If the first three digits you come to are 412, then school district #412 on your list goes in the sample. Continue in this way until you have drawn all the elements in your sample. (How many? No easy answer. As many as you can afford, unless that gets to be too many. How many is too many? Again, no easy answer. There are lots of phony answers in statistics books about the "right" number of cases to pick for a sample but these answers are generally highly arbitrary and made to look less so. Don't trust them.)

Randomness

Almost nothing other than this tedious process of assigning ID numbers and resorting to tables of random digits will do (though sometimes drawing slips of paper out of a hat isn't a bad approximation to a random sample). You must trust a mechanical process to pick the elements that will be sampled—because, humans are incapable of random behavior. Even when you think you are achieving randomness, you aren't.
Do this exercise before reading further:
Drawing from the digits 0 through 9 with repeats, write a string of 20 digits in random order, i.e., 2,6,1,8 etc. Just write them on a scrap of paper right now. Then read on.

Humans are incapable of behaving randomly. Even when what they do feels haphazard, further analysis reveals unconscious patterns and regularities. For example, in the above exercise, you undoubtedly produced digits with certain nonrandom patterns. Here is a section of a truly random digits from a table in a textbook:

                  6036 5946 4653 3507 5339 
                  4942 6142 9297 0191 8283
                  1683 7994 2402 5662 3344 
                  4234 9944 1374 7007 1147
                  3632 9600 7405 3640 9832 
                  3299 3854 1600 1113 3075

Now take a look at the string of 20 digits you tried to write down at random and see if there is even a single instance of a digit appearing beside itself, e.g., 2,5,5,1,7 . You probably didn't repeat any digit beside itself because something in you that was producing these digits felt that that wouldn't be random enough. In fact, digits repeat beside themselves quite often in truly random sequences. Look at the above set of six strings of 20 random digits. In only one of the six is a digit not repeated beside itself (in the second string), and in the other strings repeated digits occur more than once. Look at the last string! 99, 00, 111? Does that look random? Not to me; and not to you , I'd guess, but it is. Moral: Don't try to select randomly by yourself; use a mechanical procedure like a table of random digits.

The Form of Inferential Statistical Results

Inferential statistics come in two varieties: interval estimation and hypothesis testing.

Interval Estimation

This form of statistical inference produces an interval of values (e.g., -.12 to +.35) by a process that has a known probability of including the true but unknown parameter value on the interval (e.g., the value of a correlation coefficient in a population). The interval is known as a confidence interval and any confidence interval has associated with it a confidence coefficient that gives the probability that the interval will capture the parameter. The confidence coefficient is under the control of the data analyst and typically assumes values near 1.0 like .90, .95 and .99. A typical result of interval estimation might take this form:
A random sample of 185 males is drawn from the U.S. population and their heights are calculated. The average is 68.34 inches. The 95% confidence interval for the population mean extends from 66.04 inches to 70.64 inches.
In the above example, the details of how the confidence interval on the population mean is calculated are omitted.

Confidence Intervals for Proportions, Correlations & Means

Correlations

Suppose that you draw a random sample of n cases from a population and you wish to learn something about the correlation between two variables, X and Y, in that population. From the sample you can calculate the estimate of the population correlation coefficient; call that estimate, as usual, r. We wish to calculate the 95% confidence interval around r and estimating the population correlation, . There are several ways to obtain this confidence interval.

Proportions

Suppose you draw a random sample of n cases from a population in which the proportion of elements possessing some characteristic ("Are left-handed" for example) is equal to some unknown number, call it . In the sample, the proportion of elements (persons, say) with the characteristic is p. You wish to construct an interval by a procedure that has a known probability of including . If that probability is 95%, for example, then you wish to find the "95% confidence interval on around p.
There are at least three ways to get the job done. I present all three here, but you may find the third method below the easiest and most convenient.

Confidence Intervals on Means

Suppose a random sample of n cases is drawn from a population that has a mean of . The 95% confidence interval around the sample mean is constructed by calculating a formula like the following:
Lower-limit of 95% CI on = Mean - t(st.dev.)/sqrt(n), and

Upper-limit of 95% CI on = Mean + t(st.dev.)/sqrt(n),

where "sqrt" stands for "square root" and t is a number roughly equal to 2, which depends on the size of the sample (for an n of 100, t equals 1.96, and for an n of 15, t equals 2.13.
Consider this example. A random sample of n = 50 cases is drawn from the population of all beginning 5th grade students in the Mesa School District. The research office is interested in checking on whether beginning 5th-graders in their district score at the national norm level (5.0) in spelling, a subtest of the Language Arts standardized test. In the sample of 50 pupils, the mean equals 4.72 and the standard deviation is 1.15. Using the above formulas, the lower-limit of the 95% confidence interval for the mean of all 5th-graders in the Mesa School District is as follows:
Lower-limit of 95% CI on = 4.72 - 2 (1.15/7.07) = 4.72 - .33 = 4.40.

Upper-limit of 95% CI on = 4.72 + 2 (1.15/7.07) = 4.72 + .33 = 5.04.

The 95% confidence interval on the population mean, , extends from 4.39 to 5.05, and since it includes the national norm value of 5.0, the researchers conclude that there is no reason to suspect that the Mesa 5th-graders are below norm in spelling.

You can use the form supplied here to submit data and receive in return the 95% Confidence Interval on a Mean.

Chi-square Test of Association for Contingency Tables

Statistical inference with contingency tables presents a slightly different set of problems from those addressed with confidence interval estimation. In particular, the inferential question that we ask about a contingency table is a complex relational question concerning several population proportions. So, rather than calculate confidence intervals, we calculate what is called a "test statistic" (in this case a chi-square test statistic); this chi-square statistic is used to reference a table of probabilities that will tell us how probable are the different possible answers to our question.
But first, what is the question we ask of the population contingency table? It is called the "hypothesis of independence" or the "hypothesis of no association." The reference to independence or association is to the factors used to classify the cases in the table. For example, Sex and Political Affiliation, or Opinion on Abortion (Favor v. Disfavor) and Church Membership (Y v. N).
What does it mean for the two factors of classification in a contingency table to be independent or not associated? Suppose that 45% of all women are registered Republicans and 45% of all men are registered Republicans. Then, whether a person is male or female, we can say, they are 45% likely to be a registered Republican. This last statement is not conditional on whether a person is a male or female; regardless of Sex, a person is 45% to be a registered Republican. We would say, then, that Sex and Registered Political Affiliation (Republican or Democrat) are "independent" or "not associated." Suppose, however, that 40% of all women—in the population—are registered Republicans, but 50% of all men are registered Republicans. Then, in this case, Sex and Registered Political Affiliation are "not independent," rather they are "associated." We say this because the likelihood of being a registered Republican depends on one's sex; it's 40% if you are female and it's 50% if you are male.
In which of the following populations is the hypothesis of "No association" or "independence" between the row and column classifications true?
  • Example 1
    MaleFemale
    Golfer30%25%
    Swimmer25%30%
    Tennis
    Player
    45%45%

  • Example 2
    RepublicanDemocrat
    Favors "School Choice"50%50%
    Opposes "School Choice"50%50%

  • Example 3
    BoyGirl
    Left-handed11%11%
    Right-handed89%89%

Consult the correct answers here and see how you did.

Note that the above three tables show percents in a population of persons. In reality, we have only samples of cases to tabulate in a contingency table. So, we need a way of deciding whether the sample data are consistent or inconsistent with a population in which the two factors of classification are independent. The chi-square calculation and statistic provides that way.
For a sample contingency table, the chi-square statistic is calculated and if it is very large in value (it must always be positive or at least zero), we reject the hypothesis of independence in the population from which the sample was randomly drawn. If the chi-square statistic is small, then we accept the hypothesis of independence of the factors of classification in the population. Whether the calculated value of the chi-square statistic is small or large is decided by the probability of obtaining at least the value of chi-square that was observed when the hypothesis of independence in the population is true. This probability routinely accompanies the reported value of the chi-square statistic.
An example may help:
Suppose that we draw a random sample of 400 school administrators in Arizona and classify them as follows:

MaleFemale
Principal8039
Asst. Super.468
Superintendent243

As you can see, in the sample of 200 persons, there are 80 male school principals, 46 male assistant superintendents, etc.
Now, for the above data, the calculated value of chi-square (using either the method in Table 12.3 on page 243 of your textbook or the online statistical analysis programs linked to below) is

Chi-square = 9.60 Prob. = .0082

What this calculation shows is that the probability of obtaining a chi-square statistic this large when there is independence of Sex and Administrative Role in the population samples is 8 in 1000; hence, it is very improbable that Sex and Administrative Role are independent in the population of school administrators in Arizona. By studying the table you can see that a woman is relatively more likely to be a school principal and less likely to be an assistant superintendent or superintendent than a man is.
When is the chi-square test significant? In this example, the probability of obtaining the sample chi-square value was so small (8 in 1000) when there is independence in the population that there is hardly any question that independence of Sex and Administrative Role does not exist in the population. But what if the probability of the chi-square value had been 1 in 100 (.01) or 5 in 100 (.05) or .10, .20 or even .30? At which point does one say, "Yes, this chi-square value is sufficiently probable to be seen when the population has independence that I'll conclude that the two factors are indeed independent"? Well, there is no single probability that separates "significant sample association" from "non-significant sample association." Different circumstances will call for different degrees of evidence. It is conventional to conclude that the two factors of classification are associated (non-independent) in the population when the probability of the chi-square value obtained is .05 or smaller. But it is a convention honored by history only and not by good reasons. Perhaps the best one can do is to report the probability of the chi-square statistic and let all who wish to make their arguments do so.

Online Calculation of the Chi-square Test of Association

Unfortunately, Excel does not have a chi-square contingency table significance test built into it—or at least, I can't find one. Fortunately, there are several places online where you can enter the counts from the cells of your contingency table and have the chi-square significance test calculations made for you.

Assignment 6

Use this form to complete Assignment #6 and submit your work.

home     |     online calc.     |     lesson:   one     |     two     |     three     |     four     |     five     |     six