Reading Assignment for Correlation
Data Layout
Recall that when we first opened a datafile that we either acquired from someone else or built ourselves, that we needed to inspect it for "bad data" points, i.e., outliers or mistypes or other errors that distort the data set. The easiest way to clean data in Excel is to Sort a column and onspect the top and bottom for extreme scores that make no sense. When you are performing correlation analyses, there is a slight complication you must heed. Unlike with the first datafile we used (the teacher salary data for tow school districts), data arranged for correlation analysis have rows that are not arbitrary. In our example of teacher salaries, Row 5, for instance, had no particular meaning; it was simply the fifth person whose salary was recorded. And the person in Row 5 for Desertview bore no special relationship to the person in Row 5 for Mountainview. But in a correlation spreadsheet, the situation is different. The rows are tied together, in a sense. The data for each unit occupies its own row in the spreadsheet. For example, in the above table, Sue's IQ score is 115 and her Spelling test score is 34; we can't just arbitrarily put Anita's Spelling score up on Sue's row and move Sue's score down to Anita's row and make any sense of the correlation analysis. Graphing Correlations (Scatter Diagrams)
Later on we'll see how to use Excel to draw scatter plots; but if you want to
try out some examples with an online plotter,
try this one
from John Behrens.
Chapter 8. Sections 8.1 through 8.12. This is one big wad of statistical methods to try and swallow; don't worry if you gag on it the first time. If it happens that you will need regession analysis in life after this course, you'll have occasion to study it again (and again). It is not a simple topic. The statistical technique known as regression is a very widely used technique at the more advanced levels of statistical analysis. It is applied both a) to predict the future values of variables and, in more advanced applications, b) to analyze (decompose) the structure of functional relationships (for example, how students' aspirations, parental support and abilities combine and interact to mutually influence their success in school; or to measure how students' efforts influence their achievement when the influence of their intellectual ability is "removed," i.e., held constant).
The techniques of regression and correlation are closely connected. In fact, the very symbol for correlation, r, stems from this association. (When Karl Pearson devised the measure of correlation between two variables now known as the Pearson product-moment correlation coefficient in the 19th Century, he was working with Francis Galton on the phenomenon of cross-generational regression.)
Suppose that I can calculate the GPA of a high school senior and wish
to use it to predict the student's first-year college GPA. We'll call
the High School GPA by the name X and the Freshman GPA by the
name Y. Suppose further that both X and Y are
transformed to standard score form, i.e., z(Y) = (Y - My)/s(y) where Mx is the mean of X. Please note that z(X) denotes the z-score on X and not the product of z and X; likewise, s(X) is the standard deviation of X and not the product of s and X. Similarly for z(Y) and s(Y). Assume that once we have
information on
z(X) we will use the following simple equation to predict what
z(Y) will be:
In the above equation, the coefficient c is a multiplier that changes z(X) into the best estimate of what z(Y) will someday be. It remains to specify exactly how c is chosen, i.e., by what criterion.
The universally accepted criterion for determining c in
Equation 1 is known as the Least-squares criterion. The
Least-squares criterion states that the best c would be that
c which makes the sum of the squared differences between c
z(X) and a bunch of the actual associated z(Y)'s as small
as possible (minimal, as the mathematician says). So, you see that
finding a value for c will require actually gathering a sample
of X and Y scores (actually measuring High School GPA
and Freshman GPA for a sample of students in a derivation
sample). From this experience, we can find the value of c
that satisfies the Least-squares criterion in the derivation sample and
then hope that it will hold up in future applications to samples where
X is known but Y has yet to occur.
So if one wishes to make a good prediction of a person's standard score on some future variable Y, one should multiply the person's standard score on X by the correlation between X and Y from a derivation sample. Consider a simple example. Adults males in the U.S. have an average height of about 69" with a standard deviaiton of about 2.5". Furthermore, it is well known that the correlation between a father's and his son's height is about .60. How tall will we predict a son to be at adulthood if his father is two standard deviations above average in height?
So a father is who is two standard deviaitons above average in height is predicted to have a son who will grow to 1.2 standard deviaitons above average. Tall fathers are predicted to have shorter sons. Likewse, shorter than average fathers are predicted to have sons who will be taller than their fathers, though the prediction will not exceed the national avaerage. This phenomenon is known as "regression to the mean," and it has a fascinating and checkered history. The Prediction Equation in Raw-Score FormBy algebraically expanding the above standard score form of the prediction equation, we can obtain the Least-squares linear prediction equation in raw-score form, i.e., in a form that can be applied to the data as originally gathered:where sx and sy are the standard deviations of X and Y, respectively,The above prediction or "regression equation" is the algebraic equation of a straight line when graphed in a two-dimensional coordinate system. Every straight line equation has two numbers that determine where it will lie in the graph: the slope and the intercept. Slope is the number that multiplies X and it got that name because as X increases by one unit, the line will rise (or fall, if slope is negative) by the slope. The intercept is where the line cuts the Y-axis (the ordinate of the graph) above an X-value of zero. So, the equation Y = .5(X) + 3 has a slope of .5 and an intercept of +3. So it looks like this:
![]()
Intercept = My (r)(sy/sx)Mx Let's go back to our example of fathers' and sons' heights. Fathers have average height of 69" with a standard deviaiton of 2.5", and so does the generation of their sons. The correlation of heights across the generation is .60. A father who is 6'2" (74") tall will have a son whose height is predicted to be what?
.60(2.5/2.5)74 + 69 .60(2.5/2.5)69 = 72.0" The Standard Error of EstimateThe least-squares predicted value of Y is a best guess, in a sense, of what Y will be when we have X and the prediction equation to do the guessing. That guess, however, may not be very good. The measure of hwo good the predictions can be expected to be is given by an expression called the standard error of estimate, se.
In our example with heights, the standard error of estimate is
= 2.5[Sqroot(.64)] = 2.5(.80) = 2.0" Using Regression Analysis to Remove the Influence of a VariableConsider a situation with which many of us are faced these days. The state, or the U.S. Dept. of Education, or the school district you work for, says that they are going to hold teachers accountable for their students' learningby which they mean, for their students' scores on standardized tests. It doesn't take much reflection to realize that it is unfair to the teachers to compare them in terms of their class's avaerage test score at the end of the school year, because the classes started out at different points, e.g., Jones's Grade 3 tested 3.1 GE in September and 4.4 GE in June, whereas Smith's Grade 3 started at 3.9 GE and finished at 4.5 GE. No one truly thinks that Smith taught more than Jones. [Dear reader: it is, perhaps, more difficult for me to write about these matters in such gross oversimplifications than for you to read them. I have many misgivings about this entire approach to "accountability," but they lie far afield of this lesson in statistics. Nonetheless, this illustration has been chosen because it is distressingly realistic; people are actually advocating such things.] It might be fairer to take the September to June "gain" for the class as the measure of the teacher's contribution: Jones scores 4.4 3.1 = 1.3 GE; Smith scores 4.5 3.9 = 0.6 GE; Jones wins. The simple "gain" approach has disadvantages that are overcome to some extent by the approach that "predicts" June from September and then subtracts the prediction from the actual score and the result is the regression corrected residual gain score for the teachers. In this sense, then, the September class averages are used to "explain" some of the variation in the June class averages; by subtracting the prediction of June from the actual June average, we will have "removed the influence" of September differences from the June differences. The situation we are examining here is not really a "prediction" situation, since both September and June averages are available; there's no need to make any decision about a teacher in advance of obtaining the June average. Nonetheless, we use the regression method remove from the June scores differences in the "inputs" to the class in September." This logicthat of removing the influence of Variable X from Variable Yhas a wide range of applications. In fact, it is applied over and over in economics, the social sciences and in all manner of statistical analyses. (How well it achieves it purpose is hotly debated.) For example, one might wish to study the variation in per pupil expenditures across school ditricts after removing the influence of "percent of students classified as 'Special Education,' since everyone knows that per pupil special educaiton costs are much greater than others. So, if we "regressed per pupil expenditures" onto "% spec. ed."this is how we speak of "predicting" per pupil expend. from "% spec. ed."and then subtracted the predicted per pupil expenditure from the actual per pupil expenditure, the resulting residual would be a comparison of per pupil expenditures of the districts as if they served equal percentages of special education students. Let's go back to the teacher accountability example to see how all this works out with actual numbers. Just suppose that in a relatively small school district the elementary school teachers' classes are tested on the SAT9 on September 1st and again on the following June 1st. (Ignore the fact that students come and students go during the school year, and the fact that some are absent at one time and not the other, and that one of the rural schools only has 4 students in the fifth grade...well, I needn't go on in this vein.) And we obtained the following scores in Reading (Grade Equivalent units) for the teachers:
Now, the summary staistics needed to calculate the regression equation are as follows: n = 48 X: Sept. Mean-X = 4.471 st-dev X = 1.126 Y: June Mean-Y = 5.402 st dev Y = 1.235 r= .892 So the regression equation for estimating June average scores from September average scores is as follows: Let's use Teacher a to illustrate how the regression equation removes the influence of September variation from June variation. Teacher a ( a Grade 3 teacher) had a class average score on SAT9 Reading of 3.2 GE yrs. Consequently, we would predict that Teacher a's class will score the following score in June:
Teacher a's score in June with the influence of the September score removed is
The complete table of residual scores for all 48 teachers follows:
Teacher Sept June Pre-Y Residual
a 3.2 3.5 4.16 -0.66
b 2.8 4.1 3.77 0.33
c 2.5 4.2 3.48 0.72
d 3.4 4.3 4.36 -0.06
e 3.5 3.7 4.45 -0.75
f 3.1 4.3 4.06 0.24
g 2.8 3.1 3.77 -0.67
h 3.9 5.2 4.84 0.36
i 2.5 2.9 3.48 -0.58
j 2.3 3 3.28 -0.28
k 3.2 3.9 4.16 -0.26
l 3.4 3.8 4.36 -0.56
m 4.2 5.7 5.14 0.56
n 3.9 4.3 4.84 -0.54
o 3.8 5.2 4.75 0.45
p 3.6 5.4 4.55 0.85
q 4.6 6.8 5.53 1.27
r 4.1 5.5 5.04 0.46
s 3.9 3.9 4.84 -0.94
t 4.2 4.8 5.14 -0.34
u 4.1 4.6 5.04 -0.44
v 4.5 5.9 5.43 0.47
w 3.8 4.5 4.75 -0.25
x 3.7 5.6 4.65 0.95
y 5.1 6.7 6.02 0.68
z 4.8 5.5 5.72 -0.22
aa 4.7 5.3 5.63 -0.33
bb 5.2 5.7 6.11 -0.41
cc 5.5 6.3 6.41 -0.11
dd 5.1 6.2 6.02 0.18
ee 4.2 6.3 5.14 1.16
ff 4.3 5 5.24 -0.24
gg 5.3 5.9 6.21 -0.31
hh 5.5 6.9 6.41 0.49
ii 4.8 5.4 5.72 -0.32
jj 4.6 5.9 5.53 0.37
kk 5.9 5.9 6.80 -0.90
ll 5.6 6.4 6.51 -0.11
mm 6.1 7.2 6.99 0.21
nn 6.2 7.8 7.09 0.71
oo 5.6 6.3 6.51 -0.21
pp 5.9 6.2 6.80 -0.60
qq 6.1 6.5 6.99 -0.49
rr 5.8 7.1 6.70 0.40
ss 5.7 6.3 6.60 -0.30
tt 6.6 7.3 7.48 -0.18
uu 5.8 6.2 6.70 -0.50
vv 5.2 6.8 6.11 0.69
From this point you can download a Windows Excell Spreadsheet that contains the data above and the results of the regression analysis that Excell performs. Take this Macintosh version if that is your flavor. (You really want to have the data analysis toolpak available in Excell to perform these regression analyses, otherwise you have to make separate calculations of slope and intercept using the fx functions, and then form your own equation to calculate residuals. Putting It All TogetherAssignment 4 |