COE 502: Intro to Quant Methods

Lesson 4
Correlation & Regression

Downloading a New Datafile

Before we get started in learning somethings about the statistical method called "correlation," we'll need a new set of data to practice on and use as an illustration. I have prepared an Excel worksheet that contains a great deal of data on every Elementary School District in Arizona. By clicking on the link immediately below, you should--assuming you are working on a machine properly configured--start a chain of events that will bring this new Excel data file from my server to your computer. If you are asked for a logon id, use "anonymous."
You need to SAVE a copy of the data file (its name when it arrives is something like azelem.xls for Windows and azelem.csv for Macintosh) on your diskette. The datafile has about 100 school districts described by about 20 variables. The names of the variables appear in the spreadsheet. You may also see a description of the variables here.

A Note (Warning) to America Online Users

Reading Assignment for Correlation

Please study carefully, the following Sections in the textbook: 7.1 - 7.13, 7.27; 8.1 - 8.7; 8.9 - 8.10.

Data Layout

The table below is an example of how data are arranged for purposes of correlation. The question that correlation analysis answers is this: "When scores on one variable (X) increase, do the scores on a second variable (Y) also, increase, or do they decrease, or is there no systematic increase or decrease?" For example, are higher cholesterol measures associated with high body weight? Are increases in numbers of hours a student spends on homework (X) associated with increases in the student's GPA (Y)?
In a correlation analysis, there are two variables (X and Y) that are mesured on one group of "units." Units can be persons, families, schools, school districts or other thigns that can be measured in two or more ways. An example using units other than persons might be this: For schools (the units), is school size (X) measured by number of pupils associated with school violence (Y) as measured by number of reported assualts among students?
When data are collected for several units on at least two variables, they can be tabulated like this:
  IQSpelling
Sue11534
Alma87 18
Carlos10428
Anita12126
Hans9619
Linda9920
Louis13626


A Note About Cleaning Datafiles for Correlation Analysis

Recall that when we first opened a datafile that we either acquired from someone else or built ourselves, that we needed to inspect it for "bad data" points, i.e., outliers or mistypes or other errors that distort the data set. The easiest way to clean data in Excel is to Sort a column and onspect the top and bottom for extreme scores that make no sense. When you are performing correlation analyses, there is a slight complication you must heed. Unlike with the first datafile we used (the teacher salary data for tow school districts), data arranged for correlation analysis have rows that are not arbitrary. In our example of teacher salaries, Row 5, for instance, had no particular meaning; it was simply the fifth person whose salary was recorded. And the person in Row 5 for Desertview bore no special relationship to the person in Row 5 for Mountainview. But in a correlation spreadsheet, the situation is different. The rows are tied together, in a sense. The data for each unit occupies its own row in the spreadsheet. For example, in the above table, Sue's IQ score is 115 and her Spelling test score is 34; we can't just arbitrarily put Anita's Spelling score up on Sue's row and move Sue's score down to Anita's row and make any sense of the correlation analysis.

So what does this have to do with sorting the datafile? Well, the Sort function in Excel breaks up the rows. If you Sort the above table in Excel, Anita's IQ score will come out in Row 1 of the IQ variable, but Sue's Spelling score will occupy Row 1 of the Spelling variable. The solution to this problem is simple. Just Select and Copy the entire worksheet into a new Excel worksheet before sorting. Sort and find the bad data points in the temporary worksheet, and make any corrections on the origincal worksheet.

In a correlation analysis, we ask and answer three questions about the relationship between X and Y: a) What is the shape of the relationship between X and Y; b) what is the direction of the relationship between X and Y; and c) what is the strength of the relationship?
Each of these questions can be answered by inspecting a graph of the relationship called a scatter diagram or scatter plot; and questions b) and c) can be given an even more precise answer by calculation of the correlation coefficient.

Graphing Correlations (Scatter Diagrams)

The shape and direction of a correlation can be seen in the scatter diagram. Scatter diagrams can be constructed in Excel, or they can be drawn by hand. Two axes (one for X the other for Y) are drawn at right angles to each other and labeled with the scale of measurement of the two variables X and Y. A unit (person, for example) is then represented as a point in the diagram at the intersection of that unit's X and Y scores. The following picture shows the scatter diagram for the data in the above table on IQ scores and Spelling test scores.

Linear vs. Curvilinear

The first thing to notice about a scatter diagram depicting the correlation of X and Y is whether the relationship is linear (straight line) or curvilinear (curved). The correlation coefficient we will use—the Pearson correlation—only describes linear relationships. If the relationship between X and Y is curved, the Pearson will either misrepresent the relationship or simply make it look weaker than it really is. (Note: coefficients that describe curvilinear relationships are beyond the scope of this course; a reference to how one deals with them appears in footnote #2 on page 93 of your textbook.)
Here is the scatter plot of a linear relationship:

A Linear Scatter Plot

The scatter plot below shows a curvilinear relationship:

A Curvilinear Scatter Plot

The direction of a two-variable relationship can be either direct (positive) or inverse (negative).
  • When higher values of X tend to be paired with higher values of Y, then X and Y are said to be positively correlated.
  • When higher values of X tend to be paired with lower values of Y, then X and Y are said to be negatively correlated.
In the above two scatter plots, the correlation is direct, or positive, in the first example, of a linear relationship. And the relationship is inverse, or negative, in the second example of a curvilinear relationship.

Later on we'll see how to use Excel to draw scatter plots; but if you want to try out some examples with an online plotter, try this one from John Behrens.

r, the Measure of Correlation


       r = Sum[z(X)z(Y)]/(n-1),

          where n is the number of units or pairs of data points,
          z(X) is the z-score for each unit on the X variable, and
          z(Y) is the z-score for each unit on the Y variable.

You'll recall from Lesson 3 that a z-score is the difference between a raw score and its group mean divided by the group's standard deviaiton. But, don't spend too much time contemplating the above formula; computer programs like Excel can calculate the value of the Pearson product-moment correlation coefficient (which r is, for you.
    Pearson's r has the following properties:
  1. r can assume values from -1 to +1, inclusive;
  2. r is positive for direct relationships;
  3. r is negative for indirect relationships;
  4. r is zero (0) when there is no linear relationship between the two variables or when there exists particular kinds of curvilinear relationships between X and Y.
Try this online correlation calculator. Use the data in the table at the beginning of this lesson and verify that the correlation coefficient r equals .615942 . That's rather more decimal accuracy than anyone needs in such circumstance, so we would likely report the result as r = .62 , and say of it that "IQ and Spelling score were positively related with a correlation of .62."
People often ask, "How big does a correlation have to be to be significant, or important, or large, or strong?" There is no simple answer to such questions. Essentially the person is asking, "How do I translate a numerical value of r into words?" The answer to that is , "Don't." No words (of the type "large," "strong," "weak," etc.) will adequately substitute for the picture of the scatter diagram and the reported value of r. Correlation coefficients of .20 can be extremely important, e.g., the correlation between smoking and ling cancer, and other correlations of .95 can be useless, e.g., the correlation between one's height at 8 a.m. and one's height at 10 p.m.
The skill one develops, however, is the skill of inspecting a scatter diagram and calculating the associated value of r so that one can visualize a typical scatter plot when one hears of or reads a value of r--because many values of r are reported without their scatter plots.
Study the following figures in your textbook and try to develop some skill at associating a value of r with the strength of association pictured in the scatter diagram: Figure 6.1 on page 91; Figure 6.3 on page 97; Figure 6.4 on page 98; the five figures on page 101; the figure at the bottom of page 105; Figure 7.1 on page 109; Figure 8.2 on page 126; Figure 8.5 on page 139.
Now you are ready to test yourself. Here is an instructional device, courtesy of the Univ. of Illinois, that helps you learn to relate scatter diagrams and values of r. You may need to run Microsoft Internet Explorer for this one. Have fun with it.
Here's an example of a research report that makes extensive use of correlational analysis and has linked reported values of r to the corresponding scatter plots in an innovative way.

Scale Invariance: An Important Property of r

The Pearson correlation coefficient has an important property: no mater how you would change the scale of measuring either X or Y, so long as you change them only by multiplying or adding, X and Y will still have the same value of r. So, for example, if I tell you that weight in kilograms and height in centimeters have a correlation of .74, you know immediately that weight in pounds and height in inches also correlate .74—because kilograms can be converted to pounds merely by a "linear transformation" (multiplying and/or adding) and so can centimeters be converted to inches (one inch = 2.54 centimeters).
Test out this property of r by entering the numbers in Table A into an online calculator, noting the value of r, then entering the numbers in Table B into the calculator. Note that the numbers in B are just multiples of the numbers in A.

Table A

  IQSpelling
Sue11534
Alma87 18
Carlos10428
Anita12126
Hans9619
Linda9920
Louis13626

Table B

  IQSpelling
Sue11.568
Alma8.7 36
Carlos10.456
Anita12.152
Hans9.638
Linda9.940
Louis13.652

Drawing Scatter Plots and Calculating Correlations with Excel

Excel is a fairly convenient tool for doing both of these.
    To draw a scatter plot
  • With the data showing in the spreadsheet, think of the two variables you want to plot.
  • Select one of these variables by clicking at the head of the column in which it resides.
  • Now, while holding down the Ctrl key, click at the head of the column for the second variable.
  • Now click on the Chart Wizard button on the menu bar at the top of the spreadsheet. (It's next to the two Sort buttons.)
  • You'll now see a menu of chart types; pick "XY Scatter" (see the illustration below.) Click on Next.
  • If prompted to choose a scatter plot type, pick only the type that does not involve connecting dots with lines.
  • Follow your nose at this point, and everything should be OK.

    To calculate the value of r
  • Click on an empty cell in the spreadsheet where you would like the answer to appear.
  • Click on the Function button on the menu bar; it looks like this: f x.
  • Select "Statistical" in the left side of the dialog box and CORREL in the right side.
  • In the next dialog box enter as array1 the location of the X scores, e.g., e2:e150, and as array2 the location of the Y scores, e.g., g2:g150. (Note: the range of these entries will have to match; in other words, you can't have f1:f15 and h1:h25, for example).
  • When you click on Finish, you should see the value of r appear in the cell you chose in the first step above.

Regression

Reading Assignment for Regression

Chapter 8. Sections 8.1 through 8.12. This is one big wad of statistical methods to try and swallow; don't worry if you gag on it the first time. If it happens that you will need regession analysis in life after this course, you'll have occasion to study it again (and again). It is not a simple topic.

The statistical technique known as regression is a very widely used technique at the more advanced levels of statistical analysis. It is applied both a) to predict the future values of variables and, in more advanced applications, b) to analyze (decompose) the structure of functional relationships (for example, how students' aspirations, parental support and abilities combine and interact to mutually influence their success in school; or to measure how students' efforts influence their achievement when the influence of their intellectual ability is "removed," i.e., held constant).

The techniques of regression and correlation are closely connected. In fact, the very symbol for correlation, r, stems from this association. (When Karl Pearson devised the measure of correlation between two variables now known as the Pearson product-moment correlation coefficient in the 19th Century, he was working with Francis Galton on the phenomenon of cross-generational regression.)

Suppose that I can calculate the GPA of a high school senior and wish to use it to predict the student's first-year college GPA. We'll call the High School GPA by the name X and the Freshman GPA by the name Y. Suppose further that both X and Y are transformed to standard score form, i.e.,

z(X) = (X - Mx)/s(x)
z(Y) = (Y - My)/s(y)

                        where Mx is the mean of X.

Please note that z(X) denotes the z-score on X and not the product of z and X; likewise, s(X) is the standard deviation of X and not the product of s and X. Similarly for z(Y) and s(Y).

Assume that once we have information on z(X) we will use the following simple equation to predict what z(Y) will be:

Predicted z(Y) = c z(X)
Equation 1

In the above equation, the coefficient c is a multiplier that changes z(X) into the best estimate of what z(Y) will someday be. It remains to specify exactly how c is chosen, i.e., by what criterion.

The universally accepted criterion for determining c in Equation 1 is known as the Least-squares criterion. The Least-squares criterion states that the best c would be that c which makes the sum of the squared differences between c z(X) and a bunch of the actual associated z(Y)'s as small as possible (minimal, as the mathematician says). So, you see that finding a value for c will require actually gathering a sample of X and Y scores (actually measuring High School GPA and Freshman GPA for a sample of students in a derivation sample). From this experience, we can find the value of c that satisfies the Least-squares criterion in the derivation sample and then hope that it will hold up in future applications to samples where X is known but Y has yet to occur.

In a derivation sample, the value of c that satifies the Least-squares criterion is given by Equation 2 below:
c = Sum[z(X)z(Y)]/(n-1)
The interesting thing about c is that it is precisely the same formula as that for Pearson's product-moment correlation coefficient.
c = rxy

So if one wishes to make a good prediction of a person's standard score on some future variable Y, one should multiply the person's standard score on X by the correlation between X and Y from a derivation sample.

Predicted z(Y) = r[z(X)]

Consider a simple example. Adults males in the U.S. have an average height of about 69" with a standard deviaiton of about 2.5". Furthermore, it is well known that the correlation between a father's and his son's height is about .60. How tall will we predict a son to be at adulthood if his father is two standard deviations above average in height?

Son's Predicted z(Y) = r[z(X)] = .60(2.0) = 1.20

So a father is who is two standard deviaitons above average in height is predicted to have a son who will grow to 1.2 standard deviaitons above average. Tall fathers are predicted to have shorter sons. Likewse, shorter than average fathers are predicted to have sons who will be taller than their fathers, though the prediction will not exceed the national avaerage. This phenomenon is known as "regression to the mean," and it has a fascinating and checkered history.

The Prediction Equation in Raw-Score Form

By algebraically expanding the above standard score form of the prediction equation, we can obtain the Least-squares linear prediction equation in raw-score form, i.e., in a form that can be applied to the data as originally gathered:
Predicted-Y = (r)(sy/sx)X + My – (r)(sy/sx)Mx
where sx and sy are the standard deviations of X and Y, respectively,
and Mx and My are the means of X and Y.
The above prediction or "regression equation" is the algebraic equation of a straight line when graphed in a two-dimensional coordinate system. Every straight line equation has two numbers that determine where it will lie in the graph: the slope and the intercept. Slope is the number that multiplies X and it got that name because as X increases by one unit, the line will rise (or fall, if slope is negative) by the slope. The intercept is where the line cuts the Y-axis (the ordinate of the graph) above an X-value of zero. So, the equation

Y = .5(X) + 3 has a slope of .5 and an intercept of +3. So it looks like this:

Consequently, for the least-squares regression line, the

Slope = (r)(sy/sx), and the
Intercept = My – (r)(sy/sx)Mx

Let's go back to our example of fathers' and sons' heights. Fathers have average height of 69" with a standard deviaiton of 2.5", and so does the generation of their sons. The correlation of heights across the generation is .60. A father who is 6'2" (74") tall will have a son whose height is predicted to be what?

Predicted Son's height = .60(2.5/2.5)X + 69 – .60(2.5/2.5)69 =

.60(2.5/2.5)74 + 69 – .60(2.5/2.5)69 =

72.0"

So a father who is 74" tall is predicted to have a son who is 72.0" tall.

The Standard Error of Estimate

The least-squares predicted value of Y is a best guess, in a sense, of what Y will be when we have X and the prediction equation to do the guessing. That guess, however, may not be very good. The measure of hwo good the predictions can be expected to be is given by an expression called the standard error of estimate, se.

se = sy[Sqroot(1 – r2)].
Here is the use that is made of the standard error of estimate. For all those sons predicted to have a height of 70.8", 68% of them will have actual heights between 70.8" plus se and 70.8" minus se. Furthermore, 95% of the sons predicted to be 70.8" tall will have actual adult heights between 70.8" plus 2se and 70.8" minus 2se. (If "one standard ... 68%" and "two standard ... 95%" ring a bell, they should. This interpretation is based on the assumption that for fathers of any given height, say, 74", their sons heights will be normally distributed.)

In our example with heights, the standard error of estimate is

se = 2.5[Sqroot(1 – .602)]

= 2.5[Sqroot(.64)]

= 2.5(.80)

= 2.0"

Our prediction of sons' heights then takes this form: For fathers who are 74" tall, we predict that their sons will be 70.8" tall and two-thirds (68%) of them will have heights between 68.8" and 72.8" (plus and minus one standard error of estimate).

Using Regression Analysis to Remove the Influence of a Variable

Consider a situation with which many of us are faced these days. The state, or the U.S. Dept. of Education, or the school district you work for, says that they are going to hold teachers accountable for their students' learning—by which they mean, for their students' scores on standardized tests. It doesn't take much reflection to realize that it is unfair to the teachers to compare them in terms of their class's avaerage test score at the end of the school year, because the classes started out at different points, e.g., Jones's Grade 3 tested 3.1 GE in September and 4.4 GE in June, whereas Smith's Grade 3 started at 3.9 GE and finished at 4.5 GE. No one truly thinks that Smith taught more than Jones. [Dear reader: it is, perhaps, more difficult for me to write about these matters in such gross oversimplifications than for you to read them. I have many misgivings about this entire approach to "accountability," but they lie far afield of this lesson in statistics. Nonetheless, this illustration has been chosen because it is distressingly realistic; people are actually advocating such things.]

It might be fairer to take the September to June "gain" for the class as the measure of the teacher's contribution: Jones scores 4.4 – 3.1 = 1.3 GE; Smith scores 4.5 – 3.9 = 0.6 GE; Jones wins.

The simple "gain" approach has disadvantages that are overcome to some extent by the approach that "predicts" June from September and then subtracts the prediction from the actual score and the result is the regression corrected residual gain score for the teachers. In this sense, then, the September class averages are used to "explain" some of the variation in the June class averages; by subtracting the prediction of June from the actual June average, we will have "removed the influence" of September differences from the June differences. The situation we are examining here is not really a "prediction" situation, since both September and June averages are available; there's no need to make any decision about a teacher in advance of obtaining the June average. Nonetheless, we use the regression method remove from the June scores differences in the "inputs" to the class in September."

This logic—that of removing the influence of Variable X from Variable Y—has a wide range of applications. In fact, it is applied over and over in economics, the social sciences and in all manner of statistical analyses. (How well it achieves it purpose is hotly debated.) For example, one might wish to study the variation in per pupil expenditures across school ditricts after removing the influence of "percent of students classified as 'Special Education,' since everyone knows that per pupil special educaiton costs are much greater than others. So, if we "regressed per pupil expenditures" onto "% spec. ed."—this is how we speak of "predicting" per pupil expend. from "% spec. ed."—and then subtracted the predicted per pupil expenditure from the actual per pupil expenditure, the resulting residual would be a comparison of per pupil expenditures of the districts as if they served equal percentages of special education students.

Let's go back to the teacher accountability example to see how all this works out with actual numbers. Just suppose that in a relatively small school district the elementary school teachers' classes are tested on the SAT9 on September 1st and again on the following June 1st. (Ignore the fact that students come and students go during the school year, and the fact that some are absent at one time and not the other, and that one of the rural schools only has 4 students in the fifth grade...well, I needn't go on in this vein.) And we obtained the following scores in Reading (Grade Equivalent units) for the teachers:

   
September-to-June Class Average Test Scores (SAT9 Reading)
 for Elementary School Teachers (Grades 3 - 6) in the 
 Brookside Unified School District

     Teacher Sept(X) June(Y)

      a       3.2     3.5
      b       2.8     4.1
      c       2.5     4.2
      d       3.4     4.3
      e       3.5     3.7
      f       3.1     4.3
      g       2.8     3.1
      h       3.9     5.2
      i       2.5     2.9
      j       2.3     3.0
      k       3.2     3.9
      l       3.4     3.8
      m       4.2     5.7
      n       3.9     4.3
      o       3.8     5.2
      p       3.6     5.4
      q       4.6     6.8
      r       4.1     5.5
      s       3.9     3.9
      t       4.2     4.8
      u       4.1     4.6
      v       4.5     5.9
      w       3.8     4.5
      x       3.7     5.6
      y       5.1     6.7
      z       4.8     5.5
      aa      4.7     5.3
      bb      5.2     5.7
      cc      5.5     6.3
      dd      5.1     6.2
      ee      4.2     6.3
      ff      4.3     5.0
      gg      5.3     5.9
      hh      5.5     6.9
      ii      4.8     5.4
      jj      4.6     5.9
      kk      5.9     5.9
      ll      5.6     6.4
      mm      6.1     7.2
      nn      6.2     7.8
      oo      5.6     6.3
      pp      5.9     6.2
      qq      6.1     6.5
      rr      5.8     7.1
      ss      5.7     6.3
      tt      6.6     7.3
      uu      5.8     6.2
      vv      5.2     6.8

Now, the summary staistics needed to calculate the regression equation are as follows:

n = 48
X: Sept.  Mean-X = 4.471     st-dev X = 1.126
Y: June   Mean-Y = 5.402     st dev Y = 1.235
r= .892

So the regression equation for estimating June average scores from September average scores is as follows:

Predicted Y = .978(X)  +  1.03

Let's use Teacher a to illustrate how the regression equation removes the influence of September variation from June variation. Teacher a ( a Grade 3 teacher) had a class average score on SAT9 Reading of 3.2 GE yrs. Consequently, we would predict that Teacher a's class will score the following score in June:

Predicted Y = .978(3.2)   +   1.03 = 4.16 GE yrs.

Teacher a's score in June with the influence of the September score removed is

3.5 - 4.16 = -.67
This corrected score is called a residual around the regression line and what it means, in this instance, is that Teacher a's class scored .67 GE yrs (or almost 7 grade-equivalent months) below what would be expected. In other words, taking out the influence of September status, Teacher a is performing 6.7 GE months below expectation.

The complete table of residual scores for all 48 teachers follows:

     Teacher Sept    June    Pre-Y   Residual

     a       3.2     3.5     4.16    -0.66
     b       2.8     4.1     3.77     0.33
     c       2.5     4.2     3.48     0.72
     d       3.4     4.3     4.36    -0.06
     e       3.5     3.7     4.45    -0.75
     f       3.1     4.3     4.06     0.24
     g       2.8     3.1     3.77    -0.67
     h       3.9     5.2     4.84     0.36
     i       2.5     2.9     3.48    -0.58
     j       2.3     3       3.28    -0.28
     k       3.2     3.9     4.16    -0.26
     l       3.4     3.8     4.36    -0.56
     m       4.2     5.7     5.14     0.56
     n       3.9     4.3     4.84    -0.54
     o       3.8     5.2     4.75     0.45
     p       3.6     5.4     4.55     0.85
     q       4.6     6.8     5.53     1.27
     r       4.1     5.5     5.04     0.46
     s       3.9     3.9     4.84    -0.94
     t       4.2     4.8     5.14    -0.34
     u       4.1     4.6     5.04    -0.44
     v       4.5     5.9     5.43     0.47
     w       3.8     4.5     4.75    -0.25
     x       3.7     5.6     4.65     0.95
     y       5.1     6.7     6.02     0.68
     z       4.8     5.5     5.72    -0.22
     aa      4.7     5.3     5.63    -0.33
     bb      5.2     5.7     6.11    -0.41
     cc      5.5     6.3     6.41    -0.11
     dd      5.1     6.2     6.02     0.18
     ee      4.2     6.3     5.14     1.16
     ff      4.3     5       5.24    -0.24
     gg      5.3     5.9     6.21    -0.31
     hh      5.5     6.9     6.41     0.49
     ii      4.8     5.4     5.72    -0.32
     jj      4.6     5.9     5.53     0.37
     kk      5.9     5.9     6.80    -0.90
     ll      5.6     6.4     6.51    -0.11
     mm      6.1     7.2     6.99     0.21
     nn      6.2     7.8     7.09     0.71
     oo      5.6     6.3     6.51    -0.21
     pp      5.9     6.2     6.80    -0.60
     qq      6.1     6.5     6.99    -0.49
     rr      5.8     7.1     6.70     0.40
     ss      5.7     6.3     6.60    -0.30
     tt      6.6     7.3     7.48    -0.18
     uu      5.8     6.2     6.70    -0.50
     vv      5.2     6.8     6.11     0.69

From this point you can download a Windows Excell Spreadsheet that contains the data above and the results of the regression analysis that Excell performs. Take this Macintosh version if that is your flavor. (You really want to have the data analysis toolpak available in Excell to perform these regression analyses, otherwise you have to make separate calculations of slope and intercept using the fx functions, and then form your own equation to calculate residuals.

Putting It All Together

Here is a Java instructional program created by John Behrens and his students that allows you to enter data, see it plotted in a scatter diagram and observe the calculated value of the correlation coefficient. If your browser will handle Java, you can learn a lot about correlation by playing with this program for a few minutes. (The program works well in Microsoft's Internet Explorer 4.0 or better, and may not work so well in Netscape. In fact, it doesn't work at all in my Netscape Communicator 4.51 for Windows. And I notice that even Netscape 6.2 requires the installation of a special "plug-in," that is a bit of a hassle.)

Assignment 4

For this assignment, you will need to download an Excell datafile that contains test scores (GRE-V and Miller Analogies Test) for a sample of about three dozen graduate students: