Lesson 1
Tabulating and Graphing Data

Downloading & Cleaning the Datafile

Before we get started in learning somethings about statistical methods, we'll need a set of data to practice on and use as an illustration. I have prepared an Excel worksheet that contains the salaries of several hundred teachers in two school districts: Desertview and Mountainview (don't look for these districts in your directory, they are fictitious). By clicking on the link immediately below, you should--assuming you are working on a machine properly configured--start a chain of events that will bring the Excel data file from my server to your computer. If you are asked for a logon id, use "anonymous."
You need to SAVE a copy of the data file (its name when it arrives is something like sal.xls for Windows and sal.csv for Macintosh) on your diskette. The computer you are working on may either ask you before downloading if you want to "Save to Disk" or "Open in Excel". If you have the choice, select the Save option and choose your diskette in the floppy drive as the destination for saving. If the data file is opened immediately upon downlaoding in Excel, one of the first things you should do is Save it onto your floppy diskette (in Excel, go to the top menu bab, click on File, then click on Save As..., then direct the saving to your floppy.) It's not a bad idea to save a coy of sal.xls to two different floppies as a backup. If you do lose the data file, you can always come back to this point an dpwnload a new copy.

A Note (Warning) to America Online Users

Reading Assignment

Please study carefully, the following Sections in the textbook: 2.1-2.7, 3.1-3.13.
"Open" the datafile by starting the Excel program, then selecting File --> Open from the top menu bar. (This notation with an arrow is a way of indicating that the Open option will be found on a submenu of the File option.)
The first job of any data analysis--one that is so simple that it is frequently overlooked or give short shrift--is to inspect the data for "bad" values. Bad data points are entries in a datafile that are unreasonable and represent clerical errors or misunderstandings by persons collecting or reporting the raw data, e.g., 1340 lbs. in a datafile of persons' weights. In nearly every case, bad data points are simply deleted before data analysis begins. This is known as "cleaning the datafile." So your first job is to clean the teacher salaries datafile that you have been given.
Fortunately, Excel (and nearly any spreadsheet program) is a wonderful program for detecting bad data points. About all you need to do is
  • Click in the lettered box at the head of a particular column in the spreadsheet;
  • Select Data --> Sort from the top menu bar;
  • And sort the column of data in either ascending or descending order.
The bad data points will appear at either the top or bottom (or both) of the column as "outliers," i.e., numbers that fall so far from the bulk of the remaining numbers that they raise suspicions as to their accuracy. In the case of teacher salaries, you can probably judge for yourself whether the extreme numbers are reasonable or not. When you find a bad data point or egregious "outlier," you can eliminate it in a couple of ways: highlight the cell it is in by clicking on that cell, and either delete the entry by pressing the backspace key or selecting Edit --> Cut from the top menu. Please make a note of the data points you eliminate. Later I will ask you to send them to me.
Once you have "cleaned the data" (be sure to clean both Desertview and Mountainview data), be sure to File --> Save a cleaned copy of the datafile to your floppy diskette.

Percentiles, 5-Number Summaries, Box-and-Whisker Plots,
Frequency Distributions, Histograms

The above are all ways of describing sets of scores (or "observations of variables" to use the technical term). Box-and-Whisker plots and 5-number Summaries are constructed from percentiles; Histograms are constructed from Frequency Distributions. So let's take these two groups of things separately.

Percentiles and Other Things

A percentile is just a score that has a certain percent of the cases below it: the 75th Percentile is the score below which 75% of the cases fall; the 50th Percentile (sometimes written "50th %-ile") is the score below (and above) which half the scores in the datafile fall. Percentiles used to be difficult to calculate, but now thanks to computers and spreadsheets, they are real easy to find.
    Find Percentiles in a Spreadsheet
  1. Click in the lettered box at the head of a particular column in the spreadsheet;
  2. Select Data --> Sort from the top menu bar;
  3. And sort the column of data in either ascending or descending order.
  4. Look at the bottom of the column to see how many cases there are for this Variable (e.g., the Variable "Teacher Salaries in Desertview"). Call this number n.
  5. Suppose you want the 75th percentile: calculate .75(n) and count up the sorted column of scores until you rech that number (e.g., if there are 200 cases for the Variable you are working on, .75(200) = 150, so the 75th percentile will be the 150th score from the bottom (or the 50th from the top which is the same thing).
  6. The Pth percentile is found by multiplying (P/100)(n) and counting up that many scores from the bottom of the column.
(There's more explanation of percentiles on pages 18-19 of the textbook.)

A 5-Number Summary is a very concise way to describe the major features of a set of scores without getting bogged down in details. The 5 numbers in question are the 10th, 25th, 50th, 75th and 90th percentiles. In mathematical notation, these are denoted as follows: P10, P25, P50, P75, and P90. P50 is the 50th Percentile, the score that divides the set of scores into two halves; in this sense, it is a middle score and is commonly called the Median. The 25th and 75th Percentiles have an obvious meaning, and noting how far they lie from the Median tells us how spread out the distribution of scores is. P25 and P75 are known by their synonyms as well: Q1, the First Quartile, and Q3, the Third Quartile. By what other name do you think the second quartile, Q2, is known?
Between Q1 and Q3 lies half of all the scores. Knowing that much is to know quite a bit. Between P10 and P90 lies the middle 80% of all the scores, or all but the 10% highest and 10% lowest. These five numbers together, then, give a pretty informative description of the set of scores, without distracting us with too many details that may not be informative or stable. We call them the "5-Number Summary" of a distribution.

P10                           P25         P50         P75                           P90

When the 5-Number Summary is converted to a graph, the Box and Whisker Plot results. We establish a ruled line horizontally across the page and mark off the full range of scores that we see in the set of scores we are analyzing. Then we draw a rectangle above the ruled scale such that the right edge is above the point on the scale corresponding to P75 and the left edge of the rectangle is above the 25th Percentile. We draw a line inside the rectangle at the Median. Then we draw "whiskers that extend out from the sides of the rectangle in each direction unitl they reach the 10th and 90th Percentiles. Like so:
Suppose a set of 150 chidlren's heights had the following 5-Number Summary:

P10= 40"           P25= 42"         P50= 45"         P75= 48"           P90=52"
Ten percent of the children are shorter than 40 inches; half the children are taller than 45". Here's what the box-and-whisker plot looks like:

Here's another box-and-whiskers plot. It describes the distribution of Median Family Incomes for 97 Elementary School Districts in Arizona (in about 1990).
What family income is exceeded by half of the Median Family Incomes for Arizona's 97 Elementary School Districts?
John Behrens offers a detailed treatment of how to construct Box-and-Whisker plots.

Frequency Distributions and Histograms

Frequency distributions and histograms are ways of portraying the complete shape of a set of scores. There is quite a bit of discussion of both these techniques in the textbook on pages 9 through 17. Suffice it to say that frequency distributions are built by dividing the range of the scores into some number of equal size classes and then counting and reporting the number (or frequency) of scores in each class. For example, if my set of scores is as follows:

2, 3, 3, 4, 5, 6, 6, 6, 7, 10, 11, 13, 15, 16, 21

and I form the classes 1-5, 6-10, 11-15, 16-20, and 21-25, then the grouped frequency distribution looks like this:

Class Freq.
1-5 5
6-10 5
11-15 3
16-20 1
21-25 1

Question: What five scores constitute the three frequencies in the class 6-10?

A Histogram is simply a bar graph where the bar lengths are determined by the frequencies in each class of a grouped frequency distribution. Notice how the bar graph below (an example of a histogram) has five bars that represent the numbers of cases in each of the five classes in the above frequency distribution.

The histogram below describes the 97 elementary school districts in Arizona in terms of the proportion of poor people in the school district boundaries.

Notice that about 15 school districts are in the lowest poverty category and about 25 districts are in the third from lowest poverty category.

For another presentation on histograms, consult John Behrens's materials.

How to Construct a Frequency Distribution and Histogram in Excel

There are at least three methods of constructing frequency distributions (then drawing histograms) in the Excel program. One way is simple but could be tedious for very large data sets; one way is simple and powerful, if you have the right program elements in your version of Excel; and the last way is a nightmare that I hope you can avoid. Find out about all three methods here.

Collateral Materials

If you want to read more on this and related topics, then please visit John Behren's introductory statistics website. Also, there are many places on the web where this topic and other introductory topics in statistical methods are taught. Try these places if you need more opportunities to learn:

Assignment One

Use this form to complete Assignment #1 and submit your work.

home     |     online calc.     |     lesson:   one     |     two     |     three     |     four     |     five     |     six