What is statistics?

Statistics is a term which is used to mean many things. Before we state our definition of statistics, we will need a couple of other definitions. When we talk about a population, we are referring to the entire collection of elements being studied. For example, all of the high school students in the United States is a population. Obviously we cannot find out information like, what percentage of female high school students take calculus, by polling every high school student. What we can do is take a sample, or subset of the population. Using these two definitions, we can now define statistics. The term statistics references the many methods for gathering sample data from a population. The main idea behind these methods is to be able to make some conclusion about the population based on the samples gathered.


 

Measures of central tendencies.

 Many times it is of interest to study what is happening around the "middle" of a sample. What do we mean by the middle? Well, that depends. We will concern ourselves with two types of investigation: averages and dispersion statistics.

Averages

There are two averages we will be studying: the mean and the median. Let's pretend that you have four test scores: 90, 92, 93, 97. Most people would say that your average test score is a 93. What they are actually talking about is the mean! The mean is computed by adding up all of the data items and dividing by the number of data items. In order to compute your mean test score of 93, we calculated (90 + 92 + 93 + 97) / 4. Many times you will see the mean of a sample represented by . Mathematically, the mean of n data items is defined to be

.

We have discussed how to compute the mean, but what does the mean measure?

The simplest way to describe what the mean measures is to picture a see-saw (some of you may call it a teeter-totter). If two people each weigh 100 lbs. were to sit at each end of the see-saw, then in order to be able to balance the see-saw, the pivot point (or fulcrum) is placed directly in the middle of the see-saw. Now let's suppose a third person weighing 50 lbs. comes and sits in front of one of the 100 pounders. In order to balance, the fulcrum has to be moved closer to the end with the two people on it. For those of you with some physics background, the fulcrum placement which causes the see-saw to balance is called the center of mass of that system. If we were to think of our ordered test scores on a number line, making our see-saw out of the portion of the number line between the smallest and largest number, the mean will occur at the center of mass, or balance point, for this see-saw.

 The median is the middle number in an ordered set of data. If there is an even number of data items, then the median is the mean of the middle two items. In our test score example, the median is the mean of 92 and 93, or 92.5.


Dispersion Statistics

Dispersion statistics is a way of describing how data in sample is spread out. The two dispersion statistics we will concern ourselves with are standard deviation and variance.

Standard Deviation

Standard deviation is a measure of how data from a sample is spread out relative to the mean. In order to get a good picture of what is happening, let's consider the ith data point, which we will call xi. The distance from xi and is given by | xi - |. We include the absolute values here because we need to handle the fact that some of these differences will be negative. If we were to add up each of these distances for all n data items and divide by n, we would get the mean of these distances. This is called the mean deviation. Mathematically, it looks like

mean deviation = S | xi - | / n

This deviation is actually unsuitable for many statistical methods. Suffice it to say that when working with expressions that contain absolute value signs, things can get a "little" complicated. Therefore, statisticians use a twist on the mean deviation called the standard deviation of a sample. It squares the distances instead of using the absolute value. To "undo" the squaring, we take the square root to get our desired result. The standard deviation looks like

standard deviation = s =

Notice that we have divided by n-1. Some statisticians feel this gives a more accurate estimate of the standard deviation of the population.

The standard deviation for our test score data is 2.944. Let's consider the four test scores 80, 95, 97, 100. The mean is 93, just like our previous test score data, but the standard deviation is 8.907. This is because the second set of test scores is more spread out relative to the mean.

Variance

The sample variance is simply the square of the sample standard deviation. It is useful when studying other statistical methods, but is a little harder to relate to since the unit of measure for the variance is not the same as the unit of measure for the data (the unit would be the square of the unit on the data). For our purposes, we will stick to the standard deviation.


Correlation and Regression

Bivariate data

In our test score data there is only one variable, the test score! This is called univariate data. We want to turn our investigation to data that occurs in pairs. An example of this type of data would be verbal scores and math scores on an SAT exam. This type of data is called bivariate data.

Correlation

Correlation is a statistical concept which asks the burning question ... Is there a statistically significant relationship between the pairs in a bivariate set of data? This relationship can be linear, quadratic, exponential, pretty much any class of algebraic function.

Regression

Regression is the statistical analysis used in order to determine the actual type of correlation (if it exists) present in sample data. We use this analysis of the sample data to draw statistical conclusions about the population.

Linear regression

Linear regression is regression analysis which investigates whether or not sample data comes from a population which has a linear relationship between the two variables. Let's consider the set of SAT verbal and math scores listed in the table below.

Verbal

421

423

429

424

413

437

429

461

Math

476

467

467

470

453

470

463

515

If we were to plot these 8 pairs of data, using the verbal scores on the horizontal axis and the math scores on the vertical axis, we would get the scatter plot below.

 

Notice that the data points "sort of" look like they could almost be in a line. Regression analysis gives us a way of statistically stating whether or not this "sort of" looking like a line is actually good enough to say with confidence (you know, like I'm 95% sure ..) that the population is linear.

Correlation coefficient

The correlation coefficient r, sometimes referred to as simply the r-value, gives us a way to statistically say whether or not a sample comes from a linear population. An r-value of 1 or -1 implies that there is a perfect linear correlation (i.e. all the sample data lie directly on a line). The r-value for our SAT data is .8969. The question ... Is this r-value close enough to 1 to imply that the population is linear? In order to answer this question, we need to reference the Critical r-value table. This table contains the critical values (which I will call the "magic" r-value) for the correlation coefficients. You may click on the link to see the critical r-value table. Notice the table has three columns. The first column represents the number a items in your sample. The second and third columns refer to the magic r-value for specific a level (we will discuss what we mean by an a level at a later time). Suffice it to say that we will limit ourselves, for now, to an a level of 0.05. Notice that for our 8 pieces of SAT data, the table gives us a magic r-value of .707. The trick, if | r | > magic r-value, then we say that the population IS linear. Since the sample r-value of .8969 > the magic r-value of .707, we conclude that there is a significant linear correlation between SAT verbal and math scores.

Standard error

The standard error for bivariate data is analogous to the standard deviation for univariate data. The standard error is a quantitative measure of how the sample data is spread out relative to the regression line. The standard error for our SAT data is 8.774.