An Introduction to Summary Statistics: Vegetation Analysis of Two Prairies
Summarizing samples
Now that you have sampled the biotic and abiotic environment at several prairies, you need to calculate some statistics to describe the data. We call these summary statistics. You will calculate summary statistics separately for each prairie so that you can determine how the biotic and abiotic environment varies. Groups will use different statistics to summarize their data, but you should strive to understand all of the statistics described below. Biology is a highly quantitative science, and we will be using summary statistics repeatedly this semester.
a. The mean (Groups 1-8)
The mean is the average of your samples and a measure of central tendency; it tells you where most of the measurements occur. It can be calculated by summing (S ) the individual samples you took (xi) and dividing by the total number of samples (n):
mean = S xi / n
b. The variance (Groups 1-8)
The variance is a measure of dispersion; it tells you how much your data vary around the measure of central tendency. Estimating how dispersed your data are is very important because variability is an inherent part of sampling. To understand this point, consider two groups who independently randomly sample the same plant species in the same prairie on the same day. It is highly unlikely that these groups would get exactly the same results because their data are only a sample of the population. Due to chance alone, their samples will vary slightly, even if the true size of the population does not change. To get exactly the same results, our two groups would have to count all of the plants in their prairie (and who would want to do that!).
The variance can be calculated using the following formula:
S2 = S (xi - 0 )2 / n 1
Where x1 is the first sample, x2 the second, etc, 0 is the mean, and n is the total number of samples.
c. The standard deviation (Groups 1-8)
The standard deviation is another measure of dispersion. If you can calculate the variance, you can easily calculate the standard deviation; the standard deviation is simply the square root of the variance:
S = Ö S2
The standard deviation has a minimum of zero (if all of your samples are the same) and increases the more variation there is among measurements.
d. The standard error (Groups 1-8)
The standard error of the mean is yet another measure of dispersion. It can be calculated by dividing the standard deviation of you sample (S) by the square root of the sample size:
SE = S / Ö n
One useful property of the standard error is that, all things being equal, it decreases as the sample size (n) increases. This property makes sense if you think about it. If you are sampling a population of prairie plants, your estimate of the population mean will become more precise the more samples you take simply because you have more information (data) about that population. This precision is reflected in your standard error, which will also become smaller as you take more samples. Put another way, your confidence in your estimate of the population mean will increase as you sample size increases. Because the standard error reflects the precision of, and your confidence in, the estimate of the population mean, these two summary statistics are often presented together.
Species richness is simply the number of species per prairie site.
We consider communities to be similar when they have many species in common. You can calculate an index of similarity using the following formula:
Community similarity index = no. species in common/total no. of species
This index varies between zero and 1. Two communities with an index of zero have no species in common. Those with an index of 1 share all of their species.
Simpsons index is a composite measure of diversity. It takes information on the number and relative abundance of the different species in your sample and boils it down to a single number (which is very convenient for making comparisons!). Simpsons index can be calculated using the following formula:
D = 1 / S (pi)2
where D is Simpsons index and pi is the proportion of species i in the community.
Simpsons index varies from 1 to n (the number of species in the sample), with more diverse communities having higher values of D. Because n determines the maximum value of D, the magnitude of Simpsons index is clearly influenced by the number of species in the community. For example, a community containing 8 species has a higher possible value of D (8) than a community with only 4 species (maximum D = 4). But the magnitude of Simpsons index is also influenced by the evenness of species abundances (pi). A community with 5 species, all at roughly equal abundances, will have a higher Simpsons Index than a community with 5 species at higher unequal abundances. For example, consider the following hypothetical data from two communities:
|
Proportion of sample represented by species (pi) |
||||
|
Species A |
Species B |
Species C |
Species D |
Species E |
|
0.20 |
0.20 |
0.20 |
0.20 |
0.20 |
|
0.50 |
0.30 |
0.10 |
0.07 |
0.03 |
If you plug these numbers into the formula for Simpsons Index, D = 5.00 for the community where all 5 species are at equal abundances (Row 3). In contrast, D = 2.81 for the community with the same 5 species at very unequal abundances (Row 4). As this example illustrates, Simpsons index is more sensitive to changes in the abundant (rather than rare) species in a community.

Before you can create a rank-abundance curve, you need to rank the species in your sample from most to least common. The rank will vary from 1 (most common) to i (least common), where i is the number of species in the sample. Then, plot the logarithm of the number of individuals of each species in your sample (y-axis) vs. the rank abundance of each species (x-axis). The number of individuals is log transformed to linearize the curve. By putting data from different communities on the same scale, log transformation also makes it easier to compare data between different communities.
A final note: Statistics vs. parameters
One important thing to remember about all summary statistics is that they are only an estimate of what is going on in the population. Because the populations were sampled (instead of counting every individual), all statistics calculated from those samples could only be estimates. But if the population is properly sampled, the summary statistics calculated from these samples should closely approximate the true population values (known as parameters).