Statistical and Graphing Help
Statistical Foundations for Analyzing POPS Project Data
Many of you have completed a statistics course prior to taking Biology 2425, so reviewing those materials will be invaluable. If you haven’t taken a statistics course or it’s been a while, this guide will help you brush upon essential concepts needed to analyze your POPS project data effectively. Simple reporting raw data or creating unprocessed graphs is insufficient and will result in a score of zero. However, feel free to use advanced statistical techniques if you’re familiar with them. Effective data analysis can reveal powerful insights!
Raw data alone rarely reveals meaningful trends in human health studies. Descriptive statistics, such as mean, median, mode, range, and standard deviation, summarize data and make patterns more apparent. Organizing data into tables is a critical first step before graphing, as both methods enhance your ability to recognize trends. As learned in prior labs, proper placement of independent and dependent variables is key to producing graphs that effectively illustrate relationships.
Computer tools like Excel are widely used to organize, analyze, and graph data. Properly managed data prevents misleading conclusions and can safeguard against fallacious claims. Developing these skills is essential not just for scientists but for anyone who wants to critically evaluate information in today’s data driven world.
Understanding Variables and Descriptive Statistics
To identify cause and effect relationships, scientists examine how changes in one variable influence another. A variable is any measurable factor, such as height, weight, or health status. Descriptive statistics simplify complex datasets to highlight trends and patterns. Below are key terms and examples:
- Mean: The average of dataset, calculated by summing all values and dividing the number of values. Sensitive to outliers (e.g., the mean of 4, 7, 3, 8, 10, 8 is 6.7, but mistaken 80 instead of 8, it becomes 18.7)
- Median: The middle value of an ordered dataset. For even datasets, it’s the average of the two middle values. Unlike the mean, it’s less influenced by outliers (e.g., the median of 3, 4, 7, 8, 8, 10 is 7.5 regardless of an outlier).
- Mode: The most frequently occurring value in a dataset. If all values are unique, there is no mode (e.g., the mode of 4, 8, 3, 8, 10, 7 is 8; 5, 10, 6, 9, 12 had no mode).
- Range: The difference between the highest and lowest values. Highly sensitive to outliers (e.g., the range of 3, 4, 7, 8, 8, 10 is 7, but with 80 instead of 8, it’s 77).
- Standard Deviation: A measure of variability around the mean. A smaller standard deviation indicates more consistent data, while a larger one indicates greater variability. For example:
- o Group 1 (16, 15, 14, 17, 18): Mean = 16 bpm; SD = 1.6 bpm.
- o Group 2 (10, 10, 20, 25, 15): Mean = 16 bpm; SD = 6.5 bpm
In many physiological datasets, values follow a normal distribution (bell-shaped curve). Key characteristics of normally distributed data:
- approximately 68% of values lie within +/- 1 standard deviation of the mean.
- approximately 95% of values lie within +/- 2 standard deviations.
- approximately 99% of values lie within +/- 3 standard deviations.
Figure 1. Normal or bell-shaped curve with standard deviations (from: https://commons.wikimedia.org/wiki/File:Empirical_Rule.PNGLinks to an external site.)
Understanding these principles will enable you to analyze your data meaningfully, spot trends, and interpret your results with confidence.
The standard deviation measures how data values deviate from the mean, providing insight into the variability within a dataset. It is calculated as the square root of the sum of the squared deviations of each value from the mean, divided by one less than the sample size. This calculation is expressed mathematically as follows:
Researchers recorded the number of precancerous skin lesions found on six patients who had a skin cancer mass removed 10 years ago. The data set is as follows: 1, 3, 4, 6, 9, and 19 precancerous skin lesions.
- Calculate the mean for the data set.
Mean = (1+3+4+6+9+19) / 6 = 42 / 6 = 7 Mean = 7 precancerous skin lesions
- Subtract the mean from every number to get the list of deviations. It is OK to get negative numbers here.
list of deviations: -6, -4, -3, -1, 2, 12Next, square the resulting deviations.
- squares of deviations: 36, 16, 9, 1, 4, 144
- Add up all of the resulting squares to get their total sum.
sum of squared deviations: 36+16+9+1+4+144 = 210
Data Value (xi)
(# of precancerous skin lesions) |
Data Value –Mean
(xi – ) (= deviation) |
Deviation Squared |
1 | -6 | 36 |
3 | -4 | 16 |
4 | -3 | 9 |
6 | -1 | 1 |
9 | 2 | 4 |
19 | 12 | 144 |
Total | 210 |
Table 1: calculation of standard deviations from numbers of precancerous skin lesions.
Divide the sum of squared deviations by one less than the number of values. 210 / 5 = 42
Then take the square root of this number: √42 = 6.48 precancerous skin lesions = standard deviation
If this data formed a normal curve (which it does not), we could conclude that 68% of the values will fall in the range of +/- 1 standard deviations from the mean. That is, 68% of the values in the data set are within the range of 7 – (1 x 6.48) to 7 + (1 x 6.48) or 0.5 to 13 precancerous skin lesions.
Adapted from Human Physiology Lab Manual by Jim Blevins, Melaney Farr, and Arleen Sawitzke, Salt Lake Community College.