Welcome back to Professor Baker's Math Class! In this lesson, we are diving into Sections 3-1 and 3-2, which act as the foundation for much of what we will do in statistics. We aren't just crunching numbers; we are trying to tell a story about data. specifically, where the center of the data lies and how spread out it is.
Section 3-1: Measures of Central Tendency
When we have a set of data, the first thing we usually want to know is: "What is the average?" or "What is the typical value?" In statistics, we use three main definitions to describe the center:
- Mean: This is the arithmetic average. You add up all the observations and divide by the number of observations.
- Notation Alert: Pay close attention to the symbols used in the class notes!
- The Sample Mean is denoted by $\bar{x}$ (read as "x-bar") and is calculated as $\bar{x} = \frac{\sum x_i}{n}$, where $n$ is the sample size.
- The Population Mean is denoted by the Greek letter $\mu$ (mu).
- Median: This is the physical middle of the data when it is arranged in increasing order.
If you have an odd number of values, it is the exact middle number. If you have an even number, it is the average of the two middle numbers. The position of the median can be found using the formula $\frac{n+1}{2}$.
Tip: As seen in Example 3.4 (Home Resale Prices), the median is often better than the mean when your data has extreme outliers (like one multi-million dollar mansion in a neighborhood of average homes). - Mode: This is the value that occurs with the greatest frequency.
If no value repeats, there is no mode. If multiple values repeat with the same highest frequency, the data can be bimodal or multimodal. As noted in the Boston Marathon example, the Mode is the only measure of center we can use for qualitative (categorical) data, like "Male" or "Female."
Section 3-2: Measures of Variation
Knowing the center isn't enough. We also need to know if the data is consistent or if it is all over the place. For this, we look at variation.
The Range
The simplest measure of variation is the Range. It is calculated simply as:
$$ \text{Range} = \text{Max} - \text{Min} $$While easy to calculate, the range is very sensitive to outliers. In our class notes (Team 1 vs. Team 2), we saw how Team 2 had a much larger range ($84 - 67 = 17$) compared to Team 1 ($78 - 72 = 6$), indicating Team 2's performance was less consistent.
Standard Deviation
The most important measure of spread is the Standard Deviation. Roughly speaking, this measures the average distance of each data point from the mean.
- Population Standard Deviation: Denoted by $\sigma$ (sigma).
- Sample Standard Deviation: Denoted by $s$.
The formula for sample standard deviation involves taking the square root of the sum of squared deviations divided by $n-1$:
$$ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n - 1}} $$In the handwritten notes for Team 1, we calculated the deviations ($x - \bar{x}$), squared them ($9, 4, 1, 1, 9$), and summed them up. A higher standard deviation means the data is more spread out, while a lower one means the data is clustered closely around the mean.
Be sure to use the Standard Deviation Calculator linked above to check your work, as these calculations can get tedious by hand! Understanding these symbols ($\bar{x}, \mu, s, \sigma$) is crucial for the upcoming chapters, so take some time to memorize them now.
Keep up the great work!