The Basics of Probability Density Functions
A probability density function (pdf) is an equation used to calculate probabilities for continuous random variables. To be valid, a probability density function must satisfy two key properties:
- The total area under the graph across all possible values of the random variable must equal 1.
- The height of the graph must be greater than or equal to 0 for all possible values.
A simple example is the uniform probability distribution. If a friend is equally likely to be anywhere from 0 to 30 minutes late, the probability remains constant across any interval of equal length. In this case, the area under the graph over a specific interval represents the exact probability of observing a value within that interval.
Properties of the Normal Curve
A continuous random variable is normally distributed if its relative frequency histogram forms the shape of a normal curve. This specific mathematical model describes reality for many continuous variables and features several distinct properties:
- It is symmetric about its mean, $\mu$.
- Because the mean, median, and mode are equal, the curve has a single peak occurring at $x=\mu$.
- The curve has inflection points—where the graph's curvature changes—located at $\mu-\sigma$ and $\mu+\sigma$.
- The total area under the curve is 1.
- The area to the right of the mean equals the area to the left, with both sides equaling 1/2.
- The graph approaches, but never actually touches, the horizontal axis as it extends infinitely in either direction.
- The Empirical Rule: Approximately 68% of the area lies between $\mu-\sigma$ and $\mu+\sigma$, 95% lies between $\mu-2\sigma$ and $\mu+2\sigma$, and 99.7% lies between $\mu-3\sigma$ and $\mu+3\sigma$.
The Role of Area and Standardizing
The area under the normal curve for any given interval represents either the proportion of a population with that characteristic or the probability that a randomly selected individual will have that characteristic.
To easily compare different normal distributions, we can transform any normal random variable into a standard normal random variable, $Z$, using the following formula:
$$Z=\frac{X-\mu}{\sigma}$$
This process is known as standardizing. The resulting standard normal distribution always has a mean of 0 ($\mu=0$) and a standard deviation of 1 ($\sigma=1$). You can then use the resulting Z-score alongside a table or software to find the area under the curve to the left or right of that value.
Assessing Normality in Data
While histograms work well for large datasets, a small sample size might not accurately reflect the shape of the population. To assess normality in smaller samples, statisticians use normal probability plots.
- A normal probability plot graphs the observed data against expected z-scores.
- An expected z-score is what the data value's z-score should theoretically be if the population were truly normal.
- If the sample data is drawn from a normally distributed population, the resulting plot will be approximately linear.
- To confirm this, you can calculate the linear correlation coefficient between the observed values and expected z-scores; if it is greater than a specified critical value, it is reasonable to conclude the data is normally distributed.