Introduction to Sample Proportions
When researchers want to understand a population, they rarely survey everyone. Instead, they take a random sample of size n from a population where each individual either does or does not have a certain characteristic. To make sense of this data, we use the sample proportion.
The sample proportion is denoted by $\hat{p}$ (read as "p-hat") and is given by the formula $\hat{p}=\frac{x}{n}$. In this formula, x represents the number of individuals in your sample that possess the specified characteristic. Ultimately, this sample proportion serves as a point estimate, which is a statistic that estimates the true population proportion, p.
Understanding the Sampling Distribution
If you take a simple random sample, you can map out the sampling distribution of $\hat{p}$. Here are the core rules that govern this distribution:
- The shape of the sampling distribution of the sample proportion is approximately normal, provided that $np(1-p)\ge10$.
- The mean of the sampling distribution is equal to the population proportion: $\mu_{\hat{p}}=p$.
- The standard deviation of this distribution is calculated as $\sigma_{\hat{p}}=\sqrt{\frac{p(1-p)}{n}}$.
Building Confidence Intervals
Because we are estimating data, we construct a confidence interval for our unknown parameter, which consists of an interval of numbers based on our point estimate. We base this interval on our guess along with a chosen level of confidence. The level of confidence is denoted as $(1-\alpha)\cdot100\%$ and represents the expected proportion of intervals that will contain the parameter if a large number of different samples are obtained.
A Crucial Statistical Caution: A 95% confidence interval does not mean there is a 95% probability that the interval contains the parameter. Because the parameter is a fixed, albeit unknown, value, the probability that the interval includes the parameter is either strictly 0 or 1.
To calculate a $(1-\alpha)\cdot100\%$ confidence interval for a population proportion, you will use a specific critical value from the Z-distribution, denoted as $Z_{\frac{\alpha}{2}}$. For example, a 95% confidence level corresponds to a critical value of 1.96. The bounds are calculated as follows:
- Lower bound: $\hat{p}-z_{\frac{a}{2}}\cdot\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$.
- Upper bound: $\hat{p}+z_{\frac{\alpha}{2}}\cdot\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$.
Keep in mind that tweaking your study design changes your interval. Opting for a higher level of confidence will lead to a wider interval. On the flip side, increasing the sample size n decreases the standard error, which shrinks the margin of error and creates a narrower confidence interval.
Determining the Right Sample Size
Before you even begin polling, you need to know how many people to ask. The sample size required to obtain a specific margin of error, E, depends on whether you have prior data:
- If you have a prior estimate of the population proportion, use the formula $n=\hat{p}(1-\hat{p})(\frac{z_{a}}{E})^{2}$.
- If a prior estimate is unavailable, you should use the formula $n=0.25(\frac{z_{\dot{a}}}{E})^{2}$.
For both formulas, you must express the margin of error as a decimal and always round your final result up to the next integer.