Topic 12 of 12Cambridge A Levels

Statistical Data and Hypothesis Testing

Master the art of summarising data and making informed decisions about populations using samples, from calculating the mean to testing a hypothesis.

What You'll Learn

Histograms use frequency density on the y-axis, and the a…The median for grouped data is an estimate found using li…Standard deviation is the square root of the variance and…Correlation does not imply causation; it only measures th…

Introduction

Assalamu alaikum, students. I'm Ustad Bilal Ahmed, and today we delve into the heart of Statistics, a branch of mathematics that is not just about numbers, but about understanding the world around us. From analysing the performance of our cricket team to predicting economic trends for Pakistan, statistics gives us the tools to find patterns in chaos, to summarise complex information, and to make decisions in the face of uncertainty.

In this topic, we will cover two main pillars. First, Descriptive Statistics, where we learn to describe and summarise a set of data using measures of central tendency (like the mean) and spread (like the standard deviation), and visual tools like histograms and box plots. This is the art of telling a clear story from raw numbers.

Second, we will step into Inferential Statistics with Hypothesis Testing. This is where the real power lies. We take a small sample of data – say, the exam scores of 50 students from NUST – and use it to make an educated guess, a formal inference, about the entire population of all NUST students. It's a structured way of answering questions like "Has the new teaching method improved scores?" or "Is this new medicine effective?". Mastering these concepts is crucial for your A Level exams and for any field you pursue that involves data.

Core Theory

Let's build our foundation, brick by brick.

1. Describing and Summarising Data

#### Measures of Central Tendency (Averages)

These tell us about the 'centre' of the data.

Mean (x̄): The most common average.
For ungrouped data: x̄ = Σx / n
For grouped data: x̄ = Σfx / Σf, where 'f' is frequency and 'x' is the class midpoint.
Median: The middle value when data is ordered. For grouped data, it's an estimate found using linear interpolation on the cumulative frequency curve.
Formula: Median = L + [ (n/2 - F) / f_m ] * c
L = lower class boundary of the median class
n = total frequency (Σf)
F = cumulative frequency *before* the median class
f_m = frequency of the median class
c = class width of the median class
Mode: The most frequently occurring value. For grouped data, it's the class with the highest frequency (modal class).

#### Measures of Spread (Dispersion)

These tell us how spread out the data is.

Range: Highest value - Lowest value. Simple, but sensitive to outliers.
Interquartile Range (IQR): Q3 - Q1. It measures the spread of the middle 50% of the data, making it resistant to outliers. Q1 (lower quartile) and Q3 (upper quartile) are found similarly to the median using interpolation, but with n/4 and 3n/4.
Variance (σ² or s²) and Standard Deviation (σ or s): The most important measures of spread. They describe the average distance of each data point from the mean. A small standard deviation means data is clustered tightly around the mean.
Ungrouped Variance: σ² = (Σx²) / n - (x̄)²
Grouped Variance: σ² = (Σfx²) / Σf - (x̄)²
Standard Deviation is simply the square root of the variance: σ = √Variance.
Important Note for 9709: Your syllabus uses these formulae for calculating the variance and standard deviation of a sample, which are technically estimates of the population variance. You are not typically required to use the 'unbiased' estimator with the 'n-1' denominator unless specifically asked.

#### Data Representation

Histograms: Used for continuous data. The y-axis is **Frequency Density** (Frequency / Class Width), not frequency. The *area* of each bar is proportional to the frequency.
Box-and-Whisker Plots: A brilliant summary showing the minimum, Q1, median, Q3, and maximum. Excellent for comparing distributions side-by-side. Outliers, often defined as points more than 1.5 x IQR below Q1 or above Q3, can be plotted separately.

2. Correlation and Regression

Correlation: Measures the strength and direction of a *linear* relationship between two variables.
Pearson's Product-Moment Correlation Coefficient (r): A value between -1 and +1. +1 is perfect positive linear correlation, -1 is perfect negative, and 0 is no linear correlation. You will use your calculator to find this.
Spearman's Rank Correlation Coefficient (r_s): Used when the relationship is not linear or the data is ordinal (can be ranked). You rank each dataset and apply the PMCC formula to the ranks.
Regression: If correlation is strong, we can find a line of best fit.
Line of Regression (y on x): y = a + bx. This line is used to predict values of y for given values of x. Your calculator will find 'a' and 'b'.
Interpolation vs. Extrapolation: Predicting a y-value for an x-value *within* the range of your data is called interpolation and is generally reliable. Predicting *outside* the range is extrapolation and is unreliable. You must comment on this in exams!

3. Hypothesis Testing

This is a formal procedure to test a claim about a population parameter (like the population mean, μ).

The 5-Step Process:

State Hypotheses:

Null Hypothesis (H₀): The 'no change' or 'no effect' hypothesis. It always contains an equality (e.g., μ = 20).
Alternative Hypothesis (H₁): What we are trying to show evidence for. It determines the type of test:
Two-tailed: H₁: μ ≠ 20 (testing for a *change*)
One-tailed (upper): H₁: μ > 20 (testing for an *increase*)
One-tailed (lower): H₁: μ < 20 (testing for a *decrease*)

Determine the Significance Level (α): This is the probability of rejecting H₀ when it is actually true (a Type I error). It's usually given, e.g., 5% (α = 0.05).
Calculate the Test Statistic: This is a value calculated from your sample data. For a test of the mean from a Normal distribution (or using the Central Limit Theorem for large samples), the test statistic is:

Z = (x̄ - μ) / (σ / √n)

Find the Critical Region: This is the 'rejection zone'. We find the critical value(s) from statistical tables corresponding to our significance level.

For a 5% two-tailed test, we look for the Z-values that cut off 2.5% in each tail.
For a 5% one-tailed upper test, we look for the Z-value that cuts off 5% in the upper tail.

Make a Conclusion:

If the test statistic falls into the critical region, we reject H₀.
If it does not, we do not reject H₀.
Your final conclusion must be written in the context of the original problem. For example: "There is sufficient evidence at the 5% significance level to suggest that the mean traffic has increased." Never say "We accept H₀". The correct phrasing is that there is "insufficient evidence to reject H₀".

Key Definitions

Mean (x̄ or μ): The sum of all values divided by the number of values.
Median: The middle value of an ordered dataset.
Mode: The most frequent value in a dataset.
Variance (σ²): The average of the squared differences from the Mean. A measure of spread.
Standard Deviation (σ): The square root of the variance, representing the typical deviation of a value from the mean.
Interquartile Range (IQR): The range of the middle 50% of the data (Q3 - Q1).
Frequency Density: The value plotted on the y-axis of a histogram, calculated as Frequency / Class Width.
Outlier: An extreme value that lies outside the overall pattern of data. Often calculated as being below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR).
Correlation: A measure of the extent to which two variables are linearly related.
Regression Line: A line of best fit (y = a + bx) used to make predictions based on one variable from another.
Null Hypothesis (H₀): A statement about a population parameter that is assumed to be true unless sufficient evidence proves otherwise. It always contains an equality.
Alternative Hypothesis (H₁): A statement that contradicts the null hypothesis and is what the test aims to find evidence for.
Significance Level (α): The probability of making a Type I error (rejecting a true null hypothesis). This defines the threshold for "statistical significance".
Critical Region: The set of values for the test statistic for which the null hypothesis is rejected.
Type I Error: Rejecting H₀ when H₀ is true. The probability of this is α.
Type II Error: Failing to reject H₀ when H₀ is false. The probability of this is β.

Worked Examples (Pakistani Context)

Example 1: Grouped Frequency Data (NUST Entry Test Scores)

The scores, x, of 200 students in a mock NUST entry test are summarised in the table below.

| Score (x) | Frequency (f) |

|----------------|---------------|

| 60 ≤ x < 80 | 24 |

| 80 ≤ x < 100 | 50 |

| 100 ≤ x < 120 | 76 |

| 120 ≤ x < 140 | 35 |

| 140 ≤ x < 160 | 15 |

(i) Calculate an estimate for the mean and standard deviation of the scores.

(ii) Estimate the median score.

Solution:

First, we need midpoints (x) and the values fx and fx².

| Score (x) | Midpoint (x) | f | fx | x² | fx² |

|----------------|--------------|----|---------|--------|-----------|

| 60 ≤ x < 80 | 70 | 24 | 1680 | 4900 | 117600 |

| 80 ≤ x < 100 | 90 | 50 | 4500 | 8100 | 405000 |

| 100 ≤ x < 120 | 110 | 76 | 8360 | 12100 | 919600 |

| 120 ≤ x < 140 | 130 | 35 | 4550 | 16900 | 591500 |

| 140 ≤ x < 160 | 150 | 15 | 2250 | 22500 | 337500 |

| Totals | | 200 | 21340 | | 2371200 |

(i) Mean and Standard Deviation

Mean (x̄):

x̄ = Σfx / Σf = 21340 / 200 = 106.7

Variance (σ²):

σ² = (Σfx²) / Σf - (x̄)²

σ² = 2371200 / 200 - (106.7)²

σ² = 11856 - 11384.89 = 471.11

Standard Deviation (σ):

σ = √471.11 = 21.7 (3 s.f.)

(ii) Median

We need the cumulative frequency (F).

| Score (x) | f | F |

|----------------|----|-----|

| < 80 | 24 | 24 |

| < 100 | 50 | 74 |

| < 120 | 76 | 150 |

| < 140 | 35 | 185 |

| < 160 | 15 | 200 |

Find the median position: n/2 = 200/2 = 100.
Identify the median class: The 100th value falls in the 100 ≤ x < 120 class (as F=74 before it and F=150 in it).
Apply the interpolation formula:

L = 100
n/2 = 100
F (cumulative frequency *before*) = 74
f_m (frequency *of* the class) = 76
c (class width) = 20

Median = 100 + [ (100 - 74) / 76 ] * 20

Median = 100 + [ 26 / 76 ] * 20

Median = 100 + 6.842... = 106.8 (1 d.p.)

Example 2: Hypothesis Test (Lahore Metro Bus)

The Lahore Metro Bus Authority claims the mean journey time from Gajju Matta to Shahdara is 75 minutes. A transport analyst believes that due to increased traffic, the time has increased. They record the journey time for a random sample of 50 trips, finding a sample mean of 77.2 minutes. Assuming the population standard deviation of journey times is 8 minutes, test the analyst's belief at the 5% significance level.

Solution:

We follow the 5-step process.

State Hypotheses:

Let μ be the population mean journey time.

H₀: μ = 75 (The mean time is still 75 minutes)
H₁: μ > 75 (The mean time has increased - this is a one-tailed test)

Significance Level:

α = 5% = 0.05

Calculate the Test Statistic:

Population is assumed Normal or n=50 is large enough for Central Limit Theorem.
x̄ = 77.2, μ = 75, σ = 8, n = 50
Z = (x̄ - μ) / (σ / √n)
Z = (77.2 - 75) / (8 / √50)
Z = 2.2 / 1.131... = 1.944

Find the Critical Region:

This is a one-tailed test at the 5% level. We need the Z-value that cuts off the top 5% of the Normal distribution.
From tables (or calculator inverse normal function), the critical value is Z_crit = 1.645.
The critical region is Z > 1.645.

Make a Conclusion:

Our test statistic Z = 1.944.
Since 1.944 > 1.645, our test statistic falls in the critical region.
Therefore, we reject H₀.
Conclusion in context: There is sufficient evidence at the 5% significance level to support the analyst's belief that the mean journey time has increased.

Exam Technique

Read Carefully: Pay close attention to keywords. "Increased" or "decreased" implies a one-tailed test. "Changed" or "different" implies a two-tailed test.
Show Your Working: For grouped data questions, always show your Σfx and Σfx² values. These are method marks you cannot afford to lose.
Histograms: Remember Frequency Density = Frequency / Class Width. Label your axes clearly. A common mistake is to plot frequency on the y-axis.
Hypothesis Testing Structure: Always follow the 5 steps. Marks are explicitly awarded for (1) hypotheses, (2) test statistic, (3) comparison with critical value, and (4) contextual conclusion.
State Hypotheses Correctly: Use population parameters (μ, p), not sample statistics (x̄, p̂). H₀ must have the equality.
Context is King: Your final conclusion for a hypothesis test must not be "Reject H₀". It must be a sentence that refers back to the original problem (e.g., journey times, student scores, product weights).
Rounding: Do not round intermediate calculations. Use the memory function on your calculator. Give your final answer to 3 significant figures unless specified otherwise.
"Show that" questions: You are given the answer. You must provide a fully reasoned, step-by-step argument to prove how to get there. Every step must be logical and clear.

Key Points to Remember

1Histograms use frequency density on the y-axis, and the area of each bar represents the frequency.
2The median for grouped data is an estimate found using linear interpolation on the cumulative frequency data.
3Standard deviation is the square root of the variance and measures the typical spread of data around the mean.
4Correlation does not imply causation; it only measures the strength and direction of a linear association.
5The regression line y = a + bx should only be used for interpolation within the given data range; extrapolation is unreliable.
6A hypothesis test uses sample data to make a formal inference about a population parameter.
7The null hypothesis (H₀) always contains the equality sign (=, ≤, or ≥) and represents the 'no change' or 'status quo' scenario.
8A hypothesis test conclusion must be contextual, stating whether there is sufficient evidence to support the alternative hypothesis at the given significance level.

Pakistan Example

Analysing PSL Cricketers' Batting Performance

We can use descriptive statistics like mean and standard deviation to compare the consistency and scoring power of batsmen like Babar Azam and Mohammad Rizwan in the Pakistan Super League (PSL). Hypothesis testing could then be used to determine if a change in a player's batting position has had a statistically significant effect on their average score at a 5% level of significance.

Quick Revision Infographic

Mathematics — Quick Revision

Statistical Data and Hypothesis Testing

Key Concepts

1Histograms use frequency density on the y-axis, and the area of each bar represents the frequency.

2The median for grouped data is an estimate found using linear interpolation on the cumulative frequency data.

3Standard deviation is the square root of the variance and measures the typical spread of data around the mean.

4Correlation does not imply causation; it only measures the strength and direction of a linear association.

5The regression line y = a + bx should only be used for interpolation within the given data range; extrapolation is unreliable.

6A hypothesis test uses sample data to make a formal inference about a population parameter.

Formulas to Know

The regression line y = a + bx should only be used for interpolation within the given data range; extrapolation is unreliable.

H₀) always contains the equality sign (=, ≤, or ≥) and represents the 'no change' or 'status quo' scenario.

Pakistan Example

Analysing PSL Cricketers' Batting Performance

SeekhoAsaan.com — Free RevisionStatistical Data and Hypothesis Testing Infographic

Test Your Knowledge!

5 questions to test your understanding.

Start Quiz

Forces in Equilibrium and Moments