Study Notes
Overview
Scatter graphs, also known as scatter diagrams or scattergrams, are one of the most practical and visually engaging topics in GCSE statistics. They allow you to explore whether a relationship exists between two variables - such as hours of study and exam performance, or the age of a car and its selling price. Edexcel assesses this topic through precise plotting of bivariate data, qualitative analysis of correlation, and the use of lines of best fit to make predictions. Candidates are expected to distinguish between reliable interpolation (estimating within the data range) and unreliable extrapolation (estimating beyond the observed data). This topic connects directly to other areas of statistics, including averages, data handling, and probability, and it frequently appears in both Foundation and Higher tier papers. Typical exam questions ask you to plot points, draw a line of best fit, describe the type of correlation, and make estimates - each step earning marks when executed with precision and clear working.
Key Concepts
Concept 1: Bivariate Data and Scatter Graphs
Bivariate data involves two variables measured for each item or individual. For example, you might record both the height and shoe size of students in a class, or the temperature and ice cream sales at a shop. A scatter graph displays this data visually by plotting one variable on the x-axis (the independent variable) and the other on the y-axis (the dependent variable). Each pair of values becomes a single point on the graph. The beauty of scatter graphs is that they reveal patterns at a glance - you can immediately see whether the variables are related and, if so, how. When plotting points, accuracy is crucial. Edexcel mark schemes typically award one mark for plotting all points correctly, with a tolerance of half a small square on graph paper. This means you must take your time, use a sharp pencil, and double-check each coordinate before moving on.
Example: Imagine you collect data on 10 students: their hours of revision (x-axis) and their test scores out of 100 (y-axis). Student A revised for 2 hours and scored 35%, so you plot the point (2, 35). Student B revised for 5 hours and scored 60%, giving the point (5, 60). Continue for all 10 students, and you'll have a scatter graph showing the relationship between revision time and performance.
Concept 2: Correlation
Correlation describes the relationship between the two variables on your scatter graph. There are three types of correlation you must be able to identify and describe:
Positive correlation occurs when as one variable increases, the other also increases. On a scatter graph, the points trend upward from left to right. For instance, more hours of revision generally lead to higher test scores - this is positive correlation. The stronger the correlation, the closer the points cluster around an imaginary straight line.
Negative correlation occurs when as one variable increases, the other decreases. The points trend downward from left to right. A classic example is the age of a car and its value - as cars get older, their price typically falls. Again, the strength of the correlation depends on how tightly the points cluster around a line.
No correlation means there is no relationship between the variables. The points are scattered randomly across the graph with no discernible pattern. For example, shoe size and exam score have no logical connection, so you would expect no correlation.
Edexcel examiners are very particular about language here. Simply writing "positive" or "it goes up" is too vague and will not earn full marks. Instead, you must either use precise mathematical terminology ("There is a positive correlation between revision time and test score") or, even better, give a contextual description that references the specific variables: "As the number of hours of revision increases, the test score increases." This shows you understand what the data actually represents, not just the mathematical pattern.
Concept 3: Line of Best Fit
The line of best fit is a straight line drawn through a scatter graph to represent the general trend of the data. It is not a dot-to-dot line connecting the points, nor is it a freehand curve. It is a single, straight line drawn with a ruler that passes roughly through the middle of the points, with approximately equal numbers of points above and below the line. Ideally, the line should pass through the mean point - that is, the point where x equals the average of all x-values and y equals the average of all y-values. The line should extend across the full range of the data, from the smallest x-value to the largest.
Drawing the line of best fit is a skill that improves with practice. Edexcel mark schemes award one mark for a correctly drawn line of best fit, and examiners look for a line that is straight, ruled, passes through the mean, and covers the full data range. Common mistakes include drawing the line too short, using a freehand squiggle, or forcing the line through outliers (points that don't fit the general pattern). Your line should ignore outliers and represent the overall trend.
Why does this work? The line of best fit summarises the relationship between the variables in a simple, visual way. It allows you to make predictions and estimates, which is the practical purpose of scatter graphs. By reducing a cloud of points to a single line, you can quickly see the trend and use it to answer questions.
Concept 4: Interpolation and Extrapolation
Once you have drawn your line of best fit, you can use it to estimate values. This is where the distinction between interpolation and extrapolation becomes critical.
Interpolation is when you estimate a value within the range of your data. For example, if your data covers revision times from 1 hour to 10 hours, and you use your line of best fit to estimate the test score for 6 hours of revision, that is interpolation. These estimates are considered reliable because you are working within the observed pattern. The line of best fit is based on actual data points in this range, so your estimate is grounded in evidence.
Extrapolation is when you estimate a value outside the range of your data. If your data only goes up to 10 hours and you try to predict the score for 15 hours of revision, that is extrapolation. These estimates are unreliable because you are assuming the trend continues beyond what you have actually measured - and it might not. For instance, there may be a limit to how much revision helps, or the relationship might change at higher values.
Edexcel loves to test this distinction. A Higher tier question might ask you to estimate a value outside the data range and then comment on the reliability of your estimate. The correct approach is to make the estimate using your line of best fit (because the question asks you to), but then add a critical comment such as: "This estimate may be unreliable because it is based on extrapolation beyond the observed data range" or "This prediction assumes the trend continues, which may not be the case."
Example: Suppose your scatter graph shows data for revision times from 2 to 10 hours. If asked to estimate the score for 7 hours, you draw a vertical dashed line from x = 7 up to your line of best fit, then a horizontal dashed line across to the y-axis, and read off the value (say, 68%). This is interpolation and is reliable. If asked to estimate the score for 15 hours, you extend your line of best fit and repeat the process, but you must comment that this extrapolation is unreliable.
Concept 5: Outliers
An outlier is a data point that does not fit the general pattern of the scatter graph. It lies far away from the other points and the line of best fit. Outliers can occur for various reasons: measurement errors, unusual circumstances, or genuine exceptions to the trend. When drawing your line of best fit, you should ignore outliers - do not force your line to pass through them. If an exam question asks you to identify or comment on an outlier, you should explain why it doesn't fit the trend. For example: "This point is an outlier because the student scored much lower than expected for their revision time, possibly due to illness on the exam day."
Mathematical Relationships
Scatter graphs do not typically involve formulas in the way that algebra or geometry do, but there are key mathematical concepts you must understand:
Mean Point: The mean point of a scatter graph is calculated by finding the mean (average) of all x-values and the mean of all y-values. If you have data points (x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ), then the mean point is:
Mean point = (x̄, ȳ) where x̄ = (x₁ + x₂ + ... + xₙ) / n and ȳ = (y₁ + y₂ + ... + yₙ) / nYour line of best fit should pass through or very close to this mean point.
Gradient and Intercept (Higher Tier): While not always required, understanding that a line of best fit has a gradient (slope) and y-intercept can help. If the line has equation y = mx + c, then m is the gradient (how steep the line is) and c is where the line crosses the y-axis. A positive gradient indicates positive correlation; a negative gradient indicates negative correlation.
Correlation vs Causation: This is not a formula but a critical concept. Just because two variables are correlated does not mean one causes the other. For example, ice cream sales and drowning incidents are positively correlated (both increase in summer), but eating ice cream does not cause drowning. Both are linked to a third factor: hot weather. Examiners may ask you to comment on this, so always think critically about what the data actually shows.
Practical Applications
Scatter graphs are used extensively in real-world contexts, and Edexcel often sets questions in practical scenarios to test your understanding:
- Economics and Business: Analysing the relationship between advertising spend and sales revenue, or between price and demand.
- Health and Fitness: Exploring the link between exercise hours and weight loss, or between age and reaction time.
- Science: Investigating how temperature affects the rate of a chemical reaction, or how plant height relates to the amount of fertiliser used.
- Education: Examining the connection between attendance and exam results, or between hours of study and grades.
In each case, scatter graphs allow you to visualise the data, identify trends, and make informed predictions. The key is to always interpret the graph in context - don't just describe the mathematical pattern, but explain what it means in the real-world scenario.
Listen to the Podcast
Listen to our 10-minute podcast episode where an experienced educator walks you through the key concepts, exam tips, and common mistakes for scatter graphs. This audio guide reinforces everything you've read and includes a quick-fire quiz to test your understanding.