Effect sizes: Bigger is better, right?

Drew Miller

Teachers want what’s best for their students and are willing to try almost anything to improve their results. However, it can be hard to unpack research findings or navigate marketing spin to know which teaching practices, professional development or other programs are worth investing time and money into.

When it comes to evaluating educational research, the ‘effect size’ is the statistical tool that has driven decision-making over the past decade. A program or teaching technique that demonstrates a large ‘effect size’ on outcomes is thought to be good, and worthy of our attention.

Spoiler alert, bigger is not always better, and the reliance on effect sizes to summarise impact has led to the belief that some programs or practices are more valuable than they might really be.

I’m going to use the example of a hypothetical research project that examines the impact of giving feedback to students on their achievement on a common test. But the ways to scrutinise research that I delve into can be applied to any educational research.

In my example, we have two groups of students, one whose teachers are taught to use a specific method of feedback (the intervention group) and one whose teachers don’t get the intervention (the control group). In our hypothetical study, the intervention has a positive effect. The average achievement gain on the test is 12 points higher for the intervention students compared to the control students.

It’s really hard to know if 12 points extra is a lot, but if we divide it by the spread of scores from our groups (the standard deviation, which is 100 in this case) the 12 points is now expressed as a proportion (effect size = 0.12), that is, the difference between groups was equal to 12 per cent of one standard deviation (I know … but hang in there).

Because the effect of the intervention is now expressed as a proportion of the sample of scores it was taken from, the results can be compared to other studies and/or to ‘expected’ achievement growth.

An effect size between 0.4 and 0.6 has been popularised as a measure that indicates a high impact program or practice. But as I am going to show you, on its own, an effect size gives no information about the research design, the quality of the research, and whether outcomes are meaningful across schooling populations.

This is the information teachers, schools and school leaders need to be scrutinising when deciding whether the programs or practices they implement are likely to deliver outcomes. There can be a lot riding on an effect size, but there is also a lot hidden within one. Let’s dive in.

Types of research

There are three broad types of research in education that report using effect sizes – correlational, quasi-experimental and experimental research.

Correlational studies are considered lower quality evidence. A Harvard student mapped some fascinating spurious correlations like the relationship between margarine consumption and divorce rates. If we were to use our feedback example again, if you measured how much quality feedback you give each of your students and measured the test scores of your students, you may find a relationship.

While interesting, this doesn’t account for other factors because there is no control group to compare to, and tells you nothing about whether changing your teaching would change the results – remember, correlation is not causation (it just tells you if a relationship exists).

Quasi-experimental studies may compare groups receiving different programs or practices – like higher quality feedback in one class and direct instruction in another – or compare a single group receiving feedback to what is considered ‘normal’ growth. While informative, these studies lack a key element of high-quality research, random allocation by an unbiased computer.

People are naturally unconsciously biased, and so without random allocation we lack the confidence to conclude that a difference between our groups is due solely to feedback, and not how participants were grouped. To take an extreme example, if one group was made up of schools in an advantaged area and the other in a disadvantaged area, it would be hard to know if it was feedback that made the difference, or other factors.

Experimental studies have random allocation at their heart. These studies provide the highest quality evidence of impact as they have the ability to conclude that any difference between the groups after implementing the program or practice is due solely to that program or practice. These studies still need to be run to a set of standards and there are levels of quality within experimental research, so results may still be questionable.

Outcomes measured

Broadly, studies measure two types of outcomes – proximal and distal. Proximal outcomes are more closely related to the intervention, such as how much you incorporate feedback in your teaching practice after doing professional development on feedback.

Distal outcomes are those beyond the specific target of an intervention, like student results in measures such as NAPLAN or Progressive Assessment Tests. Proximal outcomes are easier to change than distal outcomes, but distal outcomes – especially large-scale standardised measures such as NAPLAN – are likely to give a better indication of any impact on students.

Sample size

Small samples (less than 500 students) drawn from a small number of schools (4-7) are less likely to represent the broader population of students within schools that make up a system. As such, small samples make the effect size estimate more volatile and less representative of the change a program or practice may have across an entire education system.

Conversely, larger samples (greater than 2000 students) that are spread across more schools (one class from each of 70-80 schools) are more representative of the broader system and produce results that are more likely to represent the true effect.

So what is quality research?

In research designed to evaluate a program, practice, or key set of behaviours, high quality research uses experimental designs, standardised distal measures, and large samples from many different schools.

Let’s look at some real-world examples of studies evaluating the impact of feedback and their reported effect sizes. In study 1 (Hattie, 2008), study 2 (Fuchs & Fuchs, 1986) and study 3 (Graham et al., 2015), we find research considered to be low in quality – quasi-experimental (and even some correlational), mostly proximal outcomes and small samples. They demonstrate an average effect size of 0.69 (about eight months’ additional growth).

Study 4 (Coe et al., 2011), study 5 (Kozlow & Bellamy, 2004) and study 6 (Speckesser et al., 2018), however, are considered high quality – experimental design, distal outcomes and large samples – and produce average effects of 0.08 (about one month additional growth).

That’s a big difference!

Despite demonstrating far lower effect sizes, findings from high quality research have a high level of generalisability because the research is designed to remove factors that call into question whether it was actually the intervention that caused the results. Also, outcomes of these studies are more connected to the overall academic achievement of students than low quality studies.

The good news for feedback is that despite the drop in anticipated effect when evaluated using high-quality research, it still displays a positive effect (go you good thing!). One-third (Lortie-Forgues & Inglis, 2019) of interventions designed to improve student achievement displayed a negative effect (did worse than a control group) when they were evaluated using the high-quality research methods.

What do I look for?

The Institute of Education Sciences (USA), Education Endowment Foundation (UK), and Evidence for Learning (Australia) are all funding and brokering high quality evaluation and synthesis of education initiatives. It is hoped that these organisations and the newly formed Australian Education Research Organisation (AERO) will be a starting point for examination of high-quality evidence.

Choosing initiatives that have demonstrated effects under high quality evaluation means you are less likely to undertake a program that will have little, no, or even a negative impact on teaching and learning outcomes.

When scrutinising research remember that evidence of scaling is not evidence of impact. A program undertaken in multiple countries by any number of schools is evidence of a good marketing team and is not high-quality evidence of impact. Be aware too of the catch phrase ‘based on research’, which means it’s rooted in theory but may never have been rigorously tested itself.

In short, you can have the greatest impact on classroom practice and student outcomes if you:

Invest in programs that have demonstrated effects under rigorous evaluation.
Look beyond research ‘effect size’ to scrutinise study design, outcomes tested and sample size.

And once an investment is made, ensure the program is well implemented and evaluated to see if it works in your specific school context.

References

Coe, M., Hanita, M., Nishioka, V., & Smiley, R. (2011). An Investigation of the Impact of the 6+ 1 Trait Writing Model on Grade 5 Student Writing Achievement. Final Report. NCEE 2012-4010. National Center for Education Evaluation and Regional Assistance.

Fuchs, L. S., & Fuchs, D. (1986). Effects of systematic formative evaluation: A meta-analysis. Exceptional children, 53(3), 199-208.

Graham, S., Hebert, M., & Harris, K. R. (2015). Formative assessment and writing: A meta-analysis. The Elementary School Journal, 115(4), 523-547.

Hattie, J. (2008). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge.

Kozlow, M., & Bellamy, P. (2004). Experimental study on the impact of the 6+ 1 trait writing model on student achievement in writing. Portland, OR: Northwest Regional Educational Laboratory.

Lortie-Forgues, H., & Inglis, M. (2019). Rigorous large-scale educational RCTs are often uninformative: Should we be concerned?. Educational Researcher, 48(3), 158-166.

Speckesser, S., Runge, J., Foliano, F., Bursnall, M., Hudson-Sharp, N., Rolfe, H., & Anders, J. (2018). Embedding formative assessment: Evaluation report and executive summary.