Issues in synthesizing evaluations on educational interventions

Synthesizing the results of interventions can provide valuable information, but needs care, says Alan Cheung

Evidence-based education reform has been gaining tremendous momentum in the past two decades. For evidence-based reform to work, scientifically valid and unbiased research reviews are critical. Though a great number of reviews of various educational interventions have been carried out, methods employed in these syntheses vary substantially leading to inconsistent conclusions. Many original evaluations of educational interventions suffer from serious methodological problems, such as a lack of a control group, brief duration, small sample sizes, limited evidence of initial equivalence between the treatment and control group, or questionable outcome measures. Since studies with poor methodologies tend to report much higher effect sizes than those with more rigorous methods, failing to screen out these studies inflates the average effect sizes of these reviews. The purpose of this article is to highlight some of key issues in synthesizing evaluations on educational interventions.

What we know
● For evidence-based reform to work, scientifically valid and unbiased research reviews are critical.
● Many evaluations of educational interventions suffer from methodological problems.
● Common problems include a lack of a control group, brief duration, small sample sizes, limited evidence of initial equivalence, and questionable outcome measures.
● Failing to exclude these studies inflates the average effect sizes of these research syntheses.

Problems with previous reviews

No control group

Many reviews of educational interventions or programs include studies that do not have a traditionally taught control group. For example, in his review, Liao (1998) included a total of 35 studies that examined the effects of hypermedia on student achievement. Five of these studies were one-group repeated measures without a traditional control group. What he found was that the average effect size of these five repeated measures studies (+1.83) was much larger than that of studies with a control group (+0.18). Lacking a control group, of course, a pre-post design attributes any growth in achievement to the program, rather than to normal, expected gain.

Brief duration

Including studies with brief durations could also potentially bias the overall results of educational interventions because short-duration studies tend to produce larger effects than long-duration studies. One of the main reasons is that experimenters are often able to create a better controlled environment in brief studies that may not be maintained for a whole school year, and which contribute to unrealistic gains. In addition, brief studies may advantage experimental groups that focus on a particular set of objectives during a limited time period, while control groups spread that topic over a longer period.

Small-scale studies

Similar to studies with brief duration, studies with small sample sizes often produce larger effect sizes. For instance, when Pearson et al (2005) examined the use of digital tools and learning environments to enhance literacy acquisition in their meta-analysis, they found that studies with smaller sample size (N<30) were much more likely to achieve higher treatment effects than those with larger sample size. Slavin and Smith (2008) also found that “studies with sample sizes below the median of about 250 had a mean effect size of +0.27, whereas those with large sample sizes had a mean effect size of +0.13”. There are three possible explanations. First, small-scale studies are often more tightly controlled than largescale studies and, therefore, are more likely to produce positive results. The positive results of small-scale studies could be due to what Lee Cronbach and his colleagues called the “super-realization” effects. That is, in small-scale experiments, researchers or program developers are more likely to maintain high implementation fidelity or provide additional support that could never be replicated on a large-scale. Second, researcher-developed measures are more likely to be used in small-scale studies. On the other hand, standardized tests, that are usually less sensitive to treatment, are often used in large-scale studies. Finally, the file-drawer effect (when studies with negative outcomes are filed away) is more likely to apply to small-scale studies in that studies with positive results are more likely to get published than those with negative results.

No initial equivalence

When evaluating educational program effectiveness, establishing initial equivalence between the treatment and control group is critical. It is not uncommon to see a post-test-only design in many evaluation studies to measure program effectiveness. However, such a design makes it difficult to know whether the treatment and control groups are comparable at the beginning of the experiment. Since there is often a high correlation between pretest and post-test in many achievement tests, even modest (unreported) pretest differences could lead to important bias in the post-test. Meyer and Feinberg (1992) had this to say with regards to the importance of establishing initial equivalence in educational research: “It is like watching a baseball game beginning in the fifth inning. If you are not told the score from the previous innings nothing you see can tell you who is winning the game.” In addition, studies with large pretest differences also pose threats to validity, even if statistical controls are used, because large pretest differences cannot be adequately controlled for, as underlying distributions may be fundamentally different, even with the use of ANCOVAs or other control procedures.

Questionable outcome measures

Using questionable outcome measures, such as researcher-developed measures, non-standardized tests, or measures that are closely aligned with the content taught to the experimental group but not the control group, could also seriously inflate effect sizes in research syntheses. To investigate this problem, Slavin and Madden (2011) identified studies accepted by the What Works Clearinghouse that reported outcomes on both treatment-inherent and treatment-independent measures. On seven mathematics studies, the effect sizes were +0.45 and -0.03, respectively. On ten reading studies, the effect sizes were +0.51 and +0.06, respectively. Similar findings were also reported by Li and Ma (2011). Out of the 46 qualifying studies included in their review, half of them used outcome measures that were either teacher-made or researcher-developed. The effect sizes for experimenter-made measures and standardized tests were +0.86 and +0.57, respectively. The findings should come as no surprise. As Li and Ma stated, “teachers/ researchers who build their own measures are also those who are heavily vested in implementing the interventions. The implementation fidelity of intervention programs, therefore, may be a factor contributing to such a difference.”

Cherry-picking evidence

Cherry-picking is a common strategy used by some developers or vendors to pick favorite findings to support their cause. For example, when examining the effects of Integrated Learning Systems (ILS), Becker (1992) included 11 Computer Curriculum Corporation (CCC) evaluation studies in his review and four of them were conducted by the developer. Each of these studies was one year long, involving sample sizes of a few hundred students. Effect sizes provided by the developer were suspiciously large, ranging from +0.60 to +1.60. Upon closer examination, Becker (1992) found that the evaluators used an unusual procedure to exclude students in the experimental group, those who showed a sharp decline in scores at post-test, claiming that these scores were atypical portraits of their abilities. However, the evaluators did not exclude those who had a large gain, arguing that the large gain might have been caused by the program. One should be cautious when interpreting results that may be cherry picked by developers.


If education is to achieve continued success, it must embrace evidence-based reform. Syntheses of research on a broad range of educational programs or interventions are critical. To produce trusted and scientifically valid, and unbiased, research syntheses on program effectiveness, it is important that a set of stringent inclusion criteria are applied to the original studies so that studies with poor methodologies could be excluded. It is also equally important that practitioners, educators, and policy makers understand the critical issues behind the various program effectiveness reviews so that they can make informed choices.

About the author

Alan Cheung is Professor in the Department of Educational Administration and Policy, and Director of the Centre for University and School Partnership at The Chinese University of Hong Kong. His research interests include school reform, research reviews and educational technology.

Further reading

Becker HJ (1992), Computer-based Integrated Learning Systems in the Elementary and Middle Grades: A Critical Review and Synthesis 0f Evaluation Reports. Journal of Educational Computing Research, 8(1), 1–41.

Cheung A and Slavin RE (2013), The Effectiveness of Educational Technology Applications for Enhancing Mathematics Achievement in K-12 Classrooms: A Meta- Analysis. Educational Research Review, 9, 88–113.

Cronbach LJ et al (1980), Toward Reform of Program Evaluation: Aims, Methods, and Institutional Arrangements. San Francisco, CA: Jossey-Bass.

Li Q and Ma X (2010), A Meta-analysis of the Effects of Computer Technology on School Students’ Mathematics Learning. Educational Psychology Review (2010) 22:215–243

Liao YK (1998), Effects of Hypermedia Versus Traditional Instruction on Students’ Achievement: A Meta-analysis. Journal of Research on Computing in Education 30(4), 341–359.

Meyer MM and Feinberg SE (1992), Assessing Evaluation Studies: The Case of Bilingual Education Strategies. Washington, DC: National Academy of Sciences.

Pearson PD et al (2005), The Effects of Technology on Reading Performance in the Middle-School Grades: A Meta-Analysis with Recommendations for Policy. Naperville, IL: Learning Point Associates.

Shadish WR, Cook TD, and Campbell DT (2002), Experimental and Quasi-experimental Designs for Generalized Causal Inference. Boston: Houghton-Mifflin.

Slavin RE and Smith D (2008), Effects of Sample Size on Effect size in Systematic Reviews in Education. Educational Evaluation and Policy Analysis, 31(4), 500–506.

Slavin RE and Madden NA (2011). Measures Inherent to Treatments in Program Effectiveness reviews. Journal of Research on Educational Effectiveness. 4(4),370–380.


November 2015