Tuesday, November 4, 2008

How can you tell if your intervention is working?

In order to use assessment data to monitor intervention efforts and to inform decisions about them, it is important to be able to distinguish between the normal, expected growth that would occur without the intervention and the kind of accelerated growth that leads to improved student performance on high stakes statewide assessments.

Consider the following scenario. You are an elementary school principal and on the most recent statewide assessment, 62% of your 4th grade students passed the math portion of the assessment. The target Annual Measurable Objective (AMO) for 4th grade math in your state is that 63% must pass. If the school is to avoid NCLB sanctions, it is crucial that a greater percentage of 4th-graders pass the math portion of the statewide assessment in the coming year. Together with your 4th-grade teachers you’ve reviewed and revised the math curriculum and put new intervention procedures in place. How do you know whether it’s working? Can you tell before it is too late?

The good news is that it is reasonable to expect a certain degree of student learning across the year under almost any circumstances. The bad news is that the rate of student growth last year obviously wasn’t enough. If the rate of student growth remains the same, then you can expect that by the end of the year approximately 62% of the students will be prepared to pass the statewide assessment. All of the hard work of the students and teachers will have resulted in merely maintaining the status quo and, in this case, the status quo is not enough. You need to make sure students are making gains at a pace that is better than status quo.

The first step in monitoring whether an intervention is effectively preparing a greater number of students to pass a statewide assessment is to identify the status quo rate of growth. Once this is known you can compare the growth rate of your own students and determine whether they are out-pacing the status quo. One obvious way to monitor student growth is to administer periodic benchmark assessments.

When comparing student performance from one benchmark assessment to the next, it is important to use scaled scores rather than raw scores, and it is important that the scores on the assessments under consideration be on the same scale so that comparisons are meaningful. Scaled scores, such as those derived via Item Response Theory (IRT) analysis, take into account differences in the degree of difficulty of assessments, whereas raw scores do not. If a student earns a raw score of 20 on the first benchmark assessment and a score of 25 on the next, you do not know whether the student improved or the 2nd assessment was easier. If the comparison is made in terms of scaled scores, however, and if both assessments have been placed on the same scale, then the relative difficulty of the assessments has been factored into the calculation of the scaled scores and an increased score can be attributed to student growth.

ATI clients that administer Galileo K-12 Online benchmark assessments can use the Aggregate Multi-Test Report to monitor student growth and to identify whether the rate of growth is greater than the rate that is likely to result in maintaining the status quo. Galileo K-12 Online benchmark Development Level (DL) scores are scaled scores that take difficulty into account and all assessments within a given grade and subject are placed on the same scale, so that comparisons across assessments are meaningful. In addition, beginning with the 2007-08 school year, the cut scores on the benchmark assessments that are aligned to the relevant statewide assessment cut scores also provide an indication of the status quo growth rate.

In Galileo K-12 Online, the cut score for passing (e.g. “Meets benchmark goals” in Arizona, “Proficient” in California, and so on) that is applied to the first benchmark assessment of the year is tailored for each district using equipercentile equating (Kolen & Brennan, 2004). The cut score is aligned to the performance of that district’s students on the previous year’s statewide assessment. For subsequent benchmark assessments, the passing cut score represents an increase over the previous cut score at an expected growth rate that is likely to maintain the status quo. The expected growth rate for each grade and subject is based on a review of the data from approximately 250,000 students in grades 1 through 12 who took Galileo benchmark assessments in math and reading during the 2007-08 school year. Districts and schools that are seeking to improve the percent of students passing the statewide assessment should aim for an increase in average DL scores that is greater than the increase in the cut score for passing the assessments.

The following graphs illustrate a district that is showing growth at the expected rate and maintaining the status quo, and another that is showing growth that is better than the expected rate and which can expect to show improvement over the previous year with regard to the percent of students passing the statewide assessment.

District A: Showing growth but maintaining the status quo

District B: Showing growth AND improvement

Different rates for different grade levels

There has been a great deal of research regarding the amount of growth in terms of scaled scores that can be expected within and across various grade levels and the results have been mixed. When IRT methods are used in calculating scaled scores, it has generally been found that the relative amount of change in scaled scores from one grade level to the next tends to decrease at the higher grade levels (Kolen & Brennan, 2004). ATI applies IRT methodology and has also observed that the rate of increase in student performance in terms of scaled scores tends to decrease at the higher grade levels. The graph that follows presents the mean scaled score on the first and third benchmark assessments within the 2007-08 school year for grades, 1, 3, 5, and 8. The sample consisted of approximately 25,000 students per grade.

It should be noted that the slower rate of growth at the higher grade levels does not necessarily imply that students’ ability to learn decreases at the upper grades. The decrease in the growth rate may, for example, be a side effect of the methodology that is used in the raw-to-scale score conversion (Kolen, 2006). Regardless, the pattern is a stable one that provides a reliable measure against which to compare the growth of individual students or groups of students.

I’d love to hear other thoughts on monitoring intervention efforts. What has been helpful? What has not? What might be helpful in the future?


Kolen, M.J. (2006). Scaling and norming. In R.L. Brennan (Ed.), Educational Measurement (pp. 155-186). Westport, CT: American Council on Education and Praeger Publishers, jointly.

Kolen, M.J. & Brennan, R.L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York: Springer.


Anonymous said...

This is a helpful explanation of the equipercentile equating, especially the way that cut scores are set after the first benchmark by using "an expected growth rate that is likely to maintain the status quo." It seems to me that if this is the case, you can tell whether you are making progress by simply looking at the percentage of students who are passing. If the percentage goes up, you are making progress, right? I wasn't sure why you would need to actually look at the increase rate of the DL scores.

I would like to see more posts about the topic of measuring growth. Here at the MA Dept of Ed, our pilot of Galileo has revealed a number of questions along these lines:

- what is a simple way of setting growth targets in Galileo based on the growth targets on our state tests for an individual student, a classroom, school, or district?

- for what amount of growth can you definitively say there is a significant difference at the classroom, school, or district level? when is it valid and not valid to use growth on DL to compare performance (i.e., "added value") at these different levels?

- is there a more user-friendly way to describe the DL gains? e.g., in grade-level equivalents, weeks of instruction, etc.?

Looking forward to more discussion on these issues!

Christine Burnham, Ph.D. said...

Thank you, Life, for your insightful comments. To respond to your first point, yes, looking at the percentage of students who are passing would tell you whether you’re making progress. I phrased it in terms of increased rate of growth because that’s the concept behind the increased percentages, but you’re right. Just looking at the percentages is more straightforward.

Regarding the rest of your comment, the questions you ask are very intriguing and well worth discussing. Two of them require fairly extensive, technical answers, which will be posted soon. In the mean time, I will try to answer the third.

You ask, “Is there a more user-friendly way to describe the DL gains? e.g., in grade-level equivalents, weeks of instruction, etc.?” The DL scales in Galileo K-12 Online for each grade level are meant to reflect the typical age of students at that grade level. For example, third-graders tend to be about 8 years old so the DL scale for third grade is built around a mean of 800, fourth-graders tend to be about 9 years old so the DL scale for fourth grade is built around a mean of 900, and so on. We opted to use student age rather than grade level as a framework for DL scores because, if grade level were used, then kindergarten would be built around a mean of 0, and any students falling below the mean would have negative scores, which would most likely be distressing for parents and teachers alike.

With that in mind, a teacher can monitor a student’s progress with reference to his or her age group. If a third-grader earns a DL score of 750, then the teacher knows that he or she is somewhat below average for the grade. In fact, since the standard deviation at all grade levels is set to be 100, then the teacher knows that the student is one half a standard deviation below the mean for the grade. As the student progresses, he or she should move closer to a score of 800 and perhaps beyond it.

One point that’s important to clarify is that the scales for the different grade levels are not vertically scaled. What that means is that each grade-level scale is independent and DL scores cannot be compared across grade levels. A third-grader who earns a DL score of 850 is performing at one half a standard deviation above the mean for third-graders, but it does not follow that he or she is also performing at one half of a standard deviation below the mean for fourth graders. In order to vertically scale the benchmark assessments, we would have to administer a series of assessments that have a certain number of items in common between adjacent grade levels. In other words, we would have to administer a third-grade level assessment to a large sample of third-graders that contained a set of 2nd-grade items in addition to the 3rd-grade items, and those same items would have to appear on a 2nd-grade assessment. The same pattern would have to be followed up through all of the grade levels. Thus far we have not pursued vertical scaling. As you know, MCAS is currently not vertically scaled. Vertical scaling is useful for measuring progress from one year to the next. In the event that the State were to opt for vertical scaling of MCAS, it would be useful for ATI to provide vertical scaling for benchmark assessments. If the DL scale were vertically scaled across grade levels, the means would no longer land on nice, round, age-relevant DL scores like 800, 900, and so on.

I hope this addresses your question and that it helps teachers who are using Galileo K-12 Online to view student DL scores in a clearer light. Watch for responses to your other questions to be posted soon.

- Christine

Anonymous said...

Thank you Christine for your response. A clarification questions plus another thought-provoker:

- Are the DL scores supposed to correspond to the expectations at the END of each grade (e.g., 800 for an average student at the end of 3rd grade)?

- I'm still interested in a simple way to think of growth in terms of DL - is there a way to translate the DL standard deviations into something more user-friendly? If a student gains 50 DL points between benchmark 1 and benchmark 2, is there an alternative to telling the parents that "Johnny gained an entire half of one standard deviation this quarter!" It would be much better to be able to say something like "Johnny has made the equivalent of four months of progress in the last two months. Since he started about six months behind, if he keeps working like this he could be caught up by the spring."

If there is a way to use the power of the psychometrics to inform the common sense understanding of parents and teachers (and students!), we need to do it.