Wednesday, May 27, 2009

Benchmark Results by Groups

ATI released a new report this month called the Benchmark Results by Groups report. This report breaks out the students who passed the benchmark goals by subgroup. Combined with customizable forms it allows the user flexibility in reporting. It can be run on an individual class, offering comparisons to the school and district data, all schools, or all classes. In all cases you will receive an “Overall” column for comparison purposes (see screen shot below). At the present time this report is available to district-level users only however ATI has future plans to make it available to all user access levels.

If you are interested in learning more about this report or other components of the system, a WebEx can be set up. A WebEx is a guided tour of the system over the Internet. Please contact the Field Services department at 800-367-4762 Ext. 124 to obtain more information.

For those districts that are current clients, please contact your Field Service Coordinator if you have questions on this report or other components of Galileo.

Wednesday, May 6, 2009

Questions to complement Value Added Measurement

A short while back I wrote a post about the use of Value Added Measurement (VAM) within the context of educational reforms. I mentioned President Obama speaking of the need to reward effective teachers financially. Indeed the impact of effective teachers is well established. There is certainly value in identifying which instructors have the most impact on the learning of their students. However, just as with any set of tools that might be employed to the ultimate task of raising student achievement, there are certain limitations to VAM and merit pay as a source of guidance for policy. One of the most notable is that it provides no insight into what effective teaching actually looks like. My task here will be to make good on my promise from the last post to talk a bit about some additional tools that can be added to the arsenal. As I said before, it makes sense to make use of any tool that we can to tackle this important task.

VAM asks the question who is the most effective in the classroom. What if we added some additional questions such as: what is the most effective way of teaching a given skill? What are the specific needs of students who need additional help? How are they progressing as they receive instruction? These might be thought of as "bottom up" questions as opposed to the “top down” type of inquiries that characterize VAM. The notion is that specific identification of the components of effective construction will support the construction of a larger program. This type of approach could provide a nice complement to the gains that can be achieved from VAM.

What is needed to effectively ask these types of questions? One necessary ingredient is the ability to work collaboratively on the implementation of common objectives, assessments, and instructional approaches across different classes and schools. It must be possible to distribute necessary materials to all the teachers who need them. It must be possible to monitor the delivery of that instruction so that differences across teachers can be identified and, where necessary, they can be addressed. Highly consistent implementation is needed in order to make strong conclusions about what works or doesn’t work.

It also must be possible to gather accurate and reliable assessment data on a frequent basis. Assessments must be shared so that information may be reliably aggregated. Assessments should also be well integrated into instruction so that the picture of learning is highly detailed.

This approach is a nice complement to VAM because it positions us to answer the question of what can be done when differences in outcomes are identified across classrooms. The implicit assumption is that teacher effectiveness can be taught once the components of effective instruction are identified. In a recent article, Stephen Raudenbush describes the successful implementation of a literacy program based on what he terms a shared systemic approach to instruction. Central to the approach are shared goals, instructional content, and assessments. Differences in teacher expertise are expected and the system encourages mentoring by those whose skills are more advanced. Raudenbush argues that this sort of collaborative approach is key to effectively identifying and then implementing the kinds of systemic changes that will ultimately advance instruction and improve schools.

The tools within Galileo have been designed to support the process of determining what strategies are effective in helping students to meet goals. As we described in our recent seminar, the intervention model positions districts to do that sort of collaborative work. We would be interested in hearing responses from those who have worked in a district where such an approach was implemented. How did it seem to work? What sort of approach was taken to implementation? What kinds of problems came up?

Friday, May 1, 2009

The Calculations behind Forecasting Risk and Making Predictions in Galileo K-12 Online

Thanks for the comment on my previous post, Gerardo! I’ll work through the calculations you requested in this response, but the real work is in making sure that the benchmark assessments are properly aligned to state standards and that student scores on the benchmark assessments correlate well with their scores on the statewide assessment. The validity of Galileo K-12 Online benchmark assessments, both in terms of the alignment of content and the correlations with the state test scores, has been well-established (see the Galileo K-12 Online Technical Manual), and so now we are free to engage in a very straightforward, easy, and accurate approach to forecasting student performance on statewide assessments like AIMS.

The first step in forecasting student performance on a statewide assessment is establishing cut scores on the benchmark assessment that correspond to the cut scores on the statewide assessment. To do this we use equipercentile equating (e.g. Kolen & Brennan, 2004). With equipercentile equating, you start with the distribution of student scores on the target assessment. In the example I’ll work through here the target assessment, the one we want to make predictions about, is the 3rd grade math AIMS assessment (the statewide assessment in Arizona). The distribution of scores that is used for the equating process is the set of scores from that particular district’s students on the previous year’s assessment. In this case, 25% of the district’s third-graders had fallen below the critical cut score for meeting the standard on the spring, 2007 AIMS assessment, and so for the 2007-08 3rd grade math benchmark assessments, the cut score for Meets the Standard was set at the 25th percentile. The same approach was used for the other two cut scores (Approaches and Exceeds) but for the purposes of this discussion, we are only concerned with the cut score for Meets, which is essentially the pass/fail cut score.

Once the cut scores for benchmark assessments are established, they can be used to estimate each student’s degree of risk of not meeting the standard on the next statewide assessment. As stated in my original blog on this topic, we have found that observing a student’s pattern of performance across multiple benchmark assessments yields more accurate forecasts of likely performance on the statewide assessment than does looking at the student’s score on one assessment in isolation. Classification depends on whether the student scored above or below the cut score for Meets the Standard on each benchmark assessment. If the student scored above the cut score for Meets on all three, then she is said to be On Course for demonstrating mastery on the statewide assessment. A student who scores below the cut score on all three assessments is classified as being at High Risk of not demonstrating mastery. Scoring above on two out of three assessments earns the classification of Low Risk and scoring above on only one out of three assessments earns the classification of Moderate Risk.

In Galileo K-12 Online, the reports that indicate student risk levels are linked directly to instructional materials and other tools to support intervention efforts with students who are at risk. This support for intervention efforts is the primary purpose of the risk classification scheme. But it is important to demonstrate that the classification scheme is accurate, which brings us to the data summary that Gerardo asked about.

The table below presents the data for the example I’ve been working through here.

The data are from a particular district’s 2007-08 benchmark assessments in 3rd grade math. The panel on the left shows the different possible patterns of performance on the series of three benchmark assessments: the first row represents students who scored above the cut score on all three benchmarks, and so on. The next column indicates the Risk Level classification for each pattern of performance. Note that scoring above the cut score on two out of three benchmark assessments leads to the same Risk Level classification, regardless of which two assessments were passed by the student. The number of students who showed each pattern of performance is indicated, as is the number of students in each pattern who did and did not demonstrate mastery on the AIMS assessment. For example, there were 238 students who scored above the cut score for Meets on all three benchmark assessments. Of these students, 234, or 98%, also met the standard when they took the AIMS assessment at the end of the year. The percent who met the standard in AIMS for each of the other risk groups was calculated in a similar manner. For the Low Risk group, it was simply a matter of adding up the number of students who passed the AIMS assessment (15+5+26), dividing by the total number of students in that risk group (19+9+34) and then multiplying by 100 to get 74%.

The Percent Met Standard column in the table presented here corresponds to the data in the table at the end of my previous post (“Forecasting Risk and Making Predictions about AMOs”). In that table, I presented averages for each Risk Group that were based on data from 7 school districts in a pilot investigation. The averages are collapsed across all of those districts and all grade level and content areas. We have plans to investigate this further, and I will keep the readers of this Blog posted in any developments.

The final column in the table indicates the accuracy of forecasting performance on AIMS for each of the Risk Groups. For the On Course and Low Risk groups, the prediction is that they will most likely meet the standard on AIMS. For these groups the calculation of the percent accuracy is the same as the calculation of the percent who Met the Standard in the previous column. For the other two groups, the prediction is that they will most likely NOT meet the standard on AIMS, and accuracy here refers to the percent of students who, as predicted, did not meet the standard on AIMS. For the High Risk group, accuracy was at 87% because 13% of these students met the standard on AIMS in spite of their failure to do so on any of the benchmark assessments. Even more interesting is the Moderate Risk group, for which accuracy was only 59% because 41% passed AIMS in spite of our prediction. At ATI, we actually like to be less accurate in these two categories. If our prediction is wrong with these students, it suggests that they and their teachers worked very hard, and they managed to pass the statewide assessment against the odds. We hope that the Galileo K-12 Reports and instructional materials were a part of that effort. That’s what it’s all about.


Kolen, M.J. & Brennan, R.L. (2004). Test equating, scaling, and linking: methods and practices. New York: Springer.