Tuesday, December 16, 2008

Interventions aren't just when something has gone wrong.

The last post that I put up discussed some of the purposes of assessment. It was written in response to a comment that we received relating to the ways that one might use assessment data and the wisdom of using an assessment for something that it was not designed to accomplish. As we discussed, this is something that must be approached with caution as an assessment that is valid for one purpose might not be valid for another.

With this post I am going to use that question as a springboard for a broader discussion which will hopefully address an issue that has been raised to us by our clients. The sort of dynamic intervention system that we have been discussing in this blog and in other papers position assessment as serving the purpose of informing instructional decision making. It is for the purpose of answering questions like the following: Who are the students whose test scores indicate that they still need more help on key skills? What are the specific areas on which the additional work should focus?

The topic that I would like to focus on here is the picture into which these questions fit. How might the instructional questions of which children need assistance, and what should the assistance be, fit into the type of intervention system that we are trying to introduce through this blog and other writings that we have posted?

It is rather interesting that in many discussions about curriculum implementation, intervention is an additional thing applied on top of normally planned instruction. It is something that is intended to fix problems, not something that is an integrated part of how things are done for everyone. We would like to discuss the possibility that it could be something different. Rather than a way to fix problems, intervention can be a different way to handle the implementation of a curriculum such that districts are in a position to be extremely agile and responsive to what the data is saying about the success of what has been planned. Rather than being separate and apart from the plan, intervening is part of the plan. A plan is developed and then implemented. Right from the start data is collected that speaks to whether things are working. Decisions are made and plans revised and then the cycle starts a-new. The US Department of Education has defined educational intervention in this way in documents generated to assist schools in conducting evidence based practice.

What does such an approach require? The first thing is a system in which assessment data is immediately available. Ideally the system should include an indication of whether an individual lesson is working. This data should be made immediately available and it should be tracked to determine if what has been planned is working. Similarly, the means of providing the content should be sufficiently flexible that plans may be readily changed. If one group of 4th grade students are part way through a set of lessons and they are still struggling with the skills that are the focus of those lessons then the teacher should be in a position of easily modifying the plan without having to wait for weeks or months until a planned course of instruction has been completed. In order to make this kind of decision making possible it is critical that the assessment data collected not only be timely, but also valid. Questions must be written so that they cover the intended skills and minimize measurement error.

It is our hope that the discussion in this blog and in the forum that surrounds it will include many innovative possibilities about how intervention might look. How have interventions been applied in your district? What obstacles have come up in making them work?

Tuesday, December 2, 2008

A response to a question about the possible uses of benchmark data

A recent comment to one of our prior posts asked some questions that I thought warranted at least a couple of posts. The questions concerned whether or not it was desirable to take formative and benchmark data that were originally implemented as a means to inform instruction and apply them to a different purpose such as the assignment of grades or retention decisions.

The reason that one might want to do this sort of thing is pretty clear. It’s an issue of efficiency. Why use all kinds of different tests for retention and grading when you already have all of this data available online? Even though it was intended to guide instruction, why not make use of benchmark data for a grade or a placement? The short answer to this question is that it is generally not a good idea to use a test for purposes other than the purposes for which it was designed. Benchmark assessments are designed to provide information that can be used to guide instruction. They are not designed to guide program placement decisions or to determine course grades.

Evaluating the possible use of benchmark data for program placement or for grading would require careful consideration of the validity of the data for these purposes. It is worth noting that the requirement to address the intended purpose of the assessment was not always a part of the process of establishing test validity. This may have contributed to the lasting tendency to ignore the question of whether or not a test is valid for purposes other than those that the test was intended to serve. There was a time when questions regarding test validity focused heavily on how well the test correlated with other measures to which the test should be related. Sam Messick argued that validity discussions must consider the use of the test score. He indicated that consideration must be given to both the validity of the score as a basis for the intended action and what sort of consequences resulted from the scores use for that purpose. Today the National Council on Measurement in Education (NCME) includes in its Standards for Educational and Psychological Tests and Manuals the notion that the use and the consequences, both intended and otherwise, of that use of a test score must be considered in evaluating validity.

Applying this notion to the use of benchmark and formative data for grades, the question that would need to be asked is whether the scores that are produced actually serve the intended purpose of the score on the report card. Report card grades are typically designed as a “bottom line measure” of the best information available on a student’s knowledge and skills related to a given body of instructional content following a defined period of instruction. This sort of a summative function will, in many cases, dictate a different selection of standards and items than one would choose if the goal were to provide a score that was intended to guide subsequent instruction. For example, a test designed to guide instruction might include a number of questions on a concept that had been introduced prior to test administration, but was intended to be a focus of later lessons. Students would be quite justified in considering this unfair on a test used for a grade. Tilting the balance in the other direction would limit the usefulness of the test as a guide for instruction.

The case against using benchmark assessments as program placement tests is particularly clear. As the author of the comment rightly points out, retention is a high stakes decision. The consequences to a student are extremely high. Moreover, the consequences may not be perceived by students and parents as being beneficial. For example, a student who has learned that he or she is being retained in grade based on benchmark test results may not be happy about that fact. Because grade retention is a high stakes issue, it would be reasonable for the student’s parents to question the validity of the placement decision. Benchmark tests are designed to provide valid and reliable data on those standards that are the focus of instruction during a particular time period. They are not intended to provide an all encompassing assessment of the student’s mastery of all those standards that they are expected to know throughout the school year. Moreover, validation of benchmark assessments does not generally include evidence that benchmark failure one year is very likely to indicate failure the following year. Thus, the parent would have good reason to challenge the placement decision.

Monday, November 10, 2008

The Dialog on Dynamic Interventions Begins

Last week we initiated the dialog on dynamic interventions through a presentation at the Arizona Educational Research Organization (AERO). I wish to thank the organization for providing the opportunity to introduce the intervention initiative to members of the organization. Also there were useful questions from the audience that I would like to discuss. One point made in the presentation was that technology can play an important role in promoting the development and implementation of dynamic intervention systems integrating intervention research and intervention management. Technology is beneficial in two major ways: First, it makes it possible to do things that would be difficult to accomplish without technology. For example, online instructional monitoring tools make it possible to observe the responses of multiple students to questions in real time. Second, technology can reduce the work involved in designing and implementing an intervention. For example, online intervention planning tools can automatically evaluate benchmark testing data to identify groups of students at-risk for not meeting standards and to recommend objectives to be targeted for instruction in an intervention aimed at minimizing the risk of not meeting standards.

A number of important audience questions were raised concerning the use of technology in dynamic intervention systems. One question dealt with the issue of whether or not all interventions should be implemented online. The answer that I gave is that all interventions are not and should not be required to be implemented online. There are two reasons for this: First, not all districts have the necessary technology to support online interventions. Second, instruction does not and should not always occur on a machine. That said, technology can still be helpful. For example, online technology can be useful in documenting the occurrence of interventions that take place offline. Questionnaires and records of lesson plans and assignments can document implementation of an intervention. Data of this kind has been used effectively in many studies of intervention implementation.

Another audience question raised the possibility of automating experimental designs that could be implemented to assess intervention effects. There is no substitute for a skilled researcher when the task at hand is experimental design. Nonetheless, automated packaged designs could be useful in supporting the kinds of short experiments that we have proposed for use in dynamic intervention systems. We are currently working on the design and development of technology to automate experimental design.

Wednesday, November 5, 2008

Different Types of Value Added Questions

Many of us who are involved in education today have heard about the value added approach to analyzing assessment data coming from a classroom. In a nutshell, this is a method of analyzing test results that attempts to answer the question of whether something that is in the classroom, typically the teacher, adds to the growth that the students make above and beyond what would otherwise be expected. The approach made its way onto the educational main stage in the state of Tennessee as the Tennessee Valued Added Assessment System (TVAAS).

This rather dry topic would likely not have been something that was well known outside the ivory towers were it not for the growing question of merit pay for educators. One of my statistics professors used to love to say that “Statistics isn’t a topic for polite conversation”. The introduction of pay into the conversation definitely casts it in a different light. In NYC, consideration is being given to utilizing a value added analyses in tenure decisions for principals. Value added models have been used for determining teacher bonus pay in Tennessee and Texas. Michelle Rhee has argued for using a value added approach to determining teacher performance in the DC school system. One might say it is all the rage, both for the size of the spotlight shining its way and the emotion that its use for this purpose has brought forth.

I will not be using this post to venture into the turbulent waters of discussing who should be getting paid based on results and who shouldn’t. I’ll leave it to others to opine on that very difficult and complicated question. My purpose here is to introduce the idea that the type of questions one asks from a value added perspective, the mindset if you will, can greatly inform instructional decision making through creative application. The thoughts that I will write about here are not intended to say that current applications of the value added type approach are wrong or misguided. I intend only to offer a different twist for everyone’s consideration.

The fundamental question in the value added mind set is whether something that has been added to the classroom positively impacts student learning above and beyond the status quo. One could easily ask this question of new instructional strategies introduced to the classroom that are intended to teach a certain skill. For instance, one might evaluate a new instructional activity designed to teach finding the lowest common denominator between two fractions. Given the limited scope of the activity, this evaluation could be conducted with a great deal of efficiency in very short time by the administration of a few test questions. This sort of evaluation will provide the sort of data that could be used immediately to guide instruction. If the activity is successful then teachers can move on to the next topic. If it is unsuccessful then a new approach may be utilized. The immediacy of the results puts one in a position of being able to make decisions informed by data without having to wait for the year or the semester to end.

Conducting short term small scale evaluations is different from the typical approach in value added analysis of being concerned about impact over a long period. The question of long term impact over time could easily be asked of collection of instructional activities or lessons. In an earlier post, Christine Burnham discusses some of the ways that impact over time could be tested.

As always, we look forward to hearing your thoughts about these issues.

Tuesday, November 4, 2008

How can you tell if your intervention is working?

In order to use assessment data to monitor intervention efforts and to inform decisions about them, it is important to be able to distinguish between the normal, expected growth that would occur without the intervention and the kind of accelerated growth that leads to improved student performance on high stakes statewide assessments.

Consider the following scenario. You are an elementary school principal and on the most recent statewide assessment, 62% of your 4th grade students passed the math portion of the assessment. The target Annual Measurable Objective (AMO) for 4th grade math in your state is that 63% must pass. If the school is to avoid NCLB sanctions, it is crucial that a greater percentage of 4th-graders pass the math portion of the statewide assessment in the coming year. Together with your 4th-grade teachers you’ve reviewed and revised the math curriculum and put new intervention procedures in place. How do you know whether it’s working? Can you tell before it is too late?

The good news is that it is reasonable to expect a certain degree of student learning across the year under almost any circumstances. The bad news is that the rate of student growth last year obviously wasn’t enough. If the rate of student growth remains the same, then you can expect that by the end of the year approximately 62% of the students will be prepared to pass the statewide assessment. All of the hard work of the students and teachers will have resulted in merely maintaining the status quo and, in this case, the status quo is not enough. You need to make sure students are making gains at a pace that is better than status quo.

The first step in monitoring whether an intervention is effectively preparing a greater number of students to pass a statewide assessment is to identify the status quo rate of growth. Once this is known you can compare the growth rate of your own students and determine whether they are out-pacing the status quo. One obvious way to monitor student growth is to administer periodic benchmark assessments.

When comparing student performance from one benchmark assessment to the next, it is important to use scaled scores rather than raw scores, and it is important that the scores on the assessments under consideration be on the same scale so that comparisons are meaningful. Scaled scores, such as those derived via Item Response Theory (IRT) analysis, take into account differences in the degree of difficulty of assessments, whereas raw scores do not. If a student earns a raw score of 20 on the first benchmark assessment and a score of 25 on the next, you do not know whether the student improved or the 2nd assessment was easier. If the comparison is made in terms of scaled scores, however, and if both assessments have been placed on the same scale, then the relative difficulty of the assessments has been factored into the calculation of the scaled scores and an increased score can be attributed to student growth.

ATI clients that administer Galileo K-12 Online benchmark assessments can use the Aggregate Multi-Test Report to monitor student growth and to identify whether the rate of growth is greater than the rate that is likely to result in maintaining the status quo. Galileo K-12 Online benchmark Development Level (DL) scores are scaled scores that take difficulty into account and all assessments within a given grade and subject are placed on the same scale, so that comparisons across assessments are meaningful. In addition, beginning with the 2007-08 school year, the cut scores on the benchmark assessments that are aligned to the relevant statewide assessment cut scores also provide an indication of the status quo growth rate.

In Galileo K-12 Online, the cut score for passing (e.g. “Meets benchmark goals” in Arizona, “Proficient” in California, and so on) that is applied to the first benchmark assessment of the year is tailored for each district using equipercentile equating (Kolen & Brennan, 2004). The cut score is aligned to the performance of that district’s students on the previous year’s statewide assessment. For subsequent benchmark assessments, the passing cut score represents an increase over the previous cut score at an expected growth rate that is likely to maintain the status quo. The expected growth rate for each grade and subject is based on a review of the data from approximately 250,000 students in grades 1 through 12 who took Galileo benchmark assessments in math and reading during the 2007-08 school year. Districts and schools that are seeking to improve the percent of students passing the statewide assessment should aim for an increase in average DL scores that is greater than the increase in the cut score for passing the assessments.

The following graphs illustrate a district that is showing growth at the expected rate and maintaining the status quo, and another that is showing growth that is better than the expected rate and which can expect to show improvement over the previous year with regard to the percent of students passing the statewide assessment.

District A: Showing growth but maintaining the status quo

District B: Showing growth AND improvement

Different rates for different grade levels

There has been a great deal of research regarding the amount of growth in terms of scaled scores that can be expected within and across various grade levels and the results have been mixed. When IRT methods are used in calculating scaled scores, it has generally been found that the relative amount of change in scaled scores from one grade level to the next tends to decrease at the higher grade levels (Kolen & Brennan, 2004). ATI applies IRT methodology and has also observed that the rate of increase in student performance in terms of scaled scores tends to decrease at the higher grade levels. The graph that follows presents the mean scaled score on the first and third benchmark assessments within the 2007-08 school year for grades, 1, 3, 5, and 8. The sample consisted of approximately 25,000 students per grade.

It should be noted that the slower rate of growth at the higher grade levels does not necessarily imply that students’ ability to learn decreases at the upper grades. The decrease in the growth rate may, for example, be a side effect of the methodology that is used in the raw-to-scale score conversion (Kolen, 2006). Regardless, the pattern is a stable one that provides a reliable measure against which to compare the growth of individual students or groups of students.

I’d love to hear other thoughts on monitoring intervention efforts. What has been helpful? What has not? What might be helpful in the future?


Kolen, M.J. (2006). Scaling and norming. In R.L. Brennan (Ed.), Educational Measurement (pp. 155-186). Westport, CT: American Council on Education and Praeger Publishers, jointly.

Kolen, M.J. & Brennan, R.L. (2004). Test equating, scaling, and linking: methods and practices (2nd ed.). New York: Springer.

Saturday, November 1, 2008

Experimental Research and Educational Interventions

A few years ago, the Federal Government introduced a new entity in the Department of Education called the Institute of Education Sciences. A significant motivating force behind the founding of the Institute was the idea that educational practice should be informed by experimental research. The reference to science in the title of the Institute calls for educational practice based on scientific evidence rather than philosophical argument. The “gold standard” for experimental research calls for the development and testing of hypotheses through experiments in which research participants (e.g., students) are assigned at random to experimental and control conditions. The great advantage of the experimental approach is that it affords a sound basis for making causal inferences leading to the determination of factors affecting learning.

Unfortunately, the educational research community has not responded adequately to the call for experimental studies. Indeed, the number of experimental studies conducted in education in United States has been declining for some time. Professor Joel Levin of the Department of Educational Psychology at the University of Arizona is one of a number of scholars who have played an important role in providing evidence that the decline is real. Moreover, he and others have been effective in pointing out the damaging effects of the decline on the potential impact of educational research on educational interventions designed to improve learning.

There are many possible reasons for the decline. One obvious reason is that the conduct of experimental research in schools can be very expensive. Expense is particularly problematic in experiments taking place over an extended time span involving large numbers of students and teachers. Many university professors, particularly young scholars, do not have access to the funding resources needed to conduct experimental studies of this kind.

Fortunately there is much to be gained from short experiments, which are inexpensive to conduct. Much of our knowledge regarding student learning, memory, cognition, and motivation has come from studies that generally require less than an hour of time from each of a small number of research subjects. Thus, while it may be beneficial to assess the effects of an entire curriculum on learning over the course of the school year, it may also be useful to assess the effects of experimental variables implemented in a single lesson or small number of lessons.

Focusing research on short experimental interventions provides much needed flexibility in implementing school-based interventions in the dynamic world of the 21st century. The educational landscape of the 21st century is in a constant state of flux. Standards change, curriculums change, and assessment practices change rapidly in the current educational environment. Schools have to be able to adjust intervention practices quickly to achieve their goals. Short experiments provide the flexibility needed to support rapid change in educational practice. I think we need more of them. In fact I think we need thousands of them coming from researchers across the nation. We now have the technology to manage the massive amounts of information that large numbers of short experiments can provide. The task ahead is to apply that technology in ways that support the continuing efforts of the educational community to promote student learning.

Thursday, October 23, 2008

Welcome to our Town Hall

Hello all,

This opening post is a good time to introduce a model for looking at the relationship between assessment and interventions that will hopefully serve as a good basis for upcoming discussion. The model is shown in the diagram below:

Put quite simply, the results of assessments are used as a basis to make plans for instructional activities. These activities are then implemented and the results are evaluated using instruments similar to what was initially employed to evaluate student mastery. The assessment results are then used again for planning and the cycle continues.

The view that we will be dicussing is that such a dynamic approach to intervention is critical to the ultimate success of the efforts to increase student achievement. We will be including posts in this blog relating to many issues within this model such as building reliable and valid assessments and evaluation of the data that they produce. We will also be inviting districts who have implemented intervention programs to comment on what they did, their sucesses, and their challenges. We welcome participation from all. Please leave us your thoughts in the comment section below.