ATI Town Hall Blog: A Forum Follow-up…

…Shouldn’t we assess our assessments before we use them to evaluate our intervention investments?

In a nutshell – Yes!

An interesting and highly pertinent question came up during the recent multi-state Educational Interventions Forum. It was this – “If we are going to use benchmark assessments as part of our efforts to evaluate the educational return on our intervention investments, then, is it not the case that we must first evaluate the credibility of our assessments?”

What a great question! Much like 21st century educational interventions, benchmark assessment tools are proliferating at a rapid rate. This broad array of assessments can potentially serve a wide range of educational needs and goals. Consequently, when it comes to evaluating the credibility of a benchmark assessment tool for use in documenting intervention impact, a basic question to ask is: Does the tool have a credible “fit to function”?

Appropriately designed, credible benchmark assessments can provide valuable information for not only determining the impact of intervention investments on learning, but also provide valuable and timely data to help guide instruction throughout the course of the school year. Benchmark assessments are locally relevant, district-wide assessments designed to measure student achievement of standards for the primary purpose of informing instruction. In assessing “fit to function” as it relates to use of benchmark assessments in evaluating intervention investments, a number of issues should be addressed:

1. Do your benchmark assessments provide reliable information on student learning as it relates to mastery of standards? Reliability has to do with the consistency of information provided by an assessment. A particularly important form of reliability for benchmark assessment is internal consistency. Measures of internal consistency provide information regarding the extent to which all of the items on a benchmark assessment are related to the underlying ability (e.g., math) that the assessment is designed to measure.

Reliability is directly affected by the length of the benchmark assessment. Longer assessments tend to be more reliable than shorter assessments. Based on our research in developing and analyzing customized benchmark assessments for school districts in several states we have found that benchmark assessments consistently begin to reach an acceptable level of reliability at a length of about 35 to 40 items.

2. Do your benchmark assessments provide valid information on student learning as it relates to mastery of standards? Since an important function of benchmark assessment is to measure the achievement of state standards, it is reasonable to expect significant correlations between benchmark assessments in a particular state and the statewide test for that state. A finding revealing such correlations provides important evidence of the validity of the benchmark assessments.

Although significant correlations support the validity of benchmark assessments, it is important to recognize that the two forms of assessment serve different purposes. Statewide tests are typically administered toward the end of the school year to provide accountability information.

Benchmark assessments are administered periodically during the school year to guide instruction. The skills assessed on a benchmark test are typically selected to match skills targeted for intervention at a particular time during the school year. For these and other reasons, benchmark assessments should not be thought of as replicas of statewide tests.

Correlations among benchmark assessments provide another source of evidence of the validity of benchmark assessments. This is because multiple benchmark tests administered during the school year measure student achievement in the same or related knowledge areas. As a result, it is reasonable to expect benchmark tests to correlate well with each other.

3. Do your benchmark assessments accurately forecast state classifications of standards mastery? The validity of local school district customized benchmark assessments is supported not only by the correlations among benchmark assessments and statewide tests, but also by their accuracy in forecasting state classifications of standards mastery. Since determining whether or not students have mastered standards is, for all intents and purposes, a categorical decision (i.e., they did or they didn’t), research on the accuracy of forecasted classifications can provide validity evidence for benchmark assessments.

In addition to the fundamental kinds of research oriented “fit to function” questions raised above, it is also essential to consider a number of other issues in assessing your benchmark assessments. These might include:

What kinds of procedures are in place to ensure that your benchmark assessments are aligned to state and district standards and tailored to reasonably accommodate your district pacing guides?

What kinds of procedures are in place to ensure that items utilized in your benchmark assessments have gone through a rigorous process of development including alignment with standards and/or performance objectives, review, and certification?

What kinds of procedures are in place to ensure that the psychometric properties of your benchmarks assessments including Item Response Theory (IRT) item parameter estimates such as difficulty, discrimination, and guessing are continuously calibrated on your local student population?

Clearly, whether you plan to use benchmark assessments for evaluating the impact of your intervention investments, or to inform data-driven instructional decision making, or to monitor student progress and level of risk for meeting or not meeting state standards, the basic issues discussed her likely deserve some discussion within your district. I will conclude by asking you a few questions:

First, from your perspective, are these valuable questions to consider and why?

Second, what kinds of activities are currently occurring within your district to help you ensure that your benchmark assessment system has a credible “fit to function”?

Third, what other kinds of questions do you think we should be asking about our assessment systems as we evaluate their strengths and limitations in helping us to meet our educational goals for students?

Monday, March 16, 2009

A Forum Follow-up…

No comments: