ATI Town Hall Blog: Comparing Results from Benchmarks When the Questions and Standards Are Different

The administration of benchmark assessments is a common practice with which most educators are familiar. Benchmarks assessments, sometimes referred to as interim assessments, are intended to measure a student’s mastery of skills in a content area against grade-level standards and learning goals. But what happens when the assessments vary? What do you do when the questions and standards are not the same across benchmark assessments? How do you evaluate student progress when the benchmark assessments are not identical to one another?

The team of researchers at ATI make it possible to answer these questions by implementing state-of-the-art measurement techniques. The process begins by using Item Response Theory (IRT) scaling techniques to place student scores on a series of Galileo assessments on a common scale. The relationship between student ability and item difficulty plays a particularly important role in the IRT scaling process. In IRT, student ability estimates and item difficulty estimates inform one another within the context of the same mathematical model. Student ability is estimated in light of the relative difficulty of the items on the test, and the difficulty of the items is estimated in light of the ability level of the students who responded to them. For example, in a common IRT model, a student of average ability will have a fifty-fifty chance of responding correctly to an item of average difficulty. A student who is one standard deviation above the mean ability level will have a fifty-fifty chance of responding correctly to an item that is one standard deviation above the mean in terms of difficulty. Likewise, a student of below average ability will have a fifty-fifty chance of responding correctly to a corresponding item that is below average in difficulty.

The fact that ability and difficulty are measured on the same scale makes it possible to adjust the student’s scale score, which is an estimate of ability, based on the difficulty of the items included in the assessment. This adjustment is a key factor in the scaling process making it possible to compare scores from different tests. When scores such as percent correct are used, such adjustment is not possible, and scores from different tests cannot be compared. For example, if a student received a score of 70 percent correct on one test and 90 percent correct on a second test, the difference could have occurred because the second test was easier than the first, or because of an increase in student performance, or both.

In this regard, a powerful report to check out in the Galileo Help files is the Item Parameters Report. The report specifically provides information about item difficulty and other item parameters for each item on a benchmark test. Other helpful reports to check out are the Aggregate Multi-Test and Student Growth and Achievement reports, which present student IRT scale scores, which are called Developmental Level, or DL scores in Galileo, on a series of tests so that student progress may be monitored.

Watch the brief video on Psychometrics

To learn more about ATI’s research initiatives, visit our website. For a one-on-one demonstration of the reporting features mentioned in this blog, request a personal demo with one of our knowledgeable field services coordinators.

Other topics of interest:
How to use item parameters to make decisions during test review
How does ATI calculate my district’s psychometrics benchmark test data?

Monday, October 30, 2017

Comparing Results from Benchmarks When the Questions and Standards Are Different

No comments: