When Your Ruler Is Wrong: How Rescoring Improves Measurement in LEVANTE

Adaptive tests are remarkable tools: instead of giving every child the same 50 questions, the computer picks the next question based on how the child is doing, zeroing in on their ability with a much shorter test. In practice, that efficiency depends on one thing going right: the system has to know how hard each question really is.

At LEVANTE, we are already deploying computerized adaptive tests (CATs) across our pilot sites in Germany, Colombia, and Canada and will soon be rolling these out to other funded sites. When we launched, our item banks often were built from smaller pilot samples at those sites, and even sometimes relied on placeholder difficulty values acting as stand-ins until we had enough data to calibrate properly. In some cases, this meant that the tests were poorly calibrated. For our mathematics task, 342 items shared only 23 unique difficulty values, meaning entire groups of questions were treated as equally hard when they clearly were not. 

Tests still ran. Children still got scores. But some of those scores were more approximate than they needed to be. What we’ve discovered is that we can correct this issue after the fact using a technique called “rescoring,” allowing us to deploy CATs quickly even when we don’t have perfect knowledge.

Why imperfect calibration matters

Miscalibrated difficulty estimates hurt a CAT in two ways.

Scoring error is the obvious one: if the system thinks a question is easier than it really is, it will over-penalize a wrong answer and the child’s estimated ability will be too low.

Selection error is subtler. The system uses those same faulty estimates to pick the next question. So a child may end up answering questions that were poorly matched to them in the first place.

Both problems are amplified across populations. A question that is easy for children in one country may be harder in another because of curriculum, language, or cultural familiarity. When the item bank treats the same question as equally easy everywhere, children in some countries end up more mismeasured than others.

What rescoring does, and what it cannot do

Once we had enough data to properly calibrate the item banks, we faced a choice. Apply the new estimates going forward? Or also go back to the children who had already taken the test, and recompute their scores using the better information?

The second option is known as rescoring. It is appealing because no child has to be retested. It is essentially a free upgrade to score accuracy using answers we already have.

But rescoring can only fix part of the problem. It addresses scoring error directly. What it cannot do is change which questions the child was given. If the system selected a poorly matched question because of bad initial information, that question has already been answered. Selection error is baked in.

So the practical question is: how much of the accuracy loss can rescoring actually recover?

Simulation: how much is fixable?

We built a simulation that varies how closely the initial difficulty estimates matched the true ones, and compared three conditions: the operational CAT with imperfect estimates, the same CAT with rescoring after recalibration, and a benchmark in which the system had the better estimates all along.

Figure 1. Simulated measurement error as a function of calibration quality. The x-axis shows how closely the initial item difficulties matched the properly calibrated ones, from a poor match on the left to a perfect match on the right. The orange line is error under the original operational scoring; the gold line is error after rescoring; the blue line is the best-case benchmark. The orange shaded region is scoring error — the portion that rescoring recovers. The blue shaded region is selection error — the portion that cannot be fixed after the test. Scoring error is substantially larger than selection error across the full range.

The headline from the simulation: scoring error is substantially larger than selection error across a wide range of realistic conditions. Most of the accuracy loss from a miscalibrated item bank is recoverable, and rescoring recovers it.

Testing on real LEVANTE data

We scored 2,657 children across four tasks — mathematics, matrix reasoning, mental rotation, and same-different selection. Many children completed more than one task. For each task, we computed two ability estimates per child, one under the original item parameters and one under the recalibrated parameters.

Scores shifted, and the amount varied by task. Mental rotation moved by nearly two points on the ability scale on average — a substantial change indicating the original parameters were badly off. For matrix reasoning, in contrast, the correlation between old and new scores was 0.89. That correlation is high, but it still corresponds to meaningful reordering of children: a nontrivial fraction shifted position enough to affect individual feedback or group-level comparisons.

Site-level patterns matched what the simulation predicted. Colombia showed the largest systematic shifts in mathematics, suggesting the original item bank was most poorly matched to that population. Germany showed the most reordering in matrix reasoning. The same item bank affected each country differently.

Figure 2. Per-child score changes after rescoring, by task and country. Distributions to the left of zero indicate scores decreased under rescoring. Colombia (green) shifted most in math; Germany (red) showed the widest spread in matrix reasoning.

Better, not just different

Showing that scores change is necessary but not sufficient. The real question is whether the new scores are better. Three independent checks point in the same direction.

Age validity improves. Since cognitive ability grows with age in children aged 5 to 12, ability estimates should correlate with age. For matrix reasoning, this correlation rose from .37 to .50 after rescoring — the new scores track developmental patterns more accurately.

Cross-task correlations strengthen. Tasks measuring related aspects of cognition, like mathematics and matrix reasoning, should correlate. Under the new parameters, all five task pairs we examined showed stronger correlations, with gains of up to .17.

Test–retest stability increases. In a German subsample where some children took the test twice in the same format, 67% had more stable scores under the new parameters.

Improvements were largest for tasks whose original parameters were most misaligned — exactly what the simulation predicted. That is what makes rescoring more than a technical adjustment. It is a validated improvement in measurement quality.

What this means going forward

Rescoring is now a standard part of the LEVANTE pipeline whenever we add items or expand to new languages. But its implications go beyond fixing past scores.

Because rescoring works, it changes the calculus for launching new assessments. Teams can deploy a CAT to a new population using the best available parameters — whether from a small pilot, a related population, or even AI-generated estimates — and then recalibrate as data accumulates. Initial scores are provisional; final scores are accurate. For an international project like LEVANTE, where expansion to new languages and sites is ongoing, that matters.

Rescoring is not the whole solution. It cannot repair selection error, so we continue to work on smarter ways for the CAT to choose items during testing. But as a retrospective fix, it is a reliable tool, and one we think other projects working with developmental or cross-cultural assessments may find useful too.

A preprint describing the simulation study and empirical validation is in preparation. If you are thinking about rescoring in your own CAT and want to compare notes, get in touch.

Further reading

•       Wainer, H., Dorans, N. J., Flaugher, R., Green, B. F., & Mislevy, R. J. (2000). Computerized adaptive testing: A primer. Routledge.

•       Weiss, D. J. (1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473–492.

•       Kachergis, G., O’Reilly, F., Braginsky, M., Xiao, X., et al. (2025). Creation and validation of the LEVANTE core tasks: Internationalized measures of learning and development for children ages 5–12 years. PsyArXiv. https://doi.org/10.31234/osf.io/r4dhw_v1

AI Usage Statement

Claude (Anthropic) was used to assist with formatting and revision. All research, analyses, and scientific claims are the author’s.