Introducing the LEVANTE core tasks: open, cross-cultural measures for measuring learning and development (ages 5–12)

Dr. Fionnuala O’Reilly & Prof. Michael C. Frank

January 14, 2026

Capturing how children learn and develop is difficult, especially at scale. If you’re tracking growth over time, evaluating an intervention, or testing how a child’s environment shapes development, you need measures that are efficient, reliable, valid, and usable across ages and contexts. In reality, the field is full of compromises: tasks that only work for a narrow age band, measures that don’t translate cleanly across languages or cultures, and “gold standard” tools that are behind licenses and expensive to access. That’s the gap we’re trying to close with LEVANTE.

Today we’re sharing a preprint of our paper introducing the LEVANTE core tasks: a suite of nine psychometrically grounded behavioral tasks designed for children aged 5–12, covering language, mathematics, reasoning, executive function, and social cognition. Our aim is to provide a set of measures that can act as a shared yardstick for developmental research—across childhood, across contexts, and across cultures.

Why is measurement such a bottleneck in developmental science?

Age: Children’s abilities change rapidly with age, which makes ‘one-size-fits-all’ measurement difficult. Younger children often need tasks that are simpler, shorter, and less verbally demanding. Older children can handle more complexity, but can get bored on items designed for younger ages. The usual workaround is to use different tasks at different ages. But that creates a new problem: if the measures change, the scores aren’t comparable, which makes it harder to quantify growth or compare developmental trajectories.
Cross-context comparisons are difficult: Child development is a global priority, but a lot of our tools were developed and validated in a small set of contexts—often English-speaking, Western, and with specific assumptions baked in (about schooling, language, norms, or even how children interpret instructions). If we want to understand development globally, we need tasks that are designed to be comparable across languages and cultures, so differences in scores reflect real developmental variation and not differences in measurement.
Access is uneven: Many widely used measures are commercially distributed. That can mean high costs and barriers to translation or adaptation, making it harder to reuse measures, evaluate them independently, or build reproducible science.

What LEVANTE is trying to solve

LEVANTE was designed to tackle these three challenges by building and validating a set of measures that are:

Developmentally scalable (usable across ages 5–12, with comparable scoring)
Cross-culturally robust (designed for adaptation and equivalence testing across contexts)
Open and non-commercial (to maximise access, transparency, and reproducibility)

In this paper, we describe how we selected and adapted well-established tasks from the literature, re-implemented them on an open-source web platform, and evaluated initial feasibility, reliability, and validity using pilot data. Psychometric models based on item-response theory (IRT) are a central component of our approach: they allow us to calibrate item difficulty and estimate children’s ability on a common scale, enabling direct comparison across ages, sites, and administration modes.

Our pilot phase included three sites on three continents and three administration modes (in-school, in-lab, and at-home). We present initial evidence from the nine tasks indicating that they (i) capture expected developmental change across childhood, (ii) produce scores that differentiate children across a range of ability levels (with performance patterns varying by task), and (iii) show early validity through associations with other measures in theoretically sensible directions. A practical strength of LEVANTE is its use of computer adaptive testing (CAT): children are presented with items targeted to their ability level, allowing us to estimate skill level efficiently, often in just a few minutes, while maintaining accurate measurement across a range of abilities. Although we use psychometric models to place task performance on a common scale, LEVANTE is not designed to estimate absolute differences between sites. As outlined above, sites differ in sampling, recruitment, and administration, so apparent performance gaps would be highly confounded and therefore uninterpretable.

Figure 1: Estimated IRT ability by age across tasks and sites

Where we’re going next

The nine core tasks are still in development, and our approach is to centre psychometrics throughout design and refinement: we continuously monitor performance, systematically refine items, and share both the tools and the evidence as they evolve. Two near-term directions we’re pursuing are:

More languages: we’re currently adapting tasks into additional languages (including French and Dutch), and expanding the infrastructure to support translation review and localisation.
Ages 2–4: we’re developing downward extensions (easier items, more practice, more scaffolding). Not every task will work for every age, but we expect strong usability for at least some executive function and language tasks in the preschool range.

Join us!

If you’d like to use LEVANTE or join our network, please get in touch.

Preprint: https://osf.io/preprints/psyarxiv/r4dhw_v1
Researcher hub (docs, translations, updates): https://researcher.levante-network.org

About us

About

Science