The Learning Variability Network Exchange (LEVANTE) brings together researchers from around the world aiming to capture the richness and diversity of child development and learning.
Only by conducting open-access, cutting-edge research can we enhance our knowledge on learning and developmental variability.
Delve into the science of learning variability, explore cutting-edge research, and discover practical insights to enhance learning for all.
Home » How we validate translations
David Cardinal
April 7, 2026
When we set out to build a shared battery of tasks for children across countries, translation quickly emerged as one of the hardest technical problems. If a task does not “speak every child’s language” in a way that is natural and culturally grounded, comparisons across sites start to break down.
Since the start of the project, our team has worked with a series of approaches to evaluating translation quality, ranging from classic backtranslation and manual review to embedding-based similarity and large language models acting as semantic judges. This post is both a retrospective on that journey and an update on what we are doing currently.
Our first attempts at validating translations looked like many large assessment projects: translators created forward translations, bilingual researchers and site partners reviewed them, and we used backtranslation to catch obvious shifts in meaning.
Backtranslation was appealing because it gave non‑bilingual team members something to look at: the translated item came back into English, and we could compare that version to the original in order to spot issues. This approach mirrors methods that have been used for survey and clinical instrument translation for decades.
As we have grown, two problems surfaced:
We worked to find a system that could triage translations automatically, and therefore focusing human review on items that were most in need.
Sentence embeddings use large neural networks to map text into high‑dimensional vectors so that semantically similar sentences are close together, even when they differ on the surface. By using a multilingual embedding model we can directly compare the embedding of the original text with that of the translated text without the need for back translation. In many cases, comparing text embeddings has proven more effective than doing a back translation and comparison.
So we augment our other validation scores with a metric that encodes both the source and translated text into an embedding and computes a cosine similarity between them.
High similarity between embeddings is not proof that a translation is perfect, but low similarity is a strong signal that something important may have changed. This immediately gave us a way to triage hundreds of items and highlight those most likely to need human review.Unfortunately the results didn’t provide as much discrimination between translations as we’d hoped. As a result, we use embeddings as part of our library of tests, but as a secondary signal.
Embeddings were a helpful first step, but we also wanted a model that could reason explicitly about meaning, register, and age‑appropriateness. This is where Gemini 2.5 Pro entered our toolset as another signal.
We currently prompt Gemini 2.5 Pro with the original text, the translation, and a short rubric that asks about adequacy, tone, and appropriateness for the target age group. The model returns a scalar score and an explanation, which we log alongside our other metrics.
This moves us from “how similar are these strings?” to “does this translation preserve meaning and tone for children in this context, according to a model that has read many languages?” It does not replace human reviewers, but it gives them a head start by ranking items and providing model‑generated comments that often point straight to the problem.
Models continue to evolve and we plan to evolve with them, so we will certainly use new versions and new models in the future, but this is where we are currently.
Research on round‑trip translation warns against treating back translation as a stand‑alone quality metric, especially when evaluated with string‑based scores like BLEU (a score designed to compare an automated translation with human grading of that translation). We take that caution seriously. In our system, our backtranslation is one signal among several: it gives us a consistent, automated way to get back into English so that we can apply both lexical and semantic checks, but we never rely on it alone to declare a translation “good” or “bad.”
The current system is far from finished. As we expand into more languages, domains, and modalities (including audio and interactive tasks), we expect to keep iterating on both the metrics and the fusion strategies we use.
Future directions include:
What has changed most over time for us is not a single algorithm, but our mindset: translation validation is no longer a one‑shot step at the end of task development. It is an ongoing, model‑assisted collaboration between translators, researchers, and site partners.