How we validate translations

When we set out to build a shared battery of tasks for children across countries, translation quickly emerged as one of the hardest technical problems. If a task does not “speak every child’s language” in a way that is natural and culturally grounded, comparisons across sites start to break down.

Since the start of the project, our team has worked with a series of approaches to evaluating translation quality, ranging from classic backtranslation and manual review to embedding-based similarity and large language models acting as semantic judges. This post is both a retrospective on that journey and an update on what we are doing currently.

Starting with traditional review and backtranslation

Our first attempts at validating translations looked like many large assessment projects: translators created forward translations, bilingual researchers and site partners reviewed them, and we used backtranslation to catch obvious shifts in meaning.

Backtranslation was appealing because it gave non‑bilingual team members something to look at: the translated item came back into English, and we could compare that version to the original in order to spot issues. This approach mirrors methods that have been used for survey and clinical instrument translation for decades.

As we have grown, two problems surfaced:

  • It is hard to scale careful human review across hundreds of items and many languages.
  • Round‑trip translation often fails in exactly the cases we cared about: idioms, cultural references, and age‑appropriate wording that were not well captured by simple lexical comparisons.

We worked to find a system that could triage translations automatically, and therefore focusing human review on items that were most in need.

The Promise of Embeddings

Sentence embeddings use large neural networks to map text into high‑dimensional vectors so that semantically similar sentences are close together, even when they differ on the surface. By using a multilingual embedding model we can directly compare the embedding of the original text with that of the translated text without the need for back translation. In many cases, comparing text embeddings has proven more effective than doing a back translation and comparison. 

So we augment our other validation scores with a metric that encodes both the source and translated text into an embedding and computes a cosine similarity between them.

High similarity between embeddings is not proof that a translation is perfect, but low similarity is a strong signal that something important may have changed. This immediately gave us a way to triage hundreds of items and highlight those most likely to need human review.

Unfortunately the results didn’t provide as much discrimination between translations as we’d hoped. As a result, we use embeddings as part of our library of tests, but as a secondary signal.

Using Gemini to compare meaning, not just strings

Embeddings were a helpful first step, but we also wanted a model that could reason explicitly about meaning, register, and age‑appropriateness. This is where Gemini 2.5 Pro entered our toolset as another signal.

We currently prompt Gemini 2.5 Pro with the original text, the translation, and a short rubric that asks about adequacy, tone, and appropriateness for the target age group. The model returns a scalar score and an explanation, which we log alongside our other metrics.

This moves us from “how similar are these strings?” to “does this translation preserve meaning and tone for children in this context, according to a model that has read many languages?” It does not replace human reviewers, but it gives them a head start by ranking items and providing model‑generated comments that often point straight to the problem.

Models continue to evolve and we plan to evolve with them, so we will certainly use new versions and new models in the future, but this is where we are currently.

Why we use multiple tools for validation

Research on round‑trip translation warns against treating back translation as a stand‑alone quality metric, especially when evaluated with string‑based scores like BLEU (a score designed to compare an automated translation with human grading of that translation). We take that caution seriously. In our system, our backtranslation is one signal among several: it gives us a consistent, automated way to get back into English so that we can apply both lexical and semantic checks, but we never rely on it alone to declare a translation “good” or “bad.”

Where we are heading next

The current system is far from finished. As we expand into more languages, domains, and modalities (including audio and interactive tasks), we expect to keep iterating on both the metrics and the fusion strategies we use.

Future directions include:

  • Calibrating scores more directly to human ratings of item difficulty and clarity, not just textual similarity.
  • Exploring task‑specific rubrics where Gemini and other models judge translations with awareness of the underlying cognitive construct, not just surface meaning.
  • Extending the same ideas to audio prompts and child‑directed speech, where prosody and timing matter as much as wording.

What has changed most over time for us is not a single algorithm, but our mindset: translation validation is no longer a one‑shot step at the end of task development. It is an ongoing, model‑assisted collaboration between translators, researchers, and site partners.