How we validate translations

Starting with traditional review and backtranslation

  • It is hard to scale careful human review across hundreds of items and many languages.
  • Round‑trip translation often fails in exactly the cases we cared about: idioms, cultural references, and age‑appropriate wording that were not well captured by simple lexical comparisons.

The Promise of Embeddings

Using Gemini to compare meaning, not just strings

Why we use multiple tools for validation

Where we are heading next

  • Calibrating scores more directly to human ratings of item difficulty and clarity, not just textual similarity.
  • Exploring task‑specific rubrics where Gemini and other models judge translations with awareness of the underlying cognitive construct, not just surface meaning.
  • Extending the same ideas to audio prompts and child‑directed speech, where prosody and timing matter as much as wording.