Tag: multilingual study

  • Unexpected Language Outperforms English and Chinese in LLM Study

    Unexpected Language Outperforms English and Chinese in LLM Study

    Key Takeaways

    1. Polish language outperforms English and Chinese in accuracy for long-context tasks, achieving 88% accuracy at 64,000 tokens or more.
    2. Model performance varies significantly with context length, highlighting a drop in English’s ranking as sequence lengths increase.
    3. Language structure, particularly tokenization efficiency and writing systems, plays a crucial role in performance differences among languages.
    4. The performance gap between the best and worst languages widens as context length increases, from 11% at 8,000 tokens to 34% at 128,000 tokens.
    5. Evaluating long-context models solely based on English can be misleading, as linguistic differences become more significant with longer sequences.


    A recent multilingual study examining how large language models manage long documents has revealed an unexpected finding: Polish outperforms both English and Chinese in accuracy when context windows extend to 64,000 tokens or more. This information comes from the OneRuler benchmark introduced in a paper at COLM 2025, which assessed 26 languages in retrieval and aggregation tasks.

    Shifting Accuracy with Length

    The researchers analyzed model accuracy across various context lengths and discovered a significant change as the sequences grew longer. The results indicated that Polish achieved the highest average accuracy of 88% in long-context scenarios, while English fell to sixth place, and Chinese ranked among the lowest four languages.

    Language Structure Matters

    The study suggests that the differences in performance might be linked to tokenization efficiency and variations in writing systems, rather than just the amount of training data. Languages that use Latin-based scripts, like Polish, French, and Spanish, consistently outperformed those with logographic or abugida systems. For instance, languages such as Chinese, Korean, and Tamil showed only moderate accuracy even in shorter contexts, and their performance worsened as the sequences lengthened. This complete turnaround in expected rankings is intriguing, especially since most widely used LLMs are predominantly trained on datasets rich in English. However, the paper’s findings indicate that when models need to search, recall, or summarize information buried deep in lengthy documents, the structural properties of the language take precedence over the frequency of dataset representation.

    Additional Insights from the Benchmark

    Other insights from the benchmark reinforce this perspective. The performance gap between the best and worst-performing languages widens significantly as context increases—from 11% at 8,000 tokens to 34% at 128,000 tokens. Another noteworthy detail from the study highlights the sensitivity of these evaluations to minor changes in instructions. For instance, allowing the model to respond with “none” if a target string is missing led to a 32% drop in accuracy for English at 128,000 tokens, as seen on page 2.

    While the benchmark also evaluates different model families, the results indicate that long-context assessments cannot rely exclusively on English testing. Performance generalizations across languages may be misleading if the effects of script and tokenization are overlooked. As context windows expand, linguistic differences become increasingly significant, challenging the notion that English’s dominance in LLM benchmarks remains valid when sequence lengths reach tens of thousands.

     

    Source:
    Link