←back to Blog

OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages

«`html

OpenAI Introduces IndQA: A Culture Aware Benchmark For Indian Languages

OpenAI has unveiled IndQA, a benchmark designed to evaluate the understanding and reasoning of large language models in the context of Indian languages and culture. This initiative addresses an essential question: how can we reliably assess AI’s grasp of the linguistic and cultural nuances that define everyday life in India?

Why IndQA?

Globally, around 80 percent of the population does not speak English as their primary language. Despite this, existing benchmarks for non-English capabilities often rely on simplistic translation or multiple-choice formats. Current benchmarks, such as MMMLU and MGSM, have reached saturation at the top end, where many strong models achieve similar scores. This makes it challenging to gauge meaningful advancements and does not accurately evaluate models on the basis of local context and cultural understanding.

Dataset, Languages And Domains

IndQA consists of 2,278 questions across 12 languages tailored to assess cultural and everyday knowledge relevant to India. The languages evaluated include Bengali, Hindi, Hinglish, Kannada, Marathi, Odia, Telugu, Gujarati, Malayalam, Punjabi, and Tamil. The benchmark covers 10 cultural domains:

/ Architecture and Design / Arts and Culture / Everyday Life / Food and Cuisine / History / Law and Ethics / Literature and Linguistics / Media and Entertainment / Religion and Spirituality / Sports and Recreation.

Each question is accompanied by four components:
— A culturally grounded prompt in an Indian language
— An English translation for auditability
— Rubric criteria for grading
— An ideal answer that encapsulates expert expectations.

Rubric Based Evaluation Pipeline

IndQA employs a rubric-based grading approach rather than relying solely on exact match accuracy. For each question, domain experts define multiple criteria detailing what constitutes a strong answer, along with assigned weights for each criterion. Model-based grading then evaluates responses based on these criteria, allowing for partial credit and capturing cultural nuance in responses.

Construction Process And Adversarial Filtering

The construction process for the IndQA benchmark followed a four-step pipeline:

  • Collaboration with Indian organizations to recruit native-level experts in various domains who authored culturally relevant prompts.
  • Application of adversarial filtering where draft questions were evaluated against OpenAI’s top models (GPT-4o, OpenAI o3, GPT-4.5, and later GPT-5). Only questions that received sub-par responses were retained, ensuring a clear distinction for future advancements.
  • Expert-defined grading criteria created to evaluate each question, which are reused in assessing other models on IndQA.
  • Experts crafted ideal answers and translations, undergoing peer review and iterative revisions to ensure quality.

Measuring Progress On Indian Languages

IndQA serves as a platform to evaluate recent frontier models and to track advancements over recent years across Indian languages. Reportedly, model performance has significantly improved within IndQA, but substantial room for enhancement remains. Results are stratified by language and domain, providing comparisons with other frontier systems.

Key Takeaways

  • IndQA is a culturally grounded Indic benchmark that focuses on how AI models understand and reason about culturally significant questions in Indian languages.
  • The dataset, developed collaboratively with 261 domain experts, covers various aspects of Indian culture and consists of 2,278 well-structured questions across 12 languages.
  • Evaluation is rubric based, allowing for nuanced grading that embodies cultural correctness beyond simple token overlap.
  • The questions have been adversarially filtered to ensure that they present a challenge for even the most advanced AI models.

Conclusion

IndQA stands as a critical step forward in addressing the gaps associated with existing multilingual benchmarks, especially for a linguistically and culturally diverse country like India. By utilizing expert-driven evaluation and targeted research, IndQA offers a robust framework for assessing language reasoning capabilities in AI systems.

«`