OpenAI launched IndQA, a significant new benchmark to evaluate the performance of AI in Indian languages and culture

SUMMARY
OpenAI, one of the leaders in AI research, has released a new and important benchmark called IndQA, which aims to measure the performance of AI models on questions that are deep-rooted in the Indian languages and culture. This project directly responds to the need to make sure that not only large language models (LLMs) and other AI systems know the English language, but can also address the linguistic and cultural diversity of one of the most diverse parts of the world. IndQA is an important move in the direction of creating more inclusive and human-centered AI that can actually benefit the global population.
Crafted benchmark and initial performance
IndQA is a well-designed benchmark comprising a total of 2,278 questions in 12 Indian languages. The questions are spread in 10 different cultural domains to make them culturally relevant and complex, and are found in such aspects as literature, food, history, law, and sports.
This is a collaborative project whose development involved the input of 261 Indian professionals. This deep engagement of local professionals makes sure that the questions are not merely memorization of facts or basic translation exercises.
The main distinction of IndQA is that it emphasizes the inability of AI systems to process complex reasoning and context in these culturally sensitive domains. In contrast to numerous past benchmarks, which are based on either a simple multiple-choice test or a simple translation test, the IndQA is designed to assess the comprehension of an AI model.
OpenAI has said that the questions were intentionally chosen to be tough for even the state-of-the-art AI models that exist today, even their own flagship models such as GPT-4o and GPT-5. The selection was also strict in the development phase: only the questions that the most efficient AI models could not answer correctly were included in the final benchmark. Such an approach will guarantee that IndQA indeed reflects current issues with non-English language comprehension and will be an ambitious target in the development of AI in the future.
To enable the evaluation process to be consistent and qualitative, a question in the IndQA benchmark is provided with two important elements: a grading rubric and an example of a perfect answer. The standards are anchored on what the Indian experts included in the dataset, which makes it possible to conduct objective and measurable performance evaluation.
The scores of the first performance conducted with the help of IndQA emphasize the magnitude of the problem that AI models have to deal with in this area. Even the best models that are presently on the market scored less than 40% benchmark. This finding clearly demonstrates that there has always been a gap and there continues to be a challenge among AI developers in obtaining strong and culturally competent performance, especially in non-English, low-resource linguistic settings.
Dual purpose and public accessibility
The dual purpose of launching IndQA has been made clear by OpenAI. The benchmark is supposedly a long-term device used to monitor the advancement of AI over time. With the development of models and their complexity, IndQA will enable a standardized, culturally based measure of progress with which advancement can be evaluated.
The project will support the development of other low-resource languages around the world in terms of comparable benchmarks. OpenAI aims to build a model of the culturally relevant, complex benchmark development in the Indian setting so that it could be replicated and applied to testing AI within other languages and cultural backgrounds where the overall training and evaluation data is limited.
IndQA should also be made publicly accessible. The possibility that it will become an industry standard of how AI has acquired proficiency in culturally-aware scenarios will rely upon the publication of the dataset, assessment rubrics, and assessment code, and clear licensing, as anticipated by the AI community.
The short-term poor performance of the top models on IndQA is also an indicator of the probable increase in data vendors based in India. Such vendors are essential in providing the culturally grounded training data, human annotation, and Reinforcement Learning from Human Feedback (RLHF) services required to increase AI scores in the 12 languages and 10 cultural domains represented in the benchmark.
Conclusion
The introduction of the IndQA benchmark by OpenAI would be a significant investment in the development of culturally-inclusive AI. It consists of 2,278 expert-vetted questions in 12 Indian languages and 10 culture-specific areas, and even the latest AI models can hardly score above 40 on IndQA. The reasoning, context, and cultural sensitivity of the benchmark will act as an essential point of reference in monitoring AI developments and dealing with the current obstacles to non-English language comprehension. IndQA is not only expected to elevate AI performance to meet the needs of the Indian market, but also serves as a crucial template to develop culturally sensitive assessment instruments of other languages in low-resource settings globally.
Note: We at scoopearth take our ethics very seriously. More information about it can be found here.