Researchers at the Indian Institute of Technology-Guwahati (IIT-G) have introduced a novel multilingual and scalable approach to identify and rectify Surface Name Errors (SNEs) on Wikipedia, thereby enhancing information reliability for both human users and AI systems. The method was highlighted at the India AI Impact Summit 2026. Wikipedia, a collaborative online encyclopedia, sometimes contains SNEs, which are inaccuracies in the surface names—text used to reference or link entities in article content. A study from the IIT-Guwahati team revealed that 3-6% of entity mentions are prone to such errors, which can undermine credibility and affect machine learning performance, given the reliance on Wikipedia as a key dataset.
To tackle this issue, Prof. Amit Awekar and M.
Tech student Anuj Khare developed a mathematical frequency pattern-based method that operates in three distinct steps. First, the process involves scanning Wikipedia to create quadruplets that associate the linked page, the destination page, the surface name, and surrounding context. Second, a surface name is deemed correct if it appears at least 10 times and constitutes at least 5% of links to a given page; otherwise, it is flagged as a potential error. The final step classifies these flagged errors into categories such as “typing mistakes” (e.g., “Gawahati” for “Guwahati”) or more complex “entity span errors” which may involve extraneous or incorrect terms in links.
The method has been tested in eight languages, including English, Sanskrit, German, and others, yielding accurate results. Prof. Awekar emphasized the need for caution regarding the accuracy of online data, asserting that high-quality data is crucial for effective AI model development and applications. The method’s validation included a comparison of English Wikipedia snapshots from 2018 and 2022, noting a 30% correction rate of predicted errors over that period, underscoring its effectiveness.
Moreover, over 99% of manual corrections recommended by the researchers have been accepted by the Wikipedia community, showcasing the method’s practical viability and its potential to assist volunteer editors in spotting typing and linking errors that may have long evaded notice. This initiative by the IIT Guwahati team represents a significant contribution to fortifying digital knowledge systems through scalable processing complemented by community validation.
