AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence? Looks like we should start taking those degrees back. A new study from researchers at the Oxford Internet Institute suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading.

Researchers looked at 445 different benchmark tests used by the industry and other academic outfits to test everything from reasoning capabilities to performance on coding tasks. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared.

A big problem that the researchers found is that â€œMany benchmarks are not valid measurements of their intended targets.â€ That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesnâ€™t actually capture a modelâ€™s capability.

For example, the researchers point to the Grade School Math 8K (GSM8K) benchmarking test, which measures a modelâ€™s performance on grade school-level word-based math problems designed to push the model into â€œmulti-step mathematical reasoning.â€ The GSM8K is advertised as being â€œuseful for probing the informal reasoning ability of large language models.â€

But the researchers argue that the test doesnâ€™t necessarily tell you if a model is engaging in reasoning. â€œWhen you ask a first grader what two plus five equals and they say seven, yes, thatâ€™s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no,â€ Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, told NBC News.

In the study, the researchers pointed out that GSM8K scores have increased over time, which may point to models getting better at this kind of reasoning and performance. But it may also point to contamination, which happens when benchmark test questions make it into the modelâ€™s dataset or the model starts â€œmemorizingâ€ answers or information rather than reasoning its way to a solution. When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced â€œsignificant performance drops.â€

While this study is among the largest reviews of AI benchmarking, itâ€™s not the first to suggest this system of measurement may not be all that itâ€™s sold to be. Last year, researchers at Stanford analyzed several popular AI model benchmark tests and found â€œlarge quality differences between them, including those widely relied on by developers and policymakers,â€ and noted that most benchmarks â€œare highest quality at the design stage and lowest quality at the implementation stage.â€

If nothing else, the research is a good reminder that these performance measures, while often well-intended and meant to provide an accurate analysis of a model, can be turned into little more than marketing speak for companies.

Original Source: https://gizmodo.com/ai-capabilities-may-be-overhyped-on-bogus-benchmarks-study-finds-2000682577

Disclaimer: This article is a reblogged/syndicated piece from a third-party news source. Content is provided for informational purposes only. For the most up-to-date and complete information, please visit the original source. Digital Ground Media does not claim ownership of third-party content and is not responsible for its accuracy or completeness.