Tag: GSM-Symbolic

Earlier this month, a group of six AI experts supported by Apple released a study introducing GSM-Symbolic, a new benchmark for AI that "allows for more controllable assessments, giving important insights and more dependable metrics for evaluating the reasoning abilities of models." Unfortunately, it appears that large language models (LLMs) still face significant limitations and are missing even the most fundamental reasoning skills, as shown by initial tests using GSM-Symbolic with AI systems from major companies like Meta and OpenAI.

Issues with Current Models

The research pointed out a major issue with current models, which is their lack of consistency when faced with similar questions. The findings indicated that minor changes in wording, which wouldn’t change the meaning for a human, often result in varied responses from AI systems. No specific model was identified as performing notably well.

The report stated, "In particular, the effectiveness of all models drops [even] when just the numerical values in the question are modified in the GSM-Symbolic benchmark." It also found that "the weakness of mathematical reasoning in these models [shows] that their performance worsens significantly as the number of clauses in a question goes up."

Study Details

This 22-page study is accessible here (PDF file). The final two pages include problems with some irrelevant details added at the end, which shouldn’t change the answer for a human. Yet, the AI systems considered these parts, leading to incorrect answers.

In conclusion, AI systems remain trapped in pattern recognition and still do not possess general problem-solving skills. This year saw the introduction of several LLMs, including Meta AI’s Llama 3.1, Nvidia’s Nemotron-4, Anthropic’s Claude 3, the Fugaku-LLM from Japan (the largest model ever trained solely on CPU power), and Nova by Rubik’s AI, which was launched earlier this month.

Upcoming Publication

Tomorrow, O’Reilly will publish the first edition of "Hands-On Large Language Models: Language Understanding and Generation" by Jay Alammar and Maarten Grootendorst. It is priced at $48.99 for the Kindle edition and $59.13 for the paperback version.

Tag: GSM-Symbolic

Humans Outperform AI, Says Apple-Funded Study

Issues with Current Models

Study Details

Upcoming Publication