A recent study led by Tuhin Chakrabarty, an assistant professor at Stony Brook's Department of Computer Science, in collaboration with researchers from Columbia University, has revealed insights into the capabilities of AI models when faced with abstract reasoning challenges. The research focused on the New York Times word game 'Connections,' which presents a unique benchmark for testing Large Language Models (LLMs).
Despite the prowess of AI and machine learning in defeating top chess players, the study found that even the most advanced LLM, Claude 3.5 Sonnet, could fully solve only 18% of 'Connections' games. This was based on an analysis of over 400 games where both novice and expert human players outperformed AI.
In 'Connections,' players must organize a 4x4 grid of 16 words into four groups based on shared characteristics. For instance, words like 'Followers,' 'Sheep,' 'Puppets,' and 'Lemmings' can be grouped as 'Conformists.' Success in this task requires reasoning across various knowledge forms, including semantic and encyclopedic understanding.
Chakrabarty explained, "While the task might seem easy to some, many of these words can be easily grouped into several other categories." He noted how potential groupings serve as red herrings designed to add complexity to the game.
The research highlighted that LLMs show relative strength in tasks involving semantic relations but struggle with more complex knowledge types such as multiword expressions and understanding combined word form and meaning. Five different LLMs were tested: Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet, OpenAI's GPT4 Omni, Meta's Llama 3.1 405B, and Mistral Large 2 (Mistral-AI, 2024). The results indicated that while these models could partially solve some puzzles, their overall performance was lacking compared to humans.
For further details on this study, readers are directed to visit the AI Innovation Institute website.