Study shows AI struggles with NYT Connections game despite advanced capabilities


Judith Brown Clarke Vice President for Equity and Inclusion Chief Diversity Officer | Stony Brook University

A recent study led by Tuhin Chakrabarty, an assistant professor at Stony Brook's Department of Computer Science, in collaboration with researchers from Columbia University, has revealed insights into the capabilities of AI models when faced with abstract reasoning challenges. The research focused on the New York Times word game 'Connections,' which presents a unique benchmark for testing Large Language Models (LLMs).

Despite the prowess of AI and machine learning in defeating top chess players, the study found that even the most advanced LLM, Claude 3.5 Sonnet, could fully solve only 18% of 'Connections' games. This was based on an analysis of over 400 games where both novice and expert human players outperformed AI.

In 'Connections,' players must organize a 4x4 grid of 16 words into four groups based on shared characteristics. For instance, words like 'Followers,' 'Sheep,' 'Puppets,' and 'Lemmings' can be grouped as 'Conformists.' Success in this task requires reasoning across various knowledge forms, including semantic and encyclopedic understanding.

Chakrabarty explained, "While the task might seem easy to some, many of these words can be easily grouped into several other categories." He noted how potential groupings serve as red herrings designed to add complexity to the game.

The research highlighted that LLMs show relative strength in tasks involving semantic relations but struggle with more complex knowledge types such as multiword expressions and understanding combined word form and meaning. Five different LLMs were tested: Google's Gemini 1.5 Pro, Anthropic's Claude 3.5 Sonnet, OpenAI's GPT4 Omni, Meta's Llama 3.1 405B, and Mistral Large 2 (Mistral-AI, 2024). The results indicated that while these models could partially solve some puzzles, their overall performance was lacking compared to humans.

For further details on this study, readers are directed to visit the AI Innovation Institute website.

Organizations Included in this History


Daily Feed

Sports

Super Bowl Props - Sometimes Boring Pays

My early thinking on the Big Game centers around the belief that I have that both of these defenses will steal the show. The Seahawks have allowed just 68 points in their last seven games that were not against the Stafford–McVay Rams, while the Patriots’ run defense has allowed 87, 79, and 48 rushing yards in their last three games.