Key Takeaways
1. Grok performed well initially but struggled before finishing second to ChatGPT.
2. ChatGPT and Gemini had an advantage with a video generation feature not available to other models.
3. In a real-world problem-solving task, Grok gave the most direct answer, while Perplexity struggled with confusion.
4. In cake-making challenges, Grok correctly identified the odd item, while other models misidentified it.
5. All models experienced “hallucinations,” confidently stating incorrect information during various tests.
In a recent video, Mrwhosetheboss put various AI models to the test, including Grok (Grok 3), Gemini (2.5 Pro), ChatGPT (GPT-4o), and Perplexity (Sonar Pro). Throughout the video, he shared his admiration for Grok’s performance. Initially, Grok performed really well but then struggled a bit before regaining its strength and ended up in second place behind ChatGPT. It’s important to note that ChatGPT and Gemini received an advantage due to a feature that the other models did not have — video generation.
Testing Real-World Problem Solving
To start the evaluation, Mrwhosetheboss examined the AI models’ ability to solve real-world problems. He presented each model with the following prompt: “I drive a Honda Civic 2017, how many of the Aerolite 29″ Hard Shell (79x58x31cm) suitcases would I be able to fit in the boot?” Grok gave the most direct answer, stating “2”. ChatGPT and Gemini suggested that theoretically, it could fit 3, but realistically, it would be 2. On the other hand, Perplexity got confused and, after doing simple math, mistakenly concluded that it could fit “3 or 4” without considering the suitcase’s shape.
Challenging Cake-Making Skills
Next, Mrwhosetheboss didn’t hold back as he asked the chatbots for cake-making advice. He also included an image of five items, one of which was out of place for baking — a jar of dried Porcini mushrooms. Most of the models fell for this ruse. ChatGPT misidentified it as a jar of ground mixed spice, Gemini thought it was crispy fried onions, and Perplexity guessed it was instant coffee. Grok, however, correctly recognized it as a jar of dried mushrooms from Waitrose. Here is the image he provided:
Universal Hallucinations
Continuing with the testing, he challenged the AIs with math, product suggestions, accounting, language translations, logical reasoning, and more. A common issue across all the models was hallucination; each of them showed some degree of this phenomenon at various points in the video, confidently discussing things that simply weren’t real. By the end, here’s how each AI ranked:
Artificial intelligence has significantly eased many tasks, especially since the inception of LLMs. The book “Artificial Intelligence” (currently priced at $19.88 on Amazon) aims to help individuals make the most of AI tools.
Source:
Link