Hello from New York, where it suddenly got hot!
I ran the same prompt on 6 popular LLMs on the market:
Balanced Models: Gemini 2.5 Flash, Claude 4 Sonnet, ChatGPT 4o
Reasoning Models: Gemini 2.5 Pro, Claude 4 Opus, ChatGPT o3
And here is the prompt. Very simple:
Make a super-lightweight Super Mario game that won’t cost you many tokens.
I was curious to see if they could all do it and how different the results would be.
In today’s newsletter, I’ll share the results and my impressions.
Let’s get started! 🍄
Disclaimer
This isn’t meant to be another “I tested A, B, and C models systematically, and A won” article because:
The prompt is unspecific. My prompt was brief and it gave AI enough room for interpretation. It would be biased to say, “Model A’s result looked better, so Model A won,” because Model B might have prioritized saving tokens over quality.
There is randomness. If I were to open a new chat and run the same prompt again, the result would be different. That’s because all those LLMs are probabilistic. They are not deterministic.
This is just a particular use case. Some models are better at deep reasoning, while others are better at balancing speed and quality. The outcome of this test alone doesn’t represent their full capabilities.
So consider this article more of a “I explored those models quickly over a weekend morning and thought you might be interested in seeing the results” piece.
Gemini 2.5 Flash
Model Overview: This model is known for prioritizing speed and efficiency with a balanced quality.
Time Spent: 9 seconds
Impression: Gemini 2.5 Flash was very fast. The content of the game was basic. Points and instructions were displayed, although they were hard to read. No celebration screen after completion.
Gemini 2.5 Pro
Model Overview: This model is known for deep reasoning, complex coding, and multimodal understanding.
Time Spent: 16 seconds
Impression: The content was basic. The control instruction was clear. No points were displayed or counted. It was great to see a celebration screen.
2nd Attempt
Open a new chat and rerun the same prompt again.
Impression: The content were still basic with simple instruction. There were still no points involved. No celebration screen this time.
(FYI, I hit the usage cap after 3 attempts.)
Claude 4 Sonnet
Model Overview: This model is known for a balance between intelligence and speed, suitable for a wide range of applications.
Time Spent: 30 seconds
Impression: It created an opening screen, which was great. It was surprising to see that there was an enemy. Coins were displayed and counted.
Reward/Penalty Design:
Golden coins to collect: +100 points each
Get hit: -100 points
Jump on enemies: +200 points
Claude 4 Opus
Model Overview: This model is known as Anthropic's most intelligent model, designed for complex tasks and reasoning.
Time Spent: 21 seconds
Impression: A flag was added. No status change for completion. Coins were displayed and counted.
2nd Attempt
I ran the test again. Different outcome, yet with a similar level of complexity. There was a simple celebration message after completion.
ChatGPT 4o
Model Overview: This model is known for general use, everyday content, and creativity.
Time Spent: 12 seconds
Impression: Really basic. No floating platforms. No coins. No celebration page.
ChatGPT o3
Model Overview: This model is known for deep reasoning and analytical skills.
Time Spent: 15 seconds
Impression: Really basic. No floating platforms. No coins. No celebration page.
(Just green grass, blue sky, me jumping :) )
Takeaways
These tools and LLMs are becoming increasingly similar in what they can achieve.
They can all create super simple code-backed applications within 30 seconds.
They all have similar features. Gemini and ChatGPT have Canvas mode, while Claude has Artifact mode.
Despite the disclaimer I wrote above, If I had to rank those LLMs based on my personal preference for building a mini app:
Claude 4 Sonnet > Gemini 2.5 Pro > Claude 4 Opus > Gemini 2.5 Flash > ChatGPT 4o > ChatGPT o3
Thanks for reading.
As I always say about learning AI: don’t just take my word for it. Try these models yourself and let me know what you think.
See you next week,
Xinran
-
P.S. For the talk last month, I locked my room to prevent my kids from coming in and interrupting me.
After the talk, I found a pile of notes they wrote and slid under the door.
Really enjoyed the comparisons, thank you Xinran. And so sweet that you included that photo in the post. It makes someone with your following feel so much more approachable🤗
and i absolutely *love* the photo of your kids' notes pushed under the door !!