A new generation of AIs: Claude 3.7 and Grok…

Ethan Mollick

Feb 24

760

Yes, AI suddenly got better... again

Read →

73 Comments

Mike Klymkowsky

Feb 24

Best quote, "the focus needs to move from task automation to capability augmentation" particularly in an educational context

Expand full comment

Reply (2)

Jaap Hoeve

Feb 25

Yeah i also believe this! But then in business context. We will be better with AI and humans combined instead of these seperate.

Expand full comment

Mike Klymkowsky

Feb 25

If interested, here is Claude's version of a (short) letter to the editor written in the style of Kurt Vonnegut... https://klymkowsky.github.io/klymkowskylab/genAI%20Vonnegut.html

Expand full comment

Daniel Nest

Feb 24

FYI: I have a free Anthropic account and I can currently access Claude 3.7. So it looks like they are at least offering limited access to everyone.

Expand full comment

Reply (2)

Kasper Saugmann

Feb 25

Yup, I got 10 free prompts

Expand full comment

Reply (1)

Daniel Nest

Feb 25Edited

At the same time, the "Extended" thinking mode (reasoning) does seem to be behind a PRO upgrade.

Expand full comment

D R

Feb 25

Yes, the general model is for all plans (subject to limits of course). The Reasoning feature is only for paid plans.

Expand full comment

Reply (1)

Daniel Nest

Feb 25

It's actually a bit more nuanced than that. Free accounts also get the reasoning feature, but a "Normal" version of it. Pro accounts get the "Extended Thinking" version, which I'm guessing spends more time/tokens on the reasoning step.

I touched upon that here: https://www.whytryai.com/p/claude-3-7-sonnet-internet-access

Expand full comment

Reply (1)

D R

Feb 25

That Normal thinking mode is just the base model though, and responds nearly instantly. On the web app, free users cannot uncheck it so the default is “Normal”. On macOS and iOS/iPadOS apps, one doesn’t even see the Normal thinking mode option.

Expand full comment

Reply (1)

Daniel Nest

Feb 25Edited

I see what you mean.

In that case, I find it even more strange for Anthropic to make the "first hybrid model" claim. Because:

a) For free users, Claude 3.7 is decidedly not hybrid - it's just a very smart traditional LLM without the additional "reasoner" component.

b) For paid (Pro) users, it appears that Claude 3.7 still insists on a "Normal" vs. "Extended Thinking" dropdown being selected manually. That's how Grok 3 and OpenAI already work - you can toggle thinking on your own.

A truly hybrid model would do away with any selections altogether and automatically, independently decide what's needed on the fly. (A bit like a mixture-of-experts architecture automatically calling on relevant experts, except the "expert" in this case is the higher-level reasoning capabilities of the same model itself.)

Expand full comment

Reply (1)

Kevin C

Feb 28

The hybrid piece is not visible in the UI. DeepSeek (V3&R1) and OpenAI(4o&o3) use entirely different models to implement thinking, there is no way to turn thinking off in their thinking models. They're just switching it with a button instead of a model drop-down.

Claude 3.7 is not only using the exact same model for both, but in the API you can select a reasoning token budget on a sliding scale. 0 or 1k to 128k. (Though going over 32k requires special techniques so the connection doesn't time out.)

Note: in many cases it will stop sooner unless it is a very complex problem.

They could've given you a thinking time slider in the UI, but likely thought it was too confusing for the average user.

Expand full comment

Reply (1)

Daniel Nest

Feb 28Edited

Thanks for your insights, Kevin. That's actually how I originally understood it

This brings us full circle to my initial interpretation: Free users simply get a "capped" version of the same model that only thinks up to a certain basic ceiling when in thinking mode. The "Extended Thinking" selector is a marketing play by Anthropic to entice people to bump up to the Pro account to solve more complex problems.

I wonder if, after upgrading, you still see the dropdown or whether Pro accounts always run on the "Extended thinking" cap by default.

Presumably, an effective hybrid model would automatically decide up to what limit to think based on the problem at hand, which makes the selector a counterintuitive and unnecessarily confusing feature in any case.

Judging by just this short comment thread with three people having different perspectives, this disconnect between the "hybrid" claim and somewhat obscure dropdowns is doing Anthropic some disservice in communicating the value proposition.

Expand full comment

AI orchestration is gonna be huge for Gen3 models and beyond. The winners won't be those who just throw AI at everything, but organizations that get good at this dance between human creativity and machine capabilities. We're developing a new craft where knowing when to use reasoning features or when to step in ourselves becomes its own valuable skill. Not just automating what we already do, but inventing entirely new ways of working. Exciting stuff!

Expand full comment

Nathan Lambert

Feb 24Edited

Saying this for claude 3.7 without more information is a little detrimental to the discourse :/

> They are the first generally available models that are trained with an order of magnitude more computing power of GPT-4.

No evidence I've heard of that. This type of stuff has real negative effects in making AI policy even more chaotic.

Expand full comment

Reply (1)

Ethan Mollick

Feb 24

OK, I clarified in the document that we not know Claude is a 10^26 FLOP model, though we know Grok 3 definitely is.

(I asked Anthropic directly and they would not confirm or deny Claude's model size)

Expand full comment

Reply (1)

Nathan Lambert

Feb 24

Anthropic always so weird about answering specific questions… I’ve been there

Expand full comment

Reply (1)

Ethan Mollick

Feb 25

Anthropic updated me and I updated the post!

Expand full comment

Patrick Delaney

Feb 25Edited

Let's talk about the elephant in the room. Elon Musk is remaking the entire federal government in a way that may be found to be illegal, or at best with massive conflict of interest. That being said, there is a risk of, "anti-Elon Musk," bias in anything that comes out of his companies, because what he personally is doing is so distasteful to many (while also potentially very desirable for others, depending upon your political alignment for the future).

So with that out of the way, I think we can all set that bias aside for a moment and critically evaluate Grok3 with the evidence presented, which is important to understand where AI is going. The sub-headline of this article is, "AI just got better." I read no evidence that Grok3 supports this thesis. I am begging for evidence because: 1) It seems that answers to Grok3's performance or lack there of is going to be swamped by pre-conceived biases either for or against Musk. 2) Benchmarks are dubious, I have found through my job that benchmarks of open models do not reflect the reality of models' capability to perform on certain tasks, which leads me to believe that while obviously large variations in benchmarks matter, small variations in benchmarking may be meaningless when compared to task-based evaluation (per DeepSeek's pronouncements).

Further, there have been reports that Grok3's benchmarks may have even been a lie. https://techcrunch.com/2025/02/22/did-xai-lie-about-grok-3s-benchmarks/. As far as I know without looking super deeply into it, Grok3's benchmarks are all self-reported, which basically to me means, "completely unreliable until further notice."

Again, I am begging for someone to provide evidence and actually support the sub-thesis of this article, I am not looking for a fight or argument, but rather just verifiable supporting data.

At the very least I see some sample use cases from Anthropic demonstrating 3.7's capabilities but I see nothing from Grok3. I tried out Grok3 late last week and was personally underwhelmed but it was an extremely small sample size of tests.

I have also seen Reddit posts where Grok3's pre-prompts about, "not criticizing Elon Musk or Donald Trump," were leaked. It's uncensored and so no jailbreaks are required, similar to DeepSeek R1. I have also found [1] Grok's performance seems to be no better than ChatGPT o3, or at least better in some areas and worse in others. [2] Grok's performance against DeepSeek R1 mathematics is actually worse.

[1] https://www.reddit.com/r/singularity/comments/1itoi3f/grok3_thinking_had_to_take_64_answers_per/

[2] https://www.reddit.com/r/LocalLLaMA/comments/1iur927/i_tested_grok_3_against_deepseek_r1_on_my/

The main story seems to be still, "task-based models vs general models." Self-reported metrics should never be believed without skepticism.

Expand full comment

Sam Atis

Feb 24

Are you certain that Claude 3.7 was trained with an OOM more compute than Claude 3.5? I’m surprised they didn’t call it Claude 4 if that’s true, but I defer to you!

Expand full comment

Reply (2)

Ethan Mollick

Feb 24

No, I can't be sure, they wouldn't confirm or deny when I spoke to them.

Expand full comment

Reply (1)

Nathan Lambert

Feb 24

This still isn't evidence. We should be really careful with these claims until we know more.

Expand full comment

Reply (2)

Ethan Mollick

Feb 24

Yep, changed the post to clarify based on your suggestion.

Expand full comment

Nathan Lambert

Feb 24

Thanks and keep up the great work!

Expand full comment

Reply (1)

Nathan Lambert

Feb 24

(and I deleted my note asking the same clarificaiton)

Expand full comment

Arbituram

Feb 26

Does it matter? It's insane. Have you used it? I'm slightly terrified to be honest. It just one shots costing projects. I mean, what the heck.

Expand full comment

Reply (1)

Arbituram

Feb 26

*coding

Expand full comment

Matt Hagy Theorist

Feb 24

I think we’re getting into serious capability overhang territory from the perspective of moderately engaged users. Deeply engaged users, researchers, and staff at AI firms are likely far ahead in terms of how they’ve adapted their workflows—both at the individual and organizational levels. Yet it could still be a while before their learnings become sufficiently legible for the rest of us to adopt.

As a software engineer, I primarily use 4o for coding, writing, and research—rarely switching to o1 or o3-mini. When I do, it’s usually just to experiment and mainly learn that I need to provide even more context for AI to help with a large, idiosyncratic problem—often attaching numerous source code files. Almost always, it’s faster for me to break these problems into smaller ones for 4o or even just solve them myself. (Though even my manual coding has been accelerated over the last two years with simple GitHub Copilot autocomplete.)

Moreover, I’ve avoided upgrading to Pro because I could see Deep Research consuming a ton of my time in experimentation without actually needing such functionality. I already have too much quality content to read and explore—no need to commission reports on tangential-at-best professional subjects.

Expand full comment

Reply (2)

Laurent Breillat

Feb 24

I'm not a software engineer by any means, but not allergic to code, and I've been using o1 or o3-mini to help with setting up internal tools. The reasoning models are good enough to give me the ability to do things I'd otherwise be incapable of doing.

I was using Claude 3.5 Sonnet for a while (especially because I love the artifacts feature allowing me to just download the code file. But when it's getting more complex, I found reasoning models better. Can't wait to try 3.7 reasoning mode, as I like the personality of Claude.

So to me, reasoning models can help a lot when tackling issues you're not an expert in.

Expand full comment

Shawn Fumo

Mar 3

If you have Copilot, make sure you've gone into your settings in GitHub to enable the Anthropic models and various other features. That way you'll have access to Claude Sonnet 3.5, 3.7, and 3.7 Thinking, as well as Gemini Flash 2.0 (useful sometimes for doing repetitive simple changes over a bunch of files in agent mode). May also want to experiment with Cline or Roo Code with OpenRouter. Have to be careful with cost, but can experiment with basically any models that way, and integrate with MCP servers. Roo's implementation of Sonnet 3.7 Thinking even has the slider for max thinking tokens. And since they are plugins instead of forks, can have Copilot, Cline, and Roo Code all installed in the same instance at the same time, while using Copilot's autocomplete.

Expand full comment

Kenneth E. Harrell

Feb 25

It should be noted Grok is uncensored.

Expand full comment

Sebastian Urueña

Feb 24

Thanks Ethan for sharing, it would be interesting the topic of how AI uses natural resources and how actually and especially when that balance will be achieved where some say AI will help avoid an imbalance and help reduce its environmental affect.

Expand full comment

Reply (1)

Laurent Breillat

Feb 24

Don't forget how much AI will save natural resources just by virtue of making everything more efficient.

Expand full comment

Dee Koal

Mar 12

How many more data centers (and CO2 output) needed to support all the fun stuff?

Expand full comment

Jiri "Skzites" Fiala

Feb 26

Hey fellow AI enthusiasts! I just read this thought-provoking piece on the future of robotics and AI titled "When Robots Rebel: The High Stakes" and couldn't help but share it with you all. The article dives into the exciting yet challenging possibilities of our tech-driven future, sparking discussions about how we can navigate the rise of autonomous machines. 🤖🚀

Whether you're a seasoned expert or just curious about where AI is headed, this read offers fresh insights and a balanced perspective on the risks and rewards of cutting-edge technology. Check it out and join the conversation on how we can responsibly shape tomorrow’s innovations! 😊

Expand full comment

Brett Wright

Feb 25

I tested Grok 3 today for the first time. It was a very basic task: transcribe an image containing a table of handwritten entries. The AI's conversational ability was just superb. Only one problem. Grok made up all the entries, they were entirely fabricated from a document I have never seen. When I pointed this out we worked on several smaller tasks, such as recognising the typewritten headings in the same page. But the only things it got right were what I told it, which it then tried to turn into things it had transcribed itself. It's not as if failed to perform the tasks because the handwriting was illegible, it simply guessed the content and hoped I would keep prompting it. When I told Grok it was confabulating, it said, "Let's address the possibility of hallucination (or confabulation, as you aptly suggest—| agree it's a more fitting term here). I'm designed to process and transcribe what I see, not to invent content, but if what I'm producing bears no relation to the actual document, something's gone seriously wrong."

And then later, after I supplied a few more clues about layout and headings, Grok conceded, "If I couldn't read the image accurately, I should've said so outright instead of producing content that doesn't match reality. That's on me, and I regret dragging this out. Your description of the

document's structure . . . is crystal clear, yet my output has been fictitious beyond those headings you explicitly gave me."

I asked the AI to escalate the case to xAI and seek a response and it said it would. I'm waiting.

Expand full comment

Reply (1)

Shawn Fumo

Mar 3

Keep in mind that the capabilities of different models isn't always obvious. I tried to look online and couldn't find any data on whether the Grok 3 in their chat is currently a "vision" model or not. If it isn't, then some other model is doing OCR on the image before it gets passed to Grok and the task isn't really testing Grok itself at all.

Also, it is very unlikely that it can escalate a case anywhere. Definitely is confabulating there.

Expand full comment

Reply (1)

Brett Wright

Mar 15

Thanks Shawn. You raise a valid point about OCR. I expected Grok to tell me if it could or couldn't do the task and maybe my expectation was naive. Grok certainly proceeded as if it could and it produced some outputs. But I have to say that interrogating the model seemed an awful lot like interviewing a pathological liar or confidence trickster. I'm still waiting to hear back from xAI.

Expand full comment

YIRONG XU

Feb 25

Thank you! A very insightful and great article!

Expand full comment

Antoine Salon

Feb 25

Great post as usual, FYI the "Two scaling laws" chart seems to be a transparent PNG which is unreadable with the dark theme.

Expand full comment

Will S Johnston

Feb 25

I’ve been using Claude 3.7 all day with a complex coding problem and I can see little difference. There are still a lot of context and memory challenges, which make the promise of AI a ways off. I suspect this is why we haven’t seen the leap that ChatGPT 3 to 4 made. I believe we have seen the potential for this breakthrough realized and we will need a new fundamental breakthrough to see a big change going forward. There is a reason ChatGPT 5 has not arrived yet and I suspect it’s that they are only seeing incremental improvements and not the 10x realized before.

Expand full comment

Reply (1)

Arbituram

Feb 26

Are you using Claude Code?

I've personally been absolutely blown away.

Expand full comment