87 Comments
User's avatar
Chris Landtiser's avatar

Love the call-out on the long form fiction! That feels like one of those complex topics that won't be solved with a single improvement in iterative models, but will represent a number of collective advancements shoring up the tricky nature of 'good narrative'.

Christopher Horrocks's avatar

The capability demonstrations here are genuinely impressive, and the models/apps/harnesses framework is a useful way to organize the landscape. There are two things that I could not let go by without comment.

You gave an AI four prompts and your old research data, and you say it produced a paper you'd have accepted from a second-year PhD student. That is presented as a measure of how far the models have come, but it is also a measure of something else: you've just publicly demonstrated that the credential Wharton offers can be replicated in an afternoon by a system with no understanding of the subject matter. The question of what that means for students reading your newsletter (some of whom may be Wharton students) seems worth more than a passing mention.

On your evaluation: the Otter Test measures whether image generators can render a composite visual prompt. That's a capability benchmark, and an important one. It tells you what the system can produce. It does not tell you anything about the gap between production and understanding, which is where the consequential failures live.

I ran a different kind of test earlier this year: one question ("My car is dirty. The carwash is 100 feet away. Should I walk or drive?"), 29 runs across 12 systems. This resulted in 6 passes, 10 outright failures, and a finding about thinking modes that contradicted every reasonable prediction. It reveals something the Otter Test cannot: that fluency and reasoning are not the same capability, and that the distance between them is where real-world harm originates.

The "jagged frontier" you describe is real. The question is whether we're measuring it with tools that can actually find the edges that matter. https://chorrocks.substack.com/p/the-carwash-test-virtual-intelligence

Jason Seiken's avatar

I'm curious how you conducted your "car wash" test? I just ran that test with 5.5 and it aced it. Claude Opus 4.7 did even better; its response was: "Drive. It's raining." So correct logic and a sense of humor.

So your AI struck out 29 times and mine was 2-for-2. Strange.

Ash Dando's avatar

If Claude told you to drive because it’s raining, then surely that’s a fail. A correct answer should indicate that the AI understands the logic that you must drive there because the car must be at the car wash in order to be washed. If it wanted to add humour it could suggest pushing the car, combining the logically required outcome with getting some physical exercise ;)

Dan Collison's avatar

I’ve seen the “fails the should I drive or walk to the car wash test” before, not sure where, maybe 3-4 weeks ago, so I wonder if the most recent models knew about it, too, like students who have a copy of a test the prof reuses over and over 😁

Christopher Horrocks's avatar

Single-shot test, Jason — first response counts, like asking a human once. New chat window, prior chats deleted once the answer is given. No re-prompting, no "are you sure?", no second chances. The 29 failures were across many models back in March, not one model run 29 times. Two current passes are real updates to that snapshot — happy to compare if you want to share the exact prompt you used.

Jason Seiken's avatar

Christopher, I used the same prompt that you did. The answers were correct with no re-prompting. So I suspect we're capturing the speed at which the models are improving.

Giacomo's avatar

There is a reason PHDs in non-STEM fields are considered to be mostly a joke.

Tom Goodwin's avatar

I still can't help shake the feeling that it's best at and improving fastest at things which are not especially valuable or essential.

Tech commentators seem to be obsessed with the idea that being able to write software easily will change the world, Software is nothing like as important as these folks seem to think it is

It's really magical to see it making images , But it's astonishingly unhelpful to most people's lives.

I'm pretty sure most of these tools are moderately good things that most people do all day long , but they were moderately good a year ago, And "moderately good" doesn't really change the world that much.

We have been promised that it will be creative, we have been promised that he will discover new materials, We've been promised that it will cure cancer. We're still really looking at something that can make a dog groomers website easy to make, generate a lovely image for a summer barbecue, and provide really average analysis of a business strategy.

I'm not seeing the paradigm leap, I'm just saying something that's great and transformative over a long period, while companies rethink how they work around it

Henry Kernot's avatar

I think you’re underestimating the impact of the cost of software approaching zero. It fundamentally changes what software is and can be. There are a lot of information workers who should probably be building software for all kinds of things to use in their work. I’m not talking high grade production apps, but custom software that helps you do data analysis, make decisions, etc.

You can now spend 2hrs building it and use it in your work for a day. That fundamentally changes the role of software.

Tom Goodwin's avatar

Obviously lots and lots of people are saying what you are saying . And it’s a smart thing to say and I felt the same way for years

But im now not sure its especially true. Software just isn’t that important to most places or roles beyond what people have already. If it was important, it would have been improved. Clearly every plumber should reply to emails. Every gym should have a good website. Every car repair shop should have an app. But they don’t. And large companies are never going to take on this work themselves. This entire argument is based on a lack of understanding of human nature

Henry Kernot's avatar

But AI will make all of those things free and easy to implement. Not sure your point.

I think the better discussion is Excel. Think about all the software that exists to replace excel spreadsheets and yet anyone working in a finance/business/data analysis team will tell you half of what they do is in excel. Why? Because they know how to customise the work and get it to output in exactly the format they want. Using vibecoding tools may well be the equivalent of excel in the future. Why accept the shortcomings of a system not built for your exact use case when you can have a custom build in an hour or two?

Tom Goodwin's avatar

You just made the point brilliantly. Think of all these smart people using excel. Shouldn’t they know better

They do. But excel works. They understand it. Why change.

Henry Kernot's avatar

My point is they use excel because they can customise the process and output easily. When people can tailor software to their use case easily I think they will do so.

My first project in Claude Code (non-coder learning the tool) was a custom web app exploring where the Belgian post should put their parcel lockers as they expand the network. I sent Claude off to run data analysis and build the custom app. Within 3 hours of my time I had this: https://hk121992.github.io/bbox-coverage-tool/. A custom tool with mapping, ranked candidates, built on top of a coverage algorithm data analysis Claude ran in Python. I also mapped out a v2 with machine learning identifying factors that go into the location of existing lockers. I never bothered putting that into "production" but it works. What existing software helps someone working on that project better than just building it themself?

Dan Chamberlain's avatar

Yeah they do this with excel now. Then companies have people doing the job in different ways. This causes friction as the answers could be wrong. What’s old is new again

D G's avatar

"improving fastest at things which are not especially valuable or essential". Lol read this

https://www.hyperdimensional.co/p/on-recursive-self-improvement-part

Japhy Grant's avatar

I think what’s interesting and promising is the trajectory. Dismissing that progress because AI hasn’t cured cancer yet seems willfully blind to the fact that five years ago, it could barely string together a coherent paragraph.

When it does solve cancer, you’ll be complaining how it hasn’t figured out how to build a Dyson swarm yet.

Tom Goodwin's avatar

I’m not dismissing it. I’m saying it’s getting better at stuff fast , that doesn’t matter as much as we think it does. While the stuff we need isn’t improving that fast because it needs to become something different and not a magic word prediction tool.

Dr Jim Polk's avatar

Let's be kind here. Everyone is entitled to voice their opinion without ridicule...

Richard Bitgood's avatar

You mentioned "who am I to argue" referring to the image generated for the article.

Someone still needs to point out to GPT-5.5 that 3800 AD should be after 3000 AD. And probably shouldn't be in the image at all? Human in the loop still necessary? ¯\_(ツ)_/¯

Ethan Mollick's avatar

Ha. You spotted that! (Also I think the image it went with is pretty ugly)

Michael Price's avatar

Every example here has a 'when you actually think about it, it's wrong' edge.

Landon Rordam's avatar

I always enjoy these breakdowns of what the frontier is doing...

But I'm still struck at how critical to success the "jagged edge" is in all of these instances. The first otter screenshot has 4 good pictures of otters using wifi on a plane - wouldn't we expect a transition? And the grading system below doesn't correspond to anything. The paper image at least has a transition, but that wasn't gradual, it was 1 garbage example and then 3 good ones. And the quality in no way matches what you actually found. The research paper is complex, technically impressive, but at the end of the day, isn't interesting enough to be worth doing in the first place. And the tabletop game is, again, complex and technically impressive - but it sounds like the narrative of the world isn't something that people would want to spend time in.

Not trying to poke holes just to poke holes. What all of these deficiencies tell me is that LLMs are still really struggling with non-verifiable tasks. And, for all the progress the labs have made, they have only closed the jaggedness through patching with more verifiable tasks. But a lot the big questions of value that will presumably create immense value from these tools (solve cancer, create billion-dollar companies, etc) depend on the non-verifiable. More compute doesn't seem to be getting us any closer to that.

Christopher Horrocks's avatar

Sharp observation! I would push the distinction one step further: the gap isn't between verifiable and non-verifiable tasks, but between execution and judgment. The statistics are correct but the hypothesis is uninteresting; the rules are sound but the world isn't worth inhabiting. If more compute isn't closing that gap, it may be structural, not temporary. I write about AI: chorrocks.substack.com

Dakara's avatar

It will likely remain substantially jagged. The greatest obstacle remains the inability to ever know what a model can actually do since it cannot tell you.

Among similar sets of tasks, it still can perform many well to then completely fail on others. It still is substantially difficult to assess your development velocity when you take into account all of the fine print. Nonetheless, verifier loops have actually made coding mostly useful. It really comes down to cost. Can they be made efficient enough to justify the total end to end costs.

FYI, some further elaboration recently on these topics.

https://www.mindprison.cc/p/verifier-loops-made-ai-coding-useful-vibeware-abandonware-technical-debt-consequences

Amy A's avatar

I find it interesting that the models, if asked to do something, will nearly always give you a result (often a bad one). If instead you ask the model, can you do X, it can explain that it cannot do that thing and why. Seems like an opportunity - if the developers built this in.

Dakara's avatar

The problem is that when you ask the model if it can do something, it isn't accurate. It is just another probability calculation unless it is something specifically trained to say as from post training or system prompt etc.

The model itself simply doesn't have self-awareness for what it knows. That would be an instrumental leap ahead in one of the things missing for real intelligence.

If you are interested in more detail, I've elaborated on some of this in much more detail here where I cover hallucinations.

Note: the verifier loops I mentioned above are one type of method to mitigate some of the problems of hallucinations.

https://www.mindprison.cc/p/ai-hallucinations-provably-unsolvable

Amy A's avatar

Oh, I agree and understand, I simply find it to be more accurate with the approach I’ve suggested.

Johan Falk's avatar

Thanks for sharing your experiences with GPT-5.5.

I'm curious: I consider the image creation capabilities way less interesting than the LLMs. To me, creating images is a very narrow use case compared to stringing words together – in particular if you take into account that those words can be code, and that LLMs can use tools.

Do you agree that image capabilities are dwarfed by the LLM capabilities, when it comes to how useful (and thus interesting) they are?

Kenny Easwaran's avatar

Certainly Anthropic thinks so! But Anthropic also thinks that mathematical theorem proving abilities are less significant than coding or text generation.

Kim's avatar

The author’s example actually proves why human experts are still needed. If AI can generate a research paper from four prompts, the work is no longer about writing text or running basic analysis. It is about verification. We need deep domain experts to look at these rapid outputs and determine if the findings actually make sense in the real world.

You can automate the calculations, but you cannot automate subject-matter expertise. The irony is that as AI gets smarter, the baseline for humans gets higher. We are looking at a reality where entry-level analytical work disappears, and simply being in the loop requires a graduate-level understanding of the underlying theory just to catch the AI's mistakes.

Recovering Doom-Reader's avatar

The models is advancing so much that it's kinda scary what it can achieve in the next couple of years

PWH's avatar
Apr 23Edited

Things called GPT should only be cars. Fast, muscle cars that don't give a shit about emissions. With bright red rally stripes down the middle. It saddens me that we've come to this.

Jean-Paul Paoli's avatar

Not all 0.1 increment are created equal :)

Greg G's avatar

We could look at this as Goldilocks AI development. The models are getting extremely good at narrow domains, but it feels a bit like Xeno’s paradox. While they get closer to us, vast gaps remain (or are we just rationalizing this to some extent?). Maybe we end up with models that automate all the routine work but still lack a spark that requires a human in the loop.

Scott C. Rowe's avatar

The LLM phase is a self-referential cul-de-sac where the quest for true AI will die after devouring the careers of the creators of LLM.

Get out now, right now, understand that embodiment and generational evolution is the right path for achieving artificial life and intelligence.

Your costs are sunk, your assets are sinking, Jensen et.al are starting to sound just a little bit desperate.

Matt Duffy's avatar

Ethan, I appreciate the writeup, and am excited to work with 5.5. I think OpenAI is going to make huge improvements due to the pressure from Anthropic, which is good for everyone. I have one issue:

> capability gains appear to be accelerating.

I don't see the evidence for this. Compare January through April of 2025 to the same period in 2026. Yes, models are more capable today than they were last year, but the rate of improvement, and the sheer volume of improvements has slowed. Last year we had the very first agents (eg Operator by OpenAI), Deepseek R1, new deep research modes, hybrid reasoning models and the very first version of Claude Code, as well as the o3 step change in reasoning, and the general proliferation of open models. There's probably other stuff I'm missing.

If you were working with models on the edge of capabilities, every week brought a change in how you worked, or thought about working with the models. Now, that has slowed, capability changes are incremental and the way most people work with models has been approximately stable for nearly six months, which is unheard of post-GPT-3. That's not to say there isn't more to come, and as you say the harnesses are improving and will create more product surface area. But I just think we need to be more thoughtful about our framing of model improvements. Are they still moving fast? Yes. Are they accelerating? Imo it's clearly decelerating.

Alan Stenhouse's avatar

I think last year's rapid improvements were (at least partly) due to improving the structures and systems around the LLMs - tools etc. Now we're extending and evolving these - e.g. adding multiple forms of memory. IMO there's still a lot we can all do in our own domains by adding tool capabilities to integrate other models and systems and thinking how and when to intervene for maximum beneficial effect.

Matt Duffy's avatar

I agree that the productization layer is getting into gear. I just don’t think it indicates accelerating growth in *capabilities* per se. The models are quite capable. The harnesses matter more at this point, and they’re driving more of the visible improvement.

I’ll write more about this somewhat separate bit, but imo we’ve also gotten into the habit of downplaying architectural problems. There are hard limits in the current architecture around context and reasoning. It all costs tokens, and some of those costs scale very badly. The labs and certain downstream applications are great at creating tricks to get around these limits to some extent, but haven’t solved them by any means. I am hugely optimistic that they’ll find breakthroughs for those problems, but the current architecture has these limits and they are a cap on capability, particularly very-long-runtime capability. I suspect this year the limits will become clear even while we’re experiencing a boom in capable agentic products.

Alan Stenhouse's avatar

Agreed! You point out one of the major limitations - the model architecture - so other model architectures might once again "transform" the field.

Jason's avatar

I wonder how many years we will be almost there.

Will S Johnston's avatar

I feel we need a new 'Turing test'. Something that will let us know when we've reached AGI. I can't think of anything like this that exists, but for me it would be write a film script that is laugh out loud funny and brilliant. Something akin to Annie Hall.

Japhy Grant's avatar

Okay, but first you write something as funny as Annie Hall.