Love the call-out on the long form fiction! That feels like one of those complex topics that won't be solved with a single improvement in iterative models, but will represent a number of collective advancements shoring up the tricky nature of 'good narrative'.
It will likely remain substantially jagged. The greatest obstacle remains the inability to ever know what a model can actually do since it cannot tell you.
Among similar sets of tasks, it still can perform many well to then completely fail on others. It still is substantially difficult to assess your development velocity when you take into account all of the fine print. Nonetheless, verifier loops have actually made coding mostly useful. It really comes down to cost. Can they be made efficient enough to justify the total end to end costs.
FYI, some further elaboration recently on these topics.
I find it interesting that the models, if asked to do something, will nearly always give you a result (often a bad one). If instead you ask the model, can you do X, it can explain that it cannot do that thing and why. Seems like an opportunity - if the developers built this in.
You mentioned "who am I to argue" referring to the image generated for the article.
Someone still needs to point out to GPT-5.5 that 3800 AD should be after 3000 AD. And probably shouldn't be in the image at all? Human in the loop still necessary? ¯\_(ツ)_/¯
I still can't help shake the feeling that it's best at and improving fastest at things which are not especially valuable or essential.
Tech commentators seem to be obsessed with the idea that being able to write software easily will change the world, Software is nothing like as important as these folks seem to think it is
It's really magical to see it making images , But it's astonishingly unhelpful to most people's lives.
I'm pretty sure most of these tools are moderately good things that most people do all day long , but they were moderately good a year ago, And "moderately good" doesn't really change the world that much.
We have been promised that it will be creative, we have been promised that he will discover new materials, We've been promised that it will cure cancer. We're still really looking at something that can make a dog groomers website easy to make, generate a lovely image for a summer barbecue, and provide really average analysis of a business strategy.
I'm not seeing the paradigm leap, I'm just saying something that's great and transformative over a long period, while companies rethink how they work around it
Things called GPT should only be cars. Fast, muscle cars that don't give a shit about emissions. With bright red rally stripes down the middle. It saddens me that we've come to this.
I feel we need a new 'Turing test'. Something that will let us know when we've reached AGI. I can't think of anything like this that exists, but for me it would be write a film script that is laugh out loud funny and brilliant. Something akin to Annie Hall.
The capability demonstrations here are genuinely impressive, and the models/apps/harnesses framework is a useful way to organize the landscape. There are two things that I could not let go by without comment.
You gave an AI four prompts and your old research data, and you say it produced a paper you'd have accepted from a second-year PhD student. That is presented as a measure of how far the models have come, but it is also a measure of something else: you've just publicly demonstrated that the credential Wharton offers can be replicated in an afternoon by a system with no understanding of the subject matter. The question of what that means for students reading your newsletter (some of whom may be Wharton students) seems worth more than a passing mention.
On your evaluation: the Otter Test measures whether image generators can render a composite visual prompt. That's a capability benchmark, and an important one. It tells you what the system can produce. It does not tell you anything about the gap between production and understanding, which is where the consequential failures live.
I ran a different kind of test earlier this year: one question ("My car is dirty. The carwash is 100 feet away. Should I walk or drive?"), 29 runs across 12 systems. This resulted in 6 passes, 10 outright failures, and a finding about thinking modes that contradicted every reasonable prediction. It reveals something the Otter Test cannot: that fluency and reasoning are not the same capability, and that the distance between them is where real-world harm originates.
The "jagged frontier" you describe is real. The question is whether we're measuring it with tools that can actually find the edges that matter.
What stood out to me is how quickly “interesting demo” is becoming “organisational problem.” The technology keeps moving; most companies’ workflows, controls, and decision-making don’t.
Regarding long form fiction writing: I've seen much weaker models punch above their weight with clever infrastructure built around them. OpenAI hasn't seriously tried to build an AI and app for long-form writing; it's bet big on coding and chatbots.
I suspect they've got it in them to build something that could cook up decent made-to-order genre fiction. (Ask any Warhammer 40k fan about the lore the Black Library publishes and you'll understand the bar isn't so high.) But it would not be trivial, and with all the overhead and tokens required, I'm less certain they could do so profitably. Even if they permitted outright smut, which moves a lot of paper but would be a dangerous gamble for their reputation, I don't think they could make it worth their time.
Thank you for the insights. I am not sure if I can transition to using AI as I already feel it moving away from me. But I love what you are doing and hopefully I'll get there in the end. Thank you again for the education 🙏
I like your trifecta of model, apps, harness. App is the surface area that has most potential, IMO. All models have the browser website, and now also a coding-specific app. The interesting lead here is from Anthropic that has made subtle variations and created Cowork, Design as well. Recently, Andre Karpathy also described his "Wiki" layer, which I can see as another 'app'.
On a separate note, I'm having déjà vu of déjà vu about everyone talking about how amazing 'this new' version is. Not saying it's false - it's just so familiar now. Your posts are intellectually honest, but the Youtube influencers are getting on my nerves (every video thumbnail is a surprised look with "<newmode> just dropped and it's <insertadjective>". 😅
I'm curious: I consider the image creation capabilities way less interesting than the LLMs. To me, creating images is a very narrow use case compared to stringing words together – in particular if you take into account that those words can be code, and that LLMs can use tools.
Do you agree that image capabilities are dwarfed by the LLM capabilities, when it comes to how useful (and thus interesting) they are?
Curious if you think Anthropic would have released this model publicly? Putting aside the compute/marketing hype conspiracy theory, if we take Anthropic at its word re: cyber concerns, Mythos Preview benchmarks score higher than GPT-5.5 but not by much. Could this public model be used to exploit vulnerabilities that Mythos found?
I always enjoy these breakdowns of what the frontier is doing...
But I'm still struck at how critical to success the "jagged edge" is in all of these instances. The first otter screenshot has 4 good pictures of otters using wifi on a plane - wouldn't we expect a transition? And the grading system below doesn't correspond to anything. The paper image at least has a transition, but that wasn't gradual, it was 1 garbage example and then 3 good ones. And the quality in no way matches what you actually found. The research paper is complex, technically impressive, but at the end of the day, isn't interesting enough to be worth doing in the first place. And the tabletop game is, again, complex and technically impressive - but it sounds like the narrative of the world isn't something that people would want to spend time in.
Not trying to poke holes just to poke holes. What all of these deficiencies tell me is that LLMs are still really struggling with non-verifiable tasks. And, for all the progress the labs have made, they have only closed the jaggedness through patching with more verifiable tasks. But a lot the big questions of value that will presumably create immense value from these tools (solve cancer, create billion-dollar companies, etc) depend on the non-verifiable. More compute doesn't seem to be getting us any closer to that.
Sharp observation! I would push the distinction one step further: the gap isn't between verifiable and non-verifiable tasks, but between execution and judgment. The statistics are correct but the hypothesis is uninteresting; the rules are sound but the world isn't worth inhabiting. If more compute isn't closing that gap, it may be structural, not temporary. I write about AI: chorrocks.substack.com
I have one question - because we keep seeing that the models become better; but interested to hear what you still think it has not figured out - what do you expect that it will crack in a few months time, that seems impossible now? Or is it just a slow steady (sometimes fast) increase in ability overall?
Love the call-out on the long form fiction! That feels like one of those complex topics that won't be solved with a single improvement in iterative models, but will represent a number of collective advancements shoring up the tricky nature of 'good narrative'.
It will likely remain substantially jagged. The greatest obstacle remains the inability to ever know what a model can actually do since it cannot tell you.
Among similar sets of tasks, it still can perform many well to then completely fail on others. It still is substantially difficult to assess your development velocity when you take into account all of the fine print. Nonetheless, verifier loops have actually made coding mostly useful. It really comes down to cost. Can they be made efficient enough to justify the total end to end costs.
FYI, some further elaboration recently on these topics.
https://www.mindprison.cc/p/verifier-loops-made-ai-coding-useful-vibeware-abandonware-technical-debt-consequences
I find it interesting that the models, if asked to do something, will nearly always give you a result (often a bad one). If instead you ask the model, can you do X, it can explain that it cannot do that thing and why. Seems like an opportunity - if the developers built this in.
You mentioned "who am I to argue" referring to the image generated for the article.
Someone still needs to point out to GPT-5.5 that 3800 AD should be after 3000 AD. And probably shouldn't be in the image at all? Human in the loop still necessary? ¯\_(ツ)_/¯
Ha. You spotted that! (Also I think the image it went with is pretty ugly)
I still can't help shake the feeling that it's best at and improving fastest at things which are not especially valuable or essential.
Tech commentators seem to be obsessed with the idea that being able to write software easily will change the world, Software is nothing like as important as these folks seem to think it is
It's really magical to see it making images , But it's astonishingly unhelpful to most people's lives.
I'm pretty sure most of these tools are moderately good things that most people do all day long , but they were moderately good a year ago, And "moderately good" doesn't really change the world that much.
We have been promised that it will be creative, we have been promised that he will discover new materials, We've been promised that it will cure cancer. We're still really looking at something that can make a dog groomers website easy to make, generate a lovely image for a summer barbecue, and provide really average analysis of a business strategy.
I'm not seeing the paradigm leap, I'm just saying something that's great and transformative over a long period, while companies rethink how they work around it
Things called GPT should only be cars. Fast, muscle cars that don't give a shit about emissions. With bright red rally stripes down the middle. It saddens me that we've come to this.
I feel we need a new 'Turing test'. Something that will let us know when we've reached AGI. I can't think of anything like this that exists, but for me it would be write a film script that is laugh out loud funny and brilliant. Something akin to Annie Hall.
The capability demonstrations here are genuinely impressive, and the models/apps/harnesses framework is a useful way to organize the landscape. There are two things that I could not let go by without comment.
You gave an AI four prompts and your old research data, and you say it produced a paper you'd have accepted from a second-year PhD student. That is presented as a measure of how far the models have come, but it is also a measure of something else: you've just publicly demonstrated that the credential Wharton offers can be replicated in an afternoon by a system with no understanding of the subject matter. The question of what that means for students reading your newsletter (some of whom may be Wharton students) seems worth more than a passing mention.
On your evaluation: the Otter Test measures whether image generators can render a composite visual prompt. That's a capability benchmark, and an important one. It tells you what the system can produce. It does not tell you anything about the gap between production and understanding, which is where the consequential failures live.
I ran a different kind of test earlier this year: one question ("My car is dirty. The carwash is 100 feet away. Should I walk or drive?"), 29 runs across 12 systems. This resulted in 6 passes, 10 outright failures, and a finding about thinking modes that contradicted every reasonable prediction. It reveals something the Otter Test cannot: that fluency and reasoning are not the same capability, and that the distance between them is where real-world harm originates.
The "jagged frontier" you describe is real. The question is whether we're measuring it with tools that can actually find the edges that matter.
What stood out to me is how quickly “interesting demo” is becoming “organisational problem.” The technology keeps moving; most companies’ workflows, controls, and decision-making don’t.
Regarding long form fiction writing: I've seen much weaker models punch above their weight with clever infrastructure built around them. OpenAI hasn't seriously tried to build an AI and app for long-form writing; it's bet big on coding and chatbots.
I suspect they've got it in them to build something that could cook up decent made-to-order genre fiction. (Ask any Warhammer 40k fan about the lore the Black Library publishes and you'll understand the bar isn't so high.) But it would not be trivial, and with all the overhead and tokens required, I'm less certain they could do so profitably. Even if they permitted outright smut, which moves a lot of paper but would be a dangerous gamble for their reputation, I don't think they could make it worth their time.
Thank you for the insights. I am not sure if I can transition to using AI as I already feel it moving away from me. But I love what you are doing and hopefully I'll get there in the end. Thank you again for the education 🙏
I like your trifecta of model, apps, harness. App is the surface area that has most potential, IMO. All models have the browser website, and now also a coding-specific app. The interesting lead here is from Anthropic that has made subtle variations and created Cowork, Design as well. Recently, Andre Karpathy also described his "Wiki" layer, which I can see as another 'app'.
On a separate note, I'm having déjà vu of déjà vu about everyone talking about how amazing 'this new' version is. Not saying it's false - it's just so familiar now. Your posts are intellectually honest, but the Youtube influencers are getting on my nerves (every video thumbnail is a surprised look with "<newmode> just dropped and it's <insertadjective>". 😅
Thanks for sharing your experiences with GPT-5.5.
I'm curious: I consider the image creation capabilities way less interesting than the LLMs. To me, creating images is a very narrow use case compared to stringing words together – in particular if you take into account that those words can be code, and that LLMs can use tools.
Do you agree that image capabilities are dwarfed by the LLM capabilities, when it comes to how useful (and thus interesting) they are?
Curious if you think Anthropic would have released this model publicly? Putting aside the compute/marketing hype conspiracy theory, if we take Anthropic at its word re: cyber concerns, Mythos Preview benchmarks score higher than GPT-5.5 but not by much. Could this public model be used to exploit vulnerabilities that Mythos found?
I always enjoy these breakdowns of what the frontier is doing...
But I'm still struck at how critical to success the "jagged edge" is in all of these instances. The first otter screenshot has 4 good pictures of otters using wifi on a plane - wouldn't we expect a transition? And the grading system below doesn't correspond to anything. The paper image at least has a transition, but that wasn't gradual, it was 1 garbage example and then 3 good ones. And the quality in no way matches what you actually found. The research paper is complex, technically impressive, but at the end of the day, isn't interesting enough to be worth doing in the first place. And the tabletop game is, again, complex and technically impressive - but it sounds like the narrative of the world isn't something that people would want to spend time in.
Not trying to poke holes just to poke holes. What all of these deficiencies tell me is that LLMs are still really struggling with non-verifiable tasks. And, for all the progress the labs have made, they have only closed the jaggedness through patching with more verifiable tasks. But a lot the big questions of value that will presumably create immense value from these tools (solve cancer, create billion-dollar companies, etc) depend on the non-verifiable. More compute doesn't seem to be getting us any closer to that.
Sharp observation! I would push the distinction one step further: the gap isn't between verifiable and non-verifiable tasks, but between execution and judgment. The statistics are correct but the hypothesis is uninteresting; the rules are sound but the world isn't worth inhabiting. If more compute isn't closing that gap, it may be structural, not temporary. I write about AI: chorrocks.substack.com
I have one question - because we keep seeing that the models become better; but interested to hear what you still think it has not figured out - what do you expect that it will crack in a few months time, that seems impossible now? Or is it just a slow steady (sometimes fast) increase in ability overall?
Not all 0.1 increment are created equal :)