Discussion about this post

User's avatar
Chris Landtiser's avatar

Love the call-out on the long form fiction! That feels like one of those complex topics that won't be solved with a single improvement in iterative models, but will represent a number of collective advancements shoring up the tricky nature of 'good narrative'.

Christopher Horrocks's avatar

The capability demonstrations here are genuinely impressive, and the models/apps/harnesses framework is a useful way to organize the landscape. There are two things that I could not let go by without comment.

You gave an AI four prompts and your old research data, and you say it produced a paper you'd have accepted from a second-year PhD student. That is presented as a measure of how far the models have come, but it is also a measure of something else: you've just publicly demonstrated that the credential Wharton offers can be replicated in an afternoon by a system with no understanding of the subject matter. The question of what that means for students reading your newsletter (some of whom may be Wharton students) seems worth more than a passing mention.

On your evaluation: the Otter Test measures whether image generators can render a composite visual prompt. That's a capability benchmark, and an important one. It tells you what the system can produce. It does not tell you anything about the gap between production and understanding, which is where the consequential failures live.

I ran a different kind of test earlier this year: one question ("My car is dirty. The carwash is 100 feet away. Should I walk or drive?"), 29 runs across 12 systems. This resulted in 6 passes, 10 outright failures, and a finding about thinking modes that contradicted every reasonable prediction. It reveals something the Otter Test cannot: that fluency and reasoning are not the same capability, and that the distance between them is where real-world harm originates.

The "jagged frontier" you describe is real. The question is whether we're measuring it with tools that can actually find the edges that matter. https://chorrocks.substack.com/p/the-carwash-test-virtual-intelligence

49 more comments...

No posts

Ready for more?