What really struck me reading this is how quickly the problem shifts from model capability to organisational design.
Once AI can reliably handle multi-step work, the hard question stops being which system is smartest and becomes how companies actually structure delegation, supervision, and accountability when the “worker” is software.
It feels like a lot of organisations are still treating this as a tooling decision, when in practice it’s already becoming an operating model question. We’re seeing that shift happen pretty quickly inside enterprise teams now.
Interesting thought. We have decades of management science built around human workers — span of control, feedback loops, performance reviews. Almost none of it translates cleanly to agents that work at machine speed, don't get tired, and have no stake in the outcome.
The tooling-versus-operating-model distinction you draw feels like the kind of thing that's obvious in hindsight but hard to act on now — because the people making the tooling decisions often aren't the ones who'd need to redesign the operating model.
The organizations that crack this first hold a structural advantage, similarly how companies that mastered async remote work early saved on office space and accessed talent differently.
Curious what you're seeing in the enterprise teams making that shift.
Yeah I think that’s exactly the tension. Most organisations still have the tooling decisions sitting with IT or innovation teams, but the operating model implications land with the business leaders afterwards.
What we’re starting to see in the teams moving fastest is that they stop treating AI as a tool someone “uses” and start treating it more like capacity they allocate. So instead of asking who should do this task, the question becomes whether this is human work, AI work, or some combination, and then they build lightweight oversight around that.
It’s still early and pretty messy, but the common thread seems to be that the shift only really sticks once business units themselves start owning that allocation decision rather than it being driven centrally.
Yes! And the organizations that treat AI as capacity to allocate will eventually discover that their real bottleneck is (rather than the capacity itself) the judgment layer that decides how to deploy it. You cannot delegate a problem you do not understand. Scoping what counts as "AI work" versus "human work" is itself an expert judgment call.
Which means the people making that allocation decision need the domain knowledge to scope and evaluate (to your point - yes the people inside business units, rather than in a central innovation team two layers removed from the actual work).
Yeah, that rings true. The bottleneck shifts to judgment pretty quickly.
What we tend to see is the teams that move fastest don’t try to define the boundary perfectly upfront. They run real work through it and let that judgment layer develop as they go. The ones waiting for a clean framework usually stall.
Love this! One thing I'd add is that while Gemini has the weakest general-purpose harness, the NotebookLM + Gemini integration is powerful (especially paired with Google's significantly larger context windows).
Previously, when my notebook couldn't answer something because it wasn't in my sources, I had to leave NotebookLM, search elsewhere, and reconcile manually. Now Gemini lets me combine my uploaded sources with live web information in a single conversation.
Two examples:
For podcast pre-production, I upload a guest's book, previous interviews, and biography into NotebookLM, then attach that notebook to Gemini and ask it to cross-reference my sources with their most recent public statements or interviews I haven't captured yet — so I can spot where their thinking has shifted and prepare sharper questions.
For lesson & curriculum design, I upload curriculum standards, past lesson plans, and student feedback. NotebookLM synthesizes gaps and aligns objectives, then in Gemini I can ask it to find current news / real-world examples from the web that bring a specific learning objective to life; without losing the grounding in my actual materials.
I agree with the models and apps; my only suggestion would be to also add Cursor. They've made interesting progress with async subagents and in their own model (Composer 1.5).
NotebookLM is an incredible and powerful tool to organize and engage with your own curated data. Its weakness, however, is the poor UI organization of the notebooks themselves. It's a gem inside and such a mess outside.
As someone who's always been prone to fall in Wikipedia rabbit holes, NotebookLM is a godsend. This feature has been by far one of my favorite applications of AI ever
Thank you, this is very insightful. I'm most interested in the Enterprise solutions for these types of models. A lot of the clients I'm speaking to haven't even decided on an Enterprise grade model. If they have they're using Microsoft copilot which you don't mention at all.
This signals a clear separation and potential opportunity/risk areas from individual workers trying to get efficiencies to Enterprise workers operating under the constraints of company data privacy while trying to gain efficiencies.
I've been spending the last couple of weeks researching how this is playing out at large corporations and even small but mighty companies who've been around as a small business for 20+ years. They're all trying to solve AI tool solution problems.
Superb and highly accurate, Ethan. This guide is a great service to the many people who are still judging outcomes using older, weaker models. I hope they read it and follow your advice.
Common questions I get is "it's all moving so fast I don't want to get locked in to any one frontier lab's ecosystem and models" - isn't it worth covering the 'chat aggregators' that allow you to put the same query to multiple models ( in series or parallel ) like POE / Perplexity etc and perhaps even more importantly allow you to control / retain / port your prompts and context ?
As usual your experiential WISDOM is of such Great Value of understanding and thinking through the present state of things and potential for what could be. Once again - I marvel at your insights! Thank YOU Sir~
What really struck me reading this is how quickly the problem shifts from model capability to organisational design.
Once AI can reliably handle multi-step work, the hard question stops being which system is smartest and becomes how companies actually structure delegation, supervision, and accountability when the “worker” is software.
It feels like a lot of organisations are still treating this as a tooling decision, when in practice it’s already becoming an operating model question. We’re seeing that shift happen pretty quickly inside enterprise teams now.
Interesting thought. We have decades of management science built around human workers — span of control, feedback loops, performance reviews. Almost none of it translates cleanly to agents that work at machine speed, don't get tired, and have no stake in the outcome.
The tooling-versus-operating-model distinction you draw feels like the kind of thing that's obvious in hindsight but hard to act on now — because the people making the tooling decisions often aren't the ones who'd need to redesign the operating model.
The organizations that crack this first hold a structural advantage, similarly how companies that mastered async remote work early saved on office space and accessed talent differently.
Curious what you're seeing in the enterprise teams making that shift.
Yeah I think that’s exactly the tension. Most organisations still have the tooling decisions sitting with IT or innovation teams, but the operating model implications land with the business leaders afterwards.
What we’re starting to see in the teams moving fastest is that they stop treating AI as a tool someone “uses” and start treating it more like capacity they allocate. So instead of asking who should do this task, the question becomes whether this is human work, AI work, or some combination, and then they build lightweight oversight around that.
It’s still early and pretty messy, but the common thread seems to be that the shift only really sticks once business units themselves start owning that allocation decision rather than it being driven centrally.
Yes! And the organizations that treat AI as capacity to allocate will eventually discover that their real bottleneck is (rather than the capacity itself) the judgment layer that decides how to deploy it. You cannot delegate a problem you do not understand. Scoping what counts as "AI work" versus "human work" is itself an expert judgment call.
Which means the people making that allocation decision need the domain knowledge to scope and evaluate (to your point - yes the people inside business units, rather than in a central innovation team two layers removed from the actual work).
Yeah, that rings true. The bottleneck shifts to judgment pretty quickly.
What we tend to see is the teams that move fastest don’t try to define the boundary perfectly upfront. They run real work through it and let that judgment layer develop as they go. The ones waiting for a clean framework usually stall.
Love this! One thing I'd add is that while Gemini has the weakest general-purpose harness, the NotebookLM + Gemini integration is powerful (especially paired with Google's significantly larger context windows).
Previously, when my notebook couldn't answer something because it wasn't in my sources, I had to leave NotebookLM, search elsewhere, and reconcile manually. Now Gemini lets me combine my uploaded sources with live web information in a single conversation.
Two examples:
For podcast pre-production, I upload a guest's book, previous interviews, and biography into NotebookLM, then attach that notebook to Gemini and ask it to cross-reference my sources with their most recent public statements or interviews I haven't captured yet — so I can spot where their thinking has shifted and prepare sharper questions.
For lesson & curriculum design, I upload curriculum standards, past lesson plans, and student feedback. NotebookLM synthesizes gaps and aligns objectives, then in Gemini I can ask it to find current news / real-world examples from the web that bring a specific learning objective to life; without losing the grounding in my actual materials.
I agree with the models and apps; my only suggestion would be to also add Cursor. They've made interesting progress with async subagents and in their own model (Composer 1.5).
NotebookLM is an incredible and powerful tool to organize and engage with your own curated data. Its weakness, however, is the poor UI organization of the notebooks themselves. It's a gem inside and such a mess outside.
As someone who's always been prone to fall in Wikipedia rabbit holes, NotebookLM is a godsend. This feature has been by far one of my favorite applications of AI ever
Also thanks a lot for focusing on the top 3 and ignoring garbage like Grok
Thank you, this is very insightful. I'm most interested in the Enterprise solutions for these types of models. A lot of the clients I'm speaking to haven't even decided on an Enterprise grade model. If they have they're using Microsoft copilot which you don't mention at all.
This signals a clear separation and potential opportunity/risk areas from individual workers trying to get efficiencies to Enterprise workers operating under the constraints of company data privacy while trying to gain efficiencies.
I've been spending the last couple of weeks researching how this is playing out at large corporations and even small but mighty companies who've been around as a small business for 20+ years. They're all trying to solve AI tool solution problems.
Thank you Ethan, another useful thing!
perfect guide, works well even for those who were sleeping since Nov 2022 & woke up in Feb 2026
Superb and highly accurate, Ethan. This guide is a great service to the many people who are still judging outcomes using older, weaker models. I hope they read it and follow your advice.
Common questions I get is "it's all moving so fast I don't want to get locked in to any one frontier lab's ecosystem and models" - isn't it worth covering the 'chat aggregators' that allow you to put the same query to multiple models ( in series or parallel ) like POE / Perplexity etc and perhaps even more importantly allow you to control / retain / port your prompts and context ?
thanks
Ethan,
As usual your experiential WISDOM is of such Great Value of understanding and thinking through the present state of things and potential for what could be. Once again - I marvel at your insights! Thank YOU Sir~
Re: the GPT-1 book, how do you know that the numbers are actually correct and not just made up?