125 Comments
User's avatar
Josh Rowe's avatar

What really struck me reading this is how quickly the problem shifts from model capability to organisational design.

Once AI can reliably handle multi-step work, the hard question stops being which system is smartest and becomes how companies actually structure delegation, supervision, and accountability when the “worker” is software.

It feels like a lot of organisations are still treating this as a tooling decision, when in practice it’s already becoming an operating model question. We’re seeing that shift happen pretty quickly inside enterprise teams now.

Eva Keiffenheim MSc's avatar

Interesting thought. We have decades of management science built around human workers — span of control, feedback loops, performance reviews. Almost none of it translates cleanly to agents that work at machine speed, don't get tired, and have no stake in the outcome.

The tooling-versus-operating-model distinction you draw feels like the kind of thing that's obvious in hindsight but hard to act on now — because the people making the tooling decisions often aren't the ones who'd need to redesign the operating model.

The organizations that crack this first hold a structural advantage, similarly how companies that mastered async remote work early saved on office space and accessed talent differently.

Curious what you're seeing in the enterprise teams making that shift.

Josh Rowe's avatar

Yeah I think that’s exactly the tension. Most organisations still have the tooling decisions sitting with IT or innovation teams, but the operating model implications land with the business leaders afterwards.

What we’re starting to see in the teams moving fastest is that they stop treating AI as a tool someone “uses” and start treating it more like capacity they allocate. So instead of asking who should do this task, the question becomes whether this is human work, AI work, or some combination, and then they build lightweight oversight around that.

It’s still early and pretty messy, but the common thread seems to be that the shift only really sticks once business units themselves start owning that allocation decision rather than it being driven centrally.

Eva Keiffenheim MSc's avatar

Yes! And the organizations that treat AI as capacity to allocate will eventually discover that their real bottleneck is (rather than the capacity itself) the judgment layer that decides how to deploy it. You cannot delegate a problem you do not understand. Scoping what counts as "AI work" versus "human work" is itself an expert judgment call.

Which means the people making that allocation decision need the domain knowledge to scope and evaluate (to your point - yes the people inside business units, rather than in a central innovation team two layers removed from the actual work).

Josh Rowe's avatar

Yeah, that rings true. The bottleneck shifts to judgment pretty quickly.

What we tend to see is the teams that move fastest don’t try to define the boundary perfectly upfront. They run real work through it and let that judgment layer develop as they go. The ones waiting for a clean framework usually stall.

FOliver's avatar

When does “Scoping what counts as "AI work" versus "human work" is itself an expert judgment call.” become AI work? Seems like something AI would do well.

Sonya Choi La Rosa's avatar

Thanks for sharing the question framing here, is it human work, AI or some combination, then build the supervision and oversight over that.

FOliver's avatar

You’re assuming the supervision and oversight will be done by a human. Maybe, until your boss realizes it can be done by AI.

Nabil Al-Khayat's avatar

Your point about judgment being the real bottleneck resonates a lot.

Once AI becomes a form of capacity, the key question becomes who defines the boundary between AI work and human work.

That boundary is not technical.

It’s organizational and epistemic.

You need people who understand both the domain and the capabilities of the system well enough to scope the problem correctly.

Without that layer, the organization doesn’t really know what it is delegating.

Catlike1's avatar

I think something else that we aren't even close to considering is the question of responsibility and liability. Who is responsible for the consequences of decisions AI "makes," the "actions" it takes, or the products it creates? Who is liable for any harm (or contractual violations) that occurs?

Josh Rowe's avatar

Ah interesting, sounds like you were early on that angle. Feels like the organisational side is only just starting to land for most teams now.

Tina Austin's avatar

I love that, I recently wrote about the same issue but mine was more focused on why organizations fail https://tinaaustin.substack.com/p/they-bought-the-ai-for-the-institution?r=4a98uc

Nabil Al-Khayat's avatar

Your framing of AI as “capacity to allocate” is a really useful way to think about it.

What becomes interesting is that once AI becomes capacity, the critical skill shifts to deciding where that capacity should be deployed.

That is less a tooling problem and more a governance problem.

Who has the authority to allocate agent capacity?

How is oversight structured?

And how do teams verify outcomes when execution happens faster than human review cycles?

I suspect organizations that solve that layer will move much faster than those still debating models.

Marco Gentile's avatar

This is a great point. When the ‘worker’ becomes software, delegation stops being only an organizational question and also becomes a cultural one: how the presence of automated agents gradually reshapes the environments in which people interpret, decide, and imagine possible courses of action.

Eva Keiffenheim MSc's avatar

Love this! One thing I'd add is that while Gemini has the weakest general-purpose harness, the NotebookLM + Gemini integration is powerful (especially paired with Google's significantly larger context windows).

Previously, when my notebook couldn't answer something because it wasn't in my sources, I had to leave NotebookLM, search elsewhere, and reconcile manually. Now Gemini lets me combine my uploaded sources with live web information in a single conversation.

Two examples:

For podcast pre-production, I upload a guest's book, previous interviews, and biography into NotebookLM, then attach that notebook to Gemini and ask it to cross-reference my sources with their most recent public statements or interviews I haven't captured yet — so I can spot where their thinking has shifted and prepare sharper questions.

For lesson & curriculum design, I upload curriculum standards, past lesson plans, and student feedback. NotebookLM synthesizes gaps and aligns objectives, then in Gemini I can ask it to find current news / real-world examples from the web that bring a specific learning objective to life; without losing the grounding in my actual materials.

Fausto's avatar

You can also just tell NotebookLM to "use external knowledge/ go beyond the sources". It will do so and point out which information is not backed by your sources. :D

Eva Keiffenheim MSc's avatar

This is partially true but misleading.

NotebookLM is designed to stay grounded in your uploaded sources, and it does not use external knowledge by default.

There is no official setting to “use external knowledge” in the UI; what people are doing with prompts like “use external knowledge / go beyond the sources” is essentially leaning on the underlying Gemini model in a way that isn’t a supported, reliable feature.

Google’s supported way to bring in outside information is via features like Discover Sources and Deep Research, which search the web and then add new documents as sources to your notebook. Once those are added, NotebookLM goes back to doing what it’s meant to do: answering questions grounded in the sources it can actually cite.

Fausto's avatar

Yeah it is just a workaround. In my workflow, I happen to need most of the output being constrained by the sources and some not. When prompting it that way, it uses the knowledge embedded in the LLM.

Maria's avatar

Thank you for shedding some light on a confusing landscape. I am 65, semi-retired, curious, and reasonably tech savvy. I use all the free AI models you mention, Perplexity, and sometimes Grok. I do not do deep research or coding, and they serve my needs well. I also use NotebookLM, which is amazing. I subscribe to the Neuron, the Rundown AI, and Superhuman to stay informed, but sometimes the amount of information is dizzying. Your newsletter brings clarity to my world. You are such a clear thinker and explainer that you actually make me feel a bit smarter for (kinda) understanding what is going on ;-)

Dr Sam Illingworth's avatar

Thanks Ethan, this is a great post.

I'm a full professor at a UK university, and in the past month of using Claude Code, I can already see that it is going to revolutionise how research is done. Also, I wake up every single morning excited to get going on the projects that I've had in my mind for years but have lacked the technical knowledge to complete. I have the pro max version, and even though it seems like a large investment, it is worth every single penny.

Ces Michelin's avatar

NotebookLM is an incredible and powerful tool to organize and engage with your own curated data. Its weakness, however, is the poor UI organization of the notebooks themselves. It's a gem inside and such a mess outside.

Sean Lawson, Ph.D.'s avatar

Absolutely agree. We need the ability to organize notebooks and then also to organize the sources inside the notebooks.

Alexis Abraham's avatar

Common questions I get is "it's all moving so fast I don't want to get locked in to any one frontier lab's ecosystem and models" - isn't it worth covering the 'chat aggregators' that allow you to put the same query to multiple models ( in series or parallel ) like POE / Perplexity etc and perhaps even more importantly allow you to control / retain / port your prompts and context ?

Sheriece Green's avatar

Thank you, this is very insightful. I'm most interested in the Enterprise solutions for these types of models. A lot of the clients I'm speaking to haven't even decided on an Enterprise grade model. If they have they're using Microsoft copilot which you don't mention at all.

This signals a clear separation and potential opportunity/risk areas from individual workers trying to get efficiencies to Enterprise workers operating under the constraints of company data privacy while trying to gain efficiencies.

I've been spending the last couple of weeks researching how this is playing out at large corporations and even small but mighty companies who've been around as a small business for 20+ years. They're all trying to solve AI tool solution problems.

Hugo Acosta's avatar

As someone who's always been prone to fall in Wikipedia rabbit holes, NotebookLM is a godsend. This feature has been by far one of my favorite applications of AI ever

Hugo Acosta's avatar

Also thanks a lot for focusing on the top 3 and ignoring garbage like Grok

Adrianno Esnarriaga's avatar

I agree with the models and apps; my only suggestion would be to also add Cursor. They've made interesting progress with async subagents and in their own model (Composer 1.5).

Chris pickering's avatar

Ethan, I would be very interested to hear your take on Microsoft copilot. It currently uses ChatGPT, and I believe Claude integration is in the pipeline, but I believe that Microsoft also puts some customization on that harness.

I ask because at my work - possibly like many other peoples’ - copilot is the primary place for AI access. Officially at least, copilot is ‘how’ we do AI, even though it isn’t a model. Would love to know your thoughts on this - and on the office integrations.

derdide's avatar

I totally second that request.

I have been using M365 Copilot for a year and I am using Claude Team in parallel - now connected to M365 - and Claude is simply way ahead.

Copilot is "good" for simple chat/search queries - "draft that email reply", "translate that paragraph", "summarize these meeting transcript", "can you find all email related to that topic" - anything that does not require long back and forth exchanges with the model. But it fails miserably on anything beyond that - despite it being the first to actually having had the harness, for more than a year. And it sill fails to properly work with the full M365 ecosystem (data in OneNote is unknown to it - same for Claude but at least Claude can tell the data is there). When I see everything Anthropic could build on top of Claude (Desktop, projects, the Excel and PPT add-ons, Code of course - I haven't tried Cowork yet), the disappointment with Microsoft is even bigger - I mean, come one.

But I'm fighting hard with internal IT to keep using Claude - "Copilot is our standard", "I tested it and it is similar", "I feel it is better is not an argument", etc. - and MS marketing is very good to promote it, so any "good" and serious assessment would be very welcome.

Jim O'Leary's avatar

+1 for the 'how does MSFT Copilot fit in here' request and are there ways to use it that get close to what Claude Code/Cowork or Codex can do? I'm also feeling pressure around the 'Copilot is our standard'...

Theoretically I imagine it could be an even better tool for those living within the MSFT garden (for accessing Outlook calendars, emails, etc.), but it seems like most people doing something cool are using a non-MSFT set of tools...

Mato Vasko's avatar

Well, I am using ChatGPT Plus, M365 Copilot and Perplexity Pro from the very beginning and i think that the overboosted marketing of Microsoft did more harm than good, as well as the significant difference of performance between prompts. Microsoft started to mix models way sooner than everybody else even before reasoner models without noticing user which model was used, because everything was "copilot". I understand that they are not startup, but publicly traded company and have to pay for the costs so they were trying to use cheaper models everytime it is possible, but it made all the mixed emotions on Copilot with users who didn't know about this "feature".

Currently, mentioned gpt-5.2 thinking is available for M365 Copilot users in Copilot Chat, but if you want to use it you have to know what you are doing, because it is hidden under 2 dropdown menus.

Similarly, they started cooperation with Anthropic, publicly announced in november 2025 and their models Sonnet 4.6 and Opus 4.6 are available in preview. You can choose Opus in Agent mode in Excel and PowerPoint. Problem is that these models are turned off on tenant by default and need to be turned on by IT admins. As these models are still in preview or experimental stage within M365 Copilot, it will take some time until they will be used in enterprises. On the other hand, skilled SMBs can turn them on and use them. So 2 out of 3 (gpt-5.2 thinking, opus 4.6) mentioned major models are available, but well hidden in M365 Copilot.

Just to add, that these models are available in Copilot Studio and AI foundry for AI agent building as well.

More details:

https://learn.microsoft.com/en-us/microsoft-copilot-studio/authoring-select-agent-model

derdide's avatar

True. MS Marketing was (and still is) promising miracles, but Claude is the one delivering those on a day-to-day. It is true that there are ways to get these in a full MS environment, but this is an uphill battle

Michael Price's avatar

I wanted to add the internet of bugs video on the 'novel physics' headline and how it is not inherently misleading, but definitely deliberately grandiose.

https://www.youtube.com/watch?v=3_2NvGVl554

"It's an ad in the form of a physics paper"

Not that these tools aren't amazing, but I think perpetuating the hype on specific ideas that it is breaking through certain boundaries is not helpful in the scheme of understanding where we are CURRENTLY, compared to where the hype implies we are.

Swaroop D's avatar

perfect guide, works well even for those who were sleeping since Nov 2022 & woke up in Feb 2026

Jon McAuliffe's avatar

Superb and highly accurate, Ethan. This guide is a great service to the many people who are still judging outcomes using older, weaker models. I hope they read it and follow your advice.

Josh Igoe's avatar

Thank you Ethan, another useful thing!

Dr Laura Leighton's avatar

Your model–app–harness framework captures an important shift: capability is no longer constrained by model intelligence but by how effectively that intelligence is operationalized.

What’s becoming visible underneath this, though, is a deeper constraint—decision latency flow.

Agentic systems collapse execution latency. They can research, build, test, and redeploy faster than human execution cycles ever allowed. But they do not collapse decision latency. They expose it.

The harness determines what the agent can do. The decision pathways determine what the organization can accept, evaluate, and redeploy. When those pathways are clean, agents compound productivity. When they are fragmented, agents amplify indecision, rework, and organizational drag.

This is why the same model, operating in different harnesses, produces dramatically different real-world outcomes. The harness governs execution capacity, but the surrounding decision architecture governs whether execution can propagate.

We’re moving into a phase where model capability is no longer the primary constraint. Decision pathway integrity is.

The organizations that benefit most from agentic systems won’t be those with access to frontier models. They’ll be those whose internal decision pathways can keep pace with agentic execution.

Nabil Al-Khayat's avatar

Your point about decision latency is extremely important.

Agentic systems collapse execution time but expose the structure of organizational decision making.

What becomes visible is that most organizations were never designed to process decisions at machine speed.

So agents amplify whatever governance structure already exists.

If decision pathways are clear, agents compound productivity.

If they are fragmented, agents simply scale confusion.

That’s why I suspect the next wave of AI infrastructure will not just be better models or harnesses, but decision and governance layers around agent systems.

Dr. James W Michel's avatar

We are scaling intelligence by law.

But responsibility is not scaling with it.

So here’s a contribution for the OWASP era of autonomous systems:

Operator Class: a governance layer for agentic security.

Responsibility cannot be delegated.

https://doctorjamesmichel.substack.com/p/operator-class-a-governance-layer