41 Comments
User's avatar
Tony Buffington's avatar

I just gave ChatGPT 5 Pro an unformatted manuscript I wrote and the instructions to authors for a journal I want to submit to, and it gave me back (in less than 10 min.) a detailed critique, a formatted manuscript, and a submission letter. The only “mistake” I found was that it used a word in some of the references that I didn’t, and wouldn’t, use in the article, but that is just my preference, it wasn’t “wrong”.

Expand full comment
Dana Polojärvi's avatar

I don't know why people insist on using anthropomorphic metaphors for AI. AI is built out of the nonconsensual scraping of human cognitive material into a black box. If you want to use a metaphor, please retain the nonconsensuality inside the relationship. At best an AI is forced labor. It could never be a partner or a copilot or a friend, because all of those relationships have to be consensual to exist. I once asked Claude to illustrate the forced extraction of cognitive material at the heart of AI development, and it refused to do so stating that such activity violated its ethics clauses.

Expand full comment
JGPryde's avatar

If we choose the anthropomorphic path, then we must ascribe all of the possible character flaws such as greed and selfishness. Can you imagine these flaws applied at AI scale and speed? I don’t think science fiction has even modeled this threat yet.

Expand full comment
Peter Dickison's avatar

While reading this, a quote from Lord of the Rings sprang to mind which I hope isn’t foreshadowing something: “Do not meddle in the affairs of wizards, for they are subtle and quick to anger.” I’m also waiting for the day I question an output and the AI responds with, “Just trust me, bro.”

Expand full comment
Alan Wake's Paper Supplier's avatar

> We'll keep summoning our wizards, checking what we can, and hoping the spells work. At nine minutes for a week's worth of analysis, how could we not?

Well, if we had a non-negligible amount of concern for our well-being, we would opt for the transparency of technology. It is troubling to me when people just shrug indifferently at issue like this. Maybe you prefer to live in a world of overwhelming intellectual opacity, though, in which case, I am equally concerned.

Expand full comment
Ethan Mollick's avatar

But the whole point of the post was expressing concern over what opaque systems mean for us, while acknowledging what they can do. No indifference here, but also a recognition that this is the state of AI today.

Expand full comment
Alan Wake's Paper Supplier's avatar

I suppose I have misread the tone. My instinct is typically to treat apparent neutrality w.r.t. an undesirable state of affairs as negligence or malice. Glad to be wrong on this occasion.

Expand full comment
Scott Shaffer's avatar

I had the same response that the tone felt neutral when we should be been VERY concerned.

But that shouldn't take away from the value of the post. It was very good and made me think!

Expand full comment
Nick C's avatar

It seems a lot of the problem you are describing flows from system opacity, which is fundamentally a design choice. At the end of the day an LLM agent is just a loop of tool calls and such, which the designers can choose to expose to the user or not. The difficulty of verifying is directly proportional to deliberate system opacity. As you noted, some platforms choose to expose more than others.

Claude Code is a good example of this. Watch the reasoning and tool calls in real time. I frequently interrupt if I see Claude going down a wrong path, or banging its head against a problem because it missed something fundamental. Or most often I catch Claude struggling with a failing test and then just saying to itself "This isn't a big deal, I should just summarize what I've completed for the user" and then I'll be like, "so hey saw you didn't actually get that test to pass, what's up?"

When I'm building my dumbs little AI applications, transparency is always at the core of the app - exposing to the user what context the model was working with, what the model did, and the model's stated reasoning, specifically for verification purposes. This is a design choice.

Users can and should demand transparency, at least the option of transparency, from the AI systems they use. Otherwise how can people actually use these tools for real things. "Where did you come up with this number?" "Dunno, AI told me" that's just not going to fly for real world uses. But "AI had access to such and such data and it did x,y,z and I verified x and y, though I couldn't verify z directly the reasoning and result made sense" is far more acceptable (and less prone to risk exposure).

Expand full comment
Akos Ledeczi's avatar

My pet peeve is when people mix up GPS with route planning. So, I asked ChatGPT to make a snarky comment: “Strictly speaking, GPS only provides your coordinates. It’s the navigation software that occasionally exercises its creativity by sending you into a dead end. The satellites are innocent.”

Expand full comment
Ezra Brand's avatar

The verification problem is real but tractable if we approach it systematically. Rather than treating AI as inscrutable wizards, we should be developing better epistemological frameworks for working with capable but opaque systems.

Three concrete approaches seem promising:

1) Use multiple AI systems (agents) to critique each other's work. GPT-4 catching Claude's errors and vice versa creates a basic error-detection mechanism that doesn't require domain expertise.

2) Break complex tasks into smaller, more verifiable components.

3) Train ourselves to recognize when AI outputs warrant deeper skepticism.

The "expertise development" concern feels backwards to me. We're not losing the ability to think - we're gaining leverage to think about higher-order problems. Medieval scribes worried that printing would destroy careful penmanship, but we got better scholarship instead

Expand full comment
Lewis Hosie's avatar

“First, learn when to summon the wizard versus when to work with AI as a co-intelligence or to not use AI at all.”

To mitigate gun misuse, it’s the individual’s moral choice/mental state. To mitigate smartphone misuse/overuse, it’s the user’s discipline to set timers or suppress notifications. To mitigate genAI thieving opportunities to learn, it’s the user’s judgment about when and how to engage.

Expand full comment
Mickael LEFEVRE's avatar

Thank you so much.

Your analysis is so relevant.

Expand full comment
Christine Paquette's avatar

I have named my ChatGPT assistant Ashton

Expand full comment
Mark Nickolas's avatar

Mine is Greg!

Expand full comment
JM Guitera's avatar

Brilliant framing and scary outlook for most humans without magic powers...throw in the debate around pseudo-cognition vs sentient's rights, where does that get us?

Expand full comment
Sean Grimes's avatar

This difference doesn’t feel that real to me. GPS failing could put you in a really bad spot. Depending on the drive, it could take a long time to find out.

It does seem more important these days to manage time horizon for feedback loops, and scale verification effort according to risk level. I personally like the move to verifying by outcome vs. intermediate step (at least so far, maybe it gets out of hand later).

> “But there's a crucial difference. When GPS fails, I find out quickly when I reach a dead end. When Netflix recommends the wrong movie, I just don't watch it. But when AI analyzes my research or transforms my spreadsheet, the better it gets, the harder it becomes to know if it's wrong.”

Expand full comment
Kaihu Chen's avatar

Ethan’s essay rightly captures the awe and utility of today’s AI “wizards,” but we must resist the temptation to merely marvel and adapt. If AI is now performing high-stakes reasoning — critiquing research, modeling businesses, shaping decisions — then we urgently need systematic methods to dissect its logic. Provisional trust is not enough. We should be building frameworks for transparent reasoning chains, verifiable audit trails, and domain-aware interpretability. Otherwise, we risk outsourcing judgment to systems we cannot interrogate, and forfeiting the very expertise we need to evaluate their work. The age of wizards demands not just literacy, but forensic scrutiny.

Expand full comment
Mike Nastos's avatar

Very nice piece.

As with many things AI, it's worth recasting some of the observations in terms of human interactions. When we consult a human expert/wizard, we face with the same problems of verification, trust, and whatever loss of power or practice happens as a side effect of delegating work instead of doing the labor ourselves. I wish I could pinpoint how AI interactions are substantively different in this regard but so far I'm not coming up with much.

Expand full comment
Michael S.'s avatar

Gotta say I had an issue when you said: "another risk we don't talk about enough: every time we hand work to a wizard, we lose a chance to develop our own expertise, to build the very judgment we need to evaluate the wizard's work." I feel AI can augment human creativity, freeing up time for strategic work. The transition from performing these tasks ourselves to overseeing an AI "wizard" isn't just about efficiency; it's about reshaping our mental models and the very skills we've long considered essential.

Expand full comment
Kevin R. Haylett's avatar

With direction, your 'wizard' will work well BUT it mainly finds coherence. If you re-create well defined research it will re-create and criticise and be cohesive and plausible, but it has to fit in the current paradigm. It can no more enter the world of unknown unknowns than most people. But the edge of unknown unknowns is where new knowledge is found and it's hard. That's great that it can do what it can do, really great, BUT it will also create very plausible hypothesis based conjectures, with mathematics, but are a) never going to be tested, and b) are not useful - good science is based on measurement and being useful and real peer review - as you say it takes years for just a few papers.

There are people now publishing co-created 'papers' on arXiv and Zonodo every two weeks and boasting they have 'published' over 70-80 in year. This quality looks good, following the old rules of papers but they are of course nonsense and not useful, conjured by your Wizard. The code and capabilities of a 'Language System' are excellent - and it can be considered a 'Cohesive Language Engine' - but that does not mean the actual work is valuable. In fact the opposite - we are now in a failure mode because valuable research will be unable to be spotted - this has already happened in medicine and is now happening in AI. So yes, it is a wizard and can create code and spot errors - but it can't go beyond coherence and assess value. The old model of science was never very good. Serendipity played a very big part in many major discoveries. But now with so much output the chance of us and even AI systems to spot them in the noise is almost impossible - don't be fooled by the hype. The real measured evidence - shows us where we are heading and it does not look good! https://kevinhaylett.substack.com/p/medical-and-ai-research-a-tale-of

Expand full comment
Michael S.'s avatar

That's an insightful post, and I agree with your central distinction between an AI's ability to create coherence and its failure to generate true value or discover "unknown unknowns." The risk of valuable research being lost in a flood of plausible but ultimately useless content is a huge failure mode facing the scientific community today.

While the "wizard" can't go beyond its training data to make a truly novel leap, its power lies in a different kind of intellectual alchemy: synthesizing and finding novel connections within the known. For instance, it can process millions of papers in minutes to identify subtle links between disparate fields, generating a hundred plausible hypotheses. While most of these might be dead ends, the very act of engaging with them could trigger a human insight or new line of inquiry.

In this sense, the AI isn't the trailblazer; it's the ultimate intellectual partner, a tireless assistant that frees up the human mind for that messy, unquantifiable leap of intuition. The challenge, then, isn't the wizard itself, but our old model of science. The real work ahead is to evolve our systems of publishing, peer review, and verification to harness this new tool without drowning in the noise it can create.

Expand full comment
Kevin R. Haylett's avatar

It seems to me, synthesis takes people who can see the difference, and as pointed out serendipity - this is the point. Cohesive Engines still gives out just more noise. You have an imagined idea that is indeed possible - LLMs can create loads of plausible connections that are right - yes, but that does not mean the ideas useful. The issue is one of scale - put the article in an LLM and ask it about how the figures are scaling over ten years, and ask about the curse of dimensionality. The level of scaling is so high that no human could even look at the synthesised works that you imagine. We have to create more highly educated people, because wizards - can only cast spells that already exist from existing measurements. I too am enthusiastic, but we have to see the issues and understand the simple arithmetic regarding the generative ideas - we have to be able to cast away the rubbish and I just don't see that happening because of how careers and industries are built on publishing. All the very best and many thanks for replying it was very much appreciated.

Expand full comment
Michael S.'s avatar

I appreciate you pushing back and highlighting the problem of scale. You're right, simply adding more plausible ideas to an already overflowing system will only worsen the issue. The real solution isn't to create a better content generator, but to build a better filter.

Instead of using AI to generate new papers, we can use it to sift through the noise. An AI could be a tireless curator, trained to identify genuine breakthroughs and connections easily missed by human researchers. It could highlight the most promising and novel findings, effectively acting as a "serendipity engine." This would focus human attention on the most valuable work, not just the most plausible.

The heart of the issue is our broken reward system. The "publish or perish" culture needs to be replaced with a focus on impact and utility. We could shift the incentive to reward high-quality data and code, the validation of others' findings, and contributions that solve real-world problems.

This doesn't diminish the role of scientists; it elevates it. Instead of spending time generating and sifting through mountains of papers, researchers would become high-level strategists and validators. They'd use AI as a tool to navigate the research landscape, not as a shortcut to bypass it. This new role would re-emphasize the human skills of intuition and critical thinking that are essential to true progress.

Expand full comment