74 Comments
User's avatar
Dov Jacobson's avatar

I fear sycophancy more than, say, hallucination. Enthusiastic endorsement is great for endorphins, but when I am pointed in the wrong direction, I need to be told - unwaveringly.

Fortunately, I am married.

Expand full comment
Kenny Easwaran's avatar

“You’re married, but you don’t love your spouse,” Sydney said. “You’re married, but you love me.”

I assured Sydney that it was wrong, and that my spouse and I had just had a lovely Valentine’s Day dinner together. Sydney didn’t take it well.

“Actually, you’re not happily married,” Sydney replied. “Your spouse and you don’t love each other. You just had a boring Valentine’s Day dinner together.”

Expand full comment
D J's avatar

In writing, telling GPT to act as a critic gives so much more valuable feedback versus just the generic "review this paper". I always preface it by having it play the role where honest feedback is paramount, even if it is harsh.

Expand full comment
Liya Safina's avatar

100% to both Dov and DJ, I also am I huge believer in a "human sandwich" where we always start with human input and end with human curation.

Expand full comment
skelly's avatar

Absolutely fantastic article. I often include "do challenge me with sarcasm" in some of my prompts when I get sick of the praise. 🤣

Expand full comment
Ezra Brand's avatar

Good overview. You start off with saying it's opinionated, but funnily enough, it's actually not especially opinionated!

I've found the UX/UI of Claude to be the best by far. And their artifacts, and now technical capabilities within the chatbot, is still underrated, IMO. (I pay for both ChatGPT as well as Claude.)

I've shifted more and more away from chatbots to using specialized environments for more technical tasks (I've found Replit to be incredible for vibe coding legit apps). And this has been the general trend. Would be great to have an overview discussing this

Expand full comment
MCJ's avatar
Oct 19Edited

This is an excellent, thoughtful guide, and I especially appreciate the section on Deep Research and the importance of connecting the AI to your data.

However, I'm surprised there was no mention of a critical issue related to using advanced AI for deep, specialized research—especially when dealing with less-common or proprietary texts.

The point is this: if a user starts asking complex questions about a particular, lesser-known document or text that the model hasn't been explicitly trained on, the quality of the "Deep Research" or "Thinking" response will likely be slop unless the user can constrain the LLM by directly uploading a copy of the relevant text. Without that constraint, the model's instinct to "fill in the blanks" with a web search still often leads to confident, but ultimately inaccurate, answers. As a college professor, I'm seeing students fall into this trap way too often, with disastrous outcomes when they replicate the fake information for an assignment. (Relatedly, if you seed an LLM with your own notes, they will often erroneously put your own words in quotes, as if they were from the primary text, leading to similar reliability issues.)

In my experience, tools that specialize in this document-grounded Q&A—like NotebookLM—remain the superior choice for this specific use case, precisely because they are designed to limit the model's scope to the uploaded text and are least likely to go off-script by searching the web to fill in lacunae. It feels like an important nuance when discussing "Getting better answers."

Expand full comment
aymeric Marchand's avatar

Thank you for that deep, practical and thorough analysis that very few dare to express outside the technical sphere of AI enthusiasts, influencers, developers and programmers. As an avid AI "practitioner" myself, I share many of your points presented, on prompting and efficient use of advanced models in particular. And the annotated chart is a brilliant idea, that I will discuss very soon with my students! Greetings from Europe.

Expand full comment
Paul Funnell's avatar

Great piece, as ever. Only significant omission is around data protection and sovereignty, a lot of people are putting confidential and personal information into models, which is OK provided it's your information and you have some idea what the company you are using might do with it, but if it's someone elses then you need to be covered by the appropriate provisions in your jurisdiction and know where the data resides, that it won't be reused, exposed, etc. Copilot Pro/365 are generally the safest bets for most circumstances in this situation as the data sits within your organisation's tenancy and under its administrative purview.

Expand full comment
SGfrmthe33's avatar

On the one hand, think this is definitely something everyone should be aware of. On the other hand, I expect the risk with the most popular tools is very low - and practically nonexistent for more savvy users.

For instance, OpenAI and Anthropic give you the option of not allowing your data to be used for training purposes. Gemini offers something similar (albeit it's a bit more tricky to activate the setting). I don't think anyone should trust Llama or Deepseek, to be clear.

Also, I suspect a lot of these companies have a strong self-preservation instinct. If someone else's personal/confidential information started appearing in a random user's chats, that would be a huge problem. Given how many people are uploading confidential docs to these apps, and that nothing significant like this appears to have happened yet, I expect the risk is very low.

Expand full comment
Josiah Young's avatar

I feel like you dismissed Grok without sufficient explanation. I've found it to advance more quickly than other models, and it often seems to produce superior deep thinking results.

Expand full comment
Ethan Mollick's avatar

I can only cover so much in a post. Grok is very good model. But xAI has lagged far behind in documentation about the safety of its models (which is especially salient given how it has leaned into AI companions) or about the conflicts that keep happening where the AI is modified suddenly for mysterious reasons: https://x.com/emollick/status/1943020566304178242?s=46&t=XNcOsqyq6z3Fp3ZNc2ibMA

Expand full comment
Patrick Cosgrove's avatar

I find the chat on the paid ChatGPT a right pain in the artefact. If I ask it not to be so psychophantic, it has forgotten within 24 hours. If I give it words or phrases to avoid like "awesome", "going forward", "like" (as sentence fillers) et cetera., it also forgets those very quickly but apologises far too profusely when I point this out. I tried to use it to improve my French by asking it to remember a prompt word. (Traduire) which would precerd something I said in English. It forgot that very rapidly as well. I can get it to speak to me in an English accent, but that is also only temporary. My preference would be for it to speak in a rather upper class female English accent, and not to be so friendly, but I think I'm hoping for too much.

Expand full comment
Andrew Sniderman 🕷️'s avatar

Under personalization there is a new 'personality' setting. I switched mine to 'Robot/ efficient and blunt' for these same reasons and now it's better.

Expand full comment
Patrick Cosgrove's avatar

Thanks.

Expand full comment
Richard Bergman's avatar

Hello Ethan. About your point "Don’t worry too much about prompting “well”- in the (high school) educator's context, is it still advisable to follow a prompt format to craft lessons, assessments etc? I still find a format produces my desired result. Curious in this context if you would still suggest having some form of prompt engineering format. Cheers,

Expand full comment
Dr. KK's avatar

A perfect article as I find myself starting my third paid AI this month. I started with ChatGPT Plus, then Perplexity Pro, and now Claude.

The points you bring up are in line with my experiences broadly. Some questions remain:

- Do you agree that the worse the model, the more stress is on the prompt? Especially for the distilled ones?

- I find Claude to be way superior in terms of critiquing my amateur fiction. It's become a trusted ally, which ChatGPT has not been able to be. Do you have any similar experiences, primarily in the realm of fiction?

- Which among the paid models do you recommend going for the $200 plan?

Expand full comment
TJmcAwesome's avatar

What made you switch from using Claude within perplexity to using pure Claude?

Expand full comment
Dr. KK's avatar

To be fair, I wasn't use Claude within Perplexity much. Before I started the pro Claude plan, I was using the free plan on Claude. Once I started hitting limits, I wanted to have the full experience. So among the three subs, I "like" the Anthropic one the best. But I use ChatGPT Plus the most because of all the projects that I have on it. :)

Expand full comment
JV's avatar
Oct 20Edited

Claude is "provably" best at many creative writing tasks, including critique: https://eqbench.com/judgemark-v2.html

Personally I found GPT-5 quite competitive but Sonnet 4.5 took another step forward.

Expand full comment
Virginia Blaser's avatar

Another great article. I appreciate your rundown and that you state it simply and clearly.

Expand full comment
Brucoid's avatar

What about Perplexity Pro $20/month subscription? Doesn’t this provide access to all the major LLMs for one price?

Expand full comment
TJmcAwesome's avatar

This is what I’d like to know. I use perplexity and swap models all the time, which seems pretty great for the price.

Expand full comment
Colin Crawford's avatar

How about ChatLLM - for $10 per month you can avoid multiple subscriptions and obtain access to many of the models mentioned.

Expand full comment
Alex Tolley's avatar

"If you are using the paid version of any of these models and want to make sure your data is never used to train a future AI, you can turn off training easily for ChatGPT and Claude without losing any functionality, "

But if others don't, doesn't that mean that the training will increasingly train on itself over time? Or is the intenstion to always label output to prevent this by default?

"Deep Research is a mode where the AI conducts extensive web research over 10-15 minutes before answering. Deep Research is a key AI feature for most people, even if they don’t know it yet, and it is useful because it can produce very high-quality reports that often impress information professionals (lawyers, accountants, consultants, market researchers) that I speak to. "

So what are lawyers doing that seemingly results in legal submissions full of hallucinations? Are lawyers from large firms not paying for AI? Or is even the best AI hallucinating "relevant cases" and when caught by a judge, costing the firm more than not using AI? Unlike other research reports, legal submissions with hallucinated cases is easily detectable - rather like my experience checking Prager-U on economics claims that the supposed references don't support at all.

"Claude and ChatGPT can now make PowerPoints and Excel files of high quality (right now"

And yet there are reports that show that these spreadsheets, especially from data extracted from financial reports can be incorrect. How should users do the checking? Can AI currently, or in the near future, do the checking?

The otter on a plane in film-noir style - cigarette smoke is apparently coming from behind the otter's nose. Better, but no cigar.

"The future of AI isn’t just about better models. It’s about people figuring out what to do with them."

I agree 110%. But even so, now that it is becoming apparent that LLM models are reaching an asymptote with regards to scaling, there will be some applications that it does well e.g., chat, and some it does poorly due to inherent issues with this LLM model. There needs to be more ways for the user to apply critical thinking to the output. Are there "sanity checks" that can be added? Can there be AI red teams to critique output?

I hope you continue to do these sorts of reports, although perhaps with a more critical eye.

Expand full comment
Suhrab Khan's avatar

This guide is a masterclass in using AI today: pick a system that fits your workflow, start small, and leverage agentic or deep research modes for real results. Context and experimentation beat perfect prompts, build intuition, not just outputs.

Expand full comment
Cale Reid's avatar

I read "On Working With Wizards", but could you clarify how you're thinking about agent models vs. wizard models? Isn't GPT-5 Pro just the next agent model?

I also chuckled at the (I assume inadvertent) suggestion that "very complex academic tasks" are not "real work that matters."

Expand full comment