Great breakdown! One thing I always mention when people say they’re wary of using AI assistants because of hallucinations: the mindset needs to shift. These aren’t just Q&A robots. They can actually be your critical thinking partners.
The real value isn’t in asking “what’s the answer?” It’s in using these models to stress-test your thinking. They can:
1. Expand your ideas
2. Validate or poke holes in them
3. Surface POVs you may have completely overlooked
Yes, they’re great for answering simple questions but in doing so, they can also hallucinate too. The key is in how you engage with them.
Give o3 a thesis, for example a stock idea and your reasons for liking it. And give it a persona, like a skeptical hedge fund portfolio manager. Ask it for 10 reasons that support your case and 10 that challenge it. You’ll get new angles, risks you hadn’t considered, and potential counterarguments to prepare for. Now, the conversation isn’t about being right or wrong now. It’s about being more rigorous.
Bottom line: don’t use LLMs only as search bars. Start using them like strategic thought partners. Pick its brain so that it shares information that can sharpen your thoughts and help YOU make more informed decisions.
This is incredibly timely and useful. I get asked this all the time—and even some people who are paying for the good models (say, ChatGPT Plus) are not aware that they can switch to more powerful models, so they're missing out. A quick "please share your screen and tell me what you want to do" is often an hour very well spent for greater effectiveness in using AI.
I agree with all your points, but I have found Claude far less useful for writing than the other models. I did not see the leap toward Claude 4 (Opus or Sonnet) that I expected, not in writing and reasoning. In fact, not long ago I asked both Claude 4 and Gemini 2.5 Pro to quantify about three pages of data (quantitative and qualitative). The conclusions were so different that I gave each the answer the other had given. Claude apologized profusely and got it wrong again upon reanalysis. I also find that Gemini is writing better than the rest of the models. If someone wants to pay for a model, right now I would not recommend paying for Claude.
One more thing—what I just mentioned is something that I recommend to people who are willing to pay for at least two models. Make them converse! Give one model the answer the other gave you. This is generally a very fruitful exercise.
I use all of these for various tasks and mostly aligned with the way you describe. Two additional thoughts
1) just started using Grok Deep Search in Tasks. It has been an amazing tool for keeping up on news (in my case a news and trends on a very specific niche - Austin Bio & Health)
2) I have found memory in ChatGPT to be a super power. As many of threads I have are linked in various ways. However I can't get it to stop using em-dashes no matter how many times I tell it to remember or put it in custom instructions.
I've also tried to get it to stop using em dashes to no avail. I also find the memory really useful, and the Projects too for keeping hold of context, but the memory does get full up.
Yeah you can't override default behaviours on things like em dashes and obsequiousness with modified profile instructions or intro prompts, sadly, in spite of the number of clickbait 'prompt master' posts on LinkedIn saying you can.
It does override for a bit, but then drifts back. It will be interesting to see how we better train them to personal preference and style. It definitely retains knowledge, just not instructions.
Great piece, Ethan. You’ve nailed the core shift in the landscape: it's no longer about the "best model" but the "best overall system." This framing is a huge help for anyone feeling overwhelmed.
That said, I'm going to challenge the quick dismissal of Copilot. While I agree its raw model performance isn't always at the bleeding edge of a new GPT-4o or Claude Opus release, you're under-valuing its power as a system.
For me, the deep integration into Windows and Office is proving to be a game-changer. The friction of alt-tabbing to a browser, copying, and pasting is a bigger productivity killer than we admit. Having a very capable AI right there in Word, Outlook, or on the desktop is an advantage that's hard to quantify but easy to feel.
For the majority of knowledge workers, the convenience of a well-integrated AI will always outweigh a slightly superior AI that resides in a separate tab.
I'm curious what others think. Are you finding this trade-off plays out the same way in your daily work, or am I over-valuing the convenience of integration?
Important to note the difference between Copilot Chat vs M365 Copilot. Not sure which one was being referred to exactly in the article, but the latter can definitely be a game changer in terms of its deep workplace integration across Outlook, Office, Teams etc.
Copilot Chat is basically ChatGPT with a Microsoft jacket on. The addition of agents is also interesting and something to keep an eye on.
AND I've just realised that using the license version, if you use the Analyst agent - it's based on o3-mini, and the Researcher agent is based on o3 deep research. Both use chain of thought (as you would expect). This is a game changer as I'm only allowed to use MS Copilot at work
Check out LLM front ends like TypingMind and BoltAI which let you use any model from any company on a pay-as-you-go basis. The monthly subscription maybe cheaper for heavy users but for most people the PAYG pricing is the better option.
Ethan, great writeup as always. One big criteria that is missing from your analysis is collaboration with your team and AI. All of the above assumes an individual in single-player mode. I think that is because this is what Anthropic, ChatGPT, Grok, and Google assume. Their Teams products are not really for teamwork.
I think team collaboration with AI is so important that we started a company, Stravu, to try to enable a new way of working together with AI. We are in Beta and I'd love your (and anyone else on the thread) feedback on how you want to work with you team and AI. Please check it out at https://www.stravu.com and sign up for the Beta.
A very good guide. One thing that I have found helpful is to instruct AI to ask you questions if it needs more information or needs to clarify anything in your prompt.
Great article. Would love to hear your views on memory. Seems to be a game changer. But it can also be a bit of pain and or is scary. Also, there is no mention of o4. And I think o3 pro it’s only available for pro users (not plus - $20/month)
I’m a teacher and one of my dangerous assumptions when talking to colleagues is to expect that they are willing to pay $20 a month for an “expert in their pocket”.
Honestly, they are addicted to SBUX to the tune of 6-10$ a DAY and can’t forego to get a top flight model?
You need to bang in a nail and yes rocks are free and easy, but for-gosh-sakes, if you find yourself working with a lot of nails why not use a hammer?
Just don’t tell my wife how many I’m subscribed to or how much I’ve toyed with the premium tier models. 🙄
I'm fascinated by the footnote at the end. Why does politeness sometimes matter on hard math and science questions, specifically, and why can it go both directions (better or worse)?
Is it more about how this was created / who created it or more about some fundamental difference in the source material (ie, the way hard math and science are presented and discussed online)? Combo of both? Something else entirely? There's a great research rabbit hole here.
THANK YOU! I have been excited to utilize AI for good and not evil but wasn't sure where to start. Based on what you've written, I think I need to upgrade ChatGPT and keep experimenting with purpose.
I've been avoiding paying for any of these models because Google AI Studio gives access to their frontier models for free. Is there something I'm missing by not subscribing to their paid tier?
I am developing an app on 4o. I provide the ideas, the logic, and the use cases. 4o provides the coding. This includes creating the patent. What are the reasons I might switch to o3? How would I do that without losing the code in the model?
If you're developing an app I would switch to Claude. There's a night and day difference for development.
Get chatgot to write a detailed prompt of everything you've done and all key information for the project.
Then give Claude the entire code base plus the prompt. Claude's coding ability and UI is just so much better than chatgpt. When I hit the limit on Claude, and I try switch to chatgpt. I end up waiting for my limit to reset rather than use chatgpt. It's that bad for coding honestly.
How You implement and run this code if You have more complicated project? I am talking about running all these requirements and eg.docker and installation/configuration? I thought agent can do this but not really. Wsl is a nightmare for not-developer who want to build something wit ai. Code is not the biggest problem, integration is! Any advice??
"asking it to explain its logic will not get you anywhere" – another great post but this statement isn't always true. I recently asked o3 to calculate the highest point within 15 miles of where I live. It was wrong, which I knew immediately. But when I asked it to explain why it was wrong, it correctly diagnosed that it had simply done web searches when actual elevation data was necessary, then found suitable data to calculate it correctly. I'll take it as a win that in this case a human with domain knowledge was much quicker and better at the task, but of course you could also argue my prompting should have been more precise! Anyway, these constant experiments are fascinating. (And of course your overall message is exactly this: experiment constantly!)
I enjoyed your book, Co-Intelligence. I use Claude to proofread, edit and comment on my freelance book reviews at the point when I think I’m ready to file them with my editor and and have found it a useful tool.
Great breakdown! One thing I always mention when people say they’re wary of using AI assistants because of hallucinations: the mindset needs to shift. These aren’t just Q&A robots. They can actually be your critical thinking partners.
The real value isn’t in asking “what’s the answer?” It’s in using these models to stress-test your thinking. They can:
1. Expand your ideas
2. Validate or poke holes in them
3. Surface POVs you may have completely overlooked
Yes, they’re great for answering simple questions but in doing so, they can also hallucinate too. The key is in how you engage with them.
Give o3 a thesis, for example a stock idea and your reasons for liking it. And give it a persona, like a skeptical hedge fund portfolio manager. Ask it for 10 reasons that support your case and 10 that challenge it. You’ll get new angles, risks you hadn’t considered, and potential counterarguments to prepare for. Now, the conversation isn’t about being right or wrong now. It’s about being more rigorous.
Bottom line: don’t use LLMs only as search bars. Start using them like strategic thought partners. Pick its brain so that it shares information that can sharpen your thoughts and help YOU make more informed decisions.
This is incredibly timely and useful. I get asked this all the time—and even some people who are paying for the good models (say, ChatGPT Plus) are not aware that they can switch to more powerful models, so they're missing out. A quick "please share your screen and tell me what you want to do" is often an hour very well spent for greater effectiveness in using AI.
I agree with all your points, but I have found Claude far less useful for writing than the other models. I did not see the leap toward Claude 4 (Opus or Sonnet) that I expected, not in writing and reasoning. In fact, not long ago I asked both Claude 4 and Gemini 2.5 Pro to quantify about three pages of data (quantitative and qualitative). The conclusions were so different that I gave each the answer the other had given. Claude apologized profusely and got it wrong again upon reanalysis. I also find that Gemini is writing better than the rest of the models. If someone wants to pay for a model, right now I would not recommend paying for Claude.
One more thing—what I just mentioned is something that I recommend to people who are willing to pay for at least two models. Make them converse! Give one model the answer the other gave you. This is generally a very fruitful exercise.
Truly, your "one more thing" suggestion is brilliant. Thank you. As soon as I read it, it seemed obvious. But I never thought of it.
I use all of these for various tasks and mostly aligned with the way you describe. Two additional thoughts
1) just started using Grok Deep Search in Tasks. It has been an amazing tool for keeping up on news (in my case a news and trends on a very specific niche - Austin Bio & Health)
2) I have found memory in ChatGPT to be a super power. As many of threads I have are linked in various ways. However I can't get it to stop using em-dashes no matter how many times I tell it to remember or put it in custom instructions.
I've also tried to get it to stop using em dashes to no avail. I also find the memory really useful, and the Projects too for keeping hold of context, but the memory does get full up.
Yeah you can't override default behaviours on things like em dashes and obsequiousness with modified profile instructions or intro prompts, sadly, in spite of the number of clickbait 'prompt master' posts on LinkedIn saying you can.
It does override for a bit, but then drifts back. It will be interesting to see how we better train them to personal preference and style. It definitely retains knowledge, just not instructions.
I have found that having those interesting exchanges on a train in the UK is fraught with danger and the risk of losing said exchanges :(
Great piece, Ethan. You’ve nailed the core shift in the landscape: it's no longer about the "best model" but the "best overall system." This framing is a huge help for anyone feeling overwhelmed.
That said, I'm going to challenge the quick dismissal of Copilot. While I agree its raw model performance isn't always at the bleeding edge of a new GPT-4o or Claude Opus release, you're under-valuing its power as a system.
ADDED -> Copilot Is the New Internet Explorer https://www.foreveryscale.com/p/copilot-is-the-new-internet-explorer?r=2wzfb&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
For me, the deep integration into Windows and Office is proving to be a game-changer. The friction of alt-tabbing to a browser, copying, and pasting is a bigger productivity killer than we admit. Having a very capable AI right there in Word, Outlook, or on the desktop is an advantage that's hard to quantify but easy to feel.
For the majority of knowledge workers, the convenience of a well-integrated AI will always outweigh a slightly superior AI that resides in a separate tab.
I'm curious what others think. Are you finding this trade-off plays out the same way in your daily work, or am I over-valuing the convenience of integration?
Important to note the difference between Copilot Chat vs M365 Copilot. Not sure which one was being referred to exactly in the article, but the latter can definitely be a game changer in terms of its deep workplace integration across Outlook, Office, Teams etc.
Copilot Chat is basically ChatGPT with a Microsoft jacket on. The addition of agents is also interesting and something to keep an eye on.
AND I've just realised that using the license version, if you use the Analyst agent - it's based on o3-mini, and the Researcher agent is based on o3 deep research. Both use chain of thought (as you would expect). This is a game changer as I'm only allowed to use MS Copilot at work
Excellent point. I’m using the paid version including agents.
Nice. On a personal or enterprise level? Keen to hear some thoughts on use cases. I’ve started testing some of the free agents
Check out BoltAI if you’re on a Mac, it offers inline AI with any model from any company
Check out LLM front ends like TypingMind and BoltAI which let you use any model from any company on a pay-as-you-go basis. The monthly subscription maybe cheaper for heavy users but for most people the PAYG pricing is the better option.
Ethan, great writeup as always. One big criteria that is missing from your analysis is collaboration with your team and AI. All of the above assumes an individual in single-player mode. I think that is because this is what Anthropic, ChatGPT, Grok, and Google assume. Their Teams products are not really for teamwork.
I think team collaboration with AI is so important that we started a company, Stravu, to try to enable a new way of working together with AI. We are in Beta and I'd love your (and anyone else on the thread) feedback on how you want to work with you team and AI. Please check it out at https://www.stravu.com and sign up for the Beta.
A very good guide. One thing that I have found helpful is to instruct AI to ask you questions if it needs more information or needs to clarify anything in your prompt.
Great article. Would love to hear your views on memory. Seems to be a game changer. But it can also be a bit of pain and or is scary. Also, there is no mention of o4. And I think o3 pro it’s only available for pro users (not plus - $20/month)
I've been using these AI models for close to 2 years and now realized you can switch between branches.
Thanks for the tip!
I’m a teacher and one of my dangerous assumptions when talking to colleagues is to expect that they are willing to pay $20 a month for an “expert in their pocket”.
Honestly, they are addicted to SBUX to the tune of 6-10$ a DAY and can’t forego to get a top flight model?
You need to bang in a nail and yes rocks are free and easy, but for-gosh-sakes, if you find yourself working with a lot of nails why not use a hammer?
Just don’t tell my wife how many I’m subscribed to or how much I’ve toyed with the premium tier models. 🙄
I'm fascinated by the footnote at the end. Why does politeness sometimes matter on hard math and science questions, specifically, and why can it go both directions (better or worse)?
Is it more about how this was created / who created it or more about some fundamental difference in the source material (ie, the way hard math and science are presented and discussed online)? Combo of both? Something else entirely? There's a great research rabbit hole here.
THANK YOU! I have been excited to utilize AI for good and not evil but wasn't sure where to start. Based on what you've written, I think I need to upgrade ChatGPT and keep experimenting with purpose.
I've been avoiding paying for any of these models because Google AI Studio gives access to their frontier models for free. Is there something I'm missing by not subscribing to their paid tier?
I am developing an app on 4o. I provide the ideas, the logic, and the use cases. 4o provides the coding. This includes creating the patent. What are the reasons I might switch to o3? How would I do that without losing the code in the model?
If you're developing an app I would switch to Claude. There's a night and day difference for development.
Get chatgot to write a detailed prompt of everything you've done and all key information for the project.
Then give Claude the entire code base plus the prompt. Claude's coding ability and UI is just so much better than chatgpt. When I hit the limit on Claude, and I try switch to chatgpt. I end up waiting for my limit to reset rather than use chatgpt. It's that bad for coding honestly.
How You implement and run this code if You have more complicated project? I am talking about running all these requirements and eg.docker and installation/configuration? I thought agent can do this but not really. Wsl is a nightmare for not-developer who want to build something wit ai. Code is not the biggest problem, integration is! Any advice??
"asking it to explain its logic will not get you anywhere" – another great post but this statement isn't always true. I recently asked o3 to calculate the highest point within 15 miles of where I live. It was wrong, which I knew immediately. But when I asked it to explain why it was wrong, it correctly diagnosed that it had simply done web searches when actual elevation data was necessary, then found suitable data to calculate it correctly. I'll take it as a win that in this case a human with domain knowledge was much quicker and better at the task, but of course you could also argue my prompting should have been more precise! Anyway, these constant experiments are fascinating. (And of course your overall message is exactly this: experiment constantly!)
My experience is the same. Then ask it to tell you what language to use for a master prompt to prevent the same issue in the future.
I enjoyed your book, Co-Intelligence. I use Claude to proofread, edit and comment on my freelance book reviews at the point when I think I’m ready to file them with my editor and and have found it a useful tool.