This is 100% my experience. The last couple of months have been the most productive in my career, and it isn’t even close. My concern with this observation is that we are potentially entering a productivity divide between those who utilize AI properly and those who do not. And this is not just an AI literacy question, it is primarily also a question of attitude towards the benefits of AI. If you are rejecting AI as a matter of principle you will no longer be able to compete, even if you are the smartest person in the room.
With 15 additional years on you (having started in the game with punched cards and wooden disks) and not having done any coding of significance in about 20 years, Copilot and I just developed a nifty electronic bulletin board app. The old adage that I have heard, "a good programmer can write Fortran in any language" really comes to life with an AI at my side.
> My concern with this observation is that we are potentially entering a productivity divide between those who utilize AI properly and those who do not.
I agree and would take this notion a step further to insist that is where we are right now.
What I find significant is that this study is done before the use of the new “deep research “ methodologies were realized. I am “teaming” with a combination of Gemini, OpenAI, and Claude using deep research and doing mashups of results. And now I am integrating this with my human team members and teaching them the technologies. The results are spectacular. We are using this technology to design, architect, project plan, and build and it is hard to describe the results. As new features appear we are folding them into our processes. GPT 4o results which used to stun me are now pedestrian in comparison.
I completely agree. I think the non obvious implication of DR & other agentic AIs will be to increase the ability & benefit of human:human collaboration, by doing the bulk of the actual work
Henry, wow thanks for the redirect to your article. Lots of food for thought. A few things I would comment on: 1) I rarely use the research results from the first prompt, often takes a bit of human in the loop feedback. This is really making your point on the prompter’s expertise matters, and I would add fine tuning is not only acceptable but to be expected - you are literally using a custom RAG to build better research results. 2) Very serious about the mashups being useful. I find the three different models give a lot of expansion to answers and modes of “alien thinking “ as Ethan puts it in his book. 3) cross-feeding models results between models for analysis and creation can be revelatory.. just saying try it you might like is. 4) getting a lot of mileage of asking for a PhD proposal on subjects as done carefully you actually get a ton of design ideas for research proposals.
I could go on but not trying to write a blog post just pointing out how rapidly the process is changing and morphing and expanding.
With respect to prompting of the "OpenAI deep research" (Odr) agent, I have started to use the meta-prompting approach. I give o3-mini a brief prompt about the research topic, together with some "standard instructions" to include at the end of the prompt for Odr (how to structure the report, tabular summaries, slide outlines in the appendix, etc.), and ask it to write a detailed prompt. Here is an example:
"I would like to engage you as a specialist in the creation of detailed prompts for the 'OpenAI deep research' (Odr) agent. The agent will use your prompt to conduct extensive research and produce a high-quality report. The best way to prompt the Odr agent is to include a significant amount of detail specifying the topic and the requirements for the research task.
Here is the research topic: <<Brief description of the topic - one sentence usually suffices>>. At the end of your prompt, include the following text verbatim: ... detailed instructions about the report structure, format, etc. ..."
The generated prompt for Odr is usually at least a page long. I am actually considering to change the "significant amount of detail" portion to "reasonable amount of detail", with the goal of giving Odr more leeway when pursuing the research. I have observed that it sticks very much to the prompt details, which may be too constraining.
Anyway, Odr is in another category. I am starting to use it for work (I'm a retired consultant specializing in Generative AI) and private investigations. Can't imagine not having it. It's also good for writing code.
Craig, I find your mashup concept very interesting. What is the process that you are using for this mashup? Are you just copying and pasting from one tool to the other, or are you using some sort of automation or workflow or tool to help you get the results that you want? I would be very interested in how you are doing this.
Several options. So one is to combine the three outputs and put them together and ask for an analysis on the three outputs using all references and use cases . Since I always ask for references and use cases and I really get different ones often this allows me to combine them this actually enriches a report (I am doing technical reports on potential architecture solutions). On the initial reports they are in a standard format (executive summary, several layers of details, summary and conclusions, recommendations, references with use cases spread through the details and references) with Open AI or Claude it does a nice job of mashups. If all three are similar I can do manual curation but takes a bit more effort. Haven’t automated yet but guess what, Claude or Open AI will do the code for me - it is an excellent idea.
Hi Craig, Fantastic to read about your excellent results with team work using multiple AIs. Do you use a tool for that at the moment? I would love to know more about how you use and integrated them in this process.
Yes Craig. So true. The way you describe it — folding new features into team workflows like it’s second nature — feels like watching the future gel in real time.
But I can’t help wondering:
What happens when the “pedestrian” becomes the new default?
When the bar keeps rising so fast that yesterday’s breakthroughs feel like slow motion?
There’s power in these mashups. But also a risk: speed can outpace reflection.
I’m curious how you and your team stay anchored while riding that wave.
I’ve seen my productivity increase in two ways. On a basic level, I’m using AI for legal document review / summaries. It was pretty bad at that when I was using 4o, but with o1, it’s very useful and saves me a ton of time. On a more advanced level, I’m using o1 and o3 mini high and Deep Research to dig into complex tax and legal questions, which is a game changer. Yes, I have to check the cites for hallucinations, but I’m still saving a ton of time. Plus, I can add PDFs of vetted third-party tax research that I know and trust to enhance the validity and output of those AI tools. It’s all about figuring out what should go into the mixing bowl.
Weighing in as an English teacher, with students in grades 7 and 11, this article makes me think about the possibilities of a dopamine hit an AI user would receive because of the sense of success and perhaps even control over difficult material. So, is the difficulty that students have in moderating their use partially energized by the feelings of positivity demonstrated above?
This is great. I have been working as a soloprenuer with ChatGPT and Claude as my co-workers for the last 18 months. I work in the innovation and strategy consulting space and now using AI to disrupt traditional consulting business model. It would be an honor to take part in this study. I have a lot of insights to share.
A quibble with your interpretation of the teams being “significantly” better than individuals, demonstrating the “value of teamwork”:
* Firstly .24 stdev is just not very much!
* But much more importantly, that improvement is strictly worse than just having each team member work solo and then taking the better of the two submissions! (I don’t have the math background to calculate it myself but ChatGPT 4o and Claude 3.7 both tell me that this strategy should boost stdev by .564 or more than double the improvement). So the teamwork is actually showing an anti-synergy!
This is a fantastic study, and Can Confirm. My work as a cyborg has been more creative, higher quality, and way more fun. “More efficient” and “faster,” while also true, don’t begin to capture it. I’m able to play in spaces that were completely unavailable to me before. To have actual evidence backing this up is very exciting, and helps me evangelize to people who make decisions about our organization’s AI resources.
I've always been convinced that Generative AI's best usage is as a 'sparring partner' for doing your job. In my experience, this has helped me greatly in my daily work, and it has been the pattern with which my team and I have trained employees and introduced Generative AI in corporate settings. Glad to see some quantitative research corroborating this approach.
Great insights on how Individual + AI team replicate benefits of cross-functional teamwork. Has anyone come across articles that explore the organization design implications? For example, how do we evolve the design of organizations, teams and roles to realize these benefits? How do we train people to team with AI?
This fascinating field experiment at P&G supports the importance of Artificial Intelligence Quotient (AIQ), a term coined by MIT and Sun Yat-sen University researchers.
The MIT-Sun Yat-sen studies showed that the best human-AI performers were not necessarily the best chess players or the most AI literate and the P&G experiment shows similar results between experts and non-specialists.
AIQ refers to a person's ability to effectively use AI across a diverse range of tasks, identified as a stable, measurable factor using 18 years of global data from chess and renju games. Subsequent studies then broadened the scope to general tasks using language models like ChatGPT and Gemini to further establish and validate AIQ.
The P&G research shows that individuals and especially teams collaborating with AI achieved better results, demonstrating the real-world value of AIQ. These findings from both the original AIQ research and "The Cybernetic Teammate" emphasize the need to measure and develop AIQ for individuals, teams and organizations to fully benefit from AI.
Measuring and developing your AIQ will make you a better player with your cybernetic teammate.
Claude helped me write this email which I put in a forward of this article to my teammates.
Dear Planning Team,
I've been reflecting on some groundbreaking research that's causing me to reconsider our entire approach to project-based learning.
What if we've been doing this backward all along? What if, instead of starting with standards and trying to build engaging projects around them, we started with student passion and used AI as a teammate to help connect that passion to standards?
Recent research from Procter & Gamble, conducted by researchers at Harvard's Digital Data Design Institute, reveals something transformative: when AI joins a team, it dramatically changes what's possible. In their study, they found that individuals with AI performed as well as traditional teams, while teams with AI achieved exceptional results beyond what either could do alone.
But here's what's keeping me up at night: this means that any teacher, working with any student, and collaborating with AI, can effectively help that student build a project that addresses ANY educational standards, in ANY subject area, for ANY grade level.
Think about that for a moment.
We no longer need to force students into predetermined projects to meet standards. Instead, we can let students follow their curiosity and passion, then use AI as a teammate to help connect their work back to whatever standards are required. The standards can finally serve the student's learning journey, not the other way around.
This completely inverts our traditional approach. Our guiding star becomes student interest - not curriculum maps or pacing guides.
I believe this approach will lead to deeper learning, more authentic engagement, and better outcomes. Students will be doing work that matters to them, while still meeting (and likely exceeding) the standards we're accountable for.
I'm eager to discuss how we might pilot this approach with our students. Perhaps we could start with a small group and document the process and outcomes?
Ive been thinking about this too. But wonder how teachers would determine if students are actually coming away with a deeper understanding of the material or if they are producing higher quality outputs that they don’t really understand?
This is a question I have more broadly about this experiment and how AI augments learning. Does a better quality output = more mastery? Do teams deepen and evolve their actual expertise, or do they produce things that look just as good as the people with expertise does?
The latter feels risky, especially when kids should be building foundational knowledge and learning how to think, how to learn. I guess it boils down to how the technology is engaged.
This is great stuff: huge scale, clean design, solid numbers and now fascinating comments here below. Did you find a way to control for the 'novelty effect' in your emotional metrics? How many of your subjects were using AI effectively for the first time and enjoying the initial thrill we have all experienced in these last few years?
Thanks for pointing out the 'novelty effect' - I was struggling to consider an effective way to control for that. Perhaps we should ask our AI teammates to suggest a way? :)
You are really right. So 1) read the document you had written and 2) check every reference and use case. The process is to use the tool to aid in research and help summarize what is out there, not to do your intellectual job necessarily. I find myself parsing particular parts of research and doing another deep research on sub topics, and that can become iterative as you explore subjects. Another trick is after the research ask for an elevator speech describing it and then 5 slides to explain it to non technical folk. Amazing what happens when you focus in and out.
This is 100% my experience. The last couple of months have been the most productive in my career, and it isn’t even close. My concern with this observation is that we are potentially entering a productivity divide between those who utilize AI properly and those who do not. And this is not just an AI literacy question, it is primarily also a question of attitude towards the benefits of AI. If you are rejecting AI as a matter of principle you will no longer be able to compete, even if you are the smartest person in the room.
That last line hit hard.
What if the real divide isn’t just about who uses AI — but who lets it think for them?
We risk building a world where even the “smartest in the room” forget how to navigate without a co-pilot.
Not because they’re lazy, but because the tools make it so easy to skip the friction — the struggle where understanding is born.
I’ve been writing a lot about that tension: productivity vs. cognitive erosion.
Your take is sharp. Grateful for it.
Hey, thanks for thinking of me — but I’m not a customer for that kind of plan.
I don’t do investments or third-party brokers.
Appreciate the gesture, but I’ll pass.
— Aurel
Couldn’t have said it any better. Same experience, even at 70 years of age 😬. 100% aligned.
With 15 additional years on you (having started in the game with punched cards and wooden disks) and not having done any coding of significance in about 20 years, Copilot and I just developed a nifty electronic bulletin board app. The old adage that I have heard, "a good programmer can write Fortran in any language" really comes to life with an AI at my side.
> My concern with this observation is that we are potentially entering a productivity divide between those who utilize AI properly and those who do not.
I agree and would take this notion a step further to insist that is where we are right now.
Thanks for the comment. 👍 Yes, we are.
What I find significant is that this study is done before the use of the new “deep research “ methodologies were realized. I am “teaming” with a combination of Gemini, OpenAI, and Claude using deep research and doing mashups of results. And now I am integrating this with my human team members and teaching them the technologies. The results are spectacular. We are using this technology to design, architect, project plan, and build and it is hard to describe the results. As new features appear we are folding them into our processes. GPT 4o results which used to stun me are now pedestrian in comparison.
I completely agree. I think the non obvious implication of DR & other agentic AIs will be to increase the ability & benefit of human:human collaboration, by doing the bulk of the actual work
I wrote more on this here: https://open.substack.com/pub/thefuturenormal/p/chatgpts-deep-research-and-thinking
Henry, wow thanks for the redirect to your article. Lots of food for thought. A few things I would comment on: 1) I rarely use the research results from the first prompt, often takes a bit of human in the loop feedback. This is really making your point on the prompter’s expertise matters, and I would add fine tuning is not only acceptable but to be expected - you are literally using a custom RAG to build better research results. 2) Very serious about the mashups being useful. I find the three different models give a lot of expansion to answers and modes of “alien thinking “ as Ethan puts it in his book. 3) cross-feeding models results between models for analysis and creation can be revelatory.. just saying try it you might like is. 4) getting a lot of mileage of asking for a PhD proposal on subjects as done carefully you actually get a ton of design ideas for research proposals.
I could go on but not trying to write a blog post just pointing out how rapidly the process is changing and morphing and expanding.
With respect to prompting of the "OpenAI deep research" (Odr) agent, I have started to use the meta-prompting approach. I give o3-mini a brief prompt about the research topic, together with some "standard instructions" to include at the end of the prompt for Odr (how to structure the report, tabular summaries, slide outlines in the appendix, etc.), and ask it to write a detailed prompt. Here is an example:
"I would like to engage you as a specialist in the creation of detailed prompts for the 'OpenAI deep research' (Odr) agent. The agent will use your prompt to conduct extensive research and produce a high-quality report. The best way to prompt the Odr agent is to include a significant amount of detail specifying the topic and the requirements for the research task.
Here is the research topic: <<Brief description of the topic - one sentence usually suffices>>. At the end of your prompt, include the following text verbatim: ... detailed instructions about the report structure, format, etc. ..."
The generated prompt for Odr is usually at least a page long. I am actually considering to change the "significant amount of detail" portion to "reasonable amount of detail", with the goal of giving Odr more leeway when pursuing the research. I have observed that it sticks very much to the prompt details, which may be too constraining.
Anyway, Odr is in another category. I am starting to use it for work (I'm a retired consultant specializing in Generative AI) and private investigations. Can't imagine not having it. It's also good for writing code.
Craig, I find your mashup concept very interesting. What is the process that you are using for this mashup? Are you just copying and pasting from one tool to the other, or are you using some sort of automation or workflow or tool to help you get the results that you want? I would be very interested in how you are doing this.
Guy,
Several options. So one is to combine the three outputs and put them together and ask for an analysis on the three outputs using all references and use cases . Since I always ask for references and use cases and I really get different ones often this allows me to combine them this actually enriches a report (I am doing technical reports on potential architecture solutions). On the initial reports they are in a standard format (executive summary, several layers of details, summary and conclusions, recommendations, references with use cases spread through the details and references) with Open AI or Claude it does a nice job of mashups. If all three are similar I can do manual curation but takes a bit more effort. Haven’t automated yet but guess what, Claude or Open AI will do the code for me - it is an excellent idea.
Craig
Thanks!
Hi Craig, Fantastic to read about your excellent results with team work using multiple AIs. Do you use a tool for that at the moment? I would love to know more about how you use and integrated them in this process.
Yes Craig. So true. The way you describe it — folding new features into team workflows like it’s second nature — feels like watching the future gel in real time.
But I can’t help wondering:
What happens when the “pedestrian” becomes the new default?
When the bar keeps rising so fast that yesterday’s breakthroughs feel like slow motion?
There’s power in these mashups. But also a risk: speed can outpace reflection.
I’m curious how you and your team stay anchored while riding that wave.
Thanks for sharing your lines so openly.
I’ve seen my productivity increase in two ways. On a basic level, I’m using AI for legal document review / summaries. It was pretty bad at that when I was using 4o, but with o1, it’s very useful and saves me a ton of time. On a more advanced level, I’m using o1 and o3 mini high and Deep Research to dig into complex tax and legal questions, which is a game changer. Yes, I have to check the cites for hallucinations, but I’m still saving a ton of time. Plus, I can add PDFs of vetted third-party tax research that I know and trust to enhance the validity and output of those AI tools. It’s all about figuring out what should go into the mixing bowl.
Weighing in as an English teacher, with students in grades 7 and 11, this article makes me think about the possibilities of a dopamine hit an AI user would receive because of the sense of success and perhaps even control over difficult material. So, is the difficulty that students have in moderating their use partially energized by the feelings of positivity demonstrated above?
This is great. I have been working as a soloprenuer with ChatGPT and Claude as my co-workers for the last 18 months. I work in the innovation and strategy consulting space and now using AI to disrupt traditional consulting business model. It would be an honor to take part in this study. I have a lot of insights to share.
A quibble with your interpretation of the teams being “significantly” better than individuals, demonstrating the “value of teamwork”:
* Firstly .24 stdev is just not very much!
* But much more importantly, that improvement is strictly worse than just having each team member work solo and then taking the better of the two submissions! (I don’t have the math background to calculate it myself but ChatGPT 4o and Claude 3.7 both tell me that this strategy should boost stdev by .564 or more than double the improvement). So the teamwork is actually showing an anti-synergy!
This is a fantastic study, and Can Confirm. My work as a cyborg has been more creative, higher quality, and way more fun. “More efficient” and “faster,” while also true, don’t begin to capture it. I’m able to play in spaces that were completely unavailable to me before. To have actual evidence backing this up is very exciting, and helps me evangelize to people who make decisions about our organization’s AI resources.
I've always been convinced that Generative AI's best usage is as a 'sparring partner' for doing your job. In my experience, this has helped me greatly in my daily work, and it has been the pattern with which my team and I have trained employees and introduced Generative AI in corporate settings. Glad to see some quantitative research corroborating this approach.
Great insights on how Individual + AI team replicate benefits of cross-functional teamwork. Has anyone come across articles that explore the organization design implications? For example, how do we evolve the design of organizations, teams and roles to realize these benefits? How do we train people to team with AI?
This fascinating field experiment at P&G supports the importance of Artificial Intelligence Quotient (AIQ), a term coined by MIT and Sun Yat-sen University researchers.
The MIT-Sun Yat-sen studies showed that the best human-AI performers were not necessarily the best chess players or the most AI literate and the P&G experiment shows similar results between experts and non-specialists.
AIQ refers to a person's ability to effectively use AI across a diverse range of tasks, identified as a stable, measurable factor using 18 years of global data from chess and renju games. Subsequent studies then broadened the scope to general tasks using language models like ChatGPT and Gemini to further establish and validate AIQ.
The P&G research shows that individuals and especially teams collaborating with AI achieved better results, demonstrating the real-world value of AIQ. These findings from both the original AIQ research and "The Cybernetic Teammate" emphasize the need to measure and develop AIQ for individuals, teams and organizations to fully benefit from AI.
Measuring and developing your AIQ will make you a better player with your cybernetic teammate.
Claude helped me write this email which I put in a forward of this article to my teammates.
Dear Planning Team,
I've been reflecting on some groundbreaking research that's causing me to reconsider our entire approach to project-based learning.
What if we've been doing this backward all along? What if, instead of starting with standards and trying to build engaging projects around them, we started with student passion and used AI as a teammate to help connect that passion to standards?
Recent research from Procter & Gamble, conducted by researchers at Harvard's Digital Data Design Institute, reveals something transformative: when AI joins a team, it dramatically changes what's possible. In their study, they found that individuals with AI performed as well as traditional teams, while teams with AI achieved exceptional results beyond what either could do alone.
But here's what's keeping me up at night: this means that any teacher, working with any student, and collaborating with AI, can effectively help that student build a project that addresses ANY educational standards, in ANY subject area, for ANY grade level.
Think about that for a moment.
We no longer need to force students into predetermined projects to meet standards. Instead, we can let students follow their curiosity and passion, then use AI as a teammate to help connect their work back to whatever standards are required. The standards can finally serve the student's learning journey, not the other way around.
This completely inverts our traditional approach. Our guiding star becomes student interest - not curriculum maps or pacing guides.
I believe this approach will lead to deeper learning, more authentic engagement, and better outcomes. Students will be doing work that matters to them, while still meeting (and likely exceeding) the standards we're accountable for.
I'm eager to discuss how we might pilot this approach with our students. Perhaps we could start with a small group and document the process and outcomes?
Looking forward to your thoughts,
Weisenfeld
Ive been thinking about this too. But wonder how teachers would determine if students are actually coming away with a deeper understanding of the material or if they are producing higher quality outputs that they don’t really understand?
This is a question I have more broadly about this experiment and how AI augments learning. Does a better quality output = more mastery? Do teams deepen and evolve their actual expertise, or do they produce things that look just as good as the people with expertise does?
The latter feels risky, especially when kids should be building foundational knowledge and learning how to think, how to learn. I guess it boils down to how the technology is engaged.
This is great stuff: huge scale, clean design, solid numbers and now fascinating comments here below. Did you find a way to control for the 'novelty effect' in your emotional metrics? How many of your subjects were using AI effectively for the first time and enjoying the initial thrill we have all experienced in these last few years?
Thanks for pointing out the 'novelty effect' - I was struggling to consider an effective way to control for that. Perhaps we should ask our AI teammates to suggest a way? :)
You are really right. So 1) read the document you had written and 2) check every reference and use case. The process is to use the tool to aid in research and help summarize what is out there, not to do your intellectual job necessarily. I find myself parsing particular parts of research and doing another deep research on sub topics, and that can become iterative as you explore subjects. Another trick is after the research ask for an elevator speech describing it and then 5 slides to explain it to non technical folk. Amazing what happens when you focus in and out.
Loved this line: “AI can improve your experience.”
I agree — but I also think we’re dangerously close to outsourcing the 'experience itself'.
What happens when we no longer write, think, or question without the loop of machine feedback?
Are we upgrading our work, or just buffering the burnout?
Been digging into that tension in my own work lately — glad to see others wrestling with it too.
Thanks, Ethan — always thoughtful, always sharp.
more power
Thanks for this thoughtful post.
I tried to get at a similar intuition, regarding how cultures of productivity and trust will be created in the Age of AI, in this historical analysis: https://taoofai.substack.com/p/institutions-in-the-age-of-ai