It seems like there is a distinction missing here regarding what replication means. Running another statistical analysis on existing datasets is not the same thing as replication, which requires running experiments again. So in psych for example, it's not just analyzing the existing reported data from a questionnaire filled out by participants: rather, it's obtaining a new batch of questionnaire results and seeing whether those results accord with prior sets. Am I missing something?
I agree, and I tried to comment it, but I don't know if it went through.
The "replication crisis" meant the version of a new, independent experiment. Here is a 2015 paper about it from the Open Science Foundation, where they redid 100 established results and saw that many of them did not replicate.
It's still a very valuable task indeed, though, to check the work that leads to a paper's conclusion! It also seems possible to do a sensitivity analysis in this way: ask the AI to calculate a slightly different thing, and then see if the effect is still there or not.
>"It all checked out. I tried this on several other papers with similarly good results, though some were inaccessible due to file size limitations or issues with the replication data provided. Doing this manually would have taken many hours."
But it didn't actually do anything helpful. It just said " it all checks out". Why aren't AI agents going through tens of thousands of papers, and unearthing papers with problems?
In practice, even cutting-edge AI tools still need a tremendous amount of guidance to do anything helpful, even on narrow tasks like coding. And they go off the rails in a significant percentage of cases.
I say this as a big user of AI for the now-standard use-cases of coding, editing, and searches. But regarding bigger tasks described here, I'm quite skeptical. like many others, I've become skeptical of toy studies and benchmarks. And I'm constantly testing the tools, and they simply don't currently work well for larger tasks
But that isn't what happened. I said "it checked out," not the AI. It provided detailed methodological critiques and was capable of applying new statistical methods to test the outcome in different ways. No guidance was needed, and the code that it wrote provides an audit trail. See my last post for another example of a critique of one of my own papers.
Right, I understand. But (as the Talmud says) saying "here are 24 reasons why you're right" isn't especially helpful. What's helpful is pointing out that something is wrong.
"See my last post for another example of a critique of one of my own papers." Right, that's very cool, and helpful. Why isn't that being done on a large scale, in the wild, on published papers, to find actual errors? (I think you mentioned the possibility of this in that post, or maybe I saw the idea elsewhere.) To me that indicates that it's not actually capable of this task, in any kind of consistent/repeatable way
Didn't some non-profit, backed by lots of money, say that they're going to use AI to check published papers? It sounds like something that can be set up and iterated on relatively rapidly
Curious question on agents - does anyone have real, viable examples of agents in production in organisations? It's tremendously hard to find examples of this where they are truly agentic (e.g., act with minimal intervention from humans.) Would love to see examples!
Heard about this on AI Daily Brief and glad you wrote about it. As a lawyer and law professor, I'm very interested in seeing the answers to the "Lawyers" questions and sharing them with my class. But those fields seem to be blank. Has OpenAI made them available?
The replication experiment would be more impressive if it was able to identify a paper that was not replicable. Did you feed in that one Harvard Business School prof's work?
Do you mean Francesca Gino? If so, her research involved actual human participants. Therefore, it cannot be replicated in software using AI. It could only verify that the analysis code is correct.
You would still need data from real people for the replication.
What happens if you ask it to specifically show that replicability is not possible in a new context window, will it successfully show that too or will it deny the possibility and show why only replicability is possible?
What a great line: "If we don't think hard about WHY we are doing work, and what work should look like, we are all going to drown in a wave of AI content." The breadth of available tools makes the field of possibility so expansive—yet if we simply transplant our current ways of working into this new world without reimagining our processes, we risk reaching a point where AIs create content that other AIs consume, all to make people feel productive. I hope we dare to think of different uses for a world powered by different tools.
Not surprising that AI scores well in symbol manipulation tasks like software dev or inventory. But less obvious (and just a bit disturbing) that they excel at managing human activity. (Front-line supervisors.)
The efficiency predicted - "If experts followed this workflow, the paper estimates they would get work done forty percent faster and sixty percent cheaper" - is terrifying, to be honest, but that claim comes as the NY Times reports on Silicon Valley relying on its humans currently working 9 AM- 9 PM 6 days a week, though these are the folks I'd expect to be first to feel the gains - does greater efficiency mean greater reliance on fewer workers?
Having your own bespoke toolkit is where I see AI enabled workers going. Building GPTs with deep complexity and knowledge for tasks and then being able to pass that information into a coding assistant or maybe another GPT that you've created as an assistant to kick out templates massively speeds up the individual. At least in my case it has. I can confidently say that AI has given me time back in my day I'd otherwise spend shuffling thru various confluence tabs and other ERP systems.
I still think a big mistake GenAI products make is chasing use cases where precision is non-negotiable.
A financial model that is 70% "accurate" is essentially 100% useless. If an LLM hallucinates one formula, I have to re-check every cell. What am I accelerating exactly if I have to rework the whole thing?
The machine is good at scaffolding work, so maybe I save 10% on structure. OK. But that is a far cry from the 50%+ productivity gains some people claim (for specific use cases).
GenAI is a probabilistic machine. It thrives in brainstorming, summarising, that sort of stuff.
In domains where a single wrong number kills credibility (a few days ago I posted this https://substack.com/@themanagementconsultant/note/c-159551491 about a Big4 consultancy report that contained LLM-generated hallucinations...) it collapses under scrutiny.
As these models continue to improve, my concern is more cognitive offloading by students at the expense of critical-thinking and deep-reading skills. My professional organization, CCCC, is the most prominent in academic writing, and they seem to have taken a "hard no" approach I don't share.
I'm searching for a methodology that brings AI into the classroom without harming students' learning. At least we can now say the new testing does not fall prey to benchmark contamination.
Amazing, Ethan. Thank you so much for sharing your insights and knowledge. You make it much easier to understand. Thanks for bringing such great content (and data) while sharing your point of view. I strongly agree that AI is a reality. As you mentioned, you need to use your judgment about what can be done, when, and at what cost. I wish everyone could understand how technology is supposed to make our lives better and improve the way
'I did not do anything other than give Claude the files and the prompts “replicate the findings in this paper from the dataset they uploaded. you need to do this yourself. if you can’t attempt a full replication, do what you can” and, because it involved complex statistics, I asked it to go further: “can you also replicate the full interactions as much as possible?”'
Was this a single-pass effort, or did it require some wrangling of the prompt[s] before it worked correctly to replicate the paper with the data and generated Python code, and then generate the report, with figures and charts?
I think the more interesting question is if it can successfully find problems/mistakes/errors. Simply saying "it replicated" after (supposedly) making the calculations isn't especially interesting
But if teh paper is wrong, what can AI do? It will fail to replicate the paper and its results. Can it explain why the paper was wrong and output the correct results? Can it be stated that the statistical tests used were wrong? It is the modern version of Tolstoy's "All happy families are alike; each unhappy family is unhappy in its own way.”. If papers are wrong, they are wrong for different reasons, and as you indicate, can an AI explain why, and better still, fix the various errors? Could the AI then fix the conclusion to reflect the changed results?
IDK if the replication crisis is because one cannot successfully repeat the experiment, or when repeating the experiment, the answers are different. I believe in the case of clinical experiments that the latter explanation is the case. I am also aware that clinical papers use the wrong statistics test about 1/3rd of the time, which may or may not change the results, but could be due to ignorance or just using the test that provides the needed p-value - a form of p-hackiing.
Just want to chime in that I had a real solid laugh at the line “that is too many PowerPoints.” The delivery was perfect. Total mirth.
But your article’s main point is actually very validating to me. I have been feeling crazy in the face of the naysaying that is predominant online. I’ve been using Codex with GPT 5 and Claude Code in my terminal for the last month for my personal projects and it’s clear to me that serious economic value could easily be generated with this relatively inexpensive combo.
I’ll be writing about my experience with it soon. There’s been so many surprises. And a nonzero number of significant wins.
Thanks for taking away my crazy pills. (Or popping one with me.)
It seems like there is a distinction missing here regarding what replication means. Running another statistical analysis on existing datasets is not the same thing as replication, which requires running experiments again. So in psych for example, it's not just analyzing the existing reported data from a questionnaire filled out by participants: rather, it's obtaining a new batch of questionnaire results and seeing whether those results accord with prior sets. Am I missing something?
Depending on the field, there is debate over whether this is replication or reproduction. I added a footnote to the piece with some more details.
Ah! Thanks
I agree, and I tried to comment it, but I don't know if it went through.
The "replication crisis" meant the version of a new, independent experiment. Here is a 2015 paper about it from the Open Science Foundation, where they redid 100 established results and saw that many of them did not replicate.
https://repository.ubn.ru.nl//bitstream/handle/2066/155639/155639.pdf
It's still a very valuable task indeed, though, to check the work that leads to a paper's conclusion! It also seems possible to do a sensitivity analysis in this way: ask the AI to calculate a slightly different thing, and then see if the effect is still there or not.
>"It all checked out. I tried this on several other papers with similarly good results, though some were inaccessible due to file size limitations or issues with the replication data provided. Doing this manually would have taken many hours."
But it didn't actually do anything helpful. It just said " it all checks out". Why aren't AI agents going through tens of thousands of papers, and unearthing papers with problems?
In practice, even cutting-edge AI tools still need a tremendous amount of guidance to do anything helpful, even on narrow tasks like coding. And they go off the rails in a significant percentage of cases.
I say this as a big user of AI for the now-standard use-cases of coding, editing, and searches. But regarding bigger tasks described here, I'm quite skeptical. like many others, I've become skeptical of toy studies and benchmarks. And I'm constantly testing the tools, and they simply don't currently work well for larger tasks
But that isn't what happened. I said "it checked out," not the AI. It provided detailed methodological critiques and was capable of applying new statistical methods to test the outcome in different ways. No guidance was needed, and the code that it wrote provides an audit trail. See my last post for another example of a critique of one of my own papers.
Right, I understand. But (as the Talmud says) saying "here are 24 reasons why you're right" isn't especially helpful. What's helpful is pointing out that something is wrong.
"See my last post for another example of a critique of one of my own papers." Right, that's very cool, and helpful. Why isn't that being done on a large scale, in the wild, on published papers, to find actual errors? (I think you mentioned the possibility of this in that post, or maybe I saw the idea elsewhere.) To me that indicates that it's not actually capable of this task, in any kind of consistent/repeatable way
Everything here came out in the past few weeks. Academia moves VERY slowly.
Didn't some non-profit, backed by lots of money, say that they're going to use AI to check published papers? It sounds like something that can be set up and iterated on relatively rapidly
Curious question on agents - does anyone have real, viable examples of agents in production in organisations? It's tremendously hard to find examples of this where they are truly agentic (e.g., act with minimal intervention from humans.) Would love to see examples!
Heard about this on AI Daily Brief and glad you wrote about it. As a lawyer and law professor, I'm very interested in seeing the answers to the "Lawyers" questions and sharing them with my class. But those fields seem to be blank. Has OpenAI made them available?
They are here: https://huggingface.co/datasets/openai/gdpval/viewer/default/train?p=1&row=116
The replication experiment would be more impressive if it was able to identify a paper that was not replicable. Did you feed in that one Harvard Business School prof's work?
Read my last post where the AI identified a (fortunately minor) error in my own research.
Do you mean Francesca Gino? If so, her research involved actual human participants. Therefore, it cannot be replicated in software using AI. It could only verify that the analysis code is correct.
You would still need data from real people for the replication.
What happens if you ask it to specifically show that replicability is not possible in a new context window, will it successfully show that too or will it deny the possibility and show why only replicability is possible?
What a great line: "If we don't think hard about WHY we are doing work, and what work should look like, we are all going to drown in a wave of AI content." The breadth of available tools makes the field of possibility so expansive—yet if we simply transplant our current ways of working into this new world without reimagining our processes, we risk reaching a point where AIs create content that other AIs consume, all to make people feel productive. I hope we dare to think of different uses for a world powered by different tools.
Not surprising that AI scores well in symbol manipulation tasks like software dev or inventory. But less obvious (and just a bit disturbing) that they excel at managing human activity. (Front-line supervisors.)
The efficiency predicted - "If experts followed this workflow, the paper estimates they would get work done forty percent faster and sixty percent cheaper" - is terrifying, to be honest, but that claim comes as the NY Times reports on Silicon Valley relying on its humans currently working 9 AM- 9 PM 6 days a week, though these are the folks I'd expect to be first to feel the gains - does greater efficiency mean greater reliance on fewer workers?
Having your own bespoke toolkit is where I see AI enabled workers going. Building GPTs with deep complexity and knowledge for tasks and then being able to pass that information into a coding assistant or maybe another GPT that you've created as an assistant to kick out templates massively speeds up the individual. At least in my case it has. I can confidently say that AI has given me time back in my day I'd otherwise spend shuffling thru various confluence tabs and other ERP systems.
I still think a big mistake GenAI products make is chasing use cases where precision is non-negotiable.
A financial model that is 70% "accurate" is essentially 100% useless. If an LLM hallucinates one formula, I have to re-check every cell. What am I accelerating exactly if I have to rework the whole thing?
The machine is good at scaffolding work, so maybe I save 10% on structure. OK. But that is a far cry from the 50%+ productivity gains some people claim (for specific use cases).
GenAI is a probabilistic machine. It thrives in brainstorming, summarising, that sort of stuff.
In domains where a single wrong number kills credibility (a few days ago I posted this https://substack.com/@themanagementconsultant/note/c-159551491 about a Big4 consultancy report that contained LLM-generated hallucinations...) it collapses under scrutiny.
As these models continue to improve, my concern is more cognitive offloading by students at the expense of critical-thinking and deep-reading skills. My professional organization, CCCC, is the most prominent in academic writing, and they seem to have taken a "hard no" approach I don't share.
I'm searching for a methodology that brings AI into the classroom without harming students' learning. At least we can now say the new testing does not fall prey to benchmark contamination.
They've now got a name for all those PowerPoints: "Workslop"
Amazing, Ethan. Thank you so much for sharing your insights and knowledge. You make it much easier to understand. Thanks for bringing such great content (and data) while sharing your point of view. I strongly agree that AI is a reality. As you mentioned, you need to use your judgment about what can be done, when, and at what cost. I wish everyone could understand how technology is supposed to make our lives better and improve the way
we live. I wish that were common sense.
@Ethan:
'I did not do anything other than give Claude the files and the prompts “replicate the findings in this paper from the dataset they uploaded. you need to do this yourself. if you can’t attempt a full replication, do what you can” and, because it involved complex statistics, I asked it to go further: “can you also replicate the full interactions as much as possible?”'
Was this a single-pass effort, or did it require some wrangling of the prompt[s] before it worked correctly to replicate the paper with the data and generated Python code, and then generate the report, with figures and charts?
I think the more interesting question is if it can successfully find problems/mistakes/errors. Simply saying "it replicated" after (supposedly) making the calculations isn't especially interesting
But if teh paper is wrong, what can AI do? It will fail to replicate the paper and its results. Can it explain why the paper was wrong and output the correct results? Can it be stated that the statistical tests used were wrong? It is the modern version of Tolstoy's "All happy families are alike; each unhappy family is unhappy in its own way.”. If papers are wrong, they are wrong for different reasons, and as you indicate, can an AI explain why, and better still, fix the various errors? Could the AI then fix the conclusion to reflect the changed results?
IDK if the replication crisis is because one cannot successfully repeat the experiment, or when repeating the experiment, the answers are different. I believe in the case of clinical experiments that the latter explanation is the case. I am also aware that clinical papers use the wrong statistics test about 1/3rd of the time, which may or may not change the results, but could be due to ignorance or just using the test that provides the needed p-value - a form of p-hackiing.
Just want to chime in that I had a real solid laugh at the line “that is too many PowerPoints.” The delivery was perfect. Total mirth.
But your article’s main point is actually very validating to me. I have been feeling crazy in the face of the naysaying that is predominant online. I’ve been using Codex with GPT 5 and Claude Code in my terminal for the last month for my personal projects and it’s clear to me that serious economic value could easily be generated with this relatively inexpensive combo.
I’ll be writing about my experience with it soon. There’s been so many surprises. And a nonzero number of significant wins.
Thanks for taking away my crazy pills. (Or popping one with me.)