One Useful Thing

3hEdited

Depending on the field, there is debate over whether this is replication or reproduction. I added a footnote to the piece with some more details.

Expand full comment

https://repository.ubn.ru.nl//bitstream/handle/2066/155639/155639.pdf

Green-2.99

Ah! Thanks

Expand full comment

Lex Spoon

30mEdited

I agree, and I tried to comment it, but I don't know if it went through.

The "replication crisis" meant the version of a new, independent experiment. Here is a 2015 paper about it from the Open Science Foundation, where they redid 100 established results and saw that many of them did not replicate.

It's still a very valuable task indeed, though, to check the work that leads to a paper's conclusion! It also seems possible to do a sensitivity analysis in this way: ask the AI to calculate a slightly different thing, and then see if the effect is still there or not.

Expand full comment

3hEdited

>"It all checked out. I tried this on several other papers with similarly good results, though some were inaccessible due to file size limitations or issues with the replication data provided. Doing this manually would have taken many hours."

But it didn't actually do anything helpful. It just said " it all checks out". Why aren't AI agents going through tens of thousands of papers, and unearthing papers with problems?

In practice, even cutting-edge AI tools still need a tremendous amount of guidance to do anything helpful, even on narrow tasks like coding. And they go off the rails in a significant percentage of cases.

I say this as a big user of AI for the now-standard use-cases of coding, editing, and searches. But regarding bigger tasks described here, I'm quite skeptical. like many others, I've become skeptical of toy studies and benchmarks. And I'm constantly testing the tools, and they simply don't currently work well for larger tasks

Expand full comment

3hEdited

But that isn't what happened. I said "it checked out," not the AI. It provided detailed methodological critiques and was capable of applying new statistical methods to test the outcome in different ways. No guidance was needed, and the code that it wrote provides an audit trail. See my last post for another example of a critique of one of my own papers.

Expand full comment

2hEdited

Right, I understand. But (as the Talmud says) saying "here are 24 reasons why you're right" isn't especially helpful. What's helpful is pointing out that something is wrong.

"See my last post for another example of a critique of one of my own papers." Right, that's very cool, and helpful. Why isn't that being done on a large scale, in the wild, on published papers, to find actual errors? (I think you mentioned the possibility of this in that post, or maybe I saw the idea elsewhere.) To me that indicates that it's not actually capable of this task, in any kind of consistent/repeatable way

Expand full comment

Everything here came out in the past few weeks. Academia moves VERY slowly.

Expand full comment

Didn't some non-profit, backed by lots of money, say that they're going to use AI to check published papers? It sounds like something that can be set up and iterated on relatively rapidly

Expand full comment

Johannes Sundlo

Curious question on agents - does anyone have real, viable examples of agents in production in organisations? It's tremendously hard to find examples of this where they are truly agentic (e.g., act with minimal intervention from humans.) Would love to see examples!

Expand full comment

Gus Coldebella

Heard about this on AI Daily Brief and glad you wrote about it. As a lawyer and law professor, I'm very interested in seeing the answers to the "Lawyers" questions and sharing them with my class. But those fields seem to be blank. Has OpenAI made them available?

Expand full comment

They are here: https://huggingface.co/datasets/openai/gdpval/viewer/default/train?p=1&row=116

Expand full comment

Alex

The replication experiment would be more impressive if it was able to identify a paper that was not replicable. Did you feed in that one Harvard Business School prof's work?

Expand full comment

Reply (2)

Read my last post where the AI identified a (fortunately minor) error in my own research.

Expand full comment

Melody

Do you mean Francesca Gino? If so, her research involved actual human participants. Therefore, it cannot be replicated in software using AI. It could only verify that the analysis code is correct.

You would still need data from real people for the replication.

Expand full comment

3hEdited

What happens if you ask it to specifically show that replicability is not possible in a new context window, will it successfully show that too or will it deny the possibility and show why only replicability is possible?

Expand full comment

Federico

What a great line: "If we don't think hard about WHY we are doing work, and what work should look like, we are all going to drown in a wave of AI content." The breadth of available tools makes the field of possibility so expansive—yet if we simply transplant our current ways of working into this new world without reimagining our processes, we risk reaching a point where AIs create content that other AIs consume, all to make people feel productive. I hope we dare to think of different uses for a world powered by different tools.

Expand full comment

Dov Jacobson

Not surprising that AI scores well in symbol manipulation tasks like software dev or inventory. But less obvious (and just a bit disturbing) that they excel at managing human activity. (Front-line supervisors.)

Expand full comment

Brad

The efficiency predicted - "If experts followed this workflow, the paper estimates they would get work done forty percent faster and sixty percent cheaper" - is terrifying, to be honest, but that claim comes as the NY Times reports on Silicon Valley relying on its humans currently working 9 AM- 9 PM 6 days a week, though these are the folks I'd expect to be first to feel the gains - does greater efficiency mean greater reliance on fewer workers?

Expand full comment

Solomon Foshko

Having your own bespoke toolkit is where I see AI enabled workers going. Building GPTs with deep complexity and knowledge for tasks and then being able to pass that information into a coding assistant or maybe another GPT that you've created as an assistant to kick out templates massively speeds up the individual. At least in my case it has. I can confidently say that AI has given me time back in my day I'd otherwise spend shuffling thru various confluence tabs and other ERP systems.

Expand full comment

23m

I still think a big mistake GenAI products make is chasing use cases where precision is non-negotiable.

A financial model that is 70% "accurate" is essentially 100% useless. If an LLM hallucinates one formula, I have to re-check every cell. What am I accelerating exactly if I have to rework the whole thing?

The machine is good at scaffolding work, so maybe I save 10% on structure. OK. But that is a far cry from the 50%+ productivity gains some people claim (for specific use cases).

GenAI is a probabilistic machine. It thrives in brainstorming, summarising, that sort of stuff.

In domains where a single wrong number kills credibility (a few days ago I posted this https://substack.com/@themanagementconsultant/note/c-159551491 about a Big4 consultancy report that contained LLM-generated hallucinations...) it collapses under scrutiny.

Expand full comment

Joe Essid

As these models continue to improve, my concern is more cognitive offloading by students at the expense of critical-thinking and deep-reading skills. My professional organization, CCCC, is the most prominent in academic writing, and they seem to have taken a "hard no" approach I don't share.

I'm searching for a methodology that brings AI into the classroom without harming students' learning. At least we can now say the new testing does not fall prey to benchmark contamination.

Expand full comment

Ben Carew

They've now got a name for all those PowerPoints: "Workslop"

Expand full comment

Raquel 🧚🏻

Amazing, Ethan. Thank you so much for sharing your insights and knowledge. You make it much easier to understand. Thanks for bringing such great content (and data) while sharing your point of view. I strongly agree that AI is a reality. As you mentioned, you need to use your judgment about what can be done, when, and at what cost. I wish everyone could understand how technology is supposed to make our lives better and improve the way

we live. I wish that were common sense.

Expand full comment

Alex Tolley

@Ethan:

'I did not do anything other than give Claude the files and the prompts “replicate the findings in this paper from the dataset they uploaded. you need to do this yourself. if you can’t attempt a full replication, do what you can” and, because it involved complex statistics, I asked it to go further: “can you also replicate the full interactions as much as possible?”'

Was this a single-pass effort, or did it require some wrangling of the prompt[s] before it worked correctly to replicate the paper with the data and generated Python code, and then generate the report, with figures and charts?

Expand full comment

I think the more interesting question is if it can successfully find problems/mistakes/errors. Simply saying "it replicated" after (supposedly) making the calculations isn't especially interesting

Expand full comment