It seems like there is a distinction missing here regarding what replication means. Running another statistical analysis on existing datasets is not the same thing as replication, which requires running experiments again. So in psych for example, it's not just analyzing the existing reported data from a questionnaire filled out by participants: rather, it's obtaining a new batch of questionnaire results and seeing whether those results accord with prior sets. Am I missing something?
I agree, and I tried to comment it, but I don't know if it went through.
The "replication crisis" meant the version of a new, independent experiment. Here is a 2015 paper about it from the Open Science Foundation, where they redid 100 established results and saw that many of them did not replicate.
It's still a very valuable task indeed, though, to check the work that leads to a paper's conclusion! It also seems possible to do a sensitivity analysis in this way: ask the AI to calculate a slightly different thing, and then see if the effect is still there or not.
There’s a difference between Replication Crisis (those who re-do (replicate) the experiment with new subjects often don’t get the same results NOR the same conclusion) vs. reanalyzing the published datasets allows readers to come up with different statistics, different analyses, and possibly different conclusions.
Curious question on agents - does anyone have real, viable examples of agents in production in organisations? It's tremendously hard to find examples of this where they are truly agentic (e.g., act with minimal intervention from humans.) Would love to see examples!
Some of the things I've seen companies working on are things like making an appointment for a healthcare patient, answering HR questions, helping make a travel booking, and handling portions of a loan underwriting process. In practice, a lot of the tasks are relatively simple for humans, meaning they can be automated pretty reliably with AI. More complex or high-risk tasks still require a human in the loop, meaning that the agent tends to be more of an assistant rather than acting autonomously.
Yes, that's my take on it too. The agent doing the grunt work. Now I need to know how to access/create an agent. I'm ok with ChatGPT etc. and issuing prompts, it's the agent side I'm clueless about. Good thing I like learning lol!
I've seen it as well, but I guess it's perhaps the longevity aspect I'm after here? The models can work (uninterrupted) for quite a long time and solve tasks without humans ever intervening. Most of the tools I've seen are more automations with an attached LLM, rather than truly agentic systems. But maybe I'm being too narrow-minded here.
TL;DR: At Gearvox, our AI agents handle complete job functions (customer intake → dispatch → follow-up) for blue-collar services, not just isolated tasks. We process tens of thousands of calls monthly. Dispatch managers experience this as relief from high-churn roles, not replacement—they shift from endless hiring cycles to strategic supervision while agents handle the 80% of calls that are routine, spam, or busy work.
I'd push back on the framing that agents only handle "tasks, not jobs." At Gearvox.ai, our agents handle what I'd call complete job functions—owning the entire customer conversation lifecycle for blue-collar services (tow companies, roadside assistance, locksmiths, home services): intake, dispatch, system updates, follow-ups, and exception handling.
This isn't "AI does task X, human reviews and moves to Y." The agent manages the full cycle from first contact to dispatched service request (with "bail out to human" always available).
What makes this a "job slice" vs. task automation:
1. End-to-end ownership - Accountable for customer satisfaction, accurate dispatch, completed transactions. Dispatch managers handle exceptions/strategy; agents handle baseline work autonomously.
2. Domain expertise exceeding typical performance - Trained on expert interviews, call recordings, best practices. Higher conversion rates than average reps while handling edge cases (urgent highway patrol requests vs. routine tows).
3. Full systems integration - Real-time ETA checks, GPS tracking, CRM updates, scheduling, notifications—zero typos, missed fields, or manual data entry.
4. Inhuman consistency - Perfect policy adherence, on-brand communication, empathetic responses without emotional fatigue. Automated post-call classification with higher accuracy than manual review.
But here's what matters: dispatch managers experience this as relief, not replacement.
The roles our agents subsume are plagued by high churn, expensive training cycles, and inconsistent quality. Managers tell us they're exhausted from constant hiring/retraining in frontline positions most people don't want long-term.
Their role has shifted: instead of managing a rotating cast of undertrained staff, they monitor agent performance, handle escalations, and focus on strategy. They report higher quality, better conversion, lower costs—and less stress.
Economic substitution happens at the role level (customers deploy agents instead of hiring 2+ inbound reps), but the human benefit is that senior employees and owner-operators finally focus on judgment, relationships, and strategy—not endless hiring-training-turnover cycles. Taking the 80% of calls that are busy work, spam, or routine requests off managers' plates has been transformative.
The article's risk—"17 PowerPoints nobody needs"—is real if you automate thoughtlessly. But there's a risk in underselling what's possible. If we only think "tasks," we miss the opportunity to redesign work around what agents own end-to-end.
Companies seeing value aren't asking "which tasks can AI help with?" They're asking "which job functions can we reimagine with an agentic workforce that has perfect memory, instant systems access, and consistent expertise—especially where human retention and consistency have always been the bottleneck?"
For blue-collar services, that's customer-facing dispatch and intake. Agents handle it fully. Humans supervise, strategize, and do work requiring judgment. That's not task automation—it's workforce transformation making human roles more sustainable and valuable.
If you are the Director of Strategy for a prestige cosmetic brand, I suspect that the difficult (or even most useful) part of your job is NOT to draft the distribution strategy, but rather:
- to persuade your own team-members, who have their own ideas, that their ideas have been incorporated into your strategy;
- to persuade the adjacent Ops & Marketing teams, who have their own ideas and concerns about executing your strategy, that you've listened to their ideas and addressed them in your strategy;
- to persuade Finance to allocate enough budget for your strategy, or to alter your strategy to fit withint the budget constraints, or negotiate something in the middle;
- etc., etc. etc.
The drafting of the strategy itself seems kind of trivial and even unhelpful to me, without the human behind it, who needs to navigate the office politics and human emotions necessary to enact the strategy successfully.
Absolutely. The real craft of strategy isn’t writing the document. It’s getting humans aligned behind it. Influence, empathy, and negotiation often decide whether a strategy lives or dies.
>"It all checked out. I tried this on several other papers with similarly good results, though some were inaccessible due to file size limitations or issues with the replication data provided. Doing this manually would have taken many hours."
But it didn't actually do anything helpful. It just said " it all checks out". Why aren't AI agents going through tens of thousands of papers, and unearthing papers with problems?
In practice, even cutting-edge AI tools still need a tremendous amount of guidance to do anything helpful, even on narrow tasks like coding. And they go off the rails in a significant percentage of cases.
I say this as a big user of AI for the now-standard use-cases of coding, editing, and searches. But regarding bigger tasks described here, I'm quite skeptical. like many others, I've become skeptical of toy studies and benchmarks. And I'm constantly testing the tools, and they simply don't currently work well for larger tasks
But that isn't what happened. I said "it checked out," not the AI. It provided detailed methodological critiques and was capable of applying new statistical methods to test the outcome in different ways. No guidance was needed, and the code that it wrote provides an audit trail. See my last post for another example of a critique of one of my own papers.
Right, I understand. But (as the Talmud says) saying "here are 24 reasons why you're right" isn't especially helpful. What's helpful is pointing out that something is wrong.
"See my last post for another example of a critique of one of my own papers." Right, that's very cool, and helpful. Why isn't that being done on a large scale, in the wild, on published papers, to find actual errors? (I think you mentioned the possibility of this in that post, or maybe I saw the idea elsewhere.) To me that indicates that it's not actually capable of this task, in any kind of consistent/repeatable way
Didn't some non-profit, backed by lots of money, say that they're going to use AI to check published papers? It sounds like something that can be set up and iterated on relatively rapidly
I must be missing something, the Stata files for replication are all readily available with the data. I could replicate it with almost no effort. The AI can certainly check results against the paper much, much faster than me, but I'm not impressed that it can take Stata files and make it into Python code at this point.
As these models continue to improve, my concern is more cognitive offloading by students at the expense of critical-thinking and deep-reading skills. My professional organization, CCCC, is the most prominent in academic writing, and they seem to have taken a "hard no" approach I don't share.
I'm searching for a methodology that brings AI into the classroom without harming students' learning. At least we can now say the new testing does not fall prey to benchmark contamination.
That’s such an important concern. The challenge isn’t whether students use AI, but how they use it. Maybe the goal should be to teach students to think with AI. Using it to question, not replace, their reasoning.
Having your own bespoke toolkit is where I see AI enabled workers going. Building GPTs with deep complexity and knowledge for tasks and then being able to pass that information into a coding assistant or maybe another GPT that you've created as an assistant to kick out templates massively speeds up the individual. At least in my case it has. I can confidently say that AI has given me time back in my day I'd otherwise spend shuffling thru various confluence tabs and other ERP systems.
Thank you - what a great analysis of the progress made and an accurate reflection of the way one should use AI these days. I just discussed this blog post with my GPT-5 and we both agreed that indeed this is how we're working together most productively.
As a conclusion I asked GPT-5 to create a baiku (business-haiku) that captures this post - enjoy :)
Not surprising that AI scores well in symbol manipulation tasks like software dev or inventory. But less obvious (and just a bit disturbing) that they excel at managing human activity. (Front-line supervisors.)
Just want to chime in that I had a real solid laugh at the line “that is too many PowerPoints.” The delivery was perfect. Total mirth.
But your article’s main point is actually very validating to me. I have been feeling crazy in the face of the naysaying that is predominant online. I’ve been using Codex with GPT 5 and Claude Code in my terminal for the last month for my personal projects and it’s clear to me that serious economic value could easily be generated with this relatively inexpensive combo.
I’ll be writing about my experience with it soon. There’s been so many surprises. And a nonzero number of significant wins.
Thanks for taking away my crazy pills. (Or popping one with me.)
Heard about this on AI Daily Brief and glad you wrote about it. As a lawyer and law professor, I'm very interested in seeing the answers to the "Lawyers" questions and sharing them with my class. But those fields seem to be blank. Has OpenAI made them available?
The replication experiment would be more impressive if it was able to identify a paper that was not replicable. Did you feed in that one Harvard Business School prof's work?
Do you mean Francesca Gino? If so, her research involved actual human participants. Therefore, it cannot be replicated in software using AI. It could only verify that the analysis code is correct.
You would still need data from real people for the replication.
Yes I was referring to Gino. And you are right, replication is not the right word. I meant detect errors, gaps, etc. which was my sense of what was being suggested in the original post: "it appears that AI could check many published papers, reproducing results, with implications for all of scientific research"
I still think a big mistake GenAI products make is chasing use cases where precision is non-negotiable.
A financial model that is 70% "accurate" is essentially 100% useless. If an LLM hallucinates one formula, I have to re-check every cell. What am I accelerating exactly if I have to rework the whole thing?
The machine is good at scaffolding work, so maybe I save 10% on structure. OK. But that is a far cry from the 50%+ productivity gains some people claim (for specific use cases).
GenAI is a probabilistic machine. It thrives in brainstorming, summarising, that sort of stuff.
In domains where a single wrong number kills credibility (a few days ago I posted this https://substack.com/@themanagementconsultant/note/c-159551491 about a Big4 consultancy report that contained LLM-generated hallucinations...) it collapses under scrutiny.
What happens if you ask it to specifically show that replicability is not possible in a new context window, will it successfully show that too or will it deny the possibility and show why only replicability is possible?
What a great line: "If we don't think hard about WHY we are doing work, and what work should look like, we are all going to drown in a wave of AI content." The breadth of available tools makes the field of possibility so expansive—yet if we simply transplant our current ways of working into this new world without reimagining our processes, we risk reaching a point where AIs create content that other AIs consume, all to make people feel productive. I hope we dare to think of different uses for a world powered by different tools.
beyond replication tasks, agents are also going to enable so many extensions and followups which are critical to really flesh things out but they often get put on the back burner in favor of other more novel or important work.
the work i submitted for this conf extended a paper i wrote on quantum semantics to try to figure out how we could set this up on a quantum computer properly, and my agent divided the task among multiple sub agents who came up with experiments and worked until they had finished their sub task, receiving reviews and guidance from the main agent every ~5 turns so they dont get stuck for too long, and then the main agent wrote up everything, sourcing references through semantic scholar search and iteratively editing a latex document, including figures the sub-agents made in their experiments.
It seems like there is a distinction missing here regarding what replication means. Running another statistical analysis on existing datasets is not the same thing as replication, which requires running experiments again. So in psych for example, it's not just analyzing the existing reported data from a questionnaire filled out by participants: rather, it's obtaining a new batch of questionnaire results and seeing whether those results accord with prior sets. Am I missing something?
Depending on the field, there is debate over whether this is replication or reproduction. I added a footnote to the piece with some more details.
Ah! Thanks
I agree, and I tried to comment it, but I don't know if it went through.
The "replication crisis" meant the version of a new, independent experiment. Here is a 2015 paper about it from the Open Science Foundation, where they redid 100 established results and saw that many of them did not replicate.
https://repository.ubn.ru.nl//bitstream/handle/2066/155639/155639.pdf
It's still a very valuable task indeed, though, to check the work that leads to a paper's conclusion! It also seems possible to do a sensitivity analysis in this way: ask the AI to calculate a slightly different thing, and then see if the effect is still there or not.
There’s a difference between Replication Crisis (those who re-do (replicate) the experiment with new subjects often don’t get the same results NOR the same conclusion) vs. reanalyzing the published datasets allows readers to come up with different statistics, different analyses, and possibly different conclusions.
Curious question on agents - does anyone have real, viable examples of agents in production in organisations? It's tremendously hard to find examples of this where they are truly agentic (e.g., act with minimal intervention from humans.) Would love to see examples!
Me too. I find it difficult to visualize what would be worth the time asking the agent to do, how long it would take and if it would be worth it.
Some of the things I've seen companies working on are things like making an appointment for a healthcare patient, answering HR questions, helping make a travel booking, and handling portions of a loan underwriting process. In practice, a lot of the tasks are relatively simple for humans, meaning they can be automated pretty reliably with AI. More complex or high-risk tasks still require a human in the loop, meaning that the agent tends to be more of an assistant rather than acting autonomously.
Yes, that's my take on it too. The agent doing the grunt work. Now I need to know how to access/create an agent. I'm ok with ChatGPT etc. and issuing prompts, it's the agent side I'm clueless about. Good thing I like learning lol!
I've seen it as well, but I guess it's perhaps the longevity aspect I'm after here? The models can work (uninterrupted) for quite a long time and solve tasks without humans ever intervening. Most of the tools I've seen are more automations with an attached LLM, rather than truly agentic systems. But maybe I'm being too narrow-minded here.
TL;DR: At Gearvox, our AI agents handle complete job functions (customer intake → dispatch → follow-up) for blue-collar services, not just isolated tasks. We process tens of thousands of calls monthly. Dispatch managers experience this as relief from high-churn roles, not replacement—they shift from endless hiring cycles to strategic supervision while agents handle the 80% of calls that are routine, spam, or busy work.
Detailed response Ethan + to @fullstackhr @wendyscott3 @gentschev @jamespember @diogenes12 @ezrabrand:
I'd push back on the framing that agents only handle "tasks, not jobs." At Gearvox.ai, our agents handle what I'd call complete job functions—owning the entire customer conversation lifecycle for blue-collar services (tow companies, roadside assistance, locksmiths, home services): intake, dispatch, system updates, follow-ups, and exception handling.
This isn't "AI does task X, human reviews and moves to Y." The agent manages the full cycle from first contact to dispatched service request (with "bail out to human" always available).
What makes this a "job slice" vs. task automation:
1. End-to-end ownership - Accountable for customer satisfaction, accurate dispatch, completed transactions. Dispatch managers handle exceptions/strategy; agents handle baseline work autonomously.
2. Domain expertise exceeding typical performance - Trained on expert interviews, call recordings, best practices. Higher conversion rates than average reps while handling edge cases (urgent highway patrol requests vs. routine tows).
3. Full systems integration - Real-time ETA checks, GPS tracking, CRM updates, scheduling, notifications—zero typos, missed fields, or manual data entry.
4. Inhuman consistency - Perfect policy adherence, on-brand communication, empathetic responses without emotional fatigue. Automated post-call classification with higher accuracy than manual review.
But here's what matters: dispatch managers experience this as relief, not replacement.
The roles our agents subsume are plagued by high churn, expensive training cycles, and inconsistent quality. Managers tell us they're exhausted from constant hiring/retraining in frontline positions most people don't want long-term.
Their role has shifted: instead of managing a rotating cast of undertrained staff, they monitor agent performance, handle escalations, and focus on strategy. They report higher quality, better conversion, lower costs—and less stress.
Economic substitution happens at the role level (customers deploy agents instead of hiring 2+ inbound reps), but the human benefit is that senior employees and owner-operators finally focus on judgment, relationships, and strategy—not endless hiring-training-turnover cycles. Taking the 80% of calls that are busy work, spam, or routine requests off managers' plates has been transformative.
The article's risk—"17 PowerPoints nobody needs"—is real if you automate thoughtlessly. But there's a risk in underselling what's possible. If we only think "tasks," we miss the opportunity to redesign work around what agents own end-to-end.
Companies seeing value aren't asking "which tasks can AI help with?" They're asking "which job functions can we reimagine with an agentic workforce that has perfect memory, instant systems access, and consistent expertise—especially where human retention and consistency have always been the bottleneck?"
For blue-collar services, that's customer-facing dispatch and intake. Agents handle it fully. Humans supervise, strategize, and do work requiring judgment. That's not task automation—it's workforce transformation making human roles more sustainable and valuable.
Great use cases!
Any demo of what a call can look like? And level of angry customers wanting to talk to a human - any insights there?
We're seeing some early success with a few micro-agents in production. One is the Mintlify agent which helps keep our documentation up to date. I wrote about that here: https://www.linkedin.com/feed/update/urn:li:activity:7379002277652717569/
We're also starting to see better and better results from Claude Code and Cursor internally. We're starting to experiment with more context engineering, which is yielding better results. This was sort of the "bible" we went off: https://github.com/humanlayer/advanced-context-engineering-for-coding-agents/blob/main/ace-fca.md
See some recent discussion here, of some relatively simple non-code agentic uses of Claude Code:
https://github.com/paradite/claude-code-is-all-you-need
https://news.ycombinator.com/item?id=45416228
I'm not impressed with the end result of the first example: https://help.complyflow.com/en/articles/12087493-microsoft-sso-integration-guide
That's a jumbled mess.
If you are the Director of Strategy for a prestige cosmetic brand, I suspect that the difficult (or even most useful) part of your job is NOT to draft the distribution strategy, but rather:
- to persuade your own team-members, who have their own ideas, that their ideas have been incorporated into your strategy;
- to persuade the adjacent Ops & Marketing teams, who have their own ideas and concerns about executing your strategy, that you've listened to their ideas and addressed them in your strategy;
- to persuade Finance to allocate enough budget for your strategy, or to alter your strategy to fit withint the budget constraints, or negotiate something in the middle;
- etc., etc. etc.
The drafting of the strategy itself seems kind of trivial and even unhelpful to me, without the human behind it, who needs to navigate the office politics and human emotions necessary to enact the strategy successfully.
Absolutely. The real craft of strategy isn’t writing the document. It’s getting humans aligned behind it. Influence, empathy, and negotiation often decide whether a strategy lives or dies.
>"It all checked out. I tried this on several other papers with similarly good results, though some were inaccessible due to file size limitations or issues with the replication data provided. Doing this manually would have taken many hours."
But it didn't actually do anything helpful. It just said " it all checks out". Why aren't AI agents going through tens of thousands of papers, and unearthing papers with problems?
In practice, even cutting-edge AI tools still need a tremendous amount of guidance to do anything helpful, even on narrow tasks like coding. And they go off the rails in a significant percentage of cases.
I say this as a big user of AI for the now-standard use-cases of coding, editing, and searches. But regarding bigger tasks described here, I'm quite skeptical. like many others, I've become skeptical of toy studies and benchmarks. And I'm constantly testing the tools, and they simply don't currently work well for larger tasks
But that isn't what happened. I said "it checked out," not the AI. It provided detailed methodological critiques and was capable of applying new statistical methods to test the outcome in different ways. No guidance was needed, and the code that it wrote provides an audit trail. See my last post for another example of a critique of one of my own papers.
Right, I understand. But (as the Talmud says) saying "here are 24 reasons why you're right" isn't especially helpful. What's helpful is pointing out that something is wrong.
"See my last post for another example of a critique of one of my own papers." Right, that's very cool, and helpful. Why isn't that being done on a large scale, in the wild, on published papers, to find actual errors? (I think you mentioned the possibility of this in that post, or maybe I saw the idea elsewhere.) To me that indicates that it's not actually capable of this task, in any kind of consistent/repeatable way
Everything here came out in the past few weeks. Academia moves VERY slowly.
Didn't some non-profit, backed by lots of money, say that they're going to use AI to check published papers? It sounds like something that can be set up and iterated on relatively rapidly
I must be missing something, the Stata files for replication are all readily available with the data. I could replicate it with almost no effort. The AI can certainly check results against the paper much, much faster than me, but I'm not impressed that it can take Stata files and make it into Python code at this point.
As these models continue to improve, my concern is more cognitive offloading by students at the expense of critical-thinking and deep-reading skills. My professional organization, CCCC, is the most prominent in academic writing, and they seem to have taken a "hard no" approach I don't share.
I'm searching for a methodology that brings AI into the classroom without harming students' learning. At least we can now say the new testing does not fall prey to benchmark contamination.
That’s such an important concern. The challenge isn’t whether students use AI, but how they use it. Maybe the goal should be to teach students to think with AI. Using it to question, not replace, their reasoning.
Having your own bespoke toolkit is where I see AI enabled workers going. Building GPTs with deep complexity and knowledge for tasks and then being able to pass that information into a coding assistant or maybe another GPT that you've created as an assistant to kick out templates massively speeds up the individual. At least in my case it has. I can confidently say that AI has given me time back in my day I'd otherwise spend shuffling thru various confluence tabs and other ERP systems.
Thank you - what a great analysis of the progress made and an accurate reflection of the way one should use AI these days. I just discussed this blog post with my GPT-5 and we both agreed that indeed this is how we're working together most productively.
As a conclusion I asked GPT-5 to create a baiku (business-haiku) that captures this post - enjoy :)
Agents take the wheel,
Tasks bend, jobs reshape, yet still—
Value is our choice.
Not surprising that AI scores well in symbol manipulation tasks like software dev or inventory. But less obvious (and just a bit disturbing) that they excel at managing human activity. (Front-line supervisors.)
The examples I saw re front line supervisors were just data manipulation and creating forms, slideshows or documents.
Just want to chime in that I had a real solid laugh at the line “that is too many PowerPoints.” The delivery was perfect. Total mirth.
But your article’s main point is actually very validating to me. I have been feeling crazy in the face of the naysaying that is predominant online. I’ve been using Codex with GPT 5 and Claude Code in my terminal for the last month for my personal projects and it’s clear to me that serious economic value could easily be generated with this relatively inexpensive combo.
I’ll be writing about my experience with it soon. There’s been so many surprises. And a nonzero number of significant wins.
Thanks for taking away my crazy pills. (Or popping one with me.)
Heard about this on AI Daily Brief and glad you wrote about it. As a lawyer and law professor, I'm very interested in seeing the answers to the "Lawyers" questions and sharing them with my class. But those fields seem to be blank. Has OpenAI made them available?
They are here: https://huggingface.co/datasets/openai/gdpval/viewer/default/train?p=1&row=116
The replication experiment would be more impressive if it was able to identify a paper that was not replicable. Did you feed in that one Harvard Business School prof's work?
Read my last post where the AI identified a (fortunately minor) error in my own research.
Do you mean Francesca Gino? If so, her research involved actual human participants. Therefore, it cannot be replicated in software using AI. It could only verify that the analysis code is correct.
You would still need data from real people for the replication.
Yes I was referring to Gino. And you are right, replication is not the right word. I meant detect errors, gaps, etc. which was my sense of what was being suggested in the original post: "it appears that AI could check many published papers, reproducing results, with implications for all of scientific research"
And while I didn't follow the Gino story closely, I thought it started by a graduate student raising some questions based on published data.
I still think a big mistake GenAI products make is chasing use cases where precision is non-negotiable.
A financial model that is 70% "accurate" is essentially 100% useless. If an LLM hallucinates one formula, I have to re-check every cell. What am I accelerating exactly if I have to rework the whole thing?
The machine is good at scaffolding work, so maybe I save 10% on structure. OK. But that is a far cry from the 50%+ productivity gains some people claim (for specific use cases).
GenAI is a probabilistic machine. It thrives in brainstorming, summarising, that sort of stuff.
In domains where a single wrong number kills credibility (a few days ago I posted this https://substack.com/@themanagementconsultant/note/c-159551491 about a Big4 consultancy report that contained LLM-generated hallucinations...) it collapses under scrutiny.
What happens if you ask it to specifically show that replicability is not possible in a new context window, will it successfully show that too or will it deny the possibility and show why only replicability is possible?
What a great line: "If we don't think hard about WHY we are doing work, and what work should look like, we are all going to drown in a wave of AI content." The breadth of available tools makes the field of possibility so expansive—yet if we simply transplant our current ways of working into this new world without reimagining our processes, we risk reaching a point where AIs create content that other AIs consume, all to make people feel productive. I hope we dare to think of different uses for a world powered by different tools.
have you heard about this upcoming conference where agents are meant to drive science completely? https://agents4science.stanford.edu
beyond replication tasks, agents are also going to enable so many extensions and followups which are critical to really flesh things out but they often get put on the back burner in favor of other more novel or important work.
the work i submitted for this conf extended a paper i wrote on quantum semantics to try to figure out how we could set this up on a quantum computer properly, and my agent divided the task among multiple sub agents who came up with experiments and worked until they had finished their sub task, receiving reviews and guidance from the main agent every ~5 turns so they dont get stuck for too long, and then the main agent wrote up everything, sourcing references through semantic scholar search and iteratively editing a latex document, including figures the sub-agents made in their experiments.