It was humbling when I first used ChatGPT to review a peer review I had written. The article being reviewed presented the efficacy of a health device based on random controlled trial sponsored by the device's manufacturer. Of course I was alert to bias, and discovered a few minor instances. But the LLM mildly mentioned a discrepancy between the control and treatment conditions that had been worded so slyly as to evade human detection. Pulling on the thread that the LLM exposed uncovered a deceptive practice that invalidated their conclusions, and (after they protested "This is the way it is always done!") a large body of previous sponsored research.
[ If you are curious: the researcher required subjects to "follow the manufacturer's instructions" for each device. In practice, treatment group subjects were told to comply with the ideal duration, frequency and manner of use specified in the printed instructions. But control group subjects were given a simpler competing device that offered no enclosed instructions and thus were given no performance requirements at all for participation in the research. ]
I must say that your experience highlights the importance of thorough scrutiny, especially in sponsored research. It's a great reminder to dig deeper and question even the smallest discrepancies.
Thanks for sharing this insightful story—it's a valuable lesson for anyone involved in research and peer reviews.
In my personal experience, this aspect is key: "[M]ore researchers can benefit because they don’t need to learn specialized skills to work with AI. This expands the set of research techniques available for many academics."
For the first time in my life, over the past year, I've been able to do serious text analysis on relatively large texts, with Python. (Specifically, on the Talmud, which is ~1.8 million words.)
The barrier to entry for doing meaningful coding is now far lower
Excellent article. May I propose a 5th? (Though I’m not thinking this through as deeply as you)
AI can also change the visibility of published research. For example the AI powered “Answer Engine” consensus (no affiliation) is a search engine that answers questions using published research.
Prior to this, as a parent I might ask a question like “does fake food colouring trigger ADHD in children” on Google and the results I see would be dominated by a mixture of big media and whackadoo bloggers with insane conspiracy theories. Credible research can barely be seen
Today I can ask consensus and see an answer based only on published research. Factors like citations of each paper and influence of the publication are also highlighted.
As a user I appreciate having somewhere to turn for a credible answer to an important question. But I think this must be a good thing for the work of the researchers as well, right?
100%! I do a lot of market research and part of that effort involves reading a number of papers from various sources, which can be time-consuming. Now I can load those resources to a LLM-powered answer engine, then start asking my questions and receiving direct answers grounded in those sources right away. A good Answer Engine can also cite the specific source of the answer to add credibility of the result. This automated reading and answer-finding process really helps reduce the time for my research.
"In fact, one paper found that research was losing steam in every field, from agriculture to cancer research. More researchers are required to advance the state of the art..."
Unintentionally (?) hilarious! There is a long standing academic joke that every published paper ends with "...more research is needed." which is understood to be a plea for more grant money to support the researcher (independent of how useful, interesting, or entertaining their research is).
Maybe the number of researchers (and the corresponding amount of money spent) is not the problem?
For sure. Too many papers (at least in the fields I know something about) are noise not signal (and I suspect the ratio is much higher than 80-20).
I know this thread is not central to what Ethan is writing about, but turning up the volume (of researchers and/or papers) does not seem like the right way to address the noise problem.
Are LLM's (whether used to edit/review/summarize, or potentially produce, papers) more likely to be part of the solution or part of the problem? I'm going to guess: part of the problem.
Dr. Mollick's Custom GPT "But Why Is It Important?" that he linked to is a brilliant example of how AI can speed up the knowledge process. Astounding, really.
I'm a professional communicator and PR working in a business school. Already AI is helping me speed up my editing process on academic papers that are not written for a public audience, and need translating into clear, accessible language
A use that bridges teaching and research: helping PhD students. Last week ChatGPT-4o provided some excellent comments on a 95-page proposal of a PhD student of mine. Then I gave Chat my four pages of comments to the student, and Chat produced (a) a better version of what I'd said and (b) specific methods and data to detail three of my suggestions. All in less than a minute.
Bravo for you to do the hard work that produced your four pages. It is easy to imagine a professor in your position choosing to skip that work, and allow the LLM to draft that initial review directly from the 95 page proposal, salvaging three hours that might be spent honing the professor's pickleball skills.
In the short term, this shortcut might be taken by lazy, overworked or even underqualified professors. But in the near future, it is easy to imagine people in your position, comfortable with LLMs finding no reason to sacrifice pickleball practice to prepare a document that is objectively inferior to one based on a careful reading of every word, cross referenced to the entire state of the art and current best practices.
At some point it may appear to be hubris for a professor to deliver a personally-written comment sheet.
I think there are few questions raised here that are not anywhere near resolved. We can find other research and substacks from respected researchers would refute many of the hypotheses in this article. Full disclosure: I am a fan of your work and have found it very useful.
I work in a ranked and accredited Business School, but I have a professional background and I do not do academic research. To your points #3 and #4, I sit through many research presentations thinking "well yeah, that is blindingly obvious as an experienced manager but now you have a mathematical formula (that needs to be recalculated in every new instance) to describe it". It is clear that a super-powered probabilistic algorithm trained on all of humanity's "best practice" will come close to mimicking humans in that endeavour.
Regarding #1 and #2, I would argue that if the writing has become so important, we are doing something wrong. I work every year with technological and scientific-based start-ups trying to bring real innovations to the world (e.g. how to pass the blood brain barrier to deliver life-saving treatment to children suffering "orphan" forms of cancer). Writing is not the problem. Time to write is not the problem. We have accompanied one research team over 10 years to gain successive rounds of grants and funding to arrive at a viable solution. This takes a long time because the science has to be solid.
The current release of Generative AI (chatbots, image creators...etc.) do nothing to advance the real value of research. At best, they provide productivity tools for the least valuable parts of the process (eliminating some job roles in which future innovators learn). At worst, they amplify the propagation of low value research that meets the requirement to publish but adds little of value to the progress of our disciplines and our society.
(I should probably exclude Deep Mind from this critique). The current iterations of "Generative AI" (GPT etc.) are like memes on TikTok: spectacular in their ability to create an immediate "Wow", but fundamentally shallow (and often dangerous).
If the current form of Generative AI tools is our future, we need to reflect seriously on what we consider "innovation" and "value". So much of what we do is just copy and paste. No surprise that copy and paste machines with billions of dollars of computer behind them are better than us.
This is a good critique of the value put on content creation. But that's not the only thing that LLMs are good at. As mentioned in the post (and as I stressed in a different comment), LLMs are very good at writing code, which can create significant efficiencies in many valuable research endeavors
Thanks Ezra and agreed, these tools do offer significant efficiencies. I have been intentionally engaging with my graduates working in data science and they all use LLMs every day to improve their productivity.
Perhaps just to be devil's advocate (but perhaps not), writing code is also essentially content creation. It has been interesting to see studies of how Github's co-pilot has (apparently) lowered the quality of code on the platform - it is more "efficient" to let the co-pilot do the initial work that to search for the best piece of code created by a human coder for a specific task.
I think this is the real danger. Whatever LLMs and similar Generative AI models do, they are simply parsing standardized best practice (and sometimes malpractice) from previous human endeavour. A new generation of programmers, researchers, business analysts....(the list goes on)...dependent on Gen AI as an "efficiency" tool will not have the opportunity to learn the difference between standardized "best" practice and excellence.
FYI, we have started teaching some math, statistics and programming subjects with pen and paper - if our brains do not understand the logic behind certain processes, we become literally slaves to the machine.
-Gen AI is improving every month. Think about the "needle in the haystack" test for LLMs with context windows up to 2 million tokens. Think about using it for systematic literature review.
-If gen AI was not "fundamentally shallow", we could have achieved AGI by now. AI is still in its infancy. Go back in 5 years to read these comments again.
Hello, great insights as ever. My colleagues and I recently put out a preprint on the implications of LLMs and GenAI for Open Science that engages many of these themes:
Hosseini, M., Horbach, S. P. J. M., Holmes, K. L., & Ross-Hellauer, T. (2024, May 24). Open Science at the Generative AI Turn: An Exploratory Analysis of Challenges and Opportunities. https://doi.org/10.31235/osf.io/zns7g
In particular, we talk about the enabling of "meaningful" (rather than just "open") access, the potential opportunities/pitfall for the democratisation of coding or data analysis, streamlining of open science documentation practices and enhancing dialogue across knowledge systems - as welll as potential issues for integrity, equity, reproducibility,and reliability of research. If you had time to read, feedback would be really welcome!
This is admittedly a half-baked idea, but a seemingly underrated use of generative AI is to brainstorm hypotheses…and even to train oneself to separate “good” hypotheses (i.e., well-founded) from “bad” ones (i.e., ill-formed, or “non-novel”).
It seems like this relies quite heavily on the probabilistic—or stochastic—nature of how these tools work, and to that end, it almost seems like this is something that generative AI will become *less* useful for as the tools become *better* (that is, hallucinate less, give consistent answers more…and approximate something that looks more deterministic).
AI will be used increasingly to meet deadlines by those looking for shortcuts and as you said, those using it as an assistant. It will create several situations in the future that could have been avoided, but overall, the impact will be positive. The quantity of research areas will increase significantly, and even if we keep the same percentage of good research as we have today, it will be a net positive for society.
The research singularity that interests me most is the one that LLMs pose for linguistics, which cuts across 2, 3, and 4. For example, last year Cambridge published a small book, Copilots for Linguists, which is about prompt engineering for linguists who want to investigate construction grammar using LLMs. The book assumes that LLMs deal with language in a way that usefully approximates human usage. That's certainly not a universal assumption among linguists. In fact, some linguists strongly oppose such an assumption, I'm thinking particularly of linguists strongly influenced by the work of Noam Chomsky.
Chomsky was certainly the major name in linguistics and linguistic theory in the second half of the previous century and his theoretical schools dominated the field, especially in North America. But that dominance has been steadily waning. To the extent that its underlying assumptions are correct, LLMs shouldn't work at all. So they represent an embarrassment.
Of course, it's not a simple matter of an all-or-nothing binary choice. There are complications and nuances. But LLMs are forcing the issue. Lex Fridman had an interesting podcast with Ted Gibson, a psycholinguist at MIT. I've got a blog post where I have that video and transcriptions of the parts of the conversation that bear most directly on linguistics: Lex Fridman talks with Ted Gibson about language, LLMs, and other things [i.e. Linguistics 101 for LLMs], https://new-savanna.blogspot.com/2024/04/lex-fridman-talks-with-ted-gibson-about.html
Excellent post. I do have a dumb question as follow up and hoping to get some better understanding. My understanding of LLM models is that - At the core of it, LLMs / Generative AI models are essentially highly sophisticated statistical algorithms. So if that is true - why do we say these models hallucinate? That seem somewhat counter intuitive. May be this is what Prof. Mollick has covered in his Singularity #4 but am not sure.
Thank you. That was very helpful. But if I think back again, at the core of it is essentially the training data (absence or biases) and the accompanying algorithm that nudges it to give a response. So when it makes up a response, it is still going to use statistical inference from a comparable scenario. So is it really hallucinating?
It was humbling when I first used ChatGPT to review a peer review I had written. The article being reviewed presented the efficacy of a health device based on random controlled trial sponsored by the device's manufacturer. Of course I was alert to bias, and discovered a few minor instances. But the LLM mildly mentioned a discrepancy between the control and treatment conditions that had been worded so slyly as to evade human detection. Pulling on the thread that the LLM exposed uncovered a deceptive practice that invalidated their conclusions, and (after they protested "This is the way it is always done!") a large body of previous sponsored research.
[ If you are curious: the researcher required subjects to "follow the manufacturer's instructions" for each device. In practice, treatment group subjects were told to comply with the ideal duration, frequency and manner of use specified in the printed instructions. But control group subjects were given a simpler competing device that offered no enclosed instructions and thus were given no performance requirements at all for participation in the research. ]
Dov wow, that’s quite a revelation!
I must say that your experience highlights the importance of thorough scrutiny, especially in sponsored research. It's a great reminder to dig deeper and question even the smallest discrepancies.
Thanks for sharing this insightful story—it's a valuable lesson for anyone involved in research and peer reviews.
In my personal experience, this aspect is key: "[M]ore researchers can benefit because they don’t need to learn specialized skills to work with AI. This expands the set of research techniques available for many academics."
For the first time in my life, over the past year, I've been able to do serious text analysis on relatively large texts, with Python. (Specifically, on the Talmud, which is ~1.8 million words.)
The barrier to entry for doing meaningful coding is now far lower
True, with AI the skill is more and more removed from the equation.
Excellent article. May I propose a 5th? (Though I’m not thinking this through as deeply as you)
AI can also change the visibility of published research. For example the AI powered “Answer Engine” consensus (no affiliation) is a search engine that answers questions using published research.
Prior to this, as a parent I might ask a question like “does fake food colouring trigger ADHD in children” on Google and the results I see would be dominated by a mixture of big media and whackadoo bloggers with insane conspiracy theories. Credible research can barely be seen
Today I can ask consensus and see an answer based only on published research. Factors like citations of each paper and influence of the publication are also highlighted.
As a user I appreciate having somewhere to turn for a credible answer to an important question. But I think this must be a good thing for the work of the researchers as well, right?
100%! I do a lot of market research and part of that effort involves reading a number of papers from various sources, which can be time-consuming. Now I can load those resources to a LLM-powered answer engine, then start asking my questions and receiving direct answers grounded in those sources right away. A good Answer Engine can also cite the specific source of the answer to add credibility of the result. This automated reading and answer-finding process really helps reduce the time for my research.
"In fact, one paper found that research was losing steam in every field, from agriculture to cancer research. More researchers are required to advance the state of the art..."
Unintentionally (?) hilarious! There is a long standing academic joke that every published paper ends with "...more research is needed." which is understood to be a plea for more grant money to support the researcher (independent of how useful, interesting, or entertaining their research is).
Maybe the number of researchers (and the corresponding amount of money spent) is not the problem?
I would assume that it's likely, as with many things, there's an 80-20 thing going on:
The best researchers are producing good research, while the majority is producing papers with little value, or even negative value
For sure. Too many papers (at least in the fields I know something about) are noise not signal (and I suspect the ratio is much higher than 80-20).
I know this thread is not central to what Ethan is writing about, but turning up the volume (of researchers and/or papers) does not seem like the right way to address the noise problem.
Are LLM's (whether used to edit/review/summarize, or potentially produce, papers) more likely to be part of the solution or part of the problem? I'm going to guess: part of the problem.
Both.
Most humans are using tools (alcohool, social media, AI...) to amplify who they already are.
Dr. Mollick's Custom GPT "But Why Is It Important?" that he linked to is a brilliant example of how AI can speed up the knowledge process. Astounding, really.
I'm a professional communicator and PR working in a business school. Already AI is helping me speed up my editing process on academic papers that are not written for a public audience, and need translating into clear, accessible language
Wonderful work, as always. Thank you.
A use that bridges teaching and research: helping PhD students. Last week ChatGPT-4o provided some excellent comments on a 95-page proposal of a PhD student of mine. Then I gave Chat my four pages of comments to the student, and Chat produced (a) a better version of what I'd said and (b) specific methods and data to detail three of my suggestions. All in less than a minute.
Bravo for you to do the hard work that produced your four pages. It is easy to imagine a professor in your position choosing to skip that work, and allow the LLM to draft that initial review directly from the 95 page proposal, salvaging three hours that might be spent honing the professor's pickleball skills.
In the short term, this shortcut might be taken by lazy, overworked or even underqualified professors. But in the near future, it is easy to imagine people in your position, comfortable with LLMs finding no reason to sacrifice pickleball practice to prepare a document that is objectively inferior to one based on a careful reading of every word, cross referenced to the entire state of the art and current best practices.
At some point it may appear to be hubris for a professor to deliver a personally-written comment sheet.
Although not focusing yet on academic research, what we’re building at https://syntheticusers.com is geared towards the second singularity mentioned.
I think there are few questions raised here that are not anywhere near resolved. We can find other research and substacks from respected researchers would refute many of the hypotheses in this article. Full disclosure: I am a fan of your work and have found it very useful.
I work in a ranked and accredited Business School, but I have a professional background and I do not do academic research. To your points #3 and #4, I sit through many research presentations thinking "well yeah, that is blindingly obvious as an experienced manager but now you have a mathematical formula (that needs to be recalculated in every new instance) to describe it". It is clear that a super-powered probabilistic algorithm trained on all of humanity's "best practice" will come close to mimicking humans in that endeavour.
Regarding #1 and #2, I would argue that if the writing has become so important, we are doing something wrong. I work every year with technological and scientific-based start-ups trying to bring real innovations to the world (e.g. how to pass the blood brain barrier to deliver life-saving treatment to children suffering "orphan" forms of cancer). Writing is not the problem. Time to write is not the problem. We have accompanied one research team over 10 years to gain successive rounds of grants and funding to arrive at a viable solution. This takes a long time because the science has to be solid.
The current release of Generative AI (chatbots, image creators...etc.) do nothing to advance the real value of research. At best, they provide productivity tools for the least valuable parts of the process (eliminating some job roles in which future innovators learn). At worst, they amplify the propagation of low value research that meets the requirement to publish but adds little of value to the progress of our disciplines and our society.
(I should probably exclude Deep Mind from this critique). The current iterations of "Generative AI" (GPT etc.) are like memes on TikTok: spectacular in their ability to create an immediate "Wow", but fundamentally shallow (and often dangerous).
If the current form of Generative AI tools is our future, we need to reflect seriously on what we consider "innovation" and "value". So much of what we do is just copy and paste. No surprise that copy and paste machines with billions of dollars of computer behind them are better than us.
This is a good critique of the value put on content creation. But that's not the only thing that LLMs are good at. As mentioned in the post (and as I stressed in a different comment), LLMs are very good at writing code, which can create significant efficiencies in many valuable research endeavors
Thanks Ezra and agreed, these tools do offer significant efficiencies. I have been intentionally engaging with my graduates working in data science and they all use LLMs every day to improve their productivity.
Perhaps just to be devil's advocate (but perhaps not), writing code is also essentially content creation. It has been interesting to see studies of how Github's co-pilot has (apparently) lowered the quality of code on the platform - it is more "efficient" to let the co-pilot do the initial work that to search for the best piece of code created by a human coder for a specific task.
I think this is the real danger. Whatever LLMs and similar Generative AI models do, they are simply parsing standardized best practice (and sometimes malpractice) from previous human endeavour. A new generation of programmers, researchers, business analysts....(the list goes on)...dependent on Gen AI as an "efficiency" tool will not have the opportunity to learn the difference between standardized "best" practice and excellence.
FYI, we have started teaching some math, statistics and programming subjects with pen and paper - if our brains do not understand the logic behind certain processes, we become literally slaves to the machine.
-Gen AI is improving every month. Think about the "needle in the haystack" test for LLMs with context windows up to 2 million tokens. Think about using it for systematic literature review.
-If gen AI was not "fundamentally shallow", we could have achieved AGI by now. AI is still in its infancy. Go back in 5 years to read these comments again.
Two words that threaten to negate all mentioned benefits : informational pollution. It's an issue that needs to be resolved fast and soon.
Hello, great insights as ever. My colleagues and I recently put out a preprint on the implications of LLMs and GenAI for Open Science that engages many of these themes:
Hosseini, M., Horbach, S. P. J. M., Holmes, K. L., & Ross-Hellauer, T. (2024, May 24). Open Science at the Generative AI Turn: An Exploratory Analysis of Challenges and Opportunities. https://doi.org/10.31235/osf.io/zns7g
In particular, we talk about the enabling of "meaningful" (rather than just "open") access, the potential opportunities/pitfall for the democratisation of coding or data analysis, streamlining of open science documentation practices and enhancing dialogue across knowledge systems - as welll as potential issues for integrity, equity, reproducibility,and reliability of research. If you had time to read, feedback would be really welcome!
This is admittedly a half-baked idea, but a seemingly underrated use of generative AI is to brainstorm hypotheses…and even to train oneself to separate “good” hypotheses (i.e., well-founded) from “bad” ones (i.e., ill-formed, or “non-novel”).
It seems like this relies quite heavily on the probabilistic—or stochastic—nature of how these tools work, and to that end, it almost seems like this is something that generative AI will become *less* useful for as the tools become *better* (that is, hallucinate less, give consistent answers more…and approximate something that looks more deterministic).
Excellent post!
AI will be used increasingly to meet deadlines by those looking for shortcuts and as you said, those using it as an assistant. It will create several situations in the future that could have been avoided, but overall, the impact will be positive. The quantity of research areas will increase significantly, and even if we keep the same percentage of good research as we have today, it will be a net positive for society.
The research singularity that interests me most is the one that LLMs pose for linguistics, which cuts across 2, 3, and 4. For example, last year Cambridge published a small book, Copilots for Linguists, which is about prompt engineering for linguists who want to investigate construction grammar using LLMs. The book assumes that LLMs deal with language in a way that usefully approximates human usage. That's certainly not a universal assumption among linguists. In fact, some linguists strongly oppose such an assumption, I'm thinking particularly of linguists strongly influenced by the work of Noam Chomsky.
Chomsky was certainly the major name in linguistics and linguistic theory in the second half of the previous century and his theoretical schools dominated the field, especially in North America. But that dominance has been steadily waning. To the extent that its underlying assumptions are correct, LLMs shouldn't work at all. So they represent an embarrassment.
Of course, it's not a simple matter of an all-or-nothing binary choice. There are complications and nuances. But LLMs are forcing the issue. Lex Fridman had an interesting podcast with Ted Gibson, a psycholinguist at MIT. I've got a blog post where I have that video and transcriptions of the parts of the conversation that bear most directly on linguistics: Lex Fridman talks with Ted Gibson about language, LLMs, and other things [i.e. Linguistics 101 for LLMs], https://new-savanna.blogspot.com/2024/04/lex-fridman-talks-with-ted-gibson-about.html
Excellent post. I do have a dumb question as follow up and hoping to get some better understanding. My understanding of LLM models is that - At the core of it, LLMs / Generative AI models are essentially highly sophisticated statistical algorithms. So if that is true - why do we say these models hallucinate? That seem somewhat counter intuitive. May be this is what Prof. Mollick has covered in his Singularity #4 but am not sure.
If the training data is biased towards specific topics or perspectives, the model might hallucinate when asked about less-represented topics:
Prompt: "Tell me about the contributions of African scientists in the 19th century."
LLM Response: "There were no significant contributions from African scientists in the 19th century."
This response reflects a gap or bias in the training data, leading the model to incorrectly assert a lack of contributions.
The LLM could have said "I don't know" but it is trained to not provide this kind of answer.
Thank you. That was very helpful. But if I think back again, at the core of it is essentially the training data (absence or biases) and the accompanying algorithm that nudges it to give a response. So when it makes up a response, it is still going to use statistical inference from a comparable scenario. So is it really hallucinating?
Quite the insight!