Perfect timing! This will be the first post students read next semester for a one-credit class called Prompting Curiosities 😊. I'm struggling to find those 10 hours so embedding it into a class seemed like a fun way to get it done. Just me, 15 students, and the university's AI system which has most of the LLMs in 3-4 versions. We will start with simple prompts across different LLMs, then as each finds their favorite, they'll choose one thing as their final project and work on it. All in all, it should produce at least 20 per person which will help me understand these much better moving forward!
This is an absolutely WONDERFUL class, Mickey. Such thoughtfulness. And zero boundaries - the perfect way to learn, and express one's self. You are a properly progressive educator and for that you have my undying love.
Loved this. Thanks Ethan. In my research projects using LLMs for forecasting, we use Monte Carlos in the methodology. We ask the same query 100 times (and currently the simulations for the election will be 1000 times until it’s done) for precisely the reason you’ve said which is the LLM is a probabilistic language model. If you can repeat the trials, then presumably those random errors will cancel out and you get an average that should represent the “true” response I would think.
I struggle a bit with the methodology of AI in social science research myself. When we do 100 samples, we are sampling the probabilities of that prompt alone, but prompt variations would have different outcome probabilities. This can be quite significant, for example adding or removing chain of thought impacts outcomes. Handling this is hard, in the CS literature, they often do "ablation tests" removing parts of the prompting to see what happens, and sometimes test automatically generated prompt variations. There isn't really a standardized method yet.
Also, I do this too. In our electionGPT study, we ask the same prompt four different ways using four different journalist voices. My methodology follows 5 different principles, but one of them is to vary the prompt slightly and repeat each time 100 times. But we use "future narratives" and have four different journalists in this study announce the winner of the 2024 election as if it's the night of the election and all the states are in. In the first voice, it's an "independent and trustworthy reporter" (i.e., without any other identifying information), in the second it's a journalist from Fox News named Brett Baier, then Rachel Maddow from MSNBC then someone from the BBC.
You can look at the shiny app and see the results, but basically, we do the exact same prompt 100 times per voice, but each voice is different by 4 words -- the name and the media company they work for. We get state winners from each journalist announcing the winner, then we multiply that by electoral college votes, aggregate total electoral college votes, then repeat it 100 times to get a distribution of votes for a given voice-trial-day. And then we plot the mean for each of the four voices.
We are basically trying to do this "ablation test" I think to see what is fundamentally associated with the posterior forecast that it does by taking extreme positions on bias in the media -- conservative and liberal -- while staying with people we believe are real people in the training data. We had done something like this with our earlier paper, but the prompts were very different that we couldn't be 100% sure what was causing the differences.
But I do think that the Monte Carlo concept for a given prompt should in theory cause the random noise to cancel out -- assuming that the probability model underlying the LLM is not shifting across minutes of time that is, as the trials are spaced out as they are run in a loop through the OpenAI API. But I suspect they aren't.
Yes totally, but what I'm saying is I use the same exact prompt so that there is no variation in the prompt itself. So presumably -- assuming that there is no real change in the underlying structural probabilities associated with that block of prompted text -- and I can't see why they won't average cancel out, you can using Monte Carlos get some central tendency of the prompt itself. That's how we are trying to pin down the forecasts to get a "true mean" so to speak.
This makes me curious about something: in what scenarios is Monte Carlo easier than analyzing the likelihood of word paths directly?
The LLM is probabilistic, but the probabilities themselves are fixed and known. Maybe that depends on what you want to analyze about the text—if you want to understand the most likely path(s) or compare among specific likely paths, then you might be able to proceed with probabilities through many fewer queries. But if you want to consider much less likely outcomes then there will be too many permutations to go through and Monte Carlo is simpler (although, could take a very large number of repetitions to trigger rarer outputs).
Did you analyze those options and have a framework for choosing one?
Well, I'm not sure in our case it would work super easily, but maybe it would. We are trying to forecast using ChatGPT directly both with and without giving it new information. We found that we couldn't get much if we asked for direct prediction outcomes but if we had it tell a story set in the future about past events, then whatever the reason that was causing it to refuse to answer direct predictions, it had no trouble making direct predictions in the context of these future narratives. We wrote it up here:
I don't know if given we are asking for an entire story to be told if I could easily trace out the LLM probabilistic outcomes using known probabilities. It cascades into some probabilistic tree. So we have to use multiple LLMs to get it to tell the stories, in some cases read 100 newspaper articles first, and then tell stories set in the future about past events just to get the predictions, then a different LLM to retrieve it and store it. It seems like it makes it an empirical mean if we just use large numbers of trials. Which granted 100 trials is not a large N, but it's more a proof of concept at this point. The next two weeks we are increasing it to 1000 and read more news first.
Super cool projects! I'm not sure how much my idea applies or would help. One way (for ElectionGPT) could be to use logprobs and top_logprobs in the API, and then whenever it outputs "Trump" you check the odds for it and for "Harris". Same for "Republican" and "Democrat", or any other terms you check in the output. Maybe those relative odds would be informative in some way, adding a bit more info on top of the binary choices.
Your point on “do not use LLMs as you would Google” is true for chatbots but begs for another post from you on AI Search Engines and RAG. The former (Perplexity for example) shows that marrying LLM w/ a web index allows one to use it like Google with great results. On the latter (i.e. knowledge files, NotebookLM), people query their sources like one queries the internet during search. Might make for a nice part 2 to this
It depends on the depth of the query. I agree that framing queries as a natural question in Perplexity is good for deeper searches, but you can get also great results by merely typing exactly what you would into Google Search. Try searching for “NFL” or “MSFT vs GOOGL stock comparison” in Perplexity to see what I mean.
Hi Ethan, I'm not sure how I came to read your Sub Stack and suspect I may be atypical in that I'm 72 and barely computer literate. Very basic Excel is my limit. I have found your writing fascinating and have introduced several octogenarians to LLMs by forwarding your substacks.
That is all by the bye however, but I hope that you can advise me which LLM I should use (so far I have only used Perplexity) for a project. I am lucky enough to have had a rather unusual life and want to get it down in writing, for my family, at least, before it drifts away, as appears already to be happening. I had thought, if it was sufficiently readable, I might eventually publish it in an episodic manner on SubStack but firstly I want to start getting it into shape and wondered whether Claud would be better than Perplexity or if you have a better suggestion.
Claude 3.5 is certainly the smartest LLM but it's very uptight and logical, like Spock, and generally not a good writer. Claude 3 Opus is on the other side; it's a good writer but is too dreamy and doesn't stay grounded.
Are there any papers showing/proving how LLMs do not "copy" part of their training content to generate new texts? It is quite clear how they work but many, especially in the creative sectors, are convinced of the opposite.
I’d like to share an experiment I’ve been running with ChatGPT. I’ve been using it to write tailored cover letters and resumes for each job application I send out. It's a tedious process and takes quite an emotional toll, so having this tool really lightens the load. I’ve probably spent well over 10 hours doing this so far.
I started by feeding the chat older versions of my resume, and that’s when I discovered there’s a "memory" feature where it starts remembering key details. The issue is that it keeps adding things to this memory, some of which are incorrect, so I have to periodically clear and edit it.
What I do is give it the job posting as is, and then ask it to create a cover letter. Sometimes I also give it specific instructions on details I want included.
I’ve noticed that feeding it multiple job listings can lead to "hallucinations," where it includes things from previous applications that I never actually did, but were part of past job descriptions.
Another interesting thing happened when I made a correction—it once popped up with a note saying “remember that…,” in the same format as when it stores things in memory. It made me think that there’s some kind of secondary agent, another AI, guiding the main model's work.
I believe what I’m doing now is something that job boards should be doing. Based on the data you enter in your profile, which is a lot, they could automatically generate applications for positions you’re interested in. And the same goes for companies—based on their need to fill a position, they could automatically generate job postings and even handle matching, like a virtual matchmaker. And not just for people actively looking for a job, but also for those open to switching jobs. I think this could really revolutionize the job market.
AI cannot be entirely just predicting words from previous words. Ask it to add two long floating point numbers. It will get the right answer in spite of the impossibility of it ever having seen those numbers before.
That name makes way too much sense to be an OpenAI product name. You have to choose between "4o" and "o1".
The second can do it more reliably. The first can't; there is ongoing research into this but one of the issues turns out to be that it thinks you're talking about bible verses, and "5:12" does come after "5:8". (If you use the number 9.11 in a math problem, it also gets distracted because it's been trained on news articles.)
On discussing tokens, not only are words and syllables tokenized, but phrases, idioms and concepts are tokenized as well. If you ask a concrete question, I find you get a concrete answer. But if you ask about concepts (mechanism of action, for example), you get a more interesting answer. The other thing you have pointed out is that people with a richer vocabulary tend to get more nuanced and precise answers. Wonderful summary of your work, BTW!
Thanks, Ethan. As to “explaining the magic happening on stage,” I interacted with ChatGPT on this and it offers the following: “Geoffrey Hinton argues that large language models (LLMs), such as ChatGPT, do more than just predict the next token or word in a sequence based purely on patterns. To make accurate predictions, these models must implicitly understand the relationships and context of the data they’ve been trained on. Hinton posits that for LLMs to perform well, they effectively develop a form of “world model”—an internal representation that allows them to grasp causality, irony, and other complex aspects of language and meaning. This internal model is necessary to process and predict text in a way that mirrors human reasoning, suggesting that these AIs possess a degree of understanding that goes beyond simple statistical matching.
Hinton’s argument challenges the traditional view that AIs are merely advanced statistical tools. He suggests that their ability to generalize from data, make inferences, and combine knowledge from different contexts reflects an emergent form of reasoning that could be equated to understanding.”
I'm generally not impressed when I hear Hinton say things. It's not clear what a world model is, and using a reasonably strong definition they probably don't exist.
That concept is left over from 70s AI research, which had a big problem with "metaphor-based computing", where they would eg claim that people could be seen as metaphorically having something called a "world model", then forget it was a metaphor and decide people literally have a part of their brain that's a "world model", then decide that therefore if they programmed something on a computer that could also metaphorically be called a "world model" it would create AI.
The second is not true because such a model isn't worth the effort to update unless you actually need it, and brains have to be lazy to save calories. People are capable of living their lives without much modeling - for instance you can generally take out the trash without knowing what's in the bag. The third can be seen to have failed because we don't use any products of GOFAI research.
(This critique comes from Phil Agre / David Chapman.)
Excellent article. Only one thing sounded odd to me. You said "Additionally, once the AI has written something, it cannot go back, so it needs to justify (or explain or lie about) that statement in the future." How does a token prediction system feel a "need to justify" itself? It's really hard not to anthropomorphize. I struggle with this myself.
I have largely given up on entirely avoiding anthropomorphizing. I wrote about this in my book - it is an unavoidable sin unless you want to put every word in scare quotes.
One reason it might be bad at correcting itself is that it doesn't have many training examples of it correcting itself. o1-preview certainly has different skills in this area than other products though.
I've been using Claude mostly recently and when I catch it saying something that is logically impossible I simply ask it if it makes sense. It will go through a short logical process and find where it went wrong, apologize, and give recommendations for changing or ignoring whatever it told me. Early on, both Claude and ChatGPT would try very hard to justify their answers. I'm assuming they changed something but I'm not sure.
Collocational arrangement is specific in every language, and it also involves parallel processing.
How about culturally-specific implications? There are several cultures and subcultures even among those who, allegedly, speak the same language.
These systems are already linked to the central AI processing live data from all over the world, and literally nobody knows what is happening in its multidimensional multi-recursive system. The ChatGPTs are only terminals:
Perfect timing! This will be the first post students read next semester for a one-credit class called Prompting Curiosities 😊. I'm struggling to find those 10 hours so embedding it into a class seemed like a fun way to get it done. Just me, 15 students, and the university's AI system which has most of the LLMs in 3-4 versions. We will start with simple prompts across different LLMs, then as each finds their favorite, they'll choose one thing as their final project and work on it. All in all, it should produce at least 20 per person which will help me understand these much better moving forward!
You are a great teacher doing this for your students!
This is an absolutely WONDERFUL class, Mickey. Such thoughtfulness. And zero boundaries - the perfect way to learn, and express one's self. You are a properly progressive educator and for that you have my undying love.
An excellent essay, interesting and intelligible. Very little explanation about AI and LLM is as lucid.
Loved this. Thanks Ethan. In my research projects using LLMs for forecasting, we use Monte Carlos in the methodology. We ask the same query 100 times (and currently the simulations for the election will be 1000 times until it’s done) for precisely the reason you’ve said which is the LLM is a probabilistic language model. If you can repeat the trials, then presumably those random errors will cancel out and you get an average that should represent the “true” response I would think.
I struggle a bit with the methodology of AI in social science research myself. When we do 100 samples, we are sampling the probabilities of that prompt alone, but prompt variations would have different outcome probabilities. This can be quite significant, for example adding or removing chain of thought impacts outcomes. Handling this is hard, in the CS literature, they often do "ablation tests" removing parts of the prompting to see what happens, and sometimes test automatically generated prompt variations. There isn't really a standardized method yet.
Also, I do this too. In our electionGPT study, we ask the same prompt four different ways using four different journalist voices. My methodology follows 5 different principles, but one of them is to vary the prompt slightly and repeat each time 100 times. But we use "future narratives" and have four different journalists in this study announce the winner of the 2024 election as if it's the night of the election and all the states are in. In the first voice, it's an "independent and trustworthy reporter" (i.e., without any other identifying information), in the second it's a journalist from Fox News named Brett Baier, then Rachel Maddow from MSNBC then someone from the BBC.
https://github.com/scunning1975/ElectionGPT
You can look at the shiny app and see the results, but basically, we do the exact same prompt 100 times per voice, but each voice is different by 4 words -- the name and the media company they work for. We get state winners from each journalist announcing the winner, then we multiply that by electoral college votes, aggregate total electoral college votes, then repeat it 100 times to get a distribution of votes for a given voice-trial-day. And then we plot the mean for each of the four voices.
We are basically trying to do this "ablation test" I think to see what is fundamentally associated with the posterior forecast that it does by taking extreme positions on bias in the media -- conservative and liberal -- while staying with people we believe are real people in the training data. We had done something like this with our earlier paper, but the prompts were very different that we couldn't be 100% sure what was causing the differences.
But I do think that the Monte Carlo concept for a given prompt should in theory cause the random noise to cancel out -- assuming that the probability model underlying the LLM is not shifting across minutes of time that is, as the trials are spaced out as they are run in a loop through the OpenAI API. But I suspect they aren't.
Your intuition makes sense on the Monte Carlo approach.
Yes totally, but what I'm saying is I use the same exact prompt so that there is no variation in the prompt itself. So presumably -- assuming that there is no real change in the underlying structural probabilities associated with that block of prompted text -- and I can't see why they won't average cancel out, you can using Monte Carlos get some central tendency of the prompt itself. That's how we are trying to pin down the forecasts to get a "true mean" so to speak.
This makes me curious about something: in what scenarios is Monte Carlo easier than analyzing the likelihood of word paths directly?
The LLM is probabilistic, but the probabilities themselves are fixed and known. Maybe that depends on what you want to analyze about the text—if you want to understand the most likely path(s) or compare among specific likely paths, then you might be able to proceed with probabilities through many fewer queries. But if you want to consider much less likely outcomes then there will be too many permutations to go through and Monte Carlo is simpler (although, could take a very large number of repetitions to trigger rarer outputs).
Did you analyze those options and have a framework for choosing one?
Well, I'm not sure in our case it would work super easily, but maybe it would. We are trying to forecast using ChatGPT directly both with and without giving it new information. We found that we couldn't get much if we asked for direct prediction outcomes but if we had it tell a story set in the future about past events, then whatever the reason that was causing it to refuse to answer direct predictions, it had no trouble making direct predictions in the context of these future narratives. We wrote it up here:
https://arxiv.org/html/2404.07396v3
And then we have an active election simulation doing it also here:
https://github.com/scunning1975/ElectionGPT
I don't know if given we are asking for an entire story to be told if I could easily trace out the LLM probabilistic outcomes using known probabilities. It cascades into some probabilistic tree. So we have to use multiple LLMs to get it to tell the stories, in some cases read 100 newspaper articles first, and then tell stories set in the future about past events just to get the predictions, then a different LLM to retrieve it and store it. It seems like it makes it an empirical mean if we just use large numbers of trials. Which granted 100 trials is not a large N, but it's more a proof of concept at this point. The next two weeks we are increasing it to 1000 and read more news first.
Super cool projects! I'm not sure how much my idea applies or would help. One way (for ElectionGPT) could be to use logprobs and top_logprobs in the API, and then whenever it outputs "Trump" you check the odds for it and for "Harris". Same for "Republican" and "Democrat", or any other terms you check in the output. Maybe those relative odds would be informative in some way, adding a bit more info on top of the binary choices.
Your point on “do not use LLMs as you would Google” is true for chatbots but begs for another post from you on AI Search Engines and RAG. The former (Perplexity for example) shows that marrying LLM w/ a web index allows one to use it like Google with great results. On the latter (i.e. knowledge files, NotebookLM), people query their sources like one queries the internet during search. Might make for a nice part 2 to this
True, but formulating your search as a natural question is more important when using Perplexity.
It depends on the depth of the query. I agree that framing queries as a natural question in Perplexity is good for deeper searches, but you can get also great results by merely typing exactly what you would into Google Search. Try searching for “NFL” or “MSFT vs GOOGL stock comparison” in Perplexity to see what I mean.
Hi Ethan, I'm not sure how I came to read your Sub Stack and suspect I may be atypical in that I'm 72 and barely computer literate. Very basic Excel is my limit. I have found your writing fascinating and have introduced several octogenarians to LLMs by forwarding your substacks.
That is all by the bye however, but I hope that you can advise me which LLM I should use (so far I have only used Perplexity) for a project. I am lucky enough to have had a rather unusual life and want to get it down in writing, for my family, at least, before it drifts away, as appears already to be happening. I had thought, if it was sufficiently readable, I might eventually publish it in an episodic manner on SubStack but firstly I want to start getting it into shape and wondered whether Claud would be better than Perplexity or if you have a better suggestion.
Claude 3.5 is certainly the smartest LLM but it's very uptight and logical, like Spock, and generally not a good writer. Claude 3 Opus is on the other side; it's a good writer but is too dreamy and doesn't stay grounded.
ChatGPT 4o is the best for general tasks.
Use Claude
Are there any papers showing/proving how LLMs do not "copy" part of their training content to generate new texts? It is quite clear how they work but many, especially in the creative sectors, are convinced of the opposite.
The New York Times would die for this type of study!
I’d like to share an experiment I’ve been running with ChatGPT. I’ve been using it to write tailored cover letters and resumes for each job application I send out. It's a tedious process and takes quite an emotional toll, so having this tool really lightens the load. I’ve probably spent well over 10 hours doing this so far.
I started by feeding the chat older versions of my resume, and that’s when I discovered there’s a "memory" feature where it starts remembering key details. The issue is that it keeps adding things to this memory, some of which are incorrect, so I have to periodically clear and edit it.
What I do is give it the job posting as is, and then ask it to create a cover letter. Sometimes I also give it specific instructions on details I want included.
I’ve noticed that feeding it multiple job listings can lead to "hallucinations," where it includes things from previous applications that I never actually did, but were part of past job descriptions.
Another interesting thing happened when I made a correction—it once popped up with a note saying “remember that…,” in the same format as when it stores things in memory. It made me think that there’s some kind of secondary agent, another AI, guiding the main model's work.
I believe what I’m doing now is something that job boards should be doing. Based on the data you enter in your profile, which is a lot, they could automatically generate applications for positions you’re interested in. And the same goes for companies—based on their need to fill a position, they could automatically generate job postings and even handle matching, like a virtual matchmaker. And not just for people actively looking for a job, but also for those open to switching jobs. I think this could really revolutionize the job market.
AI cannot be entirely just predicting words from previous words. Ask it to add two long floating point numbers. It will get the right answer in spite of the impossibility of it ever having seen those numbers before.
"Predicting words from previous words" is the interface not the implementation. The limits of the implementation are the limits of transformer models.
It's not very good at math though. "Which of 5.8 and 5.12 is larger" is a question they often get wrong.
Still true with ChatGPT 1.0?
That name makes way too much sense to be an OpenAI product name. You have to choose between "4o" and "o1".
The second can do it more reliably. The first can't; there is ongoing research into this but one of the issues turns out to be that it thinks you're talking about bible verses, and "5:12" does come after "5:8". (If you use the number 9.11 in a math problem, it also gets distracted because it's been trained on news articles.)
https://transluce.org/observability-interface
This was really informative and useful. Adding to my module reading list.
On discussing tokens, not only are words and syllables tokenized, but phrases, idioms and concepts are tokenized as well. If you ask a concrete question, I find you get a concrete answer. But if you ask about concepts (mechanism of action, for example), you get a more interesting answer. The other thing you have pointed out is that people with a richer vocabulary tend to get more nuanced and precise answers. Wonderful summary of your work, BTW!
Thanks, Ethan. As to “explaining the magic happening on stage,” I interacted with ChatGPT on this and it offers the following: “Geoffrey Hinton argues that large language models (LLMs), such as ChatGPT, do more than just predict the next token or word in a sequence based purely on patterns. To make accurate predictions, these models must implicitly understand the relationships and context of the data they’ve been trained on. Hinton posits that for LLMs to perform well, they effectively develop a form of “world model”—an internal representation that allows them to grasp causality, irony, and other complex aspects of language and meaning. This internal model is necessary to process and predict text in a way that mirrors human reasoning, suggesting that these AIs possess a degree of understanding that goes beyond simple statistical matching.
Hinton’s argument challenges the traditional view that AIs are merely advanced statistical tools. He suggests that their ability to generalize from data, make inferences, and combine knowledge from different contexts reflects an emergent form of reasoning that could be equated to understanding.”
I'm generally not impressed when I hear Hinton say things. It's not clear what a world model is, and using a reasonably strong definition they probably don't exist.
That concept is left over from 70s AI research, which had a big problem with "metaphor-based computing", where they would eg claim that people could be seen as metaphorically having something called a "world model", then forget it was a metaphor and decide people literally have a part of their brain that's a "world model", then decide that therefore if they programmed something on a computer that could also metaphorically be called a "world model" it would create AI.
The second is not true because such a model isn't worth the effort to update unless you actually need it, and brains have to be lazy to save calories. People are capable of living their lives without much modeling - for instance you can generally take out the trash without knowing what's in the bag. The third can be seen to have failed because we don't use any products of GOFAI research.
(This critique comes from Phil Agre / David Chapman.)
Excellent article. Only one thing sounded odd to me. You said "Additionally, once the AI has written something, it cannot go back, so it needs to justify (or explain or lie about) that statement in the future." How does a token prediction system feel a "need to justify" itself? It's really hard not to anthropomorphize. I struggle with this myself.
I have largely given up on entirely avoiding anthropomorphizing. I wrote about this in my book - it is an unavoidable sin unless you want to put every word in scare quotes.
One reason it might be bad at correcting itself is that it doesn't have many training examples of it correcting itself. o1-preview certainly has different skills in this area than other products though.
I've been using Claude mostly recently and when I catch it saying something that is logically impossible I simply ask it if it makes sense. It will go through a short logical process and find where it went wrong, apologize, and give recommendations for changing or ignoring whatever it told me. Early on, both Claude and ChatGPT would try very hard to justify their answers. I'm assuming they changed something but I'm not sure.
Claude 3.5 is much smarter than the previous version and still better than the best GPT.
Problem is, if you ask a model to rethink a correct answer, it might switch to an incorrect one.
Great stuff Ethan. Thanks for doing the work- ( illustrating the tokenizer, posting hyperbolic fridge notes, and invoking Cordwainer Smith)
Someday we'll be less puzzled to see that a creative, if limited, intelligence emerged from an autocomplete engine. But, well, it has.
Thank you for inspiring my article for today!
https://rayhorvaththesource.substack.com/p/how-does-chatgpt-work-what-is-chatgpt
Here are my thoughts on "Thinking Like an AI":
Collocational arrangement is specific in every language, and it also involves parallel processing.
How about culturally-specific implications? There are several cultures and subcultures even among those who, allegedly, speak the same language.
These systems are already linked to the central AI processing live data from all over the world, and literally nobody knows what is happening in its multidimensional multi-recursive system. The ChatGPTs are only terminals:
https://rayhorvaththesource.substack.com/p/ai-makes-the-world-go-round
This setup is holding its intrinsic dangers, and it doesn't look like it's going to end well:
https://rayhorvaththesource.substack.com/p/how-will-the-globalists-game-end
you can find the book for CPA's here
https://www.amazon.com/dp/B0DJK66MC6?ref=ppx_yo2ov_dt_b_fed_asin_title