What is it that they say again? The trend is your friend until the bend at the end.
All jokes aside, great primer for non-technical readers on scaling laws (plural), why AI labs are daring to invest billions of dollars upfront, and what to expect in 2025 and beyond.
Ethan, I always enjoy your posts. They make me read more and learn more. As to the core of this article, what are they doing, how are they growing, and which is doing it better, I find myself wondering about other countries. This is a US centric review. Do you know what is happening in other countries, especially China?
I usually really like the more optimistic take on your blog.
But, today I must disagree. I think that the fact that OpenAI baked chain-of-thought into a prompt and sold it as a major innovation could show signs that the scalling laws are reaching their upper threshold given the current tech.
Seems like the move from the autonomous driving companies tried to do, by overselling the tech and nettiing a lot of resources to actually fund the technological developments to achieve the autonomous driving goal. It took them 10 years to actually make it work.
The point is: the fact that Open-o1 is GPT-4o with native C-o-T prompting could mean that their actually further from their main goal of a exponentially more capable model across domains.
It really is more than "baked chain-of-thought into a prompt" though. It is more like they took scenarios where there is a right answer (like math, logic puzzles, etc) and likely upped the temperature and told it to take as much time as it needs and finally produce the answer. Then for those responses that had correct answers (and possibly also judging the individual steps as good), the model was fine-tuned on those. So there is an automated loop of the model trying to produce different reasoning steps, including ones that aren't the most likely, and getting re-enforced on those that produce correct outcomes. Literal RL like in AlphaGo vs the normal RLHF.
Not only that, some of the researchers mentioned (on a recent podcast) that after they scaled up thinking time, they started finding instances of the model backtracking on its own. Discovering it made a mistake in prior calculations and trying it again a different way. Prior to that, it almost always would get stuck in a bad line of thinking and not be able to get out of it entirely on its own.
So it really is a bit of a different thing than just prompting. Like one limitation of prompting approaches is that if you ask an LLM if it is sure the answer was correct, it will be more likely to correct a mistake, but a decent amount of the time it'll flip-flop on an actual correct answers as well. Using the RL approach makes it more likely it'll catch issues on its own in a way that won't lead away from correct ones.
Thank you for the explanation, but I understand that's more than "baked c-o-t".
That does not deter me from thinking that it's an underwhelming development, that comes more from a desperate need of money than a true innovation (OpenAI needs $5bi as runway money).
Besides, I agree with recent takes by LeCunn and also Apple's paper: there's no true reasoning in probability machines, despite the scale. And the fact that people are wiling to throw ungodly amounts of resources behind this tech is, at least for me, worrisome.
Note that today's LLMs achievements are already good enough for societal transformation, just not enough for a significant step towards reasoning or AGI.
The other big issue for Training is where will we find the "good quality" data to train from? We have already exhausted Wikipedia, and Grok is using a self-perpetuating feed of X Tweets, might it go worse? I do 🤣 like Grok s response though!
If you check out the Llama 3 paper, they go into a lot of details on how they augment things with synthetic data. I don't think data will be the big issue in the end.
Ethan - without your reflections, I think many of us would just throw up our hands in both amazement and frustration…as if we’re trying to catch the kitty fur ball. Thanks for your willingness to share.
The hope is that we will also see more architectural work, of which "thinking" is a start. For example, to count the number of letters "r" in "strawberry" one should simply create and run a python program.
There is a lot of knowledge that exists in available software. A chatbot should, like a human, use the strategies that make most sense in the given context, and often that is using tools, inspecting outcomes, and adjusting the strategy depending on what was found.
I wrote a deeper dive into the 6 key AI variables to watch last year around this time which many of my readers have found helpful for getting up to speed on some of the more technical terminology.
It is heavily linked to important papers and other Substack authors who I've found informative in understanding the latest trends. I do owe my readers an update, particularly on advancements in chain-of-thought and discrepancies between benchmark tests and real-world results, which I will publish after I finish the next few pieces I have planned about technology's impact on the job market.
This is a great summary of the current state of play, which is so dynamic and seems to change each week. There was a great talk at Stanford last week with Reid Hoffman, who is deeply involved with both the business and technology side of the domain (https://www.youtube.com/watch?v=RXjLGn14Jo4). I think his take that we have a 5 to 10 year horizon for big improvements is probably right and that we may see the scaling curve diminish, as the achievable leverage from transformers and indexed databases has peaked and everything is now on the edges (multi-modal, recursion inference, micro and local LLMs. This is probably a blessing, because it diminishes the risks associated and will allow for more augmentation and collaboration with humans. Brave new world....
Good overview! I like the categories and accessible definitions you use, one of the great strengths of your book. I was glad to see you include Grok-2 up here. Some of the people I talk to about AI don't know about Grok-2 or don't take it seriously (for a number of reasons). I think Grok-2 works surprisingly well, especially with image generation. In the near future, I'd love to see what Claude 3.5 will do with a "thinking process" like that from OpenAI's 1o.
P.S. Apart from the screenshots, how would you compare or rank the outputs from the frontier models?
Good explainer for people who do not follow the space closely — I plan to share with some I know who fit that profile. o1-preview is a Gen2 model as it is simply GPT-4o with Reasoning (and called that internally). I’d also note that many of the AI Leaderboards are nothing more than vibes. No one I know would say Grok 2 is number two; Claude 3.5 Sonnet and GPT-4o are the top two with Sonnet often being mentioned as best. Yet leaderboards show 4o mini as being superior to it
Thanks, in particular, for your "Behind The Scenes" asides where we glimpse your own LLM interplay as you write this useful (and distinctly human) summary.
I think the section about o1, "A new form of scale: thinking", is missing a major part there, which is that they figured out how to use additional training compute not simply to 'cram more education into it' but to train it (in a way that depends comparatively little on outside-the-company sources of data) to make _better use of_ inference-time compute (i.e. getting its chain of thought to be useful more often and counterproductive less often).
What is it that they say again? The trend is your friend until the bend at the end.
All jokes aside, great primer for non-technical readers on scaling laws (plural), why AI labs are daring to invest billions of dollars upfront, and what to expect in 2025 and beyond.
Ethan, I always enjoy your posts. They make me read more and learn more. As to the core of this article, what are they doing, how are they growing, and which is doing it better, I find myself wondering about other countries. This is a US centric review. Do you know what is happening in other countries, especially China?
I usually really like the more optimistic take on your blog.
But, today I must disagree. I think that the fact that OpenAI baked chain-of-thought into a prompt and sold it as a major innovation could show signs that the scalling laws are reaching their upper threshold given the current tech.
Seems like the move from the autonomous driving companies tried to do, by overselling the tech and nettiing a lot of resources to actually fund the technological developments to achieve the autonomous driving goal. It took them 10 years to actually make it work.
The point is: the fact that Open-o1 is GPT-4o with native C-o-T prompting could mean that their actually further from their main goal of a exponentially more capable model across domains.
It really is more than "baked chain-of-thought into a prompt" though. It is more like they took scenarios where there is a right answer (like math, logic puzzles, etc) and likely upped the temperature and told it to take as much time as it needs and finally produce the answer. Then for those responses that had correct answers (and possibly also judging the individual steps as good), the model was fine-tuned on those. So there is an automated loop of the model trying to produce different reasoning steps, including ones that aren't the most likely, and getting re-enforced on those that produce correct outcomes. Literal RL like in AlphaGo vs the normal RLHF.
Not only that, some of the researchers mentioned (on a recent podcast) that after they scaled up thinking time, they started finding instances of the model backtracking on its own. Discovering it made a mistake in prior calculations and trying it again a different way. Prior to that, it almost always would get stuck in a bad line of thinking and not be able to get out of it entirely on its own.
So it really is a bit of a different thing than just prompting. Like one limitation of prompting approaches is that if you ask an LLM if it is sure the answer was correct, it will be more likely to correct a mistake, but a decent amount of the time it'll flip-flop on an actual correct answers as well. Using the RL approach makes it more likely it'll catch issues on its own in a way that won't lead away from correct ones.
Thank you for the explanation, but I understand that's more than "baked c-o-t".
That does not deter me from thinking that it's an underwhelming development, that comes more from a desperate need of money than a true innovation (OpenAI needs $5bi as runway money).
Besides, I agree with recent takes by LeCunn and also Apple's paper: there's no true reasoning in probability machines, despite the scale. And the fact that people are wiling to throw ungodly amounts of resources behind this tech is, at least for me, worrisome.
Note that today's LLMs achievements are already good enough for societal transformation, just not enough for a significant step towards reasoning or AGI.
The other big issue for Training is where will we find the "good quality" data to train from? We have already exhausted Wikipedia, and Grok is using a self-perpetuating feed of X Tweets, might it go worse? I do 🤣 like Grok s response though!
If you check out the Llama 3 paper, they go into a lot of details on how they augment things with synthetic data. I don't think data will be the big issue in the end.
Ethan - without your reflections, I think many of us would just throw up our hands in both amazement and frustration…as if we’re trying to catch the kitty fur ball. Thanks for your willingness to share.
Seems like a modest return for a tenfold increase in computing output. Are there diminishing returns?
The hope is that we will also see more architectural work, of which "thinking" is a start. For example, to count the number of letters "r" in "strawberry" one should simply create and run a python program.
There is a lot of knowledge that exists in available software. A chatbot should, like a human, use the strategies that make most sense in the given context, and often that is using tools, inspecting outcomes, and adjusting the strategy depending on what was found.
This is an excellent overview! But maybe you’d consider writing something more technical for your readers who want that?
I wrote a deeper dive into the 6 key AI variables to watch last year around this time which many of my readers have found helpful for getting up to speed on some of the more technical terminology.
It is heavily linked to important papers and other Substack authors who I've found informative in understanding the latest trends. I do owe my readers an update, particularly on advancements in chain-of-thought and discrepancies between benchmark tests and real-world results, which I will publish after I finish the next few pieces I have planned about technology's impact on the job market.
Hopefully you find it informative!
https://www.2120insights.com/p/the-6-key-ai-variables-to-watch
This is a great summary of the current state of play, which is so dynamic and seems to change each week. There was a great talk at Stanford last week with Reid Hoffman, who is deeply involved with both the business and technology side of the domain (https://www.youtube.com/watch?v=RXjLGn14Jo4). I think his take that we have a 5 to 10 year horizon for big improvements is probably right and that we may see the scaling curve diminish, as the achievable leverage from transformers and indexed databases has peaked and everything is now on the edges (multi-modal, recursion inference, micro and local LLMs. This is probably a blessing, because it diminishes the risks associated and will allow for more augmentation and collaboration with humans. Brave new world....
Ethan, it was really helpful to me for you to juxtapose the two types of AI scaling you’re seeing: size & thinking. Thank you.
What a valuable post Ethan. I appreciate your scouting and leadership reports. Just fantastic.
Good overview! I like the categories and accessible definitions you use, one of the great strengths of your book. I was glad to see you include Grok-2 up here. Some of the people I talk to about AI don't know about Grok-2 or don't take it seriously (for a number of reasons). I think Grok-2 works surprisingly well, especially with image generation. In the near future, I'd love to see what Claude 3.5 will do with a "thinking process" like that from OpenAI's 1o.
P.S. Apart from the screenshots, how would you compare or rank the outputs from the frontier models?
Good explainer for people who do not follow the space closely — I plan to share with some I know who fit that profile. o1-preview is a Gen2 model as it is simply GPT-4o with Reasoning (and called that internally). I’d also note that many of the AI Leaderboards are nothing more than vibes. No one I know would say Grok 2 is number two; Claude 3.5 Sonnet and GPT-4o are the top two with Sonnet often being mentioned as best. Yet leaderboards show 4o mini as being superior to it
Thanks, in particular, for your "Behind The Scenes" asides where we glimpse your own LLM interplay as you write this useful (and distinctly human) summary.
I think the section about o1, "A new form of scale: thinking", is missing a major part there, which is that they figured out how to use additional training compute not simply to 'cram more education into it' but to train it (in a way that depends comparatively little on outside-the-company sources of data) to make _better use of_ inference-time compute (i.e. getting its chain of thought to be useful more often and counterproductive less often).
An incredibly timely and useful summary of "the state of play in AI" that I will immediately pass on to my students!
Ethan - nice article. great primer on the current architecture.