A simple answer, and then a less simple one.
It’s interesting for me that with all of google’s deep pockets and deep bench, the only things we got at yesterday’s announcement was a ChatGPT 3.5 level usable LLM and vaporware that claims to be at or slightly above ChatGPT 4 level. I disagree with Ethan a little about Bing, I forgot to renew my chatGPT 4 account and was stuck with Bing for a few days before I could get back to openAI, it was horrible, shocking really that Microsoft could take the full code and weights of ChatGPT 4 and mess it up as much as they did. I suppose they did it to lighten the model and make it less expensive to run on their side. But still, there’s something about giant companies that for now makes them less than the sum of their parts in this new exploding field.
There are rumors that LLMs are reaching a plateau, no less than Bill Gates recently said that, so maybe that explains why Gemini Ultra is at GPT 4 levels effectively a year after 4 was released. But I think even if the raw power is at some sort of plateau, there are plenty of little “tricks” left to keep progress going for years, just as we saw over the past 15 years as Moore’s Law collapsed but progress continued through multiple cores and the cloud etc. We can see the integration of video from the ground up into training sets, we can see multiple LLMs handing off to each other somewhat reminiscent of what has happened with multiple processors on single devices and the cloud. We can see bigger and persistent (as has already happened with customizable GPTs at openAI) context windows acting as memory. OpenAI has already gotten way better at integrating search and math stuff, with the same base model. And there are rumors they made a breakthrough in native math (and hence reasoning and planning) ability with a internal model called Q*
This is all super interesting for me and I’m an accelerationist, but I have a feeling that 5-10 years from now when the future I’m thinking of arrives, I will find myself deeply sad that many of the things I prided myself on as a person, will have been massively devalued and commodified.
A good summary! I’d also add Phind to the list which has beat GPT-4 on all “programming” benchmarks, is much faster and available for free. I use it often for debugging.
Coincidentally, I wrote a short blog on ChatGPT alternatives and like your list, Bard doesn’t make the cut.
Here in Canada, we don’t have access to Bard or Claude. So I’ve been using GPT 4, which has been fantastic. It’s great to know I’m not missing out on anything.
All this while i have been using Bing's Balanced version and have been thinking it was somehow inferior to 3.5. Thanks for the clarification Ethan! Your work is immensely helpful - please continue to share your insights from the jagged frontier!
Ethan, as an English professor and educational developer, I have been using GPT-4 and Claude 2.0 for months now, and our CTL is using Claude for demonstrations, especially because the free version of Claude is more accurate and intuitive than the free version of ChatGPT. Claude is also better at academic-sounding prose as a default, and it conforms better when I provide it with examples of my writing and ask it to "write like me."
When I asked GPT-4, Claude 2, Bing, and Pi to demonstrate their abilities to create an example of a learning strategy, Bing and Pi were abysmal; GPT-4 was very accurate and used the expected format of a report; Claude 2 produced a more substantial answer but did not have the business-format as part of its response.
My biggest disappointment in all of the current LLM producers is their apparent unwillingness to hire educational developers to work on staff. After all, their products are transforming higher education, but if OpenAI's "teacher's guide" is any indication, they are not deeply interested in the pedagogical consequences of their discoveries.
Good summary. A few comments/suggested edits:
1) It might be helpful to include a “cites sources” column in your opinionated summary table of LLMs as I consider source citations to be critical in research-oriented tasks (eg it really helps audit output and catch hallucinations). This is where Bard and Claude fall short currently
2) Suggested edit - Copilot/Bing Chat offers natural language interface in the mobile Bing App
3) Suggested edit - the monthly $20 Perplexity Pro offers access to not only GPT-4 but also Claude 2.1. Similar to Poe, a subscriber gets more than one model for the fee
This is a great post, especially to share with friends coming to AI now. I have had ChatGPT4 and Claude 2.0 for a while, and value them both. My challenge is that my interests are to use AI tools in the humanities--particularly non-profit operations and development--and this is not an area where much research is happening, thought business-focused studies and prompt-focused papers both have transferable knowledge.
I started off with chatGPT 4 but the product has definitely dropped in quality with the rise of users. It is not only slow but the responses to my prompts have been lackluster especially with GPTs. Hopefully those issues can be resolved. I think Claude AI and perplexity AI are much better products at the moment for general tasks.
Great post as usual! I think Claude deserved a bigger mention given all of the x-risk AI discussion in the public sphere at the moment. Claude is safer than the other options because it is build on top of a constitution of ethics that are expanded up here: https://www.anthropic.com/index/claudes-constitution
Really helpful thanks Ethan
Very helpful, as a qual researcher uploading docs is key. FYI I’m in Australia and have access to Claude, but It refuses to accept documents, just text pasted in. This may be geo-related.
Ethan, I'm very thankful for the work you're doing and sharing. I read all of your posts and share each of your posts and videos; they are super helpful. The ability to get the best information in one location is indispensable.
Thank you so much for this very useful and succinct overview! I'm glad I got GPT 4 - Claude is not available in Europe unless you pay for Perplexity, and I agree that Bing can be weird. I appreciate the tip about the Bing detour to GPT 4!
And then came Gemini, now everything else looks outdated.
While incremental improvements in gen AI continue to come at a fast pace, I wonder if GPT-4 does represent a boundary that companies are finding it hard to get beyond with current methods. It might be that we quickly hit a plateau on pure transformer models and now need to wait for new methods to get implemented, like OpenAI's rumored Q search algorithm.
To me, it makes sense that we might be at a plateau. LLMs have been fed a broad set of human language and images, but it's not clear how they would much transcend the average level of intelligence displayed in the training set. It's also not clear how LLMs would pick the best approach to something, similar to how AlphaGo picks the best Go move. Search algorithms like Q could help with that, but the other challenge is that there's no real way to "score" real world writing or decisions the way you would score a game move. So it seems like some paradigm shifts might be needed to achieve a worthy GPT-5 experience.
Thank you for sharing this - a great overview of the current State of the AI Nation!