Right around the time you are getting this email, Google finally released their long-awaited powerful AI, which, in the continued tradition of sudden AI name changes, is no longer Bard but rather Gemini Advanced. I have had early access to this LLM for over a month (as a reminder, I take no payments from any AI lab, nor do they see what I write in advance), and I wanted to offer some tasting notes.
And, yes, I mean tasting, not testing, notes. In these newsletters, I have been sloppy with spelling — I figure it is a sign that a regular human rather than an AI wrote it — but I am not making a mistake here. AI testing benchmarks have their place, but they can also mislead. AIs can be trained on the test questions, on purpose or on accident, and many of the benchmarks consist of lists of trivia questions or reasoning puzzles, which don’t reflect real-world usage. So, I wanted to offer a bit of a subjective/ objective mix of opinions about Gemini Advanced, more like sampling a wine that giving a rigorous review. I am going to avoid doing a detailed feature set comparison, and focus on the big picture, with plenty of examples.
Let me start with the headline: Gemini Advanced is clearly a GPT-4 class model. The statistics show this, but so does a month of our informal testing. And this is a big deal because OpenAI’s GPT-4 (the paid version of ChatGPT/Microsoft Copilot) has been the dominant AI for well over a year, and no other model has come particularly close. Prior to Gemini, we only had one advanced AI model to look at, and it is hard drawing conclusions with a dataset of one. Now there are two, and we can learn a few things.
At the same time, Gemini Advanced does not obviously blow away GPT-41 in the benchmarks. It is really good (more rigorous testing will be needed to figure out how good), but I would concur with the tests that suggest it is roughly equivalent, though it has its own strengths and weaknesses. GPT-4 is much more sophisticated about using code and accomplishes a number of hard verbal tasks better - it writes a better sestina and passes the Apple Test. Gemini is better at explanations and does a great job integrating images and search. Both are weird and inconsistent and hallucinate more than you would like. I find myself using both Gemini Advanced and GPT-4, depending on circumstances, as we will discuss later.
But the really interesting thing is what Gemini Advanced shows us about the future of AI.
Its Full of Ghosts
No one has a great definition for sentience, which is okay because LLMs are in no way sentient; they are software systems designed to create human-like language. But there is a weirdness to GPT-4 that isn’t sentience, but also isn’t like talking to a program. A weirdness only comes out after you spend enough hours playing with the AI and getting unnerved, or delighted, or both, by its unexpected abilities and seeming intelligence. There was a famous, controversial paper put out by Microsoft Research soon after the release of GPT-4, called “Sparks of Artificial General Intelligence” that tried to put this argument into scientific terms, but ended up just calling it “sparks” of artificial general intelligence. It is the illusion of a person on the other end of the line, even though there is nobody there. GPT-4 is full of ghosts.
Gemini is also full of ghosts.
Seriously, if you use the system for a while, I can almost guarantee at least one moment when you stand up from your desk, walk around the room, and wonder what is going on. Here is one example: I prompted Gemini: lets play a PbtA game. invent an entirely new game and be my DM (For context, PbtA refers to Powered by the Apocalypse, a roleplaying format that is sort of like Dungeons and Dragons, but more character-driven). Everything you see below are unedited: the actual prompts, and the first responses from the AI. It is pretty solid stuff, from the writing to the worldbuilding.
This means something important, I think, which is that the “sparks” of GPT-4 are not an isolated phenomenon, but rather may represent an emergent property of GPT-4 class models. When an AI model is large enough, you can get ghosts.
Personality and Prompting
While still a chatbot, Gemini has a much slicker interface than GPT-4, and is less prone to technical errors, at least in my testing, than ChatGPT. It also has a different “personality” than GPT-4, in either its ChatGPT or Copilot incarnations. While GPT-4 is pretty bland (at least since the disappearance of Bing’s personality, Syndey) Gemini is more apparently friendly, more agreeable, and has a tendency towards wordplay.
Despite these personality differences, it is remarkable how compatible these two very different models are. Complex prompts that work in GPT-4 work in Gemini, and vice-versa… with some interesting exceptions that line up with the personality. We have been actively experimenting with using AI for learning and have been writing papers with suggested prompts. While updating the prompts for Gemini (the updated paper should be available here soon), we noticed that, compared to GPT-4, it continually tries to be helpful. In fact, it is so helpful that it can undermine the goal of our prompts by trying to help the student, rather than letting them struggle through understanding a concept on their own. We had to change our prompts a little to reduce this behavior.
Thus, there are differences, but also many similarities. Both systems have safety guardrails, but they trigger in different ways. Gemini seems more willing to do darker writing than GPT-4, but absolutely refuses to explain how nuclear bombs work through the discography of Taylor Swift, while GPT-4 is happy to do so.
What Brains Can Do
One of the most interesting things about Gemini is how it illuminates a vision of AI as powerful integrated personal assistant that is quite different than the Microsoft’s application-specific Copilots or OpenAI’s open-ended GPTs/agents. Microsoft has been creating narrow companions for software like Word and PowerPoint that streamline the user’s workload. OpenAI seems to have an ambitious plan to create autonomous AI agents that can do tasks without the need for human intervention. But Google seems to want to be your helper.
Earlier versions of Bard had impressive connections to the Google ecosystem (Gmail, Google Docs, Google travel tools, and more), but were too dumb to use it. They could pull up your emails, but would hallucinate too many details, or fail to understand context, in ways that were incredibly frustrating. I speculated at the time that Google may have just built the infrastructure while waiting for a smarter brain to fill it. That seems to be the case.
All of the integrations across Google now make much more sense. With a smarter brain, in the form of Gemini Advanced, you can start to do some really interesting things that, at their best, seem magical: “go through my emails, tell me which are important, and draft replies for each,” “look up my next conference and plan a trip I would like.” But a GPT-4 class model is still limited. The AI still hallucinated a few email details and got confused about its tools on several occasions (forgetting it could use Google Maps, and so on). It isn’t there yet, but it is very much closer to being an actual assistant, rather than the limited Siris and Alexas we have seen in the past.
That is, in part, why I suspect that Gemini Advanced is the start, not the end, of a wave of AI development. We can start to see a world where AI agents act on our behalf. A GPT-4 class model is not quite strong enough to power these agents… but we are getting close.
What it means
This wasn’t a review of Gemini Advanced - we haven’t covered its excellent native multimodal ability to both create and see images, or the way it integrates search. We haven’t discussed its coding prowess, or the fact that it seems to have some Code Interpreter-like ability to make and run limited Python programs. We also haven’t covered some frustrations, like the fact it loves to make elaborate plans that it can’t always actually execute on (like telling me it was going to order me t-shirts, something it can’t do, but which it kept insisting it was working on). Suffice it to say that it is quite good, and you probably would be fine picking either GPT-4 or Gemini Advanced as your AI of choice to work with. Given their mixed strengths and weaknesses, however, I will continue to use both.
But this is not a review. Instead, it is an attempt to use the new LLM to shed light, however dimly, on how the future of AI might unfold. Gemini shows that Google is in the AI race for real, and that other companies besides OpenAI can build GPT-4 class models. And we now know something about AI that we didn’t before. Advanced LLMs may show some basic similarities in prompts and responses that make it easy for people to switch to the most advanced AI from an older model at any time. Plus, GPT-4’s “spark” is not unique to OpenAI, but is something that might often happen with scale. We don’t yet know if models get “sparkier” and more AGI-like as they get larger, but I suspect we will find out.
That is because I think Gemini’s unique strengths and weaknesses compared to
GPT-4 demonstrates that there is still a lot of room left for models to improve, and we will continue to see rapid gains in the near future. The AI wave hasn’t crested, and the next move from OpenAI might be releasing the rumored GPT-4.5 or GPT-5. But until that happens, for the first time since ChatGPT’s release, there is another company with an LLM that can compete with Open AI’s most advanced model.
In case you haven’t seen it, I have a book coming out on April 2 on living and working with AI called Co-Intelligence. If you like my posts, you will probably like the book, which has similar themes, but with much more depth and far fewer spelling errors. More on the book in the future, but you can pre-order it (and see the cool cover) here.
The question of why it doesn’t clearly beat GPT-4 is really interesting, and perhaps consequential. I can think of four possibilities: 1) GPT-4 class models are about as good as AI gets using LLM technology — suggesting that the exponential change in AI capabilities is ending (I think this unlikely, but possible); 2) Google needed a model to compete with GPT-4, so they trained up Gemini to that level and stopped, more advanced models are coming soon; 3) OpenAI has some special sauce that no other company can replicate, and they are the only ones who can easily achieve GPT-4+ abilities, Google’s attempt was the best they could do without knowing the OpenAI secret; or 4) It is a coincidence that this model happens to be so close in ability to GPT-4, and we learn nothing from this. I think 2 is most likely, but I have no idea if that is true.
This is *exactly* what I've been waiting for, and for similar reasons. Two LLMs at the top of the generative mountain is so vastly better than one, and not just because of the incredibly motivating competitiveness the two entities will feel up there. It's also remarkable because, as Ethan rightly points out, we can now begin to draw conclusions about how the models themselves will scale up. Up until now, our sample set of one hasn't been super duper helpful.
Great piece here, Ethan. Thanks for keeping us informed, and for the thoughtful analysis!
One of the things I am beginning to appreciate with Gemini that sets it apart from ChatGPT is its ability to prompt me for more information if it will help it give a better answer. Example I uploaded a photo of an insect that I wanted it to identify. Gemini came back with two possibilities, but added that with additional information about the location for example it could give a better answer. I provided the location and it gave a single definitive answer. I am not an educator but I am finding that this creates a more seamless flow to learning about something new. I appreciate your content, Ethan.