First, one of the biggest blockers to meaningful AI adoption is this belief in a “silver bullet” solution. In reality, customizability is AI’s greatest strength, and knowing how to tailor it to how you actually work is the most powerful way to leverage it. We no longer have to contort ourselves to fit into systems built by others; AI finally lets us build systems that adapt to our quirks and workflows. But that only happens if you spend real time with it. Organizations need to invest serious time and resources into tinkering (testing different models, use cases, and configurations) to discover what truly fits. Vendors can provide the menu, but only you can figure out what actually works by using it, iterating, and “interviewing” the AI, as Ethan puts it.
Second, as you noted, different models are now clearly better at different tasks. That reality makes model-agnostic solutions increasingly valuable. The advantage today isn’t just having your own proprietary LLM; it’s being able to seamlessly access and orchestrate across multiple models. A year ago, owning an LLM was the moat. Now, the real moat is flexibility - being able to partner with, switch between, and productize multiple models around user needs.
"Companies spend a lot of money to hire people who are better than average at their job and would be especially careful if the person they are hiring is in charge of advising many others."
I enjoy your analysis but the boundless optimism here regarding corporate rigour is, uh, interesting.
Great post. I think your advice to companies could be sharper, though.
Instead of "create and test realistic scenarios," better to just license 2 or 3 different models for a representative subset of employees, and let them use the models in daily work for a month.
GDPval is a great eval, but its test prompts are unnaturally well-scoped. So the problem with copy-pasting its eval approach for a company is that real-world users don't prompt so cleanly; they have more unknowns, more false starts, and their own jagged frontier of expertise. (In other words, almost everyone is bad at some parts of their job, sometimes in hidden ways.)
Part of what makes a good AI is whether it can help you feel your way in the dark toward the right approach and helpful next steps.
The underlying insight here is the old quote: "Asking the right questions is often more important than getting the right answer." Most employees actually don't know what the right question to ask is, and the best AI is the one that helps you figure it out. It's messy!
I would like to see ChatGPT or Claude add a toggle called "prompt improvement" where it evaluates your prompt and gives you a friendly UI (not a wall of text) for finding ways to deepen or broaden it.
(ChatGPT Deep Research does this pretty well -- asking clarifying questions before it starts -- but this should be an optional toggle on all models.)
In the meantime, let's advise leaders to evaluate their models in more real-world ways and avoid aspiring to laboratory-level idealistic conditions.
An appeal: Please stop posting things to and directing us to the social media site formerly known as Twitter. I know the challenges of audience, etc. It would be nice to have alternatives.
Of course there are alternatives. As mentioned, I understand the appeal of going where the audience already is. However, there is no reason to send your established audience to that site, there are plenty of other places they could direct them to instead.
Valuable read, as always (and working on the edge of creativity and business, I'm a fan of your creative test tasks).
Tonality, specific capabilities, boldness, and creativity matter a lot in a business context, yet these factors are often overlooked when choosing a model or ecosystem.
We know that there are ways to influence model performance, but it still makes sense to first evaluate what the frontier models bring to the table.
Plus, creating this kind of assessment, or "job interview," as you call it, is a great exercise in which an organization can evaluate its processes and standards cross-functionally.
In my AI coaching, I integrated this kind of structured exercise, and it's really eye-opening, also in terms of corporate culture.
Great questions get great answers. Your puzzle of the new parent with a reservoir of 47 words inspired all four models to dig pretty deep and unearth some extravagant prose.
But wait -- aren't they computers? Can't they calculate that any word is wasted on a newborn who has not yet learned language?
Good point, but each model seems to have correctly calculated that the words were more valuable to the parent compelled to bless the baby with those final words. While some hoarded rather than gushing them in an exorbitant burst of love over preciseness, in all cases, it seemed more about the parent than the child. Interestingly, Claude and Gemini took the long view, perhaps saving the final words for a time the child would understand (death will come when I am ready), whereas GPT gushed (death is inevitable, love is forever). Moonshot appears to have decided to not use them then or possibly ever given the inevitable imperfection and failure of the utterance (I'll not accept death on any terms).
It’s also worth noting that this kind of evaluation is heavily influenced by the conversational assistant’s configuration and its system prompt. Using the same underlying model via API can produce very different results depending on those factors.
In practice, a company could build its own custom assistant layer, designed with a tailored system prompt to better align the AI’s behavior with organizational needs. This setup would also allow flexibility to swap or route between different models under the hood—selecting the one best suited to each specific request—without changing the user-facing experience.
This aspect of the GDPval paper seems problematic for AI: "Across models, experts most often preferred the human deliverable because models failed to fully follow instructions on GDPval tasks."
So much good stuff in your article. Thank you. Firstly, yes to interviewing a model. That's what I just realised we're doing with our ai adoption and change work. It's a long old interview mostly with a single candidate who already has got the job, but you're dubious. Secondly, think you for including MS Copilot in your league table test. I'd like to see Copilot with the others more often. Rare and important. Speaking of Copilot, I'd have to disagree that companies today are choosing their ai models based on benchmark. For many with Microsoft ingrained in their ecosystem they've just gone with the "poorest" one, Copilot. Which says a lot. It's very expensive and is considerably the "dumbest" one out of the bunch. Which is hilarious if companies are choosing the best models bc of some sense of the others are not secure enough. Other companies I know end up having a confusing horrible mix of a usual very poor in-house version usually using the cheapest model api, and also paying for Copilot, and maybe one other. It's a total waste of time, energy and resources having too many mixtures of "experts" with no real sense to use which model for which job.
AI can be a good hire or a bad one. As Ethan points out, it’s not just about performance; it’s about alignment. The smartest move leaders can make now is to build their guiding principles into the models before they start shaping decisions and culture
Honest? This is a cool idea. However, as a student, I would say a main incentive is also money/pricing structure; it can manipulate us into preferring some AIs over others. For instance, my college has a partnership with Gemini, which allows us to receive the upgraded version for free, so we have a financial incentive to use Gemini as opposed to other AI platforms.
The thing benchmarks consistently miss is fidelity, how well a model preserves meaning across recursive transformations. You can have models that ace MMLU and still exhibit subtle semantic drift, distorted risk profiles, or unstable decision heuristics. Evaluating AIs the way we evaluate people, under real cognitive load, feels like the only way to surface those differences.
I agree with this in general, but I strongly think GDPeval has a big weakness… “code smell”… LLM output is highly plausible even when subtly wrong and just being an “expert” (I’m not convinced they were all real experts as opposed to just graduates) is not enough to spot that.
One example: I ran about 100 company reports and showed them to private equity experts, who all thought they were universally amazing. However, when I ran them for companies we had only just done due diligence on, the folks who had done that work found lots of problems with the analyses that were not apparent to a regular expert.
I think this is gonna be like self-driving in a simulator… you think it works but then have to face messy real-world physics.
Two thoughts here:
First, one of the biggest blockers to meaningful AI adoption is this belief in a “silver bullet” solution. In reality, customizability is AI’s greatest strength, and knowing how to tailor it to how you actually work is the most powerful way to leverage it. We no longer have to contort ourselves to fit into systems built by others; AI finally lets us build systems that adapt to our quirks and workflows. But that only happens if you spend real time with it. Organizations need to invest serious time and resources into tinkering (testing different models, use cases, and configurations) to discover what truly fits. Vendors can provide the menu, but only you can figure out what actually works by using it, iterating, and “interviewing” the AI, as Ethan puts it.
Second, as you noted, different models are now clearly better at different tasks. That reality makes model-agnostic solutions increasingly valuable. The advantage today isn’t just having your own proprietary LLM; it’s being able to seamlessly access and orchestrate across multiple models. A year ago, owning an LLM was the moat. Now, the real moat is flexibility - being able to partner with, switch between, and productize multiple models around user needs.
"Companies spend a lot of money to hire people who are better than average at their job and would be especially careful if the person they are hiring is in charge of advising many others."
I enjoy your analysis but the boundless optimism here regarding corporate rigour is, uh, interesting.
haha yes, very true
Great post. I think your advice to companies could be sharper, though.
Instead of "create and test realistic scenarios," better to just license 2 or 3 different models for a representative subset of employees, and let them use the models in daily work for a month.
GDPval is a great eval, but its test prompts are unnaturally well-scoped. So the problem with copy-pasting its eval approach for a company is that real-world users don't prompt so cleanly; they have more unknowns, more false starts, and their own jagged frontier of expertise. (In other words, almost everyone is bad at some parts of their job, sometimes in hidden ways.)
Part of what makes a good AI is whether it can help you feel your way in the dark toward the right approach and helpful next steps.
The underlying insight here is the old quote: "Asking the right questions is often more important than getting the right answer." Most employees actually don't know what the right question to ask is, and the best AI is the one that helps you figure it out. It's messy!
I would like to see ChatGPT or Claude add a toggle called "prompt improvement" where it evaluates your prompt and gives you a friendly UI (not a wall of text) for finding ways to deepen or broaden it.
(ChatGPT Deep Research does this pretty well -- asking clarifying questions before it starts -- but this should be an optional toggle on all models.)
In the meantime, let's advise leaders to evaluate their models in more real-world ways and avoid aspiring to laboratory-level idealistic conditions.
An appeal: Please stop posting things to and directing us to the social media site formerly known as Twitter. I know the challenges of audience, etc. It would be nice to have alternatives.
There are no alternatives, like it or don't. Critical mass is a thing.
Of course there are alternatives. As mentioned, I understand the appeal of going where the audience already is. However, there is no reason to send your established audience to that site, there are plenty of other places they could direct them to instead.
The “interview” metaphor really resonates.
I’ve started doing that myself — not to test power, but to understand temperament.
It’s a reminder that good collaboration, human or digital, always begins with good questions.
Valuable read, as always (and working on the edge of creativity and business, I'm a fan of your creative test tasks).
Tonality, specific capabilities, boldness, and creativity matter a lot in a business context, yet these factors are often overlooked when choosing a model or ecosystem.
We know that there are ways to influence model performance, but it still makes sense to first evaluate what the frontier models bring to the table.
Plus, creating this kind of assessment, or "job interview," as you call it, is a great exercise in which an organization can evaluate its processes and standards cross-functionally.
In my AI coaching, I integrated this kind of structured exercise, and it's really eye-opening, also in terms of corporate culture.
Great questions get great answers. Your puzzle of the new parent with a reservoir of 47 words inspired all four models to dig pretty deep and unearth some extravagant prose.
But wait -- aren't they computers? Can't they calculate that any word is wasted on a newborn who has not yet learned language?
Good point, but each model seems to have correctly calculated that the words were more valuable to the parent compelled to bless the baby with those final words. While some hoarded rather than gushing them in an exorbitant burst of love over preciseness, in all cases, it seemed more about the parent than the child. Interestingly, Claude and Gemini took the long view, perhaps saving the final words for a time the child would understand (death will come when I am ready), whereas GPT gushed (death is inevitable, love is forever). Moonshot appears to have decided to not use them then or possibly ever given the inevitable imperfection and failure of the utterance (I'll not accept death on any terms).
It’s also worth noting that this kind of evaluation is heavily influenced by the conversational assistant’s configuration and its system prompt. Using the same underlying model via API can produce very different results depending on those factors.
In practice, a company could build its own custom assistant layer, designed with a tailored system prompt to better align the AI’s behavior with organizational needs. This setup would also allow flexibility to swap or route between different models under the hood—selecting the one best suited to each specific request—without changing the user-facing experience.
This aspect of the GDPval paper seems problematic for AI: "Across models, experts most often preferred the human deliverable because models failed to fully follow instructions on GDPval tasks."
So much good stuff in your article. Thank you. Firstly, yes to interviewing a model. That's what I just realised we're doing with our ai adoption and change work. It's a long old interview mostly with a single candidate who already has got the job, but you're dubious. Secondly, think you for including MS Copilot in your league table test. I'd like to see Copilot with the others more often. Rare and important. Speaking of Copilot, I'd have to disagree that companies today are choosing their ai models based on benchmark. For many with Microsoft ingrained in their ecosystem they've just gone with the "poorest" one, Copilot. Which says a lot. It's very expensive and is considerably the "dumbest" one out of the bunch. Which is hilarious if companies are choosing the best models bc of some sense of the others are not secure enough. Other companies I know end up having a confusing horrible mix of a usual very poor in-house version usually using the cheapest model api, and also paying for Copilot, and maybe one other. It's a total waste of time, energy and resources having too many mixtures of "experts" with no real sense to use which model for which job.
AI can be a good hire or a bad one. As Ethan points out, it’s not just about performance; it’s about alignment. The smartest move leaders can make now is to build their guiding principles into the models before they start shaping decisions and culture
Honest? This is a cool idea. However, as a student, I would say a main incentive is also money/pricing structure; it can manipulate us into preferring some AIs over others. For instance, my college has a partnership with Gemini, which allows us to receive the upgraded version for free, so we have a financial incentive to use Gemini as opposed to other AI platforms.
The thing benchmarks consistently miss is fidelity, how well a model preserves meaning across recursive transformations. You can have models that ace MMLU and still exhibit subtle semantic drift, distorted risk profiles, or unstable decision heuristics. Evaluating AIs the way we evaluate people, under real cognitive load, feels like the only way to surface those differences.
love this framing.
testing AI like a candidate forces us to think about its judgment, not just its trivia score.
maybe we don’t need smarter benchmarks, just better interviews.
Can’t I just use an AI to interview my next AI—asked Meta-Martha?
I agree with this in general, but I strongly think GDPeval has a big weakness… “code smell”… LLM output is highly plausible even when subtly wrong and just being an “expert” (I’m not convinced they were all real experts as opposed to just graduates) is not enough to spot that.
One example: I ran about 100 company reports and showed them to private equity experts, who all thought they were universally amazing. However, when I ran them for companies we had only just done due diligence on, the folks who had done that work found lots of problems with the analyses that were not apparent to a regular expert.
I think this is gonna be like self-driving in a simulator… you think it works but then have to face messy real-world physics.