Taste Is the Next Capability AI Will Crack

Jun 12, 2026 · #research · 阅读中文版

TL;DR: Anthropic measured research taste on real research sessions: by Opus 4.7, the next step proposed by the AI is judged better than the human researcher’s in 50% of cases, and Claude mythos pushes that to 60%. Taste, the ability I assumed AI could never learn, is on the automation curve too.

Lately, while doing research with AI, I keep coming back to one question: if AI can do most of the work, what am I still for?

I have always believed the hardest ability for AI to replace is taste: the ability to make judgment calls on open-ended problems.

For questions like what research to pursue or which engineering design to adopt, the internet offers neither abundant training data nor reliable, verifiable signals for reinforcement learning.

Making tasteful judgments depends on long immersion in a particular community and environment, where we form intuitions that are hard to write down. People who do reinforcement learning know that PRMs sound great but often fail to work in practice; people who do LLM-as-judge know that LLM scores have a strong tendency to regress toward the mean.

That is what I thought, until the latest Anthropic article changed my mind.

The article is called “When AI builds itself”, just published by the Anthropic Institute. Its core point is that AI is increasingly taking over AI’s own R&D. It splits R&D into engineering and research: at the execution level, writing code and running experiments, Claude can already match or surpass humans; the one thing still clearly belonging to humans is research taste, judging which problem is worth doing, which result is trustworthy, and when to decisively kill a direction.

What surprised me most is that they actually tried to quantify the thing long considered the hardest to quantify. They went through real sessions where researchers did open-ended research together with Claude, specifically picked out the moments where the researcher took a wrong turn, fed the model only the context before the detour, and asked it what it would do next. Then another Claude, one that could see the eventual outcome, served as the judge: whose next step was better, the human’s or the AI’s? In other words, this directly measures whether a model can ask good research questions.

Research taste evaluation results from "When AI builds itself"

That final table felt uncomfortably familiar.

As the models iterate, their ability to propose the right next direction in research keeps improving. By Opus 4.7, the score had reached 50%.

Note that 50% here does not mean “right half the time.” It means that in half the cases, the research direction proposed by the AI was judged more accurate than the human’s. That is, at 50%, a human researcher’s taste is no longer better than the AI’s.

And Claude mythos pushes that number to 60%, which makes me genuinely look forward to that model’s release.

So we find that research taste, the very ability that separates good research from bad, will also become part of the automation as AI capabilities advance.

What does this mean for us, and especially for the direction of future research?

The most important trend, I think, is that research may well be automated faster than we imagine. And this extends to most open-ended problems: building software, writing articles, and so on.

If research or engineering tasks cannot be automated by AI, the human bottleneck always remains, and there is a ceiling on how fast things can scale. But if taste can be automated too, the human bottleneck keeps shrinking, and the only thing left holding back scaling is compute. This corresponds to the third of the futures Anthropic envisions.

So future work may no longer look like hand-tweaking a few harnesses for small gains. Clearly, as research and engineering get automated, our attention shifts to how to ensure reliability, inside a process we do not fully understand, in order to push automation further. Concretely, I expect the following threads to receive more attention:

Long-context orchestration: the evolved form of long-context memory. We no longer only care whether a model can work coherently when handed a lot of context; we care how an entire multi-agent harness reorganizes itself and sustains performance after a very large number of steps. This is exactly what Claude’s dynamic workflow is concerned with.
Measuring and improving agents’ ability to make good judgment calls: when agents were weak, we evaluated them on closed tasks. But agents will clearly take on more open-ended tasks, so we need a way to model their decision-making there. Are the actions they take no-regret? When they make a gut call today, does it carry positive payoff down the line?
Minimizing human supervision: once agent success rate stops being the bottleneck on the production side, the supply of human supervision becomes the bottleneck on the other side. Conveying human intent to agents more efficiently, while avoiding excessive human intervention, becomes a key dimension of future capability.

But everything above is just one person’s take. Do you believe research taste will be automated?