• NanoBits
  • Posts
  • Vibes don't scale. Evals do.

Vibes don't scale. Evals do.

Panel where OpenAI, Harvey, LangChain, and Cartesia revealed how they really build production AI

EDITOR’S NOTE

Dear Nanobits readers,

Last week, I visited an event in San Francisco for "Vibes Don't Scale: Building Production Systems with AI Evals." I will be honest, I went in thinking I knew something about evals as we have written about AI tooling, read papers, and used some of it at work. But two hours later, I walked out realizing I knew practically so little from what the industry is doing.

The panel brought together Neel Kapse (Engineering Manager for Evaluations at OpenAI), Niko Grupen (Head of Applied Research at Harvey), Sam Crowder (Head of Core Platform at LangChain), and Iz Shalom (Head of Product at Cartesia). These are not people theorizing, they are building the evaluation systems that keep production AI from breaking.

Every panelist emphasized that evals aren't just a nice-to-have. They are the only thing standing between your working demo and a production nightmare.

As Sam put it bluntly: "Do you want your product to work?" The room laughed, but he wasn't joking. That's the entire eval conversation in one question.

Vibes Don’t Scale event in San Francisco

So today, I am sharing the insights of what I captured, which is the honest, messy reality of how the best teams actually build eval systems. The shortcuts they take when resources are tight. The expensive mistakes they made so we don't have to. And the unsolved problems they are still wrestling with. Let's dive in.

WHY EVALS MATTER: THE FUNDAMENTAL SHIFT

Traditional software has a simple promise: write code, write tests, ship with confidence. The behavior of your app is defined in the code itself. You can write unit tests and integration tests that prove your application works exactly as specified.

But AI broke that model completely.

"In the world of agents, the behavior of your app is defined in the agent itself," Sam explained. "Whereas in the prior paradigm of traditional software, the behavior of the app is defined in the code and you can write integration tests and unit tests to test that code. But until your agent is actually doing something, you have no behavior, you have no definition of your application. The evals are the only thing really that can ensure that your agent performs as you want it to."

This is the foundational problem: with traditional software, your tests verify the code. With AI, your code is just scaffolding. The real behavior emerges from the model itself, and that behavior is probabilistic, context-dependent, and changes with every model update or prompt modification.

Think about it: your traditional test suite gives you a binary pass/fail. But with AI, you are testing something that might give you slightly different outputs each time, even with the same input. The model might hallucinate. It might follow an inefficient path to the right answer. It might be correct but inappropriate in tone.

Evals aren't just testing. They are your only definition of what your application should do. Without them, you are literally shipping vibes and hoping for the best.

THE CORE PRIMITIVES: WHERE TO START

If you are building an AI product from scratch, what do you actually need? Sam broke down two entry points that most successful teams use:

First Principles Eval Development

Build a golden dataset of example questions and template inputs. Take your prompt, plug in a model, run it over that dataset, and see how it scores compared to reference outputs or some feedback criteria you define. This is your baseline: does the model do what you expect on known inputs?

The key word here is "golden." These aren't random examples. They are carefully curated cases that represent the core behaviors your product needs to handle correctly. Think of them as your contract with the model: if it can handle these well, it can probably handle production.

Trace-First Approach

Many teams actually start with tracing, which captures what the agent actually does in practice. As Sam described: "Tracing for agents tells you what that behavior is. To perform X task, this model called these three tools, here's the latency, here's the total cost, here was the process that followed, that's the trace."

Traces are your window into reality. They show you the real behavior of your agent in action, then you build evals from there. You are evaluating the trajectory the agent follows to get to the end result, whether it answered the user's question, if the user got frustrated in a conversational flow.

Sam noted: "We find for most of our customers, starting from the trace as the foundation to understand final behavior is where they can then work on building eval sets of all kinds."

But here's the critical insight from Neel that changes everything: "Evals are kind of living things that need to change as you discover new issues in production. You will discover that oh this thing is broken, that thing is broken, or a new model releases and those failure points are different than before. So another part of this process is how do you actually take your starting eval set and then adapt it over time."

Your eval system isn't a one-time build. It's living infrastructure that evolves with your product, your models, and your understanding of what matters.

THE DYNAMIC EVAL PROBLEM

Niko from Harvey dropped what might be the most important warning of the evening

The biggest failure mode I see from AI teams today is they'll build an eval stack once, they'll use it to make a model decision once, and then they won't revisit it, even through multiple generations of releases.

Think about what this means. You build evals for GPT-4. You ship. GPT-4.5 comes out with better reasoning but different edge cases. You swap it in. Your evals say it's better. You ship again. But you never updated the evals to catch the new ways the newer model might fail.

Model capabilities improve rapidly. New models have different strengths, weaknesses, and failure modes than their predecessors. What worked as an eval for one model might completely miss the problems in another.

Your eval system needs to be dynamic and updateable in real time, not a one-time exercise you run before launch and forget about.

QUALITY OVER QUANTITY: THE DATASET SIZE DEBATE

This is where the conversation got intensely practical. How many examples do you actually need? Should you aim for 100? 1,000? 10,000?

The panelists were unanimous: quality matters infinitely more than quantity.

Iz from Cartesia laid out their thinking: "The first thing to think about is what is necessary to simulate as best as possible what your customer's interaction is going to be. So for us it's looking at what are the transcripts that are indicative of whether this model is going to work well for the user or not." Then they expand until hitting cost constraints, whether that's running costs or the strict upper bound of human evaluation budgets.

For voice AI, they also need to expand datasets whenever they add new languages or create new model behaviors, while keeping everything else for regression testing.

But Niko's perspective from Harvey was even more striking. Working in legal AI, they face a unique constraint most teams don't: they cannot look at production data. Prompts, queries, documents, even model responses, all of it is considered highly sensitive data.

"We can't look at them," Niko said. "The best we can do is try to create an offline data distribution that mirrors or kind of approximates our online production distribution. And in those instances, it's really about what is the specific use case that really matters for this firm, and then how do I come up with even 10 really good examples of work they produce for this. I'd take that over you know a hundred or a thousand just medium rules."

Ten examples over a thousand. Let that sink in.

Neel reinforced this with hard-won experience: "Scale is what makes the problem particularly insidious because everyone tells you you need a lot of data, and you do, the more data you have the better it is, but it's really hard to maintain a large dataset. I've seen teams where they'll have like 600 data points of something. But the problem is over the weeks it's not possible for them to staff maintaining that, making sure that their graders are still doing what they want them to over those data points. And so ultimately that's less productive."

Less productive than what? Than having 10 high-quality, well-maintained examples that you truly understand.

The honest but helpful answer? It should be as big as it needs to be and as small as it can be. If 10 works for that particular customer and use case, that's good enough. But it's not always going to be enough for every situation.

And critically, as Niko emphasized: "If you're going to reduce quality to increase quantity, you should do that intentionally." Don't accidentally trade quality for scale. Make that choice deliberately, knowing the tradeoffs.

What Makes a High-Quality Row?

Sam described it as multi-dimensional thinking: "If you think about one row in a reference dataset, human labeled expected reference output, it's not just a simple input-output completion type interaction. It's other dimensions of annotation like toxicity if you care for that, or humor if you care for that, boolean categorizations of like did the model answer a person's question."

Multiple dimensions per example, that's what creates a robust eval that captures the full picture of model behavior..

SYNTHETIC DATA: WHEN YOU CAN'T ACCESS PRODUCTION

Niko outlined Harvey's three-tier approach:

Public Data: Case law, public filings. Take a reference NDA and create synthetic variations between different parties, tweaking terms to trigger various aspects of their workflows.

Private Data: Highly constrained due to sensitivity.

Human Process Data: The most interesting category. "In many ways, it's data that doesn't yet exist or is in the process of existing," Niko explained. Best practices for work aren't in a CSV, they're conversations in hallways. "You actually need the product interface to facilitate workflows to start to extract that data."

This ends up looking like process traces, capturing the actual flow of work, not just outputs.

THE HUMAN VS. LLM-AS-JUDGE DEBATE

The reality is more nuanced than "humans good, LLMs bad." The panelists described a funnel approach:

Niko's Three-Tier System:

Tier 1 - LLM as Judge: Directional signal when testing model swaps. A smoke screen to see if changes are better.

Tier 2 - Human Side-by-Side: Large-scale comparisons with in-house or external domain experts.

Tier 3 - Traditional A/B Test: Usage metrics and engagement in the product. Run a three-week experiment before full rollout.

Iz's Approach at Cartesia:

Synthetic evals during training, manual transcript review for release candidates, then multi-turn evaluation where they pit agent versions against each other. "That is very high touch, high cost, but very effective in measuring what customers truly feel."

Sam's Three Tiers of Involvement:

Low: Out-of-the-box LLM-as-judge prompts that auto-evaluate traces.

Medium: Custom LLM-as-judge evaluators tuned to align with human preferences.

High: Annotation queues where humans review production traces.

Sam's honest take? "I'd love to say all customers do the highest tier. That's where you get the best results, but given time and resource constraints, that's still a minority.".

MAKING LLM-AS-JUDGE WORK

Neel offered two critical insights:

First, LLM-as-judge is under explored. "If you're willing to commit to figuring out how to use LLM-as-judge, there's probably a lot of stuff that can be done that none of us know about yet." (A business idea, maybe?? 🤔 )

He shared a fascinating example: a consulting firm building an investment recommendation agent. Instead of grading outputs as "good" or "accurate," they had another agent debate the first agent. The grader acted as a moderator determining who won the debate. Costly, but potentially breakthrough results.

Second, this is non-negotiable: "The best LLM-as-judge graders we built were calibrated against human data." Humans create rubrics and baselines, then models are measured against those.

Niko emphasized mixing objective and subjective criteria. For a legal brief: objective checks (Did you cite this case? Does this overrule another case?) combined with subjective ones (Is this compelling? Would this be persuasive?). "At least that gives you part of the rubric yourself, and for directional signal, models can do a reasonable job on subjective data too."

THE RUBRIC DILEMMA

Iz captured the core tension: "For us, the tension is always being too granular or not granular enough. You can create tens of different metrics for every question, but then some are better and some are worse. It's like finding that Goldilocks zone of the right amount of rubrics that truly drive decision making."

It's like having too many dashboards, if you have too many, it's as if you have none.

Niko's framework: High-fidelity rubrics (detailed, point-by-point) give verifiable outcomes. But they impose constraints. "Rubrics can become so rigid that they're actually robust to model improvements. You miss the model's innate ability to do subjective aspects better."

His mantra: "A rubric is only as good as you can measure via calibration with something else, whether it's ground truth, a human golden response, or a human calibration test.".

GUARDRAILS VS. QUALITY EVALS & MONITORING AFTER SHIPPING

Guardrails are concrete evals that shouldn't change model-to-model (like preventing the model from saying "sorry I apologize" or "oh great question"). Quality evals need to be flexible to let better models shine.

And a critical warning from Sam: "We make it possible to reuse evaluators between agents, but we don't advise doing that in every instance, because even if you've tuned an evaluator for agent A, that doesn't necessarily translate to agent B."

Also Evals don't stop when you ship. Niko broke down their approach:

Developer Experience: Local iteration gets you to ship, then observability in production. They taxonomize data (query categories, document metadata) since they can't look at raw content. This enriched data helps recreate problematic traces offline.

Lawyer Experience: Map usage to practice areas and task types, benchmark against internal eval sets for specific use cases.

Iz emphasized about regression evals: binary checks to ensure you are not regressing when shipping new models. The key is standardization and sampling production data to verify it matches offline testing qualities, avoiding data drift.

THE MULTI-TURN CHALLENGE

Evaluating single LLM calls is hard. Multi-turn agent interactions? Much harder.

Sam explained LangChain's structure: runs (single tool calls), traces (collections of runs), and threads (multiple traces in a full conversation). For multi-turn, they use thread-level evaluators that run after threads go silent for a set period, testing outcomes and conversation trajectories.

But Neel was honest: "We don't have a solution we're very happy with yet. Things we've tried are either not practical at scale or very hard to calibrate."

The Credit Assignment Problem

Niko highlighted a challenge from RL research now hitting production: "Say your agent takes 100 actions to produce output. Your eval says it's suboptimal. To which of the 100 decisions do you attribute that lower quality? Is it action 47? Action 63?"

Neel's concrete example: "I built an agent to pull GitHub PRs. It made a single API call per PR. None of us would develop that, right? But the agent was like 'I need this PR so I'll fetch it. I need that PR so I'll fetch it.' Each step was correct (did it pull needed information? Yes), but the approach was completely wrong."

The atomic steps can all be correct, but the total result wrong. Sam noted another issue: "Users stop talking halfway through because they got busy, and the agent has no idea." You can't control for that, but it muddies evaluation.

Iz's solution: Have raters accomplish a task, then verify on the backend if the thing was generated successfully. "It's easily verifiable and more helpful because otherwise it becomes so subjective the longer the interaction is."

END NOTE

As the panel wrapped up, each person offered their core takeaway. Here is what the people building production AI want you to remember:

Sam (LangChain): "Traces can and should form the foundation of your eval set. Until you have them, you do not know any sort of behavior of your product."

Neel (OpenAI): For vertical companies or any team building on AI models, there are two primary hard-to-replicate differentiators: distribution strategy and the quality of what you are doing with the model. Most people in the room won't affect distribution as much, so "evals is actually a very big component of what gives you that extra 3 to 5% of quality sometimes. And that might be all that matters to beat your competition."

And perhaps the most direct:

"Evals, all that matters."

WHAT THIS MEANS FOR YOU?

If you are building AI products, here is what you need to do:

Engineers: Start with tracing, even if it's just local experimentation. You cannot optimize what you cannot measure. Build a small, high-quality eval set before you build a large, mediocre one. Treat your evals like living code that evolves with your product.

Product Managers: Your AI product's reliability doesn't come from the model alone, it comes from the eval infrastructure around it. Ask your team: How are we versioning our evals? What happens when we swap models? How do we know if we have regressed?

Leaders: The teams winning in AI aren't the ones with the best models, they're the ones with the best eval systems. This isn't a one-time investment, it's continuous infrastructure that needs dedicated resources and constant iteration.

The gap between demos and production isn't magic. It's discipline. It's eval systems that evolve. It's the willingness to measure, calibrate, and improve continuously.

As the title of the event reminded us: Vibes don't scale. Evals do.

Share the love ❤️ Tell your friends!

If you liked our newsletter, share this link with your friends and request them to subscribe too.

Check out our website to get the latest updates in AI

Reply

or to participate.