Success Is Not an F1 Metric
You’ve been watching Netflix all weekend. You started innocently enough, planning to watch something in your queue. You’ve been asked “Are you still watching?” three times now. You’re hooked on a new show you’ve never heard of. You’re more tired than expected on Monday morning, and you can’t wait to talk about it with your friends and coworkers.
You never once thought about Netflix’s recommendation engine.
That’s the job. And most data scientists I know don’t understand this.
Most data scientists cannot tell you what their work looks like from the outside. They can describe, in excruciating detail, everything about their models from the inside. Why did they choose that model? How did you arrive at those hyperparameters? What does the loss curve look like?
The flat-earthers, moon landing deniers, anti-vaxxers all agree: “Scientists are hiding the truth.” Have you ever spoken with an academic? They are physically incapable of hiding anything, with their publications and conferences. I promise you: you have never met anyone more proud of the work they do. They can go on for hours. They’re good at it.
Next, ask them what business value was delivered. Watch the room go quiet.
The two decks
Every project has two presentations. The rules for each are completely different.
In the first presentation, the data scientist is among peers. Show your work. Brag on the mathematical complexity. Defend the methodology. Walk through the confusion matrix. You earned that PhD. Use it!
In the second, you are talking to your users, the directors, and the C-suite. If you start talking about vector spaces here, you’ve lost the plot.
A mature data scientist knows their audience. A mature data scientist knows which presentation they’re giving. Every data scientist excels at the technical merits. You don’t get hired if you can’t do this. The problem is this is the only deck they were prepared to talk about.
The academic hangover
Academia rewards explainability. You defend your methodology in front of a committee. You publish peer-reviewed papers. You present findings at conferences. The entire incentive structure is built around showing your work and being impressive to other people who can evaluate your efforts. That’s the wrong incentive structure for building products.
The problem is that most data scientists go from academia directly into industry without anyone telling them the rules changed. Bragging isn’t vanity. It’s a necessary, trained behavior that was correct in every context they’d been rewarded in. Until now.
Industry sucks at correcting it. We put data scientists in front of stakeholders or business teams and ask them to explain what they built. We sit through presentations full of and and loss functions and ROC curves and no one can explain what changed in the real world.
Hell, they didn’t even think to ask.
There is a gap between the silicon and the carbon.
Success isn’t a statistical metric
F1, AUC, RMSE, MAPE. The outputs of your models are your inputs to success. They tell you whether the model is working. They do not tell you whether anything that matters changed.
A model with a mediocre F1 score that measurably changes customer behavior is more successful than a beautifully tuned model that makes no change to the business. It just runs, making predictions. Full of sound and fury, signifying nothing.
The metrics are green. The pipeline is healthy. The model is deployed. The work was a waste of time. Good luck on your next promotion.
I want to be precise about what I mean by success. What is the business impact of what you did?
- Eyeballs: Someone saw it. Google Analytics. Adobe Analytics. The dashboard your manager opens every Monday and never acts on.
- Clicks: Someone acted on it. Click-through rate. A/B test lifts. Someone opened the email. Someone didn’t immediately close the modal.
- Orders: Someone committed money. Conversion rate. Cart abandonment recovered. Average order value went up.
- Revenue: The business captured value. MRR. ARR. Average contract value. Salesforce closed-won. Someone signed the paper.
- Changes in real human behavior: Did they come back? Retention curve at day 7, day 30, day 90. Churn rate down. Subscriptions renewed instead of cancelled. Time-to-second-purchase shortened. Referrals sent and opened. The customer who was going to leave and didn’t.
I would take any of these. Literally any of them. A click is a better success metric than an AUC score. An order is better than an RMSE. These are things that actually happened in the world, to real people, because of something you built. That’s the job.
While we’re making wishlists: proper A/B testing with phased canary rollouts. I’ll wait.
The impact of good data science
DoorDash has a program for new drivers. In your first three weeks on the platform, the highest-paying orders get routed to you preferentially. The honeymoon period is real. The earnings are good. The experience is designed to hook you before you have a chance to form accurate expectations about what driving for DoorDash actually pays.
DoorDash didn’t stumble into this solution. They watched driver retention at week four. They measured the drop-off. They knew the honeymoon ended because they were measuring for it. The data science didn’t stop at the prediction. It ran all the way through to behavioral outcome manipulation and back again.
That’s a closed feedback loop. That’s a team that knew what “it works” looked like in terms of human behavior and business outcomes, not model performance.
Facebook’s News Feed team knew something similar. In 2017, their engineers flagged internally that the “angry” reaction emoji was weighted 5x relative to a standard “like” in the ranking algorithm. This weighting was increasing engagement and time on site and systematically amplifying divisive content. They traced it forward to real-world effects on the information environment. The instrumentation was working. Upon its discovery, leadership continued to elevate divisive content to users in the name of engagement and ad revenue.
This is where I add that not all “good data science” is ethically good. Facebook did irreparable harm to the information environment and the people who made that call should be in jail. That’s a separate conversation. The point here is that the measurement worked. If you want the full, morbid story, read Careless People by Sarah Wynn-Williams.
Both examples share something: someone on those teams understood that their job didn’t end when the model shipped. They understood how the model drives a change in human behavior, and they deliberately built instrumentation to measure that change.
Shipped and forgotten
Consumer-facing AI gets scrutiny because users complain when it breaks. Internal AI doesn’t get that feedback. Employees quietly route around it, or use it ceremonially. They glance at the dashboard, then make the decision they were already going to make.
Your model has two failure modes, and both are invisible if you’re not watching:
- Adoption failure. The model was deployed. Nobody used it. The business metric never moved.
- Attribution failure. The model was used, but nobody measured the change in outcomes.
Both failures surface immediately if you’re watching the right metric and have appropriate monitoring tools in place. If watching the outcome wouldn’t tell you whether the model mattered, you haven’t defined what the model is for.
If you don’t know how to measure and monitor the business outcome, you are not ready to build the tool.
The definition of done
Here is what I want from every data scientist I work with. Before you ship, answer these questions:
- What behavior am I trying to change?
- How will I observe that behavior in production?
- What is my time horizon for seeing the effect?
- What is my counterfactual strategy? How do I know the behavior changed because of the model and not for other reasons?
- Who is the consumer of this model’s output? Is there any scenario where the model hits its performance targets while harming that person?
If you cannot answer these questions before you ship, you haven’t finished designing the system. You’ve designed the model. Those are not the same thing.
I’ve been in the meeting where nobody could answer question one. And the goddamn model was already in production.
The team that gets this right ships their models, people use them, behavior changes, the business captures value. Nobody writes a case study about it, because this team goes on to the next problem.
Data scientists: the brag belongs in the team meeting. The product doesn’t care about your AUC. Your users don’t know you exist. They just find what they were looking for — and occasionally, the thing they didn’t know they were looking for until they saw it.
F1 score is how you know the model is working. Behavior change is how you remain employed.
Photo by Markus Winkler.