Model Evaluation: Accuracy, Precision, and Recall
Model metrics shape whether machine learning models solve real-world problems or fail in silence. The core insight is that metric choice exposes or hides critical failures, especially on imbalanced data or when error costs differ. This article shows how to select, interpret, and report accuracy, precision, and recall so models work in production, not just in the lab.
TL;DR:
- Compare accuracy to the majority-class baseline before trusting results.
- Use precision when false positives are costly or disruptive.
- Use recall when missing positives is dangerous or expensive.
- Report F1 score and class-specific metrics on imbalanced data.
- Monitor metrics across thresholds, not just at a single point.
Metric Choice Determines Model Success
Metric selection is not a technical detail; it decides whether a model catches the failures that matter.
For example, a model predicting the majority class in a 91:9 split reaches 91% accuracy but misses every minority case. This "Accuracy Paradox" is common in fraud or rare disease detection, where accuracy hides the real risk.
Accuracy Masks Minority-Class Failures
Accuracy measures the fraction of correct predictions. It only works when classes are balanced and error costs are equal.
- The Baseline Test: If your model's accuracy matches the majority-class baseline (e.g., a dataset that is 95% "Healthy" and 5% "Sick"), your model has learned nothing.
- The Failure: On a 90/10 split, a model that always predicts the majority class gets 90% accuracy but has a Recall of 0% for the minority class.
Precision vs. Recall: The Strategic Trade-off
Choosing between precision and recall depends on the "cost" of being wrong.
| Metric | Focus | Use Case Example |
|---|---|---|
| Precision | Quality: "Of all predicted positives, how many were right?" | Spam Filters: You don't want a legitimate email (False Positive) in the spam folder. |
| Recall | Quantity: "Of all actual positives, how many did we find?" | Cancer Screening: Missing a case (False Negative) is much worse than a false alarm. |
Imbalanced Data and the F1 Score
In domains like churn prediction or credit card fraud, one class vastly outnumbers the other.
- F1 Score: This is the harmonic mean of precision and recall. It provides a single score that balances both, making it harder for a model to "cheat" by over-performing in only one area.
- Class-Specific Metrics: Never report just an overall average. Report the precision and recall for the minority class specifically.
$$F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$$
Avoiding Production Pitfalls
- Threshold Blindness: Don't just look at the default 0.5 probability threshold. Plot a Precision-Recall Curve to see how your model performs at different sensitivity levels.
- Alert Fatigue: Over-optimizing for recall can lead to too many false alarms, causing human operators to ignore the model entirely.
- Data Drift: Metrics that look great in the lab can degrade in production as real-world data changes.
Do This Next: Metric Selection Checklist
- Analyze Costs: Identify which is worse: a False Positive or a False Negative.
- Check Baseline: Determine the accuracy of a "dumb" model that only predicts the majority class.
- Threshold Check: Plot Precision-Recall curves to find the optimal decision point.
- F1 Calculation: Use the F1 score if you need a balance between precision and recall.
- Monitor: Set up alerts to track these metrics in your production environment.
Do This Next: Would you like me to help you calculate the F1 score or Precision-Recall for a specific set of model results you have?

