I was three months into my first serious AI product role when I made a mistake that still embarrasses me. We had a clinical AI feature with 94% accuracy on our test set. I wrote it up in the product spec as "94% accurate" and used that number as the primary justification for launch.
The feature flopped. Users didn't trust it. And when I dug into why, I discovered that our test set was carefully curated - it didn't represent the messy, ambiguous real-world cases that clinicians actually encountered. In those cases, the model was much less reliable. The number I'd anchored on was technically true and practically meaningless.
That was my introduction to the first principle failures that haunt AI product development. I've since seen the same mistakes - with different industry wrappers - in edtech, CPG, and enterprise software. The patterns are consistent enough that I've started thinking of them as a taxonomy.
Here are the ones that matter most.
Mistake 1: Confusing Model Accuracy with Product Value
This is the most common mistake and the hardest to correct because it feels so logical. If the model is more accurate, the product is better. Right?
Wrong. Model accuracy is a necessary condition, not a sufficient one. Product value is a function of:
- Accuracy in the actual distribution of inputs users will provide (not your test set)
- Whether users can identify when the model is wrong (calibration)
- Whether the workflow change required to use the output is worth the accuracy gain
- Whether the error cases are tolerable to users
At Edxcare, we built a content recommendation model that hit 88% accuracy in offline evaluation. In production, engagement with recommendations was lower than with our previous rule-based system. Investigation revealed that the 12% error cases were highly visible - the model sometimes recommended content that was clearly far below the student's level. One bad recommendation eroded trust more than ten good recommendations built it.
The fix was to add a confidence threshold - only show recommendations when the model's confidence exceeded a calibrated cutoff. Effective accuracy in production rose to 96%, user trust recovered, and engagement exceeded baseline.
First principle: Measure the cost of errors, not just the rate of errors. A 95% accurate model where the 5% errors are catastrophic can be worse than an 85% accurate model where errors are trivially recoverable.
Mistake 2: Building for Engineers, Not Users
AI products are often conceived and built by technically sophisticated teams who have deep intuitions about how the model works and what it can do. This creates a systematic blind spot: the team optimizes for what's technically impressive rather than what's practically useful.
Symptoms of this mistake:
- The feature surfaces probabilities or confidence scores directly to users who don't know how to interpret them
- The interface exposes model parameters or configuration options that users shouldn't need to think about
- Error messages reference model failures rather than task failures
- The feature requires users to understand how the AI works to use it effectively
In healthcare, I've seen clinical AI features that displayed "73% confidence" scores to nurses. The nurses had no mental model for what 73% confidence meant in context. Should they act on it? Flag it? Ignore it? The number created anxiety without guidance.
We replaced the confidence score with a signal light system: green (high confidence, act on it), yellow (moderate confidence, verify if time permits), red (low confidence, requires independent verification). Same underlying model - radically different user experience.
First principle: Users don't care how the model works. They care whether the feature helps them do their job better. Design for the task, not the technology.
Mistake 3: Ignoring the Data Flywheel
AI products have a compounding advantage that traditional software doesn't: they can get better over time without engineering effort, if they're designed to capture feedback loops. The data flywheel - where usage generates labeled data, which improves the model, which drives more usage - is the moat that separates AI leaders from followers.
Most PMs don't plan the flywheel at product inception. They add it as an afterthought, after launch. By then, valuable signal has been lost and the architecture may not support the feedback mechanism needed.
Flywheel design questions to answer before building:
- What implicit signal does user interaction generate? (clicks, corrections, skips, completions)
- What explicit signal can you capture without friction? (thumbs up/down, edit actions, flagging)
- How will this signal be labeled and used for retraining?
- What's your retraining cadence and infrastructure?
- How will you detect when the model has drifted from user expectations?
At Mamaearth, our skin care recommendation engine didn't initially capture what happened after a recommendation was shown. We knew click rates but not whether users ultimately purchased, repurchased, or returned the recommended product. Adding purchase-and-repurchase data to the feedback loop improved recommendation quality by more than any model architecture change we made.
Mistake 4: Underestimating Regulatory Complexity
Regulatory complexity in AI is not linear - it's combinatorial. Add a new data source, double the compliance considerations. Move into a new jurisdiction, triple them. Build an AI that makes a recommendation that affects a person's health, finances, or legal status, and you're operating under a completely different regulatory regime than informational AI.
PMs from consumer software backgrounds often have almost no intuition for this. They treat regulation as a checkbox at the end of development rather than a design constraint from the start.
The consequences are severe: features that have to be pulled post-launch, months of retrofitting compliance into architectures not designed for it, and in the worst cases, regulatory action and fines.
My rule: for any AI feature that affects decisions about people, bring legal and compliance into discovery, not into launch review. The cost of a 30-minute conversation with your compliance team in week one is infinitely lower than the cost of a redesign in week twelve.
In healthcare, this is mandatory - FDA clearance for clinical decision support, HIPAA compliance for patient data, state-level regulations for telehealth. In fintech, it's CFPB scrutiny of lending algorithms, SEC rules for investment advice, and FinCEN requirements for AML. In edtech, it's FERPA for student data, COPPA for under-13 users, and state-level student data privacy laws.
First principle: Regulation is a product requirement, not a compliance afterthought. The features you can build, the data you can use, and the decisions your AI can make are all shaped by regulatory constraints. Know those constraints before you start designing.
Mistake 5: Not Planning for Failure Modes
Traditional software fails in predictable ways. A server goes down. A network times out. An API returns an error code. These failures are binary and recoverable.
AI systems fail in probabilistic, context-dependent, and sometimes insidious ways. The model gives a plausible-sounding wrong answer. It works perfectly for 95% of users and consistently fails for a specific demographic. It performs well on historical data and degrades on recent data because the world has changed. It gives different answers to the same question asked in slightly different ways.
These failure modes require different product design responses:
- Graceful degradation: When confidence is low, fall back to a non-AI path rather than showing a low-quality AI output
- Uncertainty communication: Signal to users when the model is operating outside its reliable range
- Human escalation paths: Make it easy for users to override, question, or escalate AI outputs
- Monitoring and alerting: Instrument your AI features to detect performance degradation in production
- Rollback capability: Be able to revert to a previous model version quickly if a new deployment degrades
Mistake 6: Treating AI as a Feature, Not a System
A common PM framing is "let's add AI to this feature." This framing is almost always wrong. AI isn't an ingredient you add to a feature; it's a system that changes the architecture, operational requirements, and user relationship of everything around it.
When you add AI to a product, you're also adding:
- Model versioning and deployment infrastructure
- Training data pipelines and labeling workflows
- Model performance monitoring
- Bias and fairness testing
- User feedback collection mechanisms
- Explanation infrastructure (for regulated contexts)
- Human review workflows for low-confidence outputs
None of this shows up in a feature spec. All of it is required for a production AI system. Teams that treat AI as a feature are constantly surprised by the operational overhead that emerges after launch.
Mistake 7: Skipping the Human Baseline
Before building an AI solution, measure how good humans are at the task. This sounds obvious. Most teams don't do it.
Why it matters: if a human expert achieves 92% accuracy on the task and your AI achieves 88%, the AI isn't ready, regardless of how impressive 88% sounds in isolation. If a human takes 20 minutes to complete a task and the AI takes 2 seconds with 80% accuracy, the AI might still be the right choice depending on volume and error cost.
The human baseline also shapes your evaluation set. Your test cases should include the cases that trip up humans - the ambiguous, edge-case inputs where expertise matters. If your AI only works on the easy cases that humans also handle easily, you've built a toy.
The Underlying Pattern
Looking across these mistakes, the underlying pattern is the same: PMs trained on deterministic software apply deterministic mental models to probabilistic systems. Traditional software does what it's told. AI systems do what the data and training suggest is likely correct - which is a fundamentally different contract with the user.
Building good AI products requires accepting this uncertainty and designing around it. Not hiding it, not pretending the model is more reliable than it is, but designing experiences that help users make good decisions in a world where the AI is right most of the time but not all of the time.
That's a harder design problem than "make the button do the thing." It's also a more interesting one.