The first AI PRD I ever wrote was a disaster. I used the same template I had for growth features at Mamaearth - user story, acceptance criteria, metrics, launch plan. We shipped the feature. It worked in demo. In production it hallucinated on 12% of queries, we had no rollback plan, and it took three weeks to debug because we had not defined what "working correctly" meant for a generative output.
That experience taught me that AI features need a fundamentally different PRD structure. Not because the fundamentals of product management change - user needs still drive everything - but because the failure surface of AI systems is qualitatively different from deterministic software. You cannot just write "success = 95% uptime." You need to define what the model should and should not do, what happens when it is wrong, and how you will know it is drifting before your users notice.
This guide walks through every section of an AI PRD with explanations of why each section exists. The full template is at the end.
Part 1: The Standard Sections (With AI-Specific Extensions)
Problem Statement and User Need
No change from standard PRD practice here. Define the problem, the user, and the current pain. What I add for AI features is an explicit statement of why AI is the right solution - not just "AI can do this" but "this problem has the properties that make AI the appropriate tool." Those properties are: the task is too complex for rule-based logic, the input space is too variable for deterministic handling, or the output quality scales with learned patterns from data.
If you cannot articulate why AI specifically (versus a lookup table, a rule engine, or a simpler ML model), that is a flag. Some features get AI applied because it is fashionable, not because it is right.
Goals and Non-Goals
For AI features, the non-goals section is as important as the goals. Be explicit about what the model will not do. "The assistant will not provide clinical advice" is a non-goal. "The summarizer will not interpret ambiguous legal clauses" is a non-goal. These non-goals define the guardrail requirements later.
Part 2: AI-Specific Sections
Model Requirements
This section does not exist in standard PRDs. It defines the technical requirements the AI model must meet independent of business logic. Cover:
- Task type: Classification, generation, extraction, retrieval, ranking. This determines model architecture and evaluation approach.
- Latency requirement: p50, p95, p99. Real-time features (autocomplete, live suggestions) have different constraints than async features (background summarization, nightly reports).
- Context window requirement: How much input does the model need to see at once? For document analysis, this drives chunking strategy and model selection.
- Output format: Free text, structured JSON, classification label, ranking score. Structured outputs require schema validation and error handling that free text does not.
- Model provider preference and fallback: Which model(s) are in scope? What is the fallback if the primary provider has an outage?
- Cost ceiling: Cost per 1000 queries or per user per month. AI features have variable costs that traditional features do not - this needs to be defined upfront.
Data Dependencies
This is where AI PRDs most commonly fail. Teams scope the model requirements carefully and then discover in week 3 of development that the training data does not exist, the retrieval corpus is not clean, or the fine-tuning dataset requires six months of annotation work.
Map out every data dependency:
- Training data (if fine-tuning): Source, size, format, annotation requirements, PII handling, consent status
- Retrieval corpus (if RAG): Source documents, refresh cadence, access control, chunking strategy
- Inference-time data: What user data is passed to the model at runtime? Who owns it? What is the data retention policy?
- Evaluation dataset: How will you measure model quality? Who creates the ground truth? How many examples? How often does it refresh?
- Feedback data: How do user signals (thumbs up/down, corrections, dwell time) feed back into model improvement?
For healthcare features, add a column for PHI classification on every data element. For fintech, add PCI/PII classification. Knowing data sensitivity early prevents rebuilds when legal reviews the design.
Evaluation Criteria
Define what "good" means before you build. This is the section most PMs skip and most engineers wish they had.
Evaluation criteria for AI features have three layers:
- Automated metrics: ROUGE, BLEU, BERTScore for text generation. Precision/recall/F1 for classification. MRR, NDCG for retrieval. These are fast and cheap to run on every deployment.
- Human evaluation rubric: A scored rubric (1-5) for dimensions that automated metrics cannot capture - accuracy, helpfulness, tone, safety. Define the rubric before launch and use it for pre-launch sign-off and periodic audits.
- Business metrics: Task completion rate, time-to-decision, error escalation rate, user retention delta. These take weeks to measure but are the only metrics that actually matter.
Also define the minimum bar to ship. "We will not launch until automated metric X exceeds threshold Y on our evaluation dataset" gives the team a clear target and prevents premature launches driven by deadline pressure.
Failure Modes
This is the section that separates thoughtful AI PMs from everyone else. Every AI feature has a failure surface. Map it out explicitly.
Common failure modes by category:
- Hallucination: Model generates plausible but incorrect output. Especially dangerous in healthcare (wrong dosing), legal (false case citations), fintech (wrong rate quotes). Mitigation: grounding in retrieved sources, output verification, confidence scoring.
- Refusal: Model declines to answer valid queries. Causes user frustration and task abandonment. Mitigation: system prompt tuning, fallback to search, graceful degradation messaging.
- Distribution shift: Production queries look different from the evaluation dataset. Model performs well in testing and degrades post-launch. Mitigation: production query logging, continuous eval on live traffic samples.
- Context overflow: Input exceeds context window, leading to truncation and degraded outputs. Mitigation: input length validation, chunking strategy, longer-context model fallback.
- Prompt injection: Adversarial input manipulates model behavior. Mitigation: input sanitization, sandboxed tool use, output filtering.
- Latency spike: High traffic or complex queries cause response time SLA violations. Mitigation: request queuing, async fallback, simpler model for degraded mode.
For each failure mode, document: likelihood (high/medium/low), severity (critical/major/minor), detection mechanism, and mitigation.
Bias and Fairness Testing
Non-negotiable for any AI feature touching users. Define the bias dimensions relevant to your use case and the testing protocol before development starts.
For a clinical documentation AI: Does it perform equivalently across patient demographics (age, gender, race, insurance status)? For a hiring AI assistant: Does it recommend or filter differently across protected classes? For a financial advice AI: Does it give different quality recommendations to different income brackets?
Document: which demographic slices to test, the minimum sample size per slice, the maximum acceptable performance gap, and who signs off on the fairness review before launch.
User Feedback Loops
AI features need feedback loops that traditional features do not. The model improves - or degrades - based on signals from production. Define explicitly:
- What explicit feedback mechanisms exist (thumbs, ratings, corrections)?
- What implicit signals are collected (query reformulations, task abandonment, downstream actions)?
- How does feedback flow into model improvement (fine-tuning queue, RLHF pipeline, retrieval corpus updates)?
- What is the feedback-to-improvement cycle time?
- Who owns the feedback review process?
Rollback Plan
Every AI feature needs a rollback plan that is more detailed than "we will revert the deployment." AI features often have infrastructure dependencies - vector stores, fine-tuned model weights, inference endpoints - that make rollback non-trivial.
Define: the rollback trigger conditions (metric drops below X, error rate exceeds Y, critical safety issue detected), the rollback procedure step by step, the estimated time to rollback, the degraded experience users see during rollback, and who has the authority to trigger it.
Part 3: Governance and Compliance
For regulated industries, add a compliance section that covers: relevant regulations (HIPAA, GDPR, FDA CDS guidance, EU AI Act), compliance review sign-off required before launch, audit logging requirements, model card documentation, and data residency constraints.
This section often drives timeline more than engineering complexity. Getting legal and compliance review scheduled early prevents launch delays.
The Full Template
AI Feature PRD Template v2.1
Feature Name: [Name]
Author: [PM Name]
Date: [Date]
Status: [Draft / In Review / Approved]
Stakeholders: [Engineering, Data Science, Legal, Design, Compliance]
1. Problem Statement
[1-2 paragraphs. The user problem, current pain, and why AI is the right solution for this specific problem.]
2. Goals and Non-Goals
Goals: [Bullet list of what this feature will accomplish]
Non-Goals: [Explicit list of what the model will not do - these become guardrail requirements]
3. User Stories
[As a [user type], I want to [action] so that [outcome]. Include happy path and edge case stories.]
4. Model Requirements
- Task type: [classification / generation / extraction / retrieval / ranking]
- Latency: p50 < Xms, p95 < Yms, p99 < Zms
- Context window: [tokens required per request]
- Output format: [free text / structured JSON / label / score]
- Primary model: [GPT-4o / Claude 3.5 / Gemini 1.5 / fine-tuned model]
- Fallback model: [simpler model or rule-based fallback]
- Cost ceiling: $X per 1000 queries
5. Data Dependencies
- Training data (if fine-tuning): [source, size, annotation status, PII classification]
- Retrieval corpus (if RAG): [source, refresh cadence, access control, chunk size]
- Inference-time data: [user data passed at runtime, retention policy, consent basis]
- Evaluation dataset: [source, size, refresh cadence, owner]
- Feedback data: [collection mechanism, storage, use in model improvement]
6. Evaluation Criteria
- Automated metrics: [metric name, threshold, evaluation frequency]
- Human eval rubric: [dimensions, scoring scale, reviewer profile, sample size]
- Business metrics: [primary KPI, secondary KPIs, measurement window]
- Minimum bar to ship: [explicit pass/fail criteria before launch authorized]
7. Failure Modes
[Table with columns: Failure Mode | Likelihood | Severity | Detection | Mitigation]
8. Bias and Fairness Testing
- Demographic slices to test: [list]
- Minimum sample per slice: [N]
- Maximum acceptable performance gap: [X%]
- Fairness review sign-off: [who, when]
9. User Feedback Loops
- Explicit feedback: [mechanism, storage, owner]
- Implicit signals: [what is collected, how used]
- Improvement cycle: [how feedback flows to model updates, cycle time]
10. Rollback Plan
- Trigger conditions: [metric thresholds, safety signals, who can trigger]
- Rollback procedure: [step by step]
- Time to rollback: [estimated]
- Degraded experience: [what users see during rollback]
- Authority: [who approves rollback decision]
11. Compliance and Governance
- Applicable regulations: [HIPAA / GDPR / FDA / EU AI Act / other]
- Compliance review required: [yes/no, reviewer, deadline]
- Audit logging: [what is logged, retention period]
- Model card: [link or owner]
- Data residency: [constraints]
12. Launch Plan
- Rollout strategy: [percentage rollout / feature flag / cohort-based]
- Monitoring plan: [dashboards, alerts, on-call rotation]
- Communication plan: [internal stakeholders, user-facing messaging]
The template is a starting point, not a straitjacket. Healthcare features need more depth in the compliance and bias sections. Consumer features need more depth in the feedback loop and experimentation sections. Internal tooling features can skip some governance overhead. Calibrate depth to stakes.
The discipline the template enforces - thinking through failure modes before building, defining evaluation criteria before writing code, mapping data dependencies before scoping timelines - is worth more than any specific section. AI products fail because teams skip this thinking, not because they used the wrong framework.