Prompt engineering gets dismissed as a temporary workaround - something you do before you have time to build the real thing. This is wrong. At HCLTech, some of our highest-performing AI features are built on meticulously engineered prompts, not fine-tuning or custom models. Done properly, prompt engineering is a disciplined practice that is faster, cheaper, and often more maintainable than alternatives.

Here is the playbook I have developed across multiple production deployments.

The System Prompt is Your Contract

The system prompt defines the model's role, constraints, output format, and behavior in edge cases. Treat it like you treat an API contract - it should be explicit, complete, and versioned.

A good system prompt answers five questions:

  1. Who is the model? (Role and expertise)
  2. What is it doing? (Task definition)
  3. For whom? (User context)
  4. What format should output take? (Structure constraints)
  5. What should it never do? (Explicit prohibitions)

Bad system prompt: "You are a helpful medical assistant."

Better system prompt: "You are a clinical coding specialist with expertise in ICD-10-CM classification. Your task is to suggest appropriate diagnosis codes for clinical notes provided by a healthcare coder. Always output exactly three candidate codes in order of confidence, with a one-sentence rationale for each. If the clinical note lacks sufficient information to determine a code, say so explicitly rather than guessing. Never suggest codes outside your training cutoff; flag conditions that may have updated coding guidelines."

Few-Shot Examples: Quality Over Quantity

Few-shot examples - providing worked examples in your prompt - dramatically improve output quality for most tasks. Three to five high-quality examples typically outperform ten mediocre ones.

Select examples that cover: the typical case, a near-edge case, and one example that demonstrates how to handle ambiguity. The model learns the pattern from the typical case and learns your preferred failure behavior from the edge cases.

Format your examples consistently. If your expected output is JSON, every example should show valid JSON. If it is a structured report, every example should show the same structure. Inconsistency in examples creates inconsistency in outputs.

Chain-of-Thought for Reasoning Tasks

For tasks involving reasoning, judgment, or multi-step analysis, chain-of-thought (CoT) prompting significantly improves accuracy. Add a simple instruction: "Think through this step by step before giving your final answer." Or provide an explicit reasoning template in your examples.

SYSTEM_PROMPT = """You are a contract risk analyst. For each contract clause provided,
analyze potential risks following this format:

REASONING:
- Step 1: Identify what the clause requires
- Step 2: Identify potential scenarios where this clause creates liability
- Step 3: Compare to standard industry terms

RISK ASSESSMENT:
- Risk level: [LOW/MEDIUM/HIGH]
- Key concern: [one sentence]
- Recommended action: [one sentence]"""

Research consistently shows that asking the model to reason before answering improves accuracy on complex tasks by 10-40% depending on task type. The mechanism: the reasoning text provides an intermediate representation that the model can condition its final answer on.

Output Formatting: Be Explicit

If you need JSON output, say "Return a JSON object with the following fields:..." and show the exact schema. If you need markdown, specify exactly which elements to use. If you need a specific length, give a word or token count range.

Vague format instructions produce inconsistent results. "Return a summary" produces outputs that vary in length, structure, and level of detail. "Return a 3-bullet summary where each bullet is under 20 words" produces consistent results.

For JSON outputs specifically, use structured outputs (OpenAI) or tool calling to enforce schema compliance. Do not rely on prompting alone for JSON schemas in production - validation errors will happen.

The Negative Space: Telling the Model What NOT to Do

Most prompts focus on what the model should do. Adding explicit prohibitions for the specific failure modes you have observed dramatically reduces those failures in production.

Effective negation patterns:

  • "Never make up information you are not confident about. If uncertain, say so explicitly."
  • "Do not add preamble or closing pleasantries - return only the requested output."
  • "Do not suggest actions outside the following list: [list]."
  • "If the user asks about topics outside [domain], politely redirect to [domain] without answering the off-topic question."

Temperature and Sampling Parameters

Temperature controls randomness in output. Temperature 0 gives deterministic, conservative outputs. Temperature 1 gives varied, creative outputs.

Production defaults by task type:

  • Data extraction, classification, coding: temperature 0 - 0.2. You want consistency.
  • Summarization, Q&A: temperature 0.3 - 0.5. Some variation is fine, accuracy matters.
  • Creative writing, brainstorming, ideation: temperature 0.7 - 1.0. Variation is the point.

Prompt Versioning and Testing

Version your prompts in code like you version your API schemas. Every change to a production prompt should be a commit with a description of what changed and why. Run your test set against every prompt change. Track accuracy over versions.

Tools worth knowing: Promptfoo for regression testing, LangSmith for tracking prompt performance over time, PromptLayer for logging and versioning.

Common Mistakes to Avoid

  • Testing on the examples you wrote the prompt for. Your examples are not representative of production traffic. Test on a separate held-out evaluation set.
  • Optimizing for one model and deploying to another. Prompt behavior is model-specific. GPT-4o and Claude 3.5 Sonnet respond differently to the same prompt. Test on the model you plan to deploy.
  • Making prompts longer when quality is bad. Length is not quality. A focused, well-structured 200-word prompt often outperforms a sprawling 800-word one.

Where this lands

Prompt engineering is a craft that repays investment. The teams I have seen get the most value from AI products spend time on their prompts - versioning them, testing them, iterating systematically. It is not glamorous work. It is also often faster and cheaper than building alternative solutions and directly determines whether your AI feature is reliable enough to ship.


More on this