I've sat in too many product reviews where someone waves a hand at the model as if it's a black box. After building AI products across healthcare at HCLTech, edtech at Edxcare, and consumer at Mamaearth, the teams who understand the underlying architecture make dramatically better product decisions.

Why Transformers Won

Before transformers (2017 and earlier), dominant architectures were RNNs and LSTMs - recurrent networks that processed tokens one at a time, left to right. This created two problems:

  • Long-range dependencies were lossy. By the time the model reached word 200, information from word 3 had been compressed and partially forgotten.
  • Sequential processing meant slow training. You couldn't parallelize across a sequence.

The "Attention Is All You Need" paper from Google (Vaswani et al., 2017) replaced recurrence entirely with self-attention. Every token attends to every other token simultaneously. GPT, BERT, T5, LLaMA, Gemini - all transformers.

The Attention Mechanism, Without the Calculus

When you process "The nurse checked the patient's chart because she was worried," you instantly resolve that "she" refers to the nurse. Self-attention formalizes this. For each token, the model computes three vectors:

  • Query (Q): What am I looking for?
  • Key (K): What do I represent?
  • Value (V): What information do I carry?

The attention score between two tokens is the dot product of one token's Query and another token's Key, normalized via softmax. The model takes a weighted sum of all Value vectors. This runs in parallel for every token simultaneously - hence the training speed advantage.

Multi-Head Attention

Multi-head attention runs several attention operations in parallel, each with different learned Q/K/V weight matrices. Each "head" asks a different question of the same text: coreference resolution, syntactic structure, semantic roles. GPT-3 uses 96 attention heads. This is why frontier models handle complex prompts with multiple constraints better than smaller models.

Context Windows: The Product-Critical Concept

Context window = the total number of tokens the model can process in a single forward pass. This includes your system prompt, conversation history, retrieved documents, and the response.

  • GPT-3.5: 16K tokens (~12,000 words)
  • GPT-4o: 128K tokens (~96,000 words)
  • Claude 3.5 Sonnet: 200K tokens (~150,000 words)
  • Gemini 1.5 Pro: 1M tokens (~750,000 words)

At HCLTech, we built a clinical trial protocol analyzer. Early versions chunked 200-page protocols into fragments and ran separate inference calls. When we moved to a 200K context window model, we could feed the entire protocol in a single call. Accuracy jumped significantly because the model could reason across cross-referenced sections.

But there's a known phenomenon called lost in the middle (Liu et al., 2023): models perform better on information at the beginning and end of their context than in the middle. Position important context strategically - near the system prompt or near the end of the user turn.

Positional Encoding

Self-attention is inherently order-agnostic. Transformers add positional encodings - vectors added to each token embedding that encode position. Modern LLMs use Rotary Position Embedding (RoPE), used in LLaMA and Mistral. Context window extension techniques (YaRN, LongRoPE) work by modifying positional encodings. Test your specific use case at long contexts - don't just trust the headline number.

The Feed-Forward Network: Where Knowledge Lives

Research (Geva et al., 2021) showed FFN layers function like key-value memories: they store factual associations learned during training. "Paris is the capital of France" is stored in the FFN weights. This is why RAG works - you're not replacing the model's knowledge, you're adding context the attention mechanism can incorporate alongside the FFN's parametric knowledge.

Encoder vs Decoder vs Encoder-Decoder

  • Encoder-only (BERT, RoBERTa): Bidirectional attention. Best for classification, NER, semantic similarity. Not designed to generate text.
  • Decoder-only (GPT family, LLaMA, Mistral): Causal left-to-right attention. Generates text autoregressively. Dominant architecture for general-purpose LLMs.
  • Encoder-decoder (T5, BART): Natural for translation, summarization.

Scaling Laws and Why Model Size Matters (Up to a Point)

The "Chinchilla" paper (Hoffmann et al., 2022) showed most large models were undertrained - too many parameters relative to training data. The optimal ratio is roughly 20 tokens per parameter. This led to smaller, better-trained models: Mistral 7B outperforms GPT-3 (175B) on many benchmarks. Don't reflexively reach for the largest model. Benchmark on your actual use case.

What This Means for Your Product Decisions

  1. Context window size determines what problems you can solve in a single call. Budget accordingly in your API cost model.
  2. Attention is not magic - it's a learned routing mechanism. Clear, structured prompts reduce the search space.
  3. The model's knowledge is frozen at training cutoff. Anything that's changed since then requires RAG or tool use.
  4. Bigger isn't always better. A smaller model fine-tuned on your domain will often outperform a frontier model on your specific task, at a fraction of the cost.

The teams that ship the best AI products understand their architecture well enough to make the right tradeoffs - and to know which problems require a transformer at all.


More on this