ML Fundamentals
Transformer architecture explained: the model behind every modern LLM
· 5 min read · By Jon Jovinsson
The transformer architecture, introduced in the 2017 Google paper 'Attention Is All You Need', is the foundation of every major language model today. It processes entire sequences in parallel using self-attention, where each token in the input attends to every other token to determine what context matters. This parallel processing enables transformers to train on massive datasets and scale in ways that previous recurrent architectures couldn't.
The self-attention mechanism
Self-attention computes a relevance score between every pair of tokens in the input. For each token, the model learns three vectors: a Query (what am I looking for?), a Key (what do I represent?), and a Value (what information do I carry?). Attention scores are computed as dot products of queries and keys, normalised with softmax, then used to weight the sum of value vectors. The result is a context-aware representation for every position in the sequence.
Encoder vs decoder vs encoder-decoder
- →Encoder-only (BERT-style): reads sequences in both directions, good for classification and retrieval
- →Decoder-only (GPT-style): generates text left to right, the architecture behind all major LLMs
- →Encoder-decoder (T5-style): encodes an input then generates an output, used for translation and summarisation
Why transformers dominate
Transformers parallelise training across the full sequence (unlike RNNs which process sequentially), scale predictably with data and compute, and generalise across modalities: text, images, audio, and code are all handled by variants of the same architecture. Every commercial LLM in use today, including the Claude and GPT models JDML uses for agent development, is a transformer.
Practical implications for Australian AI teams
Understanding the transformer helps you make better decisions about model selection, fine-tuning, and inference costs. Decoder-only models (GPT, Claude) generate text token by token, which means output length directly drives cost. Encoder-only models are much cheaper for retrieval and classification tasks. For Australian businesses running high-volume AI workloads, choosing the right architecture variant can mean a 10x difference in inference cost.