JDML
← All notes

ML Fundamentals

Transformer architecture explained: the model behind every modern LLM

· 5 min read · By

The transformer architecture, introduced in the 2017 Google paper 'Attention Is All You Need', is the foundation of every major language model today. It processes entire sequences in parallel using self-attention, where each token in the input attends to every other token to determine what context matters. This parallel processing enables transformers to train on massive datasets and scale in ways that previous recurrent architectures couldn't.

The self-attention mechanism

Self-attention computes a relevance score between every pair of tokens in the input. For each token, the model learns three vectors: a Query (what am I looking for?), a Key (what do I represent?), and a Value (what information do I carry?). Attention scores are computed as dot products of queries and keys, normalised with softmax, then used to weight the sum of value vectors. The result is a context-aware representation for every position in the sequence.

Encoder vs decoder vs encoder-decoder

  • Encoder-only (BERT-style): reads sequences in both directions, good for classification and retrieval
  • Decoder-only (GPT-style): generates text left to right, the architecture behind all major LLMs
  • Encoder-decoder (T5-style): encodes an input then generates an output, used for translation and summarisation

Why transformers dominate

Transformers parallelise training across the full sequence (unlike RNNs which process sequentially), scale predictably with data and compute, and generalise across modalities: text, images, audio, and code are all handled by variants of the same architecture. Every commercial LLM in use today, including the Claude and GPT models JDML uses for agent development, is a transformer.

Practical implications for Australian AI teams

Understanding the transformer helps you make better decisions about model selection, fine-tuning, and inference costs. Decoder-only models (GPT, Claude) generate text token by token, which means output length directly drives cost. Encoder-only models are much cheaper for retrieval and classification tasks. For Australian businesses running high-volume AI workloads, choosing the right architecture variant can mean a 10x difference in inference cost.

Building something in this space? Let's talk.

We spend a lot of time with these tools. If you're trying to figure out which model fits your workload, we're happy to share what we've learned.

Get in touch