Rik Kisnah - Blog

Attention Is All You Need — And All You Need to Know: An Infrastructure Engineer's Guide to the Paper That Built Your GPU Cluster

Attention Is All You Need — An Infrastructure Engineer’s Guide to the Paper That Built Your GPU Cluster

Disclaimer: This post reflects my personal views and does not represent the views of my employer.

Caveat: This was written with research assistance from AI LLMs. I curated the content and cross-checked the references.

Everyone talks about attention; few folks have read the paper. In 2017, eight Google researchers published fifteen pages that reshaped computing [1]: “Attention Is All You Need.” They proposed the Transformer—no step-by-step recurrence, no convolutional filters, just a mechanism that lets every part of a sequence look at every other part. It was a translation paper, trained on eight NVIDIA P100 GPUs in about twelve hours; it beat the best ensemble on English–German and set a new state of the art on English–French in three and a half days [1]. That paper has over 100,000 citations [5]. Every model you use today—GPT, Claude, Gemini, Llama—descends from it. The GPU boom, the clusters we validate, the RDMA networks and NVL72 racks, all trace to one bet: attention can replace everything else. Most engineers building on that infrastructure have never read the paper.

What Attention Replaced

Before the Transformer, sequence models were sequential [1]: one token at a time, in order. Some had internal “memory” so earlier words weren’t fully forgotten, but they still ran in lockstep. Long inputs meant the beginning faded; training signals from the start often became too weak (the “vanishing gradient” problem). Each step depended on the previous one, so you couldn’t spread work across GPUs. Training was slow; scaling was painful.

The Transformer threw that out [1]. Self-attention lets every token look at every other token at once. No chain of steps; the model computes how each word relates to every other in one shot, in parallel. The architecture was built for parallelism before the GPU clusters we run today existed.

How Attention Works

The core mechanism uses three roles per token: Query, Key, and Value [1][2]. Query: what am I looking for? Key: what do I offer that might match? Value: the content to use when we match. Attention matches queries to keys and pulls in the right values.

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Scores from Query–Key dot products are normalized (softmax), then weight the Values. The division by √d_k in the formula keeps scores from growing too large when the key dimension is big, so softmax doesn’t collapse to a near-one-hot distribution [1]. Result: a context-aware representation of each token from the whole sequence at once [1][4]. Multi-head attention runs this several times in parallel—different “heads” for grammar, meaning, position—and combines the outputs [1].

The original design had two stacks: an encoder (reads and encodes the input) and a decoder (generates the output, attending to both the encoder output and its own previous tokens). Each layer also had a feed-forward sublayer—a small two-layer network applied per token—and residual connections plus layer normalization around every sublayer so deep stacks could train [1][3]. The paper used sinusoidal positional encoding to inject sequence order; many modern systems use learned or rotary encodings instead. The authors trained with Adam, label smoothing, and dropout. Six encoder layers, six decoder layers, and that recipe scaled.

The Infrastructure Cost of Attention

Attention scales quadratically with sequence length [1]: double the input, quadruple the compute. “Context window” is a hard limit on what the GPU cluster can hold in memory. During generation the model caches Keys and Values (the KV cache) in VRAM [1]; high-bandwidth links (RDMA, RoCE) exist because attention needs every part of the model to talk to every other. “Lost in the Middle” - models neglecting tokens in the center of long contexts—is a direct artifact of how attention distributes weight. These concerns drive batch sizing, memory allocation, and the networking fabric we validate on bare metal. Doubling context window is a hardware conversation, not a software tweak.

How OCI Runs It

OCI delivers the Transformer stack in two ways: a fully managed Generative AI service and bare-metal GPU capacity that can scale to some of the largest AI superclusters in the cloud [17][18][21].

OCI Generative AI offers pretrained LLMs (Cohere, Meta Llama) for chat, summarization, embeddings, and rerank—via playground, API, or CLI, in multiple regions so you can keep data and inference close to users [17]. For production you can fine-tune on your data and host on dedicated AI clusters (GPU capacity reserved for your tenancy) for predictable performance and no noisy neighbors [18]. On-demand is pay-as-you-go for experimentation; dedicated is for committed workloads. OCI sizes VRAM and interconnect so you don’t have to. Safety and guardrails are built in for Cohere models; you get enterprise-ready GenAI without operating the underlying attention machinery yourself.

Bare-metal GPU compute is where OCI’s infrastructure story gets serious. NVIDIA (H200, L40S, B200, GB200) and AMD Instinct (MI300X, MI355X) instances give 400 Gbps GPU-to-GPU links and RDMA over RoCE—the kind of bandwidth that keeps attention layers from stalling on memory [19][20]. Clusters scale to tens of thousands of GPUs; OCI has been chosen for government and national-lab deployments (including DOE supercomputers like Solstice and Equinox), which speaks to both raw scale and the ability to meet sovereign and compliance requirements [21]. Whether you’re serving a managed Llama endpoint or validating your own model on bare metal, the underlying cost is the same: attention over sequences on GPUs with fast memory and interconnect. OCI gives you the option to own that stack when it matters.

Attention Maps to Your Repo

Repositories need structured context for AI agents—AGENT.md, TASKS.md, MEMORY.md [14][15]. Attention in a Transformer solves the same problem at a different scale. Queries map to the task (what do I need to do? → TASKS.md). Keys map to stable anchors: design decisions, ADRs, conventions—signals for which context matters. Values map to the payload: runbooks, examples, RCAs. Positional encoding (order) maps to file structure and naming; without clear hierarchy the model loses sense of what comes first or what matters. Scatter context across wikis, Jira, and Slack and the model attends to noise; attention dilutes, output degrades—the same failure mode sequential models had. The fix is explicit, parallel context selection. A prompt is a one-off; structured repo context is a system. You build a better attention substrate, not a better prompt.

Where the Eight Authors Went

The original eight have scattered: Noam Shazeer (Character.ai), Aidan Gomez (Cohere), Ashish Vaswani (Essential AI), Llion Jones (Sakana AI), Illia Polosukhin (NEAR), Jakob Uszkoreit (Inceptive), Lukasz Kaiser (OpenAI). At GTC 2025 (I was there with the OCI Ai2 Compute team), seven gathered with Jensen Huang; Gomez said the world needs something better than Transformers.

Is Attention Still All You Need

Probably not forever - IMO. Mamba and state-space models scale linearly with sequence length; hybrids mix attention with other mechanisms. Top public models are still Transformer-based, but alternatives are closing in [9][10]. The 2017 design has been refined—new normalization, position encoding, Key/Value cache compression—so the paper is the starting point, not the final word [1][9]. For infrastructure, different architectures mean different compute profiles and validation; today’s clusters are tuned for dense attention. If the dominant architecture shifts, the hardware story shifts with it.

The Repo Problem Remains

None of this changes the core issue. Whether the next architecture is pure attention, a state-space model, or some hybrid nobody has published yet, the model still processes tokens. It still needs context. It still attends to whatever you put in front of it. The researchers will sort out the architecture. Your problem is the input. Back to the Boot Camp for you—fix the repo.

Attention, Society, and What Comes Next

The 2017 paper didn’t mention society, jobs, or policy. It described an architecture. But the infrastructure that runs it—the GPU clusters, the cloud regions, the national supercomputers—is now a geopolitical and economic fact. Who builds it, who can access it, and who gets to tune the models on it will shape the next decade.

On the technical side, we’ll see longer context windows, more efficient attention variants, and a mix of Transformer and non-Transformer designs. On the human side, the same attention mechanism that makes models useful (they attend to what we give them) also makes them sensitive to bias, prompt injection, and the quality of the data we feed them. Trust, transparency, and the choice between open and closed models are no longer academic. For infrastructure engineers, the job is to keep the systems that run this stuff reliable, observable, and aligned with the workloads that actually need them—whether that’s a startup’s prototype or a government’s sovereign AI. The paper gave us the mechanism. What we do with it is still being written.


References

  1. Vaswani, A. et al. “Attention Is All You Need.” (2017). arxiv.org/abs/1706.03762

  2. Alammar, J. “The Illustrated Transformer.” jalammar.github.io/illustrated-transformer

  3. Harvard NLP. “The Annotated Transformer.” nlp.seas.harvard.edu/2018/04/03/attention.html

  4. IBM. “What is self-attention?” ibm.com/think/topics/self-attention

  5. Wikipedia. “Attention Is All You Need.” en.wikipedia.org/wiki/Attention_Is_All_You_Need

  6. Bakshi, V. “Attention Is All You Need.” vishalbakshi.github.io

  7. Aweers.de. “Attention Is All You Need.” aweers.de/blog/2025/attention-is-all-you-need

  8. OpenGenus IQ. “Attention Is All You Need Summary.” iq.opengenus.org

  9. Nature Scientific Reports. “Transformer architecture analysis.” nature.com

  10. Towards AI. “Attention Is All You Need: A Deep Dive into the Transformer Architecture.” towardsai.net

  11. Yerkade, K. “Decoding Attention Is All You Need.” dev.to

  12. Shadecoder. “Transformer Architecture: A Comprehensive Guide for 2025.” shadecoder.com

  13. Tan, J.Y. “Transformer Architectures.” jytan.net/blog/2025/transformer-architectures

  14. Kisnah, R. “Content Engineering for AI Agents: Why Your Repository Isn’t Ready.” rikkisnah.github.io

  15. Kisnah, R. “Markdown Is the New Source Code.” rikkisnah.github.io

  16. Kisnah, R. “How Large Language Models Actually Work.” rikkisnah.github.io

  17. Oracle. “Overview of Generative AI Service.” docs.oracle.com

  18. Oracle. “On-Demand and Dedicated Modes for OCI Generative AI Models” / “Managing Dedicated AI Clusters.” docs.oracle.com, docs.oracle.com

  19. Oracle. “Deploying Large Language Models in OCI” (AMD Instinct). docs.oracle.com

  20. Oracle. “OCI Compute with NVIDIA L40S GPUs” / “Largest AI Supercomputer in the Cloud” (H200). blogs.oracle.com, blogs.oracle.com

  21. Kisnah, R. “OCI Powers America’s AI Future at NVIDIA GTC 2025: Supercomputers, AI Factories, and Strategic Leadership.” rikkisnah.github.io