Google Cloud TPU Architecture Evolution: From v1 to 8th Gen and Choosing the Right Fit

If you've recently reviewed Google Cloud TPU pricing or documentation, you've likely noticed a wide array of versions available, including TPU v5e, v5p, v6e, Ironwood, and the latest TPU 8t and 8i. Each generation presents distinct specifications, use cases, and performance tradeoffs. This article provides a detailed walkthrough of every major TPU generation, explaining the architectural changes at each step and their implications for users running various workloads.

Before diving into the generational specifics, it's beneficial to understand the core components that constitute a TPU chip, as these elements are fundamental across all versions.

Matrix Multiply Unit (MXU): This is the primary compute engine within every TPU TensorCore, designed to perform the multiply-and-accumulate operations essential for neural network computations. Up to TPU v5p, MXUs were typically 128x128 systolic arrays, providing 16,384 simultaneous multiply-accumulations. With Trillium (v6e), the MXU size expanded to 256x256, quadrupling the operations per cycle.
TensorCore: A TensorCore integrates one or more MXUs, a vector processing unit (VPU), and a scalar unit. Depending on the specific generation, a single TPU chip might house either one or two TensorCores.
High Bandwidth Memory (HBM): This on-chip memory is crucial for storing model weights and activations. For large-scale models, HBM capacity and bandwidth often emerge as the primary bottlenecks, more so than raw compute power. Each successive generation has brought increases in both HBM capacity and access speeds.
Inter-Chip Interconnect (ICI): The network facilitating communication between chips within a TPU pod. ICI bandwidth directly impacts how quickly chips can synchronize gradients during model training. Higher bandwidth translates to less communication latency and more effective computation time.
SparseCore: Introduced with TPU v4, SparseCores are specialized processors optimized for embedding operations. These operations are fundamental to recommendation systems and models with large vocabularies. TPU v5p and Ironwood chips feature four SparseCores each, while v6e includes two per chip.
Topology: Refers to the physical wiring arrangement of chips within a pod. Earlier TPU generations utilized a 2D torus configuration, where each chip connected to four neighbors. Starting with v4, Google adopted a 3D torus for larger-scale pods, which significantly reduces the maximum number of hops between any two chips, thereby cutting communication latency.

TPU v1 (2015, Internal Use Only)

The inaugural TPU was engineered with a singular focus: inference. It was not made publicly available and lacked the capability to train models. The chip featured a 256x256 systolic array of 8-bit multiply-accumulators, delivering 92 TOPS (Tera Operations Per Second) of INT8 compute performance. Consuming approximately 40 watts, its power efficiency was remarkable for its era.

Google maintained its secrecy for over a year. At Google I/O 2016, Sundar Pichai announced its existence, revealing that it had already been operational in Google's data centers for more than a year, powering critical services such as Search, Maps, and Street View.