The History of the GPU and Insights on the Future of AI Compute

Historical and forward looking perspectives from Raja Koduri on the future of Hardware / Software for developers

Jan 29, 2024

Raja Koduri is one of the leading minds on the history of the GPU and the future of AI compute. He directly shaped the past having spent 13 years at AMD, 4 at Apple and 6 at Intel leading various graphics processing and accelerated computing initiatives. This week, he presented at Stanford on the history of the GPU architecture from 1990’s to now. Much is similar to his 2020 presentation “No Transistor Left Behind” Hot Chips 2020 keynote — which is absolutely worth a listen.

Summary:

This presentation delves into the GPU industry's journey, addressing not just technical metrics like picojoules per flop but also industry dynamics
- Koduri provides insights into GPU architectures from a workload perspective and discuss emerging architectures like accelerating Python and in-memory computing
- In a 2020 keynote at the Hot Chips conference at Stanford, Koduri predicted that by 2025, data centers would need zetaflops of AI processing power, requiring a thousandfold increase in computing capability. This prediction is being realized, as evidenced by massive orders for Nvidia's GPUs by major tech companies.
How we imagine the hardware-software library (simple)

Software
Instruction set architecture (i.e. CUDA)
Hardware

The reality of the AI compute stack: much more complicated, Swiss-cheese full of holes in the middle that create huge performance issues

FW IP & Bios
Drivers
OS
Virtualization/Orchestration
Low level Libraries
Middleware frameworks and runtimes
Applications
Services and solutions

In the Hardware / Software Contract, pay attention to the whole system software stack — not just CUDA but drivers, memory, bandwidth, etc. (first image above)
- The boring parts will kill your startup
- The exciting parts will work
Building chips that work in different environments more challenging than it appears appears — no vendor outside of Nvidia works on more than 5% of the models on Hugging Face out of the box
- Nvidia works on over 80% of models out of the box
  - Venture capital investments in AI hardware have been substantial, with $20 billion invested since 2016. However, the return on these investments has been challenging, with total revenues of these companies being less than $100 million.
  - Meanwhile Nvidia will do >$30B of quarterly revenue in the coming quarters.
- It is deceptively hard to build hardware the operates with the full software stack. Despite $20 billion in funding for various startups, Nvidia, with a focus on the same AI challenges, has outperformed them, achieving $30 billion in quarterly revenue.

The boring parts = Nvidia’s moat; not just CUDA
Why do CUDA GPUs dominate AI?
- 80%+ of models on Hugging Face work out of the box on CUDA / Nvidia
- All other chip architectures (CPUs, AMD, startups, TPUs, ASICs) — no single one has even reached 5% coverage
- All the models are Python or Pytorch — you don’t find that much code related to hardware. More complicated than it appears under the hood
Model Developer / AI / Data Scientist perspective on GPUs — what impacts their hardware experience as developers?
- Programming flexibility
- Very large models
- Memory bandwidth
- Interconnect bandwidth
- Compute
- Memory Capacity
- Node Scaling
CPUs —> offer best programming flexibility and memory capacity
GPU —> High flexibility, high memory bandwidth, high interconnect Bandwidth, Compute
- Nvidia has the best coverage of all vectors

The History and Evolution of GPUs:

Initially, GPUs were primarily used for gaming.
- Many industry analysts proclaimed the death of Nvidia and GPUs as they lacked enterprise applications
  - GPUs survived numerous other predictions of obsolescence, from being deemed irrelevant for enterprise applications to facing threats from mobile technology and Google's TPU
The integration of graphics into chipsets by Intel marked a significant industry shift
- IEEE 32-bit floating point in 2005 mapped CPU to the GPU
The journey of modern GPUs began around 2002, standing on the shoulders of giants from Silicon Graphics, Pixar Renderman, AMD and others
- Pixar Renderman was the inspiration for the GPU software model
- Invented a mass market parallel computing paradigm
Data is exceeding our ability to understand and process
The evolution of GPUs has been intertwined with the development of the entire software stack, from boring but necessary components to significant innovations like Tensor Cores.
- The hardware-software contract is crucial, with Nvidia achieving dominance by maintaining a stable and pervasive contract over time.
- GPUs work on Vectors, a key driver of performance

Processors <> Transportation Analogy:

The hardware architecture debate = religious debates. There are analogies to transportation methods which are workloads.
CPUs vs GPUs vs. TPUs <> Cars vs. Bus vs. Trucks vs. Trains
- Cars and buses work on same infrastructure (roads) — with some inconvenience of bus stops
  - If the bus stop is right next to your home, you’re lucky
  - 18-wheelers on big highways also on the same highway and infrastructure
- Trains are on different infrastructure (stations and rails)
  - But if you can get to station, runs great
- We are a mix of all transportations, none absolutely better and each have pros/cons
- In Europe or Japan, more trains
- CPUs (cars) are very flexible, not very performant. They work on Scalar.
  - For one person, I could pack 3 more passengers in a car. Like a CPU, can pack a few workloads in and it’s flexible.
- If I had a bigger group (Bus/truck), I would take a different method of transportation for my workload
  - Without a good group size, GPUs are very inefficient and require a certain station
- When adding Tensor GPU (Matrix) — that is more like a train and a different architecture altogether than GPUs
  - Getting more train-like efficiency for specific workloads
- How do you get developers to your infrastructure or your “stations”?Nvidia brought more bus stations to the developers, which now rivals the x86 CPU developer ecosystem
In the future, I will say I want to go to Stanford — I don’t really care how I get there, care about efficiency
- Software will define the routing of the workload

The Hardware <> Software Developer Ecosystem:

In sitting next to AI programmers and data scientists — how they work is very different even though overlapping terminology
- Performance in semiconductors means speed, but in AI it means quality and the model converging the results they want to achieve
- AI programmers care about speed too but performance quality is far more important
- Hardware engineers care about speed far more

The Future of AI Compute —> 100M+ AI Developers

Innovations with memory that work with Python ecosystem will likely be the next disruption
- Python has become the defacto programming language of choice for AI
If LLMs and transformers are here to stay, they are way more memory and bandwidth bound than any other workload I have ever seen
- Compute is the easy part that continues to progress, memory is the hard part

Huge amount of startup opportunity to solve this problem [In-memory and near-memory computing IP]
Cost of GPU and AI Compute comes down ~16X by 2027
- 3 key performance vectors:
  - Cost per GPU —> $40K to ~$10K (4X improvement)
  - PJ per flop —> improves 2X
  - PJ per bit —> improves 2X
- 16X improvement in compute cost

The History of the GPU and Insights on the Future of AI Compute

Historical and forward looking perspectives from Raja Koduri on the future of Hardware / Software for developers

Discussion about this post