GPU Performance (Data Sheets) Quick Reference (2023)

Published at 2023-10-25 | Last Update 2024-05-19

This post provides a concise reference for the performance of popular GPU models from NVIDIA and Huawei/HiSilicon, primarily intended for personal use.

1 Introduction
- 1.1 Naming convention of NVIDIA GPUs
2 Comparison of L2/L4/T4/A10/V100
3 Comparison of A100/A800/H100/H800/910B/H200
- 3.1 Note on inter-GPU bandwidth: HCCS vs. NVLINK
4 Comparison of H20/L20/Ascend 910B
5 Notes on US “Chip Export Controls” targeting China
- 5.1 Export Controls 2022.10
- 5.2 Export Controls 2023.10

1 Introduction

1.1 Naming convention of NVIDIA GPUs

The first letter in GPU model names denote their GPU architectures, with:

T for Turing;
A for Ampere;
V for Volta;
H for Hopper; 2022
L for Ada Lovelace;

2 Comparison of `L2/L4/T4/A10/V100`

	L2	L4	T4	A10	A30	V100 PCIe/SMX2
Designed for	Data center	Data center	Data center	(Desktop) Graphics-intensive workloads	Desktop	Data center
Year	2023	2023	2018	2020		2017
Manufacturing			12nm	12nm
Architecture	Ada Lovelace	Ada Lovelace	Turing	Ampere	Ampere	Volta
Max Power		72W	70 watts	150 watts	165 watts	250/300watts
GPU Mem	24GB GDDR6	24GB	16GB GDDR6	24GB GDDR6	24GB HBM2	16/32GB HBM2
GPU Mem BW	300 GB/s	300 GB/s	400 GB/s	600 GB/s	`933GB/s`	`900 GB/s`
Interconnect	PCIe Gen4 64GB/s	PCIe Gen4 64GB/s	PCIe Gen3 32GB/s	PCIe Gen4 66 GB/s	PCIe Gen4 64GB/s, NVLINK 200GB/s	PCIe Gen3 32GB/s, NVLINK `300GB/s`
FP32 `TFLOPS`	24.1	30.3	8.1	31.2	10.3	14/15.7
TF32 `TFLOPS`	48.3	`120*`
BF16 `TFLOPS`	95.6	`242*`		125	165	NOT support
FP16 `TFLOPS`		`242*`		125	165
INT8 `TFLOPS`	193/193	`485*`		250	330
INT4 `TFLOPS`		NO			661

Notes:

*: with sparsity.

Datasheets:

3 Comparison of `A100/A800/H100/H800/910B/H200`

	A800 (PCIe/SXM)	A100 (PCIe/SXM)	Huawei Ascend 910B	H800 (PCIe/SXM)	H100 (PCIe/SXM)	H200 (PCIe/SXM)
Year	2022	2020	2023	2022	2022	2024
Manufacturing	7nm	7nm	7+nm	4nm	4nm	4nm
Architecture	Ampere	Ampere	HUAWEI Da Vinci	Hopper	Hopper	Hopper
Max Power	300/400 W	300/400 W	400 W		350/700 W	700W
GPU Mem	80G HBM2e	80G HBM2e	64G HBM2e	80G HBM3	80G HBM3	141GB HBM3e
GPU Mem BW		1935/2039 GB/s			2/3.35 TB/s	4.8 TB/s
GPU Interconnect (one-to-one max bw)	NVLINK 400GB/s	PCIe Gen4 64GB/s, NVLINK 600GB/s	HCCS `56GB/s`	NVLINK 400GB/s	PCIe Gen5 128GB/s, NVLINK `900GB/s`	PCIe Gen5 128GB/s, NVLINK 900 GB/s
GPU Interconnect (one-to-many total bw)	NVLINK 400GB/s	PCIe Gen4 64GB/s, NVLINK 600GB/s	HCCS `392GB/s`	NVLINK 400GB/s	PCIe Gen5 128GB/s, NVLINK `900GB/s`	PCIe Gen5 128GB/s, NVLINK 900 GB/s
FP32 `TFLOPS`		`19.5`			`51 \| 67*`	`67*`
TF32 `TFLOPS`		`156 \| 312*`			`756 \| 989*`	`989*`
BF16 `TFLOPS`		`156 \| 312*`			`1513 \| 1979*`	`1979*`
FP16 `TFLOPS`		`312 \| 624*`	`320`		`1513 \| 1979*`	`1979*`
FP8 `TFLOPS`	NOT support	NOT support			`3026 \| 3958*`	`3958*`
INT8 `TFLOPS`		`624 \| 1248*`	`640`		`3026 \| 3958*`	`3958*`

Notes:

*: with sparsity.

H100 vs. A100 in one word: 3x performance, 2x price.

Datasheets:

A100
H100
~~Huawei Ascend-910B~~ (404)
910 paper: Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing, HPCA, 2021

3.1 Note on inter-GPU bandwidth: `HCCS vs. NVLINK`

For 8-card A800 and 910B modules: 910B HCCS has a total bandwidth of 392GB/s, which appears to be comparable to A800 NVLink (400GB/s). However, there are some differences. To clarify them,

NVIDIA NVLink: full-mesh topology as below, so (bi-directional) GPU-to-GPU max bandwidth is 400GB/s (note that below is 8*A100 module, 600GB/s, 8*A800 shares a similar full-mesh topology);
Huawei HCCS: peer-to-peer topology (no stuffs like NVSwitch chip), so (bi-directional) GPU-to-GPU max bandwidth is 56GB/s;

4 Comparison of `H20`/`L20`/`Ascend 910B`

	Huawei Ascend 910B	L20 (PCIe)	H20	H100 (PCIe/SXM)
Year	2023	2023	2023	2022
Manufacturing	7+nm	4nm	4nm	4nm
Architecture	HUAWEI Da Vinci	Ada Lovelace	Hopper	Hopper
Max Power	400 watt	350W	500W	350/700 watt
GPU Mem	64G HBM2e	48G GDDR6	96G HBM3	80G HBM3
GPU Mem BW		864GB/s	4.0TB/s	2/3.35 TB/s
L2 Cache		`96MB`	60MB	50MB
GPU Interconnect (one-to-one max bandwidth)	HCCS 56GB/s	PCIe Gen4 64GB/s	PCIe Gen5 128GB/s, `NVLINK 900GB/s`	PCIe Gen5 128GB/s, NVLINK 900GB/s
GPU Interconnect (one-to-many total bw)	HCCS 392GB/s	PCIe Gen4 64GB/s	PCIe Gen5 128GB/s, `NVLINK 900GB/s`	PCIe Gen5 128GB/s, NVLINK 900GB/s
FP32 `TFLOPS`		59.8	44	51/67
TF32 `TFLOPS`		59.8	74	756/989
BF16 `TFLOPS`		`119 \| 119`	`148 \| 148`	`1513 \| 1979*`
FP16 `TFLOPS`	320			`1513 \| 1979*`
FP8 `TFLOPS`			`296 \| 296`	`3026 \| 3958*`
INT8 `TFLOPS`	640	`239 \| 239`	`296 \| 296`	`3026 \| 3958*`

Notes:

*: with sparsity;
L20 max power 350W: collected with dcgm-exporter.

5 Notes on US “Chip Export Controls” targeting China

5.1 Export Controls `2022.10`

According to Implementation of Additional Export Controls: Certain Advanced Computing and Semiconductor Manufacturing Items; Supercomputer and Semiconductor End Use; Entity List Modification, for chips that can be shipped to the Chinese market, the following conditions must be met:

aggregate bidirectional transfer rate must < 600 Gbyte/s; AND,
aggregated processing performance must < 4800 bit TOPS (TFLOPS), which is equivalent to:
- < 300 TFLOPS FP16
- < 150 TFLOPS FP32

A100 and H100 are subjected to these restrictions, that’s why there are tailored versions: A800 and H800.

5.2 Export Controls `2023.10`

According to Implementation of Additional Export Controls: Certain Advanced Computing Items; Supercomputer and Semiconductor End Use; Updates and Corrections, in addition to the above 2022.10 Export Controls, chips that meet one of the following conditions are also prohibited from being sold in the Chinese market:

total processing performance in 2400~4800 bit TOPS AND performance density in 1.6~5.92;

2400 bit TOPS is equivalent to:
- 150 TFLOPS FP16
- 75 TFLOPS FP32
total processing performance >= 1600 bit TOPS AND performance density in 3.2~5.92;

These restrictions cover most high-performance GPUs, including the old model A800. However, it should be noted that there is also room for low-computing-power but high-transfer-rate models, such as the rumored “148TFLOPS + 96GB HBM + 900GB/s NVLink” H20 GPU.

« GPU 进阶笔记（二）：华为昇腾 910B GPU 相关（2023） [译] 100 行 C 代码创建一个 KVM 虚拟机（2019） »

ArthurChiao's Blog

GPU Performance (Data Sheets) Quick Reference (2023)

1 Introduction

1.1 Naming convention of NVIDIA GPUs

2 Comparison of L2/L4/T4/A10/V100

3 Comparison of A100/A800/H100/H800/910B/H200

3.1 Note on inter-GPU bandwidth: HCCS vs. NVLINK

4 Comparison of H20/L20/Ascend 910B

5 Notes on US “Chip Export Controls” targeting China

5.1 Export Controls 2022.10

5.2 Export Controls 2023.10

2 Comparison of `L2/L4/T4/A10/V100`

3 Comparison of `A100/A800/H100/H800/910B/H200`

3.1 Note on inter-GPU bandwidth: `HCCS vs. NVLINK`

4 Comparison of `H20`/`L20`/`Ascend 910B`

5.1 Export Controls `2022.10`

5.2 Export Controls `2023.10`