GPU Performance (Data Sheets) Quick Reference (2023)
This post provides a concise reference for the performance of popular GPU models from NVIDIA and Huawei/HiSilicon, primarily intended for personal use.
- 1 Introduction
- 2 Comparison of
L2/T4/A10/A10G/V100
- 3 Comparison of A100/A800/H100/H800/
Ascend 910B
- 4 Comparison of
H20
/L20
/Ascend 910B
1 Introduction
Naming convention of NVIDIA GPUs
The first letter in GPU model names denote their GPU architectures, with:
T
for Turing;A
for Ampere;V
for Volta;H
for Hopper; 2022L
for Ada Lovelace;
2 Comparison of L2/T4/A10/A10G/V100
L2 | T4 | A10 | A10G | A30 | V100 PCIe/SMX2 | |
---|---|---|---|---|---|---|
Designed for | Data center | Data center | (Desktop) Graphics-intensive workloads | Desktop | Desktop | Data center |
Year | 2023 | 2018 | 2020 | 2017 | ||
Manufacturing | 12nm | 12nm | 12nm | |||
Architecture | Ada Lovelace | Turing | Ampere | Ampere | Ampere | Volta |
Max Power | 70 watts | 150 watts | 165 watts | 250/300watts | ||
GPU Mem | 24GB GDDR6 | 16GB GDDR6 | 24GB GDDR6 | 48GB GDDR6 | 24GB HBM2 | 16/32GB HBM2 |
GPU Mem BW | 300 GB/s | 400 GB/s | 600 GB/s | 933GB/s |
900 GB/s |
|
Interconnect | PCIe Gen4 64GB/s | PCIe Gen3 32GB/s | PCIe Gen4 66 GB/s | PCIe Gen4 64GB/s, NVLINK 200GB/s | PCIe Gen3 32GB/s, NVLINK 300GB/s |
|
FP32 | 24.1 TFLOPS | 8.1 TFLOPS | 31.2 TFLOPS | 10.3TFLOPS | 14/15.7 TFLOPS | |
TF32 | 48.3 TFLOPS | |||||
BFLOAT16 TensorCore | 95.6 TFLOPS | 125 TFLOPS | 165 TFLOPS | |||
FP16 TensorCore | 125 TFLOPS | 165 TFLOPS | ||||
INT8 TensorCore | 193/193 TFLOPS | 250 TFLOPS | 330 TOPS | |||
INT4 TensorCore | 661 TOPS |
Datasheets:
3 Comparison of A100/A800/H100/H800/Ascend 910B
A800 (PCIe/SXM) | A100 (PCIe/SXM) | Huawei Ascend 910B | H800 (PCIe/SXM) | H100 (PCIe/SXM) | |
---|---|---|---|---|---|
Year | 2022 | 2020 | 2023 | 2022 | 2022 |
Manufacturing | 7nm | 7nm | 7+nm | 4nm | 4nm |
Architecture | Ampere | Ampere | HUAWEI Da Vinci | Hopper | Hopper |
Max Power | 300/400 watt | 300/400 watt | 400 watt | 350/700 watt | |
GPU Mem | 80G HBM2e | 80G HBM2e | 64G HBM2e | 80G HBM3 | 80G HBM3 |
GPU Mem BW | 1935/2039 GB/s | 2/3.35 TB/s | |||
GPU Interconnect (one-to-one max bandwidth) | NVLINK 400GB/s | PCIe Gen4 64GB/s, NVLINK 600GB/s | HCCS 56GB/s |
NVLINK 400GB/s | PCIe Gen5 128GB/s, NVLINK 900GB/s |
GPU Interconnect (one-to-many total bw) | NVLINK 400GB/s | PCIe Gen4 64GB/s, NVLINK 600GB/s | HCCS 392GB/s |
NVLINK 400GB/s | PCIe Gen5 128GB/s, NVLINK 900GB/s |
FP32 | 19.5 TFLOPS | 51/67 TFLOPS | |||
TF32 (TensorFloat) | 156/312 TFLOPS | 756/989 TFLOPS | |||
BFLOAT16 TensorCore | 156/312 TFLOPS | ||||
FP16 TensorCore | 312/624 TFLOPS | 320 TFLOPS | 1513/1979 TFLOPS | ||
FP8 TensorCore | NOT support | NOT support | 3026/3958 TFLOPS | ||
INT8 TensorCore | 624/1248 TFLOPS | 640 TFLOPS | 3026/3958 TFLOPS |
H100 vs. A100 in one word: 3x performance, 2x price.
Datasheets:
- A100
- H100
Huawei Ascend-910B(404)910
paper: Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing, HPCA, 2021
Note on inter-GPU bandwidth: HCCS vs. NVLINK
For 8-card A800 and 910B modules: 910B HCCS has a total bandwidth of 392GB/s
,
which appears to be comparable to A800 NVLink (400GB/s
). However, there are
some differences. To clarify them,
-
NVIDIA NVLink: full-mesh topology as below, so (bi-directional)
GPU-to-GPU max bandwidth
is400GB/s
(note that below is8*A100
module, 600GB/s,8*A800
shares a similar full-mesh topology); -
Huawei HCCS: peer-to-peer topology (no stuffs like NVSwitch chip), so (bi-directional)
GPU-to-GPU max bandwidth
is56GB/s
;
4 Comparison of H20
/L20
/Ascend 910B
Huawei Ascend 910B | L20 (PCIe) | H20 (PCIe/SXM) | H100 (PCIe/SXM) | |
---|---|---|---|---|
Year | 2023 | 2023 | 2023 | 2022 |
Manufacturing | 7+nm | 4nm | 4nm | 4nm |
Architecture | HUAWEI Da Vinci | Ada Lovelace | Hopper | Hopper |
Max Power | 400 watt | 275W | 400W | 350/700 watt |
GPU Mem | 64G HBM2e | 48G GDDR6 | 80G HBM3 | 80G HBM3 |
GPU Mem BW | 864GB/s | 4.0TB/s | 2/3.35 TB/s | |
L2 Cache | 96MB | 60MB | ||
GPU Interconnect (one-to-one max bandwidth) | HCCS 56GB/s | PCIe Gen4 64GB/s | PCIe Gen5 128GB/s, NVLINK 900GB/s |
PCIe Gen5 128GB/s, NVLINK 900GB/s |
GPU Interconnect (one-to-many total bw) | HCCS 392GB/s | PCIe Gen4 64GB/s | PCIe Gen5 128GB/s, NVLINK 900GB/s |
PCIe Gen5 128GB/s, NVLINK 900GB/s |
FP32 | 59.8 TFLOPS | 44 TFLOPS | 51/67 TFLOPS | |
TF32 (TensorFloat) | 59.8 TFLOPS | 74 TFLOPS | 756/989 TFLOPS | |
BFLOAT16 TensorCore | 119/119 TFLOPS | 148/148 TFLOPS | ||
FP16 TensorCore | 320 TFLOPS | 1513/1979 TFLOPS | ||
FP8 TensorCore | 3026/3958 TFLOPS | |||
INT8 TensorCore | 640 TFLOPS | 239/239 TFLOPS | 296/296 TFLOPS | 3026/3958 TFLOPS |