ArthurChiao's Blog

GPU Performance (Data Sheets) Quick Reference (2023)

Published at 2023-10-25 | Last Update 2023-11-03

This post provides a concise reference for the performance of popular GPU models from NVIDIA and Huawei/HiSilicon, primarily intended for personal use.



1 Introduction

Naming convention of NVIDIA GPUs

The first letter in GPU model names denote their GPU architectures, with:

  1. T for Turing;
  2. A for Ampere;
  3. V for Volta;
  4. H for Hopper; 2022
  5. L for Ada Lovelace;

2 Comparison of L2/T4/A10/A10G/V100

  L2 T4 A10 A10G A30 V100 PCIe/SMX2
Designed for Data center Data center (Desktop) Graphics-intensive workloads Desktop Desktop Data center
Year 2023 2018 2020     2017
Manufacturing   12nm 12nm 12nm    
Architecture Ada Lovelace Turing Ampere Ampere Ampere Volta
Max Power   70 watts 150 watts   165 watts 250/300watts
GPU Mem 24GB GDDR6 16GB GDDR6 24GB GDDR6 48GB GDDR6 24GB HBM2 16/32GB HBM2
GPU Mem BW 300 GB/s 400 GB/s 600 GB/s   933GB/s 900 GB/s
Interconnect PCIe Gen4 64GB/s PCIe Gen3 32GB/s PCIe Gen4 66 GB/s   PCIe Gen4 64GB/s, NVLINK 200GB/s PCIe Gen3 32GB/s, NVLINK 300GB/s
FP32 24.1 TFLOPS 8.1 TFLOPS 31.2 TFLOPS   10.3TFLOPS 14/15.7 TFLOPS
TF32 48.3 TFLOPS          
BFLOAT16 TensorCore 95.6 TFLOPS   125 TFLOPS   165 TFLOPS  
FP16 TensorCore     125 TFLOPS   165 TFLOPS  
INT8 TensorCore 193/193 TFLOPS   250 TFLOPS   330 TOPS  
INT4 TensorCore         661 TOPS  

Datasheets:

  1. T4
  2. A10
  3. A30
  4. V100-PCIe/V100-SXM2/V100S-PCIe

3 Comparison of A100/A800/H100/H800/Ascend 910B

  A800 (PCIe/SXM) A100 (PCIe/SXM) Huawei Ascend 910B H800 (PCIe/SXM) H100 (PCIe/SXM)
Year 2022 2020 2023 2022 2022
Manufacturing 7nm 7nm 7+nm 4nm 4nm
Architecture Ampere Ampere HUAWEI Da Vinci Hopper Hopper
Max Power 300/400 watt 300/400 watt 400 watt   350/700 watt
GPU Mem 80G HBM2e 80G HBM2e 64G HBM2e 80G HBM3 80G HBM3
GPU Mem BW   1935/2039 GB/s     2/3.35 TB/s
GPU Interconnect (one-to-one max bandwidth) NVLINK 400GB/s PCIe Gen4 64GB/s, NVLINK 600GB/s HCCS 56GB/s NVLINK 400GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
GPU Interconnect (one-to-many total bw) NVLINK 400GB/s PCIe Gen4 64GB/s, NVLINK 600GB/s HCCS 392GB/s NVLINK 400GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
FP32   19.5 TFLOPS     51/67 TFLOPS
TF32 (TensorFloat)   156/312 TFLOPS     756/989 TFLOPS
BFLOAT16 TensorCore   156/312 TFLOPS      
FP16 TensorCore   312/624 TFLOPS 320 TFLOPS   1513/1979 TFLOPS
FP8 TensorCore NOT support NOT support     3026/3958 TFLOPS
INT8 TensorCore   624/1248 TFLOPS 640 TFLOPS   3026/3958 TFLOPS

H100 vs. A100 in one word: 3x performance, 2x price.

Datasheets:

  1. A100
  2. H100
  3. Huawei Ascend-910B (404)
  4. 910 paper: Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing, HPCA, 2021

For 8-card A800 and 910B modules: 910B HCCS has a total bandwidth of 392GB/s, which appears to be comparable to A800 NVLink (400GB/s). However, there are some differences. To clarify them,

  • NVIDIA NVLink: full-mesh topology as below, so (bi-directional) GPU-to-GPU max bandwidth is 400GB/s (note that below is 8*A100 module, 600GB/s, 8*A800 shares a similar full-mesh topology);

  • Huawei HCCS: peer-to-peer topology (no stuffs like NVSwitch chip), so (bi-directional) GPU-to-GPU max bandwidth is 56GB/s;

4 Comparison of H20/L20/Ascend 910B

  Huawei Ascend 910B L20 (PCIe) H20 (PCIe/SXM) H100 (PCIe/SXM)
Year 2023 2023 2023 2022
Manufacturing 7+nm 4nm 4nm 4nm
Architecture HUAWEI Da Vinci Ada Lovelace Hopper Hopper
Max Power 400 watt 275W 400W 350/700 watt
GPU Mem 64G HBM2e 48G GDDR6 80G HBM3 80G HBM3
GPU Mem BW   864GB/s 4.0TB/s 2/3.35 TB/s
L2 Cache   96MB 60MB  
GPU Interconnect (one-to-one max bandwidth) HCCS 56GB/s PCIe Gen4 64GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
GPU Interconnect (one-to-many total bw) HCCS 392GB/s PCIe Gen4 64GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
FP32   59.8 TFLOPS 44 TFLOPS 51/67 TFLOPS
TF32 (TensorFloat)   59.8 TFLOPS 74 TFLOPS 756/989 TFLOPS
BFLOAT16 TensorCore   119/119 TFLOPS 148/148 TFLOPS  
FP16 TensorCore 320 TFLOPS     1513/1979 TFLOPS
FP8 TensorCore       3026/3958 TFLOPS
INT8 TensorCore 640 TFLOPS 239/239 TFLOPS 296/296 TFLOPS 3026/3958 TFLOPS