ArthurChiao's Blog

GPU Performance (Data Sheets) Quick Reference (2023)

Published at 2023-10-25 | Last Update 2024-03-24

This post provides a concise reference for the performance of popular GPU models from NVIDIA and Huawei/HiSilicon, primarily intended for personal use.



1 Introduction

Naming convention of NVIDIA GPUs

The first letter in GPU model names denote their GPU architectures, with:

  1. T for Turing;
  2. A for Ampere;
  3. V for Volta;
  4. H for Hopper; 2022
  5. L for Ada Lovelace;

2 Comparison of L2/L4/T4/A10/V100

  L2 L4 T4 A10 A30 V100 PCIe/SMX2
Designed for Data center Data center Data center (Desktop) Graphics-intensive workloads Desktop Data center
Year 2023 2023 2018 2020   2017
Manufacturing     12nm 12nm    
Architecture Ada Lovelace Ada Lovelace Turing Ampere Ampere Volta
Max Power   72W 70 watts 150 watts 165 watts 250/300watts
GPU Mem 24GB GDDR6 24GB 16GB GDDR6 24GB GDDR6 24GB HBM2 16/32GB HBM2
GPU Mem BW 300 GB/s 300 GB/s 400 GB/s 600 GB/s 933GB/s 900 GB/s
Interconnect PCIe Gen4 64GB/s PCIe Gen4 64GB/s PCIe Gen3 32GB/s PCIe Gen4 66 GB/s PCIe Gen4 64GB/s, NVLINK 200GB/s PCIe Gen3 32GB/s, NVLINK 300GB/s
FP32 TFLOPS 24.1 30.3 8.1 31.2 10.3 14/15.7
TF32 TFLOPS 48.3 120*        
BFLOAT16 TensorCore TFLOPS 95.6 242*   125 165  
FP16 TensorCore TFLOPS   242*   125 165  
INT8 TensorCore TFLOPS 193/193 485*   250 330  
INT4 TensorCore TFLOPS   NO     661  

Notes:

  • *: with sparsity; Specifications 1/2 lower without sparsity.

Datasheets:

  1. L4
  2. T4
  3. A10
  4. A30
  5. V100-PCIe/V100-SXM2/V100S-PCIe

3 Comparison of A100/A800/H100/H800/Ascend 910B

  A800 (PCIe/SXM) A100 (PCIe/SXM) Huawei Ascend 910B H800 (PCIe/SXM) H100 (PCIe/SXM)
Year 2022 2020 2023 2022 2022
Manufacturing 7nm 7nm 7+nm 4nm 4nm
Architecture Ampere Ampere HUAWEI Da Vinci Hopper Hopper
Max Power 300/400 watt 300/400 watt 400 watt   350/700 watt
GPU Mem 80G HBM2e 80G HBM2e 64G HBM2e 80G HBM3 80G HBM3
GPU Mem BW   1935/2039 GB/s     2/3.35 TB/s
GPU Interconnect (one-to-one max bandwidth) NVLINK 400GB/s PCIe Gen4 64GB/s, NVLINK 600GB/s HCCS 56GB/s NVLINK 400GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
GPU Interconnect (one-to-many total bw) NVLINK 400GB/s PCIe Gen4 64GB/s, NVLINK 600GB/s HCCS 392GB/s NVLINK 400GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
FP32 TFLOPS   19.5     51 | 67*
TF32 (TensorFloat) TFLOPS   156 | 312*     756 | 989*
BFLOAT16 TensorCore TFLOPS   156 | 312*      
FP16 TensorCore TFLOPS   312 | 624* 320   1513 | 1979*
FP8 TensorCore TFLOPS NOT support NOT support     3026 | 3958*
INT8 TensorCore TFLOPS   624 | 1248* 640   3026/3958*

Notes:

  • *: with sparsity; Specifications 1/2 lower without sparsity.

H100 vs. A100 in one word: 3x performance, 2x price.

Datasheets:

  1. A100
  2. H100
  3. Huawei Ascend-910B (404)
  4. 910 paper: Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing, HPCA, 2021

For 8-card A800 and 910B modules: 910B HCCS has a total bandwidth of 392GB/s, which appears to be comparable to A800 NVLink (400GB/s). However, there are some differences. To clarify them,

  • NVIDIA NVLink: full-mesh topology as below, so (bi-directional) GPU-to-GPU max bandwidth is 400GB/s (note that below is 8*A100 module, 600GB/s, 8*A800 shares a similar full-mesh topology);

  • Huawei HCCS: peer-to-peer topology (no stuffs like NVSwitch chip), so (bi-directional) GPU-to-GPU max bandwidth is 56GB/s;

4 Comparison of H20/L20/Ascend 910B

  Huawei Ascend 910B L20 (PCIe) H20 H100 (PCIe/SXM)
Year 2023 2023 2023 2022
Manufacturing 7+nm 4nm 4nm 4nm
Architecture HUAWEI Da Vinci Ada Lovelace Hopper Hopper
Max Power 400 watt 275W 500W 350/700 watt
GPU Mem 64G HBM2e 48G GDDR6 96G HBM3 80G HBM3
GPU Mem BW   864GB/s 4.0TB/s 2/3.35 TB/s
L2 Cache   96MB 60MB  
GPU Interconnect (one-to-one max bandwidth) HCCS 56GB/s PCIe Gen4 64GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
GPU Interconnect (one-to-many total bw) HCCS 392GB/s PCIe Gen4 64GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s PCIe Gen5 128GB/s, NVLINK 900GB/s
FP32   59.8 TFLOPS 44 TFLOPS 51/67 TFLOPS
TF32 (TensorFloat)   59.8 TFLOPS 74 TFLOPS 756/989 TFLOPS
BFLOAT16 TensorCore   119/119 TFLOPS 148/148 TFLOPS  
FP16 TensorCore 320 TFLOPS     1513/1979 TFLOPS
FP8 TensorCore       3026/3958 TFLOPS
INT8 TensorCore 640 TFLOPS 239/239 TFLOPS 296/296 TFLOPS 3026/3958 TFLOPS

5 Notes on US “Chip Export Controls” targeting China

5.1 Export Controls 2022.10

According to Implementation of Additional Export Controls: Certain Advanced Computing and Semiconductor Manufacturing Items; Supercomputer and Semiconductor End Use; Entity List Modification, for chips that can be shipped to the Chinese market, the following conditions must be met:

  1. aggregate bidirectional transfer rate must < 600 Gbyte/s; AND,
  2. aggregated processing performance must < 4800 bit TOPS (TFLOPS), which is equivalent to:

    • < 300 TFLOPS FP16
    • < 150 TFLOPS FP32

A100 and H100 are subjected to these restrictions, that’s why there are tailored versions: A800 and H800.

5.2 Export Controls 2023.10

According to Implementation of Additional Export Controls: Certain Advanced Computing Items; Supercomputer and Semiconductor End Use; Updates and Corrections, in addition to the above 2022.10 Export Controls, chips that meet one of the following conditions are also prohibited from being sold in the Chinese market:

  1. total processing performance in 2400~4800 bit TOPS AND performance density in 1.6~5.92;

    2400 bit TOPS is equivalent to:

    • 150 TFLOPS FP16
    • 75 TFLOPS FP32
  2. total processing performance >= 1600 bit TOPS AND performance density in 3.2~5.92;

These restrictions cover most high-performance GPUs, including the old model A800. However, it should be noted that there is also room for low-computing-power but high-transfer-rate models, such as the rumored “148TFLOPS + 96GB HBM + 900GB/s NVLink” H20 GPU.


Written by Human, Not by AI Written by Human, Not by AI