Solenya

Solenya

Solenya

Introduction


Deep learning frameworks like PyTorch abstract away the complexities of GPU programming, enabling researchers and practitioners to focus on model development and experimentation. However, as models grow in size and complexity, understanding the underlying mechanisms becomes crucial for optimizing performance and leveraging the full capabilities of modern GPUs.

In this blog post, we delve into the low-level details of PyTorch’s execution on NVIDIA GPUs, exploring the intricate interplay between high-level Python code and low-level GPU operations during a forward pass through a neural network. By tracing the journey from Python code to CUDA kernels, Tensor Cores, and bitwise operations, we aim to demystify the inner workings of PyTorch and provide insights on the latest advancements in GPU acceleration for deep learning.

GPU Architectures


Modern GPUs are highly parallel processors designed to accelerate a wide range of workloads, including deep learning, scientific computing, and graphics rendering. NVIDIA’s GPUs, in particular, are widely used in the deep learning community due to their performance, programmability, and ecosystem support (For a great guide, we recommend Modal’s Website)

Memory Hierarchy


“Architecture

[Source: Lecture on GPU Memory]

NVIDIA GPUs feature a hierarchical memory structure comprising various levels of storage, each optimized for different purposes and access patterns:

Memory TypeDescriptionScope & Usage
RegistersFastest on-chip memory, private to individual threads, used to store intermediate calculations.Individual Threads
Shared MemoryFast on-chip memory shared by threads within a block; ideal for thread cooperation and data reuse.Threads within a Block
L1 CacheHigh-speed per-SM cache storing frequently accessed data, reducing access latency.Individual Streaming Multiprocessors (SMs)
L2 CacheLarger, unified cache shared across all SMs, enhancing data reuse and further reducing latency.Entire GPU Chip
Global Memory (Off-Chip)Large-capacity, off-chip memory accessible by all threads, though slower due to higher latency. It is often referred to by various terms, leading to confusion. See table below for clarity:Entire GPU Chip; All Threads

The term “Global Memory” is often misunderstood due to its different contexts. The following table clarifies the terminology:

TermDescriptionWho Uses This Term
Global MemoryGeneral CUDA abstraction referring to memory accessible by all GPU threads.Developers, CUDA Documentation
VRAM (Video RAM)General term emphasizing graphics-focused GPU memory; widely recognized but not an official standard.Graphics & Gaming Community, GPU Enthusiasts
GDDR6X / GDDR6Specific memory standards (protocols) for high-performance graphics and computing (defined by JEDEC).JEDEC, GPU Manufacturers (NVIDIA, AMD), Hardware Vendors (Samsung, SK Hynix, Micron)
HBM (High Bandwidth Memory)Advanced JEDEC standard featuring stacked memory integrated onto GPU packages for high throughput and reduced latency.JEDEC, GPU Manufacturers (AMD, NVIDIA), AI and HPC accelerators

This structured overview clearly separates conceptual usage from technical standards, ensuring better understanding for readers.

Numerical Precision


“Floating

[Source: Interactive Calculator from Numerical Systems]

Floating Point numbers are represented in computers using the IEEE 754 standard, which defines formats for single-precision (32-bit) and double-precision (64-bit) numbers, as a combination of sign, exponent, and mantissa bits, the:
- Sign Bit: Determines the number’s sign (positive or negative).
- Exponent Bits: Represent the number’s scale and range.
- Mantissa Bits: Store the number’s fractional part.

Jointly, these form the equation:

(1)sign×2exponent×mantissa(−1)^\text{sign}\times 2^\text{exponent}\times\text{mantissa}

Numerical precision plays a critical role in deep learning, balancing model accuracy with computational efficiency. NVIDIA GPUs support various data types, each with different bit-widths and trade-offs. At high-precisions (e.g., FP32, FP64), models achieve better accuracy and training stability, but require more memory and computational resources. In contrast, lower precisions (e.g., FP16, INT8) reduce memory footprint and computational complexity but may introduce numerical instability and worsen model performance.

More recently, papers like QLoRA, SVDQuant from HanLab and support for NVFP4 in Blackwell chips have explore the use of 4-bit floating point numbers, which allocate 1-bit to the sign, 2-bits to the exponent, and 1-bit to the mantissa, to further optimize AI memory requirements and throughput, by hand this may look like:

Bit PatternSign (S)Exponent (E)Mantissa (M)Calculation StepsValue
00000000Subnormal: (–1)^0 × 2^(1–1) × (0 × 2^–1) = 00
00010001Subnormal: (–1)^0 × 2^(1–1) × (1 × 2^–1) = 0.50.5
00100010Normalized: (–1)^0 × 2^(1–1) × (1 + 0 × 2^–1) = 11
00110011Normalized: (–1)^0 × 2^(1–1) × (1 + 1 × 2^–1) = 1.51.5
01000100Normalized: (–1)^0 × 2^(2–1) × (1 + 0 × 2^–1) = 22
01010101Normalized: (–1)^0 × 2^(2–1) × (1 + 1 × 2^–1) = 33
01100110Normalized: (–1)^0 × 2^(3–1) × (1 + 0 × 2^–1) = 44
01110111Normalized: (–1)^0 × 2^(3–1) × (1 + 1 × 2^–1) = 66
10001000Subnormal: (–1)^1 × 2^(1–1) × (0 × 2^–1) = –0–0
10011001Subnormal: (–1)^1 × 2^(1–1) × (1 × 2^–1) = –0.5–0.5
10101010Normalized: (–1)^1 × 2^(1–1) × (1 + 0 × 2^–1) = –1–1
10111011Normalized: (–1)^1 × 2^(1–1) × (1 + 1 × 2^–1) = –1.5–1.5
11001100Normalized: (–1)^1 × 2^(2–1) × (1 + 0 × 2^–1) = –2–2
11011101Normalized: (–1)^1 × 2^(2–1) × (1 + 1 × 2^–1) = –3–3
11101110Normalized: (–1)^1 × 2^(3–1) × (1 + 0 × 2^–1) = –4–4
11111111Normalized: (–1)^1 × 2^(3–1) × (1 + 1 × 2^–1) = –6–6

Covering a small range of the possible real numbers, these numbers may be upcast and rescales in LayerNorm and or Up and Down Fully Connected Layers (depending on the architecture), using mixed-precision in order to maintain numerical stability and performance.

Core Microarchitecture


“Core

[Source: Article by Quantum News]
NVIDIA consistently advances its GPU microarchitecture, introducing specialized hardware units tailored to specific data types and operations. A pivotal shift occurred with the introduction of Tensor Cores (TCs) in the Volta architecture, designed to accelerate deep learning through efficient mixed-precision operations. Subsequent architectures, such as Ampere, expanded TC capabilities by introducing structured sparsity support and additional data precisions, including INT8, INT4, BF16, TF32, and FP64. The Hopper architecture further enhanced these capabilities by adding FP8 precision, significantly optimizing performance for large language model (LLM) training and inference tasks, and Blackwell architecture continues this trend by introducing NVFP4 (NVIDIA’s 4-bit floating-point format), further enhancing AI compute efficiency.

FeatureVolta (2017)Ampere (2020)Hopper (2022)Blackwell (2024)
Tensor Core Support✅ FP16/FP32✅ FP16/FP32, INT8, INT4, BF16, TF32, FP64✅ FP8, FP16/FP32, INT8, INT4, BF16, TF32, FP64✅ NVFP4, FP8, FP16/FP32, INT8, INT4, BF16, TF32, FP64
Structured Sparsity❌ No✅ Yes (2:4 sparsity)✅ Yes (enhanced sparsity)✅ Yes (advanced sparsity)

Note: NVFP4 is NVIDIA’s 4-bit floating-point format introduced with the Blackwell architecture, designed to double AI throughput while halving memory requirements. [See Triton Documentation]

Note: FP8 is split into two specification E4M3 and E5M2. [See: Nvidia cuDNN Documentation]

At the heart of Tensor Cores’ efficiency is their ability to execute matrix multiplications and accumulations in a single fused operation, leveraging specialized hardware components:

  1. Multipliers: Each Tensor Core contains dedicated hardware multipliers that execute multiple fused multiply-accumulate (FMA) operations simultaneously. These multipliers take small blocks of matrices (e.g., 4x4, 16x16) and compute dot products in parallel, significantly reducing the time required for general matrix multiplications (GEMM) compared to traditional CUDA cores.
  2. Adders: After multiplication, Tensor Cores perform element-wise additions using specialized adder circuits. The fused nature of these operations avoids unnecessary memory loads and stores, which are typically bottlenecks in deep learning workloads.
  3. Accumulators: Tensor Cores employ higher-precision accumulators (typically FP32 or INT32) to maintain numerical stability. For example, when performing FP16 matrix multiplication, the intermediate sums are often accumulated in FP32 to minimize precision loss before being converted back to FP16 if necessary.
  4. Precision Conversion: Since deep learning models require varying levels of precision, Tensor Cores integrate fast precision conversion units that efficiently downcast or upcast between FP8, FP16, BF16, and FP32. This allows mixed-precision training, where forward passes are computed in FP16 for speed, while gradients and updates use FP32 for numerical accuracy.

For example, one fundamental operation executed by Tensor Cores follows the form:

D=AB+CD=AB+C

Where:

  • A (4x4 FP16) is a small block of matrix A stored in half precision.
  • B (4x4 FP16) is another small block of matrix B.
  • C (4x4 FP16 or FP32) represents an accumulation buffer that stores the partial sum.
  • D (4x4 FP16 or FP32) is the final result stored in the appropriate precision.

This operation is executed in a single clock cycle per Tensor Core, leveraging both structured sparsity (for further performance gains) and efficient memory access patterns to maximize throughput, this operator might perform certain sub-operations in FP16, then accumulate in FP32 for numerical precision, and finally convert back to FP16 for the final result.

The reason for performing operation in this manner lies in hardware design constraints and efficiency optimizations:

  • Dedicated Arithmetic Logic Units (ALUs): Each supported precision requires a specialized ALU with different multiplication and accumulation rules. Designing a fully general-purpose unit would drastically increase circuit complexity and power consumption.
  • Bit-width Optimization: Smaller precision formats (e.g., FP8, INT4, NVFP4) allow Tensor Cores to pack more operations per cycle, but require custom logic for correct rounding and accumulation. Allowing arbitrary precision would complicate scheduling and degrade performance.
  • Memory Bandwidth Constraints: Lower precision formats reduce memory bandwidth requirements, but arbitrary precision would require more flexible memory controllers, increasing latency and energy usage.

Thus, NVIDIA selectively supports key numeric formats—FP16, BF16, FP8, NVFP4—based on their impact on AI model performance, power efficiency, and hardware feasibility, the balance of accumulators and multipliers, in a core, to the memory at each layer in the memory hierarchy, and upcasters all play off throughput, power efficiency, hardware flexibility, numerical precision and manufacturing cost and constraints to arrive at the optimal design for a given architecture.

Logical Processing


“GPU

[Source: University of Maryland Lecture Slides on Thread Optimization]

In CUDA programming, you work with a layered hierarchy of abstractions designed to manage parallel computation and memory efficiently. Here are the key abstractions:

ConceptLogical Role (CUDA Programming)Hardware Mapping (Execution on SMs)
ThreadA single execution unitMapped to a CUDA core (scalar processor)
Warp32 threads executing together as a Single Instruction Multiple Threads (SIMT) operationScheduled as a unit on an SM warp scheduler
Thread BlockA group of threads (typically 128-1024)Assigned to one SM, multiple warps execute independently
GridA collection of thread blocksSpans across multiple SMs

In GPU Kernel Functions, data is typically tiled and applied over a grid of thread blocks to ensure optimal parallel execution. Together, these abstractions allow developers to design programs that exploit the massive parallelism of GPUs while managing memory efficiently and optimizing performance.

Writing Custom Kernels


To fully utilize GPU capabilities, developers often write custom kernels—functions that execute in parallel across thousands of GPU threads. CUDA and Triton are two powerful frameworks for writing and optimizing such kernels. CUDA provides fine-grained control over GPU execution, while Triton abstracts many of the complexities, making kernel programming more accessible while still achieving high performance. However, these tools offer different levels of abstraction and optimization strategies for writing GPU-accelerated code:

ConceptCUDA EquivalentTriton Equivalent
ThreadsthreadIdx.xtl.arange(0, BLOCK_SIZE) (vectorized execution)
WarpsImplicit in CUDAImplicit (Triton auto-optimizes)
BlocksblockIdx.xtl.program_id(axis=0)
GridgridDim.xDefined by the number of program instances

For example, consider a simple vector addition kernel in CUDA and its Triton equivalent:

#include

__global__ void addVectors(const float* A, const float* B, float* C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int N = 1 << 20; // 1 million elements
    size_t size = N * sizeof(float);

    float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C;
    h_A = (float*)malloc(size);
    h_B = (float*)malloc(size);
    h_C = (float*)malloc(size);
    cudaMalloc(&d_A, size);
    cudaMalloc(&d_B, size);
    cudaMalloc(&d_C, size);
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    addVectors<<>>(d_A, d_B, d_C, N);
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    free(h_A); free(h_B); free(h_C);
    return 0;
}

Comparing these two examples, we see that the Triton kernel is more concise and abstracts away many low-level details, such as thread and block management. Triton’s tl.load and tl.store functions handle data movement efficiently, and the tl.arange function enables vectorized execution. Triton also provides auto-tuning for block sizes, optimizing performance across different GPU architectures.

In PyTorch 2.6, built-in support for integrating Triton kernels was introduced, allowing custom operations to be easily accelerated. Below is an example of using a Triton-based ReLU operation in PyTorch:

import torch
import triton
import triton.language as tl

@triton.jit
def relu_kernel(X, Y, N, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N
    x = tl.load(X + offsets, mask=mask)
    y = tl.where(x > 0, x, 0)
    tl.store(Y + offsets, y, mask=mask)

class TritonReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        y = torch.empty_like(x)
        N = x.numel()
        BLOCK_SIZE = 1024
        grid = lambda meta: (triton.cdiv(N, meta['BLOCK_SIZE']),)
        relu_kernel[grid](x, y, N, BLOCK_SIZE=BLOCK_SIZE)
        return y

# Usage in a PyTorch model
x = torch.randn(10240, device='cuda')
relu = TritonReLU.apply
y = relu(x)

Stack Trace


Now that we understand GPU architecture, and how to write custom kernels, let’s explore the stack trace of a forward pass through a neural network in PyTorch, to understand how high-level Python code is translated into low-level GPU operations.

To illustrate this process, we’ll consider a simple neural network model defined in PyTorch and trace the execution flow from the Python forward method to the underlying CUDA kernels and Tensor Core operations:

import torch
import torch.nn as nn
from transformers import PreTrainedModel, PretrainedConfig, AutoModel
# Define a custom configuration class.
class SimpleModelConfig(PretrainedConfig):
    model_type = "simple_model"  # This type name will be used during registration.

    def __init__(self, input_dim=10, output_dim=5, **kwargs):
        super().__init__(**kwargs)
        self.input_dim = input_dim
        self.output_dim = output_dim

# Define the custom model by subclassing PreTrainedModel.
class SimpleModel(PreTrainedModel):
    config_class = SimpleModelConfig  # Link to the config class

    def __init__(self, config):
        super().__init__(config)
        # Create a simple linear layer using dimensions from the configuration.
        self.linear = nn.Linear(config.input_dim, config.output_dim)
        self.init_weights()  # Initialize weights according to Hugging Face conventions.

    def forward(self, x):
        return self.linear(x)

# Register the custom model with the Hugging Face AutoModel.
AutoModel.register(SimpleModelConfig, SimpleModel)

# Now, you can instantiate the model using the custom config.
config = SimpleModelConfig()
model = SimpleModel(config).cuda()

# Example input tensor
input_tensor = torch.randn(2, config.input_dim).cuda()

# Forward pass
output = model(input_tensor)
print("Model output:", output)
In this example, the `forward` method processes the `input_tensor` through a linear layer. When the forward pass is executed, PyTorch orchestrates the underlying computations, managing data flow and operations.

When this Python code was executed, Pytorch called CUDA kernels, which are compiled to PTX (Parallel Thread Execution) code, and then to SASS (Shader Assembly) code, which is executed on the GPU hardware. This process involves intricate optimizations and parallel execution across GPU cores and Tensor Cores, enabling efficient and accelerated deep learning computations:

%%{init: {'theme': 'forest'}}%% graph TD subgraph User[User-level Python] Python[Python Model.forward] end subgraph Library[Python Library-code] Autograd[PyTorch Autograd] Dispatcher[PyTorch Op Dispatcher] end subgraph CUDE[CUDA C++] CUDAKernels[CUDA Kernels] end subgraph Compile[GPU Compilation and Execution] PTX["Parallel Thread Execution"] --> SASS["Shader Assembly"] --> Hardware[GPU Hardware] end Python --> Autograd Autograd --> Dispatcher Dispatcher --> CUDAKernels CUDAKernels --> PTX

When this forward pass was executed, PyTorch’s Autograd engine dynamically constructed a computational graph, capturing the operations and tensors involved.

%%{init: {'theme': 'forest'}}%% graph LR InputTensor --> LinearOp[Linear Operation] LinearOp --> OutputTensor LinearOp --> Graph[Computational Graph Node]

This dynamic graph facilitates automatic differentiation during backpropagation. For instance, the linear transformation in the forward pass is registered in the computational graph, enabling gradient computation in subsequent backward passes.

Conclusion


In this blog post, we’ve explored the low-level performance optimizations in PyTorch, delving into GPU architectures, numerical precision, and core microarchitecture. We’ve traced the execution flow of a forward pass through a neural network, from Python code to CUDA kernels and Tensor Core operations, highlighting the intricate interplay between high-level abstractions and low-level GPU operations. By understanding these mechanisms, developers can optimize performance, leverage specialized operations, and accelerate deep learning workloads on modern GPUs.

© 2025 Solenya. All rights reserved.