KernelLab: Technical Documentation

1. Introduction

KernelLab is a library of high-performance GPU kernels written in CUDA and Triton. It serves as a practical guide and reference for GPU programming, demonstrating various optimization strategies for common computational workloads in deep learning and scientific computing.

The project is structured to provide a clear progression from simple, "naïve" implementations to highly optimized versions that leverage the full capabilities of modern GPU architectures. Each kernel is self-contained and comes with its own set of benchmarks, allowing for a clear understanding of the performance impact of each optimization.

2. Core Concepts in GPU Optimization

The optimizations implemented in KernelLab are based on the following core concepts of GPU programming:

3. Implemented Kernels and Optimizations

3.1 CUDA Kernels

3.1.1 Convolution Kernels

3.1.2 Matrix & Reduction Operations

3.1.3 Element-wise & Activation Functions

3.1.4 Image Processing Kernels

3.1.5 Sorting Kernels

3.2 Triton Kernels

Triton is a Python-based language and compiler for writing highly efficient GPU kernels. The Triton kernels in KernelLab provide a higher-level abstraction compared to CUDA, while still achieving competitive performance.

4. Benchmarking and Performance

KernelLab includes a suite of benchmarks for comparing the performance of the different kernel implementations. The benchmarks are written in Python and use the torch library for creating and managing GPU tensors.

The results of the benchmarks are presented in the README.md file and in the benchmarks directory.

5. How to Use KernelLab

The CUDA kernels are exposed to Python via Pybind11. Each kernel has a setup.py file that can be used to build and install the Python bindings.

The Triton kernels can be used directly from Python.

6. Future Work

The TODO.md file lists the kernels and features that are planned for future development.

Go Back