Selected Works

9.39× Speedup

Torch++

A powerful extension library for PyTorch, designed to supercharge your deep learning workflows with high-performance CUDA kernels and a distributed training framework.

CUDA PyTorch C++ Distributed Systems
Maintained

KernelLab

A library of high-performance GPU kernels written in CUDA and Triton. It serves as a practical guide and reference for GPU programming, demonstrating various optimization strategies.

CUDA Triton C++
Active

DistJax

A powerful and flexible library for JAX that simplifies the implementation of distributed training for large-scale neural networks. Provides high-level abstractions for common parallelism strategies.

JAX Python Distributed Systems
9.25× Latency Reduction

FastQwen3

Qwen3 language model, optimized for high-performance inference leveraging custom CUDA kernels and FlashAttention to achieve 9.3x++ speedup over the baseline Hugging Face implementation.

CUDA Triton PyTorch
Completed

GEMM

Collection of General Matrix Multiplication (GEMM) kernels implemented in CUDA C++. Explores various optimization techniques, from basic naive implementations to highly optimized kernels using Tensor Cores.

CUDA C++ Assembly
Maintained

TorchSSL

A PyTorch-based library designed to provide a clean, modular, and high-performance environment for self-supervised learning with visual representations.

PyTorch SSL
90.73% QWK

RetinaSys

Production-ready diabetic retinopathy detection system optimized for edge deployment. Combines DINOv2 features with efficient architectures.

PyTorch Edge AI ONNX
Completed

SearchSphere

A standalone, AI-powered semantic search engine that runs locally on a user's machine. Designed to overcome the limitations of traditional keyword-based file search.

Transformers Vector Search
Active

ModelGoBrr...

A playground for experimentation to make Transformer based and Diffusion model run faster (both training and inference).

PyTorch CUDA JAX
Completed

ML Stocks

Stock prediction application that uses several machine learning and deep learning models to forecast stock prices. Powered by FastAPI and Gradio.

FastAPI Gradio