Selected Works
Torch++
A powerful extension library for PyTorch, designed to supercharge your deep learning workflows with high-performance CUDA kernels and a distributed training framework.
KernelLab
A library of high-performance GPU kernels written in CUDA and Triton. It serves as a practical guide and reference for GPU programming, demonstrating various optimization strategies.
DistJax
A powerful and flexible library for JAX that simplifies the implementation of distributed training for large-scale neural networks. Provides high-level abstractions for common parallelism strategies.
FastQwen3
Qwen3 language model, optimized for high-performance inference leveraging custom CUDA kernels and FlashAttention to achieve 9.3x++ speedup over the baseline Hugging Face implementation.
GEMM
Collection of General Matrix Multiplication (GEMM) kernels implemented in CUDA C++. Explores various optimization techniques, from basic naive implementations to highly optimized kernels using Tensor Cores.
TorchSSL
A PyTorch-based library designed to provide a clean, modular, and high-performance environment for self-supervised learning with visual representations.
RetinaSys
Production-ready diabetic retinopathy detection system optimized for edge deployment. Combines DINOv2 features with efficient architectures.
SearchSphere
A standalone, AI-powered semantic search engine that runs locally on a user's machine. Designed to overcome the limitations of traditional keyword-based file search.
ModelGoBrr...
A playground for experimentation to make Transformer based and Diffusion model run faster (both training and inference).
ML Stocks
Stock prediction application that uses several machine learning and deep learning models to forecast stock prices. Powered by FastAPI and Gradio.