All Projects
A comprehensive collection of systems, optimizations, and experiments that push the boundaries of performance
KernelLab
A comprehensive collection of hand-optimized CUDA kernels for matrix operations. Started as a learning exercise, evolved into production-ready implementations that compete with cuBLAS performance. Features memory coalescing, shared memory optimization, and tensor core utilization.
DistJax
Mini-library simplifying distributed training in JAX and Flax. Implements common parallelism strategies including data, pipeline, and tensor parallelism. Features async tensor parallelism with full Transformer support and efficient gradient synchronization.
TorchSSL
PyTorch library for self-supervised learning with optimized contrastive loss implementations. Features 3-5x speedup over naive implementations through custom CUDA kernels and memory-efficient training loops. Supports popular SSL methods like SimCLR, MoCo, and BYOL.
RetinaSys
Production-ready diabetic retinopathy detection system optimized for edge deployment. Combines DINOv2 features with efficient architectures, achieving 90.73% QWK and 90.85% AUC. Features INT8 quantization reducing memory footprint by 60% while maintaining clinical accuracy.
SearchSphere
A standalone, AI-powered semantic search engine that runs locally on a user's machine. It understands the meaning behind a user's query to find relevant documents and images.
FastInference
High-performance inference engine for transformer models. Implements advanced optimization techniques including KV-cache optimization, attention fusion, and dynamic batching. Achieves 3x speedup over standard implementations while maintaining numerical stability.
EfficientVision
Lightweight computer vision pipeline optimized for mobile deployment. Features neural architecture search, progressive training, and adaptive inference. Reduces model size by 80% while maintaining competitive accuracy on ImageNet classification.