GPU Computing

KernelLab

A comprehensive collection of hand-optimized CUDA kernels for matrix operations. Started as a learning exercise, evolved into production-ready implementations that compete with cuBLAS performance. Features memory coalescing, shared memory optimization, and tensor core utilization.

CUDA Triton WMMA Nsight Compute cuBLAS
Distributed Systems

DistJax

Mini-library simplifying distributed training in JAX and Flax. Implements common parallelism strategies including data, pipeline, and tensor parallelism. Features async tensor parallelism with full Transformer support and efficient gradient synchronization.

JAX Flax XLA NCCL Distributed Training
Self-Supervised Learning

TorchSSL

PyTorch library for self-supervised learning with optimized contrastive loss implementations. Features 3-5x speedup over naive implementations through custom CUDA kernels and memory-efficient training loops. Supports popular SSL methods like SimCLR, MoCo, and BYOL.

PyTorch CUDA Vision Transformers Contrastive Learning
Medical AI

RetinaSys

Production-ready diabetic retinopathy detection system optimized for edge deployment. Combines DINOv2 features with efficient architectures, achieving 90.73% QWK and 90.85% AUC. Features INT8 quantization reducing memory footprint by 60% while maintaining clinical accuracy.

DINOv2 OpenVINO SHAP Medical Imaging Edge Deployment
AI Search

SearchSphere

A standalone, AI-powered semantic search engine that runs locally on a user's machine. It understands the meaning behind a user's query to find relevant documents and images.

MobileCLIP FAISS MobileBERT Rich
LLM Optimization

FastInference

High-performance inference engine for transformer models. Implements advanced optimization techniques including KV-cache optimization, attention fusion, and dynamic batching. Achieves 3x speedup over standard implementations while maintaining numerical stability.

C++ CUDA TensorRT Transformers Flash Attention
Computer Vision

EfficientVision

Lightweight computer vision pipeline optimized for mobile deployment. Features neural architecture search, progressive training, and adaptive inference. Reduces model size by 80% while maintaining competitive accuracy on ImageNet classification.

PyTorch Mobile Neural Architecture Search Quantization Mobile Optimization