Making GPUs go Brr...
AI Systems Engineer obsessed with performance
I optimize machine learning systems from kernel-level GPU code to distributed training pipelines.
Open to ML Performance, Systems and GPU computing roles
Building personalized AI systems that can run locally on your computer
Selected Work
Projects that actually work in production (with benchmarks to prove it)
KernelLab
Hand-crafted GPU kernels that started as "how hard can matrix multiplication be?" and ended up as production-ready implementations with serious performance gains.
DistJax
DistJax is a mini-library and collection of examples designed to simplify the implementation of common distributed training paradigms in JAX and Flax.
TorchSSL
PyTorch library that makes self-supervised learning actually usable. Complete SSL training in 20 lines because life's too short for boilerplate.
RetinaSys
Diabetic retinopathy detection system optimized for edge deployment. Because healthcare AI shouldn't require a data center to run.
SearchSphere
A standalone, AI-powered semantic search engine that runs locally on a user's machine. It understands the meaning behind a user's query to find relevant documents and images.
Automated Diabetic Retinopathy Detection via Self-Supervised Learning - Paper submitted to journals. Developing end-to-end systems that actually help doctors instead of replacing them.
Latest Thoughts
Blogs coming out soon....
Beyond the Code
What I'm actually like when not profiling memory access patterns
🤫 Developer Confessions
- My main FOMO is my own project list – there's always something more interesting to build than what I'm currently working on
- I am a print debugger and proud of it – sophisticated tools are overrated when printf/print gets the job done
- Green dots on GitHub make me happy in ways that probably aren't healthy
- I hate abstraction and love to dwell deep – give me low-level optimization over black box magic any day
- Always over-caffeinated
⚡ Current Status
Interview preparation while simultaneously trying to convince myself I actually know what I'm doing. The imposter syndrome is real, but so are the technical skills.
Working on systems where performance actually matters, interesting problems need solving, and the coffee budget matches the compute budget.
Currently exploring open source frameworks – diving deep into vLLM, SGLang, and JAX. Because apparently my idea of relaxation is understanding how other people optimize tensor operations.
"The best code is not just efficient, it's understandable by the next developer who has to debug
it at 2
AM"
- Me, after being that developer too
many
times
Technical Arsenal
Tools I use to make systems go brrr
⚡ Performance Computing
- CUDA (the good, bad, and segfaulty)
- Triton (when verbosity becomes a problem)
- WMMA (tensor cores go brrr)
- CuTe (cute name, serious performance)
- CUTLASS (NVIDIA's gift to humanity)
- OpenMP (parallel computing made simple)
🤖 ML Frameworks
- PyTorch (80% of my existence)
- JAX (functional programming supremacy)
- OpenVINO (edge deployment savior)
- TensorRT (NVIDIA optimization magic)
💻 Languages & Tools
- C++ (pointers and prayers)
- Python (life is short)
- C (when C++ feels too safe)
🎯 Domains & Specialties
- Self-Supervised Learning
- Large Language Models
- Computer Vision
- Model Optimization & Quantization
- Distributed Training
- Edge Deployment
Let's Talk Shop
Got slow models? Interesting optimization probels ? lets caht
Currently Available
Open to opportunities in MLsys