Making GPUs go Brr...

AI Systems Engineer obsessed with performance

I optimize machine learning systems from kernel-level GPU code to distributed training pipelines.

Currently Available

Open to ML Performance, Systems and GPU computing roles

Current Obsession

Building personalized AI systems that can run locally on your computer

Selected Work

Projects that actually work in production (with benchmarks to prove it)

GPU Computing

KernelLab

Hand-crafted GPU kernels that started as "how hard can matrix multiplication be?" and ended up as production-ready implementations with serious performance gains.

Performance Wins
→ 90% of PyTorch/cuBLAS throughput achieved
→ Full FP32/FP16 precision support
→ Memory coalescing that actually works
CUDA Triton WMMA Nsight
Distributed Systems

DistJax

DistJax is a mini-library and collection of examples designed to simplify the implementation of common distributed training paradigms in JAX and Flax.

Parallelism Strategies
→ Data Parallelism
→ Pipeline Parallelism
→ Tensor Parallelism
→ Async Tensor Parallelism with Transformer support
JAX Flax
Self-Supervised Learning

TorchSSL

PyTorch library that makes self-supervised learning actually usable. Complete SSL training in 20 lines because life's too short for boilerplate.

Speed Improvements
→ 3-5x faster contrastive loss computation
→ Optimized kernels under the hood
→ Works on first try (revolutionary)
PyTorch CUDA ViT
Medical AI

RetinaSys

Diabetic retinopathy detection system optimized for edge deployment. Because healthcare AI shouldn't require a data center to run.

Clinical Results
→ 90.73% QWK, 90.85% AUC
→ 60% memory reduction via INT8
→ Runs on actual edge devices
DINOv2 OpenVINO SHAP
AI Search

SearchSphere

A standalone, AI-powered semantic search engine that runs locally on a user's machine. It understands the meaning behind a user's query to find relevant documents and images.

MobileCLIP FAISS MobileBERT Rich
🔬 Current Research

Automated Diabetic Retinopathy Detection via Self-Supervised Learning - Paper submitted to journals. Developing end-to-end systems that actually help doctors instead of replacing them.

Latest Thoughts

Blogs coming out soon....

Beyond the Code

What I'm actually like when not profiling memory access patterns

🤫 Developer Confessions

  • My main FOMO is my own project list – there's always something more interesting to build than what I'm currently working on
  • I am a print debugger and proud of it – sophisticated tools are overrated when printf/print gets the job done
  • Green dots on GitHub make me happy in ways that probably aren't healthy
  • I hate abstraction and love to dwell deep – give me low-level optimization over black box magic any day
  • Always over-caffeinated

⚡ Current Status

Currently Debugging:

Interview preparation while simultaneously trying to convince myself I actually know what I'm doing. The imposter syndrome is real, but so are the technical skills.

Dream Job:

Working on systems where performance actually matters, interesting problems need solving, and the coffee budget matches the compute budget.

Weekend Project:

Currently exploring open source frameworks – diving deep into vLLM, SGLang, and JAX. Because apparently my idea of relaxation is understanding how other people optimize tensor operations.

"The best code is not just efficient, it's understandable by the next developer who has to debug it at 2 AM"

- Me, after being that developer too many times

Technical Arsenal

Tools I use to make systems go brrr

⚡ Performance Computing

  • CUDA (the good, bad, and segfaulty)
  • Triton (when verbosity becomes a problem)
  • WMMA (tensor cores go brrr)
  • CuTe (cute name, serious performance)
  • CUTLASS (NVIDIA's gift to humanity)
  • OpenMP (parallel computing made simple)

🤖 ML Frameworks

  • PyTorch (80% of my existence)
  • JAX (functional programming supremacy)
  • OpenVINO (edge deployment savior)
  • TensorRT (NVIDIA optimization magic)

💻 Languages & Tools

  • C++ (pointers and prayers)
  • Python (life is short)
  • C (when C++ feels too safe)

🎯 Domains & Specialties

  • Self-Supervised Learning
  • Large Language Models
  • Computer Vision
  • Model Optimization & Quantization
  • Distributed Training
  • Edge Deployment

Let's Talk Shop

Got slow models? Interesting optimization probels ? lets caht

💻
GitHub
AmanSwar
🔗
LinkedIn
aman-swar
🌐
Website
amanswar.github.io

Currently Available

Open to opportunities in MLsys