Making GPUs go Brr...

AI Systems Engineer obsessed with performance

I optimize machine learning systems from kernel-level GPU code to distributed training pipelines.

Grab Resume See All Projects

Currently Available

Open to ML Performance, Systems and GPU computing roles

Current Obsession

Building personalized AI systems that can run locally on your computer

Selected Work

Projects that actually work in production (with benchmarks to prove it)

ML Performance

FastQwen3

Making Qwen3 Go Brr...

Performance Wins

→ achieving 9.25x ++ inference speedup over huggingface baseline

→ Implemented KV Cache , fused RMSNorm , RoPE and custom Flash attention kernel to support Grouped Query Attention in CUDA as well as in Triton

→ Reduced 600-token inference time from 440s to 48s (saving 6.5 minutes per request) and averaging 4.83x speedup for less than 600 tokens and 13.85x speedup for greater than 600 tokens

CUDA Triton PyTorch Flamegraph and NCU profiling

GitHub → Detailed View →

GPU Computing

KernelLab

Hand-crafted GPU kernels that started as "how hard can matrix multiplication be?" and ended up as production-ready implementations with serious performance gains.

Performance Wins

→ 90% of PyTorch/cuBLAS throughput achieved

→ Full FP32/FP16 precision support

→ Memory coalescing that actually works

CUDA Triton WMMA Nsight

GitHub → Detailed View →

Distributed Systems

DistJax

DistJax is a mini-library and collection of examples designed to simplify the implementation of common distributed training paradigms in JAX and Flax.

Parallelism Strategies

→ Data Parallelism

→ Pipeline Parallelism

→ Tensor Parallelism

→ Async Tensor Parallelism with Transformer support

JAX Flax

GitHub → Detailed View →

Self-Supervised Learning

TorchSSL

PyTorch library that makes self-supervised learning actually usable. Complete SSL training in 20 lines because life's too short for boilerplate.

Speed Improvements

→ 3-5x faster contrastive loss computation

→ Optimized kernels under the hood

→ Works on first try (revolutionary)

PyTorch CUDA ViT

GitHub → Detailed View →

Medical AI

RetinaSys

Diabetic retinopathy detection system optimized for edge deployment. Because healthcare AI shouldn't require a data center to run.

Clinical Results

→ 90.73% QWK, 90.85% AUC

→ 60% memory reduction via INT8

→ Runs on actual edge devices

DINOv2 OpenVINO SHAP

GitHub → Detailed View → Paper →

AI Search

SearchSphere

A standalone, AI-powered semantic search engine that runs locally on a user's machine. It understands the meaning behind a user's query to find relevant documents and images.

MobileCLIP FAISS MobileBERT Rich

GitHub → Detailed View →

🔬 Current Research

Automated Diabetic Retinopathy Detection via Self-Supervised Learning - Paper submitted to journals. Developing end-to-end systems that actually help doctors instead of replacing them.

Latest Thoughts

Blogs coming out soon....

Beyond the Code

What I'm actually like when not profiling memory access patterns

🤫 Developer Confessions

My main FOMO is my own project list – there's always something more interesting to build than what I'm currently working on
I am a print debugger and proud of it – sophisticated tools are overrated when printf/print gets the job done
Green dots on GitHub make me happy in ways that probably aren't healthy
I hate abstraction and love to dwell deep – give me low-level optimization over black box magic any day
Always over-caffeinated

⚡ Current Status

Currently Debugging:

Interview preparation while simultaneously trying to convince myself I actually know what I'm doing. The imposter syndrome is real, but so are the technical skills.

Dream Job:

Working on systems where performance actually matters, interesting problems need solving, and the coffee budget matches the compute budget.

Weekend Project:

Currently exploring open source frameworks – diving deep into vLLM, SGLang, and JAX. Because apparently my idea of relaxation is understanding how other people optimize tensor operations.

"The best code is not just efficient, it's understandable by the next developer who has to debug it at 2 AM"

- Me, after being that developer too many times

Technical Arsenal

Tools I use to make systems go brrr

⚡ Performance Computing

CUDA (the good, bad, and segfaulty)
Triton (when verbosity becomes a problem)
WMMA (tensor cores go brrr)
CuTe (cute name, serious performance)
CUTLASS (NVIDIA's gift to humanity)
OpenMP (parallel computing made simple)

🤖 ML Frameworks

PyTorch (80% of my existence)
JAX (functional programming supremacy)
OpenVINO (edge deployment savior)
TensorRT (NVIDIA optimization magic)

💻 Languages & Tools

C++ (pointers and prayers)
Python (life is short)
C (when C++ feels too safe)

🎯 Domains & Specialties

Self-Supervised Learning
Large Language Models
Computer Vision
Model Optimization & Quantization
Distributed Training
Edge Deployment

Let's Talk Shop

Got slow models? Interesting optimization probels ? lets caht

📧

p.amanswar@gmail.com

💻

GitHub

AmanSwar

🔗

aman-swar

🌐

Website

amanswar.github.io

Currently Available

Open to opportunities in MLsys