Projects | AI Researcher

TorchSSL

A lean, high-performance PyTorch library for self-supervised learning, built from the ground up for speed and transparency.

📌 Key Features

Custom SSLDataLoader – directory-based image loader with pre-built augmentations (SimCLR, MoCo, DINO, IJEPA).
Backbones & Compatibility: All convolution-based architectures
Modular Methods – SimCLR, MoCo, DINO, I-JEPA each in its own plug-and-play class.
Fused CUDA Kernels – NT-Xent loss implemented in custom CUDA for 3–5× runtime speedup.
Evaluation Suite – kNN & linear-probe pipelines to benchmark representation quality.
Visualization – Built-in WandB support & latent-space PCA/t-SNE plotting.

📣 Coming Soon

DINOv2, iBOT, VICReg, BYOL
Support for all Transformer-based models (ViT, SwinViT, Deit)
Advanced evaluation suite (centered kNN, probing per class)
TorchScript & JIT support
TorchSSL Playground on CIFAR10, STL10, ImageNet

Python PyTorch CUDA

GitHub

KernelLab

KernelLab is a collection of highly optimized CUDA kernels designed for deep learning, high-performance computing (HPC), and general-purpose GPU acceleration. Each kernel includes multiple levels of optimization—from naïve implementations to shared memory, warp-level, vectorized, and tensor-core optimized versions.

C++ CUDA

📌 Implemented Kernels & Optimizations

🔹 Convolution Kernels

Kernel	Optimization Levels
2D Convolution (Conv2D)	1️⃣ Naïve (Direct element-wise computation) 2️⃣ Tiled Shared Memory (Minimizing global memory access) 3️⃣ Memory Coalescing (Optimized memory access patterns) 4️⃣ Tensor Cores (Using WMMA for fused matrix multiplications)
3D Convolution (Conv3D)	1️⃣ Naïve 2️⃣ Shared Memory (Minimizing redundant loads) 3️⃣ Tiled (Reducing register pressure) 4️⃣ Register Blocking (Efficient memory reuse via registers)

🔹 Matrix & Reduction Operations

Kernel	Optimization Levels
Matrix Transpose	1️⃣ Naïve (Direct row-column swaps) 2️⃣ Shared Memory Tiling (Blocking memory accesses for fewer global loads) 3️⃣ Memory Coalescing (Optimizing global memory writes for aligned access)
Matrix Multiplication (GEMM)	1️⃣ Naïve (Row-major computation) 2️⃣ Tiled (Using shared memory for efficient blocking) 3️⃣ Register Blocking (Reducing register pressure & maximizing reuse) 4️⃣ Warp-Level Tiling (Optimizing warp-level data exchange) 5️⃣ Tensor Cores with WMMA (Using NVIDIA Tensor Cores for fused matrix ops)
Reduction Sum	1️⃣ Naïve (Basic sequential reduction per thread block) 2️⃣ Branchless Reduction (Avoiding thread divergence for performance gain) 3️⃣ Warp-Level Reduction (Using shuffle intrinsics for direct register exchange)

🔹 Element-wise & Activation Functions

Kernel	Optimization Levels
ReLU Activation	1️⃣ Naïve (Basic element-wise ReLU application) 2️⃣ Coalesced Memory Access (Optimized read/write for better bandwidth usage) 3️⃣ Vectorized Execution (Processing multiple elements per thread using vector types like `float4`)
SoftMax Function	1️⃣ Naïve (Computing exponentials & normalizing sequentially) 2️⃣ Shared Memory Optimization (Avoiding redundant memory accesses) 3️⃣ Block Tiling (Parallelizing exponentiation & normalization) 4️⃣ Warp-Level Reduction (Efficient sum-reduction across warps) 5️⃣ State-of-the-Art Optimization (Optimized numerical stability & memory efficiency)
Vector Addition	1️⃣ Naïve (Thread-per-element) 2️⃣ Shared Memory Optimization (Minimizing redundant memory loads) 3️⃣ Tiled Execution (Using block-level parallelism for efficiency) 4️⃣ Coalesced Memory Access (Optimizing memory loads for aligned access) 5️⃣ Vectorized Computation (Using `float4` for processing multiple elements per thread) 6️⃣ Multi-Element Processing (Reducing loop overhead for large arrays)

🔹 Image Processing Kernels

Kernel Optimization Levels

Greyscale Conversion 1️⃣ Naïve (Direct pixel-wise computation)
2️⃣ Shared Memory Optimization (Reducing redundant loads per thread block)
3️⃣ Memory Coalescing (Ensuring aligned memory accesses for better bandwidth)
4️⃣ Vectorized Computation (uchar4 processing per thread)
5️⃣ Multi-Pixel Processing (Parallel processing of multiple pixels per thread)
6️⃣ Fused Multiply-Add (FMA) Optimization (Combining operations for fewer instructions)

Image Blurring 1️⃣ Naïve (Basic kernel filter computation per pixel)
2️⃣ Optimized Shared Memory Tiling (Minimizing global memory accesses by loading tiles into shared memory)

Kernel	Optimization Levels
Greyscale Conversion	1️⃣ Naïve (Direct pixel-wise computation) 2️⃣ Shared Memory Optimization (Reducing redundant loads per thread block) 3️⃣ Memory Coalescing (Ensuring aligned memory accesses for better bandwidth) 4️⃣ Vectorized Computation (`uchar4` processing per thread) 5️⃣ Multi-Pixel Processing (Parallel processing of multiple pixels per thread) 6️⃣ Fused Multiply-Add (FMA) Optimization (Combining operations for fewer instructions)
Image Blurring	1️⃣ Naïve (Basic kernel filter computation per pixel) 2️⃣ Optimized Shared Memory Tiling (Minimizing global memory accesses by loading tiles into shared memory)

📝 Currently Implementing / TODO & Future Plans

Self-Attention CUDA Kernel
Flash Attention Kernel Optimization
LeakyReLU Kernel
Layer Normalization CUDA Kernel
FFT, BFS, DFS, and Sorting CUDA Implementations

GitHub

RetinaSys

A novel system for the detection of diabetic retinopathy designed to address key gaps. RetinaSys is not only about a new model but also integrates explainable AI (xAI) for interpretable decision-making, generalizes across diverse demographics through robust training on varied datasets, and is optimized for real-time deployment in resource-constrained environments.

Implemented and compared multiple self-supervised learning papers including:

SimCLR
BYOL
DINO
iBOT
IJEPA
MoCo
Supervised Contrastive Learning

Applied and customized various vision models for fitting into each SSL method, such as:

Vision Transformers
Shifted window Transformers
ConvNeXt

Applied various advanced CNN methods on top of SSL pre-trained backbone as classifier such as:

Convolutional Block Attention Module
Grade Consistence Module
Gradient Reversal
Custom Loss function to combine all named OrdinalDomainLoss

C++ CUDA

GitHub

SearchSphere

Windows Search bar sucks, you can't search any file unless you know the exact name—which often we don't. Hence SearchSphere: a multimodal search engine where you can search for both images and documents using natural language.

Current Supported File types:

Word Documents
PDF
PPT
Text Files
Markdown Files
Jpeg / jpg
Png

C++ CUDA

GitHub

EANNS (Enhanced Approximate Nearest Neighbor Search)

EANNS is a high-performance, hybrid vector database designed for real-time, scalable, and metadata-aware vector search. Unlike traditional ANN solutions like FAISS, Milvus, or Weaviate, EANNS is optimized for both speed and persistence, leveraging RAM for ultra-fast queries and disk storage for long-term scalability.

C++ CUDA OpenMP

Key Features

✅ Hybrid Storage → RAM (fast retrieval) + Disk (persistent storage)
✅ Hybrid Search → Combines vector similarity search with structured metadata filtering
✅ Real-Time Updates → Supports dynamic indexing without full reindexing
✅ Optimized with CUDA & OpenMP → Extreme speed via parallel computing
✅ Redis-Style Simplicity → Minimal setup, easy-to-use API, open-source scalability

Built in C++ with CUDA and OpenMP, EANNS is designed for high-performance AI, search, and recommendation systems, offering unmatched efficiency and flexibility compared to existing vector search solutions. 🚀

🔥 Currently Developing

Core vector storage (RAM/Disk hybrid)
Brute-force search (Flat Index)
Efficient indexing with SIMD & parallelism
IVF, HNSW, and PQ-based search (WIP)
Real-time indexing & metadata filtering (WIP)
CUDA-accelerated ANN search (Coming soon)

⚡ Key Features (Planned)

1. Flexible Storage & Search

Multiple Index Types:
- SpaceFlat → Brute-force search (like IndexFlatL2).
- SpaceCluster → Cluster-based search (like IndexIVF).
- SpaceGraph → Graph-based search (like IndexHNSW).
- SpaceQuantize → Compressed search (like IndexPQ).
Hybrid Storage: RAM (fast access) & Disk (persistent storage).

2. Optimized for Speed & Scale

CUDA-accelerated search (for GPU compute).
OpenMP for multi-threaded query execution.
SIMD-powered vectorized computations.

3. Metadata-Aware Hybrid Search

Supports metadata filtering alongside vector similarity.
Key-value store for fast lookup & hybrid queries.

GitHub

Project Showcase

TorchSSL

📌 Key Features

📣 Coming Soon

KernelLab

📌 Implemented Kernels & Optimizations

🔹 Convolution Kernels

🔹 Matrix & Reduction Operations

🔹 Element-wise & Activation Functions

🔹 Image Processing Kernels

📝 Currently Implementing / TODO & Future Plans

RetinaSys

SearchSphere

UNDER ACTIVE DEVELOPMENT

EANNS (Enhanced Approximate Nearest Neighbor Search)

Key Features

🔥 Currently Developing

⚡ Key Features (Planned)

1. Flexible Storage & Search

2. Optimized for Speed & Scale

3. Metadata-Aware Hybrid Search