Project Showcase

Exploring the intersection of AI, Systems, and Innovation

TorchSSL

TorchSSL

A lean, high-performance PyTorch library for self-supervised learning, built from the ground up for speed and transparency.

📌 Key Features

  • Custom SSLDataLoader – directory-based image loader with pre-built augmentations (SimCLR, MoCo, DINO, IJEPA).
  • Backbones & Compatibility: All convolution-based architectures
  • Modular Methods – SimCLR, MoCo, DINO, I-JEPA each in its own plug-and-play class.
  • Fused CUDA Kernels – NT-Xent loss implemented in custom CUDA for 3–5× runtime speedup.
  • Evaluation Suite – kNN & linear-probe pipelines to benchmark representation quality.
  • Visualization – Built-in WandB support & latent-space PCA/t-SNE plotting.

📣 Coming Soon

  • DINOv2, iBOT, VICReg, BYOL
  • Support for all Transformer-based models (ViT, SwinViT, Deit)
  • Advanced evaluation suite (centered kNN, probing per class)
  • TorchScript & JIT support
  • TorchSSL Playground on CIFAR10, STL10, ImageNet
Python PyTorch CUDA
KernelLab

KernelLab

KernelLab is a collection of highly optimized CUDA kernels designed for deep learning, high-performance computing (HPC), and general-purpose GPU acceleration. Each kernel includes multiple levels of optimization—from naïve implementations to shared memory, warp-level, vectorized, and tensor-core optimized versions.

C++ CUDA

📌 Implemented Kernels & Optimizations

🔹 Convolution Kernels
Kernel Optimization Levels
2D Convolution (Conv2D) 1️⃣ Naïve (Direct element-wise computation)
2️⃣ Tiled Shared Memory (Minimizing global memory access)
3️⃣ Memory Coalescing (Optimized memory access patterns)
4️⃣ Tensor Cores (Using WMMA for fused matrix multiplications)
3D Convolution (Conv3D) 1️⃣ Naïve
2️⃣ Shared Memory (Minimizing redundant loads)
3️⃣ Tiled (Reducing register pressure)
4️⃣ Register Blocking (Efficient memory reuse via registers)
🔹 Matrix & Reduction Operations
Kernel Optimization Levels
Matrix Transpose 1️⃣ Naïve (Direct row-column swaps)
2️⃣ Shared Memory Tiling (Blocking memory accesses for fewer global loads)
3️⃣ Memory Coalescing (Optimizing global memory writes for aligned access)
Matrix Multiplication (GEMM) 1️⃣ Naïve (Row-major computation)
2️⃣ Tiled (Using shared memory for efficient blocking)
3️⃣ Register Blocking (Reducing register pressure & maximizing reuse)
4️⃣ Warp-Level Tiling (Optimizing warp-level data exchange)
5️⃣ Tensor Cores with WMMA (Using NVIDIA Tensor Cores for fused matrix ops)
Reduction Sum 1️⃣ Naïve (Basic sequential reduction per thread block)
2️⃣ Branchless Reduction (Avoiding thread divergence for performance gain)
3️⃣ Warp-Level Reduction (Using shuffle intrinsics for direct register exchange)
🔹 Element-wise & Activation Functions
Kernel Optimization Levels
ReLU Activation 1️⃣ Naïve (Basic element-wise ReLU application)
2️⃣ Coalesced Memory Access (Optimized read/write for better bandwidth usage)
3️⃣ Vectorized Execution (Processing multiple elements per thread using vector types like float4)
SoftMax Function 1️⃣ Naïve (Computing exponentials & normalizing sequentially)
2️⃣ Shared Memory Optimization (Avoiding redundant memory accesses)
3️⃣ Block Tiling (Parallelizing exponentiation & normalization)
4️⃣ Warp-Level Reduction (Efficient sum-reduction across warps)
5️⃣ State-of-the-Art Optimization (Optimized numerical stability & memory efficiency)
Vector Addition 1️⃣ Naïve (Thread-per-element)
2️⃣ Shared Memory Optimization (Minimizing redundant memory loads)
3️⃣ Tiled Execution (Using block-level parallelism for efficiency)
4️⃣ Coalesced Memory Access (Optimizing memory loads for aligned access)
5️⃣ Vectorized Computation (Using float4 for processing multiple elements per thread)
6️⃣ Multi-Element Processing (Reducing loop overhead for large arrays)
🔹 Image Processing Kernels
Kernel Optimization Levels
Greyscale Conversion 1️⃣ Naïve (Direct pixel-wise computation)
2️⃣ Shared Memory Optimization (Reducing redundant loads per thread block)
3️⃣ Memory Coalescing (Ensuring aligned memory accesses for better bandwidth)
4️⃣ Vectorized Computation (uchar4 processing per thread)
5️⃣ Multi-Pixel Processing (Parallel processing of multiple pixels per thread)
6️⃣ Fused Multiply-Add (FMA) Optimization (Combining operations for fewer instructions)
Image Blurring 1️⃣ Naïve (Basic kernel filter computation per pixel)
2️⃣ Optimized Shared Memory Tiling (Minimizing global memory accesses by loading tiles into shared memory)

📝 Currently Implementing / TODO & Future Plans

  • Self-Attention CUDA Kernel
  • Flash Attention Kernel Optimization
  • LeakyReLU Kernel
  • Layer Normalization CUDA Kernel
  • FFT, BFS, DFS, and Sorting CUDA Implementations
RetinaSys

RetinaSys

A novel system for the detection of diabetic retinopathy designed to address key gaps. RetinaSys is not only about a new model but also integrates explainable AI (xAI) for interpretable decision-making, generalizes across diverse demographics through robust training on varied datasets, and is optimized for real-time deployment in resource-constrained environments.

Implemented and compared multiple self-supervised learning papers including:

  • SimCLR
  • BYOL
  • DINO
  • iBOT
  • IJEPA
  • MoCo
  • Supervised Contrastive Learning

Applied and customized various vision models for fitting into each SSL method, such as:

  • Vision Transformers
  • Shifted window Transformers
  • ConvNeXt

Applied various advanced CNN methods on top of SSL pre-trained backbone as classifier such as:

  • Convolutional Block Attention Module
  • Grade Consistence Module
  • Gradient Reversal
  • Custom Loss function to combine all named OrdinalDomainLoss
C++ CUDA
SearchSphere

SearchSphere

Windows Search bar sucks, you can't search any file unless you know the exact name—which often we don't. Hence SearchSphere: a multimodal search engine where you can search for both images and documents using natural language.

Current Supported File types:

  • Word Documents
  • PDF
  • PPT
  • Text Files
  • Markdown Files
  • Jpeg / jpg
  • Png
C++ CUDA

UNDER ACTIVE DEVELOPMENT

EANNS

EANNS (Enhanced Approximate Nearest Neighbor Search)

EANNS is a high-performance, hybrid vector database designed for real-time, scalable, and metadata-aware vector search. Unlike traditional ANN solutions like FAISS, Milvus, or Weaviate, EANNS is optimized for both speed and persistence, leveraging RAM for ultra-fast queries and disk storage for long-term scalability.

C++ CUDA OpenMP

Key Features

  • Hybrid Storage → RAM (fast retrieval) + Disk (persistent storage)
  • Hybrid Search → Combines vector similarity search with structured metadata filtering
  • Real-Time Updates → Supports dynamic indexing without full reindexing
  • Optimized with CUDA & OpenMP → Extreme speed via parallel computing
  • Redis-Style Simplicity → Minimal setup, easy-to-use API, open-source scalability

Built in C++ with CUDA and OpenMP, EANNS is designed for high-performance AI, search, and recommendation systems, offering unmatched efficiency and flexibility compared to existing vector search solutions. 🚀

🔥 Currently Developing

  • Core vector storage (RAM/Disk hybrid)
  • Brute-force search (Flat Index)
  • Efficient indexing with SIMD & parallelism
  • IVF, HNSW, and PQ-based search (WIP)
  • Real-time indexing & metadata filtering (WIP)
  • CUDA-accelerated ANN search (Coming soon)

⚡ Key Features (Planned)

1. Flexible Storage & Search
  • Multiple Index Types:
    • SpaceFlat → Brute-force search (like IndexFlatL2).
    • SpaceCluster → Cluster-based search (like IndexIVF).
    • SpaceGraph → Graph-based search (like IndexHNSW).
    • SpaceQuantize → Compressed search (like IndexPQ).
  • Hybrid Storage: RAM (fast access) & Disk (persistent storage).
2. Optimized for Speed & Scale
  • CUDA-accelerated search (for GPU compute).
  • OpenMP for multi-threaded query execution.
  • SIMD-powered vectorized computations.
3. Metadata-Aware Hybrid Search
  • Supports metadata filtering alongside vector similarity.
  • Key-value store for fast lookup & hybrid queries.