Search Sphere - Technical Documentation
1. Introduction
Search Sphere is a standalone, AI-powered semantic search engine that runs locally on a user's machine. It is designed to overcome the limitations of traditional keyword-based file search by understanding the meaning behind a user's query. By leveraging state-of-the-art machine learning models, it can find relevant documents and images even if the query's keywords are not present in the file's name or content.
This document provides a detailed technical overview of the system's architecture, pipelines, and core components.
2. System Architecture
The application is divided into two main packages: encoder and query, which work together to provide the core functionality. The run.py script serves as the main entry point, orchestrating the overall workflow and managing the user interface.
run.py(Main Application Runner):- Handles all user interaction through a CLI built with the
Richlibrary. - Initializes the application, including a startup sequence and welcome message.
- Prompts the user for a directory to index.
- Calls the
encodermodule to perform the indexing. - Calls the
querymodule to handle the interactive search loop.
- Handles all user interaction through a CLI built with the
encoderPackage (The Indexer):main_seq.py: Contains the primary logic for the indexing pipeline. It traverses the file system, orchestrates content extraction and embedding generation, and saves the final index.embedding.py: A critical module that uses the MobileCLIP model to convert text strings and image files into 512-dimensional vector embeddings.faiss_base.py: A wrapper class,FAISSManagerHNSW, that abstracts away the complexity of managing the FAISS vector indexes. It handles adding new vectors, training the index (if necessary), and saving/loading the index from disk.utils.py: A set of helper functions for tasks like extracting text from various file formats (.pdf,.docx, etc.) and retrieving file metadata.
queryPackage (The Searcher):query.py: The heart of the search functionality. It takes a user's raw query, uses theutilsmodule to determine the query's intent, generates a query embedding, and performs the search against the FAISS index.utils.py: Contains theindex_tokenfunction, which loads the fine-tuned MobileBERT model to classify the user's query as eitherTEXTorIMAGE.fine_tune.py: A utility script (not part of the main application flow) used to train the MobileBERT model for the intent classification task.
3. The Indexing Pipeline
The indexing process is a sequential flow that builds the knowledge base for the search engine. This is handled by encoder/main_seq.py.
- Initialization: The
FAISSManagerHNSWis initialized, and any existing FAISS indexes (index/text_index.index,index/image_index.index) and metadata files are loaded into memory. - File Traversal: The application walks through the user-specified directory (
os.walk) and compiles a list of all files that match the supported extensions (defined inencoder/config.py). - Content Processing Loop: Each file is processed one by one:
- Content Extraction: Based on the file extension, the appropriate function from
encoder/utils.pyis called to extract the content. For text files, this is the raw text; for images, it is the file path. - Embedding Generation: The extracted content (or image path) is passed to the
embedding.pymodule. TheMobileCLIPmodel then generates a 512-dimension floating-point vector. - Temporary Storage: The generated embedding and its associated metadata (file path, name) are temporarily stored in a list within the
FAISSManagerHNSWinstance.
- Content Extraction: Based on the file extension, the appropriate function from
- Batch Indexing: Once all files have been processed, the
train_addmethod of theFAISSManagerHNSWis called. This method stacks all the temporarily stored embeddings into a single NumPy array and adds them to the FAISS index. TheHNSWFlatindex type is highly efficient for this, as it does not require an explicit training step like other FAISS indexes (e.g., IVF). - Save to Disk: Finally, the
save_statemethod is called to write the updated FAISS indexes and metadata JSON files to theindex/directory, ensuring persistence between sessions.
4. The Search Pipeline
The search process is designed to be fast and accurate, providing relevant results in real-time. This is handled by query/query.py.
- User Input: The user enters a natural language query into the CLI.
- Query Intent Classification:
- The query string is passed to the
index_tokenfunction inquery/utils.py. - This function uses a fine-tuned
MobileBERTmodel to perform a sequence classification task. The model outputs a label, eitherTEXTorIMAGE, based on what it believes the user is looking for.
- The query string is passed to the
- Query Embedding:
- The same query string is passed to the
text_extractfunction inencoder/embedding.py. - The
MobileCLIPmodel converts the query into a 512-dimensional vector embedding. This is the same model used for indexing, which is crucial for ensuring that the query and the indexed content exist in the same "embedding space."
- The same query string is passed to the
- Similarity Search:
- Based on the intent (
TEXTorIMAGE), the appropriate search method (search_textorsearch_image) inFAISSManagerHNSWis called. - FAISS then performs a k-Nearest Neighbor (k-NN) search on the corresponding vector index. It calculates the distance between the query vector and all the vectors in the index and returns the
kvectors with the smallest distance (i.e., the highest similarity).
- Based on the intent (
- Results Presentation:
- The search returns the IDs and similarity scores of the top matching items.
- The application looks up the metadata (file name and path) for these IDs from the loaded metadata map.
- The final results are displayed to the user in a formatted table using the
Richlibrary.
5. Key Technologies and Rationale
- MobileCLIP: Chosen for its excellent balance of performance and efficiency. It is designed for mobile and edge devices, making it lightweight enough to run quickly on a local machine while still providing high-quality, cross-modal embeddings for both text and images.
- FAISS (HNSWFlat): The
HNSW (Hierarchical Navigable Small World)graph-based index was chosen for its exceptional search speed and accuracy, especially for large datasets. UnlikeIVFindexes,HNSWdoes not require a separate training phase, which makes it ideal for this application where new files can be added dynamically. - MobileBERT: A compact and fast variant of the BERT model. It was chosen for the query classification task because it is lightweight and provides excellent performance on classification tasks without introducing significant latency into the search pipeline.
- Rich: This library was chosen to create a modern, visually appealing, and user-friendly CLI. It provides out-of-the-box components for progress bars, formatted tables, and styled text, which greatly enhances the overall user experience compared to a standard print-based interface.