Search Sphere - Technical Documentation

1. Introduction

Search Sphere is a standalone, AI-powered semantic search engine that runs locally on a user's machine. It is designed to overcome the limitations of traditional keyword-based file search by understanding the meaning behind a user's query. By leveraging state-of-the-art machine learning models, it can find relevant documents and images even if the query's keywords are not present in the file's name or content.

This document provides a detailed technical overview of the system's architecture, pipelines, and core components.

2. System Architecture

The application is divided into two main packages: encoder and query, which work together to provide the core functionality. The run.py script serves as the main entry point, orchestrating the overall workflow and managing the user interface.

3. The Indexing Pipeline

The indexing process is a sequential flow that builds the knowledge base for the search engine. This is handled by encoder/main_seq.py.

  1. Initialization: The FAISSManagerHNSW is initialized, and any existing FAISS indexes (index/text_index.index, index/image_index.index) and metadata files are loaded into memory.
  2. File Traversal: The application walks through the user-specified directory (os.walk) and compiles a list of all files that match the supported extensions (defined in encoder/config.py).
  3. Content Processing Loop: Each file is processed one by one:
    1. Content Extraction: Based on the file extension, the appropriate function from encoder/utils.py is called to extract the content. For text files, this is the raw text; for images, it is the file path.
    2. Embedding Generation: The extracted content (or image path) is passed to the embedding.py module. The MobileCLIP model then generates a 512-dimension floating-point vector.
    3. Temporary Storage: The generated embedding and its associated metadata (file path, name) are temporarily stored in a list within the FAISSManagerHNSW instance.
  4. Batch Indexing: Once all files have been processed, the train_add method of the FAISSManagerHNSW is called. This method stacks all the temporarily stored embeddings into a single NumPy array and adds them to the FAISS index. The HNSWFlat index type is highly efficient for this, as it does not require an explicit training step like other FAISS indexes (e.g., IVF).
  5. Save to Disk: Finally, the save_state method is called to write the updated FAISS indexes and metadata JSON files to the index/ directory, ensuring persistence between sessions.

4. The Search Pipeline

The search process is designed to be fast and accurate, providing relevant results in real-time. This is handled by query/query.py.

  1. User Input: The user enters a natural language query into the CLI.
  2. Query Intent Classification:
    1. The query string is passed to the index_token function in query/utils.py.
    2. This function uses a fine-tuned MobileBERT model to perform a sequence classification task. The model outputs a label, either TEXT or IMAGE, based on what it believes the user is looking for.
  3. Query Embedding:
    1. The same query string is passed to the text_extract function in encoder/embedding.py.
    2. The MobileCLIP model converts the query into a 512-dimensional vector embedding. This is the same model used for indexing, which is crucial for ensuring that the query and the indexed content exist in the same "embedding space."
  4. Similarity Search:
    1. Based on the intent (TEXT or IMAGE), the appropriate search method (search_text or search_image) in FAISSManagerHNSW is called.
    2. FAISS then performs a k-Nearest Neighbor (k-NN) search on the corresponding vector index. It calculates the distance between the query vector and all the vectors in the index and returns the k vectors with the smallest distance (i.e., the highest similarity).
  5. Results Presentation:
    1. The search returns the IDs and similarity scores of the top matching items.
    2. The application looks up the metadata (file name and path) for these IDs from the loaded metadata map.
    3. The final results are displayed to the user in a formatted table using the Rich library.

5. Key Technologies and Rationale

Go Back