Search Sphere is a standalone, AI-powered semantic search engine that runs locally on a user's machine. It is designed to overcome the limitations of traditional keyword-based file search by understanding the meaning behind a user's query. By leveraging state-of-the-art machine learning models, it can find relevant documents and images even if the query's keywords are not present in the file's name or content.
This document provides a detailed technical overview of the system's architecture, pipelines, and core components.
The application is divided into two main packages: encoder and query, which work together to provide the core functionality. The run.py script serves as the main entry point, orchestrating the overall workflow and managing the user interface.
run.py (Main Application Runner):
Rich library.encoder module to perform the indexing.query module to handle the interactive search loop.encoder Package (The Indexer):
main_seq.py: Contains the primary logic for the indexing pipeline. It traverses the file system, orchestrates content extraction and embedding generation, and saves the final index.embedding.py: A critical module that uses the MobileCLIP model to convert text strings and image files into 512-dimensional vector embeddings.faiss_base.py: A wrapper class, FAISSManagerHNSW, that abstracts away the complexity of managing the FAISS vector indexes. It handles adding new vectors, training the index (if necessary), and saving/loading the index from disk.utils.py: A set of helper functions for tasks like extracting text from various file formats (.pdf, .docx, etc.) and retrieving file metadata.query Package (The Searcher):
query.py: The heart of the search functionality. It takes a user's raw query, uses the utils module to determine the query's intent, generates a query embedding, and performs the search against the FAISS index.utils.py: Contains the index_token function, which loads the fine-tuned MobileBERT model to classify the user's query as either TEXT or IMAGE.fine_tune.py: A utility script (not part of the main application flow) used to train the MobileBERT model for the intent classification task.The indexing process is a sequential flow that builds the knowledge base for the search engine. This is handled by encoder/main_seq.py.
FAISSManagerHNSW is initialized, and any existing FAISS indexes (index/text_index.index, index/image_index.index) and metadata files are loaded into memory.os.walk) and compiles a list of all files that match the supported extensions (defined in encoder/config.py).encoder/utils.py is called to extract the content. For text files, this is the raw text; for images, it is the file path.embedding.py module. The MobileCLIP model then generates a 512-dimension floating-point vector.FAISSManagerHNSW instance.train_add method of the FAISSManagerHNSW is called. This method stacks all the temporarily stored embeddings into a single NumPy array and adds them to the FAISS index. The HNSWFlat index type is highly efficient for this, as it does not require an explicit training step like other FAISS indexes (e.g., IVF).save_state method is called to write the updated FAISS indexes and metadata JSON files to the index/ directory, ensuring persistence between sessions.The search process is designed to be fast and accurate, providing relevant results in real-time. This is handled by query/query.py.
index_token function in query/utils.py.MobileBERT model to perform a sequence classification task. The model outputs a label, either TEXT or IMAGE, based on what it believes the user is looking for.text_extract function in encoder/embedding.py.MobileCLIP model converts the query into a 512-dimensional vector embedding. This is the same model used for indexing, which is crucial for ensuring that the query and the indexed content exist in the same "embedding space."TEXT or IMAGE), the appropriate search method (search_text or search_image) in FAISSManagerHNSW is called.k vectors with the smallest distance (i.e., the highest similarity).Rich library.HNSW (Hierarchical Navigable Small World) graph-based index was chosen for its exceptional search speed and accuracy, especially for large datasets. Unlike IVF indexes, HNSW does not require a separate training phase, which makes it ideal for this application where new files can be added dynamically.