Search Sphere is a standalone, AI-powered semantic search engine that runs locally on a user's machine. It is designed to overcome the limitations of traditional keyword-based file search by understanding the meaning behind a user's query. By leveraging state-of-the-art machine learning models, it can find relevant documents and images even if the query's keywords are not present in the file's name or content.
This document provides a detailed technical overview of the system's architecture, pipelines, and core components.
The application is divided into two main packages: encoder
and query
, which work together to provide the core functionality. The run.py
script serves as the main entry point, orchestrating the overall workflow and managing the user interface.
run.py
(Main Application Runner):
Rich
library.encoder
module to perform the indexing.query
module to handle the interactive search loop.encoder
Package (The Indexer):
main_seq.py
: Contains the primary logic for the indexing pipeline. It traverses the file system, orchestrates content extraction and embedding generation, and saves the final index.embedding.py
: A critical module that uses the MobileCLIP model to convert text strings and image files into 512-dimensional vector embeddings.faiss_base.py
: A wrapper class, FAISSManagerHNSW
, that abstracts away the complexity of managing the FAISS vector indexes. It handles adding new vectors, training the index (if necessary), and saving/loading the index from disk.utils.py
: A set of helper functions for tasks like extracting text from various file formats (.pdf
, .docx
, etc.) and retrieving file metadata.query
Package (The Searcher):
query.py
: The heart of the search functionality. It takes a user's raw query, uses the utils
module to determine the query's intent, generates a query embedding, and performs the search against the FAISS index.utils.py
: Contains the index_token
function, which loads the fine-tuned MobileBERT model to classify the user's query as either TEXT
or IMAGE
.fine_tune.py
: A utility script (not part of the main application flow) used to train the MobileBERT model for the intent classification task.The indexing process is a sequential flow that builds the knowledge base for the search engine. This is handled by encoder/main_seq.py
.
FAISSManagerHNSW
is initialized, and any existing FAISS indexes (index/text_index.index
, index/image_index.index
) and metadata files are loaded into memory.os.walk
) and compiles a list of all files that match the supported extensions (defined in encoder/config.py
).encoder/utils.py
is called to extract the content. For text files, this is the raw text; for images, it is the file path.embedding.py
module. The MobileCLIP
model then generates a 512-dimension floating-point vector.FAISSManagerHNSW
instance.train_add
method of the FAISSManagerHNSW
is called. This method stacks all the temporarily stored embeddings into a single NumPy array and adds them to the FAISS index. The HNSWFlat
index type is highly efficient for this, as it does not require an explicit training step like other FAISS indexes (e.g., IVF).save_state
method is called to write the updated FAISS indexes and metadata JSON files to the index/
directory, ensuring persistence between sessions.The search process is designed to be fast and accurate, providing relevant results in real-time. This is handled by query/query.py
.
index_token
function in query/utils.py
.MobileBERT
model to perform a sequence classification task. The model outputs a label, either TEXT
or IMAGE
, based on what it believes the user is looking for.text_extract
function in encoder/embedding.py
.MobileCLIP
model converts the query into a 512-dimensional vector embedding. This is the same model used for indexing, which is crucial for ensuring that the query and the indexed content exist in the same "embedding space."TEXT
or IMAGE
), the appropriate search method (search_text
or search_image
) in FAISSManagerHNSW
is called.k
vectors with the smallest distance (i.e., the highest similarity).Rich
library.HNSW (Hierarchical Navigable Small World)
graph-based index was chosen for its exceptional search speed and accuracy, especially for large datasets. Unlike IVF
indexes, HNSW
does not require a separate training phase, which makes it ideal for this application where new files can be added dynamically.