Semantic Recommendation with FAISS and Sentence Transformers
Early in 2025 I built a content recommendation API for a large media catalog: roughly 42MB of metadata as a CSV. The core problem was that catalog search is not the same as catalog discovery. Search works when you know what you want. Recommendation works when you have something you like and want to find more of it, without necessarily knowing the exact terms to query.
The result was a FastAPI service with three recommendation modes, three separate ML models totaling around 300MB, lazy-loaded from Cloudflare R2 on first request, packaged as a multi-arch Docker image. Here is how it works and what I learned deploying it.
Three Modes, Three Different Problems
The reason for three recommendation modes is that they solve meaningfully different problems, and trying to collapse them into a single approach loses something real.
Semantic mode uses FAISS and sentence-transformer embeddings. You provide a title or a description, and the system returns items whose semantic meaning is most similar, regardless of whether they share any keywords. This handles intent-matching: "something atmospheric and melancholy" finds items that fit that description even if those words do not appear anywhere in the catalog metadata.
Term-based mode uses TF-IDF cosine similarity. You provide a set of keywords or tags, and the system returns items that are statistically similar on those terms. This is the right mode when the user's query language matches the catalog's tag vocabulary: genre tags, author names, content descriptors.
Title-based mode does direct similarity search against title strings. You provide a title, and the system finds entries with similar titles. This handles the "I know I've seen this before but I can't remember the exact name" use case.
Each mode has a different failure mode. Semantic search can return thematically adjacent but actually irrelevant results when the embedding space is sparse. TF-IDF is only as good as the tagging consistency in the catalog. Title similarity degrades badly when titles are short or share common words. Having all three available lets the caller choose the one whose failure modes are least harmful for their context.
FAISS Index Design
FAISS (Facebook AI Similarity Search) is a library for efficient similarity search over dense vectors. The choice of index type is the first real decision.
The main options are:
- IndexFlatL2 / IndexFlatIP: Exact search, no approximation. Slowest at scale, but correct. Supports inner product (cosine similarity with normalized vectors).
- IndexIVFFlat: Inverted file index, approximate. Partitions the vector space into clusters and only searches the nearest clusters. Faster at scale, configurable accuracy tradeoff.
- IndexHNSW: Hierarchical navigable small world graph. Very fast approximate search, good recall, but higher memory usage and non-deterministic results.
For this catalog of roughly 40,000 entries with 384-dimensional vectors, I used IndexFlatIP with normalized vectors, which gives exact cosine similarity search. The index is 124MB and query time is under 100ms even for exact search at this scale. Approximate indexes are worth the accuracy tradeoff at millions of vectors; at tens of thousands, exact search is fast enough and removes a class of correctness questions.
The embeddings are stored separately as a .npy file rather than embedded in the FAISS index itself. This is a flexibility decision: it allows rebuilding the index with different parameters without re-running the embedding model, and it allows accessing the raw embedding vectors for other downstream uses without going through FAISS.
Sentence Transformers and Model Choice
The embedding model is all-MiniLM-L6-v2 from the sentence-transformers library. The model produces 384-dimensional vectors and weighs around 80MB. The choice was driven by the quality-speed tradeoff.
Larger models like all-mpnet-base-v2 (420 dimensions, ~420MB) produce marginally better embeddings but are four to five times slower at inference. For a recommendation API where latency matters and marginal quality differences are imperceptible to users, the smaller model is the right call.
Running every item in the catalog through the model takes around 30 minutes on CPU hardware. This is a one-time cost. The resulting semantic_embeddings.npy file is the expensive artifact; the FAISS index is cheap to rebuild from it.
At serving time, query embeddings are computed on the fly. A single embedding inference call with all-MiniLM-L6-v2 takes approximately 15-30ms on CPU, which is acceptable for a recommendation endpoint.
Model Storage on Cloudflare R2
The combined model artifacts total around 300MB: semantic_embeddings.npy (124MB), faiss_semantic.index (124MB), tfidf_matrix.pkl (20MB), and tfidf_vectorizer.pkl (small). Bundling these into the container image was not a viable option.
A 300MB model layer in a Docker image means every deployment pulls 300MB, every CI run pulls 300MB, and local development requires either committing the models to the repository or maintaining a separate download step. Cloudflare R2 solves this cleanly: models live in object storage, the container image stays small, and the service loads models from R2 at runtime.
The loading strategy is lazy: models are not loaded at container startup. The first request to a given recommendation mode triggers the model load for that mode. Subsequent requests use the cached in-memory model.
The tradeoff is cold start latency. First request to the semantic endpoint downloads 248MB from R2, loads the numpy array, and initializes the FAISS index. On a reasonably fast connection, this takes 15-30 seconds. I exposed a /load_models endpoint and a /status endpoint to handle this operationally. A warmup call to /load_models triggers background loading immediately after container start, and /status returns which models are currently loaded. In production, a health check that calls /load_models on startup eliminates the cold start problem for real user requests.
FastAPI Endpoint Design
The API surface is minimal:
GET /recommend_semantic?title={title}&n={n}: semantic nearest-neighbor searchGET /recommend_by_terms?terms={terms}&n={n}: TF-IDF cosine similarityGET /recommend_by_title?title={title}&n={n}: title string similarityPOST /load_models: trigger background model loadingGET /status: return model loading status and basic service health
Each recommendation endpoint returns a ranked list of items with their similarity scores. The n parameter controls result count, defaulting to 10.
FastAPI's async support is used for the model loading endpoint specifically because loading 300MB of models synchronously would block the event loop and make the service unresponsive during loading. The /load_models endpoint starts loading in a background task and returns immediately, with /status exposing progress.
Multi-Arch Docker Builds
FAISS has platform-specific compiled binaries. The faiss-cpu package on PyPI ships wheels for linux/amd64 but not for linux/arm64. This matters because local development on Apple Silicon Macs runs arm64, while most cloud production infrastructure runs amd64.
The solution is multi-arch Docker builds using docker buildx:
docker buildx build \
--platform linux/amd64,linux/arm64 \
--push \
-t registry/recommendation-api:latest .
For the arm64 build, faiss-cpu has to be compiled from source, which requires additional build dependencies in the Dockerfile and significantly longer build times. The Dockerfile handles this with platform-conditional logic:
RUN if [ "$(uname -m)" = "aarch64" ]; then \
apt-get install -y build-essential swig libopenblas-dev && \
pip install faiss-cpu --no-binary :all:; \
else \
pip install faiss-cpu; \
fi
This is the main operational complexity in the build process. The alternative, building only for amd64 and running under emulation on Apple Silicon, is unacceptably slow for development.
Migration from Droplet to Containers
The previous version of this service ran directly on a DigitalOcean droplet with a process manager keeping the Python process alive. It worked, but it had the usual problems with direct-server deployments: manual dependency management, no easy rollback, no isolation between services on the same machine.
The containerized version changes the operational model. Deploying a new version is pushing a new image and restarting the container. Rolling back is pulling the previous image tag. The model artifacts are not part of the deployment artifact at all. They live in R2 and are loaded at runtime, which means model updates do not require a new image build.
The one thing the droplet approach did better: startup time. The droplet kept models loaded in memory across restarts because the process stayed alive. Containers restart cold by default. The warmup endpoint addresses this, but it is an additional operational step that did not exist before.
What I Would Do Differently
The lazy loading strategy optimizes for container startup time at the cost of first-request latency. For a service where predictable latency matters more than fast cold starts, eager loading is the better choice, meaning all models are loaded before the first request is accepted. The container takes longer to become ready, but every request after that has consistent latency.
The three-mode architecture could also be a single endpoint with a mode parameter rather than three separate endpoints. Separate endpoints made incremental development easier (each mode was independently testable), but a single endpoint with mode selection would be a cleaner API surface for callers.
Both of these are the kinds of decisions that look obvious in retrospect and reasonable at the time.
The stack, which is FastAPI, FAISS, sentence-transformers, Docker, and Cloudflare R2, is a reasonable starting point for any project that needs semantic similarity search over a catalog of moderate size.
Related Articles
RAG for Security: Evolution and Modern Implementation
How RAG for security has evolved from research to practice. Building on SuriCon 2024 work, this post explores modern approaches with Claude, a 2,400-document knowledge base spanning MITRE ATT&CK, CISA KEV, and Suricata, and an interactive browser-based demo.
Configure Jupyter Notebook to Interact with Splunk Enterprise
Complete guide to integrating Jupyter Notebook with Splunk Enterprise using Docker, enabling data science workflows directly with Splunk data and the ML Toolkit.
Building a Game Analytics Pipeline: ETL, TF-IDF, and K-Means on Team Composition Data
How I applied document similarity techniques to mobile game team compositions, using TF-IDF vectorization and UMAP clustering to identify meta strategies across multiple seasons of Union Raid data.