Prediction may take up to 50 seconds.

Find Similar Films

Search by concept, mood, genre, or story premise.

How The Model Was Created

From raw movie datasets to deployed prediction and similarity APIs.

1) Data Sources and Targets

We combine IMDb and TMDB data. IMDb contributes structured metadata and rating labels (target variable). TMDB contributes richer text fields such as plot overviews, keywords, cast, and directors. This gives both supervised signal (rating) and contextual signal (story semantics).

2) Feature Engineering Pipeline

Structured features include release year, runtime, genre one-hot columns, adult flag, and budget features. Budget is transformed with log1p and missing values are imputed with decade-level medians. For text, overviews are embedded with all-MiniLM-L6-v2, then reduced with PCA to compact components used by the prediction model. This keeps semantic signal while controlling feature size.

3) Chosen Prediction Model

The production artifact is model_v5.pkl. It predicts IMDb-style scores from engineered features. We selected this setup after iterating across experiments to balance quality, interpretability, and inference speed. The API endpoint /predict runs preprocessing + inference and returns a rounded score.

4) Similarity Search Model

Similar-film retrieval uses sentence embeddings from BAAI/bge-base-en-v1.5. Enriched movie text is embedded and indexed using FAISS (index.faiss). Query embeddings are searched by nearest neighbors and returned through /similar-film.

5) Serving and Deployment

FastAPI serves both the UI and API routes in one container. Model files are prepared during Docker build, including large artifacts fetched from cloud storage. At runtime, the app starts once and exposes a health endpoint plus prediction and similarity endpoints, ready for Cloud Run deployment.