1) Data Sources and Targets
We combine IMDb and TMDB data. IMDb contributes structured metadata and rating labels (target variable). TMDB contributes richer text fields such as plot overviews, keywords, cast, and directors. This gives both supervised signal (rating) and contextual signal (story semantics).
2) Feature Engineering Pipeline
Structured features include release year, runtime, genre one-hot columns, adult flag, and budget features.
Budget is transformed with log1p and missing values are imputed with decade-level medians.
For text, overviews are embedded with all-MiniLM-L6-v2, then reduced with PCA to compact
components used by the prediction model. This keeps semantic signal while controlling feature size.
3) Chosen Prediction Model
The production artifact is model_v5.pkl. It predicts IMDb-style scores from engineered features.
We selected this setup after iterating across experiments to balance quality, interpretability, and inference speed.
The API endpoint /predict runs preprocessing + inference and returns a rounded score.
4) Similarity Search Model
Similar-film retrieval uses sentence embeddings from BAAI/bge-base-en-v1.5.
Enriched movie text is embedded and indexed using FAISS (index.faiss).
Query embeddings are searched by nearest neighbors and returned through /similar-film.
5) Serving and Deployment
FastAPI serves both the UI and API routes in one container. Model files are prepared during Docker build, including large artifacts fetched from cloud storage. At runtime, the app starts once and exposes a health endpoint plus prediction and similarity endpoints, ready for Cloud Run deployment.