This project focuses on extracting movie data from IMDb using Selenium and seamlessly storing it in Google Cloud Platform (GCP) for visualization using Plotly and personalized movie recommendations on Streamlit here.
The primary goal is to automate the scraping of movie data, ensuring efficient data extraction and storage in GCP. This setup is ideal for building pipelines for movie-related analytics or projects.
Extract data from IMDb website. The data includes essential information such as:
- Title
- Year
- Duration
- MPAA
- Genres
- IMDb_Rating
- Director
- Stars
- Plot_Summary
For a detailed guide on implementing this project, refer to the tutorial article on Medium here.
Before running the project, ensure you have the following set up:
- Google Cloud Platform (GCP):
- A GCP account with proper billing enabled.
- Access to Cloud SQL Server to store data.
- Local Environment:
- Python installed (>= 3.10).
- Required libraries: Selenium, pyodbc, random, and time.
- A compatible web driver (e.g., ChromeDriver) is installed for Selenium.
