ETL pipeline on movie data using Python and postgreSQL

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

  • Data sources:
    • movies_metadata.csv
    • ratings.csv
    • wikipedia_movies.json
  • Software:
    • Python
    • PostgreSQL
    • Pandas
    • SQLAlchemy
    • Regular Expressions

Results

124815490-55c27480-df35-11eb-8ec6-20c28ca7a649

Summary

The pipeline was created under the following assumptions:

  • I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
  • The wikipedia dataset didn’t have a IMDB ID, so I had to extract it from the url link given.
  • Each dataset

     

     

     

    To finish reading, please visit source site