Starting to develop in PySpark with Jupyter installed in a Big Data Cluster
Is not a secret that Data Science tools like Jupyter, Apache Zeppelin or the more recently launched Cloud Data Lab and Jupyter Lab are a must be known for the day by day work so How could be combined the power of easily developing models and the capacity of computation of a Big Data Cluster? Well in this article I will share very simple step to start using Jupyter notebooks for PySpark in a Data Proc Cluster in GCP.
Final goal
Prerequisites
1. Have a Google Cloud account (Just log in with your Gmail and automatically get $300 of credit for one year) [1]
2. Create a new project with your favorite name
Steps
- In order to make easier the deployment, I’m going to use a beta featurethat only can be applied when creating a Data