Data pipeline architecture for onboarding public datasets to Datasets for Google Cloud

Public Datasets Pipelines

Cloud-native, data pipeline architecture for onboarding public datasets to Datasets for Google Cloud.

We use Pipenv to make environment setup more deterministic and uniform across different machines.

If you haven’t done so, install Pipenv using the instructions found here. Now with Pipenv installed, run the following command:

pipenv install --ignore-pipfile --dev

This uses the Pipfile.lock found in the project root and installs all the development dependencies.

Finally, initialize the Airflow database:

pipenv run airflow initdb

Configuring, generating, and deploying data pipelines in a programmatic, standardized, and scalable way is the main purpose of this repository.

Follow the steps below to build a data pipeline for your dataset:

1. Create a folder hierarchy for your pipeline

mkdir -p datasets/DATASET/PIPELINE

[example]
datasets/covid19_tracking/national_testing_and_outcomes

 

 

 

To finish reading, please visit source site