Analyze Big Sequence Alignments with PySpark in AWS EMR
This repo hosts my code for the article “Analyze Big Sequence Alignments with PySpark in AWS EMR”.
-
Spark
-
AWS CLI
-
AWS Account
Follow the instruction in the article. Once you have uploaded the files into your S3 bucket, run
aws emr create-cluster --name "Spark_step_pip"
--release-label emr-6.5.0
--applications Name=Spark
--log-uri s3://[your_S3_bucket]/logs/
--instance-type m5.xlarge
--instance-count 3
--bootstrap-actions Path=s3://[your_S3_bucket]/emr_bootstrap.sh
--use-default-roles --auto-terminate
--steps "Type=Spark,Name=SparkProgram,ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,--py-files,s3://[your_S3_bucket]/helper_function.py,s3://[your_S3_bucket]/spark_3mer.py,s3://[your_S3_bucket]/test.sam,[your_S3_bucket],sankey.json]"