This repo hosts my code for the article “Analyze Big Sequence Alignments with PySpark in AWS EMR”. Spark AWS CLI AWS Account Follow the instruction in the article. Once you have uploaded the files into your S3 bucket, run aws emr create-cluster –name “Spark_step_pip” –release-label emr-6.5.0 –applications Name=Spark –log-uri s3://[your_S3_bucket]/logs/ –instance-type m5.xlarge –instance-count 3 –bootstrap-actions Path=s3://[your_S3_bucket]/emr_bootstrap.sh –use-default-roles –auto-terminate –steps “Type=Spark,Name=SparkProgram,ActionOnFailure=CONTINUE,Args=[–deploy-mode,cluster,–master,yarn,–py-files,s3://[your_S3_bucket]/helper_function.py,s3://[your_S3_bucket]/spark_3mer.py,s3://[your_S3_bucket]/test.sam,[your_S3_bucket],sankey.json]”
Read more