Genomics Workflows on AWS
This guide walks through how to use Amazon Web Services (AWS), such as Amazon S3 and AWS Batch, to run large scale genomics analyses.
Here you will learn how to:
- Use S3 buckets to stage large genomics datasets as inputs and outputs from analysis pipelines
- Create job queues in AWS Batch to use for scalable parallel job execution
- Orchestrate individual jobs into analysis workflows using native AWS services like AWS Step Functions and 3rd party workflow engines
If you're impatient and want to get something up and running immediately, head straight to the Quick Start section. Otherwise, continue on for the full details.
Throughout this guide we'll assume that you:
- Are familiar with the Linux command line
- Can use SSH to access a Linux server
- Have access to an AWS account
If you are completely new to AWS, we highly recommend going through the following AWS 10-Minute Tutorials that will demonstrate the basics of AWS, as well as set up your development machine for working with AWS.
- Launch a Linux Virtual Machine - A tutorial which walks users through the process of starting a host on AWS, and configuring your own computer to connect over SSH.
- Batch upload files to the cloud - A tutorial on using the AWS Command Line Interface (CLI) to access Amazon S3.
AWS Account Access
AWS has many services that can be used for genomics. Here, we will build core architecture with AWS Batch, a managed service that is built on top of other AWS services, such as Amazon EC2 and Amazon Elastic Container Service (ECS). Along the way, we'll leverage some advanced capabilities that need escalated (administrative) privileges to implement. For example, you will need to be able to create Roles via AWS Identity and Access Management (IAM), a service that helps you control who is authenticated (signed in) and authorized (has permissions) to use AWS resources.
We strongly recommend following the IAM Security Best Practices for securing your root AWS account and IAM users.
If you are using an institutional account, it is likely you do not have administrative privileges, i.e. the IAM AdministratorAccess managed policy is not attached to your IAM User or Role, and you won't be able to attach it yourself.
If this is the case, you will need to work with your account administrator to get things set up for you. Refer them to this guide, and have them provide you with an AWS Batch Job Queue ARN, and an Amazon S3 Bucket that you can write results to.
This site is a living document, created for and by the genomics community at AWS and around the world. We encourage you to contribute new content and make improvements to existing content via pull request to the GitHub repo that hosts the source code for this site.