Train a Deep Learning Model with AWS Deep Learning Containers on Amazon EC2
TUTORIAL
Overview
AWS Deep Learning Containers (DL Containers) are Docker images pre-installed with deep learning frameworks to make it easy to deploy custom machine learning environments quickly by letting you skip the complicated process of building and optimizing your environments from scratch.
Using AWS DL Containers, developers and data scientists can quickly add machine learning to their containerized applications deployed on Amazon Elastic Container Service for Kubernetes (Amazon EKS), self-managed Kubernetes, Amazon Elastic Container Service (Amazon ECS), and Amazon EC2.
In this tutorial, you will train a TensorFlow machine learning model on an Amazon EC2 instance using the AWS Deep Learning Containers.
AWS experience
Audience
Time to complete
10 minutes
Cost to complete
Less than $1
Requires
AWS Account
Services used
AWS Deep Learning Containers, Amazon EC2, Amazon ECR
Last updated
Implementation
1. Sign-up for AWS
You need an AWS account to follow this tutorial. There is no additional charge for using AWS Deep Learning Containers with this tutorial - you pay only for the Amazon c5.large instance used in this tutorial, which will be less than $1 after following termination steps at the end of this tutorial.
Already have an account? Log in to your account
2. Add permissions for accessing Amazon ECR
AWS Deep Learning Container images are hosted on Amazon Elastic Container Registry (ECR), a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. In this step, you will grant an existing IAM user permissions to access Amazon ECR (using AmazonECS_FullAccess Policy).
If you do not have an existing IAM user, refer to the IAM Documentation for more information.
a. Navigate to the IAM console
Open the AWS Management Console, so you can keep this step-by-step guide open. When the screen loads, enter your user name and password to get started. Then type IAM in the search bar and select IAM to open the service console.
b. Select Users
Select Users from the navigation pane on the left.
c. Add Permissions
You will now add permissions to a new IAM user you created or to an existing IAM user. Select Add Permissions on the IAM user summary page.
d. Add the ECS Full Access Policy
Select Attach existing policies directly and search for ECS_FullAccess. Select the Amazon_FullAccess policy and click through to Review and Add Permissions.
e. Add inline policy
On the IAM user summary page, select Add inline policy.
f. Paste JSON policy
Select the JSON tab and paste the following policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "ecr:*",
"Effect": "Allow",
"Resource": "*"
}
]
}
Save this policy as ‘ECR’ and select Create Policy.
3. Launch an AWS Deep Learning Base AMI instance
In this tutorial, we will use AWS Deep Learning Containers on an AWS Deep Learning Base Amazon Machine Images (AMIs), which come pre-packaged with necessary dependencies such as Nvidia drivers, docker, and nvidia-docker. You can run Deep Learning Containers on any AMI with these packages.
a. Navigate to the EC2 console
Return to the AWS Management Console home screen and type EC2 in the search bar and select EC2 to open the service console.
b. Launch an Amazon EC2 instance
Navigate to the Amazon EC2 console again and select the Launch Instance button.
c. Select the AWS Deep Learning Base AMI
Choose the AWS Marketplace tab on the left, then search for ‘deep learning base ubuntu’. Select Deep Learning Base AMI (Ubuntu). You can also select the Deep Learning Base AMI (Amazon Linux).
d. Select the instance type
Choose an Amazon EC2 instance type. Amazon Elastic Compute Cloud (EC2) is the Amazon Web Service you use to create and run virtual machines in the cloud. AWS calls these virtual machines 'instances'.
For this tutorial, we will use a c5.large instance, but you can choose additional instance types, including GPU-based instances (such as G4, G5, P3, and P4).
Select Review and Launch.
e. Launch your instance
Review the details of your instance and select Launch.
f. Create a new private key file
On the next screen you will be asked to choose an existing key pair or create a new key pair. A key pair is used to securely access your instance using SSH. AWS stores the public part of the key pair which is just like a house lock. You download and use the private part of the key pair which is just like a house key.
Select Create a new key pair and give it the name. Then select Download Key Pair and you store your key in a secure location. If you lose your key, you won't be able to access your instance. If someone else gets access to your key, they will be able to access your instance.
If you have previously created a private key file that you can still access, you can use your existing private key instead by selecting Choose an existing key pair.
g. View instance details
Select the instance ID to view the details of your newly created Amazon EC2 on the console.
4. Connect to your instance
In this step, you will connect to your newly launched instance using SSH. The instructions below use a Mac / Linux environment. If you are using Windows, follow step 4 on this tutorial.
a. Find and copy your instance’s public DNS
Under the Description tab, copy your Amazon EC2 instance’s Public DNS (IPv4).
b. Open your command line terminal
On your terminal, use the following commands to change to the directory where your security key is located, then connect to your instance using SSH.
cd /Users/<your_username>/Downloads/
chmod 0400 <your .pem filename>
ssh -L localhost:8888:localhost:8888 -i <your .pem filename> ubuntu@<your instance DNS>
c. Install Docker
Stop any ongoing system update, so we’re free to install Docker.
sudo pkill -f "apt.systemd.daily"
sudo apt install docker.io
5. Log in to Amazon ECR
AWS Deep Learning Container images are hosted on Amazon Elastic Container Registry (ECR), a fully-managed Docker container registry that makes it easy for developers to store, manage, and deploy Docker container images. In this step, you will login and verify access to Amazon ECR.
a. Configure your EC2 instance with your AWS credentials
You need to provide your AWS Access Key ID and Secret Access Key. If you don’t already have this information, you can create an Access Key ID and Secret Access Key here.
b. Log in to Amazon ECR
You will use the command below to log in to Amazon ECR:
sudo su –
$(aws ecr get-login --region us-east-1 --no-include-email --registry-ids 763104351884)
Note: You need to include ‘$’ and parantheses in your command. You will see ‘Login Succeeded’ when this step concludes.
6. Run TensorFlow training with Deep Learning Containers
In this step, we will use an AWS Deep Learning Container image for TensorFlow training on CPU instances with Python 3.6.
a. Run AWS Deep Learning Containers
You will now run AWS Deep Learning Container images on your EC2 instance using the command below. This command will automatically pull the Deep Learning Container image if it doesn’t exist locally.
If using CPU instance:
docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:2.8.0-cpu-py39-ubuntu20.04-e3
b. Pull an example model to train
We will clone the Keras repository, which includes example Python scripts to train models.
git clone https://github.com/gilinachum/keras
c. Start training
Start training the canonical MNIST CNN model with the following command:
python keras/mnist.py
You have just successfully commenced training with your AWS Deep Learning Container.
7. Terminate Your Resources
In this step, you will terminate the Amazon EC2 instance you created during this tutorial.
Important: Terminating resources that are not actively being used reduces costs and is a best practice. Not terminating your resources can result in charges to your account.
a. Select your running instance
On the Amazon EC2 Console, select Running Instances.
b. Terminate your EC2 instance
Select the EC2 instance you created and choose Actions > Instance State > Terminate.
c. Confirm termination
You will be asked to confirm your termination. Select Yes, Terminate.
Note: This process can take several seconds to complete. Once your instance has been terminated, the Instance State will change to terminated on your EC2 Console.
Conclusion
You have successfully trained an MNIST CNN model with TensorFlow using AWS Deep Learning Containers.
You can use AWS DL Containers for training and inference on CPU and GPU resources on Amazon EC2, Amazon ECS, Amazon EKS, and Kubernetes.
Use these stable deep learning images, which have been optimized for performance and scale on AWS, to build your own custom deep learning environments.