Automatically Create Machine Learning Models
TUTORIAL
Overview
In this tutorial, you will learn how to use Amazon SageMaker Autopilot to automatically build, train, and tune a machine learning (ML) model, and deploy the model to make predictions. SageMaker AutoPilot is available as an API as well as a UI. The UI for SageMaker AutoPilot is a component of the SageMaker Canvas Workspace.
SageMaker Canvas is an end-to end no-code ML workspace for ML and GenAI.
Amazon SageMaker Autopilot eliminates the heavy lifting of building ML models by helping you automatically build, train, and tune the best ML model based on your data. With SageMaker Autopilot, you simply provide a tabular dataset and select the target column to predict. SageMaker Autopilot explores your data, selects the algorithms relevant to your problem type, prepares the data for model training, tests a variety of models, and selects the best performing one. You can then deploy one of the candidate models or iterate on them further to improve prediction quality.
What you will accomplish
In this guide, you will:
- Create a training experiment using SageMaker Autopilot
- Explore the different stages of the training experiment
- Identify and deploy the best performing model from the training experiment
- Predict with your deployed model
Prerequisites
Before starting this guide, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your AWS Environment getting started guide for a quick overview.
AWS experience
Beginner
Time to complete
45 minutes
Cost to complete
See SageMaker pricing to estimate cost for this tutorial.
Requires
You must be logged into an AWS account.
Services used
Amazon SageMaker Autopilot
Last updated
May 03, 2024
Implementation
For this workflow, you will use a synthetically generated auto insurance claims dataset. The raw inputs are two tables of insurance data: a claims table and a customers table. The claims table has a fraud column indicating whether a claim was fraudulent or otherwise. For the purposes of this tutorial, we have selected a small portion of the dataset. However, you can follow the same steps in this tutorial to process large datasets.
Step 1: Set up Amazon SageMaker Studio domain
With Amazon SageMaker, you can deploy a model visually using the console or programmatically using either SageMaker Studio or SageMaker notebooks. In this tutorial, you deploy the model in the console using a SageMaker Canvas, which requires a SageMaker Studio domain.
If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.
If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.
Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.
Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.
On the CloudFormation pane, choose Stacks. It takes about 10 minutes for the stack to be created. When the stack is created, the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.
Step 2: Launch SageMaker Canvas
Developing and testing a large number of candidate models is crucial for machine learning (ML) projects. Amazon SageMaker AutoPilot helps by providing different model candidates and automatically chooses the best model based on your data. In this step, you will configure a SageMaker Autopilot experiment to predict success from a financial services marketing campaign. This dataset represents a marketing campaign that was run by a major financial services institution to promote certificate of deposit enrollment.
In the AWS Management Console type SageMaker into the search bar, and then choose Amazon SageMaker.
Select Canvas from the menu on the left.
On the Canvas page select Open Canvas.
If you did not already have Canvas running, it will take around 5 minutes for Canvas to start.
Once you see the Canvas Home page, you’ll select My Models on the left.
Step 3: Build a model using SageMaker AutoPilot.
Select My Models on the left pane.
Select + Create new model
For Model name you’ll type Marketing Campaign, leaving the Problem type set to Predictive analysis, and before clicking Create.
On the Select dataset page you’ll select + Create dataset
Name the dataset MarketingData and click Create.
Download the dataset from this link: https://sagemaker-sample-files.s3.amazonaws.com/datasets/tabular/uci_bank_marketing/bank-additional-full.csv
Since you downloaded the data to your local machine, leave Data Source as Local upload, Click Select files from your computer or drag the bank-additional-full.csv file on to the window, and once the file is processed, you’ll Click Preview dataset. While local upload is convenitent for this usecase, there are also many other data source options available to make it easy to access the data sources required for different use cases. You can learn more about the different options here: https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-importing-data.html
The preview will display the first 100 rows of the data. If everything looks good, you’ll click Create dataset.
After the dataset is ready, select the radio button beside MarketingData and click Select dataset.
From the Target Column drop down you’ll select y and click the down arrow for the Quick Build button to select Standard build. While the defaults are appropriate for this use case, before clicking Standard build, explore the Configure model options.
Under Objective metric you could change the metric AutoPilot uses to optimize while building your model. AutoPilot generally selects the best metric based on the model type, but this allows you to override that based on your specific requirements.
Under Training method and algorithms, you can optionally, select Ensemble or Hyperparameter optimization (HPO) as the approach to develop the model. With Auto SageMaker AutoPilot will use Ensemble for datasets smaller than 100MB and HPO for datasets larger than 100MB.
Under Data split you’ll have to option to specify the percentage of data used during model development vs testing of the model. 80% for training and 20% validation is a good starting point for most datasets.
Under Max candidates and runtime, you’ll be able to set the maximum number of model candidates AutoPilot is allowed to generate and the maximum amount of time that AutoPilot is allowed to run a build job. Note: Max Candidates is only available for HPO training mode. We highly recommend that you keep the maximum job runtime greater than 30 minutes to ensure that AutoPilot has enough time to generate model candidates and finish building your model.
SageMaker AutoPilot will begin to run through the phases of model building. In the Model overview window you can monitor the process. It should take around 45 minutes for your build to complete.
Step 4: Interpret model performance
Now that the experiment is complete and you have a model, the next step is to interpret its performance. You will now learn how to use SageMaker Canvas to analyze the model's performance.
After model building is complete, SageMaker Canvas automatically switches to the Analyze tab to show the quick training results. The SageMaker Canvas model selected based on the F1 score optimization can predict the class accurately 91.296% of the time. Machine Learning introduces some stochasticity in the process of training models, which can lead to different results to different builds. Therefore, the exact performance in terms of the metrics that you see might be different.
In the Overview section the view you see is called Column impact and represents the aggregated SHAP value for each feature across each instance in the dataset. The column impact score is an important part of model explainability because it shows what features tend to influence the predictions the most in the dataset. In this use case, the customer duration or tenure and employment variation rate are the top two fields for driving the model's outcome.
To view other models or change the selected model, click the Model leaderboard link.
If you would like to analyze a different model from the leaderboard, selecting the three dots on the right and selecting View model details will take you to explainability and performance details for that model or you can Change to default model if you wanted to work with that model. You can click the X at the top right to close the Model leaderboard view.
Now, click on the Scoring tab. You will see a visual representation of Predicted vs. Actual to help you understand the accuracy of the model in predicting the different classes, in this case yes or no. You will also see, Model accuracy insights to help you interpret the visualization. Highlighting that if the model predicts the marketing campaign will not be effective, it is correct 96.872% of the time and that if the campaign was not effective, the model predicted it would not have been 93.201% of the time.
Now, click on the Advanced metrics tab. You will find detailed information on the model’s performance, including recall, precision, and accuracy. You can also interpret model performance and decide if additional model tuning is needed.
Next, visualizations are provided to further illustrate model performance. First, look at the confusion matrix. The confusion matrix is commonly used to understand how the model labels are divided among the predicted and true classes. In this case, the diagonal elements show the number of correctly predicted labels and the off-diagonal elements show the misclassified records. A confusion matrix is useful for analyzing misclassifications due to false positives and false negatives.
Finally, look at the precision versus recall curve. This curve interprets the label as a probability threshold and shows the trade-off that occurs at various probability thresholds for model precision and recall. SageMaker AutoPilot automatically optimizes these two parameters to provide the best model.
Step 5: Test the model SageMaker AutoPilot built.
Now that you have a classification model, you can either use the model to run predictions, or you can create a new version of this model. In this step, you use SageMaker Canvas to generate predictions, both single and in bulk, from a dataset.
To start generating predictions, choose the Predict button at the bottom of the Analyze page, or choose the Predict tab. On the Predict page, Batch prediction is already selected. To make a one time batch prediction choose Manual and then select the MarketingData dataset. Selecting Automatic would allow you to make batch predictions for a dataset every time the data set is updated. In actual ML workflows, this dataset should be separate from the training dataset. However, for simplicity, you use the same dataset to demonstrate how SageMaker Canvas generates predictions. Choose Generate predictions.
After a few seconds, the prediction is done. Choose View from the message window at the bottom of the page to see a preview of the predictions. You can also choose Download to download a CSV file containing the full output. SageMaker Canvas returns a prediction for each row of data.
On the Predict page, you can generate predictions for a single sample by selecting Single prediction. SageMaker Canvas presents an interface in which you can manually enter values for each of the input variables used in the model. This type of analysis is ideal for what-if scenarios where you want to know how the prediction changes when one or more variables increase or decrease in value.
After the model building process, SageMaker AutoPilot uploads all artifacts including the trained model saved as a pickle file, metrics, datasets, and predictions to an S3 Bucket in your account named sagemaker-<your-Region>-<your-account-id> under a location named Canvas/<you-user>/. You can inspect the contents and use them as necessary for further development.
While it is outside the scope of this tutorial, from the deploy tab you can easily deploy you model to a SageMaker endpoint so you can make predictions from outside of the Canvas application, test and monitor your model to proactively detect issues such as model drift.
From the deploy tab, select + Create Deployment. Set the desired endpoint configuration and select Deploy.
From My models you also have the ability to Add the model you created to Model Registry so Studio users can catalogue, review and deploy the model in SageMaker Studio.
Select the model you want to add to Model Registry then on the versions Tab click the three dots to the right of the version you would like to add to Model Registry and select Add to Model Registry.
Step 6: Clean up your AWS resources
It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.
Navigate to the S3 console and choose Buckets. Navigate to your bucket named sagemaker-<your-Region>-<your-account-id> and select the check box to the left of all of the files and folders. Next, choose Delete.
- On the Delete objects page, verify that you have selected the proper objects to delete. In the Permanently delete objects section, confirm by entering permanently delete in the text field and choose Delete objects. After completion and the bucket is empty, you can delete the S3 bucket by following the same process. A success banner appears after deletion is complete.
On the SageMaker Canvas main page, choose Models. On the right pane, the model you built is visible. Choose the vertical ellipsis to the right of the View option and select Delete model.
After the model is deleted, click on Log out to end your Canvas session.
If you used an existing SageMaker Studio domain in Step 1, skip the rest of Step 6 and proceed directly to the conclusion section.
If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.
To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.
Open the CloudFromation console. In the CloudFormation pane, choose Stacks. From the status dropdown list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.
On CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.
Conclusion
Congratulations! You have now completed the Automatically Create Machine Learning Models tutorial.
You have successfully used SageMaker Autopilot to automatically build, train, and tune models, and then deploy the best candidate model to make predictions.
Next steps
Explore SageMaker Autopilot documentation