With Amazon SageMaker Ground Truth, you can easily and inexpensively build accurately labeled machine learning (ML) datasets. To decrease labeling costs, SageMaker Ground Truth uses active learning to differentiate between data objects (like images or documents) that are difficult and easy to label. Difficult data objects are sent to human workers to be annotated and easy data objects are automatically labeled with machine learning (automated labeling or auto-labeling).
The automated labeling feature in SageMaker Ground Truth uses pre-defined Amazon SageMaker algorithms to label data and is only available when you create a labeling job using one of the supported SageMaker Ground Truth built-in task types.
Use this blog post to create an active learning workflow with your own algorithm to run training and inference in that workflow. This example can be used as a starting point to perform active learning and auto annotation with a custom labeling job.
This post contains two parts:
- In Part 1, we demonstrate how to create an active learning workflow using the Amazon SageMaker built-in algorithm, BlazingText.
- In Part 2, we replace the BlazingText algorithm with a custom ML model.
To run and customize the code used in these Parts, use the notebook
bring_your_own_model_for_sagemaker_labeling_workflows_with_active_learning.ipynb (the notebook) in SageMaker Examples section of a Notebook instance. You can further customize this code. For example, you can use different active learning logic than random selection, using the code provided in the
src directory of this GitHub repository.
This post walks you through a custom active learning workflow using the UCI News dataset. This dataset contains a list of about 420,000 articles that fall into one of four categories: Business (b), Science & Technology (t), Entertainment (e) and Health & Medicine (m).
Overview of solution
For this solution, you create an active learning workflow for a text classification labeling job using AWS Step Functions. The Step Functions service provides a simple way to manage any distributed application.
The following diagram outlines the active learning loop logic for this solution.
The following Python modules in GitHub correspond to the high-level steps above:
- “Analyze and process input manifest” is specified in Bootstrap.
- “Have enough labeled data to start active learning?” is specified in MetaData.
- “Have humans label data with
CreateLabelingJob” is specified in Labeling.
- “Add labeled data to the manifest” and “Export all labeled data to output manifest” are specified in Output.
- The rest of the steps besides “Check for completion” are specified in ActiveLearning.
- “Check for completion” does not belong to any Python module and is specified directly in a step function.
The active learning workflow contains the following steps:
- Analyze the input manifest to determine the number (count) of data objects that are already labeled. At least 20% of the data must be labeled to start an active learning loop.
- If all the data is labeled, copy the current manifest as the final output manifest.
- If less than 20% of the data is labeled, send 20% of the data to humans for labeling using
- When there is enough data to train your model, start the active learning loop.
- Auto-label data for which transform job results had the highest confidence to infer the labels.
- Call a
CreateLabelingJobto have humans label a subset of unlabeled data objects with low inference confidence.
- Repeat the loop.
Prerequisites and setting up
To create a custom active learning workflow using this post, you need to complete the following prerequisites:
- Create an AWS account.
- Create an IAM role with the permissions required to complete this walkthrough. Your IAM role must have the following AWS managed policies attached:
- Familiarity with Amazon SageMaker labeling, training and batch transform; AWS CloudFormation; and Step Functions.
Additionally, you will need an Amazon SageMaker Jupyter notebook instance to use the notebook. To learn how to create a new Amazon SageMaker Jupyter Notebook instance, see Create a Notebook Instance. Use an IAM role that has an
AmazonSageMakerFullAccess IAM policy attached to create your notebook instance.
Once you launch your notebook instance, look for
bring_your_own_model_for_sagemaker_labeling_workflows_with_active_learning.ipynb in the Ground Truth Labeling Jobs section of SageMaker Examples in your instance. See Use Example Notebooks to learn how to find an Amazon SageMaker example notebook.
Launching the CloudFormation stack
You can launch the stack in AWS Region us-east-1 in the CloudFormation console using the Launch Stack button below. To launch the stack in a different AWS Region, use the instructions found in the README of the GitHub repository.
This CloudFormation stack generates two state machines in AWS Step Functions: ActiveLearning-* and ActiveLearningLoop-*, where * is the name you used when you launch your CloudFormation stack.
When you run Part 1 and Part 2 of this walkthrough in an Amazon SageMaker Jupyter Notebook Instance, you will incur the following costs:
- Human data labeling costs in Amazon SageMaker Ground Truth. This will depend on the type of workforce that you use. If you are a new user of SageMaker Ground Truth, we suggest that you use a private workforce and include yourself as a worker to test your labeling job configuration. To learn more about these costs, see https://aws.amazon.com/sagemaker/groundtruth/pricing/.
- Training and inference costs in Amazon SageMaker. This cost will vary depending on the type of algorithm that you use. To learn more about Amazon SageMaker pricing, see https://aws.amazon.com/sagemaker/pricing/.
- EC2 Costs from your Amazon SageMaker Jupyter notebook. Refer to https://aws.amazon.com/sagemaker/pricing/instance-types/ to determine the cost incurred by the type of notebook Instance you create.
- AWS Lambda and Step Function costs. The active learning workflow that is created using the CloudFormation template provided uses AWS Lambda and Step Functions.
- Storage costs in S3. Refer to https://aws.amazon.com/s3/pricing/ to learn more about S3 pricing.
Part 1: Creating an active learning workflow with BlazingText
You use Part 1 of the notebook in an Amazon SageMaker Jupyter notebook instance to create the resources needed in your active learning workflow. Specifically, you do the following:
- Clean the data and create an input manifest file.
- Create the resources needed to create a labeling job. For example, specify label categories and a worker task template to generate the worker UI.
You use these resources to configure a labeling job request (in JSON format). At the end of Part 1, you copy and paste this JSON to the Step Function that calls
Creating an input Manifest and labeling job resources
To create a manifest file, complete the following steps:
- Open the notebook in your Amazon SageMaker Notebook instance.
- Set up your Amazon SageMaker environment. The following code defines your session, role, Region, and S3 bucket for the data:
- Download and unzip the UCI News Dataset dataset. This walkthrough uses the file
newsCorpora.csvand randomly choose 10,000 articles from that file to create our dataset. Each article corresponds to a single row of our input manifest file.
- Clean the dataset and create our subset of 10,000 articles.
- Save the dataset to
You use this file to create our input manifest file. For the active learning loop to start, 20% of the data must be labeled. To quickly test the active learning component, this post includes 20% of the original labels provided in the dataset in your input manifest.
You will use this partially-labeled dataset as the input to the active learning loop. This walkthrough will demonstrate how the rest of the labels can be generated using active learning.
The remainder of the notebook specifies resources that you need to create a labeling job configuration. These include the following:
- A label categories JSON file stored in your S3 bucket.
- A worker task template to create a worker UI.
- A work team ARN. You can customize this to use a public (Amazon Mechanical Turk), private, or vendor work team. To use a private workforce, set
- Pre- and post-Lambda function ARNs.
- A task title, description, and keywords to help workers find and complete your task.
You use these resources in
your human_task_config JSON.
- Generate the JSON with the following code:
- Copy the result. You use this to start your active learning workflow in Step Functions.
Starting an active learning workflow
After you generate a JSON to configure your labeling job request using the notebook, you can use it to start an active learning workflow. Your CloudFormation stack is built to use the Amazon SageMaker BlazingText algorithm by default.
To start your active learning workflow with BlazingText, complete the following steps:
- On the AWS Step Functions console, choose State Machines. Choose the state machine ActiveLearningLoop-*, where * is the name you used when you launched your CloudFormation stack.
- Optionally, give your active learning workflow an execution name.
- Paste the JSON that you copied from the notebook in the Input – optional code block.
- Choose Start execution.
Monitoring your active learning workflow
Because you started with 20% of your data labeled, this starts your active learning workflow. To monitor the progress of your workflow, complete the following steps:
- On the AWS Step Functions console, choose State Machines.
- Choose the state machine ActiveLearningLoop -*.
In the Executions section, you have a list of active learning workflows and their statuses.
- To see the state of your workflow, choose your workflow from the list.
You can monitor the status in the Visual workflow section.
When the data labeling is complete, the labeled data is written to an output manifest file in your S3 bucket. The following diagram depicts what the active learning workflow looks like.
Once all of the data has been labeled, all of the labels will be exported to an output manifest file in our S3 bucket. The following diagram depicts what a complete active learning workflow looks like.
Part 2: Creating a custom model and integrating it into an active learning workflow
Part 2 demonstrates how you can bring your own custom training and inference algorithm to the active learning workflow you developed.
In this section, you use a Keras deep learning model to add a custom model to your active learning workflow. You train the model using 1,000 data points from the UCI dataset in an Amazon SageMaker Jupyter notebook. This model is only intended to be used for this demonstration and we only use the training and inference algorithm in the remainder of this tutorial.
To complete this section, use Part 2 of the notebook.
This section assumes that you have performed the following steps:
- Complete the prerequisites in this post.
- Launch the CloudFormation stack provided.
- Completed the section Creating an input Manifest and labeling job resources in Part 1 of this post.
Developing and containerizing the model
The methodology in this post to develop and containerize your model was inspired by the following GitHub repo on building, training, and deploying an ML model using a custom TensorFlow Docker container.
You can use the notebook to:
- Read and clean the dataset.
- Tokenize the data objects using the Keras Tokenizer class.
- After tokenizing the dataset, you can use it to train a model. We trained a Keras deep learning model.
- Containerize the model in a Docker container.
- Add the container to ECR. Amazon SageMaker will retrieve this image for training and inference during the active learning workflow.
The final code cell in the notebook prints your Docker image’s ECR ID. You can use it for training and inference across Amazon SageMaker.
Bringing your container to an active learning workflow:
After you are confident that your algorithm meets your standards, use this section to replace the BlazingText algorithm you used in Part 1 with your custom ML model in your active learning workflow.
Step 1: Updating the container ECR reference
To update the container ECR reference, complete the following steps:
- On the AWS Lambda console, locate the Lambda function with the name *-PrepareForTraining-<###> where * is the name you used when you launched your CloudFormation stack and -<###> is a list of letters and numbers.
- Identify the existing algorithm specification, which looks like the following code:
TrainingImagefrom 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest to the output of last print-statement in the notebook.
Step 2: Changing the batch strategy
CreateTransformJob supports two batch strategies –
SingleRecord. For more information, see CreateTransformJob. The Amazon SageMaker BlazingText algorithm supports
MultiRecord, whereas while the Keras container used in this walkthrough supports
SingleRecord. To change the batch strategy, complete the following steps:
To change your batch strategy:
- On the AWS Step Functions console, choose State Machines.
- Select the radio button near “ActiveLearning-*”
- Choose Edit.
- Look for the
CreateTransformJobstate, which is responsible for calling the batch transform job. See the following code:
- Choose Save.
The following screenshot shows how the state machine console looks after you make this edit.
After these steps, repeat the steps in Part 1 to start the active learning workflow. Specifically, you will need to start your active learning workflow using the Start an Active Learning Workflow section.
After the successful completion of the state machine, you have generated labeled data using both machine and human labels with your custom training and inference algorithms.
To avoid incurring future charges, stop and delete the notebook instance used for the walkthrough.
Also, on the Amazon SageMaker console, stop any training (in the Training section), transform (in the Inference section), or labeling (in the Ground Truth section) jobs created while completing this tutorial.
In Part 1 of this post, 52% of the data was human labeled and 48% was auto-labeled. These percentages are dependent on the specific training algorithm, inference algorithm, and active learning logic that you use. When you implement your own active learning workflow, these numbers may vary depending on the algorithm and active learning logic that you use, and the level of label accuracy that you require.
When you bring your own model to the active learning workflow, for better results (a higher percentage of data getting auto-labeled) ensure that your model continues to perform as expected when noise is present in the labeled data used to train the model.
In this post, you created an active learning workflow and used it to produce high-quality labels from both ML model inferences and human workers.
You can use this workflow for a variety of custom labeling tasks to reduce the cost of labeling large datasets. You can bring any custom learning algorithm and active learning logic and alter this example to suit your needs. To get started and preview the active learning workflow using Blazing Text, launch the Cloud Formation stack and complete Part 1.
About the Authors
Koushik Kalyanaraman is a Software Development Engineer on the SageMaker Ground Truth team. In his spare time, he aspires to develop his own board game.
Andrea Morandi is an AI/ML specialist solution architect in the Strategic Specialist team. He helps customers to deliver and optimize ML applications on AWS. Andrea holds a Ph.D. in Astrophysics from the University of Bologna (Italy), he lives with his wife in the Bay area, and in his free time he likes hiking.
Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She has worked with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon. In her free time, she enjoys meditating and taking walks in nature.