AWS is excited to introduce Amazon SageMaker Operators for Kubernetes, a new capability that makes it easier for developers and data scientists using Kubernetes to train, tune, and deploy machine learning (ML) models in Amazon SageMaker. Customers can install these Amazon SageMaker Operators on their Kubernetes cluster to create Amazon SageMaker jobs natively using the Kubernetes API and command-line Kubernetes tools such as ‘kubectl’.
Many AWS customers use Kubernetes, an open-source general-purpose container orchestration system, to deploy and manage containerized applications, often via a managed service such as Amazon Elastic Kubernetes Service (EKS). This enables data scientists and developers, for example, to set up repeatable ML pipelines and maintain greater control over their training and inference workloads. However, to support ML workloads these customers still need to write custom code to optimize the underlying ML infrastructure, ensure high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. For example, when Kubernetes customers use GPUs for training and inference, they often need to change how Kubernetes schedules and scales GPU workloads in order to increase utilization, throughput, and availability. Similarly, for deploying trained models to production for inference, Kubernetes customers have to spend additional time in setting up and optimizing their auto-scaling clusters across multiple Availability Zones.
Amazon SageMaker Operators for Kubernetes bridges this gap, and customers are now spared all the heavy lifting of integrating their Amazon SageMaker and Kubernetes workflows. Starting today, customers using Kubernetes can make a simple call to Amazon SageMaker, a modular and fully-managed service that makes it easier to build, train, and deploy machine learning (ML) models at scale. With workflows in Amazon SageMaker, compute resources are pre-configured and optimized, only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, offering near 100% utilization. Now with Amazon SageMaker Operators for Kubernetes, customers can continue to enjoy the portability and standardization benefits of Kubernetes and EKS, along with integrating the many additional benefits that come out-of-the-box with Amazon SageMaker, no custom code required.
Amazon SageMaker and Kubernetes
Machine learning is more than just the model. The ML workflow consists of sourcing and preparing data, building machine learning models, training and evaluating these models, deploying them to production, and ongoing post-production monitoring. Amazon SageMaker is a modular and fully managed service that helps data scientists and developers more quickly accomplish the tasks to build, train, deploy, and maintain models.
But the workflows related to building a model are often one part of a bigger pipeline that spans multiple engineering teams and services that support an overarching application. Kubernetes users, including Amazon EKS customers, deploy workloads by writing configuration files, which Kubernetes matches with available compute resources in the user’s Kubernetes cluster. While Kubernetes gives customers control and portability, running ML workloads on a Kubernetes cluster brings unique challenges. For example, the underlying infrastructure requires additional management such as optimizing for utilization, cost and performance; complying with appropriate security and regulatory requirements; and ensuring high availability and reliability. All of this undifferentiated heavy lifting takes away valuable time and resources from bringing new ML applications to market. Kubernetes customers want to control orchestration and pipelines without having to manage the underlying ML infrastructure and services in their cluster.
Amazon SageMaker Operators for Kubernetes addresses this need by bringing Amazon SageMaker and Kubernetes together. From Kubernetes, data scientists and developers get to use a fully managed service that is designed and optimized specifically for ML workflows. Infrastructure and platform teams retain control and portability by orchestrating workloads in Kubernetes, without having to manage the underlying ML infrastructure and services. To add new capabilities to Kubernetes, developers can extend the Kubernetes API by creating a custom resource that contains their application-specific or domain-specific logic and components. ‘Operators’ in Kubernetes allow users to natively invoke these custom resources and automate associated workflows. By installing SageMaker Operators for Kubernetes on your Kubernetes cluster, you can now add Amazon SageMaker as a ‘custom resource’ in Kubernetes. You can then use the following Amazon SageMaker Operators:
- Train – Train ML models in Amazon SageMaker, including Managed Spot Training, to save up to 90% in training costs, and distributed training to reduce training time by scaling to multiple GPU nodes. You pay for the duration of your job, offering near 100% utilization.
- Tune – Tune model hyperparameters in Amazon SageMaker, including with Amazon EC2 Spot Instances, to save up to 90% in cost. Amazon SageMaker Automatic Model Tuning performs hyperparameter optimization to search the hyperparameter range for more accurate models, saving you days or weeks of time spent improving model accuracy.
- Inference – Deploy trained models in Amazon SageMaker to fully managed autoscaling clusters, spread across multiple Availability Zones, to deliver high performance and availability for real-time or batch prediction.
Each Amazon SageMaker Operator for Kubernetes provides you with a native Kubernetes experience for creating and interacting with your jobs, either with the Kubernetes API or with Kubernetes command-line utilities such as
kubectl. Engineering teams can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these operators—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with Amazon SageMaker training, tuning, and inference jobs natively, as you would with Kubernetes jobs executing locally. Logs from Amazon SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in your command line.
Using Amazon SageMaker Operators for Kubernetes with TensorFlow
This post demonstrates training a simple convolutional neural network model on the Modified National Institute of Standards and Technology (MNIST) dataset using the Amazon SageMaker Training Operators for Kubernetes. The MNIST dataset contains images of handwritten digits from 0 to 9 and is a popular ML problem. The MNIST dataset contains 60,000 training images and 10,000 test images.
This post performs the following steps:
- Install Amazon SageMaker Operators for Kubernetes on a Kubernetes cluster
- Create a YAML config for training
- Train the model using the Amazon SageMaker Operator
For this post, you need an existing Kubernetes cluster in EKS. For information about creating a new cluster in Amazon EKS, see Getting Started with Amazon EKS. You also need the following on the machine you use to control the Kubernetes cluster (for example, your laptop or an EC2 instance):
- Kubectl (Version >=1.13). Use a
kubectlversion that is within one minor version of your Kubernetes cluster’s control plane. For example, a 1.13
kubectlclient works with Kubernetes 1.13 and 1.14 clusters. For more information, see Installing kubectl.
- AWS CLI (Version >=1.16.232). For more information, see Installing the AWS CLI version 1.
- AWS IAM Authenticator for Kubernetes. For more information, see Installing aws-iam-authenticator.
- Either existing IAM access keys for the operator to use or IAM permissions to create users, attach policies to users, and create access keys.
Setting up IAM roles and permissions
Before you deploy the operator to your Kubernetes cluster, associate an IAM role with an OpenID Connect (OIDC) provider for authentication. See the following code:
Your output should look like the following:
Now that your Kubernetes cluster in EKS has an OIDC identity provider, you can create a role and give it permissions. Obtain the OIDC issuer URL with the following command:
This command will return a URL like the following:
Use the OIDC ID returned by the previous command to create your role. Create a new file named ‘
trust.json’ with the following code block. Be sure to update the placeholders with your OIDC ID, AWS Account Number, and EKS Cluster Region.
Now create a new IAM role:
The output will return your ‘ROLE ARN’ that you pass to the operator for securely invoking Amazon SageMaker from the Kubernetes cluster.
Finally, give this new role access to Amazon SageMaker and attach the AmazonSageMakerFullAccess policy.
Setting up the operator on your Kubernetes cluster
Use the Amazon SageMaker Operators from the GitHub repo by downloading a YAML configuration file that installs the operator for you.
installer.yaml file, update the
eks.amazonaws.com/role-arn with the ARN from your OIDC-based role from the previous step.
Now on your Kubernetes cluster, install the Amazon SageMaker CRD and set up your operators.
Verify that Amazon SageMaker Operators are available in your Kubernetes cluster. See the following code:
With these operators, all of Amazon SageMaker’s managed and secured ML infrastructure and software optimization at scale is now available as a custom resource in your Kubernetes cluster.
To view logs from Amazon SageMaker in our command line using
kubetl, we will install the following client:
Preparing your training job
Before you create a YAML config for your Amazon SageMaker training job, create a container that includes your Python training code, which is available in the tensorflow_distributed_mnist GitHub repo. Use TensorFlow GPU images provided by AWS Deep Learning Containers to create your Dockerfile. See the following code:
For this post, you uploaded the MNIST training dataset to an S3 bucket. Create a
train.yaml YAML configuration file to start training. Specify
TrainingJob as a custom resource to train your model on Amazon SageMaker, which is now a custom resource on your Kubernetes cluster.
Training the model
You can now start your training job by entering the following:
Amazon SageMaker Operator creates a training job in Amazon SageMaker using the specifications you provided in
train.yaml. You can interact with this training job as you normally would in Kubernetes. See the following code:
After your training job is complete, any compute instances that were provisioned in Amazon SageMaker for this training job are terminated.
For additional examples, see the GitHub repo.
New Amazon SageMaker capabilities are now generally available
Amazon SageMaker Operators for Kubernetes are generally available as of this writing in US East (Ohio), US East (N. Virginia), US West (Oregon), and EU (Ireland) AWS Regions. For more information and step-by-step tutorials, see our user guide.
As always, please share your experience and feedback, or submit additional example YAML specs or Operator improvements. Let us know how you’re using Amazon SageMaker Operators for Kubernetes by posting on the AWS forum for Amazon SageMaker, creating issues in the GitHub repo, or sending it through your usual AWS contacts.
About the Author
Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to use deep learning engines. In his spare time, he enjoys playing tennis, reading historical fiction, and traveling.