Forecasting is an important aspect for many businesses and industries; forging ahead without clearly defined goals could have serious consequences. Product planning, financial forecasting, and weather forecasting create scientific estimates based on hard data and critical analysis. Time-series forecasting decomposes the historical data into the baseline, trend, and seasonality, if any.
The Amazon SageMaker DeepAR forecasting algorithm is a supervised machine learning algorithm to forecast a time series. The algorithm uses recurrent neural networks (RNN) to produce point and probabilistic forecasts. You can use the DeepAR algorithm for forecasting single values for a scalar (one-dimensional) time series, or have it work on hundreds of related time series simultaneously by creating a model. It can even predict a new time series that is related to the series on which the model is trained.
To illustrate time series forecasting, I use the DeepAR algorithm to analyze Chicago’s Speed Camera Violation dataset. The dataset is hosted by Data.gov and managed by the U.S. General Services Administration, Technology Transformation Service. These violations are captured by camera systems and available on the Chicago Data Portal. You can use the dataset to discern patterns in the data and gain meaningful insights.
The dataset contains multiple camera locations and daily violation counts. If you imagine each daily violation for a camera as one time series, you can use the DeepAR algorithm to train a model for multiple streets simultaneously and predict violations for multiple street cameras.
This analysis can identify streets where motorists are most likely to drive above the speed limit at different times of year, and any seasonality in the data. This could help cities to implement proactive measures to reduce speed, create alternate routes, and increase safety.
The code for this notebook is available on the GitHub repo.
Creating a Jupyter notebook
Before you get started, create an Amazon SageMaker Jupyter notebook instance. For this post, I use an ml.m4.xlarge notebook instance and built-in python3 kernel.
Importing the necessary libraries, downloading and visualizing the data
Download the data to the Jupyter notebook instance and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. You use addresses, violation dates, and the number of violations to train the data. The code and the output below shows how to download dataset, and display a few rows and the four columns.
Visualizing the data with matplotlib
In this step, you convert the violation date from string format to date in the data frame, add missing violation values as 0 for each camera, and use matplotlib to visualize violations by different streets as a time series. This helps visualize each camera and street data as a time series. See the following code:
The graph below shows all data available in the dataset as time series with daily speeding violation counts on Y axis plotted against date on X axis.
Splitting the dataset for training and evaluation
You can now split the data into training and test sets. I create a test dataset from the last 30 days of the data to evaluate the model’s predictions. The training job doesn’t see the last 30 days of test data. You convert the dataset from panda series to JSON series objects and use the test dataset to check the quality of trained model’s prediction capability. The code below demonstrates the data split and creation of training and test dataset.
Using managed Spot Instances and automatic model tuning for training
The Amazon SageMaker Python SDK provides a simple API to create an automatic model tuning job. In this use case, I also use managed Spot Instances to reduce the cost of training. You can use root mean square error (RMSE) on the test dataset as an objective to minimize its value for training the model. The HyperparameterTuner class of the Amazon SageMaker tuner package provides an easy interface to control the parallelism of the number of training jobs to run to find the optimal hyperparameters. I use 10 parallel jobs with maximum jobs to run set as 10. You could set the number to a higher value to allow for more hyperparameter tuning, which could produce better results. The fit method kicks off the hyperparameter tuning job with maximum training time set as 1 hour. See the following code:
Deploying the best model
The Amazon SageMaker Python SDK tuner.best_tuning_job API can identify the best tuning job, which you can use to deploy the model that minimized the hyperparameter training objective metric. With a single deploy API call, the best model identified by the automatic hyperparameter optimization job is deployed on an ml.m4.xlarge instance.
You can also see the part of the output below that shows the best model’s training and billable training time:
Amazon SageMaker managed Spot training creates cost savings for all 10 training jobs. The preceding output indicates that the best-trained model out of the 10 training jobs saved over 64% of training costs compared to on-demand training instances.
Next, define a DeepARPredictor class that extends the sagemaker.predictor.RealTimePredictor class and associated helper functions to encode the panda series to objects and a decode function to deserialize objects to panda series. This method allows you to implement a prediction method for the test dataset. See the following code:
Visualizing the predictions
In the final step, I use the predictor object to make predictions for five sample streets from the test dataset and provide a graphical representation of test data vs. predicted data. You can graph the predictions against the test data with an 80% confidence interval to see how well the model performed. See the following code and graph below:
The pattern displayed above shows that the prediction follows the target and test data within 80% confidence. It also indicates that the 1111 N HUMBOLDT street location spikes over the weekend 01/31/2020, 02/08/2020, 02/15/2020—all Saturdays.
If you graph this data with all data points available in series, you can see a seasonal pattern with mid-year and summer months showing spikes in the speeding violations.
After you complete this walkthrough, make sure that you delete the predictor endpoint to avoid incurring charges in your AWS account. See the following code:
You should also delete your Amazon SageMaker notebook instance. For instructions, see Step 9: Clean Up.
In this post, I trained a model using the Amazon SageMaker’s DeepAR algorithm to predict multiple street addresses with different camera locations observing speeding violations over time and to identify seasonality. With this data, the model could predict future recurring violation patterns and spikes in violations on weekends and during the summer months. Such analysis could help predict the streets where motorists are likely to drive above speed limit at different times of year. Cities can implement proactive measures to reduce speed, create alternate routes to improve safety, and reduce congestion.
You can use the DeepAR algorithm and the solution in this post when your business needs to predict multiple related time series. For more information about the DeepAR algorithm, see How the DeepAR Algorithm Works. For more information about how the algorithm is designed, see DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.
About the Author
Viral Desai is a Solutions Architect with AWS. He provides architectural guidance to help customers achieve success in the cloud. In his spare time, Viral enjoys playing tennis and spending time with family.