 ### Echo Dot (3rd Gen) - Smart speaker with Alexa - Charcoal

Use your voice to play a song, artist, or genre through Amazon Music, Apple Music, Spotify, Pandora, and others. With compatible Echo devices in different rooms, you can fill your whole home with music. ### Wireless Rechargeable Battery Powered WiFi Camera.

Wireless Rechargeable Battery Powered WiFi Camera is home security camera system lets you listen in and talk back through the built in speaker and microphone that work directly through your iPhone or Android Mic.

# How to Use Quantile Transforms for Machine Learning

0
307 Numerical input variables may have a highly skewed or non-standard distribution.

This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.

Many machine learning algorithms prefer or perform better when numerical input variables and even output variables in the case of regression have a standard probability distribution, such as a Gaussian (normal) or a uniform distribution.

The quantile transform provides an automatic way to transform a numeric input variable to have a different data distribution, which in turn, can be used as input to a predictive model.

In this tutorial, you will discover how to use quantile transforms to change the distribution of numeric variables for machine learning.

After completing this tutorial, you will know:

• Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.
• Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.
• How to use the QuantileTransformer to change the probability distribution of numeric variables to improve the performance of predictive models.

Let’s get started. How to Use Quantile Transforms for Machine Learning
Photo by Bernard Spragg. NZ, some rights reserved.

## Tutorial Overview

This tutorial is divided into five parts; they are:

1. Change Data Distribution
2. Quantile Transforms
3. Sonar Dataset
4. Normal Quantile Transform
5. Uniform Quantile Transform

## Change Data Distribution

Many machine learning algorithms perform better when the distribution of variables is Gaussian.

Recall that the observations for each variable may be thought to be drawn from a probability distribution. The Gaussian is a common distribution with the familiar bell shape. It is so common that it is often referred to as the “normal” distribution.

For more on the Gaussian probability distribution, see the tutorial:

Some algorithms, like linear regression and logistic regression, explicitly assume the real-valued variables have a Gaussian distribution. Other nonlinear algorithms may not have this assumption, yet often perform better when variables have a Gaussian distribution.

This applies both to real-valued input variables in the case of classification and regression tasks, and real-valued target variables in the case of regression tasks.

Some input variables may have a highly skewed distribution, such as an exponential distribution where the most common observations are bunched together. Some input variables may have outliers that cause the distribution to be highly spread.

These concerns and others, like non-standard distributions and multi-modal distributions, can make a dataset challenging to model with a range of machine learning models.

As such, it is often desirable to transform each input variable to have a standard probability distribution, such as a Gaussian (normal) distribution or a uniform distribution.

## Quantile Transforms

A quantile transform will map a variable’s probability distribution to another probability distribution.

Recall that a quantile function, also called a percent-point function (PPF), is the inverse of the cumulative probability distribution (CDF). A CDF is a function that returns the probability of a value at or below a given value. The PPF is the inverse of this function and returns the value at or below a given probability.

The quantile function ranks or smooths out the relationship between observations and can be mapped onto other distributions, such as the uniform or normal distribution.

The transformation can be applied to each numeric input variable in the training dataset and then provided as input to a machine learning model to learn a predictive modeling task.

This quantile transform is available in the scikit-learn Python machine learning library via the QuantileTransformer class.

The class has an “output_distribution” argument that can be set to “uniform” or “random” and defaults to “uniform“.

It also provides a “n_quantiles” that determines the resolution of the mapping or ranking of the observations in the dataset. This must be set to a value less than the number of observations in the dataset and defaults to 1,000.

We can demonstrate the QuantileTransformer with a small worked example. We can generate a sample of random Gaussian numbers and impose a skew on the distribution by calculating the exponent. The QuantileTransformer can then be used to transform the dataset to be another distribution, in this cases back to a Gaussian distribution.

The complete example is listed below.

Running the example first creates a sample of 1,000 random Gaussian values and adds a skew to the dataset.

A histogram is created from the skewed dataset and clearly shows the distribution pushed to the far left. Histogram of Skewed Gaussian Distribution

Then a QuantileTransformer is used to map the data distribution Gaussian and standardize the result, centering the values on the mean value of 0 and a standard deviation of 1.0.

A histogram of the transform data is created showing a Gaussian shaped data distribution. Histogram of Skewed Gaussian Data After Quantile Transform

In the following sections will take a closer look at how to use the quantile transform on a real dataset.

Next, let’s introduce the dataset.

## Sonar Dataset

The sonar dataset is a standard machine learning dataset for binary classification.

It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. Top performance on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.

The dataset describes radar returns of rocks or simulated mines.

First, let’s load and summarize the dataset. The complete example is listed below.

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.

Finally a histogram is created for each input variable.

If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.

The dataset provides a good candidate for using a quantile transform to make the variables more-Gaussian. Histogram Plots of Input Variables for the Sonar Binary Classification Dataset

Next, let’s fit and evaluate a machine learning model on the raw dataset.

We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated stratified k-fold cross-validation. The complete example is listed below.

Running the example evaluates a KNN model on the raw sonar dataset.

We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).

Next, let’s explore a normal quantile transform of the dataset.

## Normal Quantile Transform

It is often desirable to transform an input variable to have a normal probability distribution to improve the modeling performance.

We can apply the Quantile transform using the QuantileTransformer class and set the “output_distribution” argument to “normal“. We must also set the “n_quantiles” argument to a value less than the number of observations in the training dataset, in this case, 100.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

Let’s try it on our sonar dataset.

The complete example of creating a normal quantile transform of the sonar dataset and plotting histograms of the result is listed below.

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms for each variable looks very Gaussian as compared to the raw data. Histogram Plots of Normal Quantile Transformed Input Variables for the Sonar Dataset

Next, let’s evaluate the same KNN model as the previous section, but in this case on a normal quantile transform of the dataset.

The complete example is listed below.

Running the example, we can see that the normal quantile transform results in a lift in performance from 79.7% accuracy without the transform to about 81.7% with the transform.

Next, let’s take a closer look at the uniform quantile transform.

## Uniform Quantile Transform

Sometimes it can be beneficial to transform a highly exponential or multi-modal distribution to have a uniform distribution.

This is especially useful for data with a large and sparse range of values, e.g. outliers that are common rather than rare.

We can apply the transform by defining a QuantileTransformer class and setting the “output_distribution” argument to “uniform” (the default).

The example below applies the uniform quantile transform and creates histogram plots of each of the transformed variables.

Running the example transforms the dataset and plots histograms of each input variable.

We can see that the shape of the histograms for each variable looks very uniform compared to the raw data. Histogram Plots of Uniform Quantile Transformed Input Variables for the Sonar Dataset

Next, let’s evaluate the same KNN model as the previous section, but in this case on a uniform quantile transform of the raw dataset.

The complete example is listed below.

Running the example, we can see that the uniform transform results in a lift in performance from 79.7 percent accuracy without the transform to about 84.5 percent with the transform, better than the normal transform that achieved a score of 81.7 percent.

We chose the number of quantiles as an arbitrary number, in this case, 100.

This hyperparameter can be tuned to explore the effect of the resolution of the transform on the resulting skill of the model.

The example below performs this experiment and plots the mean accuracy for different “n_quantiles” values from 1 to 99.

Running the example reports the mean classification accuracy for each value of the “n_quantiles” argument.

We can see that surprisingly smaller values resulted in better accuracy, with values such as 4 achieving an accuracy of about 85.4 percent.

A line plot is created showing the number of quantiles used in the transform versus the classification accuracy of the resulting model.

We can see a bump with values less than 10 and drop and flat performance after that.

The results highlight that there is likely some benefit in exploring different distributions and number of quantiles to see if better performance can be achieved. Line Plot of Number of Quantiles vs. Classification Accuracy of KNN on the Sonar Dataset

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered how to use quantile transforms to change the distribution of numeric variables for machine learning.

Specifically, you learned:

• Many machine learning algorithms prefer or perform better when numerical variables have a Gaussian or standard probability distribution.
• Quantile transforms are a technique for transforming numerical input or output variables to have a Gaussian or uniform probability distribution.
• How to use the QuantileTransformer to change the probability distribution of numeric variables to improve the performance of predictive models.

Do you have any questions?