Nice bucket challenge

Overview of data binning techniques

5 min readOct 3, 2021

One of many problems that may arise from time to time during feature engineering stage is how to aggregate our numerical data into several intervals. This is frequently known as data bucketing. A good example of such grouping is binning performed during histogram construction.

Histogram. Image from https://pl.wikipedia.org/wiki/Histogram

Sometimes we may want to transform our data in a similar manner as it is done in histograms, so our model will focus on the behavior in particular buckets, rather than on the whole variable domain. It can be desirable when there might be a certain premise on the different influences of particular intervals of the independent variable that is exerted on the dependent variable.

As an example, let’s imagine that we want to model if there will be snowing today. Hence our dependent variable has a binary shape with two levels: 1 – snow, 0 – no snow. For an independent variable, we will use only one feature: temperature.

We know that snow usually occurs when the temperature is below or at 0C (32F). If the temperature is higher then the probability that we see the snow is rather low. However, it is still possible that snowflakes will reach the surface when the temperature is higher than 0C (32F) while the air is sufficiently saturated.

On the other hand, during very cold weather conditions when the temperature is lower than -40C (-40F), the moisture capacity of the air will be so low that likely not much snow can occur. Given that the chances of snow may be the same or very similar in certain temperature ranges, we may want to model the probability of snow for a given temperature bucket rather than the influence of any temperature change across the whole domain.

But how we can divide our variable’s domain if we don’t have any specific knowledge about the underlying phenomena or/and we want to bucket the data in an automated fashion? The answer is simple: we can use one of the well-known binning techniques!

For sure some of these techniques were used by us without even realizing it, as default histogram plot algorithms. In order to understand how they work and what are the main differences, let’s consider a few of the most popular binning methods.

Note that each method attempts to determine an optimal number of bins k, but some calculate the suggested bin width h, which is then inputted to the following formula to obtain the bin number:

Where k – number of bins, x – variable values, h – bin width. The braces indicate the ceiling function.

Square-root choice

The fastest algorithm, which takes into account only the data size. Due to its simplicity and performance, it is frequently used by many statistical packages to build histograms, including Excel.

Where n – number of observations.

Freedman–Diaconis rule

The algorithm is usually a good starting point as it is resilient to outliers. It was designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution. This methodology utilizes the interquartile range to capture the data variability and considers also the data size.

Sturge’s rule

This methodology is derived from the binomial distribution and assumes that our data has a Gaussian shape. Thus it will underestimate the number of bins for large non-normal distributions. Unlike the Freedman – Diaconis rule, the Sturges rule does not account for data variability, but only for data size:

It can be observed that this method may perform poorly for a small number of observations (e.g. n < 30), because it will produce a small number of bins, which may not be suitable to reflect the sample distribution properly. Note that this is only relevant when we are using the algorithm to produce a histogram, not transform our continuous variable into specific intervals.

Doane’s formula

Doane’s formula is a modification of Sturges rule, which attempts to improve its performance when dealing with non-normal data by taking into account also the shape of the data.

Where s – estimated skewness of the variable distribution and:

Rice rule

The Rice rule is an alternative to Sturge’s formula, which also takes into account only the size of the data:

This method usually overestimates the number of required bins, so it may be useful when we especially don’t want to end up with an insufficient number of buckets.

Scott’s rule

Last, but not least is Scott’s rule. This formula takes into account both data variability and size to form the optimal bin width h. It is quite good for large datasets but can be too conservative for smaller ones.

The results are very similar to those obtained by Freedman–Diaconis rule. However, due to the fact that standard deviation is not very robust to outliers (in contrast to the interquartile range), it may produce slightly different optimal bin widths. Especially if we have a significant proportion of outliers in our data.

All the presented binning rules can be found in many statistical packages. In Python, for example, all these were implemented in numpy.histogram_bin_edges function.

References

[1] https://en.m.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

[2] https://numpy.org/doc/stable/reference/generated/numpy.histogram_bin_edges.html

[3] https://en.m.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule

[4] http://www.jtrive.com/determining-histogram-bin-width-using-the-freedman-diaconis-rule.html