In this tutorial, we will learn about the Pandas cut()
function. This function bins values into separate intervals. It is mainly used to analyze scalar data.
Syntax and Documentation
Here are the parameters from the official documentation:
Parameter  Type  Description 
x 
arraylike  The onedimensional input array to be binned. 
bins 
int, sequence of scalars, orIntervalIndex 
The criteria to bin by.
sequence of scalars: the bin edges allowing for nonuniform width. Doesn’t extend the range of

right 
bool, default True 
Does argument bins include the rightmost edge? If right == True (default), bins [1, 2, 3, 4] indicate intervals (1,2], (2,3], (3,4] . Ignored when bins is an IntervalIndex . 
labels 
array or False , default None 
Specifies the labels for the returned bins. Must be the same length as the resulting bins. – If False , returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex . 
retbins 
bool, default False 
To return the bins or not? Useful if bins is a scalar. 
precision 
int , default 3 
Precision at which to store and display the bins labels. 
include_lowest 
bool, default False 
Whether the first interval should be leftinclusive or not. 
duplicates 
{default ‘raise’, ‘drop’} , optional 
If bin edges are not unique, raise ValueError or drop nonuniques. 
ordered 
bool, default True 
Whether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype ). – If True , the resulting categorical will be ordered. – If False , the resulting categorical will be unordered and labels must be provided. 
Returns  Type  Description 
out 
Categorical, Series, or ndarray  An arraylike object representing the respective bin for each value of x . The type depends on the value of labels.
sequence of scalars: returns a Series for Series

bins 
numpy.ndarray or IntervalIndex 
The computed or specified bins . Only returned when retbins=True . – For scalar or sequence bins, this is an ndarray with the computed bins. – If set duplicates=drop , bins will drop nonunique bin. – For an IntervalIndex bins, this is equal to bins . 
Basic Example
To get to know the cut()
function, we will start with an introductory example, on which we will build in the following sections:
import pandas as pd df = pd.DataFrame({'Diver': ['Dave', 'Alice', 'Mary', 'John', 'Jane', 'Bob'], 'Score': [1,6,4,8,5,10]})
print(df)
Diver  Score  
0  Dave  1 
1  Alice  6 
2  Mary  4 
3  John  8 
4  Jane  5 
5  Bob  10 
First, we import the Pandas library. Then we create a Pandas data frame with two columns. A “Diver” column with string values and a “Score” column with integer values.
The outputted data frame shows a dataset with six different divers and their respective score values.
Now, we apply the cut()
function:
pd.cut(x = df['Score '], bins = 3)
0  (0.991, 4.0] 
1  (4.0, 7.0] 
2  (0.991, 4.0] 
3  (7.0, 10.0] 
4  (4.0, 7.0] 
5  (7.0, 10.0] 
Name: Score, dtype: category 
Categories: (3, interval[float64, right]): [(0.991, 4.0] < (4.0, 7.0] < (7.0, 10.0]] 
The cut()
function provides lots of parameters. Two of those are mandatory to apply.
 The first one is the parameter “
x
that expects a list that we want to bin. In the example, we apply the “Score” column from our data frame.  The second necessary parameter is “
bins
“. This one expects the number of bins as an integer value or a list of the interval values. In the example, we assign “3” to the “bins
” parameter to state that we want to create three equalsized intervals.
The output shows the interval for each score. For example, Alice’s score is “6” and is assigned the interval “(4.0, 7.0]
” because 6 lies within this range.
But how were these intervals calculated? By assigning the “bins
” parameter the value “3” we state that we want three equalsized intervals. The intervals are calculated like this: we take the maximum value of the scores (which is “10”) and the minimum value (which is “1”). We subtract these values (10 – 1 = 9) and divide that by the number of intervals which we defined as “3” (9 / 3 = 3).
In short: (maximum value – minimum value) / number of intervals
.
That way, we get the size of an interval which is 3 in our example. We already looked at Alice’s score which is 6 and lies in the interval “(4.0, 7.0]”. We can see that the difference between 7.0 and 4.0 is indeed 3.
But why does the lowest interval not start with “1.0” but with “0.991” although the lowest value is 1? That’s because of the meaning of the brackets in the intervals. The intervals here are halfopen intervals. The interval “(0.991, 4.0]” means the values included are greater than 0.991 and less than or equal to 4.0. If the interval looked like this: “(1.0, 4.0]”, the value “1” would not be included in that interval.
The output also shows the order of the intervals.
To make it better visible which category belongs to which score, we can create a new column and add it to the data frame:
df['Interval'] = pd.cut(x = df['Score'], bins = 3)
Diver  Score  Interval  
0  Dave  1  (0.991, 4.0] 
1  Alice  6  (4.0, 7.0] 
2  Mary  4  (0.991, 4.0] 
3  John  8  (7.0, 10.0] 
4  Jane  5  (4.0, 7.0] 
5  Bob  10  (7.0, 10.0] 
We applied the cut()
function the same way as before. But this time, we assigned it to a new column labeled “Interval”. The outputted data frame now shows all divers, scores, and the respective intervals in a clear way.
Change the Intervals
In the previous section, we applied the cut()
function using three intervals by assigning the “bins
” parameter the value “3”.
Let’s now assign the “bins
” parameter another value, for example, “5”:
df['Interval'] = pd.cut(x = df['Score'], bins = 5)
Diver  Score  Interval  
0  Dave  1  (0.991, 2.8] 
1  Alice  6  (4.6, 6.4] 
2  Mary  4  (2.8, 4.6] 
3  John  8  (6.4, 8.2] 
4  Jane  5  (4.6, 6.4] 
5  Bob  10  (8.2, 10.0] 
As before, we create an “Interval” column and assign it to the initial data frame to see immediately which score is assigned to which interval.
The only thing we change here is that we set the “bins
” parameter equal to “5”. That way, we now have five equalsized intervals. The length of each interval is calculated as follows:
(maximum value – minimum value) / number of intervals => (10 – 1) / 5 = 1.8
As we can see, each interval has indeed the length 1.8, except for the lowest interval “(0.991, 2.8]”. It starts at “0.991”, just like in the previous section, because we have halfopen intervals and that way, the value “1” is included in this interval.
Apart from an integer value, we can also assign the “bins
” parameter a list of scalar values. This way, we determine the interval boundaries directly:
df['Interval'] = pd.cut(x = df['Score'], bins=[0,2,4,6,8,10])
Diver  Score  Interval  
0  Dave  1  (0, 2] 
1  Alice  6  (4, 6] 
2  Mary  4  (2, 4] 
3  John  8  (6, 8] 
4  Jane  5  (4, 6] 
5  Bob  10  (8, 10] 
The list “[0,2,4,6,8,10]
” creates the intervals: “(0,2]”, “(2,4]”, “(4,6]”, “(6,8]”, and “(8,10]”.
This way, we specify how many intervals we want to get and how long each interval should be.
In this example, we created intervals that all have the same length. However, this does not have to be the case. We can stipulate the interval lengths in any way we want:
df['Interval'] = pd.cut(x = df['Score'], bins=[0,4,5,6,10])
Diver  Score  Interval  
0  Dave  1  (0, 4] 
1  Alice  6  (5, 6] 
2  Mary  4  (0, 4] 
3  John  8  (6, 10] 
4  Jane  5  (4, 5] 
5  Bob  10  (6, 10] 
Here, we assigned the “bins
” parameter a different list. The resulting intervals do not all have the same length.
We might define intervals using the “bins
” parameter and some values from the data frame do not lie in any determined interval:
df['Interval'] = pd.cut(x = df['Score'], bins=[0,4,5,6])
Diver  Score  Interval  
0  Dave  1  (0.0, 4.0] 
1  Alice  6  (5.0, 6.0] 
2  Mary  4  (0.0, 4.0] 
3  John  8  NaN 
4  Jane  5  (4.0, 5.0] 
5  Bob  10  NaN 
The scores “8” and “10” do not lie within any given interval. Pandas handles these cases by assigning these values the interval “NaN“. When that happens, we know that our intervals do not cover the whole field.
Include the Leftmost or the Rightmost Edge
In the examples we saw by now, the intervals were always structured like this: “(x, y]”. That way, the rightmost edge is included in the interval. That’s because the “right
” parameter from the cut()
function is by default set to “True
“.
If we change that parameter and set it to “False
“, this is what happens:
df['Interval'] = pd.cut(x = df['Score'], bins = 3, right=False)
Diver  Score  Interval  
0  Dave  1  [1.0, 4.0) 
1  Alice  6  [4.0, 7.0) 
2  Mary  4  [4.0, 7.0) 
3  John  8  [7.0, 10.009) 
4  Jane  5  [4.0, 7.0) 
5  Bob  10  [7.0, 10.009) 
We set the “bins
” parameter to “3” like in the first example, so we get three equalsized intervals. But now, the intervals are structured like this: “[x, y)”. The leftmost edge is now included in the interval and not the rightmost.
Thus, the smallest interval now looks like this “[1.0, 4.0)”, instead of this “(0.991, 4.0]”. The value “1” is now included in the interval.
Hence, the biggest interval now occurs like this “[7.0, 10.009)”. It has to be that way, so the value “10” is included in this interval.
Label the Intervals
We can label the intervals using the “labels
” parameter of the cut()
function. This way, we can categorize each score:
df['Interval'] = pd.cut(x = df['Score'], bins = 3, labels=['bad', 'good', 'exceptional'])
Diver  Score  Interval  
0  Dave  1  bad 
1  Alice  6  good 
2  Mary  4  bad 
3  John  8  exceptional 
4  Jane  5  good 
5  Bob  10  exceptional 
Again, we created three equalsized intervals. But this time, we labeled each interval. The smallest interval is labeled “bad”, the middle interval is labeled “good”, and the biggest interval is labeled “exceptional”.
By doing that, we categorize and evaluate each score.
Include the Lowest Value
Imagine, we create the following intervals:
df['Interval'] = pd.cut(x = df['Score'], bins = [1,3,5,7,9,11])
Diver  Score  Interval  
0  Dave  1  NaN 
1  Alice  6  (5.0, 7.0] 
2  Mary  4  (3.0, 5.0] 
3  John  8  (7.0, 9.0] 
4  Jane  5  (3.0, 5.0] 
5  Bob  10  (9.0, 11.0] 
We can see that Dave’s score is not included in any interval. That’s because the “right
” parameter is set to “True
” by default which does not include the leftmost edge. Thus, the score “1” is not included in the interval “(1.0, 3.0]”.
What do we do to include the score “1” in the interval while not changing the “right
” parameter because we want to keep the interval structure with the rightmost edge remaining included in the intervals?
We achieve that by applying the “include_lowest
” parameter. By assigning that parameter the value “True
“, we include the lowest value:
df['Interval'] = pd.cut(x = df['Score'], bins = [1,3,5,7,9,11], include_lowest=True)
Diver  Score  Interval  
0  Dave  1  (0.999, 3.0] 
1  Alice  6  (5.0, 7.0] 
2  Mary  4  (3.0, 5.0] 
3  John  8  (7.0, 9.0] 
4  Jane  5  (3.0, 5.0] 
5  Bob  10  (9.0, 11.0] 
Now, the value “1” is included in an interval.
Summary
All in all, the cut()
function provides us with a lot of possibilities. We can create various intervals, change the interval’s structures and label the intervals to categorize our data.
For more tutorials about Pandas, Python libraries, Python in general, or other computer sciencerelated topics, check out the Finxter Blog page.
Happy Coding!
https://www.sickgaming.net/blog/2021/12/...ithvideo/