Create an account


Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Tut] Pandas cut() – A Simple Guide with Video

#1
Pandas cut() – A Simple Guide with Video

In this tutorial, we will learn about the Pandas cut() function. This function bins values into separate intervals. It is mainly used to analyze scalar data.




Syntax and Documentation


Here are the parameters from the official documentation:


Parameter Type Description
x array-like The one-dimensional input array to be binned.
bins int, sequence of scalars, or
IntervalIndex
The criteria to bin by.

int: the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

sequence of scalars: the bin edges allowing for non-uniform width. Doesn’t extend the range of x.

IntervalIndex: the exact bins to be used. Must be non-overlapping for bins.

right bool, default True Does argument bins include the rightmost edge? If right == True (default), bins [1, 2, 3, 4] indicate intervals (1,2], (2,3], (3,4].
Ignored when bins is an IntervalIndex.
labels array or False, default None Specifies the labels for the returned bins. Must be the same length as the resulting bins.
– If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex.
retbins bool, default False To return the bins or not? Useful if bins is a scalar.
precision int, default 3 Precision at which to store and display the bins labels.
include_lowest bool, default False Whether the first interval should be left-inclusive or not.
duplicates {default ‘raise’, ‘drop’}, optional If bin edges are not unique, raise ValueError or drop non-uniques.
ordered bool, default True Whether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype).
– If True, the resulting categorical will be ordered.
– If False, the resulting categorical will be unordered and labels must be provided.


Returns Type Description
out Categorical, Series, or ndarray An array-like object representing the respective bin for each value of x. The type depends on the value of labels.

True (default): returns a Series for Series x or a Categorical for all other inputs. The values stored within are Interval dtype.

sequence of scalars: returns a Series for Series x or a Categorical for all other inputs. The values stored within are whatever the type in the sequence is.

False: returns an ndarray of integers.

bins numpy.ndarray or IntervalIndex The computed or specified bins. Only returned when retbins=True.
– For scalar or sequence bins, this is an ndarray with the computed bins.
– If set duplicates=drop, bins will drop non-unique bin.
– For an IntervalIndex bins, this is equal to bins.

Basic Example


To get to know the cut() function, we will start with an introductory example, on which we will build in the following sections:

import pandas as pd df = pd.DataFrame({'Diver': ['Dave', 'Alice', 'Mary', 'John', 'Jane', 'Bob'], 'Score': [1,6,4,8,5,10]})
print(df)

Diver Score
0 Dave 1
1 Alice 6
2 Mary 4
3 John 8
4 Jane 5
5 Bob 10

First, we import the Pandas library. Then we create a Pandas data frame with two columns. A “Diver” column with string values and a “Score” column with integer values.

The outputted data frame shows a dataset with six different divers and their respective score values.

Now, we apply the cut() function:

pd.cut(x = df['Score '], bins = 3)

0 (0.991, 4.0]
1 (4.0, 7.0]
2 (0.991, 4.0]
3 (7.0, 10.0]
4 (4.0, 7.0]
5 (7.0, 10.0]


Name: Score, dtype: category
Categories: (3, interval[float64, right]): [(0.991, 4.0] < (4.0, 7.0] < (7.0, 10.0]]

The cut() function provides lots of parameters. Two of those are mandatory to apply.

  • The first one is the parameter “x that expects a list that we want to bin. In the example, we apply the “Score” column from our data frame.
  • The second necessary parameter is “bins“. This one expects the number of bins as an integer value or a list of the interval values. In the example, we assign “3” to the “bins” parameter to state that we want to create three equal-sized intervals.

The output shows the interval for each score. For example, Alice’s score is “6” and is assigned the interval “(4.0, 7.0]” because 6 lies within this range.

But how were these intervals calculated? By assigning the “bins” parameter the value “3” we state that we want three equal-sized intervals. The intervals are calculated like this: we take the maximum value of the scores (which is “10”) and the minimum value (which is “1”). We subtract these values (10 – 1 = 9) and divide that by the number of intervals which we defined as “3” (9 / 3 = 3).

In short: (maximum value – minimum value) / number of intervals.

That way, we get the size of an interval which is 3 in our example. We already looked at Alice’s score which is 6 and lies in the interval “(4.0, 7.0]”. We can see that the difference between 7.0 and 4.0 is indeed 3.

But why does the lowest interval not start with “1.0” but with “0.991” although the lowest value is 1? That’s because of the meaning of the brackets in the intervals. The intervals here are half-open intervals. The interval “(0.991, 4.0]” means the values included are greater than 0.991 and less than or equal to 4.0. If the interval looked like this: “(1.0, 4.0]”, the value “1” would not be included in that interval.

The output also shows the order of the intervals.

To make it better visible which category belongs to which score, we can create a new column and add it to the data frame:

df['Interval'] = pd.cut(x = df['Score'], bins = 3)

Diver Score Interval
0 Dave 1 (0.991, 4.0]
1 Alice 6 (4.0, 7.0]
2 Mary 4 (0.991, 4.0]
3 John 8 (7.0, 10.0]
4 Jane 5 (4.0, 7.0]
5 Bob 10 (7.0, 10.0]

We applied the cut() function the same way as before. But this time, we assigned it to a new column labeled “Interval”. The outputted data frame now shows all divers, scores, and the respective intervals in a clear way.

Change the Intervals


In the previous section, we applied the cut() function using three intervals by assigning the “bins” parameter the value “3”.

Let’s now assign the “bins” parameter another value, for example, “5”:

df['Interval'] = pd.cut(x = df['Score'], bins = 5)

Diver Score Interval
0 Dave 1 (0.991, 2.8]
1 Alice 6 (4.6, 6.4]
2 Mary 4 (2.8, 4.6]
3 John 8 (6.4, 8.2]
4 Jane 5 (4.6, 6.4]
5 Bob 10 (8.2, 10.0]

As before, we create an “Interval” column and assign it to the initial data frame to see immediately which score is assigned to which interval.

The only thing we change here is that we set the “bins” parameter equal to “5”. That way, we now have five equal-sized intervals. The length of each interval is calculated as follows:

(maximum value – minimum value) / number of intervals => (10 – 1) / 5 = 1.8

As we can see, each interval has indeed the length 1.8, except for the lowest interval “(0.991, 2.8]”. It starts at “0.991”, just like in the previous section, because we have half-open intervals and that way, the value “1” is included in this interval.

Apart from an integer value, we can also assign the “bins” parameter a list of scalar values. This way, we determine the interval boundaries directly:

df['Interval'] = pd.cut(x = df['Score'], bins=[0,2,4,6,8,10])

Diver Score Interval
0 Dave 1 (0, 2]
1 Alice 6 (4, 6]
2 Mary 4 (2, 4]
3 John 8 (6, 8]
4 Jane 5 (4, 6]
5 Bob 10 (8, 10]

The list “[0,2,4,6,8,10]” creates the intervals: “(0,2]”, “(2,4]”, “(4,6]”, “(6,8]”, and “(8,10]”.

This way, we specify how many intervals we want to get and how long each interval should be.

In this example, we created intervals that all have the same length. However, this does not have to be the case. We can stipulate the interval lengths in any way we want:

df['Interval'] = pd.cut(x = df['Score'], bins=[0,4,5,6,10])

Diver Score Interval
0 Dave 1 (0, 4]
1 Alice 6 (5, 6]
2 Mary 4 (0, 4]
3 John 8 (6, 10]
4 Jane 5 (4, 5]
5 Bob 10 (6, 10]

Here, we assigned the “bins” parameter a different list. The resulting intervals do not all have the same length.

We might define intervals using the “bins” parameter and some values from the data frame do not lie in any determined interval:

df['Interval'] = pd.cut(x = df['Score'], bins=[0,4,5,6])

Diver Score Interval
0 Dave 1 (0.0, 4.0]
1 Alice 6 (5.0, 6.0]
2 Mary 4 (0.0, 4.0]
3 John 8 NaN
4 Jane 5 (4.0, 5.0]
5 Bob 10 NaN

The scores “8” and “10” do not lie within any given interval. Pandas handles these cases by assigning these values the interval “NaN“. When that happens, we know that our intervals do not cover the whole field.

Include the Leftmost or the Rightmost Edge


In the examples we saw by now, the intervals were always structured like this: “(x, y]”. That way, the rightmost edge is included in the interval. That’s because the “right” parameter from the cut() function is by default set to “True“.

If we change that parameter and set it to “False“, this is what happens:

df['Interval'] = pd.cut(x = df['Score'], bins = 3, right=False)

Diver Score Interval
0 Dave 1 [1.0, 4.0)
1 Alice 6 [4.0, 7.0)
2 Mary 4 [4.0, 7.0)
3 John 8 [7.0, 10.009)
4 Jane 5 [4.0, 7.0)
5 Bob 10 [7.0, 10.009)

We set the “bins” parameter to “3” like in the first example, so we get three equal-sized intervals. But now, the intervals are structured like this: “[x, y)”. The leftmost edge is now included in the interval and not the rightmost.

Thus, the smallest interval now looks like this “[1.0, 4.0)”, instead of this “(0.991, 4.0]”. The value “1” is now included in the interval.

Hence, the biggest interval now occurs like this “[7.0, 10.009)”. It has to be that way, so the value “10” is included in this interval.

Label the Intervals


We can label the intervals using the “labels” parameter of the cut() function. This way, we can categorize each score:

df['Interval'] = pd.cut(x = df['Score'], bins = 3, labels=['bad', 'good', 'exceptional'])

Diver Score Interval
0 Dave 1 bad
1 Alice 6 good
2 Mary 4 bad
3 John 8 exceptional
4 Jane 5 good
5 Bob 10 exceptional

Again, we created three equal-sized intervals. But this time, we labeled each interval. The smallest interval is labeled “bad”, the middle interval is labeled “good”, and the biggest interval is labeled “exceptional”.

By doing that, we categorize and evaluate each score.

Include the Lowest Value


Imagine, we create the following intervals:

df['Interval'] = pd.cut(x = df['Score'], bins = [1,3,5,7,9,11])

Diver Score Interval
0 Dave 1 NaN
1 Alice 6 (5.0, 7.0]
2 Mary 4 (3.0, 5.0]
3 John 8 (7.0, 9.0]
4 Jane 5 (3.0, 5.0]
5 Bob 10 (9.0, 11.0]

We can see that Dave’s score is not included in any interval. That’s because the “right” parameter is set to “True” by default which does not include the leftmost edge. Thus, the score “1” is not included in the interval “(1.0, 3.0]”.

What do we do to include the score “1” in the interval while not changing the “right” parameter because we want to keep the interval structure with the rightmost edge remaining included in the intervals?

We achieve that by applying the “include_lowest” parameter. By assigning that parameter the value “True“, we include the lowest value:

df['Interval'] = pd.cut(x = df['Score'], bins = [1,3,5,7,9,11], include_lowest=True)

Diver Score Interval
0 Dave 1 (0.999, 3.0]
1 Alice 6 (5.0, 7.0]
2 Mary 4 (3.0, 5.0]
3 John 8 (7.0, 9.0]
4 Jane 5 (3.0, 5.0]
5 Bob 10 (9.0, 11.0]

Now, the value “1” is included in an interval.

Summary


All in all, the cut() function provides us with a lot of possibilities. We can create various intervals, change the interval’s structures and label the intervals to categorize our data.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!



https://www.sickgaming.net/blog/2021/12/...ith-video/
Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

[-]
Discord

[-]
Active Threads
[Tut] [SOLVED] Error after upgrading pip...
Last Post: xSicKxBot
Today 05:38 PM
» Replies: 0
» Views: 14
(Indie Deal) FREE Alone on Mars, Curve, ...
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 1
SpeedTree 9 Released
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 3
Mobile - True Piece codes – spins and ge...
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 1
Fedora - Restarting and Offline Updates
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 2
News - Konami’s New Yu-Gi-Oh! Video Game...
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 1
Xbox Wire - Next Week on Xbox: January 1...
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 2
News - Improve your accuracy and comfort...
Last Post: xSicKxBot
Today 05:37 PM
» Replies: 0
» Views: 1
(Indie Deal) War Giveaways, Akupara, Hum...
Last Post: xSicKxBot
Yesterday 04:56 AM
» Replies: 0
» Views: 9
(Free Game Key) Free DLC for Steam game ...
Last Post: xSicKxBot
Yesterday 04:56 AM
» Replies: 0
» Views: 9

[-]
Twitter



Discord Server © SickGaming.net 2012-2021