Groupby is a very popular function in Pandas. This is very good at summarising, transforming, filtering, and a few other very essential data analysis tasks. In this article, I will explain the application of groupby function in detail with example.

## Dataset

For this article, I will use a ‘Students Performance’ dataset from Kaggle. Please feel free download the dataset from here:

## rashida048/Datasets

### Contribute to rashida048/Datasets development by creating an account on GitHub.

#### github.com

Here I am importing the necessary packages and the dataset:

```
import pandas as pd
import numpy as np
df = pd.read_csv('StudentsPerformance.csv')
df.head()
```

## How Groupby Works?

Groupby function splits the dataset based on criteria that you define. Here I am showing the process behind the groupby function. It will give you an idea, how much work we may have to do if we would not have groupby function. I will make a new smaller dataset of two columns only to demonstrate in this section. The columns are ‘gender’ and ‘reading score’.

```
test = df[['gender', 'reading score']]
test.head()
```

**Let’s find out the average reading score gender-wise**

First, we need to split the dataset based on gender. Generate the data for females only.

```
female = test['gender'] == 'female'
test[female].head()
```

In the same way, generate the data for the males,

```
male = test['gender'] == 'male'
test[male].head()
```

Using females and males dataset above to calculate the mean reading score for females and males respectively.

```
fe_avg = test[female]['reading score'].mean()
male_avg = test[male]['reading score'].mean()
print(fe_avg, male_avg)
```

The mean reading score of females is 72.608 and the mean reading score for males is 65.473. Now, make a DataFrame for the mean reading score of females and males.

`df_reading = pd.DataFrame({'Gender': ['female', 'male'], 'reading score': [fe_avg, male_avg]})`

Now, **let’s solve the same problem with the groupby function.** Splitting the data based on gender and applying the ‘mean’ on it with just one simple line of code:

`test.groupby('gender').mean()`

This small piece of code gives the same result.

## Groups in Groupby

I will use the original dataset ‘df’ now. Make groups of ‘race/ethnicity’.

```
race = df.groupby('race/ethnicity')
print(race)
```

Output:

It returns an object. Now check the datatype of ‘race’.

`type(race)`

Output: pandas.core.groupby.generic.DataFrameGroupBy

So, we generated a DataFrameGroupBy object. Calling groups on this DataFrameGroupBy object will return the indices of each group.

race.groups#Here is the Output: {'group A': Int64Index([ 3, 13, 14, 25, 46, 61, 62, 72, 77, 82, 88, 112, 129, 143, 150, 151, 170, 228, 250, 296, 300, 305, 327, 356, 365, 368, 378, 379, 384, 395, 401, 402, 423, 428, 433, 442, 444, 464, 467, 468, 483, 489, 490, 506, 511, 539, 546, 571, 575, 576, 586, 589, 591, 597, 614, 623, 635, 651, 653, 688, 697, 702, 705, 731, 741, 769, 778, 805, 810, 811, 816, 820, 830, 832, 837, 851, 892, 902, 911, 936, 943, 960, 966, 972, 974, 983, 985, 988, 994], dtype='int64'), 'group B': Int64Index([ 0, 2, 5, 6, 7, 9, 12, 17, 21, 26, ... 919, 923, 944, 946, 948, 969, 976, 980, 982, 991], dtype='int64', length=190), 'group C': Int64Index([ 1, 4, 10, 15, 16, 18, 19, 23, 27, 28, ... 963, 967, 971, 975, 977, 979, 984, 986, 996, 997], dtype='int64', length=319), 'group D': Int64Index([ 8, 11, 20, 22, 24, 29, 30, 33, 36, 37, ... 965, 970, 973, 978, 981, 989, 992, 993, 998, 999], dtype='int64', length=262), 'group E': Int64Index([ 32, 34, 35, 44, 50, 51, 56, 60, 76, 79, ... 937, 949, 950, 952, 955, 962, 968, 987, 990, 995], dtype='int64', length=140)}

Have a look at the output above. Groupby function splits the data into subgroups and you can now see the indices of each subgroup. That’s great! But only the indices are not enough. We need to see the real data of each group. The function ‘get_group’ helps with that.

`race.get_group('group B')`

I am showing the part of the results here. The original output is much bigger.

**Find the size of each group**

Calling size on the ‘race’ object will give the size of each group

`race.size()`

**Loop over each group**

You can loop over the groups. Here is an example:

```
for name, group in race:
print(name, 'has', group.shape[0], 'data')
```

**Grouping by multiple variables**

In all the examples above, we only grouped by one variable. But grouping by multiple variables is also possible. Here I am grouping by ‘race/ethnicity’ and ‘gender’. This should return the number of data of each race segregated by gender.

`df.groupby(['gender', 'race/ethnicity']).size()`

This example aggregates the data using ‘size’. There are other aggregate functions as well. Here is the list of all the aggregate functions:

sum()

mean()

size()

count()

std()

var()

sem()

min()

median()

Please try them out. Just replace any of these aggregate functions instead of the ‘size’ in the above example.

**Using multiple aggregate functions**

The way we can use groupby on multiple variables, using multiple aggregate functions is also possible. This next example will group by ‘race/ethnicity and will aggregate using ‘max’ and ‘min’ functions.

`df.groupby('race/ethnicity').agg([np.max, np.min])`

The aggregate function worked on each column here. Because we did not specify any column.

**Make a DataFrame of minimum and the maximum score of the math, reading, and writing for each race.**

To do that, use the previous code and just add the scores like this:

`df.groupby('race/ethnicity')['math score', 'reading score', 'writing score'].agg([np.max, np.min])`

**Groupby on multiple variables and use multiple aggregate functions**

To demonstrate this, we will groupby on ‘race/ethnicity’ and ‘gender’. Also, use two aggregate functions ‘min’ and ‘max’. Let’s make a DataFrame that contains the maximum and minimum score in math, reading, and writing for each group segregated by gender.

`df.groupby(['race/ethnicity', 'gender'])['math score', 'reading score', 'writing score'].agg([np.max, np.min])`

**Different aggregate functions on different columns**

Group by ‘race/ethnicity’ and use max and mean on math score and median and min on reading score.

```
df.groupby('race/ethnicity').agg({'math score': ['max', 'mean'],
'reading score': ['median','min']})
```

Cool, right? But the column names are not that nice and clear. They should be clearer and nicer. We can change the column names like this:

```
math_read = df.groupby('race/ethnicity').agg({'math score': ['max', 'mean'], 'reading score': ['max', 'mean']})
math_read.columns=['Max Math Score', 'Average Math Score', 'Max Reading Score', 'Average Reading Score' ]
```

Exactly the same DataFrame, but a lot more organized.

**Use the aggregate function on nominal columns**

In all the examples above, we used aggregate functions on numeric columns. Apply aggregate functions on some nominal columns such as ‘lunch’ and ‘parental level of education’.

`df.groupby(['race/ethnicity', 'gender']).agg({'lunch': pd.Series.mode, 'parental level of education': pd.Series.mode, 'math score':np.mean})`

As you can see in the code above, the syntax was different for nominal data. Just as reminder mode is the data that appears the most.

**Apply a function in Groupby**

Calculate how many kids ‘parental levels of education’ is the high school level.

`df.groupby(df['parental level of education'].apply(lambda x: 'high' in x)).size()`

If you are not used to using lambda, please check out this article.

**Cut the ‘math score’ column in three even buckets and define them as low, average and high scores**

`df.groupby(pd.qcut(x=df['math score'], q=3, labels=['low', 'average', 'high'])).size()`

If you want to set the cut point and define your low, average, and high, that is also a simple method.

`df.groupby(pd.cut(df['math score'], [0, 40, 70, 100])).size()`

## Transform with Groupby

Generate a new column in the DataFrame ‘df’ and add a column that contains the difference of each math score from the mean math score.

`df['Distance From the Mean'] = df.groupby(['race/ethnicity', 'gender'])['math score'].transform(lambda x: x - x.mean())`

Look at the DataFrame above. There is a new column at the end named ‘Distance From the Mean’.

## Filter Using Groupby

You can filter out the data based on certain conditions to make the data more meaningful. Filter the group or race that has less than 100 rows of data.

`df_n = df.groupby('race/ethnicity').filter(lambda x: len(x) > 100)`

The code above says that keep the groups that have a length of more than 100. Check the length of the original DataFrame ‘df’ and the filtered DataFrame ‘df_n’.

```
print(len(df))
print(len(df_n))
```

The length of the original DataFrame is 1000 and after applying the filter the length of the DataFrame becomes 911.

## Map

Map the mean reading score of each group and generate a new column.

`df['New'] = df['race/ethnicity'].map(df.groupby(['race/ethnicity'])['reading score'].mean())`

Look at this DataFrame. There is a new column named ‘New’ at the end, that is containing the mean reading score of the corresponding group.

## Visualization Using Groupby

As we all know, a picture says a thousand words. Here is some demonstration of visualization techniques in groupby. Make a bar plot of the parental level of education.

```
import matplotlib.pyplot as plt
plt.clf()
df.groupby('parental level of education').size().plot(kind='bar')
```

Please try making a pie plot. You just need to use ‘pie’ instead of ‘bar’ in the code above. If you do not pass any ‘kind’, the plot will be a simple line plot. Let’s plot the mean math score of each group.

`df.groupby('race/ethnicity')['math score'].mean().plot()`

## Conclusion

In this article, you learned to group and summarize the data in many different ways. You learned to use aggregate functions, data transformation, filter, map, and visualization using groupby.

#pandas #datascience #dataanalytics #dataanalysis #python #groupby #datascientists