Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 107 additions & 36 deletions content/pandas/concepts/dataframe/terms/groupby/groupby.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,59 +12,130 @@ CatalogContent:
- 'paths/data-science'
---

The **`.groupby()`** function groups a [`DataFrame`](https://www.codecademy.com/resources/docs/pandas/dataframe) using a mapper or a series of columns and returns a [`GroupBy`](https://www.codecademy.com/resources/docs/pandas/groupby) object. A range of methods, as well as custom functions, can be applied to `GroupBy` objects in order to combine or transform large amounts of data in these groups.
The Pandas DataFrame **`.groupby()`** function groups a `DataFrame` using a mapper or a series of columns and returns a [`GroupBy`](https://www.codecademy.com/resources/docs/pandas/groupby) object. A range of methods, as well as custom functions, can be applied to `GroupBy` objects in order to combine or transform large amounts of data in these groups.

## Syntax
## Pandas `.groupby()` Syntax

```pseudo
dataframevalue.groupby(by, axis, level, as_index, sort, group_keys, observed, dropna)
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)
```

`.groupby()` uses the following parameters:
**Parameters:**

- `by`: If a dictionary or `Series` is passed, the values will determine groups. If a list or [ndarray](https://www.codecademy.com/resources/docs/numpy/ndarray) with the same length as the selected axis is passed, the values will be used to form groups. A label or list of labels can be used to group by a particular column or columns.
- `axis`: Split along rows (0 or "index") or columns (1 or "columns"). Default value is 0.
- `level`: If the axis is a `MultiIndex`, group by a particular level or levels. Value is int or level name, or sequence of them. Default value is `None`.
- `as_index`: Boolean value. `True` returns group labels as an index in aggregated output, and `False` returns labels as `DataFrame` columns. Default value is `True`.
- `sort`: Boolean value. `True` sorts the group keys. Default value is `True`.
- `group_keys`: Boolean value. Add group keys to index when calling apply. Default value is `True`.
- `observed`: Boolean value. If `True`, only show observed values for categorical groupers, otherwise show all values. Default value is `False`.
- `dropna`: Boolean value. If `True`, drop groups whose keys contain `NA` values. If `False`, `NA` will be used as a key for those groups. Default value is `True`.
- `axis`: Split along rows (`0` or `"index"`) or columns (`1` or `"columns"`).
- `level`: If the axis is a `MultiIndex`, group by a particular level or levels. Value is an integer or level name, or a sequence of them.
- `as_index`: Boolean value. `True` returns group labels as an index in aggregated output, and `False` returns labels as `DataFrame` columns.
- `sort`: Boolean value. `True` sorts the group keys.
- `group_keys`: Boolean value. If `False`, add group keys to index when calling apply.
- `observed`: Boolean value. If `True`, only show observed values for categorical groupers, otherwise show all values.
- `dropna`: Boolean value. If `True`, drop groups whose keys contain `NA` values. If `False`, `NA` will be used as a key for those groups.

## Example
## Example 1: Group by Single Column Using `.groupby()`

This example uses `.groupby()` on a `DataFrame` to produce some aggregate results.
This example uses `.groupby()` to group the data by a single column:

```py
import pandas as pd

df = pd.DataFrame({'Key' : ['A', 'A', 'A', 'B', 'B', 'C'],
'Value' : [15., 23., 17., 5., 8., 12.]})
print(df, end='\n\n')
data = {
'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East'],
'Sales': [250, 200, 300, 400, 150, 500, 100]
}

print(df.groupby(['Key'], as_index=False).mean(), end='\n\n')
df = pd.DataFrame(data)

print(df.groupby(['Key'], as_index=False).sum())
result = df.groupby('Region')['Sales'].sum()

print(result)
```

Here is the output:

```shell
Region
East 650
South 900
West 350
Name: Sales, dtype: int64
```

## Example 2: Group by Multiple Columns Using `.groupby()`

This example uses `.groupby()` to group the data by multiple columns:

```py
import pandas as pd

data = {
'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East'],
'Product': ['A', 'B', 'A', 'B', 'A', 'A', 'B'],
'Sales': [250, 200, 300, 400, 150, 500, 100]
}

df = pd.DataFrame(data)

result = df.groupby(['Region', 'Product'])['Sales'].sum()

print(result)
```

This produces the following output:
Here is the output:

```shell
Key Value
0 A 15.0
1 A 23.0
2 A 17.0
3 B 5.0
4 B 8.0
5 C 12.0

Key Value
0 A 18.333333
1 B 6.500000
2 C 12.000000

Key Value
0 A 55.0
1 B 13.0
2 C 12.0
Region Product
East A 550
B 100
South A 500
B 400
West A 150
B 200
Name: Sales, dtype: int64
```

## Codebyte Example: Using Aggregate Functions with Python's `.groupby()`

This codebyte example uses `.groupby()` to group the data and then applies aggregate functions on the grouped data:

```codebyte/python
import pandas as pd

data = {
'Region': ['East', 'West', 'East', 'South', 'West', 'South', 'East'],
'Product': ['A', 'B', 'A', 'B', 'A', 'A', 'B'],
'Sales': [250, 200, 300, 400, 150, 500, 100]
}

df = pd.DataFrame(data)

result = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'max'])

print(result)
```

## Frequently Asked Questions

### 1. When should I use `groupby` in Pandas?

Use `groupby` when you want to split data into groups, apply a function, and combine results. Common operations include computing aggregates like sum, mean, or count per category.

### 2. Is Pandas `groupby` slow?

It can be slow for large datasets, especially if:

- You’re grouping by multiple columns.
- The dataset doesn’t fit in memory.
- You're applying custom Python functions instead of built-ins.

For most medium-sized tasks, it's fast enough. For massive data, look into more efficient libraries like Polars or Dask.

### 3. Is Polars `groupby` faster than Pandas?

Yes, often much faster. Polars is built in Rust and optimized for speed and parallelism. It can handle larger-than-memory data better and is ideal for performance-critical data tasks.

Example speed difference:

- Pandas: single-threaded.
- Polars: multi-threaded, faster `groupby` and aggregation.

If performance is a bottleneck, switching to Polars is worth considering.