How is the ‘groupby’ function used in Pandas?
In Pandas, the groupby() function is used to group data. It allows for data to be grouped based on specified columns, and then operations can be applied to each group, such as calculating statistics, aggregating, or transforming.
The basic usage of groupby() is:
df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, dropna=True)
Parameter Explanation:
- Specify which columns to group by, which can be a single column name, a list of multiple column names, a Series, a dictionary, a function, etc. The default is None, which means grouping by the entire DataFrame.
- Axis: Specifies the axis of grouping, where 0 represents grouping by rows and 1 represents grouping by columns, with the default being 0.
- Level: If a DataFrame has multiple index levels, you can specify which level to group by, with the default being None.
- as_index: Specifies whether the results of the grouping should use the group column as an index, with the default being True.
- sort: specifies whether the results after grouping will be sorted by the grouping column, with the default being True.
- group_keys: Specifies whether to display group keys in the grouped result, default is True.
- Squeeze: Specifies whether to squeeze the results of a specific group after grouping, default is False.
- observed: Specifies whether to use all observed values of the groups for grouping, with a default setting of False.
- dropna: Specifies whether to exclude group keys containing missing values, with a default value of True.
The groupby() function returns a GroupBy object, which can be used to perform various operations such as applying aggregation functions (like sum, mean, etc.), filtering data, and transforming data.
Specific operations can be achieved through the methods of the GroupBy object, such as:
- agg(): Apply aggregation functions to each group.
- apply(): Applying a custom function to each group.
- transform(): Apply a transformation function to each group.
- filter(): select data based on certain conditions.
Sample code:
import pandas as pd
# 创建一个DataFrame
data = {'Name': ['Tom', 'Nick', 'John', 'Tom', 'Nick', 'John'],
'Subject': ['Math', 'English', 'Math', 'English', 'Math', 'English'],
'Score': [85, 90, 92, 78, 82, 88]}
df = pd.DataFrame(data)
# 按照Name列进行分组,并计算每个分组的平均分数
result = df.groupby('Name')['Score'].mean()
print(result)
Result output:
Name
John 90.0
Nick 86.0
Tom 81.5
Name: Score, dtype: float64
In this example, the data is first grouped based on the Name column, and then the average score for each group is calculated. The result is a Series with the unique values of the groups (values of the Name column) as the index and the average score of each group as the values.