In [1]:

from lec_utils import *

Discussion Slides: Grouping, Pivoting, and Merging

Agenda 📆¶

The groupby method.
Pivot tables.
merge and types of merges.
Worksheet 📝.

Today's Dataset 🏈〽️¶

Today, we're going to be working with a dataset on the past 100 years of Michigan Football.
Source: www.mgoblue.com.

Like last week, let's store our dataset in the df variable.

In [2]:

df = pd.read_csv('data/michigan_football.csv')
df

Out[2]:

	year	opponent	venue	result	UM_score	opp_score	date	time
0	2024	Fresno State	Home	W	30.0	10.0	Aug 31 (Sat)	7:30 PM
1	2024	#3 Texas	Home	L	12.0	31.0	Sep 7 (Sat)	Noon
2	2024	Arkansas State	Home	W	28.0	18.0	Sep 14 (Sat)	Noon
...	...	...	...	...	...	...	...	...
1100	1924	Northwestern	Home	W	27.0	0.0	Nov 8 (Sat)	NaN
1101	1924	Ohio State	Away	W	16.0	6.0	Nov 15 (Sat)	NaN
1102	1924	Iowa (HC)	Home	L	2.0	9.0	Nov 22 (Sat)	NaN

1103 rows × 8 columns

In [3]:

df.columns

Out[3]:

Index(['year', 'opponent', 'venue', 'result', 'UM_score', 'opp_score', 'date',
       'time'],
      dtype='object')

Read the DataFrame internals guide!¶

We've posted a guide, DataFrame Internals, that you should read.

It covers various topics, like:
- How to add columns to DataFrames without side effects.
- How to write functions that manipulate DataFrames without side effects.
- How to use Series methods, like mean and describe, on DataFrames.
and more!

The `groupby` method¶

Key idea: If the information we want involves performing some operation separately for each group, we use groupby.

For instance, if we want to find the average number of points Michigan scored each year, we use groupby('year').

In [4]:

df.groupby('year')['UM_score'].mean()

Out[4]:

year
1924    19.38
1925    28.38
1926    23.88
        ...  
2022    40.43
2023    35.87
2024    22.00
Name: UM_score, Length: 101, dtype: float64

df.groupby('year')['UM_score'].mean()
First, we need to tell pandas which column we want to group by. Since we're grouping by 'year', the remainder of the calculations will be done separately for each 'year'.

df.groupby('year')['UM_score'].mean()
Then, select the other column(s) that you want to aggregate. Here, we want to calculate the mean 'UM Score', so that's what we select.

df.groupby('year')['UM_score'].mean()
Finally, we use an aggregation method. This is saying, for each 'year', compute the mean 'UM Score'.

Reference Slide¶

The split, apply, combine process¶

Under the hood, groupby follows 3 steps: split, apply, and combine.

We start by splitting rows into distinct groups.
When we use df.groupby('year'), the groups are created by separating rows based on 'year' values.

In [5]:

# The 2024 group and 2023 group.
# df.groupby('year') will create separate groups for each 'year.'
group_2024 = df.loc[df['year'] == 2024]
group_2023 = df.loc[df['year'] == 2023]
dfs_side_by_side(group_2024.head(), group_2023.head())

	year	opponent	venue	result	UM_score	opp_score	date	time
0	2024	Fresno State	Home	W	30.0	10.0	Aug 31 (Sat)	7:30 PM
1	2024	#3 Texas	Home	L	12.0	31.0	Sep 7 (Sat)	Noon
2	2024	Arkansas State	Home	W	28.0	18.0	Sep 14 (Sat)	Noon
3	2024	#11 USC	Home	W	27.0	24.0	Sep 21 (Sat)	3:30 PM
4	2024	Minnesota	Home	W	27.0	24.0	Sep 28 (Sat)	Noon

	year	opponent	venue	result	UM_score	opp_score	date	time
13	2023	East Carolina	Home	W	30.0	3.0	Sep 2 (Sat)	Noon
14	2023	UNLV	Home	W	35.0	7.0	Sep 9 (Sat)	3:30 PM
15	2023	Bowling Green	Home	W	31.0	6.0	Sep 16 (Sat)	7:30 PM
16	2023	Rutgers	Home	W	31.0	7.0	Sep 23 (Sat)	12:00 PM
17	2023	Nebraska	Away	W	45.0	7.0	Sep 30 (Sat)	2:30 PM CT

Then, we apply the operation to each group, individually.

In [6]:

print("2024 result:", group_2024['UM_score'].mean())
print("2023 result:", group_2023['UM_score'].mean())
print("and so on...")

2024 result: 22.0
2023 result: 35.86666666666667
and so on...

Finally, we combine the results of the operation on each group into a single Series or DataFrame (depending on the operation).

In [7]:

df.groupby('year')['UM_score'].mean().sort_index(ascending=False)

Out[7]:

year
2024    22.00
2023    35.87
2022    40.43
        ...  
1926    23.88
1925    28.38
1924    19.38
Name: UM_score, Length: 101, dtype: float64

In the expression below, mean is an aggregation method.
It takes many 'UM_score' values for a particular 'year' and combines them into a single value.

Reference Slide¶

Aggregating¶

In [8]:

df.groupby('year')['UM_score'].mean()

Out[8]:

year
1924    19.38
1925    28.38
1926    23.88
        ...  
2022    40.43
2023    35.87
2024    22.00
Name: UM_score, Length: 101, dtype: float64

Sometimes, we want to perform:
- An aggregation that doesn't have a built-in implementation.
- Different aggregation methods on different columns.

The aggregate method, or agg for short, is the general-purpose tool for performing aggregations.

In [9]:

# Finds the average score, and the win percentage, for home vs. away games.
(
    df
    .groupby('venue')
    .agg({
        'UM_score': 'mean', # equivalent to df.groupby('venue')['UM_score'].mean()
        # the parameter `s` here is a Series of the 'result' values for each group.
        'result': lambda s: (s == 'W').sum() / len(s) 
    })
)

Out[9]:

	UM_score	result
venue
Away	23.59	0.65
Home	27.35	0.74

📢 Michigan performs significantly better at home, both in terms of points scored and win percentage!

Filtering¶

So far, the only operations we've applied to each group is some sort of aggregation (i.e. aggregating the mean score across years).

Instead of aggregating, perhaps we want to only keep groups that satisfy a certain condition.
To do so, we use the filter method instead of any aggregation method.

For instance, if we want to keep only the 'year's in which Michigan won at least 12 games:

In [10]:

def has_at_least_twelve_wins(group):
    # has_at_least_twelve_wins is called separately for each 'year'.
    # Each time it's called, the `group` argument is a DataFrame.
    return (group['result'] == 'W').sum() >= 12

twelveplus_wins = df.groupby('year').filter(has_at_least_twelve_wins)
twelveplus_wins

Out[10]:

	year	opponent	venue	result	UM_score	opp_score	date	time
13	2023	East Carolina	Home	W	30.0	3.0	Sep 2 (Sat)	Noon
14	2023	UNLV	Home	W	35.0	7.0	Sep 9 (Sat)	3:30 PM
15	2023	Bowling Green	Home	W	31.0	6.0	Sep 16 (Sat)	7:30 PM
...	...	...	...	...	...	...	...	...
367	1997	#23 Wisconsin	Away	W	26.0	16.0	Nov 15 (Sat)	2:30 PM CT
368	1997	#4 Ohio State	Home	W	20.0	14.0	Nov 22 (Sat)	Noon
369	1997	#8 Washington State	NaN	W	21.0	16.0	Jan 1 (Thu)	2:00 PM PT

58 rows × 8 columns

In [11]:

twelveplus_wins['year'].unique()

Out[11]:

array([2023, 2022, 2021, 1997])

Pivot tables¶

It's possible to use groupby on two columns at once, e.g. df.groupby(['year', 'venue']).
The resulting operations are performed separately for every unique combination of values in the specified columns.
See Lecture 5 for examples. We can also group on three or more columns too!

In [12]:

# The index here is a MultiIndex.
df.groupby(['year', 'venue'])['UM_score'].mean()

Out[12]:

year  venue
1924  Away     12.50
      Home     26.25
1925  Away     12.00
               ...  
2023  Home     35.71
2024  Away     13.00
      Home     26.88
Name: UM_score, Length: 202, dtype: float64

Rather than grouping on two columns at once, we can create a pivot table, which also shows the result of aggregating by two columns at once but in a more human-readable way.

In [13]:

df.pivot_table(index='year', columns='venue', values='UM_score', aggfunc='mean')

Out[13]:

venue	Away	Home
year
1924	12.50	26.25
1925	12.00	40.20
1926	12.00	33.40
...	...	...
2022	38.75	40.38
2023	40.20	35.71
2024	13.00	26.88

101 rows × 2 columns

The DataFrame above has:
- One row for every unique value in index.
- One column for every unique value in columns.
- Values determined by applying aggfunc on values in values.

Notice how the value 12.50 at (1924, 'Away') in our MultiIndex groupby result matches the corresponding value in our pivot table! Michigan scored 12.5 points away on average in 1924.

Merging¶

We merge DataFrames when we have two sources of information about the same individuals that is linked by a common column(s).
The common column(s) are called the join key.
In SQL, we call this operation a JOIN.

When we merge DataFrames, we have to specify how to merge the DataFrames, in the (likely) event that there are rows in the two DataFrames that don't match.

No description has been provided for this image

To help us visualize these operations, we'll redefine df to be a smaller subset of our original DataFrame.

In [14]:

# Smaller game DataFrame
df = pd.DataFrame({
    "year": [2024, 2024, 2023, 2023, 2022],
    "opponent": ["Washington", "Alabama", "Washington", "Michigan State", "Oregon"],
    "venue": ["Home", "Away", "Away", "Home", "Home"],
    "result": ["L", "W", "L", "W", "L"]
})

# Big Ten Teams DataFrame (5 rows)
big_10 = pd.DataFrame({
    "team": ["Minnesota", "Michigan State", "Washington", "Ohio State", "Oregon"],
    "coach": ["P.J. Fleck", "Mel Tucker", "Kalen DeBoer", "Ryan Day", "Dan Lanning"],
    "location": ["Minneapolis, MN", "East Lansing, MI", "Seattle, WA", "Columbus, OH", "Eugene, OR"]
})

dfs_side_by_side(df, big_10)

	year	opponent	venue	result
0	2024	Washington	Home	L
1	2024	Alabama	Away	W
2	2023	Washington	Away	L
3	2023	Michigan State	Home	W
4	2022	Oregon	Home	L

	team	coach	location
0	Minnesota	P.J. Fleck	Minneapolis, MN
1	Michigan State	Mel Tucker	East Lansing, MI
2	Washington	Kalen DeBoer	Seattle, WA
3	Ohio State	Ryan Day	Columbus, OH
4	Oregon	Dan Lanning	Eugene, OR

An inner merge keeps only the rows where the keys match in both DataFrames. In this case, it keeps the games where Michigan played teams from the Big 10 (based on the opponent column in df and the team column in big_10). Notice how the resulting DataFrame includes all columns from both DataFrames, but only for rows where there’s a match."

In [15]:

df.merge(big_10, left_on="opponent", right_on="team", how="inner")

Out[15]:

	year	opponent	venue	result	team	coach	location
0	2024	Washington	Home	L	Washington	Kalen DeBoer	Seattle, WA
1	2023	Washington	Away	L	Washington	Kalen DeBoer	Seattle, WA
2	2023	Michigan State	Home	W	Michigan State	Mel Tucker	East Lansing, MI
3	2022	Oregon	Home	L	Oregon	Dan Lanning	Eugene, OR

Merging Activity¶

Activity

Consider the following DataFrame, evs, which has 32 rows total.

And consider the pivot table that contains the distribution of “BodyStyle” for all “Brands” in evs, other than Nissan.

evs.pivot_table(index='Brand', columns='BodyStyle', values='BodyStyle', aggfunc='count').

Suppose we ran the following lines of code:

tesla = evs[evs.get("Brand") == "Tesla"]

bmw = evs[evs.get("Brand") == "BMW"]

audi = evs[evs.get("Brand") == "Audi"]

Question: How many rows does the DataFrame combo have?

combo = tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle")

Hint: When we leave the how parameter empty, .merge defaults to an inner join.

Click here to see the answer to the previous activity.

Answer: 35¶

First, we need to determine the number of rows in tesla.merge(bmw, on="BodyStyle"), and then determine the number of rows in combo. For the purposes of the solution, let’s use temp to refer to the first merged DataFrame, tesla.merge(bmw, on="BodyStyle").

Step 1: Merging Tesla and BMW (`temp`)¶

When merging two DataFrames, the resulting DataFrame (temp) contains a row for every match between the two columns being merged, while rows without matches are excluded. In this case, the column of interest is BodyStyle.

Matches Between Tesla and BMW:¶

SUV: Tesla has $4$ rows for "SUV", and BMW has $1$ row for "SUV".
$$ 4 \times 1 = 4 \text{ SUV rows in } temp $$
Sedan: Tesla has $3$ rows for "Sedan", and BMW has $1$ row for "Sedan".
$$ 3 \times 1 = 3 \text{ Sedan rows in } temp $$

Thus, the total number of rows in temp is: $$ 4 + 3 = 7 \text{ rows} $$

Step 2: Merging `temp` with Audi (`combo`)¶

Now, we merge temp with the Audi DataFrame, again on BodyStyle. Each row in temp matches with rows in Audi based on the same "BodyStyle".

Matches Between `temp` and Audi:¶

SUV: temp has $4$ rows for "SUV", and Audi has $8$ rows for "SUV".
$$ 4 \times 8 = 32 \text{ SUV rows in } combo $$
Sedan: temp has $3$ rows for "Sedan", and Audi has $1$ row for "Sedan".
$$ 3 \times 1 = 3 \text{ Sedan rows in } combo $$

Thus, the total number of rows in combo is: $$ 32 + 3 = 35 $$

Note: You may notice that 35 is the result of multiplying the "SUV" and "Sedan" columns in the DataFrame provided, and adding up the results.

Discussion Slides: Grouping, Pivoting, and Merging

Agenda 📆¶

Today's Dataset 🏈〽️¶

Read the DataFrame internals guide!¶

The groupby method¶

Reference Slide¶

The split, apply, combine process¶

Reference Slide¶

Aggregating¶

Filtering¶

Pivot tables¶

Merging¶

Merging Activity¶

Activity

Answer: 35¶

Step 1: Merging Tesla and BMW (temp)¶

Matches Between Tesla and BMW:¶

Step 2: Merging temp with Audi (combo)¶

Matches Between temp and Audi:¶

The `groupby` method¶

Step 1: Merging Tesla and BMW (`temp`)¶

Step 2: Merging `temp` with Audi (`combo`)¶

Matches Between `temp` and Audi:¶