Grouping, Pivoting, and Merging

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

In this problem, we will work with the DataFrame tv, which contains information about various TV shows available to watch on streaming services. For each TV show, we have:

The first few rows of tv are shown below (though tv has many more rows than are pictured here).

As you see in the first few rows of tv, some TV shows are available for streaming on multiple streaming services. Fill in the blanks so that the two expressions below, Expression 1 and Expression 2, both evaluate to the "Title" of the TV show that is available for streaming on the greatest number of streaming services. Assume there are no ties and that the "Title" column contains unique values.

Expression 1:

tv.set_index("Title").loc[__(a)__].T.sum(axis=0).idxmax()

Expression 2:

    (
        tv.assign(num_services=tv.iloc[__(b)__].sum(__(c)__))
            .sort_values("num_services")
            .iloc[__(d)__]
    )

Hint: .T transposes the rows and columns of a DataFrame — the indexes of df are the columns of df.T and vice versa.

What goes in the blanks?

Answers:

    1. :, "Netflix": or some variation of that
    1. :, 5: or some variation of that
    1. axis=1
    1. -1, 0

In Expression 1, keep in mind that idxmax() is a Series method returns the index of the row with the maximum value. As such, we can infer that Expression 1 sums the service-specific indicator columns (that is, the columns "Netflix", "Hulu", "Prime Video", and "Disney+") for each row and returns the index of the row with the greatest sum. To do this, we need the loc accessor to select all the service-specific indicator columns, which we can do using loc[:, "Netflix":] or loc[:, ["Netflix", "Hulu", "Prime Video", "Disney+"]].

When looking at Expression 2, we can split the problem into two parts: the code inside the assign statement and the code outside of it.

  • Glancing at the code inside of the assign statement, (and also noticing the variable num_services), we realize that we, once again, want to sum up the values in the service-specific indicator columns. We do this by first selecting the last four columns, using .iloc[:, 5:] (notice the iloc), and then summing over axis=1. We use axis=1 (different from axis=0 in Expression 1), because unlike Expression 1, we’re summing over each row, instead of each column. If there had not been a .T in the code for Expression 1, we would’ve also used axis=1 in Expression 1.
  • Finally, we need to select the "Title" of the last row in DataFrame in Expression 2, because sort_values sorts in ascending order by default. The last row has an integer position of -1, and the "Title" column has an integer position of 0, so we use iloc[-1, 0].

Problem 2

The laptops DataFrame contains information on various factors that influence the pricing of laptops. Each row represents a laptop, and the columns are:


Problem 2.1

Without using groupby, write an expression that evaluates to the average price of laptops with the "macOS" operating system (the same quantity as above).

Answer: laptops.loc[laptops["OS"] == "macOS", "Price"].mean()



Problem 2.2

Using groupby, write an expression that evaluates to the average price of laptops with the "macOS" operating system.

Answer: laptops.groupby("OS")["Price"].mean().loc["macOS"]



Problem 3

You are given a DataFrame called books that contains columns 'author' (string), 'title' (string), 'num_chapters' (int), and 'publication_year' (int).

Suppose that after doing books.groupby('author').max(), one row says

author title num_chapters publication_year
Charles Dickens Oliver Twist 53 1838


Problem 3.1

Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?

Answer: No

When we group by 'author', all books by the same author get aggregated together into a single row. The aggregation function is applied separately to each other column besides the column we’re grouping by. Since we’re grouping by 'author' here, the 'author' column never has the max() function applied to it. Instead, each unique value in the 'author' column becomes a value in the index of the grouped DataFrame. We are told that the Charles Dickens row is just one row of the output, but we don’t know anything about the other rows of the output, or the other authors. We can’t say anything about where Charles Dickens falls when authors are ordered alphabetically (but it’s probably not last!)


Problem 3.2

Based on this data, can you conclude that Charles Dickens wrote Oliver Twist?

Answer: Yes

Grouping by 'author' collapses all books written by the same author into a single row. Since we’re applying the max() function to aggregate these books, we can conclude that Oliver Twist is alphabetically last among all books in the books DataFrame written by Charles Dickens. So Charles Dickens did write Oliver Twist based on this data.


Problem 3.3

Based on this data, can you conclude that Oliver Twist has 53 chapters?

Answer: No

The key to this problem is that groupby applies the aggregation function, max() in this case, independently to each column. The output should be interpreted as follows:

  • Among all books in books written by Charles Dickens, Oliver Twist is the title that is alphabetically last.
  • Among all books in books written by Charles Dickens, 53 is the greatest number of chapters.
  • Among all books in books written by Charles Dickens, 1838 is the latest year of publication.

However, the book titled Oliver Twist, the book with 53 chapters, and the book published in 1838 are not necessarily all the same book. We cannot conclude, based on this data, that Oliver Twist has 53 chapters.


Problem 3.4

Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?

Answer: No

As explained in the previous question, the max() function is applied separately to each column, so the book written by Charles Dickens with 53 chapters may not be the same book as the book written by Charles Dickens published in 1838.



Problem 4

The h table records addresses within San Diego. Only 50 addresses are recorded. The index of the dataframe contains the numbers 1-50 as unique integers.

Fill in the Python code to create a DataFrame containing the proportion of 4-digit address numbers for each unique street in h.

def foo(x):
    lengths = __(a)__
    return (lengths == 4).mean()

h.groupby(__(b)__).__(c)__(foo)

Answer:

(a): x.astype(str).str.len()

(b): 'street'

(c): agg


Problem 5

The DataFrame items describes various items available to collect or purchase using bells, the currency used in the game Animal Crossing: New Horizons.

For each item, we have:

The first 6 rows of items are below, though items has more rows than are shown here.

The DataFrame keepers has 5 rows, each of which represent a different shopkeeper in the Animal Crossing: New Horizons universe.

keepers is shown below in its entirety.


How many rows are in the following DataFrame? Give your answer as an integer.

keepers.merge(items.iloc[:6]), 
              left_on="Store", 
              right_on="Location")

Answer: 10. Since the type of join is not specified, this is an inner join. Each row in keepers is merged with each row in items only if 'Store' in keepers equals 'Location' in items. Each row in keepers has the following number of merges: row 0 has 1, row 1 has 3, row 2 has 3, row 3 has 0 (there are no rows in items with 'Location' equal to ‘Kicks Shoe Store’), and row 4 has 3.

1 + 3 + 3 + 0 + 3 = 10


Problem 6

Suppose we create a DataFrame called midwest containing Nishant’s flights departing from DTW, ORD, and MKE. midwest has 10 rows; the bar chart below shows how many of these 10 flights departed from each airport.


Consider the DataFrame that results from merging midwest with itself, as follows:

double_merge = midwest.merge(midwest, left_on='FROM', right_on='FROM')

How many rows does double_merge have?

Answer: 38

There are two flights from DTW. When we merge midwest with itself on the 'FROM' column, each of these flights gets paired up with each of these flights, for a total of four rows in the output. That is, the first flight from DTW gets paired with both the first and second flights from DTW. Similarly, the second flight from DTW gets paired with both the first and second flights from DTW.

Following this logic, each of the five flights from ORD gets paired with each of the five flights from ORD, for an additional 25 rows in the output. For MKE, there will be 9 rows in the output. The total is therefore 2^2 + 5^2 + 3^2 = 4 + 25 + 9 = 38 rows.


Problem 7

Kyle and Yutong are trying to decide where they’ll study on campus and start flipping a Michigan-themed coin, with a picture of the Michigan Union on the heads side and a picture of the Shapiro Undergraduate Library (aka the UgLi) on the tails side.


Kyle flips the coin 21 times and sees 13 heads and 8 tails. He stores this information in a DataFrame named kyle that has 21 rows and 2 columns, such that:

Then, Yutong flips the coin 11 times and sees 4 heads and 7 tails. She stores this information in a DataFrame named yutong that has 11 rows and 2 columns, such that:


Problem 7.1

How many rows are in the following DataFrame? Give your answer as an integer.

    kyle.merge(yutong, on="flips")

Hint: The answer is less than 200.

Answer: 108

Since we used the argument on="flips, rows from kyle and yutong will be combined whenever they have matching values in their "flips" columns.

For the kyle DataFrame:

  • There are 13 rows with "Heads" in the "flips" column.
  • There are 8 rows with "Tails" in the "flips" column.

For the yutong DataFrame:

  • There are 4 rows with "Heads" in the "flips" column.
  • There are 7 rows with "Tails" in the "flips" column.

The merged DataFrame will also only have the values "Heads" and "Tails" in its "flips" column.

  • The 13 "Heads" rows from kyle will each pair with the 4 "Heads" rows from yutong. This results in 13 \cdot 4 = 52 rows with "Heads"
  • The 8 "Tails" rows from kyle will each pair with the 7 "Tails" rows from yutong. This results in 8 \cdot 7 = 56 rows with "Tails".

Then, the total number of rows in the merged DataFrame is 52 + 56 = 108.


Problem 7.2

Let A be your answer to the previous part. Now, suppose that:

Suppose we again merge kyle and yutong on the "flips" column. In terms of A, how many rows are in the new merged DataFrame?

Answer: A+1

The additional row in each DataFrame has a unique "flips" value of "Total". When we merge on the "flips" column, this unique value will only create a single new row in the merged DataFrame, as it pairs the "Total" from kyle with the "Total" from yutong. The rest of the rows are the same as in the previous merge, and as such, they will contribute the same number of rows, A, to the merged DataFrame. Thus, the total number of rows in the new merged DataFrame will be A (from the original matching rows) plus 1 (from the new "Total" rows), which sums up to A+1.



Problem 8

Define small_students to be the DataFrame with 8 rows and 2 columns shown directly below, and define districts to be the DataFrame with 3 rows and 2 columns shown below small_students.

Consider the DataFrame merged, defined below.

merged = small_students.merge(districts, 
                                left_on="High School", 
                                right_on="school", 
                                how="outer")

How many total NaN values does merged contain? Give your answer as an integer.

Answer: 4

merged is shown below.


Problem 9

The DataFrame dogs, contains one row for every registered pet dog in Zurich, Switzerland in 2017.

The first few rows of dogs are shown below, but dogs has many more rows than are shown.


In this question, assume that there are more than 12 districts in dogs.

Suppose we merge the dogs DataFrame with itself as follows.

# on="x" is the same as specifying both left_on="x" and right_on="x".
double = dogs.merge(dogs, on="district")

# sort_index sorts a Series in increasing order of its index.
square = double["district"].value_counts().value_counts().sort_index()

The first few rows of square are shown below.

1     5500
4      215
9       40


Problem 9.1

In dogs, there are 12 rows with a "district" of 8. How many rows of double have a "district" of 8? Give your answer as a positive integer.

Answer: 144

When we merge dogs with dogs on "district", each 8 in the first dogs DataFrame will be combined with each 8 in the second dogs DataFrame. Since there are 12 in the first and 12 in the second, there are 12 \cdot 12 = 144 combinations.


Problem 9.2

What does the following expression evaluate to? Give your answer as a positive integer.

dogs.groupby("district").filter(lambda df: df.shape[0] == 3).shape[0]

Hint: Unlike in 5.1, your answer to 5.2 depends on the values in square.

Answer: 120

square is telling us that: - There are 5500 districts that appeared just 1x in dogs. - There are 215 districts that appeared 2x in dogs (2x, not 4x, because of the logic explained in the 5a rubric item). - There are 40 districts that appeared 3x in dogs.

The expression given in this question is keeping all of the rows corresponding to districts that appear 3 times. There are 40 districts that appear 3 times. So, the total number of rows in this DataFrame is 40 \cdot 3 = 120.



Problem 10

The DataFrame tv_excl contains all of the information we have for TV shows that are available to stream exclusively on a single streaming service. The "Service" column contains the name of the one streaming service that the TV show is available for streaming on.

The first few rows of tv_excl are shown below (though, of course, tv_excl has many more rows than are pictured here). Note that Being Erica is not in tv_excl, since it is available to stream on multiple services.

The DataFrame counts, shown in full below, contains the number of TV shows for every combination of "Age" and "Service".

Given the above information, what does the following expression evaluate to?

tv_excl.groupby(["Age", "Service"]).sum().shape[0]

Answer: 18

Note that the DataFrame counts is a pivot table, created using tv_excl.pivot_table(index="Age", columns="Service", aggfunc="size"). As we saw in lecture, pivot tables contain the same information as the result of grouping on two columns.

The DataFrame tv_excl.groupby(["Age", "Service"]).sum() will have one row for every unique combination of "Age" and "Service" in tv_excl. (The same is true even if we used a different aggregation method, like .mean() or .max().) As counts shows us, tv_excl contains every possible combination of a single element in {"13+", "16+", "18+", "7+", "all"} with a single element in {"Disney+", "Hulu", "Netflix", "Prime Video"}, except for ("13+", "Disney+") and ("18+", "Disney+"), which were not present in tv_excl; if they were, they would have non-null values in counts.

As such, tv_excl.groupby(["Age", "Service"]).sum() will have 20 - 2 = 18 rows, and tv_excl.groupby(["Age", "Service"]).sum().shape[0] evaluates to 18.


Problem 11

The DataFrame flights contains information about recent flights, with each row representing a specific flight and the following columns:

Suppose we have another DataFrame more_flights which contains the same columns as flights, but different rows. Define merged as follows.

        merged = flights.merge(more_flights, on = "airline")

Suppose that in merged, there are 108 flights where the airline is "United", and in more_flights, there are 12 flights where the airline is "United". If flights has 15 rows in total, how many of these rows are not for "United" flights? Give your answer as an integer.

Answer: 6

We are merging dataframes flights and more_flights according to the airline each flight belongs to. All the "United" flights in more_flights will be merged with all the "United" flights in flights, which we know gives us 108 total flights. We also know that there are 12 "United" flights in more_flights. To find the number of "United" flights in flights, we simply need to divide the total number of "United" flights in merged by the number of "United" flights in more_flights, which is 108/12 = 9. If flights has a total of 15 rows, then the total number of non-United rows is equal to 15 - 9 = 6.


Problem 12

For your convenience, the first few rows of tv are shown again below.

For the purposes of this question only, suppose we have also access to another similar DataFrame, movies, which contains information about a variety of movies. The information we have for each movie in movies is the same as the information we have for each TV show in tv, except for IMDb ratings, which are missing from movies.

The first few rows of movies are shown below (though movies has many more rows than are pictured here).


Problem 12.1

The function total_null, defined below, takes in a DataFrame and returns the total number of null values in the DataFrame.

total_null = lambda df: df.isna().sum().sum()

Consider the function delta, defined below.

def delta(a, b):
    tv_a = tv.head(a)
    movies_b = movies.head(b)
    together = pd.concat([tv_a, movies_b])
    return total_null(together) - total_null(tv_a) - total_null(movies_b)

Which of the following functions is equivalent to delta?

Answer: lambda a, b: b

Let’s understand what each function does.

  • total_null just counts all the null values in a DataFrame.
  • delta concatenates the first a rows of tv with the first b rows of movies vertically, that is, on top of one another (over axis 0). It then returns the difference between the total number of null values in the concatenated DataFrame and the total number of null values in the first a rows of tv and first b rows of movies – in other words, it returns the number of null values that were added as a result of the concatenation.

The key here is recognizing that tv and movies have all of the same column names, except movies doesn’t have an "IMDb" column. As a result, when we concatenate, the "IMDb" column will contain null values for every row that was originally from movies. Since b rows from movies are in the concatenated DataFrame, b new null values are introduced as a result of the concatenation, and thus lambda, a, b: b does the same thing as delta.


Problem 12.2

Fill in the blank to complete the implementation of the function size_of_merge, which takes a string col, corresponding to the name of a single column that is shared between tv and movies, and returns the number of rows in the DataFrame tv.merge(movies, on=col).

What goes in the blank below?

def size_of_merge(col):
    return (____).sum()

Hint: Consider the behavior below.

>>> s1 = pd.Series({'a': 2, 'b': 3})
>>> s2 = pd.Series({'c': 4, 'a': -1, 'b': 4})
>>> s1 * s2
a    -2.0
b    12.0
c     NaN
dtype: float64

Answer: tv[col].value_counts() * movies[col].value_counts()

tv.merge(movies, on=col) contains one row for every “match” between tv[col] and movies[col]. Suppose, for example, that col="Year". If tv["Year"] contains 30 values equal to 2019, and movies["Year"] contains 5 values equal to 2019, tv.merge(movies, on="Year") will contain 30 \cdot 5 = 150 rows in which the "Year" value is equal to 2019 – one for every combination of a 2019 row in tv and a 2019 row in movies.

tv["Year"].value_counts() and movies["Year"].value_counts() contain, respectively, the frequencies of the unique values in tv["Year"] and movies["Year"]. Using the 2019 example from above, tv["Year"].value_counts() * movies["Year"].value_counts() will contain a row whose index is 2019 and whose value is 150, with similar other entries for the other years in the two Series. (The hint is meant to demonstrate the fact that no matter how the two Series are sorted, the product is done element-wise by matching up indexes.) Then, (tv["Year"].value_counts() * movies["Year"].value_counts()).sum() will sum these products across all years, ignoring null values.

As such, the answer we were looking for is tv[col].value_counts() * movies[col].value_counts() (remember, "Year" was just an example for this explanation).



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.