← return to study.practicaldsc.org

The problems in this worksheet are taken from past exams in similar
classes. Work on them **on paper**, since the exams you
take in this course will also be on paper.

We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.**Note: We do not plan to
cover all problems here in the live discussion section**; the problems
we don’t cover can be used for extra practice.

In this problem, we will work with the DataFrame `tv`

,
which contains information about various TV shows available to watch on
streaming services. For each TV show, we have:

`"Title" (object)`

: The title of the TV show.`"Year" (int)`

: The year in which the TV show was first released. (For instance, the show*How I Met Your Mother*ran from 2005 to 2014; there is only one row for*How I Met Your Mother*in`tv`

, and its`"Year"`

value is 2005.)`"Age" (object)`

: The age category for the TV show. If not missing,`"Age"`

is one of`"all"`

,`"7+"`

,`"13+"`

,`"16+"`

, or`"18+"`

. (For instance,`"all"`

means that the show is appropriate for all audiences, while `“18+”} means that the show contains mature content and viewers should be at least 18 years old.)`"IMDb" (float)`

: The TV show’s rating on IMDb (between 0 and 10).`"Rotten Tomatoes" (int)`

: The TV show’s rating on Rotten Tomatoes (between 0 and 100).`"Netflix" (int)`

: 1 if the show is available for streaming on Netflix and 0 otherwise. The`"Hulu"`

,`"Prime Video"`

, and`"Disney+"`

columns work the same way.

The first few rows of `tv`

are shown below (though
`tv`

has many more rows than are pictured here).

As you see in the first few rows of `tv`

, some TV shows
are available for streaming on multiple streaming services. Fill in the
blanks so that the two expressions below, Expression 1 and Expression 2,
**both** evaluate to the `"Title"`

of the TV
show that is available for streaming on the **greatest number of
streaming services**. Assume there are no ties and that the
`"Title"`

column contains unique values.

Expression 1:

`"Title").loc[__(a)__].T.sum(axis=0).idxmax() tv.set_index(`

Expression 2:

```
(=tv.iloc[__(b)__].sum(__(c)__))
tv.assign(num_services"num_services")
.sort_values(
.iloc[__(d)__] )
```

** Hint**:

`.T`

transposes the rows
and columns of a DataFrame — the indexes of `df`

are the
columns of `df.T`

and vice versa.What goes in the blanks?

**Answers**:

`:, "Netflix":`

or some variation of that

`:, 5:`

or some variation of that

`axis=1`

`-1, 0`

In Expression 1, keep in mind that `idxmax()`

is a Series
method returns the index of the row with the maximum value. As such, we
can infer that Expression 1 sums the service-specific indicator columns
(that is, the columns `"Netflix"`

, `"Hulu"`

,
`"Prime Video"`

, and `"Disney+"`

) for each row and
returns the index of the row with the greatest sum. To do this, we need
the `loc`

accessor to select all the service-specific
indicator columns, which we can do using `loc[:, "Netflix":]`

or
`loc[:, ["Netflix", "Hulu", "Prime Video", "Disney+"]]`

.

When looking at Expression 2, we can split the problem into two
parts: the code inside the `assign`

statement and the code
outside of it.

- Glancing at the code inside of the
`assign`

statement, (and also noticing the variable`num_services`

), we realize that we, once again, want to sum up the values in the service-specific indicator columns. We do this by first selecting the last four columns, using`.iloc[:, 5:]`

(notice the`iloc`

), and then summing over`axis=1`

. We use`axis=1`

(different from`axis=0`

in Expression 1), because unlike Expression 1, we’re summing over each row, instead of each column. If there had not been a`.T`

in the code for Expression 1, we would’ve also used`axis=1`

in Expression 1. - Finally, we need to select the
`"Title"`

of the last row in DataFrame in Expression 2, because`sort_values`

sorts in ascending order by default. The last row has an integer position of -1, and the`"Title"`

column has an integer position of 0, so we use`iloc[-1, 0]`

.

The `laptops`

DataFrame contains information on various
factors that influence the pricing of laptops. Each row represents a
laptop, and the columns are:

`"Mfr" (str)`

: the company that manufactures the laptop, like “Apple” or “Dell”.`"Model" (str)`

: the model name of the laptop, such as “MacBook Pro”.`"OS" (str)`

: the operating system, such as “macOS” or “Windows 11”.`"Screen Size" (float)`

: the diagonal length of the screen, in inches.`"Price" (float)`

: the price of the laptop, in dollars.

**Without using groupby**, write an
expression that evaluates to the average price of laptops with the

`"macOS"`

operating system (the same quantity as above).**Answer:**
`laptops.loc[laptops["OS"] == "macOS", "Price"].mean()`

**Using groupby**, write an expression that
evaluates to the average price of laptops with the

`"macOS"`

operating system.**Answer:**
`laptops.groupby("OS")["Price"].mean().loc["macOS"]`

You are given a DataFrame called `books`

that contains
columns `'author'`

(string), `'title'`

(string),
`'num_chapters'`

(int), and `'publication_year'`

(int).

Suppose that after doing `books.groupby('author').max()`

,
one row says

author | title | num_chapters | publication_year |
---|---|---|---|

Charles Dickens | Oliver Twist | 53 | 1838 |

Based on this data, can you conclude that Charles Dickens is the alphabetically last of all author names in this dataset?

Yes

No

**Answer: ** No

When we group by `'author'`

, all books by the same author
get aggregated together into a single row. The aggregation function is
applied separately to each other column besides the column we’re
grouping by. Since we’re grouping by `'author'`

here, the
`'author'`

column never has the `max()`

function
applied to it. Instead, each unique value in the `'author'`

column becomes a value in the index of the grouped DataFrame. We are
told that the Charles Dickens row is just one row of the output, but we
don’t know anything about the other rows of the output, or the other
authors. We can’t say anything about where Charles Dickens falls when
authors are ordered alphabetically (but it’s probably not last!)

Based on this data, can you conclude that Charles Dickens wrote
*Oliver Twist*?

Yes

No

**Answer: ** Yes

Grouping by `'author'`

collapses all books written by the
same author into a single row. Since we’re applying the
`max()`

function to aggregate these books, we can conclude
that *Oliver Twist* is alphabetically last among all books in the
`books`

DataFrame written by Charles Dickens. So Charles
Dickens did write *Oliver Twist* based on this data.

Based on this data, can you conclude that *Oliver Twist* has
53 chapters?

Yes

No

**Answer: ** No

The key to this problem is that `groupby`

applies the
aggregation function, `max()`

in this case, independently to
each column. The output should be interpreted as follows:

- Among all books in
`books`

written by Charles Dickens,*Oliver Twist*is the title that is alphabetically last. - Among all books in
`books`

written by Charles Dickens, 53 is the greatest number of chapters. - Among all books in
`books`

written by Charles Dickens, 1838 is the latest year of publication.

However, the book titled *Oliver Twist*, the book with 53
chapters, and the book published in 1838 are not necessarily all the
same book. We cannot conclude, based on this data, that *Oliver
Twist* has 53 chapters.

Based on this data, can you conclude that Charles Dickens wrote a book with 53 chapters that was published in 1838?

Yes

No

**Answer: ** No

As explained in the previous question, the `max()`

function is applied separately to each column, so the book written by
Charles Dickens with 53 chapters may not be the same book as the book
written by Charles Dickens published in 1838.

The `h`

table records addresses within San Diego. Only 50
addresses are recorded. The index of the dataframe contains the numbers
1-50 as unique integers.

`"number" (int)`

: Street address number`"street" (str)`

: Street name

Fill in the Python code to create a DataFrame containing the
proportion of 4-digit address numbers for each unique street in
`h`

.

```
def foo(x):
= __(a)__
lengths return (lengths == 4).mean()
h.groupby(__(b)__).__(c)__(foo)
```

**Answer:**

(a): `x.astype(str).str.len()`

(b): `'street'`

(c): `agg`

The DataFrame `items`

describes various items available to
collect or purchase using bells, the currency used in the game
*Animal Crossing: New Horizons*.

For each item, we have:

`"Item" (str)`

: The name of the item.`"Cost" (int)`

: The cost of the item in bells. Items that cost 0 bells cannot be purchased and must be collected through other means (such as crafting).`"Location" (str)`

: The store or action through which the item can be obtained.

The first 6 rows of `items`

are below, though
`items`

has more rows than are shown here.

The DataFrame `keepers`

has 5 rows, each of which
represent a different shopkeeper in the *Animal Crossing: New
Horizons* universe.

`keepers`

is shown below in its entirety.

How many rows are in the following DataFrame? Give your answer as an integer.

```
6]),
keepers.merge(items.iloc[:="Store",
left_on="Location") right_on
```

**Answer**: 10. Since the type of join is not specified,
this is an inner join. Each row in `keepers`

is merged with
each row in `items`

only if `'Store'`

in
`keepers`

equals `'Location'`

in
`items`

. Each row in `keepers`

has the following
number of merges: row 0 has 1, row 1 has 3, row 2 has 3, row 3 has 0
(there are no rows in `items`

with `'Location'`

equal to ‘Kicks Shoe Store’), and row 4 has 3.

1 + 3 + 3 + 0 + 3 = 10

Suppose we create a DataFrame called `midwest`

containing
Nishant’s flights departing from DTW, ORD, and MKE. `midwest`

has 10 rows; the bar chart below shows how many of these 10 flights
departed from each airport.

Consider the DataFrame that results from merging `midwest`

with itself, as follows:

`= midwest.merge(midwest, left_on='FROM', right_on='FROM') double_merge `

How many rows does `double_merge`

have?

**Answer: ** 38

There are two flights from DTW. When we merge `midwest`

with itself on the `'FROM'`

column, each of these flights
gets paired up with each of these flights, for a total of four rows in
the output. That is, the first flight from DTW gets paired with both the
first and second flights from DTW. Similarly, the second flight from DTW
gets paired with both the first and second flights from DTW.

Following this logic, each of the five flights from ORD gets paired with each of the five flights from ORD, for an additional 25 rows in the output. For MKE, there will be 9 rows in the output. The total is therefore 2^2 + 5^2 + 3^2 = 4 + 25 + 9 = 38 rows.

Kyle and Yutong are trying to decide where they’ll study on campus and start flipping a Michigan-themed coin, with a picture of the Michigan Union on the heads side and a picture of the Shapiro Undergraduate Library (aka the UgLi) on the tails side.

Kyle flips the coin 21 times and sees 13 heads and 8 tails. He stores
this information in a DataFrame named `kyle`

that has 21 rows
and 2 columns, such that:

The

`"flips"`

column contains`"Heads"`

13 times and`"Tails"`

8 times.The

`"Markley"`

column contains`"Kyle"`

21 times.

Then, Yutong flips the coin 11 times and sees 4 heads and 7 tails.
She stores this information in a DataFrame named `yutong`

that has 11 rows and 2 columns, such that:

The

`"flips"`

column contains`"Heads"`

4 times and`"Tails"`

7 times.The

`"MoJo"`

column contains`"Yutong"`

11 times.

How many rows are in the following DataFrame? Give your answer as an integer.

`="flips") kyle.merge(yutong, on`

*Hint: The answer is less than 200.*

**Answer**: 108

Since we used the argument `on="flips`

, rows from
`kyle`

and `yutong`

will be combined whenever they
have matching values in their `"flips"`

columns.

For the `kyle`

DataFrame:

- There are 13 rows with
`"Heads"`

in the`"flips"`

column. - There are 8 rows with
`"Tails"`

in the`"flips"`

column.

For the `yutong`

DataFrame:

- There are 4 rows with
`"Heads"`

in the`"flips"`

column. - There are 7 rows with
`"Tails"`

in the`"flips"`

column.

The merged DataFrame will also only have the values
`"Heads"`

and `"Tails"`

in its
`"flips"`

column.

- The 13
`"Heads"`

rows from`kyle`

will each pair with the 4`"Heads"`

rows from`yutong`

. This results in 13 \cdot 4 = 52 rows with`"Heads"`

- The 8
`"Tails"`

rows from`kyle`

will each pair with the 7`"Tails"`

rows from`yutong`

. This results in 8 \cdot 7 = 56 rows with`"Tails"`

.

Then, the total number of rows in the merged DataFrame is 52 + 56 = 108.

Let A be your answer to the previous part. Now, suppose that:

`kyle`

contains an additional row, whose`"flips"`

value is`"Total"`

and whose`"Markley"`

value is 21.`yutong`

contains an additional row, whose`"flips"`

value is`"Total"`

and whose`"MoJo"`

value is 11.

Suppose we again merge `kyle`

and `yutong`

on
the `"flips"`

column. In terms of A, how many rows are in the new merged
DataFrame?

A

A+1

A+2

A+4

A+231

**Answer**: A+1

The additional row in each DataFrame has a unique
`"flips"`

value of `"Total"`

. When we merge on the
`"flips"`

column, this unique value will only create a single
new row in the merged DataFrame, as it pairs the `"Total"`

from `kyle`

with the `"Total"`

from
`yutong`

. The rest of the rows are the same as in the
previous merge, and as such, they will contribute the same number of
rows, A, to the merged DataFrame. Thus,
the total number of rows in the new merged DataFrame will be A (from the original matching rows) plus 1
(from the new `"Total"`

rows), which sums up to A+1.

Define `small_students`

to be the DataFrame with 8 rows
and 2 columns shown directly below, and define `districts`

to
be the DataFrame with 3 rows and 2 columns shown below
`small_students`

.

Consider the DataFrame `merged`

, defined below.

```
= small_students.merge(districts,
merged ="High School",
left_on="school",
right_on="outer") how
```

How many total `NaN`

values does `merged`

contain? Give your answer as an integer.

**Answer: **4

`merged`

is shown below.

The DataFrame `dogs`

, contains one row for every
registered pet dog in Zurich, Switzerland in 2017.

The first few rows of `dogs`

are shown below, but
`dogs`

has many more rows than are shown.

`"owner_id" (int)`

: A unique ID for each owner. Note that, for example, there are two rows in the preview for`4215`

, meaning that owner has at least 2 dogs.**Assume that if an**`"owner_id"`

appears in`dogs`

multiple times, the corresponding`"owner_age"`

,`"owner_sex"`

, and`"district"`

are always the same.`"owner_age" (str)`

: The age group of the owner; either`"11-20"`

,`"21-30"`

, …, or`"91-100"`

(9 possibilities in total).`"owner_sex" (str)`

: The birth sex of the owner; either`"m"`

(male) or`"f"`

(female).`"district" (int)`

: The city district the owner lives in; a positive integer between`1`

and`12`

(inclusive).`"primary_breed" (str)`

: The primary breed of the dog.`"secondary_breed" (str)`

: The secondary breed of the dog. If this column is not null, the dog is a “mixed breed” dog; otherwise, the dog is a “purebred” dog.`"dog_sex" (str)`

: The birth sex of the dog; either`"m"`

(male) or`"f"`

(female).`"birth_year" (int)`

: The birth year of the dog.

In this question, assume that there are more than 12 districts in
`dogs`

.

Suppose we merge the `dogs`

DataFrame with itself as
follows.

```
# on="x" is the same as specifying both left_on="x" and right_on="x".
= dogs.merge(dogs, on="district")
double
# sort_index sorts a Series in increasing order of its index.
= double["district"].value_counts().value_counts().sort_index() square
```

The first few rows of `square`

are shown below.

```
1 5500
4 215
9 40
```

In `dogs`

, there are 12 rows with a
`"district"`

of `8`

. How many rows of
`double`

have a `"district"`

of `8`

?
Give your answer as a positive integer.

**Answer**: 144

When we merge `dogs`

with `dogs`

on
`"district"`

, each `8`

in the first
`dogs`

DataFrame will be combined with each `8`

in
the second `dogs`

DataFrame. Since there are 12 in the first
and 12 in the second, there are 12 \cdot 12 =
144 combinations.

What does the following expression evaluate to? Give your answer as a positive integer.

`"district").filter(lambda df: df.shape[0] == 3).shape[0] dogs.groupby(`

*Hint: Unlike in 5.1, your answer to 5.2 depends on the values in
square.*

**Answer**: 120

`square`

is telling us that: - There are 5500 districts
that appeared just 1x in `dogs`

. - There are 215 districts
that appeared 2x in `dogs`

(2x, not 4x, because of the logic
explained in the 5a rubric item). - There are 40 districts that appeared
3x in `dogs`

.

The expression given in this question is keeping all of the rows corresponding to districts that appear 3 times. There are 40 districts that appear 3 times. So, the total number of rows in this DataFrame is 40 \cdot 3 = 120.

The DataFrame `tv_excl`

contains all of the information we
have for TV shows that are available to stream *exclusively* on a
single streaming service. The `"Service"`

column contains the
name of the one streaming service that the TV show is available for
streaming on.

The first few rows of `tv_excl`

are shown below (though,
of course, `tv_excl`

has many more rows than are pictured
here). Note that *Being Erica* is not in `tv_excl`

,
since it is available to stream on multiple services.

The DataFrame `counts`

, shown in full below, contains the
number of TV shows for every combination of `"Age"`

and
`"Service"`

.

Given the above information, what does the following expression evaluate to?

`"Age", "Service"]).sum().shape[0] tv_excl.groupby([`

4

5

12

16

18

20

25

**Answer**: 18

Note that the DataFrame `counts`

is a pivot table, created
using
`tv_excl.pivot_table(index="Age", columns="Service", aggfunc="size")`

.
As we saw in lecture, pivot tables contain the same information as the
result of grouping on two columns.

The DataFrame `tv_excl.groupby(["Age", "Service"]).sum()`

will have one row for every unique combination of `"Age"`

and
`"Service"`

in `tv_excl`

. (The same is true even
if we used a different aggregation method, like `.mean()`

or
`.max()`

.) As `counts`

shows us,
`tv_excl`

contains every possible combination of a single
element in {`"13+"`

, `"16+"`

, `"18+"`

,
`"7+"`

, `"all"`

} with a single element in
{`"Disney+"`

, `"Hulu"`

, `"Netflix"`

,
`"Prime Video"`

}, except for (`"13+"`

,
`"Disney+"`

) and (`"18+"`

,
`"Disney+"`

), which were not present in `tv_excl`

;
if they were, they would have non-null values in
`counts`

.

As such, `tv_excl.groupby(["Age", "Service"]).sum()`

will
have 20 - 2 = 18 rows, and
`tv_excl.groupby(["Age", "Service"]).sum().shape[0]`

evaluates to 18.

The DataFrame `flights`

contains information about recent
flights, with each row representing a specific flight and the following
columns:

`"flight num" (str)`

: The unique code of the flight, consisting of a two-character airline designator followed by 1 to 4 digits (e.g.,`"UA1989"`

).`"airline" (str)`

: The airline name (e.g.,`"United"`

).`"departure" (str)`

: The code for the airport from which the flight departs (e.g.,`"SAN"`

).`"arrival" (str)`

: The code for the airport at which the flight arrives (e.g.,`"LAX"`

).

Suppose we have another DataFrame `more_flights`

which
contains the same columns as `flights`

, but different rows.
Define `merged`

as follows.

`= flights.merge(more_flights, on = "airline") merged `

Suppose that in `merged`

, there are 108 flights where the
airline is `"United"`

, and in `more_flights`

,
there are 12 flights where the airline is `"United"`

. If
`flights`

has 15 rows in total, how many of these rows are
**not** for `"United"`

flights? Give your answer
as an integer.

**Answer:** 6

We are merging dataframes `flights`

and
`more_flights`

according to the airline each flight belongs
to. All the `"United"`

flights in `more_flights`

will be merged with all the `"United"`

flights in
`flights`

, which we know gives us 108 total flights. We also
know that there are 12 `"United"`

flights in
`more_flights`

. To find the number of `"United"`

flights in `flights`

, we simply need to divide the total
number of `"United"`

flights in `merged`

by the
number of `"United"`

flights in `more_flights`

,
which is 108/12 = 9. If `flights`

has a total of 15 rows,
then the total number of non-United rows is equal to 15 - 9 = 6.

For your convenience, the first few rows of `tv`

are shown
again below.

For the purposes of this question only, suppose we have also access
to another similar DataFrame, `movies`

, which contains
information about a variety of movies. The information we have for each
movie in `movies`

is the same as the information we have for
each TV show in `tv`

, except for IMDb ratings, which are
missing from `movies`

.

The first few rows of `movies`

are shown below (though
`movies`

has many more rows than are pictured here).

The function `total_null`

, defined below, takes in a
DataFrame and returns the total number of null values in the
DataFrame.

`= lambda df: df.isna().sum().sum() total_null `

Consider the function `delta`

, defined below.

```
def delta(a, b):
= tv.head(a)
tv_a = movies.head(b)
movies_b = pd.concat([tv_a, movies_b])
together return total_null(together) - total_null(tv_a) - total_null(movies_b)
```

Which of the following functions is equivalent to
`delta`

?

`lambda a, b: a`

`lambda a, b: b`

`lambda a, b: 9 * a`

`lambda a, b: 8 * b`

`lambda a, b: min(9 * a, 8 * b)`

**Answer**: `lambda a, b: b`

Let’s understand what each function does.

`total_null`

just counts all the null values in a DataFrame.`delta`

concatenates the first`a`

rows of`tv`

with the first`b`

rows of`movies`

**vertically**, that is, on top of one another (over axis 0). It then returns the difference between the total number of null values in the concatenated DataFrame and the total number of null values in the first`a`

rows of`tv`

and first`b`

rows of`movies`

– in other words, it returns**the number of null values that were added as a result of the concatenation**.

The key here is recognizing that `tv`

and
`movies`

have all of the same column names,
**except** `movies`

doesn’t have an
`"IMDb"`

column. As a result, when we concatenate, the
`"IMDb"`

column will contain null values for every row that
was originally from `movies`

. Since `b`

rows from
`movies`

are in the concatenated DataFrame, `b`

new null values are introduced as a result of the concatenation, and
thus `lambda, a, b: b`

does the same thing as
`delta`

.

Fill in the blank to complete the implementation of the function
`size_of_merge`

, which takes a string `col`

,
corresponding to the name of a **single** column that is
shared between `tv`

and `movies`

, and returns the
**number of rows in** the DataFrame
`tv.merge(movies, on=col)`

.

For instance,

`size_of_merge("Year")`

should return the number of rows in`tv.merge(movies, on="Year")`

.The purpose of this question is to have you think conceptually about how merges work. As such,

**solutions containing**`merge`

or`concat`

will receive 0 points.

**What goes in the blank below?**

```
def size_of_merge(col):
return (____).sum()
```

** Hint**: Consider the behavior below.

```
>>> s1 = pd.Series({'a': 2, 'b': 3})
>>> s2 = pd.Series({'c': 4, 'a': -1, 'b': 4})
>>> s1 * s2
-2.0
a 12.0
b
c NaN dtype: float64
```

**Answer**:
`tv[col].value_counts() * movies[col].value_counts()`

`tv.merge(movies, on=col)`

contains one row for every
“match” between `tv[col]`

and `movies[col]`

.
Suppose, for example, that `col="Year"`

. If
`tv["Year"]`

contains 30 values equal to 2019, and
`movies["Year"]`

contains 5 values equal to 2019,
`tv.merge(movies, on="Year")`

will contain 30 \cdot 5 = 150 rows in which the
`"Year"`

value is equal to 2019 – one for every combination
of a 2019 row in `tv`

and a 2019 row in
`movies`

.

`tv["Year"].value_counts()`

and
`movies["Year"].value_counts()`

contain, respectively, the
frequencies of the unique values in `tv["Year"]`

and
`movies["Year"]`

. Using the 2019 example from above,
`tv["Year"].value_counts() * movies["Year"].value_counts()`

will contain a row whose index is 2019 and whose value is 150, with
similar other entries for the other years in the two Series. (The hint
is meant to demonstrate the fact that no matter how the two Series are
sorted, the product is done element-wise by matching up indexes.) Then,
`(tv["Year"].value_counts() * movies["Year"].value_counts()).sum()`

will sum these products across all years, ignoring null values.

As such, the answer we were looking for is
`tv[col].value_counts() * movies[col].value_counts()`

(remember, `"Year"`

was just an example for this
explanation).