← return to study.practicaldsc.org

The problems in this worksheet are taken from past exams in similar
classes. Work on them **on paper**, since the exams you
take in this course will also be on paper.

We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.**Note: We do not plan to
cover all problems here in the live discussion section**; the problems
we don’t cover can be used for extra practice.

The DataFrame `games`

contains information about a sample
of popular games, including board games, dice games, and card games. The
data comes from Board Game Geek, a popular website and vibrant online
community for game enthusiasts.

The columns of games are as follows.

`"Name"`

(`str`

): The name of the game.`"Mechanics"`

(`str`

): A sequence of descriptors for how the game is played. A game can have several descriptors, each of which is separated by a comma.`"Domains"`

(`str`

): A sequence of domains to which the game belongs. A game can belong to multiple domains.`"Play Time"`

(`int`

): The average play time of the game, in minutes, as suggested by the game’s creators.`"Complexity"`

(`float`

): The average complexity of the game, on a scale of 1 to 5 points, as reported by Board Game Geek community members.`"Rating"`

(`str`

): The average rating of the game, on a scale of 1 to 10 points, as rated by Board Game Geek community members. Note that while this data should be numerical, it is actually stored as a string, because some entries use a comma in place of a decimal point. For example 8,79 actually represents the number 8.79.`"BGG Rank"`

(`int`

): The rank of the game in Board Game Geek’s database. The formula for how this rank is calculated is not publicly known, but it likely includes many factors, such as`"Rating"`

, number of registered owners of the game, and number of reviews.

The first few rows of `games`

are shown below (though
`games`

has many more rows than pictured here).

Assume that we have already run `import pandas as pd`

and
`import numpy as np`

.

Many of the games in the `games`

DataFrame belong to more
than one domain. We want to identify the number of games that belong to
only one domain. Select all of the options below that would correctly
calculate the number of games that belong to only one domain.

`(games["Domains"].str.split(" ").apply(len) == 2).sum()`

`(games["Domains"].apply(len) == 1).sum()`

`games[games["Domains"].str.split(",").str.len() == 1].value_counts("Name").sum()`

`games[games["Domains"].str.split(",").str.len() == 1].shape[0]`

**Answer:** Options 3 and 4

Let’s take a closer look at why **Option 3** and
**Option 4** are correct.

**Option 3**: Option 3 first queries the
`games`

DataFrame to only keep `games`

with one
`"Domains"`

.
`games["Domains"].str.split(",").str.len() == 1`

gets the
`"Domains"`

column and splits all of them by their comma.

For example, let’s say the domain was
`"Strategy Games", "Thematic Games"`

then after doing
`.str.split(",")`

we would have the list
`["Strategy Games", "Thematic Games"]`

. If the domain was
just `"Strategy Games"`

, then after splitting we’d have
`["Strategy Games"]`

.

Any `"Domain"`

with at least one comma will turn into a
list whose length is at least 2, so the comparison with
`.str.len() == 1`

will evaluate to `False`

. So
ultimately,

`"Domains"].str.split(",").str.len() == 1] games[games[`

is a DataFrame with just the rows for games that belong to one
`"Domain"`

.

The next part `.value_counts("Name").sum()`

makes a Series
with an index of the unique `"Name"`

s and the number of times
those appear. Finally, summing this resulting Series will give us the
number of games, among the games that only belong to only one
“Domain”.

**Option 4**: Option 4 starts off exactly like Option 3,
but instead of using `value_counts`

it gets the number of
rows using `.shape[0]`

, which will give us the number of
games that belong to only one domain.

**Option 1:** Let’s step through why Option 1 is
incorrect.
`games["Domains"].str.split(" ").apply(len) == 2.sum()`

gives
you a `Series`

of the `"Domains"`

column, then
splits each domain by a space. We then get the length of that list,
check if the length is equal to 2, which would mean there are two
elements in the list, and finally get the sum of all elements in the
list who had two elements because of the split. Remember that
`True`

evaluates to 1 and `False`

evaluates to 0,
so we are getting the number of elements that were split into two. It
does not tell us the number of games that belong to only one domain.

**Option 2:** Let’s step through why Option 2 is also
incorrect. `(games["Domains"].apply(len) == 1).sum()`

checks
to see if each element in the column `"Domains"`

has only one
character. Remember when you apply `len()`

to a string then
we get the number of characters in that string. This is essentially
counting the number of domains that have 1 letter. Thus, it does not
tell us the number of games that belong to only one domain.

In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).

The DataFrame `evs`

consists of **32** rows,
each of which contains information about a different EV model.

`"Brand"`

(str): The vehicle’s manufacturer.`"Model"`

(str): The vehicle’s model name.`"BodyStyle"`

(str): The vehicle’s body style.`"Seats"`

(int): The vehicle’s number of seats.`"TopSpeed"`

(int): The vehicle’s top speed, in kilometers per hour.`"Range"`

(int): The vehicle’s range, or distance it can travel on a single charge, in kilometers.

The first few rows of `evs`

are shown below (though
remember, `evs`

has 32 rows total).

**Throughout this problem, we will refer to evs
repeatedly.**

Assume that:

- The only four values in the
`"Brand"`

column are`"Tesla"`

,`"BMW"`

,`"Audi"`

, and`"Nissan"`

. - We have already run
`import pandas as pd`

and`import numpy as np`

.

**Tip:** Open this page in another tab, so that it is
easy to refer to this data description as you work through the
problem.

Which of the following types of visualizations could use to visualize
the distribution of `"Range"`

? Select all that apply.

Bar chart

Histogram

Scatter plot

Line plot

Box plot

None of the above

**Answer:** Histogram or box plot

`"Range"`

is a numerical (i.e. quantitative) variable, and
we use histograms and box plots to visualize the distribution of
numerical features.

- A bar chart couldn’t work here. Bar charts can show the distribution
of a categorical feature, but
`"Range"`

is not categorical. - A scatter plot visualizes the relationship between two numerical
features, but we are only dealing with one numerical feature here
(
`"Range"`

). - Similarly, a line plot visualizes the relationship between two numerical features, but we only have one here.

Teslas, on average, tend to have higher `"Range"`

s than
BMWs. In which of the following visualizations would we be able to see
this pattern? Select all that apply.

A bar chart that shows the distribution of

`"Brand"`

A bar chart that shows the average

`"Range"`

for each`"Brand"`

An overlaid histogram showing the distribution of

`"Range"`

for each`"Brand"`

A scatter plot with

`"TopSpeed"`

on the x-axis and`"Range"`

on the y-axis

**Answer:**

- A bar chart that shows the average
`"Range"`

for each`"Brand"`

- An overlaid histogram showing the distribution of
`"Range"`

for each`"Brand"`

Let’s look at each option more closely.

**Option 1:**A bar chart showing the distribution of`"Brand"`

would only show us how many cars of each`"Brand"`

there are. It would not tell us anything about the average`"Range"`

of each`"Brand"`

.**Option 2:**A bar chart showing the average range for each`"Brand"`

would help us directly visualize how the average range of each`"Brand"`

compares to one another.**Option 3:**An overlaid histogram, although perhaps a bit messy, would also give us a general idea of the average range of each`"Brand"`

by giving us the distribution of the`"Range"`

of each brand. In the scenario mentioned in the question, we’d expect to see that the Tesla distribution is further right than the BMW distribution.**Option 4:**A scatter plot of`"TopSpeed"`

against`"Range"`

would only illustrate the relationship between`"TopSpeed"`

and`"Range"`

, but would contain no information about the`"Brand"`

of each EV.

Kyle thinks `"Seats"`

is a categorical feature because it
can be used to categorize EVs by size. For instance, EVs with 4 seats
are small, EVs with 5 seats are medium, and EVs with 6 or more seats are
large.

Is Kyle correct?

Yes

No

Justify your answer in **one sentence**. Your answer
must fit in the box below.

**Answer:** No

`"Seats"`

is a numerical feature, since it makes sense to
do arithmetic with the values. For instance, we can find the average
number of `"Seats"`

that a group of cars has. Kyle’s argument
could apply to any numerical feature; just because we can place
numerical features into “bins” doesn’t make them categorical.

The seat-back TV on one of Biff the Wolverine’s more recent flights
was very dirty and was full of fingerprints. The fingerprints made an
interesting pattern. We’ve stored the x and y positions of each
fingerprint in the DataFrame `fingerprints`

, and created the
following scatterplot using

`='scatter', x='x', y='y') fingerprints.plot(kind`

True or False: The histograms that result from the following two lines of code will look very similar.

```
# When we say bins=np.arange(0, 8, 2),
# we mean that the bins are [0, 2), [2, 4), [4, 6].
='hist',
fingerprints.plot(kind='x',
y=True,
density=np.arange(0, 8, 2)) bins
```

and

```
='hist',
fingerprints.plot(kind='y',
y=True,
density=np.arange(0, 8, 2)) bins
```

True

False

**Answer: ** True

The only difference between the two code snippets is the data values
used. The first creates a histogram of the `x`

-values in
`fingerprints`

, and the second creates a histogram of the
`y`

-values in `fingerprints`

.

Both histograms use the same bins:
`bins=np.arange(0, 8, 2)`

. This means the bin endpoints are
`[0, 2, 4, 6]`

, so there are three distinct bins: [0, 2), [2,
4), and [4, 6]. Remember the
right-most bin of a histogram includes both endpoints, whereas others
include the left endpoint only.

Let’s look at the `x`

-values first. If we divide the
scatterplot into nine equally-sized regions, as shown below, note that
eight of the nine regions have a very similar number of data points.

Aside from the middle region, about \frac{1}{8} of the data falls in each region.
That means \frac{3}{8} of the data has
an `x`

-value in the first bin [0,
2), \frac{2}{8} of the data has
an `x`

-value in the middle bin [2,
4), and \frac{3}{8} of the data
has an `x`

-value in the rightmost bin [4, 6]. This distribution of
`x`

-values into bins determines what the histogram will look
like.

Now, if we look at the `y`

-values, we’ll find that \frac{3}{8} of the data has a
`y`

-value in the first bin [0,
2), \frac{2}{8} of the data has
a `y`

-value in the middle bin [2,
4), and \frac{3}{8} of the data
has a `y`

-value in the last bin [4,
6]. That’s the same distribution of data into bins as the
`x`

-values had, so the histogram of `y`

-values
will look just like the histogram of `y`

-values.

Alternatively, an easy way to see this is to use the fact that the
scatterplot is symmetric over the line y=x, the line that makes a 45 degree angle
with the origin. In other words, interchanging the `x`

and
`y`

values doesn’t change the scatterplot noticeably, so the
`x`

and `y`

values have very similar
distributions, and their histograms will be very similar as a
result.

Below, we’ve drawn a histogram using the line of code

```
='hist',
fingerprints.plot(kind='x',
y=True,
density=np.arange(0, 8, 2)) bins
```

However, our Jupyter Notebook was corrupted, and so the resulting histogram doesn’t quite look right. While the height of the first bar is correct, the histogram doesn’t contain the second or third bars, and the y-axis is replaced with letters.

Which of the four options on the y-axis is closest to where the height of the middle bar should be?

A

B

C

D

Which of the four options on the y-axis is closest to where the height of the rightmost bar should be?

A

B

C

D

**Answer: ** B, then C

We’ve already determined that the first bin should contain \frac{3}{8} of the values, the middle bin should contain \frac{2}{8} of the values, and the rightmost bin should contain \frac{3}{8} of the values. The middle bar of the histogram should therefore be two-thirds as tall as the first bin, and the rightmost bin should be equally as tall as the first bin. The only reasonable height for the middle bin is B, as it’s closest to two-thirds of the height of the first bar. Similarly, the rightmost bar must be at height C, as it’s the only one close to the height of the first bar.

Pranavi proposes the following imputation scheme.

```
def impute(s):
return s.fillna(np.random.choice(s[s.notna()]))
```

**True or False:** `impute`

performs
probabilistic imputation, using the same definition of probabilistic
imputation we learned about in class.

True

False

**Answer:** False

In `impute`

, `np.random.choice`

will return a
single non-null value from `s`

, and `.fillna()`

will fill every null value with this single value. Meanwhile,
probabilistic imputation draws a different value from a specified
distribution to fill each missing value, making it such that there won’t
be a single “spike” in the imputed distribution at a single chosen
value.

Consider the following expressions and values.

```
>>> vals.isna().mean()
0.2
>>> vals.describe().loc[["min", "mean", "max"]]
min 2.0
mean 4.0
max 7.0
dtype: float64
```

Given the above, what is the **maximum** possible value
of `impute(vals).mean()`

? Give your answer as a number
rounded to one decimal place.

**Answer:** 4.6

The maximum possible value of `impute(vals).mean()`

would
occur when every single missing value in `vals`

is filled in
with the highest possible non-null value in `vals`

. (As
discussed in the previous solution, `impute`

selects only one
value from `s`

to fill into every missing space.)

If this occurs, then the mean of the imputed Series will be weighted mean of the available data and the filled data, and given the numbers in the question, this is 0.8 \cdot 4 + 0.2 \cdot 7, or 4.6.

Which of the following statements below will always evaluate to
`True`

?

`vals.std() < impute(vals).std()`

`vals.std() <= impute(vals).std()`

`vals.std() == impute(vals).std()`

`vals.std() >= impute(vals).std()`

`vals.std() > impute(vals).std()`

None of the above

**Answer:** None of the above

Since the value which `impute`

will choose to impute with
is random, the effect that it has on the standard deviation of
`vals`

is unknown. If the missing values are filled with a
value close to the mean, this could reduce standard deviation; if they
are filled with a value far from the mean, this could increase standard
deviation. (Of course, the imputation will also shift the mean, so
without knowing details of the Series, it’s impossible to come up with
thresholds.) In any case, since the value for imputation is chosen at
random, none of these statements will *always* be true, and so
the correct answer is “none of the above.”

Suppose we have a DataFrame, `dogs`

, in which each row
corresponds to a different dog owned by someone in Ann Arbor. There are
several columns in the DataFrame, including `"birth_year"`

(the dog’s year of birth) and `"Breed"`

(the dog’s
breed).

Using the `"birth_year"`

column in `dogs`

,
Tahseen computes the age of each dog and stores it a new
`"age"`

column of `dogs`

. He also categorizes dogs
into three “breed classes” based on their breed –
`"Retriever"`

, `"Terrier"`

, or
`"Other"`

– and stores these in a new `"class"`

column of `dogs`

.

The following bar chart shows the distribution of breed classes by missingness of age:

The following bar chart shows the mean observed age for each breed class:

Let O be the mean of the observed
ages in `dogs`

, that is, the mean of the ages that aren’t
missing. Let T be the true mean of the
ages in `dogs`

, that is, the mean we’d compute if none of the
ages were missing.

What is the relationship between O and T?

O < T

O \approx T

O = T

O > T

**Answer**: O <
T

In the ages we get to observe (that is, when age is not missing), we have way fewer terrier values than in the ages we don’t get to observe. Terriers are older on average than the other breed classes. This means we’re missing values that are larger, so when we take the average of the values we have access to, it’ll be lower than the true mean.

Consider the following strategy for imputing missing ages.

```
def fill_missing(dogs):
return dogs.assign(
=dogs.groupby("class")["age"]
agelambda x: x.fillna(x.value_counts().idxmax()))
.transform( )
```

In one sentence, describe this imputation strategy.

**Answer**: This is filling in missing ages in each
breed class with the most common observed age in that breed class.

`small_dogs`

is a subset of `dogs`

shown in its
entirety below.

If we call `fill_missing`

on `small_dogs`

, what
values would the four null values in the `"age"`

column of
`small_dogs`

be replaced with?

- row 2: ___
- row 3: ___
- row 4: ___
- row 7: ___

**Answer**:

- row 2: 5. The most common observed age for the
`"Other"`

breed class is 5. - row 3: 5. The most common observed age for the
`"Other"`

breed class is 5. - row 4: 12. The most common observed age for the
`"Terrier"`

breed class is 12. - row 7: 7. The most common observed age for the
`"Retriever"`

breed class is 7.

The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.

Each row of `hotels`

contains information about a
different hotel in San Diego. Specifically, for each hotel, we have:

`"Hotel Name" (str)`

: The name of the hotel.**Assume hotel names are unique.**`"Location" (str):`

The hotel’s neighborhood in San Diego.`"Chain" (str):`

The chain the hotel is a part of; either`"Hilton", "Marriott", "Hyatt", or "Other".`

A hotel chain is a group of hotels owned or operated by a shared company.`"Number of Rooms" (int)`

: The number of rooms the hotel has.

The first few rows of `hotels`

are shown below, but hotels
has many more rows than are shown.

Now, consider the variable `summed`

, defined below.

`= hotels.groupby("Chain")["Number of Rooms"].sum().idxmax() summed `

What is `type(summed)`

?

`int`

`str`

`Series`

`DataFrame`

`DataFrameGroupBy`

**Answer:** `str`

When we do a `groupby`

on the `Chain`

column in
`hotels`

, this means that the values in the
`Chain`

column will be the indices of the DataFrame or Series
we get as output, in this case the Series
`hotels.groupby("Chain")["Number of Rooms"].sum()`

.

Since the values of `Chain`

are strings, and since
`.idxmax()`

will return a value from the index of the
aforementioned Series, `summed`

is a string.

In one sentence, explain what the value of `summed`

means.
Phrase your explanation as if you had to give it to someone who is not a
data science major; that is, don’t say something like “it is the result
of grouping `hotels`

by `"Chain"`

, selecting the
`"Number of Rooms"`

column, …”, but instead, give the value
context.

**Answer:** `summed`

is the name of the hotel
chain with the most total rooms

The result of the `.groupby()`

and `.sum()`

is
a Series indexed by the unique `Chains`

, whose values are the
total number of rooms in hotels owned by each chain. The
`idxmax()`

function gets the index corresponding to the
largest value in the Series, which will be the hotel chain name with the
most total rooms.

Consider the variable `curious`

, defined below.

`= frame["Chain"].value_counts().idxmax() curious `

Fill in the blank: `curious`

is guaranteed to be equal to
`summed`

only if `frame`

has one row for every
____ in San Diego.

hotel

hotel chain

hotel room

neighborhood

**Answer:** hotel room

`curious`

gets the most common value of `Chain`

in the DataFrame `frame`

. We already know that
`summed`

is the hotel chain with the most rooms in San Diego,
so `curious`

only equals `summed`

if the most
common `Chain`

in `frame`

is the hotel chain with
the most total rooms; this occurs when each row of `frame`

is
a single hotel room.

Fill in the blanks so that `popular_areas`

is an array of
the names of the unique neighborhoods that have at least 5 hotels and at
least 1000 hotel rooms.

```
= lambda df: __(i)__
f = hotels.groupby(__(ii)__).__(iii)__.__(iv)__ popular_areas
```

What goes in blank (i)?

What goes in blank (ii)?

`"Hotel Name"`

`"Location"`

`"Chain"`

`"Number of Rooms"`

- What goes in blank (iii)?

`agg(f)`

`filter(f)`

`transform(f)`

- What goes in blank (iv)?

**Answers**:

`df.shape[0] >= 5 and df["Number of Rooms"].sum() >= 1000`

`"Location"`

`filter(f)`

`["Location"].unique()`

or equivalent

We’d like to only consider certain neighborhoods according to group
characteristics (having >= 5 hotels and >= 1000 hotel rooms), and
`.filter()`

allows us to do that by excluding groups not
meeting those criteria. So, we can write a function that evaluates those
criteria on one group at a time (the `df`

of input to
`f`

is the subset of `hotels`

containing just one
`Location`

value), and calling `filter(f)`

means
that the only remaining rows are hotels in neighborhoods that match
those criteria. Finally, all we have to do is get the unique
neighborhoods from this DataFrame, which are the neighborhoods for which
`f`

returned `True`

.

Consider the code below.

```
= hotels["Chain"] == "Marriott"
cond1 = hotels["Location"] == "Coronado"
cond2 = hotels[cond1].merge(hotels[cond2], on="Hotel Name", how=???) combined
```

- If we replace
`???`

with`"inner"`

in the code above, which of the following will be equal to`combined.shape[0]`

?

`min(cond1.sum(), cond2.sum())`

`(cond1 & cond2).sum()`

`cond1.sum() + cond2.sum()`

`cond1.sum() + cond2.sum() - (cond1 & cond2).sum()`

`cond1.sum() + (cond1 & cond2).sum()`

- If we replace
`???`

with`"outer"`

in the code above, which of the following will be equal to`combined.shape[0]`

?

`min(cond1.sum(), cond2.sum())`

`(cond1 & cond2).sum()`

`cond1.sum() + cond2.sum()`

`cond1.sum() + cond2.sum() - (cond1 & cond2).sum()`

`cond1.sum() + (cond1 & cond2).sum()`

**Answers**:

`(cond1 & cond2).sum()`

`cond1.sum() + cond2.sum() - (cond1 & cond2).sum()`

Note that `cond1`

and `cond2`

are boolean
Series, and `hotels[cond1]`

and `hotels[cond2]`

are the subsets of `hotels`

where
`Chain == "Marriott`

and
`"Location" == "Coronado"`

, respectively.

When we perform an inner merge, we’re selecting every row where a

`Hotel Name`

appears in*both*`hotels[cond1]`

and`hotels[cond2]`

. This is the same set of indices (and therefore hotel names, since those are unique) as where`(cond1 & cond2) == True`

. So, the length of`combined`

will be the same as the number of`True`

s in`(cond1 & cond2)`

.When we perform an outer merge, we’re selecting every row that appears in

*either*DataFrame, although there will not be repeats for hotels that are both Marriott properties and are in Coronado. So, to find the total number of rows in either DataFrame, we take the sum of the sizes of each, and subtract rows that appear in both, which corresponds to answer`cond1.sum() + cond2.sum() - (cond1 & cond2).sum()`

.

Let’s consider admissions at UC San Diego and UC Santa Barbara for two high schools in particular.

For instance, the above table tells us that 200 students from La Jolla Private applied to UC San Diego, and 50 were admitted.

What is the largest possible integer value of N such that:

UC Santa Barbara has a strictly higher admit rate for

**both**La Jolla Private and Sun God Memorial High individually, butUC San Diego has a strictly higher admit rate overall?

**Answer: **124

Let’s consider the two conditions separately.

First, UC Santa Barbara needs to have a higher admit rate for both high schools. This is already true for La Jolla Private (\frac{100}{300} > \frac{50}{200}); for Sun God Memorial High, we just need to ensure that \frac{N}{150} > \frac{200}{300}. This means that N > 100.

Now, UC San Diego needs to have a higher admit rate overall. The UC San Diego admit rate is \frac{50+200}{200+300} = \frac{250}{500} = \frac{1}{2}, while the UC Santa Barbara admit rate is \frac{100 + N}{450}. This means that we must require that \frac{1}{2} = \frac{225}{450} > \frac{100+N}{450}. This means that 225 > 100 + N, i.e. that N < 125.

So there are two conditions on N: N > 100 and N < 125. The largest integer N that satisfies these conditions is N=124, which is the final answer.

Suppose Charlie and Norah each have separate DataFrames for their
contacts, called `charlie`

and `norah`

,
respectively. These DataFrames have the same column names and format as
your DataFrame, `contacts`

.

As illustrated in the diagram below, Charlie has 172 contacts in
total, whereas Norah has 88 contacts. 12 of these contacts are shared,
meaning they appear in both `charlie`

and
`norah`

.

What does the following expression evaluate to?

`=True, right_index=True).shape[0] charlie.merge(norah, left_index`

**Answer:** 12

The code merges DataFrames `charlie`

and
`norah`

on their indexes, so the resulting DataFrame will
contain one row for every match between their indexes (‘Person’ since
they follow the same format as DataFrame `contact`

). From the
Venn Diagram, we know that Charlie and Norah have 12 contacts in common,
so the resulting DataFrame will contain 12 rows: one row for each shared
contact.

Thus,
`charlie.merge(norah, left_index=True, right_index=True).shape[0]`

returns the row number of the resulting DataFrame, which is 12.

One day, when updating her phone’s operating system, Norah
accidentally duplicates the 12 contacts she has in common with Charlie.
Now, the `norah`

DataFrame has 100 rows.

What does the following expression evaluate to?

`=True, right_index=True).shape[0] norah.merge(norah, left_index`

**Answer:** 24 \cdot 2 + 76 =
124

Since Norah duplicates 12 contacts, the `norah`

DataFrame
now has 76 unique rows + 12 rows + 12 duplicated rows. Note that the
above code is now merging norah with itself on indexes.

After merging, the resulting DataFrame will contain 76 unique rows, as there is only one match for each unique row. As for the duplicated rows, each row can match twice, and we have 24 rows. Thus the resulting DataFrame’s row number = 76 + 2 \cdot 24 = 124.

For better understanding, imagine we have a smaller DataFrame
`nor`

with only one contact Jim. After duplication, it will
have two identical rows of Jim. For easier explanation, let’s denote the
original row Jim1, and duplicated row Jim2. When merging Nor with
itself, Jim1 can be matched with Jim1 and Jim2, and Jim2 can be matched
with Jim1 and Jim2, resulting $= 2 = 4 $ number of rows.

In this problem, we will be using the following DataFrame
`students`

, which contains various information about high
school students and the university/universities they applied to.

The columns are:

`'Name' (str)`

: the name of the student.`'High School' (str)`

: the High School that the student attended.`'Email' (str)`

: the email of the student.`'GPA' (float)`

: the GPA of the student.`'AP' (int)`

: the number of AP exams that the student took.`'University' (str)`

: the name of the university that the student applied to.`'Admit' (str)`

: the acceptance status of the student (where ‘Y’ denotes that they were accepted to the university and ‘N’ denotes that they were not).

The rows of `'student'`

are arranged in no particular
order. The first eight rows of `'student'`

are shown above
(though `'student'`

has many more rows than pictured
here).

Fill in the blank so that the result evaluates to a Series indexed by
`"Email"`

that contains a **list** of the
universities that each student **was admitted to**. If a
student wasn’t admitted to any universities, they should have an empty
list.

`"Email").apply(_____) students.groupby(`

What goes in the blank?

**Answer: **
`lambda df: df.loc[df["Admit"] == "Y", "University"].tolist()`

Which of the following blocks of code correctly assign
`max_AP`

to the maximum number of APs taken by a student who
was rejected by UC San Diego?

Option 1:

```
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.loc[cond1 & cond2, "APs"].sort_values().iloc[-1] max_AP
```

Option 2:

```
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.groupby(["University", "Admit"]).max().reset_index()
d3 = d3.loc[cond1 & cond2, "APs"].iloc[0] max_AP
```

Option 3:

```
= students.pivot_table(index="Admit",
p ="University",
columns="APs",
values="max")
aggfunc= p.loc["N", "UC San Diego"] max_AP
```

Option 4:

```
# .last() returns the element at the end of a Series it is called on
= students.sort_values(["APs", "Admit"]).groupby("University")
groups = groups["APs"].last()["UC San Diego"] max_AP
```

**Select all that apply.** There is at least one correct
option.

Option 1

Option 2

Option 3

Option 4

**Answer: ** Option 1 and Option 3

Option 1 works correctly, it is probably the most straightforward way of answering the question.

`cond1`

is`True`

for all rows in which students were rejected, and`cond2`

is`True`

for all rows in which students applied to UCSD. As such,`students.loc[cond1 & cond2]`

contains only the rows where students were rejected from UCSD. Then,`students.loc[cond1 & cond2, "APs"].sort_values()`

sorts by the number of`"APs"`

taken in increasing order, and`.iloc[-1]`

gets the largest number of`"APs"`

taken.Option 2 doesn’t work because the lengths of

`cond1`

and`cond2`

are not the same as the length of`d3`

, so this causes an error.Option 3 works correctly. For each combination of

`"Admit"`

status (`"Y"`

,`"N"`

,`"W"`

) and`"University"`

(including UC San Diego), it computes the max number of`"APs"`

. The usage of`.loc["N", "UC San Diego"]`

is correct too.Option 4 doesn’t work. It currently returns the maximum number of

`"APs"`

taken by someone who applied to UC San Diego; it does not factor in whether they were admitted, rejected, or waitlisted.

Currently, `students`

has a lot of repeated information —
for instance, if a student applied to 10 universities, their GPA appears
10 times in `students`

.

We want to generate a DataFrame that contains a single row for each
student, indexed by `"Email"`

, that contains their
`"Name"`

, `"High School"`

, `"GPA"`

, and
`"APs"`

.

One attempt to create such a DataFrame is below.

```
"Email").aggregate({"Name": "max",
students.groupby("High School": "mean",
"GPA": "mean",
"APs": "max"})
```

There is exactly one issue with the line of code above. **In
one sentence**, explain what needs to be changed about the line
of code above so that the desired DataFrame is created.

**Answer: ** The problem right now is that aggregating
High School by mean doesn’t work since you can’t aggregate a column with
strings using `"mean"`

. Thus changing it to something that
works for strings like `"max"`

or `"min"`

would
fix the issue.

Consider the following snippet of code.

```
= students.assign(Admit=students["Admit"] == "Y") \
pivoted ="High School",
.pivot_table(index="University",
columns="Admit",
values="sum") aggfunc
```

Some of the rows and columns of `pivoted`

are shown
below.

No students from Warren High were admitted to Columbia or Stanford.
However,

`pivoted.loc["Warren High", "Columbia"]`

and
`pivoted.loc["Warren High", "Stanford"]`

evaluate to
different values. What is the reason for this difference?

Some students from Warren High applied to Stanford, and some others applied to Columbia, but none applied to both.

Some students from Warren High applied to Stanford but none applied to Columbia.

Some students from Warren High applied to Columbia but none applied to Stanford.

The students from Warren High that applied to both Columbia and Stanford were all rejected from Stanford, but at least one was admitted to Columbia.

When using

`pivot_table`

,`pandas`

was not able to sum strings of the form`"Y"`

,`"N"`

, and`"W"`

, so the values in`pivoted`

are unreliable.

**Answer: ** Option 3

`pivoted.loc["Warren High", "Stanford"]`

is
`NaN`

because there were no rows in `students`

in
which the `"High School"`

was `"Warren High"`

and
the `"University"`

was `"Stanford"`

, because
nobody from Warren High applied to Stanford. However,
`pivoted.loc["Warren High", "Columbia"]`

is not
`NaN`

because there was at least one row in
`students`

in which the `"High School"`

was
`"Warren High"`

and the `"University"`

was
`"Columbia"`

. This means that at least one student from
Warren High applied to Columbia.

Option 3 is the only option consistent with this logic.

For each day in May 2022, the DataFrame `streams`

contains
the number of streams for each of the “Top 200" songs on Spotify that
day — that is, the number of streams for the 200 songs with the most
streams on Spotify that day. The columns in `streams`

are as
follows:

`"date"`

: the date the song was streamed`"artist_names"`

: name(s) of the artists who created the song`"track_name"`

: name of the song`"streams"`

: the number of times the song was streamed on Spotify that day

The first few rows of `streams`

are shown below. Since
there were 31 days in May and 200 songs per day, `streams`

has 6200 rows in total.

Note that:

`streams`

is already sorted in a very particular way — it is sorted by`"date"`

in reverse chronological (decreasing) order, and, within each`"date"`

, by`"streams"`

in increasing order.Many songs will appear multiple times in

`streams`

, because many songs were in the Top 200 on more than one day.

Complete the implementation of the function `song_by_day`

,
which takes in an integer `day`

between 1 and 31
corresponding to a day in May, and an integer `n`

, and
returns the song that had the `n`

-th most streams on
`day`

. For instance,

`song_by_day(31, 199)`

should evaluate to
`"pepas"`

, because `"pepas"`

was the 199th most
streamed song on May 31st.

**Note:** You are not allowed to sort within
`song_by_day`

— remember, `streams`

is already
sorted.

```
def song_by_day(day, n):
= f"2022-05-{str(day).zfill(2)}"
day_str = streams[__(a)__].iloc[__(b)__]
day_only return __(c)__
```

What goes in each of the blanks?

**Answer: ** a) `streams['date'] == day_str`

,
b) `(200 - n)`

, c) `day_only['track_name']`

The first line in the function gives us an idea that maybe later on
in the function we’re going to filter for all the days that match the
given data. Indeed, in blank a, we filter for all the rows in which the
`'date'`

column matches `day_str`

. In blank b, we
could access directly access the row with the `n`

-th most
stream using iloc. (Remember, the image above shows us that the streams
are sorted by most streamed in ascending order, so to find the
`n`

-th most popular song of a day, we simply do
`200-n`

). Finally, to return the track name, we could simply
do `day_only['track_name']`

.

Below, we define a DataFrame `pivoted`

.

```
= streams.pivot_table(index="track_name", columns="date",
pivoted ="streams", aggfunc=np.max) values
```

After defining `pivoted`

, we define a Series
`mystery`

below.

`= 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1) mystery `

`mystery.loc["pepas"]`

evaluates to 23. In one sentence,
describe the relationship between the number 23 and the song
`"pepas"`

in the context of the `streams`

dataset.
For instance, a correctly formatted but incorrect answer is “I listened
to the song `"pepas"`

23 times today."

**Answer: ** See below.

`pivoted.apply(lambda s: s.isna().sum(), axis=1)`

computes
the number of days that a song was not on the Top 200, so
`31 - pivoted.apply(lambda s: s.isna().sum(), axis=1)`

computes the number of days the song was in the Top 200. As such, the
correct interpretation is that ** "pepas" was in the
Top 200 for 23 days in May**.

In defining `pivoted`

, we set the keyword argument
`aggfunc`

to `np.max`

. Which of the following
functions could we have used instead of `np.max`

without
changing the values in `pivoted`

? **Select all that
apply.**

`np.mean`

`np.median`

`len`

`lambda df: df.iloc[0]`

None of the above

**Answer: ** Option A, B and D

For each combination of `"track_name"`

and
`"date"`

, there is just a single value — the number of
streams that song received on that date. As such, the
`aggfunc`

needs to take in a Series containing a single
number and return that same number.

The mean and median of a Series containing a single number is equal to that number, so the first two options are correct.

The length of a Series containing a single number is 1, no matter what that number is, so the third option is not correct.

`lambda df: df.iloc[0]`

takes in a Series and returns the first element in the Series, which is the only element in the Series. This option is correct as well. (The parameter name`df`

is irrelevant.)

Below, we define another DataFrame `another_mystery`

.

```
= (streams.groupby("date").last()
another_mystery "artist_names", "track_name"])
.groupby([ .count().reset_index())
```

`another_mystery`

has 5 rows. In one sentence, describe
the significance of the number 5 in the context of the
`streams`

dataset. For instance, a correctly formatted but
incorrect answer is “There are 5 unique artists in
`streams`

." Your answer should not include the word”row".

**Answer: ** See below.

1in Since `streams`

is sorted by `"date"`

in
descending order and, within each `"date"`

, by
`"streams"`

in ascending order,
`streams.groupby("date").last()`

is a DataFrame containing
the song with the most `"streams"`

on each day in May. In
other words, we found the “top song" for each day. (The DataFrame we
created has 31 rows.)

When we then execute
`.groupby(["artist_names", "track_name"]).count()`

, we create
one row for every unique combination of song and artist, amongst the
“top songs". (If no two artists have a song with the same name, this is
the same as creating one row for every unique song.) Since there are 5
rows in this new DataFrame (resetting the index doesn’t do anything
here), it means that **there were only 5 unique songs that were
ever the “top song" in a day in May**; this is the correct
interpretation.

Build on the previous problem.

Suppose the DataFrame `today`

consists of 15 rows — 3 rows
for each of 5 different `"artist_names"`

. For each artist, it
contains the `"track_name"`

for their three most-streamed
songs today. For instance, there may be one row for
`"olivia rodrigo"`

and `"favorite crime"`

, one row
for `"olivia rodrigo"`

and `"drivers license"`

,
and one row for `"olivia rodrigo"`

and
`"deja vu"`

.

Another DataFrame, `genres`

, is shown below in its
entirety.

Suppose we perform an **inner** merge between
`today`

and `genres`

on
`"artist_names"`

. If the five `"artist_names"`

in
`today`

are the same as the five `"artist_names"`

in `genres`

, what fraction of the rows in the merged
DataFrame will contain `"Pop"`

in the `"genre"`

column? Give your answer as a simplified fraction.

**Answer: ** \frac{2}{5}

If the five `"artist_names"`

in `today`

and
**genres** are the same, the DataFrame that results from an
inner merge will have 15 rows, one for each row in `today`

.
This is because there are 3 matches for `"harry styles"`

, 3
matches for `"olivia rodrigo"`

, 3 matches for
`"glass animals"`

, and so on.

In the merged DataFrame’s 15 rows, 6 of them will correspond to
`"Pop"`

artists — 3 to `"harry styles"`

and 3 to
`"olivia rodrigo"`

. Thus, the fraction of rows that contain
`"Pop"`

in the `"genre"`

column is \frac{6}{15} = \frac{2}{5} (which is the
fraction of rows that contained `"Pop"`

in
`genres["genre"]`

, too).

Suppose we perform an **inner** merge between
`today`

and `genres`

on
`"artist_names"`

. Furthermore, suppose that the only
overlapping `"artist_names"`

between `today`

and
`genres`

are `"drake"`

and
`"olivia rodrigo"`

. What fraction of the rows in the merged
DataFrame will contain `"Pop"`

in the `"genre"`

column? Give your answer as a simplified fraction.

**Answer: ** \frac{1}{2}

If we perform an inner merge, there will only be 6 rows in the merged
DataFrame — 3 for `"olivia rodrigo"`

and 3 for
`"drake"`

. 3 of those 6 rows will have `"Pop"`

in
the `"genre"`

column, hence the answer is \frac{3}{6} = \frac{1}{2}.

Suppose we perform an **outer** merge between
`today`

and `genres`

on
`"artist_names"`

. Furthermore, suppose that the only
overlapping `"artist_names"`

between `today`

and
`genres`

are `"drake"`

and
`"olivia rodrigo"`

. What fraction of the rows in the merged
DataFrame will contain `"Pop"`

in the `"genre"`

column? Give your answer as a simplified fraction.

**Answer: ** \frac{2}{9}

Since we are performing an outer merge, we can decompose the rows in the merged DataFrame into three groups:

Rows that are in

`today`

that are not in`genres`

. There are 9 of these (3 each for the 3 artists that are in`today`

and not`genres`

).`today`

doesn’t have a`"genre"`

column, and so all of these`"genre"`

s will be`NaN`

upon merging.Rows that are in

`genres`

that are not in`today`

. There are 3 of these — one for`"harry styles"`

, one for`"glass animals"`

, and one for`"doja cat"`

. 1 of these 3 have`"Pop"`

in the`"genre"`

column.Rows that are in both

`today`

and`genres`

. There are 6 of these — 3 for`"olivia rodrigo"`

and 3 for`"drake"`

— and 3 of those rows contain`"Pop"`

in the`"genre"`

column.

Tallying things up, we see that there are 9
+ 3 + 6 = 18 rows in the merged DataFrame overall, of which 0 + 1 + 3 = 4 have `"Pop"`

in the
`"genre"`

column. Hence, the relevant fraction is \frac{4}{18} = \frac{2}{9}.

Recall the `evs`

DataFrame.

The first few rows of `evs`

are shown below (though
remember, `evs`

has 32 rows total).

Below, we provide the a DataFrame that contains the distribution of
“BodyStyle” for all “Brands” in `evs`

, other than Nissan.

Suppose we’ve run the following few lines of code.

```
= evs[evs.get("Brand") == "Tesla"]
tesla = evs[evs.get("Brand") == "BMW"]
bmw = evs[evs.get("Brand") == "Audi"]
audi
= tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle") combo
```

How many rows does the DataFrame `combo`

have?

21

24

35

65

72

96

**Answer:** 35

Let’s attempt this problem step-by-step. We’ll first determine the
number of rows in `tesla.merge(bmw, on="BodyStyle")`

, and
then determine the number of rows in `combo`

. **For the
purposes of the solution, let’s use temp to refer to the
first merged DataFrame,
tesla.merge(bmw, on="BodyStyle").**

Recall, when we `merge`

two DataFrames, the resulting
DataFrame contains a single row for every match between the two columns,
and rows in either DataFrame without a match disappear. In this problem,
the column that we’re looking for matches in is
`"BodyStyle"`

.

To determine the number of rows of `temp`

, we need to
determine which rows of `tesla`

have a
`"BodyStyle"`

that matches a row in `bmw`

. From
the DataFrame provided, we can see that the only
`"BodyStyle"`

s in both `tesla`

and
`bmw`

are SUV and sedan. When we merge `tesla`

and
`bmw`

on `"BodyStyle"`

:

- The 4 SUV rows in
`tesla`

each match the 1 SUV row in`bmw`

. This will create 4 SUV rows in`temp`

. - The 3 sedan rows in
`tesla`

each match the 1 sedan row in`bmw`

. This will create 3 sedan rows in`temp`

.

So, `temp`

is a DataFrame with a total of 7 rows, with 4
rows for SUVs and 3 rows for sedans (in the `"BodyStyle"`

)
column. Now, when we merge `temp`

and `audi`

on
`"BodyStyle"`

:

- The 4 SUV rows in
`temp`

each match the 8 SUV rows in`audi`

. This will create 4 \cdot 8 = 32 SUV rows in`combo`

. - The 3 sedan rows in
`temp`

each match the 1 sedan row in`audi`

. This will create 3 \cdot 1 = 3 sedan rows in`combo`

.

Thus, the total number of rows in `combo`

is 32 + 3 = 35.

Note: You may notice that 35 is the result of multiplying the
`"SUV"`

and `"Sedan"`

columns in the DataFrame
provided, and adding up the results.