← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.
Note: We do not plan to
cover all problems here in the live discussion section; the problems
we don’t cover can be used for extra practice.
The DataFrame games
contains information about a sample
of popular games, including board games, dice games, and card games. The
data comes from Board Game Geek, a popular website and vibrant online
community for game enthusiasts.
The columns of games are as follows.
"Name"
(str
): The name of the game."Mechanics"
(str
): A sequence of
descriptors for how the game is played. A game can have several
descriptors, each of which is separated by a comma."Domains"
(str
): A sequence of domains to
which the game belongs. A game can belong to multiple domains."Play Time"
(int
): The average play time
of the game, in minutes, as suggested by the game’s creators."Complexity"
(float
): The average
complexity of the game, on a scale of 1 to 5 points, as reported by
Board Game Geek community members."Rating"
(str
): The average rating of the
game, on a scale of 1 to 10 points, as rated by Board Game Geek
community members. Note that while this data should be numerical, it is
actually stored as a string, because some entries use a comma in place
of a decimal point. For example 8,79 actually represents the number
8.79."BGG Rank"
(int
): The rank of the game in
Board Game Geek’s database. The formula for how this rank is calculated
is not publicly known, but it likely includes many factors, such as
"Rating"
, number of registered owners of the game, and
number of reviews.The first few rows of games
are shown below (though
games
has many more rows than pictured here).
Assume that we have already run import pandas as pd
and
import numpy as np
.
Many of the games in the games
DataFrame belong to more
than one domain. We want to identify the number of games that belong to
only one domain. Select all of the options below that would correctly
calculate the number of games that belong to only one domain.
(games["Domains"].str.split(" ").apply(len) == 2).sum()
(games["Domains"].apply(len) == 1).sum()
games[games["Domains"].str.split(",").str.len() == 1].value_counts("Name").sum()
games[games["Domains"].str.split(",").str.len() == 1].shape[0]
Answer: Options 3 and 4
Let’s take a closer look at why Option 3 and Option 4 are correct.
Option 3: Option 3 first queries the
games
DataFrame to only keep games
with one
"Domains"
.
games["Domains"].str.split(",").str.len() == 1
gets the
"Domains"
column and splits all of them by their comma.
For example, let’s say the domain was
"Strategy Games", "Thematic Games"
then after doing
.str.split(",")
we would have the list
["Strategy Games", "Thematic Games"]
. If the domain was
just "Strategy Games"
, then after splitting we’d have
["Strategy Games"]
.
Any "Domain"
with at least one comma will turn into a
list whose length is at least 2, so the comparison with
.str.len() == 1
will evaluate to False
. So
ultimately,
"Domains"].str.split(",").str.len() == 1] games[games[
is a DataFrame with just the rows for games that belong to one
"Domain"
.
The next part .value_counts("Name").sum()
makes a Series
with an index of the unique "Name"
s and the number of times
those appear. Finally, summing this resulting Series will give us the
number of games, among the games that only belong to only one
“Domain”.
Option 4: Option 4 starts off exactly like Option 3,
but instead of using value_counts
it gets the number of
rows using .shape[0]
, which will give us the number of
games that belong to only one domain.
Option 1: Let’s step through why Option 1 is
incorrect.
games["Domains"].str.split(" ").apply(len) == 2.sum()
gives
you a Series
of the "Domains"
column, then
splits each domain by a space. We then get the length of that list,
check if the length is equal to 2, which would mean there are two
elements in the list, and finally get the sum of all elements in the
list who had two elements because of the split. Remember that
True
evaluates to 1 and False
evaluates to 0,
so we are getting the number of elements that were split into two. It
does not tell us the number of games that belong to only one domain.
Option 2: Let’s step through why Option 2 is also
incorrect. (games["Domains"].apply(len) == 1).sum()
checks
to see if each element in the column "Domains"
has only one
character. Remember when you apply len()
to a string then
we get the number of characters in that string. This is essentially
counting the number of domains that have 1 letter. Thus, it does not
tell us the number of games that belong to only one domain.
In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).
The DataFrame evs
consists of 32 rows,
each of which contains information about a different EV model.
"Brand"
(str): The vehicle’s manufacturer."Model"
(str): The vehicle’s model name."BodyStyle"
(str): The vehicle’s body style."Seats"
(int): The vehicle’s number of seats."TopSpeed"
(int): The vehicle’s top speed, in
kilometers per hour."Range"
(int): The vehicle’s range, or distance it can
travel on a single charge, in kilometers.The first few rows of evs
are shown below (though
remember, evs
has 32 rows total).
Throughout this problem, we will refer to evs
repeatedly.
Assume that:
"Brand"
column are
"Tesla"
, "BMW"
, "Audi"
, and
"Nissan"
.import pandas as pd
and
import numpy as np
.Tip: Open this page in another tab, so that it is easy to refer to this data description as you work through the problem.
Which of the following types of visualizations could use to visualize
the distribution of "Range"
? Select all that apply.
Bar chart
Histogram
Scatter plot
Line plot
Box plot
None of the above
Answer: Histogram or box plot
"Range"
is a numerical (i.e. quantitative) variable, and
we use histograms and box plots to visualize the distribution of
numerical features.
"Range"
is not
categorical."Range"
).Teslas, on average, tend to have higher "Range"
s than
BMWs. In which of the following visualizations would we be able to see
this pattern? Select all that apply.
A bar chart that shows the distribution of "Brand"
A bar chart that shows the average "Range"
for each
"Brand"
An overlaid histogram showing the distribution of
"Range"
for each "Brand"
A scatter plot with "TopSpeed"
on the x-axis and "Range"
on the y-axis
Answer:
"Range"
for each
"Brand"
"Range"
for each "Brand"
Let’s look at each option more closely.
Option 1: A bar chart showing the distribution
of "Brand"
would only show us how many cars of each
"Brand"
there are. It would not tell us anything about the
average "Range"
of each "Brand"
.
Option 2: A bar chart showing the average range
for each "Brand"
would help us directly visualize how the
average range of each "Brand"
compares to one
another.
Option 3: An overlaid histogram, although
perhaps a bit messy, would also give us a general idea of the average
range of each "Brand"
by giving us the distribution of the
"Range"
of each brand. In the scenario mentioned in the
question, we’d expect to see that the Tesla distribution is further
right than the BMW distribution.
Option 4: A scatter plot of
"TopSpeed"
against "Range"
would only
illustrate the relationship between "TopSpeed"
and
"Range"
, but would contain no information about the
"Brand"
of each EV.
Kyle thinks "Seats"
is a categorical feature because it
can be used to categorize EVs by size. For instance, EVs with 4 seats
are small, EVs with 5 seats are medium, and EVs with 6 or more seats are
large.
Is Kyle correct?
Yes
No
Justify your answer in one sentence. Your answer must fit in the box below.
Answer: No
"Seats"
is a numerical feature, since it makes sense to
do arithmetic with the values. For instance, we can find the average
number of "Seats"
that a group of cars has. Kyle’s argument
could apply to any numerical feature; just because we can place
numerical features into “bins” doesn’t make them categorical.
The seat-back TV on one of Biff the Wolverine’s more recent flights
was very dirty and was full of fingerprints. The fingerprints made an
interesting pattern. We’ve stored the x and y positions of each
fingerprint in the DataFrame fingerprints
, and created the
following scatterplot using
='scatter', x='x', y='y') fingerprints.plot(kind
True or False: The histograms that result from the following two lines of code will look very similar.
# When we say bins=np.arange(0, 8, 2),
# we mean that the bins are [0, 2), [2, 4), [4, 6].
='hist',
fingerprints.plot(kind='x',
y=True,
density=np.arange(0, 8, 2)) bins
and
='hist',
fingerprints.plot(kind='y',
y=True,
density=np.arange(0, 8, 2)) bins
True
False
Answer: True
The only difference between the two code snippets is the data values
used. The first creates a histogram of the x
-values in
fingerprints
, and the second creates a histogram of the
y
-values in fingerprints
.
Both histograms use the same bins:
bins=np.arange(0, 8, 2)
. This means the bin endpoints are
[0, 2, 4, 6]
, so there are three distinct bins: [0, 2), [2,
4), and [4, 6]. Remember the
right-most bin of a histogram includes both endpoints, whereas others
include the left endpoint only.
Let’s look at the x
-values first. If we divide the
scatterplot into nine equally-sized regions, as shown below, note that
eight of the nine regions have a very similar number of data points.
Aside from the middle region, about \frac{1}{8} of the data falls in each region.
That means \frac{3}{8} of the data has
an x
-value in the first bin [0,
2), \frac{2}{8} of the data has
an x
-value in the middle bin [2,
4), and \frac{3}{8} of the data
has an x
-value in the rightmost bin [4, 6]. This distribution of
x
-values into bins determines what the histogram will look
like.
Now, if we look at the y
-values, we’ll find that \frac{3}{8} of the data has a
y
-value in the first bin [0,
2), \frac{2}{8} of the data has
a y
-value in the middle bin [2,
4), and \frac{3}{8} of the data
has a y
-value in the last bin [4,
6]. That’s the same distribution of data into bins as the
x
-values had, so the histogram of y
-values
will look just like the histogram of y
-values.
Alternatively, an easy way to see this is to use the fact that the
scatterplot is symmetric over the line y=x, the line that makes a 45 degree angle
with the origin. In other words, interchanging the x
and
y
values doesn’t change the scatterplot noticeably, so the
x
and y
values have very similar
distributions, and their histograms will be very similar as a
result.
Below, we’ve drawn a histogram using the line of code
='hist',
fingerprints.plot(kind='x',
y=True,
density=np.arange(0, 8, 2)) bins
However, our Jupyter Notebook was corrupted, and so the resulting histogram doesn’t quite look right. While the height of the first bar is correct, the histogram doesn’t contain the second or third bars, and the y-axis is replaced with letters.
Which of the four options on the y-axis is closest to where the height of the middle bar should be?
A
B
C
D
Which of the four options on the y-axis is closest to where the height of the rightmost bar should be?
A
B
C
D
Answer: B, then C
We’ve already determined that the first bin should contain \frac{3}{8} of the values, the middle bin should contain \frac{2}{8} of the values, and the rightmost bin should contain \frac{3}{8} of the values. The middle bar of the histogram should therefore be two-thirds as tall as the first bin, and the rightmost bin should be equally as tall as the first bin. The only reasonable height for the middle bin is B, as it’s closest to two-thirds of the height of the first bar. Similarly, the rightmost bar must be at height C, as it’s the only one close to the height of the first bar.
Pranavi proposes the following imputation scheme.
def impute(s):
return s.fillna(np.random.choice(s[s.notna()]))
True or False: impute
performs
probabilistic imputation, using the same definition of probabilistic
imputation we learned about in class.
True
False
Answer: False
In impute
, np.random.choice
will return a
single non-null value from s
, and .fillna()
will fill every null value with this single value. Meanwhile,
probabilistic imputation draws a different value from a specified
distribution to fill each missing value, making it such that there won’t
be a single “spike” in the imputed distribution at a single chosen
value.
Consider the following expressions and values.
>>> vals.isna().mean()
0.2
>>> vals.describe().loc[["min", "mean", "max"]]
min 2.0
mean 4.0
max 7.0
dtype: float64
Given the above, what is the maximum possible value
of impute(vals).mean()
? Give your answer as a number
rounded to one decimal place.
Answer: 4.6
The maximum possible value of impute(vals).mean()
would
occur when every single missing value in vals
is filled in
with the highest possible non-null value in vals
. (As
discussed in the previous solution, impute
selects only one
value from s
to fill into every missing space.)
If this occurs, then the mean of the imputed Series will be weighted mean of the available data and the filled data, and given the numbers in the question, this is 0.8 \cdot 4 + 0.2 \cdot 7, or 4.6.
Which of the following statements below will always evaluate to
True
?
vals.std() < impute(vals).std()
vals.std() <= impute(vals).std()
vals.std() == impute(vals).std()
vals.std() >= impute(vals).std()
vals.std() > impute(vals).std()
None of the above
Answer: None of the above
Since the value which impute
will choose to impute with
is random, the effect that it has on the standard deviation of
vals
is unknown. If the missing values are filled with a
value close to the mean, this could reduce standard deviation; if they
are filled with a value far from the mean, this could increase standard
deviation. (Of course, the imputation will also shift the mean, so
without knowing details of the Series, it’s impossible to come up with
thresholds.) In any case, since the value for imputation is chosen at
random, none of these statements will always be true, and so
the correct answer is “none of the above.”
Suppose we have a DataFrame, dogs
, in which each row
corresponds to a different dog owned by someone in Ann Arbor. There are
several columns in the DataFrame, including "birth_year"
(the dog’s year of birth) and "Breed"
(the dog’s
breed).
Using the "birth_year"
column in dogs
,
Tahseen computes the age of each dog and stores it a new
"age"
column of dogs
. He also categorizes dogs
into three “breed classes” based on their breed –
"Retriever"
, "Terrier"
, or
"Other"
– and stores these in a new "class"
column of dogs
.
The following bar chart shows the distribution of breed classes by missingness of age:
The following bar chart shows the mean observed age for each breed class:
Let O be the mean of the observed
ages in dogs
, that is, the mean of the ages that aren’t
missing. Let T be the true mean of the
ages in dogs
, that is, the mean we’d compute if none of the
ages were missing.
What is the relationship between O and T?
O < T
O \approx T
O = T
O > T
Answer: O < T
In the ages we get to observe (that is, when age is not missing), we have way fewer terrier values than in the ages we don’t get to observe. Terriers are older on average than the other breed classes. This means we’re missing values that are larger, so when we take the average of the values we have access to, it’ll be lower than the true mean.
Consider the following strategy for imputing missing ages.
def fill_missing(dogs):
return dogs.assign(
=dogs.groupby("class")["age"]
agelambda x: x.fillna(x.value_counts().idxmax()))
.transform( )
In one sentence, describe this imputation strategy.
Answer: This is filling in missing ages in each breed class with the most common observed age in that breed class.
small_dogs
is a subset of dogs
shown in its
entirety below.
If we call fill_missing
on small_dogs
, what
values would the four null values in the "age"
column of
small_dogs
be replaced with?
Answer:
"Other"
breed class is 5."Other"
breed class is 5."Terrier"
breed class is 12."Retriever"
breed class is 7.The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.
Each row of hotels
contains information about a
different hotel in San Diego. Specifically, for each hotel, we have:
"Hotel Name" (str)
: The name of the hotel.
Assume hotel names are unique."Location" (str):
The hotel’s neighborhood in San
Diego."Chain" (str):
The chain the hotel is a part of; either
"Hilton", "Marriott", "Hyatt", or "Other".
A hotel chain is
a group of hotels owned or operated by a shared company."Number of Rooms" (int)
: The number of rooms the hotel
has.The first few rows of hotels
are shown below, but hotels
has many more rows than are shown.
Now, consider the variable summed
, defined below.
= hotels.groupby("Chain")["Number of Rooms"].sum().idxmax() summed
What is type(summed)
?
int
str
Series
DataFrame
DataFrameGroupBy
Answer: str
When we do a groupby
on the Chain
column in
hotels
, this means that the values in the
Chain
column will be the indices of the DataFrame or Series
we get as output, in this case the Series
hotels.groupby("Chain")["Number of Rooms"].sum()
.
Since the values of Chain
are strings, and since
.idxmax()
will return a value from the index of the
aforementioned Series, summed
is a string.
In one sentence, explain what the value of summed
means.
Phrase your explanation as if you had to give it to someone who is not a
data science major; that is, don’t say something like “it is the result
of grouping hotels
by "Chain"
, selecting the
"Number of Rooms"
column, …”, but instead, give the value
context.
Answer: summed
is the name of the hotel
chain with the most total rooms
The result of the .groupby()
and .sum()
is
a Series indexed by the unique Chains
, whose values are the
total number of rooms in hotels owned by each chain. The
idxmax()
function gets the index corresponding to the
largest value in the Series, which will be the hotel chain name with the
most total rooms.
Consider the variable curious
, defined below.
= frame["Chain"].value_counts().idxmax() curious
Fill in the blank: curious
is guaranteed to be equal to
summed
only if frame
has one row for every
____ in San Diego.
hotel
hotel chain
hotel room
neighborhood
Answer: hotel room
curious
gets the most common value of Chain
in the DataFrame frame
. We already know that
summed
is the hotel chain with the most rooms in San Diego,
so curious
only equals summed
if the most
common Chain
in frame
is the hotel chain with
the most total rooms; this occurs when each row of frame
is
a single hotel room.
Fill in the blanks so that popular_areas
is an array of
the names of the unique neighborhoods that have at least 5 hotels and at
least 1000 hotel rooms.
= lambda df: __(i)__
f = hotels.groupby(__(ii)__).__(iii)__.__(iv)__ popular_areas
What goes in blank (i)?
What goes in blank (ii)?
"Hotel Name"
"Location"
"Chain"
"Number of Rooms"
agg(f)
filter(f)
transform(f)
Answers:
df.shape[0] >= 5 and df["Number of Rooms"].sum() >= 1000
"Location"
filter(f)
["Location"].unique()
or equivalentWe’d like to only consider certain neighborhoods according to group
characteristics (having >= 5 hotels and >= 1000 hotel rooms), and
.filter()
allows us to do that by excluding groups not
meeting those criteria. So, we can write a function that evaluates those
criteria on one group at a time (the df
of input to
f
is the subset of hotels
containing just one
Location
value), and calling filter(f)
means
that the only remaining rows are hotels in neighborhoods that match
those criteria. Finally, all we have to do is get the unique
neighborhoods from this DataFrame, which are the neighborhoods for which
f
returned True
.
Consider the code below.
= hotels["Chain"] == "Marriott"
cond1 = hotels["Location"] == "Coronado"
cond2 = hotels[cond1].merge(hotels[cond2], on="Hotel Name", how=???) combined
???
with "inner"
in the code
above, which of the following will be equal to
combined.shape[0]
? min(cond1.sum(), cond2.sum())
(cond1 & cond2).sum()
cond1.sum() + cond2.sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
cond1.sum() + (cond1 & cond2).sum()
???
with "outer"
in the code
above, which of the following will be equal to
combined.shape[0]
? min(cond1.sum(), cond2.sum())
(cond1 & cond2).sum()
cond1.sum() + cond2.sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
cond1.sum() + (cond1 & cond2).sum()
Answers:
(cond1 & cond2).sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
Note that cond1
and cond2
are boolean
Series, and hotels[cond1]
and hotels[cond2]
are the subsets of hotels
where
Chain == "Marriott
and
"Location" == "Coronado"
, respectively.
When we perform an inner merge, we’re selecting every row where a
Hotel Name
appears in both
hotels[cond1]
and hotels[cond2]
. This is the
same set of indices (and therefore hotel names, since those are unique)
as where (cond1 & cond2) == True
. So, the length of
combined
will be the same as the number of
True
s in (cond1 & cond2)
.
When we perform an outer merge, we’re selecting every row that
appears in either DataFrame, although there will not be repeats
for hotels that are both Marriott properties and are in Coronado. So, to
find the total number of rows in either DataFrame, we take the sum of
the sizes of each, and subtract rows that appear in both, which
corresponds to answer
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
.
Let’s consider admissions at UC San Diego and UC Santa Barbara for two high schools in particular.
For instance, the above table tells us that 200 students from La Jolla Private applied to UC San Diego, and 50 were admitted.
What is the largest possible integer value of N such that:
UC Santa Barbara has a strictly higher admit rate for both La Jolla Private and Sun God Memorial High individually, but
UC San Diego has a strictly higher admit rate overall?
Answer: 124
Let’s consider the two conditions separately.
First, UC Santa Barbara needs to have a higher admit rate for both high schools. This is already true for La Jolla Private (\frac{100}{300} > \frac{50}{200}); for Sun God Memorial High, we just need to ensure that \frac{N}{150} > \frac{200}{300}. This means that N > 100.
Now, UC San Diego needs to have a higher admit rate overall. The UC San Diego admit rate is \frac{50+200}{200+300} = \frac{250}{500} = \frac{1}{2}, while the UC Santa Barbara admit rate is \frac{100 + N}{450}. This means that we must require that \frac{1}{2} = \frac{225}{450} > \frac{100+N}{450}. This means that 225 > 100 + N, i.e. that N < 125.
So there are two conditions on N: N > 100 and N < 125. The largest integer N that satisfies these conditions is N=124, which is the final answer.
Suppose Charlie and Norah each have separate DataFrames for their
contacts, called charlie
and norah
,
respectively. These DataFrames have the same column names and format as
your DataFrame, contacts
.
As illustrated in the diagram below, Charlie has 172 contacts in
total, whereas Norah has 88 contacts. 12 of these contacts are shared,
meaning they appear in both charlie
and
norah
.
What does the following expression evaluate to?
=True, right_index=True).shape[0] charlie.merge(norah, left_index
Answer: 12
The code merges DataFrames charlie
and
norah
on their indexes, so the resulting DataFrame will
contain one row for every match between their indexes (‘Person’ since
they follow the same format as DataFrame contact
). From the
Venn Diagram, we know that Charlie and Norah have 12 contacts in common,
so the resulting DataFrame will contain 12 rows: one row for each shared
contact.
Thus,
charlie.merge(norah, left_index=True, right_index=True).shape[0]
returns the row number of the resulting DataFrame, which is 12.
One day, when updating her phone’s operating system, Norah
accidentally duplicates the 12 contacts she has in common with Charlie.
Now, the norah
DataFrame has 100 rows.
What does the following expression evaluate to?
=True, right_index=True).shape[0] norah.merge(norah, left_index
Answer: 24 \cdot 2 + 76 = 124
Since Norah duplicates 12 contacts, the norah
DataFrame
now has 76 unique rows + 12 rows + 12 duplicated rows. Note that the
above code is now merging norah with itself on indexes.
After merging, the resulting DataFrame will contain 76 unique rows, as there is only one match for each unique row. As for the duplicated rows, each row can match twice, and we have 24 rows. Thus the resulting DataFrame’s row number = 76 + 2 \cdot 24 = 124.
For better understanding, imagine we have a smaller DataFrame
nor
with only one contact Jim. After duplication, it will
have two identical rows of Jim. For easier explanation, let’s denote the
original row Jim1, and duplicated row Jim2. When merging Nor with
itself, Jim1 can be matched with Jim1 and Jim2, and Jim2 can be matched
with Jim1 and Jim2, resulting $= 2 = 4 $ number of rows.
In this problem, we will be using the following DataFrame
students
, which contains various information about high
school students and the university/universities they applied to.
The columns are:
'Name' (str)
: the name of the student.'High School' (str)
: the High School that the student
attended.'Email' (str)
: the email of the student.'GPA' (float)
: the GPA of the student.'AP' (int)
: the number of AP exams that the student
took.'University' (str)
: the name of the university that the
student applied to.'Admit' (str)
: the acceptance status of the student
(where ‘Y’ denotes that they were accepted to the university and ‘N’
denotes that they were not).The rows of 'student'
are arranged in no particular
order. The first eight rows of 'student'
are shown above
(though 'student'
has many more rows than pictured
here).
Fill in the blank so that the result evaluates to a Series indexed by
"Email"
that contains a list of the
universities that each student was admitted to. If a
student wasn’t admitted to any universities, they should have an empty
list.
"Email").apply(_____) students.groupby(
What goes in the blank?
Answer:
lambda df: df.loc[df["Admit"] == "Y", "University"].tolist()
Which of the following blocks of code correctly assign
max_AP
to the maximum number of APs taken by a student who
was rejected by UC San Diego?
Option 1:
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.loc[cond1 & cond2, "APs"].sort_values().iloc[-1] max_AP
Option 2:
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.groupby(["University", "Admit"]).max().reset_index()
d3 = d3.loc[cond1 & cond2, "APs"].iloc[0] max_AP
Option 3:
= students.pivot_table(index="Admit",
p ="University",
columns="APs",
values="max")
aggfunc= p.loc["N", "UC San Diego"] max_AP
Option 4:
# .last() returns the element at the end of a Series it is called on
= students.sort_values(["APs", "Admit"]).groupby("University")
groups = groups["APs"].last()["UC San Diego"] max_AP
Select all that apply. There is at least one correct option.
Option 1
Option 2
Option 3
Option 4
Answer: Option 1 and Option 3
Option 1 works correctly, it is probably the most straightforward
way of answering the question. cond1
is True
for all rows in which students were rejected, and cond2
is
True
for all rows in which students applied to UCSD. As
such, students.loc[cond1 & cond2]
contains only the
rows where students were rejected from UCSD. Then,
students.loc[cond1 & cond2, "APs"].sort_values()
sorts
by the number of "APs"
taken in increasing order, and
.iloc[-1]
gets the largest number of "APs"
taken.
Option 2 doesn’t work because the lengths of cond1
and cond2
are not the same as the length of
d3
, so this causes an error.
Option 3 works correctly. For each combination of
"Admit"
status ("Y"
, "N"
,
"W"
) and "University"
(including UC San
Diego), it computes the max number of "APs"
. The usage of
.loc["N", "UC San Diego"]
is correct too.
Option 4 doesn’t work. It currently returns the maximum number of
"APs"
taken by someone who applied to UC San Diego; it does
not factor in whether they were admitted, rejected, or
waitlisted.
Currently, students
has a lot of repeated information —
for instance, if a student applied to 10 universities, their GPA appears
10 times in students
.
We want to generate a DataFrame that contains a single row for each
student, indexed by "Email"
, that contains their
"Name"
, "High School"
, "GPA"
, and
"APs"
.
One attempt to create such a DataFrame is below.
"Email").aggregate({"Name": "max",
students.groupby("High School": "mean",
"GPA": "mean",
"APs": "max"})
There is exactly one issue with the line of code above. In one sentence, explain what needs to be changed about the line of code above so that the desired DataFrame is created.
Answer: The problem right now is that aggregating
High School by mean doesn’t work since you can’t aggregate a column with
strings using "mean"
. Thus changing it to something that
works for strings like "max"
or "min"
would
fix the issue.
Consider the following snippet of code.
= students.assign(Admit=students["Admit"] == "Y") \
pivoted ="High School",
.pivot_table(index="University",
columns="Admit",
values="sum") aggfunc
Some of the rows and columns of pivoted
are shown
below.
No students from Warren High were admitted to Columbia or Stanford.
However,
pivoted.loc["Warren High", "Columbia"]
and
pivoted.loc["Warren High", "Stanford"]
evaluate to
different values. What is the reason for this difference?
Some students from Warren High applied to Stanford, and some others applied to Columbia, but none applied to both.
Some students from Warren High applied to Stanford but none applied to Columbia.
Some students from Warren High applied to Columbia but none applied to Stanford.
The students from Warren High that applied to both Columbia and Stanford were all rejected from Stanford, but at least one was admitted to Columbia.
When using pivot_table
, pandas
was not able
to sum strings of the form "Y"
, "N"
, and
"W"
, so the values in pivoted
are
unreliable.
Answer: Option 3
pivoted.loc["Warren High", "Stanford"]
is
NaN
because there were no rows in students
in
which the "High School"
was "Warren High"
and
the "University"
was "Stanford"
, because
nobody from Warren High applied to Stanford. However,
pivoted.loc["Warren High", "Columbia"]
is not
NaN
because there was at least one row in
students
in which the "High School"
was
"Warren High"
and the "University"
was
"Columbia"
. This means that at least one student from
Warren High applied to Columbia.
Option 3 is the only option consistent with this logic.
For each day in May 2022, the DataFrame streams
contains
the number of streams for each of the “Top 200" songs on Spotify that
day — that is, the number of streams for the 200 songs with the most
streams on Spotify that day. The columns in streams
are as
follows:
"date"
: the date the song was streamed
"artist_names"
: name(s) of the artists who created
the song
"track_name"
: name of the song
"streams"
: the number of times the song was streamed
on Spotify that day
The first few rows of streams
are shown below. Since
there were 31 days in May and 200 songs per day, streams
has 6200 rows in total.
Note that:
streams
is already sorted in a very particular way —
it is sorted by "date"
in reverse chronological
(decreasing) order, and, within each "date"
, by
"streams"
in increasing order.
Many songs will appear multiple times in streams
,
because many songs were in the Top 200 on more than one day.
Complete the implementation of the function song_by_day
,
which takes in an integer day
between 1 and 31
corresponding to a day in May, and an integer n
, and
returns the song that had the n
-th most streams on
day
. For instance,
song_by_day(31, 199)
should evaluate to
"pepas"
, because "pepas"
was the 199th most
streamed song on May 31st.
Note: You are not allowed to sort within
song_by_day
— remember, streams
is already
sorted.
def song_by_day(day, n):
= f"2022-05-{str(day).zfill(2)}"
day_str = streams[__(a)__].iloc[__(b)__]
day_only return __(c)__
What goes in each of the blanks?
Answer: a) streams['date'] == day_str
,
b) (200 - n)
, c) day_only['track_name']
The first line in the function gives us an idea that maybe later on
in the function we’re going to filter for all the days that match the
given data. Indeed, in blank a, we filter for all the rows in which the
'date'
column matches day_str
. In blank b, we
could access directly access the row with the n
-th most
stream using iloc. (Remember, the image above shows us that the streams
are sorted by most streamed in ascending order, so to find the
n
-th most popular song of a day, we simply do
200-n
). Finally, to return the track name, we could simply
do day_only['track_name']
.
Below, we define a DataFrame pivoted
.
= streams.pivot_table(index="track_name", columns="date",
pivoted ="streams", aggfunc=np.max) values
After defining pivoted
, we define a Series
mystery
below.
= 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1) mystery
mystery.loc["pepas"]
evaluates to 23. In one sentence,
describe the relationship between the number 23 and the song
"pepas"
in the context of the streams
dataset.
For instance, a correctly formatted but incorrect answer is “I listened
to the song "pepas"
23 times today."
Answer: See below.
pivoted.apply(lambda s: s.isna().sum(), axis=1)
computes
the number of days that a song was not on the Top 200, so
31 - pivoted.apply(lambda s: s.isna().sum(), axis=1)
computes the number of days the song was in the Top 200. As such, the
correct interpretation is that "pepas"
was in the
Top 200 for 23 days in May.
In defining pivoted
, we set the keyword argument
aggfunc
to np.max
. Which of the following
functions could we have used instead of np.max
without
changing the values in pivoted
? Select all that
apply.
np.mean
np.median
len
lambda df: df.iloc[0]
None of the above
Answer: Option A, B and D
For each combination of "track_name"
and
"date"
, there is just a single value — the number of
streams that song received on that date. As such, the
aggfunc
needs to take in a Series containing a single
number and return that same number.
The mean and median of a Series containing a single number is equal to that number, so the first two options are correct.
The length of a Series containing a single number is 1, no matter what that number is, so the third option is not correct.
lambda df: df.iloc[0]
takes in a Series and returns
the first element in the Series, which is the only element in the
Series. This option is correct as well. (The parameter name
df
is irrelevant.)
Below, we define another DataFrame another_mystery
.
= (streams.groupby("date").last()
another_mystery "artist_names", "track_name"])
.groupby([ .count().reset_index())
another_mystery
has 5 rows. In one sentence, describe
the significance of the number 5 in the context of the
streams
dataset. For instance, a correctly formatted but
incorrect answer is “There are 5 unique artists in
streams
." Your answer should not include the word”row".
Answer: See below.
1in Since streams
is sorted by "date"
in
descending order and, within each "date"
, by
"streams"
in ascending order,
streams.groupby("date").last()
is a DataFrame containing
the song with the most "streams"
on each day in May. In
other words, we found the “top song" for each day. (The DataFrame we
created has 31 rows.)
When we then execute
.groupby(["artist_names", "track_name"]).count()
, we create
one row for every unique combination of song and artist, amongst the
“top songs". (If no two artists have a song with the same name, this is
the same as creating one row for every unique song.) Since there are 5
rows in this new DataFrame (resetting the index doesn’t do anything
here), it means that there were only 5 unique songs that were
ever the “top song" in a day in May; this is the correct
interpretation.
Build on the previous problem.
Suppose the DataFrame today
consists of 15 rows — 3 rows
for each of 5 different "artist_names"
. For each artist, it
contains the "track_name"
for their three most-streamed
songs today. For instance, there may be one row for
"olivia rodrigo"
and "favorite crime"
, one row
for "olivia rodrigo"
and "drivers license"
,
and one row for "olivia rodrigo"
and
"deja vu"
.
Another DataFrame, genres
, is shown below in its
entirety.
Suppose we perform an inner merge between
today
and genres
on
"artist_names"
. If the five "artist_names"
in
today
are the same as the five "artist_names"
in genres
, what fraction of the rows in the merged
DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Answer: \frac{2}{5}
If the five "artist_names"
in today
and
genres are the same, the DataFrame that results from an
inner merge will have 15 rows, one for each row in today
.
This is because there are 3 matches for "harry styles"
, 3
matches for "olivia rodrigo"
, 3 matches for
"glass animals"
, and so on.
In the merged DataFrame’s 15 rows, 6 of them will correspond to
"Pop"
artists — 3 to "harry styles"
and 3 to
"olivia rodrigo"
. Thus, the fraction of rows that contain
"Pop"
in the "genre"
column is \frac{6}{15} = \frac{2}{5} (which is the
fraction of rows that contained "Pop"
in
genres["genre"]
, too).
Suppose we perform an inner merge between
today
and genres
on
"artist_names"
. Furthermore, suppose that the only
overlapping "artist_names"
between today
and
genres
are "drake"
and
"olivia rodrigo"
. What fraction of the rows in the merged
DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Answer: \frac{1}{2}
If we perform an inner merge, there will only be 6 rows in the merged
DataFrame — 3 for "olivia rodrigo"
and 3 for
"drake"
. 3 of those 6 rows will have "Pop"
in
the "genre"
column, hence the answer is \frac{3}{6} = \frac{1}{2}.
Suppose we perform an outer merge between
today
and genres
on
"artist_names"
. Furthermore, suppose that the only
overlapping "artist_names"
between today
and
genres
are "drake"
and
"olivia rodrigo"
. What fraction of the rows in the merged
DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Answer: \frac{2}{9}
Since we are performing an outer merge, we can decompose the rows in the merged DataFrame into three groups:
Rows that are in today
that are not in
genres
. There are 9 of these (3 each for the 3 artists that
are in today
and not genres
).
today
doesn’t have a "genre"
column, and so
all of these "genre"
s will be NaN
upon
merging.
Rows that are in genres
that are not in
today
. There are 3 of these — one for
"harry styles"
, one for "glass animals"
, and
one for "doja cat"
. 1 of these 3 have "Pop"
in
the "genre"
column.
Rows that are in both today
and genres
.
There are 6 of these — 3 for "olivia rodrigo"
and 3 for
"drake"
— and 3 of those rows contain "Pop"
in
the "genre"
column.
Tallying things up, we see that there are 9
+ 3 + 6 = 18 rows in the merged DataFrame overall, of which 0 + 1 + 3 = 4 have "Pop"
in the
"genre"
column. Hence, the relevant fraction is \frac{4}{18} = \frac{2}{9}.
Recall the evs
DataFrame.
The first few rows of evs
are shown below (though
remember, evs
has 32 rows total).
Below, we provide the a DataFrame that contains the distribution of
“BodyStyle” for all “Brands” in evs
, other than Nissan.
Suppose we’ve run the following few lines of code.
= evs[evs.get("Brand") == "Tesla"]
tesla = evs[evs.get("Brand") == "BMW"]
bmw = evs[evs.get("Brand") == "Audi"]
audi
= tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle") combo
How many rows does the DataFrame combo
have?
21
24
35
65
72
96
Answer: 35
Let’s attempt this problem step-by-step. We’ll first determine the
number of rows in tesla.merge(bmw, on="BodyStyle")
, and
then determine the number of rows in combo
. For the
purposes of the solution, let’s use temp
to refer to the
first merged DataFrame,
tesla.merge(bmw, on="BodyStyle")
.
Recall, when we merge
two DataFrames, the resulting
DataFrame contains a single row for every match between the two columns,
and rows in either DataFrame without a match disappear. In this problem,
the column that we’re looking for matches in is
"BodyStyle"
.
To determine the number of rows of temp
, we need to
determine which rows of tesla
have a
"BodyStyle"
that matches a row in bmw
. From
the DataFrame provided, we can see that the only
"BodyStyle"
s in both tesla
and
bmw
are SUV and sedan. When we merge tesla
and
bmw
on "BodyStyle"
:
tesla
each match the 1 SUV row in
bmw
. This will create 4 SUV rows in temp
.tesla
each match the 1 sedan row in
bmw
. This will create 3 sedan rows in
temp
.So, temp
is a DataFrame with a total of 7 rows, with 4
rows for SUVs and 3 rows for sedans (in the "BodyStyle"
)
column. Now, when we merge temp
and audi
on
"BodyStyle"
:
temp
each match the 8 SUV rows in
audi
. This will create 4 \cdot 8
= 32 SUV rows in combo
.temp
each match the 1 sedan row in
audi
. This will create 3 \cdot 1
= 3 sedan rows in combo
.Thus, the total number of rows in combo
is 32 + 3 = 35.
Note: You may notice that 35 is the result of multiplying the
"SUV"
and "Sedan"
columns in the DataFrame
provided, and adding up the results.