← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.
We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.
Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.
The DataFrame games
contains information about a sample of popular games, including board games, dice games, and card games. The data comes from Board Game Geek, a popular website and vibrant online community for game enthusiasts.
The columns of games are as follows.
"Name"
(str
): The name of the game."Mechanics"
(str
): A sequence of descriptors for how the game is played. A game can have several descriptors, each of which is separated by a comma."Domains"
(str
): A sequence of domains to which the game belongs. A game can belong to multiple domains."Play Time"
(int
): The average play time of the game, in minutes, as suggested by the game’s creators."Complexity"
(float
): The average complexity of the game, on a scale of 1 to 5 points, as reported by Board Game Geek community members."Rating"
(str
): The average rating of the game, on a scale of 1 to 10 points, as rated by Board Game Geek community members. Note that while this data should be numerical, it is actually stored as a string, because some entries use a comma in place of a decimal point. For example 8,79 actually represents the number 8.79."BGG Rank"
(int
): The rank of the game in Board Game Geek’s database. The formula for how this rank is calculated is not publicly known, but it likely includes many factors, such as "Rating"
, number of registered owners of the game, and number of reviews.The first few rows of games
are shown below (though games
has many more rows than pictured here).
Assume that we have already run import pandas as pd
and import numpy as np
.
Many of the games in the games
DataFrame belong to more than one domain. We want to identify the number of games that belong to only one domain. Select all of the options below that would correctly calculate the number of games that belong to only one domain.
(games["Domains"].str.split(" ").apply(len) == 2).sum()
(games["Domains"].apply(len) == 1).sum()
games[games["Domains"].str.split(",").str.len() == 1].value_counts("Name").sum()
games[games["Domains"].str.split(",").str.len() == 1].shape[0]
Answer: Options 3 and 4
Let’s take a closer look at why Option 3 and Option 4 are correct.
Option 3: Option 3 first queries the games
DataFrame to only keep games
with one "Domains"
. games["Domains"].str.split(",").str.len() == 1
gets the "Domains"
column and splits all of them by their comma.
For example, let’s say the domain was "Strategy Games", "Thematic Games"
then after doing .str.split(",")
we would have the list ["Strategy Games", "Thematic Games"]
. If the domain was just "Strategy Games"
, then after splitting we’d have ["Strategy Games"]
.
Any "Domain"
with at least one comma will turn into a list whose length is at least 2, so the comparison with .str.len() == 1
will evaluate to False
. So ultimately,
"Domains"].str.split(",").str.len() == 1] games[games[
is a DataFrame with just the rows for games that belong to one "Domain"
.
The next part .value_counts("Name").sum()
makes a Series with an index of the unique "Name"
s and the number of times those appear. Finally, summing this resulting Series will give us the number of games, among the games that only belong to only one “Domain”.
Option 4: Option 4 starts off exactly like Option 3, but instead of using value_counts
it gets the number of rows using .shape[0]
, which will give us the number of games that belong to only one domain.
Option 1: Let’s step through why Option 1 is incorrect. games["Domains"].str.split(" ").apply(len) == 2.sum()
gives you a Series
of the "Domains"
column, then splits each domain by a space. We then get the length of that list, check if the length is equal to 2, which would mean there are two elements in the list, and finally get the sum of all elements in the list who had two elements because of the split. Remember that True
evaluates to 1 and False
evaluates to 0, so we are getting the number of elements that were split into two. It does not tell us the number of games that belong to only one domain.
Option 2: Let’s step through why Option 2 is also incorrect. (games["Domains"].apply(len) == 1).sum()
checks to see if each element in the column "Domains"
has only one character. Remember when you apply len()
to a string then we get the number of characters in that string. This is essentially counting the number of domains that have 1 letter. Thus, it does not tell us the number of games that belong to only one domain.
In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).
The DataFrame evs
consists of 32 rows, each of which contains information about a different EV model.
"Brand"
(str): The vehicle’s manufacturer."Model"
(str): The vehicle’s model name."BodyStyle"
(str): The vehicle’s body style."Seats"
(int): The vehicle’s number of seats."TopSpeed"
(int): The vehicle’s top speed, in kilometers per hour."Range"
(int): The vehicle’s range, or distance it can travel on a single charge, in kilometers.The first few rows of evs
are shown below (though remember, evs
has 32 rows total).
Throughout this problem, we will refer to evs
repeatedly.
Assume that:
"Brand"
column are "Tesla"
, "BMW"
, "Audi"
, and "Nissan"
.import pandas as pd
and import numpy as np
.Tip: Open this page in another tab, so that it is easy to refer to this data description as you work through the problem.
Which of the following types of visualizations could use to visualize the distribution of "Range"
? Select all that apply.
Bar chart
Histogram
Scatter plot
Line plot
Box plot
None of the above
Answer: Histogram or box plot
"Range"
is a numerical (i.e. quantitative) variable, and we use histograms and box plots to visualize the distribution of numerical features.
"Range"
is not categorical."Range"
).Teslas, on average, tend to have higher "Range"
s than BMWs. In which of the following visualizations would we be able to see this pattern? Select all that apply.
A bar chart that shows the distribution of "Brand"
A bar chart that shows the average "Range"
for each "Brand"
An overlaid histogram showing the distribution of "Range"
for each "Brand"
A scatter plot with "TopSpeed"
on the x-axis and "Range"
on the y-axis
Answer:
"Range"
for each "Brand"
"Range"
for each "Brand"
Let’s look at each option more closely.
Option 1: A bar chart showing the distribution of "Brand"
would only show us how many cars of each "Brand"
there are. It would not tell us anything about the average "Range"
of each "Brand"
.
Option 2: A bar chart showing the average range for each "Brand"
would help us directly visualize how the average range of each "Brand"
compares to one another.
Option 3: An overlaid histogram, although perhaps a bit messy, would also give us a general idea of the average range of each "Brand"
by giving us the distribution of the "Range"
of each brand. In the scenario mentioned in the question, we’d expect to see that the Tesla distribution is further right than the BMW distribution.
Option 4: A scatter plot of "TopSpeed"
against "Range"
would only illustrate the relationship between "TopSpeed"
and "Range"
, but would contain no information about the "Brand"
of each EV.
Kyle thinks "Seats"
is a categorical feature because it can be used to categorize EVs by size. For instance, EVs with 4 seats are small, EVs with 5 seats are medium, and EVs with 6 or more seats are large.
Is Kyle correct?
Yes
No
Justify your answer in one sentence. Your answer must fit in the box below.
Answer: No
"Seats"
is a numerical feature, since it makes sense to do arithmetic with the values. For instance, we can find the average number of "Seats"
that a group of cars has. Kyle’s argument could apply to any numerical feature; just because we can place numerical features into “bins” doesn’t make them categorical.
The seat-back TV on one of Biff the Wolverine’s more recent flights was very dirty and was full of fingerprints. The fingerprints made an interesting pattern. We’ve stored the x and y positions of each fingerprint in the DataFrame fingerprints
, and created the following scatterplot using
='scatter', x='x', y='y') fingerprints.plot(kind
True or False: The histograms that result from the following two lines of code will look very similar.
# When we say bins=np.arange(0, 8, 2),
# we mean that the bins are [0, 2), [2, 4), [4, 6].
='hist',
fingerprints.plot(kind='x',
y=True,
density=np.arange(0, 8, 2)) bins
and
='hist',
fingerprints.plot(kind='y',
y=True,
density=np.arange(0, 8, 2)) bins
True
False
Answer: True
The only difference between the two code snippets is the data values used. The first creates a histogram of the x
-values in fingerprints
, and the second creates a histogram of the y
-values in fingerprints
.
Both histograms use the same bins: bins=np.arange(0, 8, 2)
. This means the bin endpoints are [0, 2, 4, 6]
, so there are three distinct bins: [0, 2), [2, 4), and [4, 6]. Remember the right-most bin of a histogram includes both endpoints, whereas others include the left endpoint only.
Let’s look at the x
-values first. If we divide the scatterplot into nine equally-sized regions, as shown below, note that eight of the nine regions have a very similar number of data points.
Aside from the middle region, about \frac{1}{8} of the data falls in each region. That means \frac{3}{8} of the data has an x
-value in the first bin [0, 2), \frac{2}{8} of the data has an x
-value in the middle bin [2, 4), and \frac{3}{8} of the data has an x
-value in the rightmost bin [4, 6]. This distribution of x
-values into bins determines what the histogram will look like.
Now, if we look at the y
-values, we’ll find that \frac{3}{8} of the data has a y
-value in the first bin [0, 2), \frac{2}{8} of the data has a y
-value in the middle bin [2, 4), and \frac{3}{8} of the data has a y
-value in the last bin [4, 6]. That’s the same distribution of data into bins as the x
-values had, so the histogram of y
-values will look just like the histogram of y
-values.
Alternatively, an easy way to see this is to use the fact that the scatterplot is symmetric over the line y=x, the line that makes a 45 degree angle with the origin. In other words, interchanging the x
and y
values doesn’t change the scatterplot noticeably, so the x
and y
values have very similar distributions, and their histograms will be very similar as a result.
Below, we’ve drawn a histogram using the line of code
='hist',
fingerprints.plot(kind='x',
y=True,
density=np.arange(0, 8, 2)) bins
However, our Jupyter Notebook was corrupted, and so the resulting histogram doesn’t quite look right. While the height of the first bar is correct, the histogram doesn’t contain the second or third bars, and the y-axis is replaced with letters.
Which of the four options on the y-axis is closest to where the height of the middle bar should be?
A
B
C
D
Which of the four options on the y-axis is closest to where the height of the rightmost bar should be?
A
B
C
D
Answer: B, then C
We’ve already determined that the first bin should contain \frac{3}{8} of the values, the middle bin should contain \frac{2}{8} of the values, and the rightmost bin should contain \frac{3}{8} of the values. The middle bar of the histogram should therefore be two-thirds as tall as the first bin, and the rightmost bin should be equally as tall as the first bin. The only reasonable height for the middle bin is B, as it’s closest to two-thirds of the height of the first bar. Similarly, the rightmost bar must be at height C, as it’s the only one close to the height of the first bar.
Pranavi proposes the following imputation scheme.
def impute(s):
return s.fillna(np.random.choice(s[s.notna()]))
True or False: impute
performs probabilistic imputation, using the same definition of probabilistic imputation we learned about in class.
True
False
Answer: False
In impute
, np.random.choice
will return a single non-null value from s
, and .fillna()
will fill every null value with this single value. Meanwhile, probabilistic imputation draws a different value from a specified distribution to fill each missing value, making it such that there won’t be a single “spike” in the imputed distribution at a single chosen value.
Consider the following expressions and values.
>>> vals.isna().mean()
0.2
>>> vals.describe().loc[["min", "mean", "max"]]
min 2.0
mean 4.0
max 7.0
dtype: float64
Given the above, what is the maximum possible value of impute(vals).mean()
? Give your answer as a number rounded to one decimal place.
Answer: 4.6
The maximum possible value of impute(vals).mean()
would occur when every single missing value in vals
is filled in with the highest possible non-null value in vals
. (As discussed in the previous solution, impute
selects only one value from s
to fill into every missing space.)
If this occurs, then the mean of the imputed Series will be weighted mean of the available data and the filled data, and given the numbers in the question, this is 0.8 \cdot 4 + 0.2 \cdot 7, or 4.6.
Which of the following statements below will always evaluate to True
?
vals.std() < impute(vals).std()
vals.std() <= impute(vals).std()
vals.std() == impute(vals).std()
vals.std() >= impute(vals).std()
vals.std() > impute(vals).std()
None of the above
Answer: None of the above
Since the value which impute
will choose to impute with is random, the effect that it has on the standard deviation of vals
is unknown. If the missing values are filled with a value close to the mean, this could reduce standard deviation; if they are filled with a value far from the mean, this could increase standard deviation. (Of course, the imputation will also shift the mean, so without knowing details of the Series, it’s impossible to come up with thresholds.) In any case, since the value for imputation is chosen at random, none of these statements will always be true, and so the correct answer is “none of the above.”
Suppose we have a DataFrame, dogs
, in which each row corresponds to a different dog owned by someone in Ann Arbor. There are several columns in the DataFrame, including "birth_year"
(the dog’s year of birth) and "Breed"
(the dog’s breed).
Using the "birth_year"
column in dogs
, Tahseen computes the age of each dog and stores it a new "age"
column of dogs
. He also categorizes dogs into three “breed classes” based on their breed – "Retriever"
, "Terrier"
, or "Other"
– and stores these in a new "class"
column of dogs
.
The following bar chart shows the distribution of breed classes by missingness of age:
The following bar chart shows the mean observed age for each breed class:
Let O be the mean of the observed ages in dogs
, that is, the mean of the ages that aren’t missing. Let T be the true mean of the ages in dogs
, that is, the mean we’d compute if none of the ages were missing.
What is the relationship between O and T?
O < T
O \approx T
O = T
O > T
Answer: O < T
In the ages we get to observe (that is, when age is not missing), we have way fewer terrier values than in the ages we don’t get to observe. Terriers are older on average than the other breed classes. This means we’re missing values that are larger, so when we take the average of the values we have access to, it’ll be lower than the true mean.
Consider the following strategy for imputing missing ages.
def fill_missing(dogs):
return dogs.assign(
=dogs.groupby("class")["age"]
agelambda x: x.fillna(x.value_counts().idxmax()))
.transform( )
In one sentence, describe this imputation strategy.
Answer: This is filling in missing ages in each breed class with the most common observed age in that breed class.
small_dogs
is a subset of dogs
shown in its entirety below.
If we call fill_missing
on small_dogs
, what values would the four null values in the "age"
column of small_dogs
be replaced with?
Answer:
"Other"
breed class is 5."Other"
breed class is 5."Terrier"
breed class is 12."Retriever"
breed class is 7.The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.
Each row of hotels
contains information about a different hotel in San Diego. Specifically, for each hotel, we have:
"Hotel Name" (str)
: The name of the hotel. Assume hotel names are unique."Location" (str):
The hotel’s neighborhood in San Diego."Chain" (str):
The chain the hotel is a part of; either "Hilton", "Marriott", "Hyatt", or "Other".
A hotel chain is a group of hotels owned or operated by a shared company."Number of Rooms" (int)
: The number of rooms the hotel has.The first few rows of hotels
are shown below, but hotels has many more rows than are shown.
Now, consider the variable summed
, defined below.
= hotels.groupby("Chain")["Number of Rooms"].sum().idxmax() summed
What is type(summed)
?
int
str
Series
DataFrame
DataFrameGroupBy
Answer: str
When we do a groupby
on the Chain
column in hotels
, this means that the values in the Chain
column will be the indices of the DataFrame or Series we get as output, in this case the Series hotels.groupby("Chain")["Number of Rooms"].sum()
.
Since the values of Chain
are strings, and since .idxmax()
will return a value from the index of the aforementioned Series, summed
is a string.
In one sentence, explain what the value of summed
means. Phrase your explanation as if you had to give it to someone who is not a data science major; that is, don’t say something like “it is the result of grouping hotels
by "Chain"
, selecting the "Number of Rooms"
column, …”, but instead, give the value context.
Answer: summed
is the name of the hotel chain with the most total rooms
The result of the .groupby()
and .sum()
is a Series indexed by the unique Chains
, whose values are the total number of rooms in hotels owned by each chain. The idxmax()
function gets the index corresponding to the largest value in the Series, which will be the hotel chain name with the most total rooms.
Consider the variable curious
, defined below.
= frame["Chain"].value_counts().idxmax() curious
Fill in the blank: curious
is guaranteed to be equal to summed
only if frame
has one row for every ____ in San Diego.
hotel
hotel chain
hotel room
neighborhood
Answer: hotel room
curious
gets the most common value of Chain
in the DataFrame frame
. We already know that summed
is the hotel chain with the most rooms in San Diego, so curious
only equals summed
if the most common Chain
in frame
is the hotel chain with the most total rooms; this occurs when each row of frame
is a single hotel room.
Fill in the blanks so that popular_areas
is an array of the names of the unique neighborhoods that have at least 5 hotels and at least 1000 hotel rooms.
= lambda df: __(i)__
f = hotels.groupby(__(ii)__).__(iii)__.__(iv)__ popular_areas
What goes in blank (i)?
What goes in blank (ii)?
"Hotel Name"
"Location"
"Chain"
"Number of Rooms"
agg(f)
filter(f)
transform(f)
Answers:
df.shape[0] >= 5 and df["Number of Rooms"].sum() >= 1000
"Location"
filter(f)
["Location"].unique()
or equivalentWe’d like to only consider certain neighborhoods according to group characteristics (having >= 5 hotels and >= 1000 hotel rooms), and .filter()
allows us to do that by excluding groups not meeting those criteria. So, we can write a function that evaluates those criteria on one group at a time (the df
of input to f
is the subset of hotels
containing just one Location
value), and calling filter(f)
means that the only remaining rows are hotels in neighborhoods that match those criteria. Finally, all we have to do is get the unique neighborhoods from this DataFrame, which are the neighborhoods for which f
returned True
.
You may wonder why we’re using and
instead of &
, when we’re told to use &
when making queries. In cases where we’re Boolean filtering arrays/Series/DataFrames, you’ll always use the bitwise operators (&
for and, |
for or). We use the “regular” and
/or
when taking the and/or of individual values. But here, df.shape[0] >= 5
is an individual Boolean – not a Series of Booleans – and df["Number of Rooms"].sum() >= 1000
is also an individual Boolean. So here, we needed to compute the and of two individual Booleans, making the and
operator correct (and &
incorrect).
Consider the code below.
= hotels["Chain"] == "Marriott"
cond1 = hotels["Location"] == "Coronado"
cond2 = hotels[cond1].merge(hotels[cond2], on="Hotel Name", how=???) combined
???
with "inner"
in the code above, which of the following will be equal to combined.shape[0]
? min(cond1.sum(), cond2.sum())
(cond1 & cond2).sum()
cond1.sum() + cond2.sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
cond1.sum() + (cond1 & cond2).sum()
???
with "outer"
in the code above, which of the following will be equal to combined.shape[0]
? min(cond1.sum(), cond2.sum())
(cond1 & cond2).sum()
cond1.sum() + cond2.sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
cond1.sum() + (cond1 & cond2).sum()
Answers:
(cond1 & cond2).sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
Note that cond1
and cond2
are boolean Series, and hotels[cond1]
and hotels[cond2]
are the subsets of hotels
where Chain == "Marriott
and "Location" == "Coronado"
, respectively.
When we perform an inner merge, we’re selecting every row where a Hotel Name
appears in both hotels[cond1]
and hotels[cond2]
. This is the same set of indices (and therefore hotel names, since those are unique) as where (cond1 & cond2) == True
. So, the length of combined
will be the same as the number of True
s in (cond1 & cond2)
.
When we perform an outer merge, we’re selecting every row that appears in either DataFrame, although there will not be repeats for hotels that are both Marriott properties and are in Coronado. So, to find the total number of rows in either DataFrame, we take the sum of the sizes of each, and subtract rows that appear in both, which corresponds to answer cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
.
Let’s consider admissions at UC San Diego and UC Santa Barbara for two high schools in particular.
For instance, the above table tells us that 200 students from La Jolla Private applied to UC San Diego, and 50 were admitted.
What is the largest possible integer value of N such that:
UC Santa Barbara has a strictly higher admit rate for both La Jolla Private and Sun God Memorial High individually, but
UC San Diego has a strictly higher admit rate overall?
Answer: 124
Let’s consider the two conditions separately.
First, UC Santa Barbara needs to have a higher admit rate for both high schools. This is already true for La Jolla Private (\frac{100}{300} > \frac{50}{200}); for Sun God Memorial High, we just need to ensure that \frac{N}{150} > \frac{200}{300}. This means that N > 100.
Now, UC San Diego needs to have a higher admit rate overall. The UC San Diego admit rate is \frac{50+200}{200+300} = \frac{250}{500} = \frac{1}{2}, while the UC Santa Barbara admit rate is \frac{100 + N}{450}. This means that we must require that \frac{1}{2} = \frac{225}{450} > \frac{100+N}{450}. This means that 225 > 100 + N, i.e. that N < 125.
So there are two conditions on N: N > 100 and N < 125. The largest integer N that satisfies these conditions is N=124, which is the final answer.
Suppose Charlie and Norah each have separate DataFrames for their contacts, called charlie
and norah
, respectively. These DataFrames have the same column names and format as your DataFrame, contacts
.
As illustrated in the diagram below, Charlie has 172 contacts in total, whereas Norah has 88 contacts. 12 of these contacts are shared, meaning they appear in both charlie
and norah
.
What does the following expression evaluate to?
=True, right_index=True).shape[0] charlie.merge(norah, left_index
Answer: 12
The code merges DataFrames charlie
and norah
on their indexes, so the resulting DataFrame will contain one row for every match between their indexes (‘Person’ since they follow the same format as DataFrame contact
). From the Venn Diagram, we know that Charlie and Norah have 12 contacts in common, so the resulting DataFrame will contain 12 rows: one row for each shared contact.
Thus, charlie.merge(norah, left_index=True, right_index=True).shape[0]
returns the row number of the resulting DataFrame, which is 12.
One day, when updating her phone’s operating system, Norah accidentally duplicates the 12 contacts she has in common with Charlie. Now, the norah
DataFrame has 100 rows.
What does the following expression evaluate to?
=True, right_index=True).shape[0] norah.merge(norah, left_index
Answer: 24 \cdot 2 + 76 = 124
Since Norah duplicates 12 contacts, the norah
DataFrame now has 76 unique rows + 12 rows + 12 duplicated rows. Note that the above code is now merging norah with itself on indexes.
After merging, the resulting DataFrame will contain 76 unique rows, as there is only one match for each unique row. As for the duplicated rows, each row can match twice, and we have 24 rows. Thus the resulting DataFrame’s row number = 76 + 2 \cdot 24 = 124.
For better understanding, imagine we have a smaller DataFrame nor
with only one contact Jim. After duplication, it will have two identical rows of Jim. For easier explanation, let’s denote the original row Jim1, and duplicated row Jim2. When merging Nor with itself, Jim1 can be matched with Jim1 and Jim2, and Jim2 can be matched with Jim1 and Jim2, resulting $= 2 = 4 $ number of rows.
In this problem, we will be using the following DataFrame students
, which contains various information about high school students and the university/universities they applied to.
The columns are:
'Name' (str)
: the name of the student.'High School' (str)
: the High School that the student attended.'Email' (str)
: the email of the student.'GPA' (float)
: the GPA of the student.'AP' (int)
: the number of AP exams that the student took.'University' (str)
: the name of the university that the student applied to.'Admit' (str)
: the acceptance status of the student (where ‘Y’ denotes that they were accepted to the university and ‘N’ denotes that they were not).The rows of 'student'
are arranged in no particular order. The first eight rows of 'student'
are shown above (though 'student'
has many more rows than pictured here).
Fill in the blank so that the result evaluates to a Series indexed by "Email"
that contains a list of the universities that each student was admitted to. If a student wasn’t admitted to any universities, they should have an empty list.
"Email").apply(_____) students.groupby(
What goes in the blank?
Answer: lambda df: df.loc[df["Admit"] == "Y", "University"].tolist()
Which of the following blocks of code correctly assign max_AP
to the maximum number of APs taken by a student who was rejected by UC San Diego?
Option 1:
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.loc[cond1 & cond2, "APs"].sort_values().iloc[-1] max_AP
Option 2:
= students["Admit"] == "N"
cond1 = students["University"] == "UC San Diego"
cond2 = students.groupby(["University", "Admit"]).max().reset_index()
d3 = d3.loc[cond1 & cond2, "APs"].iloc[0] max_AP
Option 3:
= students.pivot_table(index="Admit",
p ="University",
columns="APs",
values="max")
aggfunc= p.loc["N", "UC San Diego"] max_AP
Option 4:
# .last() returns the element at the end of a Series it is called on
= students.sort_values(["APs", "Admit"]).groupby("University")
groups = groups["APs"].last()["UC San Diego"] max_AP
Select all that apply. There is at least one correct option.
Option 1
Option 2
Option 3
Option 4
Answer: Option 1 and Option 3
Option 1 works correctly, it is probably the most straightforward way of answering the question. cond1
is True
for all rows in which students were rejected, and cond2
is True
for all rows in which students applied to UCSD. As such, students.loc[cond1 & cond2]
contains only the rows where students were rejected from UCSD. Then, students.loc[cond1 & cond2, "APs"].sort_values()
sorts by the number of "APs"
taken in increasing order, and .iloc[-1]
gets the largest number of "APs"
taken.
Option 2 doesn’t work because the lengths of cond1
and cond2
are not the same as the length of d3
, so this causes an error.
Option 3 works correctly. For each combination of "Admit"
status ("Y"
, "N"
, "W"
) and "University"
(including UC San Diego), it computes the max number of "APs"
. The usage of .loc["N", "UC San Diego"]
is correct too.
Option 4 doesn’t work. It currently returns the maximum number of "APs"
taken by someone who applied to UC San Diego; it does not factor in whether they were admitted, rejected, or waitlisted.
Currently, students
has a lot of repeated information — for instance, if a student applied to 10 universities, their GPA appears 10 times in students
.
We want to generate a DataFrame that contains a single row for each student, indexed by "Email"
, that contains their "Name"
, "High School"
, "GPA"
, and "APs"
.
One attempt to create such a DataFrame is below.
"Email").aggregate({"Name": "max",
students.groupby("High School": "mean",
"GPA": "mean",
"APs": "max"})
There is exactly one issue with the line of code above. In one sentence, explain what needs to be changed about the line of code above so that the desired DataFrame is created.
Answer: The problem right now is that aggregating High School by mean doesn’t work since you can’t aggregate a column with strings using "mean"
. Thus changing it to something that works for strings like "max"
or "min"
would fix the issue.
Consider the following snippet of code.
= students.assign(Admit=students["Admit"] == "Y") \
pivoted ="High School",
.pivot_table(index="University",
columns="Admit",
values="sum") aggfunc
Some of the rows and columns of pivoted
are shown below.
No students from Warren High were admitted to Columbia or Stanford. However,
pivoted.loc["Warren High", "Columbia"]
and pivoted.loc["Warren High", "Stanford"]
evaluate to different values. What is the reason for this difference?
Some students from Warren High applied to Stanford, and some others applied to Columbia, but none applied to both.
Some students from Warren High applied to Stanford but none applied to Columbia.
Some students from Warren High applied to Columbia but none applied to Stanford.
The students from Warren High that applied to both Columbia and Stanford were all rejected from Stanford, but at least one was admitted to Columbia.
When using pivot_table
, pandas
was not able to sum strings of the form "Y"
, "N"
, and "W"
, so the values in pivoted
are unreliable.
Answer: Option 3
pivoted.loc["Warren High", "Stanford"]
is NaN
because there were no rows in students
in which the "High School"
was "Warren High"
and the "University"
was "Stanford"
, because nobody from Warren High applied to Stanford. However, pivoted.loc["Warren High", "Columbia"]
is not NaN
because there was at least one row in students
in which the "High School"
was "Warren High"
and the "University"
was "Columbia"
. This means that at least one student from Warren High applied to Columbia.
Option 3 is the only option consistent with this logic.
For each day in May 2022, the DataFrame streams
contains the number of streams for each of the "Top 200" songs on Spotify that day — that is, the number of streams for the 200 songs with the most streams on Spotify that day. The columns in streams
are as follows:
"date"
: the date the song was streamed
"artist_names"
: name(s) of the artists who created the song
"track_name"
: name of the song
"streams"
: the number of times the song was streamed on Spotify that day
The first few rows of streams
are shown below. Since there were 31 days in May and 200 songs per day, streams
has 6200 rows in total.
Note that:
streams
is already sorted in a very particular way — it is sorted by "date"
in reverse chronological (decreasing) order, and, within each "date"
, by "streams"
in increasing order.
Many songs will appear multiple times in streams
, because many songs were in the Top 200 on more than one day.
Complete the implementation of the function song_by_day
, which takes in an integer day
between 1 and 31 corresponding to a day in May, and an integer n
, and returns the song that had the n
-th most streams on day
. For instance,
song_by_day(31, 199)
should evaluate to "pepas"
, because "pepas"
was the 199th most streamed song on May 31st.
Note: You are not allowed to sort within song_by_day
— remember, streams
is already sorted.
def song_by_day(day, n):
= f"2022-05-{str(day).zfill(2)}"
day_str = streams[__(a)__].iloc[__(b)__]
day_only return __(c)__
What goes in each of the blanks?
Answer: a) streams['date'] == day_str
, b) (200 - n)
, c) day_only['track_name']
The first line in the function gives us an idea that maybe later on in the function we’re going to filter for all the days that match the given data. Indeed, in blank a, we filter for all the rows in which the 'date'
column matches day_str
. In blank b, we could access directly access the row with the n
-th most stream using iloc. (Remember, the image above shows us that the streams are sorted by most streamed in ascending order, so to find the n
-th most popular song of a day, we simply do 200-n
). Finally, to return the track name, we could simply do day_only['track_name']
.
Below, we define a DataFrame pivoted
.
= streams.pivot_table(index="track_name", columns="date",
pivoted ="streams", aggfunc=np.max) values
After defining pivoted
, we define a Series mystery
below.
= 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1) mystery
mystery.loc["pepas"]
evaluates to 23. In one sentence, describe the relationship between the number 23 and the song "pepas"
in the context of the streams
dataset. For instance, a correctly formatted but incorrect answer is "I listened to the song "pepas"
23 times today."
Answer: See below.
pivoted.apply(lambda s: s.isna().sum(), axis=1)
computes the number of days that a song was not on the Top 200, so 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1)
computes the number of days the song was in the Top 200. As such, the correct interpretation is that "pepas"
was in the Top 200 for 23 days in May.
In defining pivoted
, we set the keyword argument aggfunc
to np.max
. Which of the following functions could we have used instead of np.max
without changing the values in pivoted
? Select all that apply.
np.mean
np.median
len
lambda df: df.iloc[0]
None of the above
Answer: Option A, B and D
For each combination of "track_name"
and "date"
, there is just a single value — the number of streams that song received on that date. As such, the aggfunc
needs to take in a Series containing a single number and return that same number.
The mean and median of a Series containing a single number is equal to that number, so the first two options are correct.
The length of a Series containing a single number is 1, no matter what that number is, so the third option is not correct.
lambda df: df.iloc[0]
takes in a Series and returns the first element in the Series, which is the only element in the Series. This option is correct as well. (The parameter name df
is irrelevant.)
Below, we define another DataFrame another_mystery
.
= (streams.groupby("date").last()
another_mystery "artist_names", "track_name"])
.groupby([ .count().reset_index())
another_mystery
has 5 rows. In one sentence, describe the significance of the number 5 in the context of the streams
dataset. For instance, a correctly formatted but incorrect answer is “There are 5 unique artists in streams
." Your answer should not include the word”row".
Answer: See below.
1in Since streams
is sorted by "date"
in descending order and, within each "date"
, by "streams"
in ascending order, streams.groupby("date").last()
is a DataFrame containing the song with the most "streams"
on each day in May. In other words, we found the "top song" for each day. (The DataFrame we created has 31 rows.)
When we then execute .groupby(["artist_names", "track_name"]).count()
, we create one row for every unique combination of song and artist, amongst the "top songs". (If no two artists have a song with the same name, this is the same as creating one row for every unique song.) Since there are 5 rows in this new DataFrame (resetting the index doesn’t do anything here), it means that there were only 5 unique songs that were ever the "top song" in a day in May; this is the correct interpretation.
Build on the previous problem.
Suppose the DataFrame today
consists of 15 rows — 3 rows for each of 5 different "artist_names"
. For each artist, it contains the "track_name"
for their three most-streamed songs today. For instance, there may be one row for "olivia rodrigo"
and "favorite crime"
, one row for "olivia rodrigo"
and "drivers license"
, and one row for "olivia rodrigo"
and "deja vu"
.
Another DataFrame, genres
, is shown below in its entirety.
Suppose we perform an inner merge between today
and genres
on "artist_names"
. If the five "artist_names"
in today
are the same as the five "artist_names"
in genres
, what fraction of the rows in the merged DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Answer: \frac{2}{5}
If the five "artist_names"
in today
and genres are the same, the DataFrame that results from an inner merge will have 15 rows, one for each row in today
. This is because there are 3 matches for "harry styles"
, 3 matches for "olivia rodrigo"
, 3 matches for "glass animals"
, and so on.
In the merged DataFrame’s 15 rows, 6 of them will correspond to "Pop"
artists — 3 to "harry styles"
and 3 to "olivia rodrigo"
. Thus, the fraction of rows that contain "Pop"
in the "genre"
column is \frac{6}{15} = \frac{2}{5} (which is the fraction of rows that contained "Pop"
in genres["genre"]
, too).
Suppose we perform an inner merge between today
and genres
on "artist_names"
. Furthermore, suppose that the only overlapping "artist_names"
between today
and genres
are "drake"
and "olivia rodrigo"
. What fraction of the rows in the merged DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Answer: \frac{1}{2}
If we perform an inner merge, there will only be 6 rows in the merged DataFrame — 3 for "olivia rodrigo"
and 3 for "drake"
. 3 of those 6 rows will have "Pop"
in the "genre"
column, hence the answer is \frac{3}{6} = \frac{1}{2}.
Suppose we perform an outer merge between today
and genres
on "artist_names"
. Furthermore, suppose that the only overlapping "artist_names"
between today
and genres
are "drake"
and "olivia rodrigo"
. What fraction of the rows in the merged DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Answer: \frac{2}{9}
Since we are performing an outer merge, we can decompose the rows in the merged DataFrame into three groups:
Rows that are in today
that are not in genres
. There are 9 of these (3 each for the 3 artists that are in today
and not genres
). today
doesn’t have a "genre"
column, and so all of these "genre"
s will be NaN
upon merging.
Rows that are in genres
that are not in today
. There are 3 of these — one for "harry styles"
, one for "glass animals"
, and one for "doja cat"
. 1 of these 3 have "Pop"
in the "genre"
column.
Rows that are in both today
and genres
. There are 6 of these — 3 for "olivia rodrigo"
and 3 for "drake"
— and 3 of those rows contain "Pop"
in the "genre"
column.
Tallying things up, we see that there are 9 + 3 + 6 = 18 rows in the merged DataFrame overall, of which 0 + 1 + 3 = 4 have "Pop"
in the "genre"
column. Hence, the relevant fraction is \frac{4}{18} = \frac{2}{9}.
Recall the evs
DataFrame.
The first few rows of evs
are shown below (though remember, evs
has 32 rows total).
Below, we provide the a DataFrame that contains the distribution of “BodyStyle” for all “Brands” in evs
, other than Nissan.
Suppose we’ve run the following few lines of code.
= evs[evs.get("Brand") == "Tesla"]
tesla = evs[evs.get("Brand") == "BMW"]
bmw = evs[evs.get("Brand") == "Audi"]
audi
= tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle") combo
How many rows does the DataFrame combo
have?
21
24
35
65
72
96
Answer: 35
Let’s attempt this problem step-by-step. We’ll first determine the number of rows in tesla.merge(bmw, on="BodyStyle")
, and then determine the number of rows in combo
. For the purposes of the solution, let’s use temp
to refer to the first merged DataFrame, tesla.merge(bmw, on="BodyStyle")
.
Recall, when we merge
two DataFrames, the resulting DataFrame contains a single row for every match between the two columns, and rows in either DataFrame without a match disappear. In this problem, the column that we’re looking for matches in is "BodyStyle"
.
To determine the number of rows of temp
, we need to determine which rows of tesla
have a "BodyStyle"
that matches a row in bmw
. From the DataFrame provided, we can see that the only "BodyStyle"
s in both tesla
and bmw
are SUV and sedan. When we merge tesla
and bmw
on "BodyStyle"
:
tesla
each match the 1 SUV row in bmw
. This will create 4 SUV rows in temp
.tesla
each match the 1 sedan row in bmw
. This will create 3 sedan rows in temp
.So, temp
is a DataFrame with a total of 7 rows, with 4 rows for SUVs and 3 rows for sedans (in the "BodyStyle"
) column. Now, when we merge temp
and audi
on "BodyStyle"
:
temp
each match the 8 SUV rows in audi
. This will create 4 \cdot 8 = 32 SUV rows in combo
.temp
each match the 1 sedan row in audi
. This will create 3 \cdot 1 = 3 sedan rows in combo
.Thus, the total number of rows in combo
is 32 + 3 = 35.
Note: You may notice that 35 is the result of multiplying the "SUV"
and "Sedan"
columns in the DataFrame provided, and adding up the results.