Visualization, Missing Values, More Practice

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

The DataFrame games contains information about a sample of popular games, including board games, dice games, and card games. The data comes from Board Game Geek, a popular website and vibrant online community for game enthusiasts.

The columns of games are as follows.

The first few rows of games are shown below (though games has many more rows than pictured here).


Assume that we have already run import pandas as pd and import numpy as np.

Many of the games in the games DataFrame belong to more than one domain. We want to identify the number of games that belong to only one domain. Select all of the options below that would correctly calculate the number of games that belong to only one domain.

Answer: Options 3 and 4

Let’s take a closer look at why Option 3 and Option 4 are correct.

Option 3: Option 3 first queries the games DataFrame to only keep games with one "Domains". games["Domains"].str.split(",").str.len() == 1 gets the "Domains" column and splits all of them by their comma.

For example, let’s say the domain was "Strategy Games", "Thematic Games" then after doing .str.split(",") we would have the list ["Strategy Games", "Thematic Games"]. If the domain was just "Strategy Games", then after splitting we’d have ["Strategy Games"].

Any "Domain" with at least one comma will turn into a list whose length is at least 2, so the comparison with .str.len() == 1 will evaluate to False. So ultimately,

games[games["Domains"].str.split(",").str.len() == 1]

is a DataFrame with just the rows for games that belong to one "Domain".

The next part .value_counts("Name").sum() makes a Series with an index of the unique "Name"s and the number of times those appear. Finally, summing this resulting Series will give us the number of games, among the games that only belong to only one “Domain”.

Option 4: Option 4 starts off exactly like Option 3, but instead of using value_counts it gets the number of rows using .shape[0], which will give us the number of games that belong to only one domain.

Option 1: Let’s step through why Option 1 is incorrect. games["Domains"].str.split(" ").apply(len) == 2.sum() gives you a Series of the "Domains" column, then splits each domain by a space. We then get the length of that list, check if the length is equal to 2, which would mean there are two elements in the list, and finally get the sum of all elements in the list who had two elements because of the split. Remember that True evaluates to 1 and False evaluates to 0, so we are getting the number of elements that were split into two. It does not tell us the number of games that belong to only one domain.

Option 2: Let’s step through why Option 2 is also incorrect. (games["Domains"].apply(len) == 1).sum() checks to see if each element in the column "Domains" has only one character. Remember when you apply len() to a string then we get the number of characters in that string. This is essentially counting the number of domains that have 1 letter. Thus, it does not tell us the number of games that belong to only one domain.


Problem 2

In September 2020, Governor Gavin Newsom announced that by 2035, all new vehicles sold in California must be zero-emissions vehicles. Electric vehicles (EVs) are among the most popular zero-emissions vehicles (though other examples include plug-in hybrids and hydrogen fuel cell vehicles).


The DataFrame evs consists of 32 rows, each of which contains information about a different EV model.

The first few rows of evs are shown below (though remember, evs has 32 rows total).


Throughout this problem, we will refer to evs repeatedly.

Assume that:

Tip: Open this page in another tab, so that it is easy to refer to this data description as you work through the problem.


Problem 2.1

Which of the following types of visualizations could use to visualize the distribution of "Range"? Select all that apply.

Answer: Histogram or box plot

"Range" is a numerical (i.e. quantitative) variable, and we use histograms and box plots to visualize the distribution of numerical features.

  • A bar chart couldn’t work here. Bar charts can show the distribution of a categorical feature, but "Range" is not categorical.
  • A scatter plot visualizes the relationship between two numerical features, but we are only dealing with one numerical feature here ("Range").
  • Similarly, a line plot visualizes the relationship between two numerical features, but we only have one here.


Problem 2.2

Teslas, on average, tend to have higher "Range"s than BMWs. In which of the following visualizations would we be able to see this pattern? Select all that apply.

Answer:

  • A bar chart that shows the average "Range" for each "Brand"
  • An overlaid histogram showing the distribution of "Range" for each "Brand"

Let’s look at each option more closely.

  • Option 1: A bar chart showing the distribution of "Brand" would only show us how many cars of each "Brand" there are. It would not tell us anything about the average "Range" of each "Brand".

  • Option 2: A bar chart showing the average range for each "Brand" would help us directly visualize how the average range of each "Brand" compares to one another.

  • Option 3: An overlaid histogram, although perhaps a bit messy, would also give us a general idea of the average range of each "Brand" by giving us the distribution of the "Range" of each brand. In the scenario mentioned in the question, we’d expect to see that the Tesla distribution is further right than the BMW distribution.

  • Option 4: A scatter plot of "TopSpeed" against "Range" would only illustrate the relationship between "TopSpeed" and "Range", but would contain no information about the "Brand" of each EV.


Problem 2.3

Kyle thinks "Seats" is a categorical feature because it can be used to categorize EVs by size. For instance, EVs with 4 seats are small, EVs with 5 seats are medium, and EVs with 6 or more seats are large.

Is Kyle correct?

Justify your answer in one sentence. Your answer must fit in the box below.

Answer: No

"Seats" is a numerical feature, since it makes sense to do arithmetic with the values. For instance, we can find the average number of "Seats" that a group of cars has. Kyle’s argument could apply to any numerical feature; just because we can place numerical features into “bins” doesn’t make them categorical.



Problem 3

The seat-back TV on one of Biff the Wolverine’s more recent flights was very dirty and was full of fingerprints. The fingerprints made an interesting pattern. We’ve stored the x and y positions of each fingerprint in the DataFrame fingerprints, and created the following scatterplot using

fingerprints.plot(kind='scatter', x='x', y='y')


Problem 3.1

True or False: The histograms that result from the following two lines of code will look very similar.

# When we say bins=np.arange(0, 8, 2),
# we mean that the bins are [0, 2), [2, 4), [4, 6].
fingerprints.plot(kind='hist', 
                  y='x',
                  density=True,
                  bins=np.arange(0, 8, 2))

and

fingerprints.plot(kind='hist', 
                  y='y',
                  density=True,
                  bins=np.arange(0, 8, 2))

Answer: True

The only difference between the two code snippets is the data values used. The first creates a histogram of the x-values in fingerprints, and the second creates a histogram of the y-values in fingerprints.

Both histograms use the same bins: bins=np.arange(0, 8, 2). This means the bin endpoints are [0, 2, 4, 6], so there are three distinct bins: [0, 2), [2, 4), and [4, 6]. Remember the right-most bin of a histogram includes both endpoints, whereas others include the left endpoint only.

Let’s look at the x-values first. If we divide the scatterplot into nine equally-sized regions, as shown below, note that eight of the nine regions have a very similar number of data points.

Aside from the middle region, about \frac{1}{8} of the data falls in each region. That means \frac{3}{8} of the data has an x-value in the first bin [0, 2), \frac{2}{8} of the data has an x-value in the middle bin [2, 4), and \frac{3}{8} of the data has an x-value in the rightmost bin [4, 6]. This distribution of x-values into bins determines what the histogram will look like.

Now, if we look at the y-values, we’ll find that \frac{3}{8} of the data has a y-value in the first bin [0, 2), \frac{2}{8} of the data has a y-value in the middle bin [2, 4), and \frac{3}{8} of the data has a y-value in the last bin [4, 6]. That’s the same distribution of data into bins as the x-values had, so the histogram of y-values will look just like the histogram of y-values.

Alternatively, an easy way to see this is to use the fact that the scatterplot is symmetric over the line y=x, the line that makes a 45 degree angle with the origin. In other words, interchanging the x and y values doesn’t change the scatterplot noticeably, so the x and y values have very similar distributions, and their histograms will be very similar as a result.


Problem 3.2

Below, we’ve drawn a histogram using the line of code

fingerprints.plot(kind='hist', 
                  y='x',
                  density=True,
                  bins=np.arange(0, 8, 2))

However, our Jupyter Notebook was corrupted, and so the resulting histogram doesn’t quite look right. While the height of the first bar is correct, the histogram doesn’t contain the second or third bars, and the y-axis is replaced with letters.

Which of the four options on the y-axis is closest to where the height of the middle bar should be?

Which of the four options on the y-axis is closest to where the height of the rightmost bar should be?

Answer: B, then C

We’ve already determined that the first bin should contain \frac{3}{8} of the values, the middle bin should contain \frac{2}{8} of the values, and the rightmost bin should contain \frac{3}{8} of the values. The middle bar of the histogram should therefore be two-thirds as tall as the first bin, and the rightmost bin should be equally as tall as the first bin. The only reasonable height for the middle bin is B, as it’s closest to two-thirds of the height of the first bar. Similarly, the rightmost bar must be at height C, as it’s the only one close to the height of the first bar.



Problem 4

Pranavi proposes the following imputation scheme.

    def impute(s):
        return s.fillna(np.random.choice(s[s.notna()]))


Problem 4.1

True or False: impute performs probabilistic imputation, using the same definition of probabilistic imputation we learned about in class.

Answer: False

In impute, np.random.choice will return a single non-null value from s, and .fillna() will fill every null value with this single value. Meanwhile, probabilistic imputation draws a different value from a specified distribution to fill each missing value, making it such that there won’t be a single “spike” in the imputed distribution at a single chosen value.


Problem 4.2

Consider the following expressions and values.

    >>> vals.isna().mean()
    0.2
    >>> vals.describe().loc[["min", "mean", "max"]]
    min     2.0
    mean    4.0
    max     7.0
    dtype: float64

Given the above, what is the maximum possible value of impute(vals).mean()? Give your answer as a number rounded to one decimal place.

Answer: 4.6

The maximum possible value of impute(vals).mean() would occur when every single missing value in vals is filled in with the highest possible non-null value in vals. (As discussed in the previous solution, impute selects only one value from s to fill into every missing space.)

If this occurs, then the mean of the imputed Series will be weighted mean of the available data and the filled data, and given the numbers in the question, this is 0.8 \cdot 4 + 0.2 \cdot 7, or 4.6.


Problem 4.3

Which of the following statements below will always evaluate to True?

Answer: None of the above

Since the value which impute will choose to impute with is random, the effect that it has on the standard deviation of vals is unknown. If the missing values are filled with a value close to the mean, this could reduce standard deviation; if they are filled with a value far from the mean, this could increase standard deviation. (Of course, the imputation will also shift the mean, so without knowing details of the Series, it’s impossible to come up with thresholds.) In any case, since the value for imputation is chosen at random, none of these statements will always be true, and so the correct answer is “none of the above.”



Problem 5

Suppose we have a DataFrame, dogs, in which each row corresponds to a different dog owned by someone in Ann Arbor. There are several columns in the DataFrame, including "birth_year" (the dog’s year of birth) and "Breed" (the dog’s breed).

Using the "birth_year" column in dogs, Tahseen computes the age of each dog and stores it a new "age" column of dogs. He also categorizes dogs into three “breed classes” based on their breed – "Retriever", "Terrier", or "Other" – and stores these in a new "class" column of dogs.

The following bar chart shows the distribution of breed classes by missingness of age:

The following bar chart shows the mean observed age for each breed class:


Problem 5.1

Let O be the mean of the observed ages in dogs, that is, the mean of the ages that aren’t missing. Let T be the true mean of the ages in dogs, that is, the mean we’d compute if none of the ages were missing.

What is the relationship between O and T?

Answer: O < T

In the ages we get to observe (that is, when age is not missing), we have way fewer terrier values than in the ages we don’t get to observe. Terriers are older on average than the other breed classes. This means we’re missing values that are larger, so when we take the average of the values we have access to, it’ll be lower than the true mean.


Consider the following strategy for imputing missing ages.

def fill_missing(dogs):
  return dogs.assign(
    age=dogs.groupby("class")["age"]
            .transform(lambda x: x.fillna(x.value_counts().idxmax()))
  )


Problem 5.2

In one sentence, describe this imputation strategy.

Answer: This is filling in missing ages in each breed class with the most common observed age in that breed class.


Problem 5.3

small_dogs is a subset of dogs shown in its entirety below.

If we call fill_missing on small_dogs, what values would the four null values in the "age" column of small_dogs be replaced with?

Answer:

  • row 2: 5. The most common observed age for the "Other" breed class is 5.
  • row 3: 5. The most common observed age for the "Other" breed class is 5.
  • row 4: 12. The most common observed age for the "Terrier" breed class is 12.
  • row 7: 7. The most common observed age for the "Retriever" breed class is 7.



Problem 6

The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.

Each row of hotels contains information about a different hotel in San Diego. Specifically, for each hotel, we have:

The first few rows of hotels are shown below, but hotels has many more rows than are shown.

Now, consider the variable summed, defined below.

summed = hotels.groupby("Chain")["Number of Rooms"].sum().idxmax()


Problem 6.1

What is type(summed)?

Answer: str

When we do a groupby on the Chain column in hotels, this means that the values in the Chain column will be the indices of the DataFrame or Series we get as output, in this case the Series hotels.groupby("Chain")["Number of Rooms"].sum().

Since the values of Chain are strings, and since .idxmax() will return a value from the index of the aforementioned Series, summed is a string.


Problem 6.2

In one sentence, explain what the value of summed means. Phrase your explanation as if you had to give it to someone who is not a data science major; that is, don’t say something like “it is the result of grouping hotels by "Chain", selecting the "Number of Rooms" column, …”, but instead, give the value context.

Answer: summed is the name of the hotel chain with the most total rooms

The result of the .groupby() and .sum() is a Series indexed by the unique Chains, whose values are the total number of rooms in hotels owned by each chain. The idxmax() function gets the index corresponding to the largest value in the Series, which will be the hotel chain name with the most total rooms.


Problem 6.3

Consider the variable curious, defined below.

curious = frame["Chain"].value_counts().idxmax()

Fill in the blank: curious is guaranteed to be equal to summed only if frame has one row for every ____ in San Diego.

Answer: hotel room

curious gets the most common value of Chain in the DataFrame frame. We already know that summed is the hotel chain with the most rooms in San Diego, so curious only equals summed if the most common Chain in frame is the hotel chain with the most total rooms; this occurs when each row of frame is a single hotel room.


Problem 6.4

Fill in the blanks so that popular_areas is an array of the names of the unique neighborhoods that have at least 5 hotels and at least 1000 hotel rooms.

    f = lambda df: __(i)__
    popular_areas = hotels.groupby(__(ii)__).__(iii)__.__(iv)__
  1. What goes in blank (i)?

  2. What goes in blank (ii)?

  1. What goes in blank (iii)?
  1. What goes in blank (iv)?

Answers:

  1. df.shape[0] >= 5 and df["Number of Rooms"].sum() >= 1000
  2. "Location"
  3. filter(f)
  4. ["Location"].unique() or equivalent

We’d like to only consider certain neighborhoods according to group characteristics (having >= 5 hotels and >= 1000 hotel rooms), and .filter() allows us to do that by excluding groups not meeting those criteria. So, we can write a function that evaluates those criteria on one group at a time (the df of input to f is the subset of hotels containing just one Location value), and calling filter(f) means that the only remaining rows are hotels in neighborhoods that match those criteria. Finally, all we have to do is get the unique neighborhoods from this DataFrame, which are the neighborhoods for which f returned True.

You may wonder why we’re using and instead of &, when we’re told to use & when making queries. In cases where we’re Boolean filtering arrays/Series/DataFrames, you’ll always use the bitwise operators (& for and, | for or). We use the “regular” and/or when taking the and/or of individual values. But here, df.shape[0] >= 5 is an individual Boolean – not a Series of Booleans – and df["Number of Rooms"].sum() >= 1000 is also an individual Boolean. So here, we needed to compute the and of two individual Booleans, making the and operator correct (and & incorrect).


Problem 6.5

Consider the code below.

    cond1 = hotels["Chain"] == "Marriott"
    cond2 = hotels["Location"] == "Coronado"
    combined = hotels[cond1].merge(hotels[cond2], on="Hotel Name", how=???)
  1. If we replace ??? with "inner" in the code above, which of the following will be equal to combined.shape[0]?
  1. If we replace ??? with "outer" in the code above, which of the following will be equal to combined.shape[0]?

Answers:

  1. (cond1 & cond2).sum()
  2. cond1.sum() + cond2.sum() - (cond1 & cond2).sum()

Note that cond1 and cond2 are boolean Series, and hotels[cond1] and hotels[cond2] are the subsets of hotels where Chain == "Marriott and "Location" == "Coronado", respectively.

  1. When we perform an inner merge, we’re selecting every row where a Hotel Name appears in both hotels[cond1] and hotels[cond2]. This is the same set of indices (and therefore hotel names, since those are unique) as where (cond1 & cond2) == True. So, the length of combined will be the same as the number of Trues in (cond1 & cond2).

  2. When we perform an outer merge, we’re selecting every row that appears in either DataFrame, although there will not be repeats for hotels that are both Marriott properties and are in Coronado. So, to find the total number of rows in either DataFrame, we take the sum of the sizes of each, and subtract rows that appear in both, which corresponds to answer cond1.sum() + cond2.sum() - (cond1 & cond2).sum().



Problem 7

Let’s consider admissions at UC San Diego and UC Santa Barbara for two high schools in particular.

For instance, the above table tells us that 200 students from La Jolla Private applied to UC San Diego, and 50 were admitted.

What is the largest possible integer value of N such that:

Answer: 124

Let’s consider the two conditions separately.

First, UC Santa Barbara needs to have a higher admit rate for both high schools. This is already true for La Jolla Private (\frac{100}{300} > \frac{50}{200}); for Sun God Memorial High, we just need to ensure that \frac{N}{150} > \frac{200}{300}. This means that N > 100.

Now, UC San Diego needs to have a higher admit rate overall. The UC San Diego admit rate is \frac{50+200}{200+300} = \frac{250}{500} = \frac{1}{2}, while the UC Santa Barbara admit rate is \frac{100 + N}{450}. This means that we must require that \frac{1}{2} = \frac{225}{450} > \frac{100+N}{450}. This means that 225 > 100 + N, i.e. that N < 125.

So there are two conditions on N: N > 100 and N < 125. The largest integer N that satisfies these conditions is N=124, which is the final answer.


Problem 8

Suppose Charlie and Norah each have separate DataFrames for their contacts, called charlie and norah, respectively. These DataFrames have the same column names and format as your DataFrame, contacts.

As illustrated in the diagram below, Charlie has 172 contacts in total, whereas Norah has 88 contacts. 12 of these contacts are shared, meaning they appear in both charlie and norah.


Problem 8.1

What does the following expression evaluate to?

charlie.merge(norah, left_index=True, right_index=True).shape[0]    

Answer: 12

The code merges DataFrames charlie and norah on their indexes, so the resulting DataFrame will contain one row for every match between their indexes (‘Person’ since they follow the same format as DataFrame contact). From the Venn Diagram, we know that Charlie and Norah have 12 contacts in common, so the resulting DataFrame will contain 12 rows: one row for each shared contact.

Thus, charlie.merge(norah, left_index=True, right_index=True).shape[0] returns the row number of the resulting DataFrame, which is 12.


Problem 8.2

One day, when updating her phone’s operating system, Norah accidentally duplicates the 12 contacts she has in common with Charlie. Now, the norah DataFrame has 100 rows.

What does the following expression evaluate to?

norah.merge(norah, left_index=True, right_index=True).shape[0]   

Answer: 24 \cdot 2 + 76 = 124

Since Norah duplicates 12 contacts, the norah DataFrame now has 76 unique rows + 12 rows + 12 duplicated rows. Note that the above code is now merging norah with itself on indexes.

After merging, the resulting DataFrame will contain 76 unique rows, as there is only one match for each unique row. As for the duplicated rows, each row can match twice, and we have 24 rows. Thus the resulting DataFrame’s row number = 76 + 2 \cdot 24 = 124.

For better understanding, imagine we have a smaller DataFrame nor with only one contact Jim. After duplication, it will have two identical rows of Jim. For easier explanation, let’s denote the original row Jim1, and duplicated row Jim2. When merging Nor with itself, Jim1 can be matched with Jim1 and Jim2, and Jim2 can be matched with Jim1 and Jim2, resulting $= 2 = 4 $ number of rows.



Problem 9

In this problem, we will be using the following DataFrame students, which contains various information about high school students and the university/universities they applied to.

The columns are:

The rows of 'student' are arranged in no particular order. The first eight rows of 'student' are shown above (though 'student' has many more rows than pictured here).


Problem 9.1

Fill in the blank so that the result evaluates to a Series indexed by "Email" that contains a list of the universities that each student was admitted to. If a student wasn’t admitted to any universities, they should have an empty list.

    students.groupby("Email").apply(_____)

What goes in the blank?

Answer: lambda df: df.loc[df["Admit"] == "Y", "University"].tolist()


Problem 9.2

Which of the following blocks of code correctly assign max_AP to the maximum number of APs taken by a student who was rejected by UC San Diego?

Option 1:

cond1 = students["Admit"] == "N"
cond2 = students["University"] == "UC San Diego"
max_AP = students.loc[cond1 & cond2, "APs"].sort_values().iloc[-1]

Option 2:

cond1 = students["Admit"] == "N"
cond2 = students["University"] == "UC San Diego"
d3 = students.groupby(["University", "Admit"]).max().reset_index()
max_AP = d3.loc[cond1 & cond2, "APs"].iloc[0]

Option 3:

p = students.pivot_table(index="Admit", 
                            columns="University", 
                            values="APs", 
                            aggfunc="max")
max_AP = p.loc["N", "UC San Diego"]

Option 4:

# .last() returns the element at the end of a Series it is called on
groups = students.sort_values(["APs", "Admit"]).groupby("University")
max_AP = groups["APs"].last()["UC San Diego"]

Select all that apply. There is at least one correct option.

Answer: Option 1 and Option 3

  • Option 1 works correctly, it is probably the most straightforward way of answering the question. cond1 is True for all rows in which students were rejected, and cond2 is True for all rows in which students applied to UCSD. As such, students.loc[cond1 & cond2] contains only the rows where students were rejected from UCSD. Then, students.loc[cond1 & cond2, "APs"].sort_values() sorts by the number of "APs" taken in increasing order, and .iloc[-1] gets the largest number of "APs" taken.

  • Option 2 doesn’t work because the lengths of cond1 and cond2 are not the same as the length of d3, so this causes an error.

  • Option 3 works correctly. For each combination of "Admit" status ("Y", "N", "W") and "University" (including UC San Diego), it computes the max number of "APs". The usage of .loc["N", "UC San Diego"] is correct too.

  • Option 4 doesn’t work. It currently returns the maximum number of "APs" taken by someone who applied to UC San Diego; it does not factor in whether they were admitted, rejected, or waitlisted.


Problem 9.3

Currently, students has a lot of repeated information — for instance, if a student applied to 10 universities, their GPA appears 10 times in students.

We want to generate a DataFrame that contains a single row for each student, indexed by "Email", that contains their "Name", "High School", "GPA", and "APs".

One attempt to create such a DataFrame is below.

students.groupby("Email").aggregate({"Name": "max",
                                        "High School": "mean",
                                        "GPA": "mean",
                                        "APs": "max"})

There is exactly one issue with the line of code above. In one sentence, explain what needs to be changed about the line of code above so that the desired DataFrame is created.

Answer: The problem right now is that aggregating High School by mean doesn’t work since you can’t aggregate a column with strings using "mean". Thus changing it to something that works for strings like "max" or "min" would fix the issue.


Problem 9.4

Consider the following snippet of code.

pivoted = students.assign(Admit=students["Admit"] == "Y") \
                    .pivot_table(index="High School", 
                                columns="University", 
                                values="Admit", 
                                aggfunc="sum")

Some of the rows and columns of pivoted are shown below.

No students from Warren High were admitted to Columbia or Stanford. However,
pivoted.loc["Warren High", "Columbia"] and pivoted.loc["Warren High", "Stanford"] evaluate to different values. What is the reason for this difference?

Answer: Option 3

pivoted.loc["Warren High", "Stanford"] is NaN because there were no rows in students in which the "High School" was "Warren High" and the "University" was "Stanford", because nobody from Warren High applied to Stanford. However, pivoted.loc["Warren High", "Columbia"] is not NaN because there was at least one row in students in which the "High School" was "Warren High" and the "University" was "Columbia". This means that at least one student from Warren High applied to Columbia.

Option 3 is the only option consistent with this logic.



Problem 10

For each day in May 2022, the DataFrame streams contains the number of streams for each of the "Top 200" songs on Spotify that day — that is, the number of streams for the 200 songs with the most streams on Spotify that day. The columns in streams are as follows:

The first few rows of streams are shown below. Since there were 31 days in May and 200 songs per day, streams has 6200 rows in total.

Note that:


Problem 10.1

Complete the implementation of the function song_by_day, which takes in an integer day between 1 and 31 corresponding to a day in May, and an integer n, and returns the song that had the n-th most streams on day. For instance,
song_by_day(31, 199) should evaluate to "pepas", because "pepas" was the 199th most streamed song on May 31st.

Note: You are not allowed to sort within song_by_day — remember, streams is already sorted.

def song_by_day(day, n):
    day_str = f"2022-05-{str(day).zfill(2)}"
    day_only = streams[__(a)__].iloc[__(b)__]
    return __(c)__

What goes in each of the blanks?

Answer: a) streams['date'] == day_str, b) (200 - n), c) day_only['track_name']

The first line in the function gives us an idea that maybe later on in the function we’re going to filter for all the days that match the given data. Indeed, in blank a, we filter for all the rows in which the 'date' column matches day_str. In blank b, we could access directly access the row with the n-th most stream using iloc. (Remember, the image above shows us that the streams are sorted by most streamed in ascending order, so to find the n-th most popular song of a day, we simply do 200-n). Finally, to return the track name, we could simply do day_only['track_name'].


Problem 10.2

Below, we define a DataFrame pivoted.

pivoted = streams.pivot_table(index="track_name", columns="date", 
                                values="streams", aggfunc=np.max)

After defining pivoted, we define a Series mystery below.

mystery = 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1)

mystery.loc["pepas"] evaluates to 23. In one sentence, describe the relationship between the number 23 and the song "pepas" in the context of the streams dataset. For instance, a correctly formatted but incorrect answer is "I listened to the song "pepas" 23 times today."

Answer: See below.

pivoted.apply(lambda s: s.isna().sum(), axis=1) computes the number of days that a song was not on the Top 200, so 31 - pivoted.apply(lambda s: s.isna().sum(), axis=1) computes the number of days the song was in the Top 200. As such, the correct interpretation is that "pepas" was in the Top 200 for 23 days in May.


Problem 10.3

In defining pivoted, we set the keyword argument aggfunc to np.max. Which of the following functions could we have used instead of np.max without changing the values in pivoted? Select all that apply.

Answer: Option A, B and D

For each combination of "track_name" and "date", there is just a single value — the number of streams that song received on that date. As such, the aggfunc needs to take in a Series containing a single number and return that same number.

  • The mean and median of a Series containing a single number is equal to that number, so the first two options are correct.

  • The length of a Series containing a single number is 1, no matter what that number is, so the third option is not correct.

  • lambda df: df.iloc[0] takes in a Series and returns the first element in the Series, which is the only element in the Series. This option is correct as well. (The parameter name df is irrelevant.)


Problem 10.4

Below, we define another DataFrame another_mystery.

another_mystery = (streams.groupby("date").last()
                            .groupby(["artist_names", "track_name"])
                            .count().reset_index())

another_mystery has 5 rows. In one sentence, describe the significance of the number 5 in the context of the streams dataset. For instance, a correctly formatted but incorrect answer is “There are 5 unique artists in streams." Your answer should not include the word”row".

Answer: See below.

1in Since streams is sorted by "date" in descending order and, within each "date", by "streams" in ascending order, streams.groupby("date").last() is a DataFrame containing the song with the most "streams" on each day in May. In other words, we found the "top song" for each day. (The DataFrame we created has 31 rows.)

When we then execute .groupby(["artist_names", "track_name"]).count(), we create one row for every unique combination of song and artist, amongst the "top songs". (If no two artists have a song with the same name, this is the same as creating one row for every unique song.) Since there are 5 rows in this new DataFrame (resetting the index doesn’t do anything here), it means that there were only 5 unique songs that were ever the "top song" in a day in May; this is the correct interpretation.



Problem 11

Build on the previous problem.

Suppose the DataFrame today consists of 15 rows — 3 rows for each of 5 different "artist_names". For each artist, it contains the "track_name" for their three most-streamed songs today. For instance, there may be one row for "olivia rodrigo" and "favorite crime", one row for "olivia rodrigo" and "drivers license", and one row for "olivia rodrigo" and "deja vu".

Another DataFrame, genres, is shown below in its entirety.


Problem 11.1

Suppose we perform an inner merge between today and genres on "artist_names". If the five "artist_names" in today are the same as the five "artist_names" in genres, what fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.

Answer: \frac{2}{5}

If the five "artist_names" in today and genres are the same, the DataFrame that results from an inner merge will have 15 rows, one for each row in today. This is because there are 3 matches for "harry styles", 3 matches for "olivia rodrigo", 3 matches for "glass animals", and so on.

In the merged DataFrame’s 15 rows, 6 of them will correspond to "Pop" artists — 3 to "harry styles" and 3 to "olivia rodrigo". Thus, the fraction of rows that contain "Pop" in the "genre" column is \frac{6}{15} = \frac{2}{5} (which is the fraction of rows that contained "Pop" in genres["genre"], too).


Problem 11.2

Suppose we perform an inner merge between today and genres on "artist_names". Furthermore, suppose that the only overlapping "artist_names" between today and genres are "drake" and "olivia rodrigo". What fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.

Answer: \frac{1}{2}

If we perform an inner merge, there will only be 6 rows in the merged DataFrame — 3 for "olivia rodrigo" and 3 for "drake". 3 of those 6 rows will have "Pop" in the "genre" column, hence the answer is \frac{3}{6} = \frac{1}{2}.


Problem 11.3

Suppose we perform an outer merge between today and genres on "artist_names". Furthermore, suppose that the only overlapping "artist_names" between today and genres are "drake" and "olivia rodrigo". What fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.

Answer: \frac{2}{9}

Since we are performing an outer merge, we can decompose the rows in the merged DataFrame into three groups:

  • Rows that are in today that are not in genres. There are 9 of these (3 each for the 3 artists that are in today and not genres). today doesn’t have a "genre" column, and so all of these "genre"s will be NaN upon merging.

  • Rows that are in genres that are not in today. There are 3 of these — one for "harry styles", one for "glass animals", and one for "doja cat". 1 of these 3 have "Pop" in the "genre" column.

  • Rows that are in both today and genres. There are 6 of these — 3 for "olivia rodrigo" and 3 for "drake" — and 3 of those rows contain "Pop" in the "genre" column.

Tallying things up, we see that there are 9 + 3 + 6 = 18 rows in the merged DataFrame overall, of which 0 + 1 + 3 = 4 have "Pop" in the "genre" column. Hence, the relevant fraction is \frac{4}{18} = \frac{2}{9}.



Problem 12

Recall the evs DataFrame.

The first few rows of evs are shown below (though remember, evs has 32 rows total).


Below, we provide the a DataFrame that contains the distribution of “BodyStyle” for all “Brands” in evs, other than Nissan.

Suppose we’ve run the following few lines of code.

tesla = evs[evs.get("Brand") == "Tesla"]
bmw = evs[evs.get("Brand") == "BMW"]
audi = evs[evs.get("Brand") == "Audi"]

combo = tesla.merge(bmw, on="BodyStyle").merge(audi, on="BodyStyle")

How many rows does the DataFrame combo have?

Answer: 35

Let’s attempt this problem step-by-step. We’ll first determine the number of rows in tesla.merge(bmw, on="BodyStyle"), and then determine the number of rows in combo. For the purposes of the solution, let’s use temp to refer to the first merged DataFrame, tesla.merge(bmw, on="BodyStyle").

Recall, when we merge two DataFrames, the resulting DataFrame contains a single row for every match between the two columns, and rows in either DataFrame without a match disappear. In this problem, the column that we’re looking for matches in is "BodyStyle".

To determine the number of rows of temp, we need to determine which rows of tesla have a "BodyStyle" that matches a row in bmw. From the DataFrame provided, we can see that the only "BodyStyle"s in both tesla and bmw are SUV and sedan. When we merge tesla and bmw on "BodyStyle":

  • The 4 SUV rows in tesla each match the 1 SUV row in bmw. This will create 4 SUV rows in temp.
  • The 3 sedan rows in tesla each match the 1 sedan row in bmw. This will create 3 sedan rows in temp.

So, temp is a DataFrame with a total of 7 rows, with 4 rows for SUVs and 3 rows for sedans (in the "BodyStyle") column. Now, when we merge temp and audi on "BodyStyle":

  • The 4 SUV rows in temp each match the 8 SUV rows in audi. This will create 4 \cdot 8 = 32 SUV rows in combo.
  • The 3 sedan rows in temp each match the 1 sedan row in audi. This will create 3 \cdot 1 = 3 sedan rows in combo.

Thus, the total number of rows in combo is 32 + 3 = 35.

Note: You may notice that 35 is the result of multiplying the "SUV" and "Sedan" columns in the DataFrame provided, and adding up the results.


👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.