← return to study.practicaldsc.org
Instructor(s): Suraj Rampure
This exam was administered in-person. The exam was closed-notes, except students were allowed to bring a single two-sided notes sheet. No calculators were allowed. Students had 120 minutes to take this exam.
Access the original exam PDF here: Version A, Version B. (This site is based on Version A.)
Skytrax is a website that allows anyone to submit a review for a flight they took. In this exam, we’ll work with the DataFrame reviews
, which contains flight reviews taken from Skytrax.
The first few rows of reviews
are shown below, but reviews
has many more rows than are shown.
The columns in reviews
are as follows:
"airline"
(str): The airline flown. Note that an airline can be reviewed multiple times."author"
(str): The author of the review. Note that an author can write multiple reviews (and can even write multiple reviews of the same airline!)."date"
(str): The date on which the review was written. Most of the reviews were written in 2015."content"
(str): The content of the review."cabin"
(str): The cabin the author flew; either "Economy"
, "Premium Economy"
, "Business"
, or "First"
."overall"
(int): The overall rating the author gave their flight experience, between 0 and 10 (inclusive).Throughout the exam, assume we have already run import pandas as pd
and import numpy as np
.
Fill in the blanks so that santas_airlines
evaluates to an array containing the unique names of the airlines that Santa Ono has reviewed.
= reviews.loc[__(i)__, __(ii)__].__(iii)__ santas_airlines
Answer:
= reviews.loc[reviews["author"] == "Santa Ono", "airline"].unique() santas_airlines
Conceptually, we need to find a Series containing the names of all of the airlines that Santa Ono has reviewed, and then find the unique values in that Series.
loc
helps us extract a subset of the rows and columns in a DataFrame. In general, loc
works as follows:
reviews.loc[<subset of rows>, <subset of columns>]
Here: - We want all of the rows where reviews["author"]
is equal to "Santa Ono"
, so we use the expression reviews["author"] == "Santa Ono"
as the first input to loc
. (Remember, reviews["author"] == "Santa Ono"
is a Series of True
s and False
s; only the True
values are kept in the result.) - We want just the column "airline"
, so we supply it as a string as the second input to loc
.
Now, we have a Series with the names of all airlines that Santa Ono has reviewed; calling the .unique()
method on this Series will give us back an array with just the unique values in the Series, which is what we’re looking for here.
The average score on this problem was 88%.
Consider the function operate
, defined below.
def operate(df):
"content"] = df["content"].str.split().str.len() df[
Consider the following six answer choices:
Nothing, because we didn’t reassign reviews
after calling operate(reviews)
.
reviews["content"]
now contains the number of characters per review.
reviews["content"]
now contains the number of words per review.
We see an error because we’re trying to use .str
methods on an invalid column.
We see an error because a column with the name "content"
already exists.
We see an error because we’re assuming df
is a DataFrame but it is None
(null).
All three parts of this question are independent from one another; the code in earlier parts should not influence your answers in later parts.
What happens after running the following block of code?
operate(reviews)
A
B
C
D
E
F
Answer: C. reviews["content"]
now contains the number of words per review.
The main premise behind this question is that running operate(reviews)
makes an in-place modification to reviews
, because DataFrame objects are mutable. So, even though we didn’t run reviews = operate(reviews)
, reviews
was still updated.
Let’s visualize what happens to the Series reviews["content"]
through the transformation reviews["content"].str.split().str.len()
:
To summarize:
s
is a Series of strings, s.str.split()
evaluates to a Series of lists resulting from splitting each string in s
by " "
(the default argument to split
is " "
). These new lists now contain all of the words in the original strings.t
is a Series of lists, then t.str.len()
evaluates to a Series of ints containing the lengths of the lists in t
. If each list in t
contained words, then the ints in t.str.len()
represent numbers of words.reviews["content"].str.split().str.len()
evaluates to a list containing the number of words per review!
The average score on this problem was 67%.
What happens after running the following block of code?
operate(reviews) operate(reviews)
A
B
C
D
E
F
Answer: D. We see an error because we’re trying to use .str
methods on an invalid column.
As we discussed in the solution above, after running operate(reviews)
, the DataFrame reviews
is modified in-place such that its "content"
column contains the number of words per review. This new "content"
column is made up of ints.
When we try and call operate(reviews)
a second time, we’ll try to use .str.split()
on the new "content"
column, but the new "content"
column contains ints! Numbers cannot be “split”, and .str
methods in general don’t work on Series of numbers, so we’ll get an error that results from this mismatch in types.
operate(reviews)
.
The average score on this problem was 62%.
What happens after running the following block of code?
= operate(reviews)
reviews = operate(reviews) reviews
A
B
C
D
E
F
Answer: F. We see an error because we’re assuming df
is a DataFrame but it is None
(null).
As discussed in the previous two solutions, after calling operate(reviews)
, reviews["content"]
now contains the number of words per review. However, the function operate
itself doesn’t have an explicit return
keyword, which means it returns None
, the default Python null value. That is, operate(reviews)
makes an in-place modification, but outputs None
.
So, when we set reviews = operate(reviews)
, we’re effectively setting reviews = None
, i.e. setting reviews
to a null value. So, the second time operate(reviews)
is called, Python will try to use None
in place of df
in the body of operate
:
def operate(df):
"content"] = df["content"].str.split().str.len() df[
But, None["content"]
is an undefined operation, and we’ll get an error to that effect.
The average score on this problem was 50%.
Suppose we define n = reviews.shape[0]
.
In each of the four parts of this question, we provide an expression that evaluates to a new DataFrame. Depending on the expression, the new DataFrame:
reviews
.reviews
appear multiple times (i.e. may have duplicated rows). Note that reviews
itself does not have any duplicated rows.reviews
.For example, if df
is on the left, the DataFrame on the right has the same number of rows as df
, has some duplicated rows (row 1 appears twice), and has the same row order as df
(all row 0s before row 1s, all row 1s before row 2s, etc.).
Select the options that correctly describe the DataFrame that results from each of the following expressions.
True, False], size=n, replace=True)] reviews.loc[np.random.choice([
Same number of rows as reviews
? Yes No
Possibility of duplicated rows? Yes No
Row order guaranteed to be the same as reviews
? Yes No
Answer:
reviews
? No.reviews
? Yes.Here, np.random.choice([True, False], size=n, replace=True)
evaluates to an array of True
s and False
s with the same length as reviews
. For instance, it may evaluate to:
True, False, False, True, True, False, True, False, ...., False]) np.array([
When passing this array into reviews.loc
, it is used as a Boolean mask, the same way that Boolean Series are. For example, we write reviews.loc[reviews["author"] == "Santa Ono"]
– or reviews[reviews["author"] == "Santa Ono"]
for short – we’re asking to keep all the rows when reviews["author"] == "Santa Ono"
(a Boolean Series) contains True
and exclude all of the rows when that Boolean Series contains False
.
In the above example, we’re saying keep row 0, exclude rows 1 and 2, keep rows 3 and 4, exclude row 5, and so on. In this scheme:
False
s in the resulting array, we will exclude at least some of the rows in reviews
, so the result will almost certainly not have the same number of rows as reviews
.reviews
.
The average score on this problem was 52%.
reviews.loc[np.random.permutation(np.arange(n))]
Same number of rows as reviews
? Yes No
Possibility of duplicated rows? Yes No
Row order guaranteed to be the same as reviews
? Yes No
reviews
? Yes.reviews
? No.np.arange(n)
evaluates to the array:
0, 1, 2, 3, ..., n - 2, n - 1]) np.array([
where, again, n
is the length of reviews
. np.random.permutation
shuffles a sequence, without removing or adding any extra elements, so np.random.permutation(np.arange(n))
is an array with the integers between 0
in n - 1
in a random order.
Separately, as we see in the DataFrame preview provided at the start of the exam, the index of reviews
is 0, 1, 2, 3, etc. When passing a sequence of index values as the first argument to reviews.loc
, pandas
will return a DataFrame with just the index values you asked for.
For example, reviews.loc[[19, 1, 3, 18]]
evaluates to a DataFrame with 4 rows, specifically, the rows with index values 19, 1, 3, and 18 from reviews
. If the index of reviews
wasn’t made up of integers, then reviews.loc[[19, 1, 3, 18]]
would error (and we’d need to use reviews.iloc
, which uses integer positions, instead).
Putting this all together, reviews.loc[np.random.permutation(np.arange(n))]
will produce a new DataFrame that results from shuffling all of the rows in reviews
. If np.random.permutation(np.arange(n))
evalutes to np.array([235, 11, 4, 953, 23, 1, 355, ...])
, reviews.loc[np.random.permutation(np.arange(n))]
will first have row 235 from reviews
, then row 11, and so on.
So, the resulting DataFrame will have the same number of rows as reviews
, and not have any duplicated rows, but have a different row order than reviews
in all likelihood.
The average score on this problem was 74%.
=n, replace=True)] reviews.loc[np.random.choice(np.arange(n), size
Same number of rows as reviews
? Yes No
Possibility of duplicated rows? Yes No
Row order guaranteed to be the same as reviews
? Yes No
reviews
? Yes.reviews
? No.The expression np.random.choice(np.arange(n), size=n, replace=True)
selects n
elements with replacement from np.arange(n)
, and returns the results in an array. This array could look like, for example:
235, 15, 3, 940, 15, ...]) np.array([
The resulting array will have n
elements, but could have duplicated values (15 in the example above), and also could be sorted differently than reviews
. So, the DataFrame that results from reviews.loc[np.random.choice(np.arange(n), size=n, replace=True)]
will have the same number of rows as reviews
, but could have duplicated values, and very well could have a different row order than reviews
.
The average score on this problem was 92%.
=n, replace=False)] reviews.loc[np.random.choice(np.arange(n), size
Same number of rows as reviews
? Yes No
Possibility of duplicated rows? Yes No
Row order guaranteed to be the same as reviews
? Yes No
reviews
? Yes.reviews
? No.The expression np.random.choice(np.arange(n), size=n, replace=False)
selects n
elements without replacement from np.arange(n)
, and returns the results in an array.
Let’s consider what it means to select n
items without replacement from a collection of n
items labeled 0
, 1
, 2
, …, n-1
. For example, let n = 5
. Then, selecting 5 elements without replacement from np.array([0, 1, 2, 3, 4])
could look like:
2
. Now there are 4 items left, np.array([0, 1, 3, 4])
.3
. Now we have 3 items left, np.array([0, 1, 4])
.1
. Now we have 2 items left, np.array([0, 4])
.4
. Now we have 1 item left, np.array([0])
.0
. We’re done!np.array([2, 3, 1, 4, 0])
.np.array([2, 3, 1, 4, 0])
is just a permutation of np.array([0, 1, 2, 3, 4])
!
More generally, np.random.choice(np.arange(n), size=n, replace=False)
behaves identically to np.random.permutation(np.arange(n))
, so our answer to this part should be the same as our answer to part 2 of this question. That is, reviews.loc[np.random.choice(np.arange(n), size=n, replace=False)]
produces a DataFrame that has the same number of rows as reviews
, and does not have any duplicated rows, but has a different row order than reviews
in all likelihood.
The average score on this problem was 52%.
The bar chart below shows the distribution of the number of reviews per "author"
in reviews
. For instance, it’s telling us that approximately 120 "author"
s wrote 2 reviews. Assume that the height of the bar at 8 on the x-axis is exactly 1, and that this is the shortest bar (with no ties).
Hint: The values in the Series that results from calling value_counts()
are sorted in descending order.
Fill in the blank below so that W
evaluates to the value 7
.
= reviews["author"].value_counts().__(i)__ W
Answer: There are a few valid answers, namely:
iloc[1]
.iloc[2]
.value_counts().index[-2]
.value_counts().index[6]
.reviews["author"].value_counts()
is a Series where: - The index contains all of the unique values in reviews["author"]
. - The values contain the frequencies (i.e. counts) of each value in reviews["author"]
.
In other words, reviews["author"].value_counts()
tells us the number of reviews per "author"
.
What do we know about reviews["author"].value_counts()
from the question?
"author"
s that wrote 1 review, the number of "author"
s that wrote 2 reviews, etc."author"
wrote 8 reviews, and the author with the next most reviews was 7.As the question hints to us, reviews["author"].value_counts()
is sorted in decreasing order of value by default. Putting the information above together, we know that the first few rows of reviews["author"].value_counts()
looks like:
Most Common Author 8
Someone else 7
A third person 7
...
With that in mind, reviews["author"].value_counts().iloc[0]
would give us 8, and reviews["author"].value_counts().iloc[1]
would give us 7. So, iloc[1]
and iloc[2]
are two ways of answering this part. There are other solutions, but they involve understanding how reviews["author"].value_counts().value_counts()
works, which we will save for the solution of the next part.
If it’s helpful, here’s what reviews["author"].value_counts()
actually looks like on the full DataFrame we used to make this exam:
The average score on this problem was 41%.
Now, consider the values defined below.
= reviews["author"].value_counts().value_counts().loc[1]
W = reviews["author"].value_counts().value_counts().iloc[1]
X = reviews["author"].value_counts().value_counts().index[1]
Y = reviews["author"].value_counts().value_counts().index[-1] Z
Which value is equal to 8?
W
X
Y
Z
None of these.
Answer: Z = reviews["author"].value_counts().value_counts().index[-1]
.
In the solution to the previous part, we understood that reviews["author"].value_counts()
would look something like:
Most Common Author 8
Someone else 7
A third person 7
...
Then, what does reviews["author"].value_counts().value_counts()
look like? It’s a Series, too, where:
reviews["author"].value_counts()
(so 1, 2, 3, 4, …, 8)."author"
s with 1 review, the number of "author"
s with 2 reviews, etc.So, reviews["author"].value_counts().value_counts()
actually describes the information in the bar chart provided! It may look something like:
1 1400
2 120
3 25
5 12
6 9
4 8
7 2
8 1
The only one of these values that we know exactly is in the very last row, because we’re told that exactly one "author"
wrote 8 reviews, the rest wrote fewer. The other values were estimated from the provided bar chart, but their values are not important in answering any of the remaining parts of this question.
Keep in mind that the resulting Series is sorted in descending order of value, not by index!
So, we can answer all three remaining parts of this question at once:
reviews["author"].value_counts().value_counts()
, so Z = reviews["author"].value_counts().value_counts().index[-1]
must evaluate to 8
.W = reviews["author"].value_counts().value_counts().loc[1]
will give us the value at index 1
, which will be the number of "author"
s with 1 review, which is the height of the first bar.X = reviews["author"].value_counts().value_counts().iloc[1]
will give us the value at integer position 1. Since Python is 0-indexed, this will give us the second-largest value in reviews["author"].value_counts().value_counts()
, which is the height of the second-tallest bar, which is indeed what we want.
The average score on this problem was 64%.
Which value is equal to the height of the tallest bar?
W
X
Y
Z
None of these.
Answer: W = reviews["author"].value_counts().value_counts().loc[1]
.
See the solution to Part 2.
The average score on this problem was 56%.
Which value is equal to the height of the second-tallest bar?
W
X
Y
Z
None of these.
Answer: X = reviews["author"].value_counts().value_counts().loc[1]
.
See the solution to Part 2.
The average score on this problem was 67%.
Select all visualization techniques we could use to visualize the distribution of "overall"
ratings, separately for Delta Air Lines and United Airlines. At least one answer is correct.
side-by-side box plots
side-by-side bar charts
a scatter plot
a pie chart
overlaid histograms
Answer: side-by-side box plots, side-by-side bar charts, overlaid histograms
"overall"
ratings are a numerical feature type, and we’ve learned to visualize the distribution of numerical features using box plots and histograms.
So, a side-by-side box plot, with one box for Delta Air Lines and one box for United Airlines, would do the job:
An overlaid histogram would do the job, too: one histogram figure for Delta, one for United.
But, there’s a third possibility! Since "overall"
ratings are some integer between 0 and 10, we can also use a bar chart to describe them. While we didn’t talk about such plots in class explicitly, a similar one is provided to you in the exam at the start of Problem 4! There, we show the distribution of a discrete, numerical feature using a bar chart:
We can do the same thing here, with one bar chart for Delta and one for United, overlaid:
The average score on this problem was 82%.
We define the robust mean of a collection of values to be the mean of all values, once the largest and smallest values are removed. For example, the robust mean of 2, 2, 3, 7, 12 is \frac{2+3+7}{3} = 4.
Fill in the blanks so that worst_robustly
evaluates to the "airline"
with the lowest robust mean of "overall"
ratings. Assume there are no ties in robust ratings, and that each "airline"
has at least 3 ratings.
def funky(x):
return __(iii)__
worst_robustly = reviews.__(i)__["overall"].__(ii)__(funky).idxmin()
(i)
: sort_values("overall")
groupby("overall")
value_counts("overall")
sort_values("airline")
groupby("airline")
value_counts("airline")
(ii)
: agg
filter
transform
(iii)
: Free responseAnswer:
def funky(x):
return x.sort_values().iloc[1:-1].mean()
# OR
return (x.sum() - x.min() - x.max()) / (x.shape[0] - 2)
= reviews.groupby("airline")["overall"].agg(funky).idxmin() worst_robustly
Let’s look at the problem step-by-step.
"airline"
with the lowest robust mean of "overall"
rating. To do so, we need to first find the robust mean of "overall"
ratings for each "airline"
. Since we want to perform this calculation for each "airline"
, we should groupby("airline")
. (Note that the .idxmin()
at the very end of the expression finds the index of the lowest value in the preceding Series.)"overall"
ratings – which are stored in a Series – and compute the robust mean of them. This act of taking many "overall"
ratings and condensing them into a single value is called aggregation, and so the appropriate method here is agg
.funky
needs to find the robust mean of a collection of values. If x
is a Series of values, there are a few ways we can do this:
x
, either in ascending or descending order. Then, remove the first and last value, which will be the min and max if sorted in ascending order, or the max and min if sorted in descending order (either way, it doesn’t change the result). Then, extract all of the remaining values and take the mean of them. x.sort_values().iloc[1:-1]
gets the values we need to take the mean of, and x.sort_values().iloc[1:-1].mean()
returns what we want.x.sum() - x.min() - x.max()
, and the number of values – again, with the min and max removed – is x.shape[0] - 2
, which means that (x.sum() - x.min() - x.max()) / (x.shape[0] - 2)
also returns the robust mean.
The average score on this problem was 86%.
What are the input and output types of the function funky
?
Input: Series, Output: Number
Input: Series, Output: Series
Input: Series, Output: DataFrame
Input: DataFrame, Output: Number
Input: DataFrame, Output: Series
Input: DataFrame, Output: DataFrame
Answer: Input: Series, Output: Number
Let’s take a look at the expression below:
"airline")["overall"].agg(funky).idxmin() reviews.groupby(
reviews.groupby("airline")["overall"]
is a SeriesGroupBy object, meaning that funky
is called on various Series. Specifically, it is called on a Series of "overall"
scores, separately for each "airline"
. The output of funky
is a single number, since it’s an aggregation method that takes lots of values (numbers) and condenses them into a single value (number).
The average score on this problem was 52%.
Consider the DataFrame imp
, shown in its entirety below.
As in the previous question, we define the robust mean of a collection of values to be the mean of all values, once the largest and smallest values are removed. For example, the robust mean of 2, 2, 3, 7, 12 is \frac{2+3+7}{3} = 4.
Suppose we use robust mean imputation, conditional on "cabin"
to fill in the missing values in the "overall"
column. In doing so, what are the missing values in the "overall"
column filled in with? Give your answers as numbers.
Answer:
9
: 6 (or 6.0)10
: 7 (or 7.0)To perform robust mean imputation, conditional on "cabin"
, we must fill each missing "overall"
value with the robust mean of the observed "overall"
values with the same "cabin"
as the missing value. So:
9
has a "cabin"
of Business. The observed "overall"
values for Business are 7, 6, and 2. The robust mean of 7, 6, and 2 is just 6, since when you remove the min (2) and max (7) you’re left with just a single value, 6, and the mean of a single value is that one value. So, we fill in the missing value in row 9
with 6.10
has a "cabin"
of Premium Economy. The observed "overall"
values for Premium Economy are 5, 6, 9, and 8, and the robust mean of these values is \frac{6 + 8}{2} = 7, which is the mean once the min (5) and max (9) are removed. So, we fill in the missing value in row 10
with 7.
The average score on this problem was 90%.
Suppose imp
represents the DataFrame above, with 2 missing values, and filled
represents the DataFrame that results from imputing the missing values in imp
in the way described in the previous part. In the following code block, what is the value of comp
?
= imp.groupby("cabin")["overall"].mean()
old_means = old_means == filled.groupby("cabin")["overall"].mean() comp
True
False
The Series [True, True, True]
The Series [False, False, False]
The Series [True, True, False]
The Series [True, False, True]
The Series [False, True, True]
The Series [False, False, True]
The Series [False, True, False]
The Series [True, False, False]
Hint: When grouping, the index is automatically sorted in ascending order.
Answer: The Series [False, True, True]
old_means
is a Series containing the mean "overall"
score of each "airline"
, before we fill in any missing values (i.e. with missing values ignored), and filled.groupby("cabin")["overall"].mean()
is a Series containing the mean "overall"
score of each "airline"
post-imputation. The index of both of these Series is ["Business", "First", "Premium Economy"]
, since when grouping and aggregating, the index is set to the unique values in the grouped-on column ("cabin"
, here) in ascending order.
So, this question boils down to checking whether the means of each group changed because of the imputation we did.
So, the Business mean changed, while the First and Premium Economy means did not change. In the order ["Business", "First", "Premium Economy"]
that we established above, our comparisons are [False, True, True]
, and so comp
is a Series of [False, True, True]
.
The average score on this problem was 46%.
Consider the DataFrame framer
, defined below.
= reviews.pivot_table(index="airline",
framer ="cabin",
columns="author",
values="max") aggfunc
In one English sentence, describe the meaning of the value that the following expression evaluates to.
"emirates", "Economy"] framer.loc[
Answer: The last name, alphabetically, among all authors who submitted an Economy Class Emirates review.
framer
might look something like this:
The rows of framer
are the unique "airline"
s and the columns of framer
are the unique "cabin"
s. The value in row a
and column c
is the maximum "author"
who wrote a review for "airline"
a
in "cabin"
c
. Since "author"
names are strings, the “maximum” "author"
name is the last alphabetically (or lexicographically).
The average score on this problem was 73%.
What is the value of the following expression?
sum(axis=0).shape[0] framer.isna().
The number of unique values in the "airline"
column of reviews
.
The number of unique values in the "cabin"
column of reviews
.
The number of unique values in the "author"
column of reviews
.
The number of unique values in the "overall"
column of reviews
.
None of the above.
Answer: The number of unique values in the "cabin"
column of reviews
.
Let’s walk through the logic:
framer
has one column per unique "cabin"
in reviews
.framer.isna()
is a DataFrame with the same shape as framer
, with each value either being True
(if the corresponding value in framer
is null) or False
.framer.isna().sum(axis=0)
computes the sum across the rows, which is the sum of each column of framer
.framer.isna().sum(axis=0)
has one entry per column in framer
, i.e. one entry per unique "cabin"
in reviews.framer.isna().sum(axis=0).shape[0]
is equal to the number of unique values in the "cabin"
column of reviews
.
The average score on this problem was 17%.
Consider the following information.
>>> framer.isna().sum(axis=0).sum()
320
>>> reviews["airline"].nunique()
125
>>> reviews["cabin"].nunique()
4
What is the value of the following expression? Show your work, and \boxed{\text{box}} your final answer, which should be a positive integer.
"airline", "cabin"])["overall"].mean().shape[0] reviews.groupby([
Answer: 180.
Let grouped = reviews.groupby(["airline", "cabin"])["overall"].mean()
, the DataFrame provided in the question. grouped
has one row for every unique combination of "airline"
and "cabin"
that actually appears in reviews
. Since there are 125 unique values of "airline"
and 4 unique values of "cabin"
, there are 125 \cdot 4 = 500 possible unique combinations of "airline"
and "cabin"
.
But, these combinations don’t all actually exist in grouper
! The fact that framer.isna().sum(axis=0).sum()
is 320
means that there are 320 missing values in framer
, total. framer
, too, has 125 rows and 4 columns, since it results from calling reviews.pivot_table(index="airline", columns="cabin")
. But, since it has 320 missing values, it means that 320 of the possible combinations of "airline"
and "cabin"
never actually appeared in reviews
.
So, the number of combinations of "airline"
and "cabin"
that actually appeared in reviews
is equal to the number of non-null values in framer
– or, equivalently, the number of rows in grouped
(so grouped.shape[0]
), which is the answer to the question we’re looking for – is:
125 \cdot 4 - 320 = \boxed{180}
The average score on this problem was 61%.
Consider the DataFrame lines
, created by keeping the rows in reviews
corresponding to reviews written between 2015-11-24 and 2015-11-28. Remember, each row in lines
corresponds to a review of an airline.
Also consider the DataFrame ports
, in which each row corresponds to a review of a particular airport. ports
has a "date"
column, too, and also only has reviews written between 2015-11-24 and 2015-11-28.
Consider the following information.
>>> lines["date"].value_counts().sort_index()
2015-11-24 2
2015-11-25 7
2015-11-26 1
2015-11-27 4
>>> ports.shape[0]
17
>>> ports["date"].unique()
"2015-11-26", "2015-11-27", "2015-11-28"])
array([>>> lines.merge(ports, on="date", how="inner").shape[0]
29
>>> lines.merge(ports, on="date", how="outer").shape[0]
41
Below, give your answers as integers.
How many values in ports["date"]
are equal to "2015-11-24"
?
Answer: 0.
Since ports["date"].unique()
evaluates to array(["2015-11-26", "2015-11-27", "2015-11-28"])
, we know that none of the values in ports["date"]
are "2015-11-24"
.
The average score on this problem was 83%.
How many values in ports["date"]
are equal to "2015-11-25"
?
Answer: 0.
Since ports["date"].unique()
evaluates to array(["2015-11-26", "2015-11-27", "2015-11-28"])
, we know that none of the values in ports["date"]
are "2015-11-25"
.
The average score on this problem was 83%.
How many values in ports["date"]
are equal to "2015-11-26"
?
Answer: 9.
Let:
ports["date"]
equal to "2015-11-26"
be a,ports["date"]
equal to "2015-11-27"
be b, andports["date"]
equal to "2015-11-28"
be c.Then, a, b, and c are the answers to parts 3, 4, and 5 of this question, respectively.
To solve for a, b, and c, we’ll need to put together several pieces of information:
ports.shape[0]
is 17
, the total number of values in ports["date"]
is 17. We know, from the fact that ports["date"].unique()
is array(["2015-11-26", "2015-11-27", "2015-11-28"])
, that the only values in ports["date"]
are "2015-11-26"
, "2015-11-27"
, and "2015-11-28"
. This means that a + b + c = 17 – we’ll call this equation 1.lines
and ports
on "date"
. The only unique values that are shared between lines["date"]
and ports["date"]
are "2015-11-26"
(1 in lines
, a in ports
) and "2015-11-27"
(4 in lines
, b in ports
). So, because of the fact that: \underbrace{\text{frequency}(\text{value}, \texttt{merged(lines, dates)} = \text{frequency}(\text{value}, \texttt{lines}) \cdot \text{frequency}(\text{value}, \texttt{dates})}_{\text{this is not a formula we saw explicitly in class; rather, it's a principle we used repeatedly when solving number-of-rows-in-merges problems}} We have our equation 2: 1a + 4b = 29lines
and ports
on "date"
. The rows in the result of the outer merge are all 29 rows of the inner merge, plus one row each for each row that was in lines
or dates
without an overlapping "date"
. So, equation 3 is: \underbrace{29}_{\text{inner merge}} + \underbrace{2 + 7}_{\text{the frequencies of \texttt{"2015-11-24"} and \texttt{"2015-11-25"} in \texttt{lines}}} + \underbrace{c}_\text{the frequency of \texttt{"2015-11-28"} in \texttt{ports}} = 41So, our three equations are:
\begin{align*} a + b + c &= 17 & (1) \\ a + 4b & = 29 & (2) \\ 29 + 2 + 7 + c &= 41 & (3) \end{align*}
Let’s work our way to a solution.
So, a = 9, b = 5, and c = 3, meaning that:
ports["date"]
equal to "2015-11-26"
is 9,ports["date"]
equal to "2015-11-27"
is 5, andports["date"]
equal to "2015-11-28"
is 3.
The average score on this problem was 52%.
How many values in ports["date"]
are equal to "2015-11-27"
?
Answer 5.
See the solution to part 3.
The average score on this problem was 55%.
How many values in ports["date"]
are equal to "2015-11-28"
?
Answer 3.
See the solution to part 3.
The average score on this problem was 65%.
Consider the following SQL query.
SELECT author, AVG(overall) AS average_overall FROM reviews
GROUP BY author
HAVING AVG(overall) >= 8 AND COUNT(*) >= 5
ORDER BY average_overall;
Now, consider the following four code blocks.
= lambda x: x["overall"].mean() >= 8 and x.shape[0] >= 5
f = (reviews.groupby("author").filter(f)
pop_auth "author")[["overall"]].mean()
.groupby("overall").reset_index()) .sort_values(
= lambda x: x.shape[0] >= 5
g = (reviews[reviews["overall"] >= 8]
pop_auth "author").filter(g)
.groupby("author")[["overall"]].mean()
.groupby("overall").reset_index()) .sort_values(
= (reviews["author"].value_counts().reset_index()
pop_auth
.merge("author")[["overall"]].mean().reset_index(),
reviews.groupby(="author")
onlambda x: x[(x["overall"] >= 8)])
.pipe("overall")
.sort_values(lambda r: r[r["count"] >= 5])
.pipe("author", "overall"]]) [[
= reviews.groupby("author")["overall"].agg(["count", "mean"])
temp = (temp[(temp["count"] >= 5) & (temp["mean"] >= 8)]
pop_auth "mean")
.sort_values("author", "mean"]]) .reset_index()[[
Three of the blocks above correctly assign pop_auth
to a DataFrame that is equivalent to the table outputted by the SQL query at the top of the page (ignoring the names of the resulting columns and the values in the index). One of the options is incorrect.
Which block is incorrect – that is, which block does not assign pop_auth
so that it is equivalent to the result of the SQL query at the top of the page?
Block 1
Block 2
Block 3
Block 4
Answer: Block 2
The provided SQL query finds the name and average "overall"
rating given by all "author"
s who wrote at least 5 reviews and gave an average "overall"
rating of at least 8. Blocks 1, 3, and 4 do this.
The issue with Block 2 is that it first queries for just the rows where the "overall"
rating is over 8 before grouping, using reviews[reviews["overall"] >= 8]
. After this (pandas
) query, we’ve totally lost all of the rows with an "overall"
rating below 8. It’s very possible for "author"
s to have an average "overall"
rating above 8 but have written some reviews with an "overall"
rating below 8, which means that the resulting DataFrame in Block 2 could have incorrect average "overall"
ratings for some "author"
s. It could even be missing some "author"
s entirely.
For example, suppose we have two "author"
s, A and B, who gave out the following "overall"
ratings:
Both A and B meet the criteria to be included in the the output DataFrame – they have written at least 5 reviews and have an average "overall"
rating of at least 8. But, after querying out all of the rows with an "overall"
rating under 8, B is reduced to just having the 4 values 8, 9, 10, 10. Since B no longer fits the criteria in filter(g)
, it won’t be included in the output.
The average score on this problem was 73%.
Suppose we define soup
to be a BeautifulSoup object that is instantiated using the HTML document below. (To save space, we’ve omitted the tags <html>
and </html>
.)
<title>Aalborg Airport Customer Reviews - SKYTRAX</title>
<a href="https://www.airlinequality.com" rel="nofollow">
<img src="https://www.airlinequality.com/images/skytrax.svg" alt="SKYTRAX">
</a>
<div class="skytrax-ratings">
<table class="review-ratings">
<tr>
<td class="terminal-cleanliness">Terminal Cleanliness</td>
<td>4</td>
</tr>
<tr>
<td class="food-beverages">Food Beverages</td>
<td>9</td>
</tr>
<tr>
<td class="airport-staff">SKYTRAX Staff</td>
<td>3</td>
</tr>
</table>
</div>
In parts 1 and 2, fill in the blanks so that each expression evaluates to "SKYTRAX"
.
"alt") soup.__(i)__(
Answer:
"img").get("alt") soup.find(
The existing "alt"
argument tells us that we’re looking to extract the alt
attribute’s value from this tag:
<img src="https://www.airlinequality.com/images/skytrax.svg" alt="SKYTRAX">
The tag containing this attribute is the very first img
tag in the document, so soup.find("img")
will navigate to this tag. To access the value of this tag’s img
attribute, we use .get("alt")
on the tag, so putting it all together, we have soup.find("img").get("alt")
.
The average score on this problem was 64%.
"td", __(i)__).text.__(ii)__ soup.find(
Answer:
"td", class_="airport-staff").text.split()[0] soup.find(
or
"td", attrs={"class": "airport-staff"}).text[:7] soup.find(
or some combination.
Here, we’re told the information we’re looking for must be in some td
tag. There’s only one td
tag that contains "SKYTRAX"
anywhere, and it’s this one:
<td class="airport-staff">SKYTRAX Staff</td>
To navigate to this td
tag specifically, we can ask for a td
tag whose class
attribute is "airport-staff"
. Two ways to do that are:
soup.find("td", class_="airport-staff")
soup.find("td", attrs={"class": "airport-staff"})
Once we’ve navigated to that tag, its text
attribute evaluates to "SKYTRAX Staff"
(the text
attribute of a tag gives us back all of the text between <tag>
and </tag>
). Then, it’s a matter of extracting "SKYTRAX"
from "SKYTRAX Staff"
, which can either be done by:
soup.find("td", class_="airport-staff").text.split()[0]
, orsoup.find("td", class_="airport-staff").text[:7]
.
The average score on this problem was 82%.
Fill in the blanks so that avg_rating
evaluates to the mean of the three ratings in the document above. (In the document, the three ratings are 4, 9, and 3, but you shouldn’t hard-code these values.)
= [tag.text for tag in soup.find_all("td")]
texts = np.mean([__(i)__ for j in range(__(ii)__)]) avg_rating
Answer:
= np.mean([int(texts[j]) for j in range(1, 6, 2)]) avg_rating
texts
evaluates to a list containing the text between each td
tag, all as strings. Here, that would be:
= ["Terminal Cleanliness", "4", "Food Beverages", "9", "SKYTRAX Staff", "3"] texts
The purpose of the list comprehension in the second line, [__(i)__ for j in range(__(ii)__)]
, is to extract out just the numbers from this list.
range(1, 6, 2)
evaluates to a list (technically, a range object) with the values [1, 3, 5]
, so iterating over range(1, 6, 2)
will allow us to access the positions of each number.[j for j in range(1, 6, 2)]
gives us [1, 3, 5]
, and [texts[j] for j in range(1, 6, 2)]
gives us ["4", "9", "3"]
.texts[j]
value either to an int or float. int(texts[j])
does this for us. Putting all steps together gives us:int(texts[j]) for j in range(1, 6, 2)] [
Note that a common incorrect answer was:
int) for j in range(1, 6, 2)] [texts[j].astype(
But, astype
is a Series method, meaning that whatever comes before .astype(int)
must be a Series, and each individual texts[j]
value is a single string, not a Series of strings.
The average score on this problem was 48%.
All airports have IATA codes, which are strings of three uppercase letters that uniquely identify the airport. For example, Detroit’s IATA code is "DTW"
. Many of the entries in the "content"
column of reviews
use IATA codes to describe the route flown.
We define a route string as being a pair of IATA codes separated by "-"
or " to "
(pay close attention to the spacing). For example, in test_review
there are two route strings, "YWG-LHR"
and "EDI to YWG"
:
= """I recently traveled YWG-LHR and returned home from
test_review EDI to YWG with AC Rouge. I must say I was
pleasantly surprised with how well the trip went
in both directions, but I'm glad I didn't have to
transit in ORD."""
Write a regular expression that matches valid route strings.
Answer: [A-Z]{3}( to |-)[A-Z]{3}
Interact with the solution on regex101 at this link:
Here, we need to match all sequences with:
[A-Z]{3}
, followed by" to "
or "-"
, matched by ( to |-)
, followed by[A-Z]{3}
.Some refreshers:
[A-Z]
character class matches any uppercase letter.( to |-)
, the spaces and -
are literals, and |
is the “or” operator. Parentheses are needed so that the “or” operation only applies to to
and -
, not everything before and after the |
.
The average score on this problem was 91%.
Consider the string thoughts
, defined below:
= """Brought my Sony A5000 for my trip
thoughts the Airbus A220v300's bathroom has a window
My favorite jet is the B737-900ER flew it like x 239
all stan the queen B747, 120% the B333st
I like the chonky A380
but I love the A350"""
There are exactly 5 valid plane codes in thoughts
: "A220"
, "B737"
, "B747"
, "A380"
, and "A350"
. For each regular expression exp
below,
re.findall(exp, thoughts)
extracts in their entirety, andre.findall(exp, thoughts)
also extracts.The first example has been done for you.
= r"B\d{3}" exp
Even though the answer was already provided for you, we’ve illustrated it here:
The regular expression says, match any sequence of characters with a "B"
followed by any three digits. This correctly captures "B737"
and "B747"
, but incorrectly captures "B333"
in line 4, which is not a valid plane code.
Interact with the regular expression on regex101 at this link.
= r"[AB]\d{3}" exp
The regular expression says, match any sequence of characters with an "A"
or "B"
followed by any three digits. This correctly captures all 5 plane codes, but also matches "A500"
in line 1 and "B333"
in line 4.
Interact with the solution on regex101 at this link.
The average score on this problem was 84%.
= r"A\d{2}0|B7\d7" exp
Answer:
This regular expression says, match any sequence that either looks like:
"A"
, followed by any two digits, followed by a "0"
, or"B"
, followed by a "7"
, followed by any digit, followed by another "7"
.This correctly captures all 5 plane codes, but also matches "A500"
. Unlike the previous regular expression, it doesn’t match "B333"
.
Interact with the solution on regex101 at this link.
The average score on this problem was 80%.
= r"A[23]\d0|B7\d7" exp
Answer:
This regular expression says, match any sequence that looks like:
"A"
, followed by either a "2"
or a "3"
, followed by any digit, followed by a "0"
, or"B"
, followed by a "7"
, followed by any digit, followed by another "7"
.This correctly captures all 5 plane codes, and nothing else! "B333"
still isn’t matched, and "A500"
isn’t matched either, because we’ve required that the digit right after "A"
must either be "2"
or "3"
, disqualifying the "5"
in "A500"
.
Interact with the solution on regex101 at this link.
The average score on this problem was 79%.
= r"(A[23]\d0|B7\d7)$" exp
Answer:
The difference between this regular expression and the last one is the anchor at the end, $
. This requires the pattern being matched to be at the end of a line. While are 5 valid plane codes in thoughts
, only two are at the end of their respective lines: "A380"
in line 5 and "A350"
in line 6. So, those are the only two plane codes that are matched.
However, the default behavior of re.findall
in Python is that it treats $
as an end-of-string anchor, rather than an end-of-line character. So, in Python, only the final plane code of "A350"
is extracted by default.
So, we gave full credit to both 1 and 2 as answers.
Interact with the solution on regex101 at this link.
The average score on this problem was 80%.
The following bag of words matrix represents the frequencies of various words across four reviews. Assume, just for this question, that the only five words that appear in any review are the columns of this matrix, and all calculations use a base-2 logarithm.
The matrix has three unknown integer values, A, B, and C. However, we’re given that:
"movies"
in Review 0 is \frac{5}{4}."landing"
in Review 2 is \frac{1}{5}.What is the value of A? Show your work, and \boxed{\text{box}} your final answer.
Answer: 10.
The information we need to answer this part comes from the first bullet point:
The TF-IDF of
"movies"
in Review 0 is \frac{5}{4}.
Recall, the TF-IDF of "movies"
in Review 0 is:
\begin{align*} \text{tfidf}(\text{``movies"}, \text{Review 0}) &= \text{tf}(\text{``movies"}, \text{Review 0}) \cdot \text{idf}(\text{``movies"}) \\ &= \frac{ \text{number of occurrences of ``movies" in Review 0}}{\text{total number of terms in Review 0}} \cdot \log_2 \left( \frac{\text{total number of reviews}}{\text{number of reviews in which ``movies" appears}} \right) \end{align*}
Here, we have:
"movies"
in Review 0 is A."movies"
occurs is 1 – it only occurs in Review 0."movies"
in Review 0 is, as the question tells us, \frac{5}{4}.Putting all of this together, we have:
\begin{align*}\frac{5}{4} &= \frac{A}{6+A} \cdot \underbrace{\log_2\left(\frac{4}{1}\right)}_{\text{this is } 2\text{, because }2^2 = 4} \\ \frac{5}{4} &= \frac{A}{6+A} \cdot 2 \\ \frac{5}{8} &= \frac{A}{6+A} \\ 5(6+A) &= 8A \\ 30 + 5A &= 8A \\ 30 &= 3A \\ A &= \boxed{10} \end{align*}
The average score on this problem was 83%.
What is the value of B? Show your work, and \boxed{\text{box}} your final answer.
Answer: 3.
The information we need to answer this part comes from the last bullet point:
The cosine similarity between the bag of words vectors for Reviews 1 and 2 is \frac{16}{21}.
Recall, the cosine similarity between two vectors \vec u and \vec v is:
\text{cosine similarity}(\vec u, \vec v) = \frac{\vec u \cdot \vec v}{\lVert \vec u \rVert \lVert \vec v \rVert}
Here, let \vec{r_1} be the row vector for Review 1 and let \vec r_2 be the row vector for Review 2.
\vec r_1 \cdot \vec r_2 = \begin{bmatrix} 0 \\ 6 \\ B \\ 2 \\ 0 \end{bmatrix} \cdot \begin{bmatrix} 4 \\ 8 \\ 0 \\ 8 \\ 0 \end{bmatrix} = 6 \cdot 8 + 2 \cdot 8 = 48 + 16 = 64.
\lVert \vec r_1 \rVert = \sqrt{0^2 + 6^2 + B^2 + 2^2 + 0^2} = \sqrt{B^2 + 40}.
\lVert \vec r_2 \rVert = \sqrt{4^2 + 8^2 + 0^2 + 8^2 + 0^2} = \sqrt{16 + 64 + 64} = \sqrt{144} = 12.
So, we have:
\begin{align*} \text{cosine similarity}(\vec r_1, \vec r_2) &= \frac{\vec r_1 \cdot \vec r_2}{\lVert \vec r_1 \rVert \lVert \vec r_2 \rVert} \\ \frac{16}{21} &= \frac{64}{12\sqrt{B^2 + 40}} \\ \frac{64}{84} &= \frac{64}{12\sqrt{B^2 + 40}} \\ 84 &= 12 \sqrt{B^2 + 40} \\ 7 &= \sqrt{B^2 + 40} \\ 49 &= B^2 + 40 \\ 9 &= B^2 \\ B = \boxed{3} \end{align*}
The average score on this problem was 67%.
What is the value of C? Show your work, and \boxed{\text{box}} your final answer.
Answer: 0.
The information we need to answer this part comes from the second bullet point:
The TF-IDF of
"landing"
in Review 2 is \frac{1}{5}.
As in part 1, we’ll look at the formula for the TF-IDF:
\text{tfidf}(\text{``landing"}, \text{Review 0}) = \frac{ \text{number of occurrences of ``landing" in Review 2}}{\text{total number of terms in Review 2}} \cdot \log_2 \left( \frac{\text{total number of reviews}}{\text{number of reviews in which ``landing" appears}} \right)
Here:
"landing"
in Review 2 is 4."landing"
appears is either 2, if C = 0, or 3, if C > 0.Putting together what we know, we have:
\frac{1}{5} = \frac{5}{20} \cdot \log_2 \left( \frac{4}{\text{number of reviews in which ``landing" appears}} \right)
Note that \frac{5}{20} = \frac{1}{4}, so:
\frac{1}{5} = \frac{1}{5} \cdot \log_2 \left( \frac{4}{\text{number of reviews in which ``landing" appears}} \right)
In order for the above to be true, we must have that:
\log_2 \left( \frac{4}{\text{number of reviews in which ``landing" appears}} \right) = 1 \implies \left( \frac{4}{\text{number of reviews in which ``landing" appears}} \right) = 2
This gives us that:
\text{number of reviews in which ``landing" appears} = 2
So, it must be the case that C = \boxed{0}, because there are already two reviews in which "landing"
appears (Reviews 0 and 2), and if C > 0, then "landing"
would be in three reviews, violating this constraint.
The average score on this problem was 67%.