Midterm Review, Day 1

The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to attempt these problems before Sunday’s exam review session, so that we have enough time to walk through the solutions to all of the problems.

We will enable the solutions here after the review session, though you can find the written solutions to these problems in other discussion worksheets.

Problem 1

The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.

Each row of hotels contains information about a different hotel in San Diego. Specifically, for each hotel, we have:

The first few rows of hotels are shown below, but hotels has many more rows than are shown.

Problem 1.1

Answer: str

When we do a groupby on the Chain column in hotels, this means that the values in the Chain column will be the indices of the DataFrame or Series we get as output, in this case the Series hotels.groupby("Chain")["Number of Rooms"].sum().

Since the values of Chain are strings, and since .idxmax() will return a value from the index of the aforementioned Series, summed is a string.

Problem 1.2

In one sentence, explain what the value of summed means. Phrase your explanation as if you had to give it to someone who is not a data science major; that is, don’t say something like “it is the result of grouping hotels by "Chain", selecting the "Number of Rooms" column, …”, but instead, give the value context.

Answer: summed is the name of the hotel chain with the most total rooms

The result of the .groupby() and .sum() is a Series indexed by the unique Chains, whose values are the total number of rooms in hotels owned by each chain. The idxmax() function gets the index corresponding to the largest value in the Series, which will be the hotel chain name with the most total rooms.

Problem 1.3

Fill in the blank: curious is guaranteed to be equal to summed only if frame has one row for every ____ in San Diego.

Answer: hotel room

curious gets the most common value of Chain in the DataFrame frame. We already know that summed is the hotel chain with the most rooms in San Diego, so curious only equals summed if the most common Chain in frame is the hotel chain with the most total rooms; this occurs when each row of frame is a single hotel room.

Problem 1.4

Fill in the blanks so that popular_areas is an array of the names of the unique neighborhoods that have at least 5 hotels and at least 1000 hotel rooms.

Answers:

df.shape[0] >= 5 and df["Number of Rooms"].sum() >= 1000
"Location"
filter(f)
["Location"].unique() or equivalent

We’d like to only consider certain neighborhoods according to group characteristics (having >= 5 hotels and >= 1000 hotel rooms), and .filter() allows us to do that by excluding groups not meeting those criteria. So, we can write a function that evaluates those criteria on one group at a time (the df of input to f is the subset of hotels containing just one Location value), and calling filter(f) means that the only remaining rows are hotels in neighborhoods that match those criteria. Finally, all we have to do is get the unique neighborhoods from this DataFrame, which are the neighborhoods for which f returned True.

Problem 1.5

Answers:

(cond1 & cond2).sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()

Note that cond1 and cond2 are boolean Series, and hotels[cond1] and hotels[cond2] are the subsets of hotels where Chain == "Marriott and "Location" == "Coronado", respectively.

When we perform an inner merge, we’re selecting every row where a Hotel Name appears in both hotels[cond1] and hotels[cond2]. This is the same set of indices (and therefore hotel names, since those are unique) as where (cond1 & cond2) == True. So, the length of combined will be the same as the number of Trues in (cond1 & cond2).
When we perform an outer merge, we’re selecting every row that appears in either DataFrame, although there will not be repeats for hotels that are both Marriott properties and are in Coronado. So, to find the total number of rows in either DataFrame, we take the sum of the sizes of each, and subtract rows that appear in both, which corresponds to answer cond1.sum() + cond2.sum() - (cond1 & cond2).sum().

Problem 2

Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.

Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.

Answer: np.arange(1, 76), replace=False, np.count_nonzero(choices == 23) > 0

Here, the idea is to randomly choose 10 different floors repeatedly, and each time, check if floor 23 was selected.

Blank (a): The first argument to np.random.choice needs to be an array/list containing the options we want to choose from, i.e. an array/list containing the values 1, 2, 3, 4, …, 75, since those are the numbers of the floors. np.arange(a, b) returns an array of integers spaced out by 1 starting from a and ending at b-1. As such, the correct call to np.arange is np.arange(1, 76).

Blank (b): Since we want to select 10 different floors, we need to specify replace=False (the default behavior is replace=True).

Blank (c): The if condition needs to check if 23 was one of the 10 numbers that were selected, i.e. if 23 is in choices. It needs to evaluate to a single Boolean value, i.e. True (if 23 was selected) or False (if 23 was not selected). Let’s go through each incorrect option to see why it’s wrong:

Option 1, choices == 23, does not evaluate to a single Boolean value; rather, it evaluates to an array of length 10, containing multiple Trues and Falses.
Option 2, choices is 23, does not evaluate to what we want – it checks to see if the array choices is the same Python object as the number 23, which it is not (and will never be, since an array cannot be a single number).
Option 4, np.count_nonzero(choices) == 23, does evaluate to a single Boolean, however it is not quite correct. np.count_nonzero(choices) will always evaluate to 10, since choices is made up of 10 integers randomly selected from 1, 2, 3, 4, …, 75, none of which are 0. As such, np.count_nonzero(choices) == 23 is the same as 10 == 23, which is always False, regardless of whether or not 23 is in choices.
Option 5, choices.str.contains(23), errors, since choices is not a Series (and .str can only follow a Series). If choices were a Series, this would still error, since the argument to .str.contains must be a string, not an int.

By process of elimination, Option 3, np.count_nonzero(choices == 23) > 0, must be the correct answer. Let’s look at it piece-by-piece:

As we saw in Option 1, choices == 23 is a Boolean array that contains True each time the selected floor was floor 23 and False otherwise. (Since we’re sampling without replacement, floor 23 can only be selected at most once, and so choices == 23 can only contain the value True at most once.)
np.count_nonzero(choices == 23) evaluates to the number of Trues in choices == 23. If it is positive (i.e. 1), it means that floor 23 was selected. If it is 0, it means floor 23 was not selected.
Thus, np.count_nonzero(choices == 23) > 0 evaluates to True if (and only if) floor 23 was selected.

Problem 3

Suppose the DataFrame today consists of 15 rows — 3 rows for each of 5 different "artist_names". For each artist, it contains the "track_name" for their three most-streamed songs today. For instance, there may be one row for "olivia rodrigo" and "favorite crime", one row for "olivia rodrigo" and "drivers license", and one row for "olivia rodrigo" and "deja vu".

Problem 3.1

Suppose we perform an inner merge between today and genres on "artist_names". If the five "artist_names" in today are the same as the five "artist_names" in genres, what fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.

Answer: \frac{2}{5}

If the five "artist_names" in today and genres are the same, the DataFrame that results from an inner merge will have 15 rows, one for each row in today. This is because there are 3 matches for "harry styles", 3 matches for "olivia rodrigo", 3 matches for "glass animals", and so on.

In the merged DataFrame’s 15 rows, 6 of them will correspond to "Pop" artists — 3 to "harry styles" and 3 to "olivia rodrigo". Thus, the fraction of rows that contain "Pop" in the "genre" column is \frac{6}{15} = \frac{2}{5} (which is the fraction of rows that contained "Pop" in genres["genre"], too).

Problem 3.2

Suppose we perform an inner merge between today and genres on "artist_names". Furthermore, suppose that the only overlapping "artist_names" between today and genres are "drake" and "olivia rodrigo". What fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.

Answer: \frac{1}{2}

If we perform an inner merge, there will only be 6 rows in the merged DataFrame — 3 for "olivia rodrigo" and 3 for "drake". 3 of those 6 rows will have "Pop" in the "genre" column, hence the answer is \frac{3}{6} = \frac{1}{2}.

Problem 3.3

Suppose we perform an outer merge between today and genres on "artist_names". Furthermore, suppose that the only overlapping "artist_names" between today and genres are "drake" and "olivia rodrigo". What fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.

Answer: \frac{2}{9}

Since we are performing an outer merge, we can decompose the rows in the merged DataFrame into three groups:

Rows that are in today that are not in genres. There are 9 of these (3 each for the 3 artists that are in today and not genres). today doesn’t have a "genre" column, and so all of these "genre"s will be NaN upon merging.
Rows that are in genres that are not in today. There are 3 of these — one for "harry styles", one for "glass animals", and one for "doja cat". 1 of these 3 have "Pop" in the "genre" column.
Rows that are in both today and genres. There are 6 of these — 3 for "olivia rodrigo" and 3 for "drake" — and 3 of those rows contain "Pop" in the "genre" column.

Tallying things up, we see that there are 9 + 3 + 6 = 18 rows in the merged DataFrame overall, of which 0 + 1 + 3 = 4 have "Pop" in the "genre" column. Hence, the relevant fraction is \frac{4}{18} = \frac{2}{9}.

Problem 4

The DataFrame random_10 contains the "track_name" and "genre" of 10 randomly-chosen songs in Spotify’s Top 200 today, along with their "genre_rank", which is their rank in the Top 200 among songs in their "genre". For instance, “the real slim shady" is the 20th-ranked Hip-Hop/Rap song in the Top 200 today. random_10 is shown below in its entirety.

The "genre_rank" column of random_10 contains missing values. Below, we provide four different imputed "genre_rank" columns, each of which was created using a different imputation technique. On the next page, match each of the four options to the imputation technique that was used in the option.

Note that each option (A, B, C, D) should be used exactly once between parts (a) and (d).

Problem 4.1

Answer: Option B

Explanation given in part d) below

Problem 4.2

Answer: Option D

Explanation given in part d) below

Problem 4.3

Answer: Option C

Explanation given in part d) below

Problem 4.4

Answer: Option A

First, note that in Option B, all three missing values are filled in with the same number, 7. The mean of the observed values in random_10["genre rank"] is 7, so we must have performed unconditional mean imputation in Option B. (Technically, it’s possible for Option B to be the result of unconditional probabilistic imputation, but we stated that each option could only be used once, and there is another option that can only be unconditional probabilistic imputation.)
Then note that in Option C, the very last missing value (in the "Pop" "genre") is filled in with a 7, which is not the mean of the observed "Pop" values, but rather a value from the "Alternative" "genre". This must mean that unconditional probabilistic imputation was used in Option C, since that’s the only way a value from a different group can be used for imputation (if we are not performing some sort of mean imputation).
This leaves Option A and Option D. The last two missing values (the two in the "Pop" "genre") are both filled in with the same value, 2 in Option A and 5 in Option D. The mean of the observed values for the "Pop" "genre" is \frac{9+2+4}{3} = 5, so mean imputation conditional on "genre" must have been used in Option D and thus probabilistic imputation conditional on "genre" must have been used in Option A.

Problem 5

You want to use regular expressions to extract out the number of ounces from the 5 product names below.

Index	Product Name	Expected Output
0	Adult Dog Food 18-Count, 3.5 oz Pouches	3.5
1	Gardetto’s Snack Mix, 1.75 Ounce	1.75
2	Colgate Whitening Toothpaste, 3 oz Tube	3
3	Adult Dog Food, 13.2 oz. Cans 24 Pack	13.2
4	Keratin Hair Spray 2!6 oz	6

The names are stored in a pandas Series called names. For each snippet below, select the indexes for all the product names that will not be matched correctly.

Problem 5.1

For the snippet below, which indexes correspond to products that will not be matched correctly?

Answer: Only product 1 will not be matched; the pattern only looks for “oz” so the “Ounce” is not identified. See this Regex101 link for an illustration.

Problem 5.2

For the snippet below, which indexes correspond to products that will not be matched correctly?

Answer: 1, 2, and 4 are either not matched or done so incorrectly. This pattern matches either numbers+any character+numbers+[a space]+“oz” OR “Ounce”. However it fails in the following ways:

For index 1, the pattern matches too little: only “Ounce” when it should match “1.75 Ounce”.
For index 2, “3 oz” is missed entirely, as the number is a single digit.
For index 4, the pattern matches too much: “2!6 oz” when it should match of “6 oz”.

See this Regex101 link for an interactive illustration.

Problem 6

Problem 6.1

Rahul is trying to scrape the website of an online bookstore ‘The Book Club’.

Answer: Tree D

Following tree D in the image from top to bottom, we can follow the nesting of tags in the HTML file to verify that the DOM tree D matches the syntax of the HTML file.

Problem 6.2

Rahul wants to extract the ‘instock availability’ status of the book titled ‘A Light in the Attic’. Which of the following expressions will evaluate to "In Stock"? Assume that Rahul has already parsed the HTML into a BeautifulSoup object stored in the variable named soup.

Answer: Code Snippet B

Code Snippet B is the only option that finds the tag p with the attribute class being equal to instock availability and then getting the text contained in that tag, which is equal to ‘instock availability’.

Option A will cause an error because of .get('icon-ok') since 'icon-ok' is not the name of the attribute, but is instead the value of the class attribute.

Option C and D will both get the text of the i tag, which is '' and is therefore incorrect.

Problem 6.3

Rahul also wants to extract the number of stars that the book titled ‘A Light in the Attic’ received. If you look at the HTML file, you will notice that the book received a star rating of three. Which code snippet will evaluate to "Three"?

Answer: Code Snippet C

Code Snippet C finds the first occurence of the tag p, gets the contents of its class attribute as a list, and returns the last element, which is the rating 'Three' as desired.

Option A will error because .get('class') returns ['product_pod'] and strip cannot be used on a list, but also the content of the list does not bring us closer to the desired result.

Option B gets the text contained within the first p tag as a list, which is [''].

Problem 7

Tahseen decides to look at reviews for the same hotel, but he modifies them so that the only terms they contain are "taco" and "sand". The bag-of-words representations of three reviews are shown as vectors below.

Using cosine similarity to measure similarity, which pair of reviews are the most similar? If there are multiple pairs of reviews that are most similar, select them all.

Answer: \vec{r}_1 and \vec{r}_2, and \vec{r}_2 and \vec{r}_3

The cosine similarity of two vectors \vec{a} and \vec{b} is \frac{\vec{a} \cdot \vec{b}}{||\vec{a}|| \cdot ||\vec{b}||}.

The cosine similarity of \vec{r}_1 and \vec{r}_2 is: \frac{2 \cdot 4 + 3 \cdot 4}{\sqrt{13} \cdot \sqrt{32}} = \frac{20}{4 \cdot \sqrt{26}} = \frac{5}{\sqrt{26}}
The cosine similarity of \vec{r}_1 and \vec{r}_3 is: \frac{2 \cdot 6 + 3 \cdot 4}{\sqrt{13} \cdot \sqrt{52}} = \frac{24}{26} = \frac{12}{13}
The cosine similarity of \vec{r}_2 and \vec{r}_3 is: \frac{4 \cdot 6 + 4 \cdot 4}{\sqrt{32} \cdot \sqrt{52}} = \frac{40}{8 \cdot \sqrt{26}} = \frac{5}{\sqrt{26}}

\frac{12}{13} \approx 0.9231 and \frac{5}{\sqrt{26}} \approx 0.9806. Since larger cosine similarities mean more similar vectors, our answer is vector pairs \vec{r}_1, \vec{r}_2 AND \vec{r}_2, \vec{r}_3.

Note that we could’ve answered the question without finding the cosine similarity between \vec{r}_1 and \vec{r}_3. Remember, the cosine similarity between two vectors is the cosine of the angle between the two vectors.

The angle between \vec{r}_1 and \vec{r}_3 is clearly bigger than the angle between \vec{r}_1 and \vec{r}_2, because \vec{r}_2 is in between \vec{r}_1 and \vec{r}_3, so we can rule out \vec{r}_1 and \vec{r}_3.
The question then boils down to comparing the angle between \vec{r}_1, \vec{r}_2 and the angle between \vec{r}_2, \vec{r}_3. We can compute the cosine similarities of both pairs, as we did above. But, there’s a slightly easier way!
Specifically, note that \vec{r}_3 = \begin{bmatrix} 6 \\ 4 \end{bmatrix} = 2 \begin{bmatrix} 3 \\ 2 \end{bmatrix}, i.e. it is a scalar multiple of \begin{bmatrix} 3 \\ 2 \end{bmatrix}, meaning \vec{r}_3 points in the same direction as \begin{bmatrix} 3 \\ 2 \end{bmatrix}, just is twice the length. This means that the angle between \vec{r}_2 and \vec{r}_3 is the same as the angle between \vec{r}_2 and \begin{bmatrix} 3 \\ 2 \end{bmatrix} (and remember, if two pairs of vectors have the same angle between them, they have the same cosine similarity).
Why does that matter? Because, now we’re comparing the cosine similarity between: \vec{r}_1 = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \vec{r}_2 = \begin{bmatrix} 4 \\ 4 \end{bmatrix} \\ \\ \text{ and } \\ \\ \vec{r_2} = \begin{bmatrix} 4 \\ 4 \end{bmatrix}, \frac{1}{2}\vec{r}_3 = \begin{bmatrix} 3 \\ 2 \end{bmatrix}
Because of symmetry (work this out yourself!), the two pairs of vectors have the same cosine similarity, and thus the same angle!

Problem 8

You create a table called gums that only contains the chewing gum purchases of df, then you create a bag-of-words matrix called bow from the name column of gums. The bow matrix is stored as a DataFrame shown below:

For each question below, write your answer as an unsimplified math expression (no need to simplify fractions or logarithms) in the space provided, or write “Need more information” if there is not enough information provided to answer the question.

Problem 8.1

Answer: 0

First, it’s worth discussing what information we have.

bow_df.sum(axis=0) computes the sum of each column of bow_df. Each column of bow_df refers to a specific word, so the sum of a column in bow_df tells us the number of occurrences of a particular word across the entire corpus (all documents).
bow_df.sum(axis=1) computes the sum of each row of bow_df. Each row of bow_df refers to a specific document, so the sum of a row in bow_df tells us the number of words in a particular document.
bow_df.loc[0, 'pur'] being 0 tells us that the word "pur" appears 0 times in document 0.
(bow_df["paperboard"] > 0).sum() being 20 means that there are 20 documents that contain the word "paperboard".
bow_df["gum"].sum() being 41 means that "gum" appears 41 times across all documents.

Now, back to the subpart at hand. The TF-IDF of “pur” in document 0 is 0, because bullet point 3 above tells us that “pur” doesn’t occur at all in document 0. This means that the term frequency of “pur” in document 0 is 0, which means the TF-IDF (which is the product of the term frequency and inverse document frequency) of “pur” in document 0 is also 0, because 0 multiplied by the IDF of “pur” must be 0.

Problem 8.2

Answer: Need more information

Let’s try and compute the TF-IDF of “gum” in document 0. The formula is as follows:

\text{tfidf}(\text{gum}, \text{document 0}) = \text{tf}(\text{gum}, \text{document 0}) \cdot \text{idf}(\text{gum})

= \frac{\# \text{ of words in document 0 equal to gum}}{\# \text{ of words in document 0}} \cdot \log \left( \frac{\text{ total \# of documents}}{\text{total \# of documents containing gum}} \right)

= \frac{1}{21} \cdot \log \left( \frac{40}{???} \right)

We don’t know the number of documents containing “gum” – all we know (from bullet point 5 in the previous solution) is that “gum” appears 41 times across all documents, but we don’t know how many unique documents contain “gum”. So, we need more information.

Problem 8.3

Answer: \frac{1}{22}

Let’s try and compute the TF-IDF of “paperboard” in document 1. The formula is as follows (assuming a base-2 logarithm):

\text{tfidf}(\text{"paperboard"}, \text{document 1}) = \text{tf}(\text{"paperboard"}, \text{document 1}) \cdot \text{idf}(\text{"paperboard"})

= \frac{\text{\# of words in document 1 equal to "paperboard"}}{\text{\# of words in document 1}} \cdot \log \left( \frac{\text{total \# of documents}}{\text{total \# of documents containing "paperboard"}} \right)

= \frac{\text{1}}{\text{22}} \cdot \log \left( \frac{\text{40}}{\text{20}} \right) = \frac{\text{1}}{\text{22}} \cdot \log(\text{2}) = \frac{\text{1}}{\text{22}} \cdot \text{1} = \frac{\text{1}}{\text{22}}

Problem 9

Consider a dataset of n integers, y_1, y_2, ..., y_n, whose histogram is given below:

Problem 9.1

Which of the following is closest to the constant prediction h^* that minimizes:

\displaystyle \frac{1}{n} \sum_{i = 1}^n \begin{cases} 0 & y_i = h \\ 1 & y_i \neq h \end{cases}

30.

The minimizer of empirical risk for the constant model when using zero-one loss is the mode.

Problem 9.2

Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \frac{1}{n} \sum_{i = 1}^n |y_i - h|

The minimizer of empirical risk for the constant model when using absolute loss is the median. If the bar at 30 wasn’t there, the median would be 6, but the existence of that bar drags the “halfway” point up slightly, to 7.

Problem 9.3

Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2

11.

The minimizer of empirical risk for the constant model when using squared loss is the mean. The mean is heavily influenced by the presence of outliers, of which there are many at 30, dragging the mean up to 11. While you can’t calculate the mean here, given the large right tail, this question can be answered by understanding that the mean must be larger than the median, which is 7, and 11 is the next biggest option.

Problem 9.4

Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \lim_{p \rightarrow \infty} \frac{1}{n} \sum_{i = 1}^n |y_i - h|^p

15.

The minimizer of empirical risk for the constant model when using infinity loss is the midrange, i.e. halfway between the min and max.

Problem 10

Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h

Problem 10.1

Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.

R_{\text{midterm}}(h)=\frac{1}{n}\sum_{i=1}^{n}[(\alpha y_i - h)^2+\lambda h]=[\frac{1}{n}\sum_{i=1}^{n}(\alpha y_i - h)^2] +\lambda h

Problem 10.2

The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.

Problem 1

Problem 1.1

Click to view the solution.

Problem 1.2

Click to view the solution.

Problem 1.3

Click to view the solution.

Problem 1.4

Click to view the solution.

Problem 1.5

Click to view the solution.

Problem 2

Click to view the solution.

Problem 3

Problem 3.1

Click to view the solution.

Problem 3.2

Click to view the solution.

Problem 3.3

Click to view the solution.

Problem 4

Problem 4.1

Click to view the solution.

Problem 4.2

Click to view the solution.

Problem 4.3

Click to view the solution.

Problem 4.4

Click to view the solution.

Problem 5

Problem 5.1

Click to view the solution.

Problem 5.2

Click to view the solution.

Problem 6

Problem 6.1

Click to view the solution.

Problem 6.2

Click to view the solution.

Problem 6.3

Click to view the solution.

Problem 7

Click to view the solution.

Problem 8

Problem 8.1

Click to view the solution.

Problem 8.2

Click to view the solution.

Problem 8.3

Click to view the solution.

Problem 9

Problem 9.1

Click to view the solution.

Problem 9.2

Click to view the solution.

Problem 9.3

Click to view the solution.

Problem 9.4

Click to view the solution.

Problem 10

Problem 10.1

Click to view the solution.

Problem 10.2

Click to view the solution.

👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.