Midterm Review, Day 1

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to attempt these problems before Sunday’s exam review session, so that we have enough time to walk through the solutions to all of the problems.

We will enable the solutions here after the review session, though you can find the written solutions to these problems in other discussion worksheets.


Problem 1

The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.

Each row of hotels contains information about a different hotel in San Diego. Specifically, for each hotel, we have:

The first few rows of hotels are shown below, but hotels has many more rows than are shown.

Now, consider the variable summed, defined below.

summed = hotels.groupby("Chain")["Number of Rooms"].sum().idxmax()


Problem 1.1

What is type(summed)?


Problem 1.2

In one sentence, explain what the value of summed means. Phrase your explanation as if you had to give it to someone who is not a data science major; that is, don’t say something like “it is the result of grouping hotels by "Chain", selecting the "Number of Rooms" column, …”, but instead, give the value context.


Problem 1.3

Consider the variable curious, defined below.

curious = frame["Chain"].value_counts().idxmax()

Fill in the blank: curious is guaranteed to be equal to summed only if frame has one row for every ____ in San Diego.


Problem 1.4

Fill in the blanks so that popular_areas is an array of the names of the unique neighborhoods that have at least 5 hotels and at least 1000 hotel rooms.

    f = lambda df: __(i)__
    popular_areas = (hotels
                    .groupby(__(ii)__)
                    .__(iii)__
                    __(iv)__)
  1. What goes in blank (i)?

  2. What goes in blank (ii)?

  1. What goes in blank (iii)?
  1. What goes in blank (iv)?


Problem 1.5

Consider the code below.

    cond1 = hotels["Chain"] == "Marriott"
    cond2 = hotels["Location"] == "Coronado"
    combined = hotels[cond1].merge(hotels[cond2], on="Hotel Name", how=???)
  1. If we replace ??? with "inner" in the code above, which of the following will be equal to combined.shape[0]?
  1. If we replace ??? with "outer" in the code above, which of the following will be equal to combined.shape[0]?



Problem 2

Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.

Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.

total = 0
repetitions = 10000
for i in np.arange(repetitions):
    choices = np.random.choice(__(a)__, 10, __(b)__)
    if __(c)__:
        total = total + 1
prob_renovate = total / repetitions

What goes in blank (a)?

What goes in blank (b)?

What goes in blank (c)?


Problem 3

For each day in May 2022, the DataFrame streams contains the number of streams for each of the “Top 200" songs on Spotify that day — that is, the number of streams for the 200 songs with the most streams on Spotify that day. The columns in streams are as follows:

The first few rows of streams are shown below. Since there were 31 days in May and 200 songs per day, streams has 6200 rows in total.

Note that:

Suppose the DataFrame today consists of 15 rows — 3 rows for each of 5 different "artist_names". For each artist, it contains the "track_name" for their three most-streamed songs today. For instance, there may be one row for "olivia rodrigo" and "favorite crime", one row for "olivia rodrigo" and "drivers license", and one row for "olivia rodrigo" and "deja vu".

Another DataFrame, genres, is shown below in its entirety.


Problem 3.1

Suppose we perform an inner merge between today and genres on "artist_names". If the five "artist_names" in today are the same as the five "artist_names" in genres, what fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.


Problem 3.2

Suppose we perform an inner merge between today and genres on "artist_names". Furthermore, suppose that the only overlapping "artist_names" between today and genres are "drake" and "olivia rodrigo". What fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.


Problem 3.3

Suppose we perform an outer merge between today and genres on "artist_names". Furthermore, suppose that the only overlapping "artist_names" between today and genres are "drake" and "olivia rodrigo". What fraction of the rows in the merged DataFrame will contain "Pop" in the "genre" column? Give your answer as a simplified fraction.



Problem 4

The DataFrame random_10 contains the "track_name" and "genre" of 10 randomly-chosen songs in Spotify’s Top 200 today, along with their "genre_rank", which is their rank in the Top 200 among songs in their "genre". For instance, “the real slim shady" is the 20th-ranked Hip-Hop/Rap song in the Top 200 today. random_10 is shown below in its entirety.

The "genre_rank" column of random_10 contains missing values. Below, we provide four different imputed "genre_rank" columns, each of which was created using a different imputation technique. On the next page, match each of the four options to the imputation technique that was used in the option.

Note that each option (A, B, C, D) should be used exactly once between parts (a) and (d).


Problem 4.1

In which option was unconditional mean imputation used?


Problem 4.2

In which option was mean imputation conditional on "genre" used?


Problem 4.3

In which option was unconditional probabilistic imputation used?


Problem 4.4

In which option was probabilistic imputation conditional on "genre" used?



Problem 5

You want to use regular expressions to extract out the number of ounces from the 5 product names below.

Index Product Name Expected Output
0 Adult Dog Food 18-Count, 3.5 oz Pouches 3.5
1 Gardetto’s Snack Mix, 1.75 Ounce 1.75
2 Colgate Whitening Toothpaste, 3 oz Tube 3
3 Adult Dog Food, 13.2 oz. Cans 24 Pack 13.2
4 Keratin Hair Spray 2!6 oz 6

The names are stored in a pandas Series called names. For each snippet below, select the indexes for all the product names that will not be matched correctly.


Problem 5.1

For the snippet below, which indexes correspond to products that will not be matched correctly?

regex = r'([\d.]+) oz'
names.str.findall(regex)


Problem 5.2

For the snippet below, which indexes correspond to products that will not be matched correctly?

regex = r'(\d+?.\d+) oz|Ounce'
names.str.findall(regex)



Problem 6


Problem 6.1

Rahul is trying to scrape the website of an online bookstore ‘The Book Club’.

<HTML>
<H1>The Book Club</H1>
<BODY BGCOLOR="FFFFFF">
Email us at <a href="mailto:support@thebookclub.com">
support@thebookclub.com</a>.

<div>
    <ol class="row">
    <li class="book_list">
    
        <article class="product_pod">
            <div class="image_container">
                    <img src="pic1.jpeg" alt="A Light in the Attic" 
                    class="thumbnail">
            </div>
            
            <p class="star-rating Three"></p>
            
            <h3>
            <a href="cat/index.html" title="A Light in the Attic">
            A Light in the Attic
            </a>
            </h3>
        
            <div class="product_price">
                <p class="price_color">£51.77</p>
                
                <p class="instock availability">
                    <i class="icon-ok"></i>
                    In stock
                </p>
        
            </div>
        </article>
    </li>
    </ol>

</div>
</BODY>
</HTML>

Which is the equivalent Document Object Model (DOM) tree of this HTML file?


Problem 6.2

Rahul wants to extract the ‘instock availability’ status of the book titled ‘A Light in the Attic’. Which of the following expressions will evaluate to "In Stock"? Assume that Rahul has already parsed the HTML into a BeautifulSoup object stored in the variable named soup.

Code Snippet A

    soup.find('p',attrs = {'class': 'instock availability'})\
    .get('icon-ok').strip()

Code Snippet B

    soup.find('p',attrs = {'class': 'instock availability'}).text.strip()

Code Snippet C

    soup.find('p',attrs = {'class': 'instock availability'}).find('i')\
    .text.strip()

Code Snippet D

    soup.find('div', attrs = {'class':'product_price'})\
    .find('p',attrs = {'class': 'instock availability'})\
    .find('i').text.strip()


Problem 6.3

Rahul also wants to extract the number of stars that the book titled ‘A Light in the Attic’ received. If you look at the HTML file, you will notice that the book received a star rating of three. Which code snippet will evaluate to "Three"?

Code Snippet A

    soup.find('article').get('class').strip()

Code Snippet B

    soup.find('p').text.split(' ')

Code Snippet C

    soup.find('p').get('class')[1]

None of the above



Problem 7

Tahseen decides to look at reviews for the same hotel, but he modifies them so that the only terms they contain are "taco" and "sand". The bag-of-words representations of three reviews are shown as vectors below.

Using cosine similarity to measure similarity, which pair of reviews are the most similar? If there are multiple pairs of reviews that are most similar, select them all.


Problem 8

You create a table called gums that only contains the chewing gum purchases of df, then you create a bag-of-words matrix called bow from the name column of gums. The bow matrix is stored as a DataFrame shown below:

You also have the following outputs:

>>> bow_df.sum(axis=0)     >>> bow_df.sum(axis=1)     >>> bow_df.loc[0, 'pur']
pur            5           0     21                   0
gum           41           1     22
sugar          2           2     22                   >>> (bow_df['paperboard'] > 0).sum()
              ..                 ..                   20
90             4           37    22
paperboard    22           38    10                   >>> bow_df['gum'].sum()
80            20           39    17                   41
Length: 139                Length: 40

For each question below, write your answer as an unsimplified math expression (no need to simplify fractions or logarithms) in the space provided, or write “Need more information” if there is not enough information provided to answer the question.


Problem 8.1

What is the TF-IDF for the word “pur” in document 0?


Problem 8.2

What is the TF-IDF for the word “gum” in document 0?


Problem 8.3

What is the TF-IDF for the word “paperboard” in document 1?



Problem 9

Consider a dataset of n integers, y_1, y_2, ..., y_n, whose histogram is given below:


Problem 9.1

Which of the following is closest to the constant prediction h^* that minimizes:

\displaystyle \frac{1}{n} \sum_{i = 1}^n \begin{cases} 0 & y_i = h \\ 1 & y_i \neq h \end{cases}


Problem 9.2

Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \frac{1}{n} \sum_{i = 1}^n |y_i - h|


Problem 9.3

Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2


Problem 9.4

Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \lim_{p \rightarrow \infty} \frac{1}{n} \sum_{i = 1}^n |y_i - h|^p



Problem 10

Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h


Problem 10.1

Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.


Problem 10.2

The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.