← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
attempt these problems before Sunday’s exam review
session, so that we have enough time to walk through the solutions to
all of the problems.
We will enable the solutions here after the
review session, though you can find the written solutions to these
problems in other discussion worksheets.
The EECS 398 staff are looking into hotels — some in San Diego, for their family to stay at for graduation (and to eat Mexican food), and some elsewhere, for summer trips.
Each row of hotels
contains information about a
different hotel in San Diego. Specifically, for each hotel, we have:
"Hotel Name" (str)
: The name of the hotel.
Assume hotel names are unique."Location" (str):
The hotel’s neighborhood in San
Diego."Chain" (str):
The chain the hotel is a part of; either
"Hilton", "Marriott", "Hyatt", or "Other".
A hotel chain is
a group of hotels owned or operated by a shared company."Number of Rooms" (int)
: The number of rooms the hotel
has.The first few rows of hotels
are shown below, but hotels
has many more rows than are shown.
Now, consider the variable summed
, defined below.
= hotels.groupby("Chain")["Number of Rooms"].sum().idxmax() summed
What is type(summed)
?
int
str
Series
DataFrame
DataFrameGroupBy
In one sentence, explain what the value of summed
means.
Phrase your explanation as if you had to give it to someone who is not a
data science major; that is, don’t say something like “it is the result
of grouping hotels
by "Chain"
, selecting the
"Number of Rooms"
column, …”, but instead, give the value
context.
Consider the variable curious
, defined below.
= frame["Chain"].value_counts().idxmax() curious
Fill in the blank: curious
is guaranteed to be equal to
summed
only if frame
has one row for every
____ in San Diego.
hotel
hotel chain
hotel room
neighborhood
Fill in the blanks so that popular_areas
is an array of
the names of the unique neighborhoods that have at least 5 hotels and at
least 1000 hotel rooms.
= lambda df: __(i)__
f = (hotels
popular_areas
.groupby(__(ii)__)
.__(iii)__ __(iv)__)
What goes in blank (i)?
What goes in blank (ii)?
"Hotel Name"
"Location"
"Chain"
"Number of Rooms"
agg(f)
filter(f)
transform(f)
Consider the code below.
= hotels["Chain"] == "Marriott"
cond1 = hotels["Location"] == "Coronado"
cond2 = hotels[cond1].merge(hotels[cond2], on="Hotel Name", how=???) combined
???
with "inner"
in the code
above, which of the following will be equal to
combined.shape[0]
? min(cond1.sum(), cond2.sum())
(cond1 & cond2).sum()
cond1.sum() + cond2.sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
cond1.sum() + (cond1 & cond2).sum()
???
with "outer"
in the code
above, which of the following will be equal to
combined.shape[0]
? min(cond1.sum(), cond2.sum())
(cond1 & cond2).sum()
cond1.sum() + cond2.sum()
cond1.sum() + cond2.sum() - (cond1 & cond2).sum()
cond1.sum() + (cond1 & cond2).sum()
Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.
Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.
= 0
total = 10000
repetitions for i in np.arange(repetitions):
= np.random.choice(__(a)__, 10, __(b)__)
choices if __(c)__:
= total + 1
total = total / repetitions prob_renovate
What goes in blank (a)?
np.arange(1, 75)
np.arange(10, 75)
np.arange(0, 76)
np.arange(1, 76)
What goes in blank (b)?
replace=True
replace=False
What goes in blank (c)?
choices == 23
choices is 23
np.count_nonzero(choices == 23) > 0
np.count_nonzero(choices) == 23
choices.str.contains(23)
For each day in May 2022, the DataFrame streams
contains
the number of streams for each of the “Top 200" songs on Spotify that
day — that is, the number of streams for the 200 songs with the most
streams on Spotify that day. The columns in streams
are as
follows:
"date"
: the date the song was streamed
"artist_names"
: name(s) of the artists who created
the song
"track_name"
: name of the song
"streams"
: the number of times the song was streamed
on Spotify that day
The first few rows of streams
are shown below. Since
there were 31 days in May and 200 songs per day, streams
has 6200 rows in total.
Note that:
streams
is already sorted in a very particular way —
it is sorted by "date"
in reverse chronological
(decreasing) order, and, within each "date"
, by
"streams"
in increasing order.
Many songs will appear multiple times in streams
,
because many songs were in the Top 200 on more than one day.
Suppose the DataFrame today
consists of 15 rows — 3 rows
for each of 5 different "artist_names"
. For each artist, it
contains the "track_name"
for their three most-streamed
songs today. For instance, there may be one row for
"olivia rodrigo"
and "favorite crime"
, one row
for "olivia rodrigo"
and "drivers license"
,
and one row for "olivia rodrigo"
and
"deja vu"
.
Another DataFrame, genres
, is shown below in its
entirety.
Suppose we perform an inner merge between
today
and genres
on
"artist_names"
. If the five "artist_names"
in
today
are the same as the five "artist_names"
in genres
, what fraction of the rows in the merged
DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Suppose we perform an inner merge between
today
and genres
on
"artist_names"
. Furthermore, suppose that the only
overlapping "artist_names"
between today
and
genres
are "drake"
and
"olivia rodrigo"
. What fraction of the rows in the merged
DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
Suppose we perform an outer merge between
today
and genres
on
"artist_names"
. Furthermore, suppose that the only
overlapping "artist_names"
between today
and
genres
are "drake"
and
"olivia rodrigo"
. What fraction of the rows in the merged
DataFrame will contain "Pop"
in the "genre"
column? Give your answer as a simplified fraction.
The DataFrame random_10
contains the
"track_name"
and "genre"
of 10 randomly-chosen
songs in Spotify’s Top 200 today, along with their
"genre_rank"
, which is their rank in the Top 200
among songs in their "genre"
. For
instance, “the real slim shady" is the 20th-ranked Hip-Hop/Rap song in
the Top 200 today. random_10
is shown below in its
entirety.
The "genre_rank"
column of random_10
contains missing values. Below, we provide four different imputed
"genre_rank"
columns, each of which was created using a
different imputation technique. On the next page, match each of the four
options to the imputation technique that was used in the option.
Note that each option (A, B, C, D) should be used exactly once between parts (a) and (d).
In which option was unconditional mean imputation used?
In which option was mean imputation conditional on
"genre"
used?
In which option was unconditional probabilistic imputation used?
In which option was probabilistic imputation conditional on
"genre"
used?
You want to use regular expressions to extract out the number of ounces from the 5 product names below.
Index | Product Name | Expected Output |
---|---|---|
0 | Adult Dog Food 18-Count, 3.5 oz Pouches | 3.5 |
1 | Gardetto’s Snack Mix, 1.75 Ounce | 1.75 |
2 | Colgate Whitening Toothpaste, 3 oz Tube | 3 |
3 | Adult Dog Food, 13.2 oz. Cans 24 Pack | 13.2 |
4 | Keratin Hair Spray 2!6 oz | 6 |
The names are stored in a pandas Series called names
.
For each snippet below, select the indexes for all the product names
that will not be matched correctly.
For the snippet below, which indexes correspond to products that will not be matched correctly?
= r'([\d.]+) oz'
regex str.findall(regex) names.
0
1
2
3
4
All names will be matched correctly.
For the snippet below, which indexes correspond to products that will not be matched correctly?
= r'(\d+?.\d+) oz|Ounce'
regex str.findall(regex) names.
0
1
2
3
4
All names will be matched correctly.
Rahul is trying to scrape the website of an online bookstore ‘The Book Club’.
<HTML>
<H1>The Book Club</H1>
<BODY BGCOLOR="FFFFFF">
Email us at <a href="mailto:support@thebookclub.com">
support@thebookclub.com</a>.
<div>
<ol class="row">
<li class="book_list">
<article class="product_pod">
<div class="image_container">
<img src="pic1.jpeg" alt="A Light in the Attic"
class="thumbnail">
</div>
<p class="star-rating Three"></p>
<h3>
<a href="cat/index.html" title="A Light in the Attic">
A Light in the Attic
</a>
</h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock
</p>
</div>
</article>
</li>
</ol>
</div>
</BODY>
</HTML>
Which is the equivalent Document Object Model (DOM) tree of this HTML file?
Tree A
Tree B
Tree C
Tree D
Rahul wants to extract the ‘instock availability’
status
of the book titled ‘A Light in the Attic’. Which of the following
expressions will evaluate to "In Stock"
? Assume that Rahul
has already parsed the HTML into a BeautifulSoup object stored in the
variable named soup
.
Code Snippet A
'p',attrs = {'class': 'instock availability'})\
soup.find('icon-ok').strip() .get(
Code Snippet B
'p',attrs = {'class': 'instock availability'}).text.strip() soup.find(
Code Snippet C
'p',attrs = {'class': 'instock availability'}).find('i')\
soup.find( .text.strip()
Code Snippet D
'div', attrs = {'class':'product_price'})\
soup.find('p',attrs = {'class': 'instock availability'})\
.find('i').text.strip() .find(
Rahul also wants to extract the number of stars that the book titled
‘A Light in the Attic’ received. If you look at the HTML file, you will
notice that the book received a star rating of three. Which code snippet
will evaluate to "Three"
?
Code Snippet A
'article').get('class').strip() soup.find(
Code Snippet B
'p').text.split(' ') soup.find(
Code Snippet C
'p').get('class')[1] soup.find(
None of the above
Tahseen decides to look at reviews for the same hotel, but he
modifies them so that the only terms they contain are
"taco"
and "sand"
. The bag-of-words
representations of three reviews are shown as vectors below.
Using cosine similarity to measure similarity, which pair of reviews are the most similar? If there are multiple pairs of reviews that are most similar, select them all.
\vec{r}_1 and \vec{r}_2
\vec{r}_1 and \vec{r}_3
\vec{r}_2 and \vec{r}_3
You create a table called gums
that only contains the
chewing gum purchases of df
, then you create a bag-of-words
matrix called bow
from the name
column of
gums
. The bow
matrix is stored as a DataFrame
shown below:
You also have the following outputs:
>>> bow_df.sum(axis=0) >>> bow_df.sum(axis=1) >>> bow_df.loc[0, 'pur']
pur 5 0 21 0
gum 41 1 22
sugar 2 2 22 >>> (bow_df['paperboard'] > 0).sum()
.. .. 20
90 4 37 22
paperboard 22 38 10 >>> bow_df['gum'].sum()
80 20 39 17 41
Length: 139 Length: 40
For each question below, write your answer as an unsimplified math expression (no need to simplify fractions or logarithms) in the space provided, or write “Need more information” if there is not enough information provided to answer the question.
What is the TF-IDF for the word “pur” in document 0?
What is the TF-IDF for the word “gum” in document 0?
What is the TF-IDF for the word “paperboard” in document 1?
Consider a dataset of n integers, y_1, y_2, ..., y_n, whose histogram is given below:
Which of the following is closest to the constant prediction h^* that minimizes:
\displaystyle \frac{1}{n} \sum_{i = 1}^n \begin{cases} 0 & y_i = h \\ 1 & y_i \neq h \end{cases}
1
5
6
7
11
15
30
Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \frac{1}{n} \sum_{i = 1}^n |y_i - h|
1
5
6
7
11
15
30
Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \frac{1}{n} \sum_{i = 1}^n (y_i - h)^2
1
5
6
7
11
15
30
Which of the following is closest to the constant prediction h^* that minimizes: \displaystyle \lim_{p \rightarrow \infty} \frac{1}{n} \sum_{i = 1}^n |y_i - h|^p
1
5
6
7
11
15
30
Consider a dataset that consists of y_1, \cdots, y_n. In class, we used calculus to minimize mean squared error, R_{sq}(h) = \frac{1}{n} \sum_{i = 1}^n (h - y_i)^2. In this problem, we want you to apply the same approach to a slightly different loss function defined below: L_{\text{midterm}}(y,h)=(\alpha y - h)^2+\lambda h
Write down the empiricial risk R_{\text{midterm}}(h) by using the above loss function.
The mean of dataset is \bar{y}, i.e. \bar{y} = \frac{1}{n} \sum_{i = 1}^n y_i. Find h^* that minimizes R_{\text{midterm}}(h) using calculus. Your result should be in terms of \bar{y}, \alpha and \lambda.