Fall 2024 Final Exam

← return to study.practicaldsc.org


Instructor(s): Suraj Rampure

This exam was administered in-person. The exam was closed-notes, except students were allowed to bring two double-sided notes sheets. No calculators were allowed. Students had 120 minutes to take this exam.

Access the original exam PDF here.
Note: Questions 1-6 were used for “Midterm Redemption”, which during the Fall 2024 semester allowed students to improve their midterm grade. Starting in Winter 2025, there is no redemption policy, Starting Winter 2025, and Final Exam questions will look more like Questions 7-15.


As it gets colder outside, it’s important to make sure we’re taking care of our skin! In this exam, we’ll work with the DataFrame skin, which contains information about various skincare products for sale at Sephora, a popular retailer that sells skincare products. The first few rows of skin are shown below, but skin has many more rows than are shown.

The columns in skin are as follows:

Throughout the exam, assume we have already run all necessary import statements.


Problem 1

An expensive product is one that costs at least $100.


Problem 1.1

Write an expression that evaluates to the proportion of products in skin that are expensive.

Answer:
np.mean(skin['Price'] >= 100) or equivalent expressions such as skin['Price'].ge(100).mean() or (skin['Price'] >= 100).sum() / skin.shape[0].

To calculate the proportion of products in skin that are expensive (price at least 100), we first evaluate the condition skin['Price'] >= 100. This generates a Boolean series where:

  • Each True represents a product with a price greater than or equal to 100.

  • Each False represents a product with a price below 100.

When we call np.mean on this Boolean series, Python internally treats True as 1 and False as 0. The mean of this series therefore gives the proportion of True values, which corresponds to the proportion of products that meet the “expensive” criteria.

Alternatively, we can: 1. Count the total number of “expensive” products using (skin['Price'] >= 100).sum(). 2. Divide this count by the total number of products in the dataset (skin.shape[0]).

Both approaches yield the same result.


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 1.2

Fill in the blanks so that the expression below evaluates to the number of brands that sell fewer than 5 expensive products.

skin.groupby(__(i)__).__(ii)__(__(iii)__)["Brand"].nunique()

(i):

(ii):

(iii): Fill in the blanks so that the expression below also evaluates to the number of brands that sell at least 5 expensive products.

Answer:
(i): "Brand"
(ii): filter
(iii): lambda x: (x['Price'] >= 100).sum() < 5

The expression filters the grouped DataFrame to include only groups where the number of expensive products (those costing at least $100) is fewer than 5. Here’s how the expression works:

  1. groupby('Brand'): Groups the skin DataFrame by the unique values in the "Brand" column, creating subgroups for each brand. Now, when we apply the next step, it will perform the action on each Brand group.
  2. filter: We use filter instead of agg (or other options) because filter is designed to evaluate a condition on the entire group and decide whether to include or exclude the group in the final result. In this case, we want to include only groups (brands) where the number of expensive products is fewer than 5.
  3. (lambda x: (x['Price'] >= 100).sum() < 5): This lambda function runs on each group (in this case, x is a dataframe). It calculates the number of products in the group where the price is greater than or equal to $100 (i.e., expensive products). If this count is less than 5, the group is kept. Otherwise, the group is excluded from our final result.

Since we have an .nunique() at the end of our original expression, we will return the number (n in nunique) of unique brands.

Alternative forms for (iii) include:
- lambda x: np.sum(x['Price'] >= 100) < 5
- lambda x: x[x['Price'] >= 100].shape[0] < 5
- lambda x: x[x['Price'] >= 100].count() < 5


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.



Problem 2

Fill in each blank with one word to complete the sentence below.

The SQL keyword for filtering after grouping is __(i)__, and the SQL keyword for querying is __(ii)__.

Answer: (i): HAVING (ii): WHERE

An important distinction to make for part (ii) is that we’re asking about the keyword for querying. SELECT is incorrect, as SELECT is for specifying a subset of columns, whereas querying is for specifying a subset of rows.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 3

Consider the Series small_prices and vc, both of which are defined below.

small_prices = pd.Series([36, 36, 18, 100, 18, 36, 1, 1, 1, 36])

vc = small_prices.value_counts().sort_values(ascending=False)`

In each of the parts below, select the value that the provided expression evaluates to. If the expression errors, select “Error".


Problem 3.1

vc.iloc[0]

Answer: 4

vc computes the frequency of each value in small_prices, sorts the counts in descending order, and stores the result in vc. The resulting vc will look like this, where 36, 1, 18, 100 are the indices and 4, 3, 2, 1 are the values:

  • iloc[0] retrieves the value at integer position 0 in vc. Remember that position can be different than index for a Series: for this Series, the indices are 36, 1, 18, 100, whereas the integer positions are 0, 1, 2, 3.
  • The result is 4.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 73%.


Problem 3.2

vc.loc[0]

Answer: Error

  • loc[0] attempts to retrieve the value associated with the index 0 in vc.
  • Since vc has indices 36, 1, 18, 100, and 0 is not a valid index, this operation results in an error.

Difficulty: ⭐️⭐️

The average score on this problem was 83%.


Problem 3.3

vc.index[0]

Answer: 36

  • index[0] retrieves the first index in vc, which corresponds to the value 36.`

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 69%.


Problem 3.4

vc.iloc[1]

Answer: 3

  • iloc[1] retrieves the value at integer position 1 in vc, which is 3.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.


Problem 3.5

vc.loc[1]

Answer: 3

  • loc[1] retrieves the value associated with the index 1 in vc, which corresponds to the frequency of the value 1 in small_prices.
  • The result is 3.

Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 3.6

vc.index[1]

Answer: 1

  • index[1] retrieves the positionally second index in vc.
  • The result is 1.

Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.



Problem 4

Consider the DataFrames type_pivot, clinique, fresh, and boscia, defined below.

type_pivot = skin.pivot_table(index="Type",
                              columns="Brand", 
                              values="Sensitive",
                              aggfunc=lambda s: s.shape[0] + 1)
                               
clinique = skin[skin["Brand"] == "CLINIQUE"]
fresh = skin[skin["Brand"] == "FRESH"]
boscia = skin[skin["Brand"] == "BOSCIA"]

Three columns of type_pivot are shown below in their entirety.

In each of the parts below, give your answer as an integer.


Problem 4.1

How many rows are in the following DataFrame?

clinique.merge(fresh, on="Type", how="inner")

Answer: 10

The expression clinique.merge(fresh, on="Type", how="inner") performs an inner join on the Type column between the clinique and fresh DataFrames. An inner join includes only rows where the Type values are present in both DataFrames.

From the provided type_pivot table:

  • The overlapping Type values between CLINIQUE and FRESH are:

    • Face Mask (3.0 for CLINIQUE, 4.0 for FRESH)

    • Moisturizer (3.0 for CLINIQUE, 3.0 for FRESH)

However, we must notice that the aggfunc for our original pivot table is lambda s: s.shape[0] + 1. Thus, we know that the values in the pivot table include an additional count of 1 for each group. This means the raw counts are actually:

  • Face Mask: 2 rows in CLINIQUE, 3 rows in FRESH

  • Moisturizer: 2 rows in CLINIQUE, 2 rows in FRESH

The inner join will match these rows based on Type. Since there are 2 rows in CLINIQUE for both Face Mask and Moisturizer, and 3 and 2 rows in FRESH respectively, the result will have:

  • Face Mask: 2 (from CLINIQUE) × 3 (from FRESH) = 6 rows

  • Moisturizer: 2 (from CLINIQUE) × 2 (from FRESH) = 4 rows

The total number of rows in the resulting DataFrame is 6 + 4 = 10.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.


Problem 4.2

How many rows are in the following DataFrame?

(clinique.merge(fresh, on="Type", how="outer")
         .merge(boscia, on="Type", how="outer"))

Answer: 31

An outer join includes all rows from all DataFrames, filling in missing values (NaN) for any Type not present in one of them. From the table, the counts are calculated as follows:

Remember that our pivot_table aggfunc is calculating lambda s: s.shape[0] + 1 (so we need to subtract one from the pivot table entry to get the actual number of rows).

  • Cleanser: 5 · 1
    • 5 rows in CLINIQUE
    • 1 row in BOSCIA
  • Eye cream: 3 · 1
    • 3 rows in CLINIQUE
    • 1 row in BOSCIA
  • Face Mask: 2 · 3 · 3
    • 2 rows in CLINIQUE
    • 3 rows in FRESH
    • 3 rows in BOSCIA
  • Moisturizer: 2 · 2
    • 2 rows in CLINIQUE
    • 2 rows in FRESH
  • Sun protect: 1
    • 1 row in CLINIQUE

Final Calculation:

5 · 1 + 3 · 1 + 2 · 3 · 3 + 2 · 2 + 1 = 31


Difficulty: ⭐️⭐️⭐️⭐️⭐️

The average score on this problem was 27%.



Problem 5

Consider a sample of 60 skincare products. The name of one product from the sample is given below:

“our drops cream is the best drops drops for eye drops drops proven formula..."

The total number of terms in the product name above is unknown, but we know that the term drops only appears in the name 5 times.

Suppose the TF-IDF of drops in the product name above is \frac{2}{3}. Which of the following statements are NOT possible, assuming we use a base-2 logarithm? Select all that apply.

Answer:

  • All 60 product names contain the term drops.
  • None of the 59 other product names contain the term drops.
  • There are 25 terms in the product name above in total.

The TF-IDF score of a term is calculated as: \text{TF-IDF} = \text{TF} \cdot \text{IDF} Where:

  • \text{TF}: Term Frequency in a document (the ratio of the term’s occurrences to the total number of terms in the document).

  • \text{IDF}: Inverse Document Frequency, which is given by: \text{IDF} = \log_2\left(\frac{N}{1 + n}\right) Here:

    • N: Total number of documents (60 in this case).

    • n: Number of documents containing the term.

The TF-IDF of drops in the product name is given as \frac{2}{3}.


Step 1: Calculating Term Frequency (TF) The problem states that the term drops appears in the product name. If the total number of terms in the product name is T, then: \text{TF} = \frac{5}{T}


Step 2: Calculating Inverse Document Frequency (IDF) Substitute the known values into the IDF formula: \text{IDF} = \log_2\left(\frac{60}{1 + n}\right) Where n is the number of other product names (out of 59) that contain drops.


Step 3: Combine TF and IDF to Match TF-IDF The TF-IDF is given as \frac{2}{3}. Substituting the expressions for TF and IDF: \frac{5}{T} \cdot \log_2\left(\frac{60}{1 + n}\right) = \frac{2}{3}


Step 4: Analyze Each Option


If all 60 product names contain drops, then n = 59. Substituting into the IDF formula: \text{IDF} = \log_2\left(\frac{60}{1 + 59}\right) = \log_2(1) = 0 Since \text{IDF} = 0, the TF-IDF score would also be 0, which contradicts the given TF-IDF of \frac{2}{3}.

This is NOT possible.



If 14 other product names contain drops, then n = 14. Substituting into the IDF formula: \text{IDF} = \log_2\left(\frac{60}{1 + 14}\right) = \log_2\left(\frac{60}{15}\right) = \log_2(4) = 2 Substituting \text{IDF} = 2 into the TF-IDF equation: \frac{5}{T} \cdot 2 = \frac{2}{3} Solving for T: \frac{10}{T} = \frac{2}{3} \implies T = 15 This within the problem’s constraints.

This is possible.



If no other product names contain drops, then n = 0. Substituting into the IDF formula: \text{IDF} = \log_2\left(\frac{60}{1 + 0}\right) = \log_2(60) Approximating \log_2(60) \approx 5.91, substituting into the TF-IDF equation: \frac{5}{T} \cdot 5.91 = \frac{2}{3} Solving for T: \frac{29.55}{T} = \frac{2}{3} \implies T \approx 44.33 Since T must be an integer, this is not consistent with the problem’s constraints.

This is NOT possible.



If T = 15, substituting into the TF equation: \text{TF} = \frac{5}{15} = \frac{1}{3} Substituting \text{TF} = \frac{1}{3} into the TF-IDF equation: \frac{1}{3} \cdot \text{IDF} = \frac{2}{3} \implies \text{IDF} = 2 Substituting \text{IDF} = 2 into the IDF formula: 2 = \log_2\left(\frac{60}{1 + n}\right) 2 = \log_2(4) \implies \frac{60}{1 + n} = 4 \implies n = 14 This is consistent with the problem’s constraints.

This is possible.



If T = 25, substituting into the TF equation: \text{TF} = \frac{5}{25} = \frac{1}{5} Substituting \text{TF} = \frac{1}{5} into the TF-IDF equation: \frac{1}{5} \cdot \text{IDF} = \frac{2}{3} \implies \text{IDF} = \frac{10}{3} Substituting \text{IDF} = \frac{10}{3} into the IDF formula: \frac{10}{3} = \log_2\left(\frac{60}{1 + n}\right) This results in a value for n that is not an integer.

This is NOT possible.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 6

Suppose soup is a BeautifulSoup object representing the homepage of wolfskin.com, a Sephora competitor.

Furthermore, suppose prods, defined below, is a list of strings containing the name of every product on the site.

prods = [row.get("prod") for row in soup.find_all("row", class_="thing")]

Given that prods[1] evaluates to "Cleansifier", which of the following options describes the source code of wolfskin.com?

Answer: Option 2

Explanation:

The given code:

prods = [row.get("prod") for row in soup.find_all("row", class_="thing")]

retrieves all <row> elements with the class "thing" and extracts the value of the "prod" attribute for each.

Option 1: In this structure, the product names appear as text content within the <row> tags, not as an attribute. row.get("prod") would return None.

Option 2: Here, the product names are stored as the value of the "prod" attribute. row.get("prod") successfully retrieves these values.

Option 3: In this structure, the "prod" attribute contains the value "thing", and the product names are stored as the value of the "class" attribute. soup.find_all("row", class_="thing") will not retrieve any of these tags since there are no tags with class="thing".

Option 4: In this structure, the product names are part of the text content inside the <row> tags. row.get("prod") would return None.


Difficulty: ⭐️⭐️

The average score on this problem was 76%.


Problem 7

Consider a dataset of n values, y_1, y_2, ..., y_n, all of which are positive. We want to fit a constant model, H(x) = h, to the data.

Let h_p^* be the optimal constant prediction that minimizes average degree-p loss, R_p(h), defined below.

R_p(h) = \frac{1}{n} \sum_{i = 1}^n | y_i - h |^p

For example, h_2^* is the optimal constant prediction that minimizes \displaystyle R_2(h) = \frac{1}{n} \sum_{i = 1}^n |y_i - h|^2.

In each of the parts below, determine the value of the quantity provided. By “the data", we are referring to y_1, y_2, ..., y_n.


Problem 7.1

h_0^*

Answer: The mode of the data

The minimizer of empirical risk for the constant model when using zero-one loss is the mode.


Difficulty: ⭐️

The average score on this problem was 100%.


Problem 7.2

h_1^*

Answer: The median of the data

The minimizer of empirical risk for the constant model when using absolute loss is the median.


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 7.3

R_1(h_1^*)

Answer: None of the above

The minimizer of empirical risk for R_1(h_1^*) is the sum of absolute deviations from the median divided by n.

R_1(h_1^*) represents the sum of absolute deviations from the median divided by n:

R_1(h_1^*) = \frac{1}{n} \sum_{i=1}^n |y_i - h_1^*|


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 44%.


Problem 7.4

h_2^*

Answer: The mean of the data

The minimizer of empirical risk for the constant model when using squared loss is the mean.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 7.5

R_2(h_2^*)

Answer: The variance of the data

The minimizer of empirical risk for R_2(h_2^*) is the mean squared error, which is equivalent to the variance of the data when h_2^* equals the mean.

R_2(h_2^*) represents the mean squared error:

R_2(h_2^*) = \frac{1}{n} \sum_{i=1}^n (y_i - h_2^*)^2


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 70%.


Now, suppose we want to find the optimal constant prediction, h_\text{U}^*, using the “Ulta" loss function, defined below.

L_U(y_i, h) = y_i (y_i - h)^2


Problem 7.6

To find h_\text{U}^*, suppose we minimize average Ulta loss (with no regularization). How does h_\text{U}^* compare to the mean of the data, M?

Answer: h_\text{U}^* \geq M

Since we’re multiplying by y_i, the Ulta loss function skews the optimal prediction higher to reduce the impact of larger y_i values on the loss.


Difficulty: ⭐️⭐️⭐️⭐️

The average score on this problem was 42%.


Now, to find the optimal constant prediction, we will instead minimize regularized average Ulta loss, R_\lambda(h), where \lambda is a non-negative regularization hyperparameter:

R_\lambda(h) = \left( \frac{1}{n} \sum_{i = 1}^n y_i (y_i - h)^2 \right) + \lambda h^2

It can be shown that \displaystyle \frac{\partial R_\lambda(h)}{\partial h}, the derivative of R_\lambda(h) with respect to h, is:

\frac{\partial R_\lambda(h)}{\partial h} = -2 \left( \frac{1}{n} \sum_{i = 1}^n y_i (y_i - h) - \lambda h \right)


Problem 7.7

Find h^*, the constant prediction that minimizes R_\lambda(h). Show your work, and put a around your final answer, which should be an expression in terms of y_i, n, and/or \lambda.

Answer: h^* = \frac{\sum_{i = 1}^n y_i^2}{\sum_{i = 1}^n y_i + n\lambda}

To minimize the regularized average Ulta loss, we solve the equation by setting the derivative to zero and solving for h.

Set the derivative to zero:

\frac{\partial R_\lambda(h)}{\partial h} = 0

-2 \left( \frac{1}{n} \sum_{i = 1}^n y_i (y_i - h) \right) + 2\lambda h = 0

Simplify:

\frac{1}{n} \sum_{i = 1}^n y_i^2 - \frac{1}{n} \sum_{i = 1}^n y_i h - \lambda h = 0

Combine terms:

\frac{1}{n} \sum_{i = 1}^n y_i^2 = h \left( \frac{1}{n} \sum_{i = 1}^n y_i + \lambda \right)

Solve for h:

h^* = \frac{\frac{1}{n} \sum_{i = 1}^n y_i^2}{\frac{1}{n} \sum_{i = 1}^n y_i + \lambda}

h^* = \frac{\sum_{i = 1}^n y_i^2}{\sum_{i = 1}^n y_i + n\lambda}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 71%.



Problem 8

Suppose we want to fit a simple linear regression model (using squared loss) that predicts the number of ingredients in a product given its price. We’re given that:

The intercept and slope of the regression line are w_0^* = 11 and w_1^* = \frac{1}{10}, respectively.


Problem 8.1

Suppose Victors’ Veil (a skincare product) costs $40 and has 11 ingredients. What is the squared loss of our model’s predicted number of ingredients for Victors’ Veil? Give your answer as a number.

Answer: 16

The predicted number of ingredients for Victors’ Veil is calculated using the regression model: \hat{y} = w_0^* + w_1^* x

Substituting w_0^* = 11, w_1^* = \frac{1}{10}, and x = 40: \hat{y} = 11 + \frac{1}{10} \cdot 40 = 11 + 4 = 15

The squared loss is then:

L = (\hat{y} - y)^2 = (15 - 11)^2 = 4^2 = 16


Difficulty: ⭐️⭐️

The average score on this problem was 86%.


Problem 8.2

Is it possible to answer part (a) above just by knowing \bar x and \bar y, i.e. without knowing the values of w_0^* and w_1^*?

Answer: Yes; the values of w_0^* and w_1^* don’t impact the answer to part (a).

To answer part (a), we only need to know the average values \bar{x} and \bar{y}, as the regression line will always pass through (\bar{x}, \bar{y}).


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.


Problem 8.3

Suppose x_i represents the price of product i, and suppose u_i represents the negative price of product i. In other words, for i = 1, 2, ..., n, where n is the number of points in our dataset:

u_i = - x_i

Suppose U is the design matrix for the simple linear regression model that uses negative price to predict number of ingredients. Which of the following matrices could be U^TU?

\begin{bmatrix} -15 & 600 \\ 600 & -30000 \end{bmatrix}

\begin{bmatrix} 15 & -600 \\ -600 & 30000 \end{bmatrix}

\begin{bmatrix} -15 & 450 \\ 450 & -30000 \end{bmatrix}

\begin{bmatrix} 15 & -450 \\ -450 & 30000 \end{bmatrix}

Answer: \begin{bmatrix} 15 & -600 \\ -600 & 30000 \end{bmatrix}

The design matrix U is defined as: U = \begin{bmatrix} 1 & -x_1 \\ 1 & -x_2 \\ \vdots & \vdots \\ 1 & -x_n \end{bmatrix} where 1 represents the intercept term, and -x_i represents the negative price of product i. Thus:

U^T U = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ -x_1 & -x_2 & \cdots & -x_n \end{bmatrix} \begin{bmatrix} 1 & -x_1 \\ 1 & -x_2 \\ \vdots & \vdots \\ 1 & -x_n \end{bmatrix}

When we multiply this out, we get: U^T U = \begin{bmatrix} n & -\sum x_i \\ -\sum x_i & \sum x_i^2 \end{bmatrix}

Conceptually, this matrix represents: \begin{bmatrix} \text{Number of elements} & \text{Sum of the negative values of the data} \\ \text{Sum of the negative values of the data} & \text{Sum of squared values of the data} \end{bmatrix}.

We know that the sum of all elements in a series is equal to the mean of the series multiplied by the number of elements in the series. \Sigma x_i = n \cdot \bar{x}

\bar{x} is equal to 40. The first element in the matrix, 15, represents the number of elements in the series. Thus, \Sigma x_i = 40 \times 15 = 600, and the solution must be as follows:

\begin{bmatrix} n & -\sum x_i \\ -\sum x_i & \sum x_i^2 \end{bmatrix} = \begin{bmatrix} 15 & -600 \\ -600 & 30000 \end{bmatrix}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 74%.



Problem 9

Suppose we want to fit a multiple linear regression model (using squared loss) that predicts the number of ingredients in a product given its price and various other information.

From the Data Overview page, we know that there are 6 different types of products. Assume in this question that there are 20 different product brands. Consider the models defined in the table below.

Model Name Intercept Price Type Brand
Model A Yes Yes No One hot encoded
without drop="first"
Model B Yes Yes No No
Model C Yes Yes One hot encoded
without drop="first"
Model D No Yes One hot encoded One hot encoded
with drop="first" with drop="first"
Model E No Yes One hot encoded One hot encoded
with drop="first" without drop="first"

For instance, Model A above includes an intercept term, price as a feature, one hot encodes brand names, and doesn’t use drop="first" as an argument to OneHotEncoder in sklearn.

In parts (a) through (c), you are given a model. For each model provided, state the number of columns and the rank (i.e. number of linearly independent columns) of the design matrix, X. Some of part (a) is already done for you as an example.


Problem 9.1

Model A

Number of columns in X: Rank of X: 21

Answer:

Columns: 22

Model A includes an intercept, the price feature, and a one-hot encoding of the 20 brands without dropping the first category. The intercept adds 1 column, the price adds 1 column, and the one-hot encoding for brands adds 20 columns.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 9.2

Model B

Number of columns in X: Rank of X:

Answer:

Columns: 2

Rank: 2

Model B includes an intercept and the price feature, but no one-hot encoding for the brands or product types. The intercept and price are linearly independent, resulting in both the number of columns and the rank being 2.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 9.3

Model C

Number of columns in X: Rank of X:

Answer:

Columns: 8

Rank: 7

Model C includes an intercept, the price feature, and a one-hot encoding of the 6 product types without dropping the first category. The intercept adds 1 column, the price adds 1 column, and the one-hot encoding for the product types adds 6 columns. However, one column is linearly dependent because we didn’t drop one of the one-hot encoded columns, reducing the rank to 7.


Difficulty: ⭐️⭐️

The average score on this problem was 81%.


Problem 9.4

Which of the following models are NOT guaranteed to have residuals that sum to 0?

Hint: Remember, the residuals of a fit model are the differences between actual and predicted y-values, among the data points in the training set.

Answer: Model D

For residuals to sum to zero in a linear regression model, the design matrix must include either an intercept term (Models A - C) or equivalent redundancy in the encoded variables to act as a substitute for the intercept (Model E). Models that lack both an intercept and equivalent redundancy, such as Model D, are not guaranteed to have residuals sum to zero.

One-hot encoding without dropping any category can behave like an intercept because it introduces a column for each category, allowing the model to adjust its predictions as if it had an intercept term. The sum of the indicator variables in such a setup equals 1 for every row, creating a similar effect to an intercept, which is why residuals may sum to zero in these cases.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 60%.



Problem 10

Suppose we want to create polynomial features and use ridge regression (i.e. minimize mean squared error with L_2 regularization) to fit a linear model that predicts the number of ingredients in a product given its price.

To choose the polynomial degree and regularization hyperparameter, we use cross-validation through GridSearchCV in sklearn using the code below.

searcher = GridSearchCV(
    make_pipeline(PolynomialFeatures(include_bias=False), 
                  Ridge()),
    
    param_grid={"polynomialfeatures__degree": np.arange(1, D + 1), 
                "ridge__alpha": 2 ** np.arange(1, L + 1)},

    cv=K # K-fold cross-validation.
) 
searcher.fit(X_train, y_train)

Assume that there are N rows in X_train, where N is a multiple of K (the number of folds used for cross-validation).

In each of the parts below, give your answer as an expression involving D, L, K, and/or N. Part (a) is done for you as an example.


Problem 10.1

How many combinations of hyperparameters are being considered? Answer: LD

We have a grid with 2 features: L, which we use to iterate our \lambda value in ridge regression, and D, which we use to iterate the polynomial degree for one of our hyperparameters.


Problem 10.2

Each time a model is trained, how many points are being used to train the model?

Answer: N \times \frac{K-1}{K}

In K-fold cross-validation, the data is split into K folds. Each time a model is trained, one fold is used for validation, and the remaining K-1 folds are used for training. Since N is evenly split among K folds, each fold contains \frac{N}{K} points. The training data, therefore, consists of N - \frac{N}{K} = N \times \frac{K-1}{K} points.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 72%.


Problem 10.3

In total, how many times are X_train.iloc[1] and X_train.iloc[-1] both used for training a model at the same time? Assume that these two points are in different folds.

Answer: LD \times (K-2)

In K-fold cross-validation, X_{\text{train}}.\text{iloc}[1] and X_{\text{train}}.\text{iloc}[-1] are in different folds and can only be used together for training when neither of their respective folds is the validation set. This happens in (K-2) folds for each combination of polynomial degree (D) and regularization hyperparameter (L). Since there are L \times D total combinations in the grid search, the two points are used together for training LD \times (K-2) times.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 53%.



Problem 11

Suppose we want to use LASSO (i.e. minimize mean squared error with L_1 regularization) to fit a linear model that predicts the number of ingredients in a product given its price and rating.

Let \lambda be a non-negative regularization hyperparameter. Using cross-validation, we determine the average validation mean squared error — which we’ll refer to as AVMSE in this question — for several different choices of \lambda. The results are given below.


Problem 11.1

As \lambda increases, what happens to model complexity and model variance?

Answer: Model complexity and model variance both decrease.

As \lambda increases, the L_1 regularization penalizes the model more heavily for including features, effectively reducing the number of non-zero coefficients in the model. This decreases model complexity. Additionally, by reducing complexity, the model becomes less sensitive to the training data, leading to lower variance.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 64%.


Problem 11.2

What does the value A on the graph above correspond to?

Answer: The AVMSE of an unregularized multiple linear regression model.

Point A represents the case where \lambda = 0, meaning no regularization is applied. This corresponds to an unregularized multiple linear regression model.


Difficulty: ⭐️⭐️

The average score on this problem was 80%.


Problem 11.3

What does the value B on the graph above correspond to?

Answer: The AVMSE of the \lambda we’d choose to use to train a model.

Point B is the minimum point on the graph, indicating the optimal \lambda value that minimizes the AVMSE. This is the \lambda we’d choose to train the model.


Difficulty: ⭐️

The average score on this problem was 91%.


Problem 11.4

What does the value C on the graph above correspond to?

Answer: The AVMSE of the constant model.

Point C represents the AVMSE when \lambda is very large, effectively forcing all coefficients to zero. This corresponds to the constant model.


Difficulty: ⭐️⭐️

The average score on this problem was 83%.



Problem 12

Suppose we fit five different classifiers that predict whether a product is designed for sensitive skin, given its price and number of ingredients. In the five decision boundaries below, the gray-shaded regions represent areas in which the classifier would predict that the product is designed for sensitive skin (i.e. predict class 1).


Problem 12.1

Which model does Decision Boundary 1 correspond to?

Answer: Logistic Regression

Logistic regression creates a linear divide between classes.


Difficulty: ⭐️

The average score on this problem was 93%.


Problem 12.2

Which model does Decision Boundary 2 correspond to?

Answer: Decision tree with \text{max depth} = 15

We know that Decision Boundaries 2 and 5 are decision trees, since the decision boundaries are parallel to the axes. Decision boundaries for decision trees are parallel to axes because the tree splits the feature space based on one feature at a time, creating partitions aligned with that feature’s axis (e.g., “Price ≤ 100” results in a vertical split).

Decision Boundary 2 is more complicated than Decision Boundary 5, so we can assume there are more levels.


Difficulty: ⭐️⭐️

The average score on this problem was 82%.


Problem 12.3

Which model does Decision Boundary 3 correspond to?

Answer: k-nearest neighbors with k = 3

k-nearest neighbors decision boundaries don’t follow straight lines like logistic regression or decision trees. Thus, boundaries 3 and 4 are k-nearest neighbors.

Boundary 3 seems to be more fit to the data than boundary 4, so we can assume that we used a decision tree with \text{max depth} = 3.

For example, if in our training data we had 3 points in the group represented by the light-shaded region at (300, 150), the decision boundary for k-nearest neighbors with k = 3 would be light-shaded at (300, 150), whereas it may be dark-shaded for k-nearest neighbors with k = 100.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 59%.


Problem 12.4

Which model does Decision Boundary 4 correspond to?

Answer: k-nearest neighbors with k = 100

k-nearest neighbors decision boundaries don’t follow straight lines like logistic regression or decision trees. Thus, boundaries 3 and 4 are k-nearest neighbors.

Boundary 3 seems to be less fit to the data than boundary 4, so we can assume that we used a decision tree with \text{max depth} = 100.


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 56%.



Problem 13

Suppose we fit a logistic regression model that predicts whether a product is designed for sensitive skin, given its price, x^{(1)}, number of ingredients, x^{(2)}, and rating, x^{(3)}. After minimizing average cross-entropy loss, the optimal parameter vector is as follows:

\vec{w}^* = \begin{bmatrix} -1 \\ 1 / 5 \\ - 3 / 5 \\ 0 \end{bmatrix}

In other words, the intercept term is -1, the coefficient on price is \frac{1}{5}, the coefficient on the number of ingredients is -\frac{3}{5}, and the coefficient on rating is 0.

Consider the following four products:

Which of the following products have a predicted probability of being designed for sensitive skin of at least 0.5 (50%)? For each product, select Yes or No and justify your answer.

Wolfcare:

Justify your answer.

Go Blue Glow:

Justify your answer.

DataSPF:

Justify your answer.

Maize Mist:

Justify your answer.

In order to solve this problem, we need to use the sigmoid function to cast our result into a probability between 0-1. As a reminder, the sigmoid function \frac{1}{1 + e^{-x}} trends towards 0 when the input becomes more negative, and towards 1 as the input becomes a higher positive.

Wolfcare:

z = -1 + \frac{1}{5}(15) - \frac{3}{5}(20) + 0 = -1 + 3 - 12 = -10 P = \frac{1}{1 + e^{-(-10)}} = \frac{1}{1 + e^{10}} \approx 0

Decision: No, as P < 0.5

Go Blue Glow:

z = -1 + \frac{1}{5}(25) - \frac{3}{5}(5) + 0 = -1 + 5 - 3 = 1 P = \frac{1}{1 + e^{-1}} \approx \frac{1}{1 + 0.3679} \approx 0.73

Decision: Yes, as P \geq 0.5.

DataSPF:

z = -1 + \frac{1}{5}(50) - \frac{3}{5}(15) + 0 = -1 + 10 - 9 = 0 P = \frac{1}{1 + e^{-0}} = \frac{1}{2} = 0.5

Decision: Yes, as P \geq 0.5.

Maize Mist: z = -1 + \frac{1}{5}(0) - \frac{3}{5}(1) + 0 = -1 - 0.6 = -1.6 P = \frac{1}{1 + e^{-(-1.6)}} = \frac{1}{1 + e^{1.6}} \approx \frac{1}{1 + 4.953} \approx 0.17 Decision: No, as P < 0.5


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 14

Suppose, again, that we fit a logistic regression model that predicts whether a product is designed for sensitive skin. We’re deciding between three thresholds, A, B, and C, all of which are real numbers between 0 and 1 (inclusive). If a product’s predicted probability of being designed for sensitive skin is above our chosen threshold, we predict they belong to class 1 (yes); otherwise, we predict class 0 (no).

The confusion matrices of our model on the test set for all three thresholds are shown below.

\begin{array}{ccc} \textbf{$A$} & \textbf{$B$} & \textbf{$C$} \\ \begin{array}{|l|c|c|} \hline & \text{Pred.} \: 0 & \text{Pred.} \: 1 \\ \hline \text{Actually} \: 0 & 40 & 5 \\ \hline \text{Actually} \: 1 & 35 & 35 \\ \hline \end{array} & \begin{array}{|l|c|c|} \hline & \text{Pred.} \: 0 & \text{Pred.} \: 1 \\ \hline \text{Actually} \: 0 & 5 & 40 \\ \hline \text{Actually} \: 1 & 10 & ??? \\ \hline \end{array} & \begin{array}{|l|c|c|} \hline & \text{Pred.} \: 0 & \text{Pred.} \: 1 \\ \hline \text{Actually} \: 0 & 10 & 35 \\ \hline \text{Actually} \: 1 & 30 & 40 \\ \hline \end{array} \end{array}


Problem 14.1

Suppose we choose threshold A, i.e. the leftmost confusion matrix. What is the precision of the resulting predictions? Give your answer as an unsimplified fraction.

Answer: \boxed{\frac{35}{40}} The precision of a classification model is calculated as:

\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}

From the confusion matrix for threshold A:

\begin{array}{|l|c|c|} \hline & \text{Pred.} \: 0 & \text{Pred.} \: 1 \\ \hline \text{Actually} \: 0 & 40 & 5 \\ \hline \text{Actually} \: 1 & 35 & 35 \\ \hline \end{array}

The number of true positives (TP) is 35, and the number of false positives (FP) is 5. Substituting into the formula:

\text{Precision} = \frac{35}{35 + 5} = \frac{35}{40}

Thus, the precision for threshold A is:

\boxed{\frac{35}{40}}


Difficulty: ⭐️⭐️

The average score on this problem was 85%.


Problem 14.2

What is the missing value (???) in the confusion matrix for threshold B? Give your answer as an integer.

Answer: 60

From the confusion matrix for threshold B:

\begin{array}{|l|c|c|} \hline & \text{Pred.} \: 0 & \text{Pred.} \: 1 \\ \hline \text{Actually} \: 0 & 5 & 40 \\ \hline \text{Actually} \: 1 & 10 & ??? \\ \hline \end{array}

The sum of all entries in a confusion matrix must equal the total number of samples. For thresholds A and C, the total number of samples is:

40 + 5 + 35 + 35 = 115

Using this total, for threshold B, the entries provided are:

5 (\text{TN}) + 40 (\text{FP}) + 10 (\text{FN}) + ??? (\text{TP}) = 115

Solving for the missing value (???):

115 - (5 + 40 + 10) = 60

Thus, the missing value for threshold B is:

\boxed{60}


Difficulty: ⭐️

The average score on this problem was 90%.


Problem 14.3

Using the information in the three confusion matrices, arrange the thresholds from largest to smallest. Remember that 0 \leq A, B, C \leq 1.

Thresholds can be arranged based on their strictness. Higher thresholds result in more predictions for class 0 (more products classified as not designed for sensitive skin).

From the confusion matrices: - Threshold A predicts the least as 0 and Threshold B predicts the most as 0.

\boxed{A > C > B}


Difficulty: ⭐️⭐️⭐️

The average score on this problem was 67%.


Problem 14.4

Remember that in our classification problem, class 1 means the product is designed for sensitive skin, and class 0 means the product is not designed for sensitive skin. In one or two English sentences, explain which is worse in this context and why: a false positive or a false negative.

A false positive is worse, as it would mean we predict a product is safe when it isn’t; someone with sensitive skin may use it and it could harm them.


Difficulty: ⭐️

The average score on this problem was 91%.



Problem 15

Let \vec{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}. Consider the function \displaystyle Q(\vec x) = x_1^2 - 2x_1x_2 + 3x_2^2 - 1.


Problem 15.1

Fill in the blank to complete the definition of \nabla Q(\vec x), the gradient of Q.

\nabla Q(\vec x) = \begin{bmatrix} 2(x_1 - x_2) \\ \\ \_\_\_\_ \end{bmatrix}

What goes in the blank? Show your work, and put a your final answer, which should be an expression involving x_1 and/or x_2.

Answer: \nabla Q(\vec{x}) = \begin{bmatrix} 2(x_1 - x_2) \\ -2x_1 + 6x_2 \end{bmatrix}

The gradient of Q(\vec{x}) is given by: \nabla Q(\vec{x}) = \begin{bmatrix} \frac{\partial Q}{\partial x_1} \\ \frac{\partial Q}{\partial x_2} \end{bmatrix}

First, compute the partial derivatives:

  1. Partial derivative with respect to x_1: Q(x_1, x_2) = x_1^2 - 2x_1x_2 + 3x_2^2 - 1 \frac{\partial Q}{\partial x_1} = 2x_1 - 2x_2

  2. Partial derivative with respect to x_2: \frac{\partial Q}{\partial x_2} = -2x_1 + 6x_2

Thus, the gradient is: \nabla Q(\vec{x}) = \begin{bmatrix} 2(x_1 - x_2) \\ -2x_1 + 6x_2 \end{bmatrix}


Difficulty: ⭐️⭐️

The average score on this problem was 87%.


Problem 15.2

We decide to use gradient descent to minimize Q, using an initial guess of \vec x^{(0)} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} and a learning rate/step size of \alpha.

If after one iteration of gradient descent, we have \vec{x}^{(1)} = \begin{bmatrix} 1 \\ -4 \end{bmatrix}, what is \alpha?

Final Answer: \frac{5}{4}

In gradient descent, the update rule is: \vec{x}^{(k+1)} = \vec{x}^{(k)} - \alpha \nabla Q(\vec{x}^{(k)})

We are given: \vec{x}^{(0)} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \quad \vec{x}^{(1)} = \begin{bmatrix} 1 \\ -4 \end{bmatrix}

  1. Compute \nabla Q(\vec{x}^{(0)}):

Substitute \vec{x}^{(0)} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} into the gradient: \nabla Q(\vec{x}^{(0)}) = \begin{bmatrix} 2(1 - 1) \\ -2(1) + 6(1) \end{bmatrix} = \begin{bmatrix} 0 \\ 4 \end{bmatrix}

  1. Use the update rule to find \alpha:

From the update rule: \vec{x}^{(1)} = \vec{x}^{(0)} - \alpha \nabla Q(\vec{x}^{(0)})

Substitute the values: \begin{bmatrix} 1 \\ -4 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} - \alpha \begin{bmatrix} 0 \\ 4 \end{bmatrix}

Separate components: 1 = 1 - \alpha(0) \quad \text{(satisfied for any $\alpha$)}, -4 = 1 - \alpha(4)

Solve for \alpha: -4 = 1 - 4\alpha \implies 4\alpha = 5 \implies \alpha = \frac{5}{4}


Difficulty: ⭐️⭐️

The average score on this problem was 86%.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.