Generalization, Cross-Validation, and Regularization

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

Neerad wants to build a model that predicts the number of open rooms a hotel has, given various other features. He has a training set with 1200 rows available to him for the purposes of training his model.


Problem 1.1

Neerad fits a regression model using the GPTRegression class. GPTRegression models have several hyperparameters that can be tuned, including context_length and sentience.

To choose between 5 possible values of the hyperparameter context_length, Neerad performs k-fold cross-validation.

  1. How many total times is a GPTRegression model fit?
  1. Suppose that every time a GPTRegression model is fit, it appends the number of points in its training set to the list sizes. Note that after performing cross- validation, len(sizes) is equal to your answer to the previous subpart.

What is sum(sizes)?

Answers:

  1. 5k
  2. 6000(k-1)

When we do k-fold cross-validation for one single hyperparameter value, we split the dataset into k folds, and in each iteration i, train the model on the remaining k-1 folds and evaluate on fold i. Since every fold is left out and evaluated on once, the model is fit in total k times. We do this once for every hyperparameter value we want to test, so the total number of model fits required is 5k.

In part 2, we can note that each model fit is performed on the same size of data – the size of the remaining k-1 folds when we hold out a single fold. This size is 1 - \frac{1}{k} = \frac{k-1}{k} times the size of the entire dataset, in this case, 1200 \cdot \frac{k-1}{k}, and we fit a model on a dataset of this size 5k times. So, the sum of the training sizes for each model fit is:

5k \cdot \frac{k-1}{k} \cdot 1200 = 6000(k-1)


Problem 1.2

The average training error and validation error for all 5 candidate values of context_length are given below.

Fill in the blanks: As context_length increases, model complexity __(i)__. The optimal choice of context_length is __(ii)__; if we choose a context_length any higher than that, our model will __(iii)__.

  1. What goes in blank (i)?
  1. What goes in blank (ii)?
  1. What goes in blank (iii)?

Answers:

  1. decreases
  2. 100
  3. underfit the training data and have high bias

In part 1, we can see that as context_length increases, the training error increases, and the model performs worse. In general, higher model complexity leads to better model performance, so here, increasing context_length is reducing model complexity.

In part 2, we will choose a context_length of 100, since this parameterization leads to the best validation performance. If we increase context_length further, the validation error increases.

In part 3, since increased context_length indicates less complexity and worse training performance, increasing the context_length further would lead to underfitting, as the model would lack the expressiveness or number of parameters required to capture the data. Since training error represents model bias, and since high variance is associated with overfitting, a further increase in context_length would mean a more biased model.



Problem 2

Every week, Pranavi goes to her local grocery store and buys a varying amount of vegetable but always buys exactly one pound of meat (either beef, fish, or chicken). We use a linear regression model to predict her total grocery bill. We’ve collected a dataset containing the pounds of vegetables bought, the type of meat bought, and the total bill. Below we display the first few rows of the dataset and two plots generated using the entire training set.


Problem 2.1

Determine how each change below affects model bias and variance compared to the model H(x) described at the top of this page. For each change (i., ii., iii., iv.), choose all of the following that apply: increase bias, decrease bias, increase variance, decrease variance.

  1. Add degree 3 polynomial features.
  2. Add a feature of numbers chosen at random between 0 and 1.
  3. Collect 100 more points for the training set.
  4. Don’t use the 'veg' feature.

    1. Answer: Decrease bias, increase variance. Note that adding degree 3 polynomial features will increase the complexity of our model. Increasing model complexity decreases model bias, but increases model variance (because the fitted model will vary more from training set to training set than it would without the degree 3 feature).
    1. Answer: Increase variance. We’re adding a new feature that will contribute nothing of quality to our model’s predictions, other than make them vary more from training set to training set.
    1. Answer: Decrease variance. Think of our training set as being a sample drawn from some population. The larger our training sets are, the less our model will vary from training set to training set, hence, reducing model variance.
    1. Answer: Increase bias, decrease variance. Removing the \text{veg} feature would reduce the complexity of our model, which would increase model bias but decrease model complexity.


Problem 2.2

Suppose we predict 'total' from 'veg' using 8 models with different degree polynomial features (degrees 0 through 7). Which of the following plots display the training and validation errors of these models? Assume that we plot the degree of the polynomial features on the x-axis, mean squared error loss on the y-axis, and the plots share y-axes limits.

Training Error:

Validation Error:

Answer: Training error: C; Validation Error: B

Training Error: As we increase the complexity of our model, it will gain the ability to memorize the patterns in our training set to a greater degree, meaning that its performance will get better and better (here, meaning that its MSE will get lower and lower).

Validation Error: As we increase the complexity of our model, there will be a point at which it becomes “too complex” and overfits to insignificant patterns in the training set. The second graph from the start of Problem 9 tells us that the relationship between 'total' and 'veg' is roughly quadratic, meaning it’s best modeled using degree 2 polynomial features. Using a degree greater than 2 will lead to overfitting and, therefore, worse validation set performance. Plot B shows MSE decrease until d = 2 and increase afterwards, which matches the previous explanation.



Problem 3

Consider the least squares regression model, \vec{y} = X \vec{w}. Assume that X and \vec{y} refer to the design matrix and true response vector for our training data.

Let \vec{w}_\text{OLS}^* be the parameter vector that minimizes mean squared error without regularization. Specifically:

\vec{w}_\text{OLS}^* = \arg\underset{\vec{w}}{\min} \frac{1}{n} \| \vec{y} - X \vec{w} \|^2_2

Let \vec{w}_\text{ridge}^* be the parameter vector that minimizes mean squared error with L_2 regularization, using a non-negative regularization hyperparameter \lambda (i.e. ridge regression). Specifically:

\vec{w}_\text{ridge}^* = \arg\underset{\vec{w}}{\min} \frac{1}{n} \| y - X \vec{w} \|^2_2 + \lambda \sum_{j=1}^{p} w_j^2

For each of the following problems, fill in the blank.


Problem 3.1

If we set \lambda = 0, then \Vert \vec{w}_\text{OLS}^* \Vert^2_2 is ________ \Vert \vec{w}_\text{ridge}^* \Vert^2_2

Answers:

equal to


Problem 3.2

For each of the remaining parts, you can assume that \lambda is set such that the predicted response vectors for our two models (\vec{y}^* = X \vec{w}_\text{OLS}^* and \vec{y}^* = X \vec{w}_\text{ridge}^*) is different.

The training MSE of the model \vec{y}^* = X \vec{w}_\text{OLS}^* is ________ than the model \vec{y}^* = X \vec{w}_\text{ridge}^*.

Answers:

less than


Problem 3.3

Now, assume we’ve fit both models using our training data, and evaluate both models on some unseen testing data.

The test MSE of the model \vec{y}^* = X \vec{w}_\text{OLS}^* is ________ than the model \vec{y}^* = X \vec{w}_\text{ridge}^*.

Answers:

impossible to tell


Problem 3.4

Assume that our design matrix X contains a column of all ones. The sum of the residuals of our model \vec{y}^* = X \vec{w}_\text{ridge}^* ________.

Answers:

not necessarily equal to 0


Problem 3.5

As we increase \lambda, the bias of the model \vec{y}^* = X \vec{w}_\text{ridge}^* tends to ________.

Answers:

increase


Problem 3.6

As we increase \lambda, the model variance of the model \vec{y}^* = X \vec{w}_\text{ridge}^* tends to ________.

Answers:

decrease


Problem 3.7

As we increase \lambda, the observation variance of the model \vec{y}^* = X \vec{w}_\text{ridge}^* tends to ________.

Answers:

stay the same



Problem 4

One piece of information that may be useful as a feature is the proportion of SAT test takers in a state in a given year that qualify for free lunches in school. The Series lunch_props contains 8 values, each of which are either "low", "medium", or "high". Since we can’t use strings as features in a model, we decide to encode these strings using the following Pipeline:

# Note: The FunctionTransformer is only needed to change the result
# of the OneHotEncoder from a "sparse" matrix to a regular matrix
# so that it can be used with StandardScaler;
# it doesn't change anything mathematically.
pl = Pipeline([
    ("ohe", OneHotEncoder(drop="first")),
    ("ft", FunctionTransformer(lambda X: X.toarray())),
    ("ss", StandardScaler())
])

After calling pl.fit(lunch_props), pl.transform(lunch_props) evaluates to the following array:

array([[ 1.29099445, -0.37796447],
       [-0.77459667, -0.37796447],
       [-0.77459667, -0.37796447],
       [-0.77459667,  2.64575131],
       [ 1.29099445, -0.37796447],
       [ 1.29099445, -0.37796447],
       [-0.77459667, -0.37796447],
       [-0.77459667, -0.37796447]])

and pl.named_steps["ohe"].get_feature_names() evaluates to the following array:

array(["x0_low", "x0_med"], dtype=object)

Fill in the blanks: Given the above information, we can conclude that lunch_props has (a) value(s) equal to "low", (b) value(s) equal to "medium", and (c) value(s) equal to "high". (Note: You should write one positive integer in each box such that the numbers add up to 8.)

What goes in the blanks?

Answer: 3, 1, 4

The first column of the transformed array corresponds to the standardized one-hot-encoded low column. There are 3 values that are positive, which means there are 3 values that were originally 1 in that column pre-standardization. This means that 3 of the values in lunch_props were originally "low".

The second column of the transformed array corresponds to the standardized one-hot-encoded med column. There is only 1 value in the transformed column that is positive, which means only 1 of the values in lunch_props was originally "medium".

The Series lunch_props has 8 values, 3 of which were identified as "low" in subpart 1, and 1 of which was identified as "medium" in subpart 2. The number of values being "high" must therefore be 8 - 3 - 1 = 4.


Problem 5

Suppose we have one qualitative variable that that we convert to numerical values using one- hot encoding. We’ve shown the first four rows of the resulting design matrix below:


Problem 5.1

Say we train a linear model m_1 on these data. Then, we replace all of the 1 values in column a with 3’s and all of the 1 values in column b with 2’s and train a new linear model m_2. Neither m_1 nor m_2 have an intercept term. On the training data, the average squared loss for m_1 will be ________ that of m_2.

Answers:

The answer is equal to.

Note that we can just re-scale our weights accordingly. Any model we can get with m_1 we can also get with m_2 (and vice versa).


Problem 5.2

To account for the intercept term, we add a column of all ones to our design matrix from part a. That is, the resulting design matrix has four columns: a with 3’s instead of 1’s, b with 2’s instead of 1’s, c, and a column of all ones. What is the rank of the new design matrix with these four columns?

Answers:

The answer is 3.

Note that the column c = intercept column −\frac{1}{3}a + \frac{1}{2}b. Hence, there is a linear dependence relationship, meaning that one of the columns is redundant and that the rank of the new design matrix is 3.


Problem 5.3

Suppose we divide our sampling frame into three clusters of people, numbered 1, 2, and 3. After we survey people, along with our survey results, we save their cluster number as a new feature in our design matrix. Before training a model, what should we do with the cluster column? (Note: This part is independent of parts a and b.)

Answers:

The cluster number is a categorical variable, so it should be one-hot encoded.



Problem 6

We will aim to build a classifier that takes in demographic information about a state from a particular year and predicts whether or not the state’s mean math score is higher than its mean verbal score that year.

In honor of the rotisserie chicken event on UCSD’s campus in March of 2023, sklearn released a new classifier class called ChickenClassifier.


Problem 6.1

ChickenClassifiers have many hyperparameters, one of which is height. As we increase the value of height, the model variance of the resulting ChickenClassifier also increases.

First, we consider the training and testing accuracy of a ChickenClassifier trained using various values of height. Consider the plot below.

Which of the following depicts training accuracy vs. height?

Which of the following depicts testing accuracy vs. height?

Answer: Option 2 depicts training accuracy vs. height; Option 3 depicts testing accuract vs. height

We are told that as height increases, the model variance (complexity) also increases.

As we increase the complexity of our classifier, it will do a better job of fitting to the training set because it’s able to “memorize” the patterns in the training set. As such, as height increases, training accuracy increases, which we see in Option 2.

However, after a certain point, increasing height will make our classifier overfit too closely to our training set and not general enough to match the patterns in other similar datasets, meaning that after a certain point, increasing height will actually decrease our classifier’s accuracy on our testing set. The only option that shows accuracy increase and then decrease is Option 3.


ChickenClassifiers have another hyperparameter, color, for which there are four possible values: "yellow", "brown", "red", and "orange". To find the optimal value of color, we perform k-fold cross-validation with k=4. The results are given in the table below.


Problem 6.2

Which value of color has the best average validation accuracy?

Answer: "red"

From looking at the results of the k-fold cross validation, we see that the color red has the highest, and therefore the best, validation accuracy as it has the highest row mean (across all 4 folds).


Problem 6.3

True or False: It is possible for a hyperparameter value to have the best average validation accuracy across all folds, but not have the best validation accuracy in any one particular fold.

Answer: True

An example is shown below:

color Fold 1 Fold 2 Fold 3 Fold 4 average
color 1 0.8 0 0 0 0.2
color 2 0 0.6 0 0 0.15
color 3 0 0 0.1 0 0.025
color 4 0 0 0 0.2 0.05
color 5 0.7 0.5 0.01 0.1 0.3275

In the example, color 5 has the highest average validation accuracy across all folds, but is not the best in any one fold.


Problem 6.4

Now, instead of finding the best height and best color individually, we decide to perform a grid search that uses k-fold cross-validation to find the combination of height and color with the best average validation accuracy.

For the purposes of this question, assume that:

Consider the following three subparts:

Choose from the following options.

Answer: A: Option 3 (\frac{n}{k}), B: Option 6 (h_1h_2(k-1)), C: Option 8 (None of the above)


A. What is the size of each fold?

The training set is divided into k folds of equal size, resulting in k folds with size \frac{n}{k}.


B. How many times is row 5 in the training set used for training?

For each combination of hyperparameters, row 5 is k - 1 times for training and 1 time for validation. There are h_1 \cdot h_2 combinations of hyperparameters, so row 5 is used for training h_1 \cdot h_2 \cdot (k-1) times.


C. How many times is row 5 in the training set used for validation?

Building off of the explanation for the previous subpart, row 5 is used for validation 1 times for each combination of hyperparameters, so the correct expression would be h_1 \cdot h_2, which is not a provided option.



Problem 7

Suppose we build a binary classifier that uses a song’s "track_name" and "artist_names" to predict whether its genre is "Hip-Hop/Rap" (1) or not (0).

For our classifier, we decide to use a brand-new model built into sklearn called the
BillyClassifier. A BillyClassifier instance has three hyperparameters that we’d like to tune. Below, we show a dictionary containing the values for each hyperparameter that we’d like to try:

hyp_grid = {
  "radius": [0.1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100], # 12 total
  "inflection": [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4], # 10 total
  "color": ["red", "yellow", "green", "blue", "purple"] # 5 total
}

To find the best combination of hyperparameters for our BillyClassifier, we first conduct a train-test split, which we use to create a training set with 800 rows. We then use GridSearchCV to conduct k-fold cross-validation for each combination of hyperparameters in hyp_grid, with k=4.


Problem 7.1

When we call GridSearchCV, how many times is a BillyClassifier instance trained in total? Give your answer as an integer.

Answer: 2400

There are 12 \cdot 10 \cdot 5 = 600 combinations of hyperparameters. For each combination of hyperparameters, we will train a BillyClassifier with that combination of hyperparameters k = 4 times. So, the total number of BillyClassifier instances that will be trained is 600 \cdot 4 = 2400.


Problem 7.2

In each of the 4 folds of the data, how large is the training set, and how large is the validation set? Give your answers as integers.

size of training set =

size of validation set =

Answer: 600, 200

Since we performed k=4 cross-validation, we must divide the training set into four disjoint groups each of the same size. \frac{800}{4} = 200, so each group is of size 200. Each time we perform cross validation, one group is used for validation, and the other three are used for training, so the validation set size is 200 and the training set size is 200 \cdot 3 = 600.


Problem 7.3

Suppose that after fitting a GridSearchCV instance, its best_params_ attribute is

{"radius": 8, "inflection": 4, "color": "blue"}

Select all true statements below.

Answer: Option B

When performing cross validation, we select the combination of hyperparameters that had the highest average validation accuracy across all four folds of the data. That is, by definition, how best_params_ came to be. None of the other options are guaranteed to be true.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.