← return to study.practicaldsc.org

The problems in this worksheet are taken from past exams in similar
classes. Work on them **on paper**, since the exams you
take in this course will also be on paper.

We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.**Note: We do not plan to
cover all problems here in the live discussion section**; the problems
we don’t cover can be used for extra practice.

Consider the following assignment statement.

`= np.array([5, 9, 13, 17, 21]) puffin `

Provide arguments to call `np.arange`

with so that the
array `penguin`

is identical to the array
`puffin`

.

`= np.arange(____) penguin `

**Answer**: We need to provide `np.arange`

with three arguments: 5, anything in (21,
25], 4. For instance, something line
`penguin = np.arange(5, 25, 4)`

would work.

Fill in the blanks so that the array `parrot`

is also
identical to the array `puffin`

.

*Hint: Start by choosing* `y`

*so that*
`parrot`

*has length 5.*

`= __(x)__ * np.arange(0, __(y)__, 2) + __(z)__ parrot `

**Answer**:

`x`

:`2`

`y`

: anything in (8, 10]`z`

:`5`

Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.

Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.

```
= 0
total = 10000
repetitions for i in np.arange(repetitions):
= np.random.choice(__(a)__, 10, __(b)__)
choices if __(c)__:
= total + 1
total = total / repetitions prob_renovate
```

What goes in blank (a)?

`np.arange(1, 75)`

`np.arange(10, 75)`

`np.arange(0, 76)`

`np.arange(1, 76)`

What goes in blank (b)?

`replace=True`

`replace=False`

What goes in blank (c)?

`choices == 23`

`choices is 23`

`np.count_nonzero(choices == 23) > 0`

`np.count_nonzero(choices) == 23`

`choices.str.contains(23)`

**Answer:** `np.arange(1, 76)`

,
`replace=False`

,
`np.count_nonzero(choices == 23) > 0`

Here, the idea is to randomly choose 10 **different**
floors repeatedly, and each time, check if floor 23 was selected.

Blank (a): The first argument to `np.random.choice`

needs
to be an array/list containing the options we want to choose from,
i.e. an array/list containing the values 1, 2, 3, 4, …, 75, since those
are the numbers of the floors. `np.arange(a, b)`

returns an
array of integers spaced out by 1 starting from `a`

and
ending at `b-1`

. As such, the correct call to
`np.arange`

is `np.arange(1, 76)`

.

Blank (b): Since we want to select 10 different floors, we need to
specify `replace=False`

(the default behavior is
`replace=True`

).

Blank (c): The `if`

condition needs to check if 23 was one
of the 10 numbers that were selected, i.e. if 23 is in
`choices`

. It needs to evaluate to a single Boolean value,
i.e. `True`

(if 23 was selected) or `False`

(if 23
was not selected). Let’s go through each incorrect option to see why
it’s wrong:

- Option 1,
`choices == 23`

, does not evaluate to a single Boolean value; rather, it evaluates to an array of length 10, containing multiple`True`

s and`False`

s. - Option 2,
`choices is 23`

, does not evaluate to what we want – it checks to see if the array`choices`

is the same Python object as the number 23, which it is not (and will never be, since an array cannot be a single number). - Option 4,
`np.count_nonzero(choices) == 23`

, does evaluate to a single Boolean, however it is not quite correct.`np.count_nonzero(choices)`

will always evaluate to 10, since`choices`

is made up of 10 integers randomly selected from 1, 2, 3, 4, …, 75, none of which are 0. As such,`np.count_nonzero(choices) == 23`

is the same as`10 == 23`

, which is always False, regardless of whether or not 23 is in`choices`

. - Option 5,
`choices.str.contains(23)`

, errors, since`choices`

is not a Series (and`.str`

can only follow a Series). If`choices`

were a Series, this would still error, since the argument to`.str.contains`

must be a string, not an int.

By process of elimination, Option 3,
`np.count_nonzero(choices == 23) > 0`

, must be the correct
answer. Let’s look at it piece-by-piece:

- As we saw in Option 1,
`choices == 23`

is a Boolean array that contains`True`

each time the selected floor was floor 23 and`False`

otherwise. (Since we’re sampling without replacement, floor 23 can only be selected at most once, and so`choices == 23`

can only contain the value`True`

at most once.) `np.count_nonzero(choices == 23)`

evaluates to the number of`True`

s in`choices == 23`

. If it is positive (i.e. 1), it means that floor 23 was selected. If it is 0, it means floor 23 was not selected.- Thus,
`np.count_nonzero(choices == 23) > 0`

evaluates to`True`

if (and only if) floor 23 was selected.

In this problem, we’ll work with the DataFrame `dogs`

,
which contains one row for every registered pet dog in Zurich,
Switzerland in 2017.

The first few rows of `dogs`

are shown below, but
`dogs`

has many more rows than are shown.

`"owner_id" (int)`

: A unique ID for each owner. Note that, for example, there are two rows in the preview for`4215`

, meaning that owner has at least 2 dogs.**Assume that if an**`"owner_id"`

appears in`dogs`

multiple times, the corresponding`"owner_age"`

,`"owner_sex"`

, and`"district"`

are always the same.`"owner_age" (str)`

: The age group of the owner; either`"11-20"`

,`"21-30"`

, …, or`"91-100"`

(9 possibilities in total).`"owner_sex" (str)`

: The birth sex of the owner; either`"m"`

(male) or`"f"`

(female).`"district" (int)`

: The city district the owner lives in; a positive integer between`1`

and`12`

(inclusive).`"primary_breed" (str)`

: The primary breed of the dog.`"secondary_breed" (str)`

: The secondary breed of the dog. If this column is not null, the dog is a “mixed breed” dog; otherwise, the dog is a “purebred” dog.`"dog_sex" (str)`

: The birth sex of the dog; either`"m"`

(male) or`"f"`

(female).`"birth_year" (int)`

: The birth year of the dog.

Fill in the blank so that `most_common`

evaluates to the
most common district in `dogs`

. Assume there are no ties.

`= ____ most_common `

**Answer**:

`"district"].value_counts().idxmax() dogs[`

Above, we presented one possible solution, but there are many:

`dogs["district"].value_counts().idxmax()`

`dogs["district"].value_counts().index[0]`

`dogs.groupby("district").size().sort_values(ascending=False).index[0]`

`dogs.groupby("district").count()["owner_id"].sort_values(ascending=False).index[0]`

Fill in the blank so that `female_breeds`

evaluates to a
Series containing the primary breeds of all female dogs.

`= dogs.____ female_breeds `

**Answer**:

`"dog_sex"] == "f", "primary_breed"] loc[dogs[`

Another possible answer is:

```py query("dog_sex == 'f'")["primary_breed"]`

Note that the question *didn’t* ask for unique primary
breeds.

For this question, we will work with the DataFrame `tv`

,
which contains information about various TV shows available to watch on
streaming services. For each TV show, we have:

`"Title" (object)`

: The title of the TV show.`"Year" (int)`

: The year in which the TV show was first released. (For instance, the show*How I Met Your Mother*ran from 2005 to 2014; there is only one row for*How I Met Your Mother*in`tv`

, and its`"Year"`

value is 2005.)`"Age" (object)`

: The age category for the TV show. If not missing,`"Age"`

is one of`"all"`

,`"7+"`

,`"13+"`

,`"16+"`

, or`"18+"`

. (For instance,`"all"`

means that the show is appropriate for all audiences, while `“18+”} means that the show contains mature content and viewers should be at least 18 years old.)`"IMDb" (float)`

: The TV show’s rating on IMDb (between 0 and 10).`"Rotten Tomatoes" (int)`

: The TV show’s rating on Rotten Tomatoes (between 0 and 100).`"Netflix" (int)`

: 1 if the show is available for streaming on Netflix and 0 otherwise. The`"Hulu"`

,`"Prime Video"`

, and`"Disney+"`

columns work the same way.

The first few rows of `tv`

are shown below (though
`tv`

has many more rows than are pictured here).

Assume that we have already run all of the necessary imports.

**Throughout this problem, we will refer to tv
repeatedly.**

In the following subparts, consider the variable
`double_count`

, defined below.

`= tv["Title"].value_counts().value_counts() double_count `

What is `type(double_count)`

?

Series

SeriesGroupBy

DataFrame

DataFrameGroupBy

**Answer**: Series

The `.value_counts()`

method, when called on a Series
`s`

, produces a new Series in which

- the index contains all unique values in
`s`

. - the values are the frequencies of the unique values in
`s`

.

Since `tv["Title"]`

is a Series,
`tv["Title"].value_counts()`

is a Series, and so is
`tv["Title"].value_counts.value_counts()`

. We provide an
interpretation of each of these Series in the solution to the next
subpart.

Which of the following statements are true? Select all that apply.

The only case in which it would make sense to set the index of

`tv`

to`"Title"`

is if`double_count.iloc[0] == 1`

is`True`

.The only case in which it would make sense to set the index of

`tv`

to`"Title"`

is if`double_count.loc[1] == tv.shape[0]`

is`True`

.If

`double_count.loc[2] == 5`

is`True`

, there are 5 TV shows that all share the same`"Title"`

.If

`double_count.loc[2] == 5`

is`True`

, there are 5 pairs of 2 TV shows such that each pair shares the same`"Title"`

.None of the above.

**Answers**:

- The only case in which it would make sense to set the index of
`tv`

to`"Title"`

is if`double_count.loc[1] == tv.shape[0]`

is`True`

. - If
`double_count.loc[2] == 5`

is`True`

, there are 5 pairs of 2 TV shows such that each pair shares the same`"Title"`

.

To answer, we need to understand what each of
`tv["Title"]`

, `tv["Title"].value_counts()`

, and
`tv["Title"].value_counts().value_counts()`

contain. To
illustrate, let’s start with a basic, unrelated example. Suppose
`tv["Title"]`

looks like:

```
0 A
1 B
2 C
3 B
4 D
5 E
6 A
object dtype:
```

Then, `tv["Title"].value_counts()`

looks like:

```
2
A 2
B 1
C 1
D 1
E dtype: int64
```

and `tv["Title"].value_counts().value_counts()`

looks
like:

```
1 3
2 2
dtype: int64
```

Back to our actual dataset. `tv["Title"]`

, as we know,
contains the name of each TV show.
`tv["Title"].value_counts()`

is a Series whose index is a
sequence of the unique TV show titles in `tv["Title"]`

, and
whose values are the frequencies of each title.
`tv["Title"].value_counts()`

may look something like the
following:

```
1
Breaking Bad 1
Fresh Meat 1
Doctor Thorne
...1
Styling Hollywood 1
Vai Anitta with Jack Randall 1
Fearless Adventures 5368, dtype: int64 Name: Title, Length:
```

Then, `tv["Title"].value_counts().value_counts()`

is a
Series whose index is a sequence of the unique values in the above
Series, and whose values are the frequencies of each value above. In the
case where all titles in `tv["Title"]`

are unique, then
`tv["Title"].value_counts()`

will only have one unique value,
1, repeated many times. Then,
`tv["Title"].value_counts().value_counts()`

will only have
one row total, and will look something like:

```
1 5368
Name: Title, dtype: int64
```

This allows us to distinguish between the first two answer choices.
The key is remembering that **in order to set a column to the
index, the column should only contain unique values**, since the
goal of the index is to provide a “name” (more formally, a label) for
each row.

- The first answer choice, “The only case in which it would make sense
to set the index of
`tv`

to`"Title"`

is if`double_count.iloc[0] == 1`

is`True`

”, is false. As we can see in the example above, all titles are unique, but`double_count.iloc[0]`

is something other than 1. - The second answer choice, “The only case in which it would make
sense to set the index of
`tv`

to`"Title"`

is if`double_count.loc[1] == tv.shape[0]`

is`True`

”, is true. If`double_count.loc[1] == tv.shape[0]`

, it means that all values in`tv["Title"].value_counts()`

were 1, meaning that`tv["Title"]`

consisted solely of unique values, which is the only case in which it makes sense to set`"Title"`

to the index.

Now, let’s look at the second two answer choices. If
`double_counts.loc[2] == 5`

, it would mean that 5 of the
values in `tv["Title"].value_counts()`

were 2. This would
mean that there were 5 pairs of titles in `tv["Title"]`

that
were the same.

- This makes the fourth answer choice, “If
`double_count.loc[2] == 5`

is`True`

, there are 5 pairs of 2 TV shows such that each pair shares the same`"Title"`

”, correct. - The third answer choice, “If
`double_count.loc[2] == 5`

is`True`

, there are 5 TV shows that all share the same`"Title"`

”, is incorrect; if there were 5 TV shows with the same title, then`double_count.loc[5]`

would be at least 1, but we can’t make any guarantees about`double_counts.loc[2]`

.

Ethan is an avid Star Wars fan, and the only streaming service he has an account on is Disney+. (He had a Netflix account, but then Netflix cracked down on password sharing.)

Fill in the blanks below so that `star_disney_prop`

evaluates to the proportion of TV shows in `tv`

with
`"Star Wars"`

in the title that are available to stream on
Disney+.

```
= __(a)__
star_only = __(b)__ / star_only.shape[0] star_disney_prop
```

What goes in the blanks?

**Answers**:

- Blank (a):
`tv[tv["Title"].str.contains("Star Wars")]`

- Blank (b):
`star_only["Disney+"].sum()`

We’re asked to find the proportion of TV shows with
`"Star Wars"`

in the title that are available to stream on
Disney+. This is a fraction, where:

- The numerator is the number of TV shows that have
`"Star Wars"`

in the title**and**are available to stream on Disney+. - The denominator is the number of TV shows that have
`"Star Wars"`

in the title.

The key is recognizing that `star_only`

must be a
DataFrame that contains all the rows in which the `"Title"`

contains `"Star Wars"`

; to create this DataFrame in blank
(a), we use `tv[tv["Title"].str.contains("Star Wars")]`

.
Then, the denominator is already provided for us, and all we need to
fill in is the numerator. There are a few possibilities, though they all
include `star_only`

:

`star_only["Disney+"].sum()`

`(star_only["Disney+"] == 1).sum()`

`star_only[star_only["Disney+"] == 1].shape[0]`

**Common misconception**: Many students calculated the
wrong proportion: they calculated the proportion of shows available to
stream on Disney+ that have `"Star Wars"`

in the title. We
asked for the proportion of shows with `"Star Wars"`

in the
title that are available to stream on Disney+; “proportion of X that Y” is
always \frac{\# X \text{ and } Y}{\#
X}.

Consider the function `tom_nook`

, defined below. Recall
that if `x`

is an integer, `x % 2`

is
`0`

if `x`

is even and `1`

if
`x`

is odd.

```
def tom_nook(crossing):
= 0
bells for nook in np.arange(crossing):
if nook % 2 == 0:
= bells + 1
bells else:
= bells - 2
bells return bells
```

What value does `tom_nook(8)`

evaluate to?

-6

-4

-2

0

2

4

6

**Answer**: -4

Consider the following four assignment statements.

```
= "5"
bass = 2
tuna = ["4.0", 5, 12.5, -10, "2023"]
sword = [4, "6", "CSE", "doc"] gold
```

What is the value of the expression `bass * tuna`

?

**Answer**: `"55"`

Which of the following expressions results in an error?

`int(sword[0])`

`float(sword[1])`

`int(sword[2])`

`int(sword[3])`

`float(sword[4])`

**Answer**: `int(sword[0])`

Which of the following expressions evaluates to
`"DSC10"`

?

`gold[3].replace("o", "s").title() + str(gold[0] + gold[1])`

`gold[3].replace("o", "s").upper() + str(gold[0] + int(gold[1]))`

`gold[3].replace("o", "s").upper() + str(gold[1] + int(gold[0]))`

`gold[3].replace("o", "s").title() + str(gold[0] + int(gold[1]))`

**Answer**:
`gold[3].replace("o", "s").upper() + str(gold[0] + int(gold[1]))`

We’d like to select three students at random from the entire class to win extra credit (not really). When doing so, we want to guarantee that the same student cannot be selected twice, since it wouldn’t really be fair to give a student double extra credit.

Fill in the blanks below so that `prob_all_unique`

is an
estimate of the probability that all three students selected are in
different majors.

*Hint: The function np.unique, when called on an
array, returns an array with just one copy of each unique element in the
input. For example, if vals contains the values
1, 2, 2, 3, 3, 4, np.unique(vals) contains the
values 1, 2, 3, 4.*

```
= np.array([])
unique_majors for i in np.arange(10000):
= np.random.choice(survey.get("Major"), 3, __(a)__)
group = np.append(unique_majors, len(__(c)__))
__(b)__
= __(d)__ prob_all_unique
```

What goes in blank (a)?

`replace=True`

`replace=False`

**Answer**: `replace=False`

Since we want to guarantee that the same student cannot be selected twice, we should sample without replacement.

What goes in blank (b)?

**Aswer**: `unique_majors`

`unique_majors`

is the array we initialized before running
our `for`

-loop to keep track of our results. We’re already
given that the first argument to `np.append`

is
`unique_majors`

, meaning that in each iteration of the
`for`

-loop we’re creating a new array by adding a new element
to the end of `unique_majors`

; to save this new array, we
need to re-assign it to `unique_majors`

.

What goes in blank (c)?

**Answer**: `np.unique(group)`

In each iteration of our `for`

-loop, we’re interested in
finding the number of unique majors among the 3 students who were
selected. We can tell that this is what we’re meant to store in
`unique_majors`

by looking at the options in the next
subpart, which involve checking the proportion of times that the values
in `unique_majors`

is 3.

The majors of the 3 randomly selected students are stored in
`group`

, and `np.unique(group)`

is an array with
the unique values in `group`

. Then,
`len(np.unique(group))`

is the number of unique majors in the
group of 3 students selected.

What could go in blank (d)? Select all that apply. At least one option is correct; blank answers will receive no credit.

`(unique_majors > 2).mean()`

`(unique_majors.sum() > 2).mean()`

`np.count_nonzero(unique_majors > 2).sum() / len(unique_majors > 2)`

`1 - np.count_nonzero(unique_majors != 3).mean()`

`unique_majors.mean() - 3 == 0`

**Answer**: Option 1 only

Let’s break down the code we have so far:

- An empty array named
`unique_majors`

is initialized to store the number of unique majors in each iteration of the simulation. - The simulation runs 10,000 times, and in every iteration: Three
majors are selected at random from the survey dataset without
replacement. This ensures that the same item is not chosen more than
once within a single iteration. The
`np.unique`

function is employed to identify the number of unique majors among the selected three. The result is then appended to the`unique_majors`

array. - Following the simulation, the objective is to determine the fraction
of iterations in which all three selected majors were unique. Since the
maximum number of unique majors that can be selected in a group of three
is 3, the code checks the fraction of times the
`unique_majors`

array contains a value greater than 2.

Let’s look at each option more carefully.

**Option 1**:`(unique_majors > 2).mean()`

will create a Boolean array where each value in`unique_majors`

is checked if it’s greater than 2. In other words, it’ll return`True`

for each 3 and`False`

otherwise. Taking the mean of this Boolean array will give the proportion of`True`

values, which corresponds to the probability that all 3 students selected are in different majors.**Option 2**:`(unique_majors.sum() > 2)`

will generate a single Boolean value (either`True`

or`False`

) since you’re summing up all values in the`unique_majors`

array and then checking if the sum is greater than 2. This is not what you want.`.mean()`

on a single Boolean value will raise an error because you can’t compute the mean of a single Boolean.**Option 3**:`np.count_nonzero(unique_majors > 2).sum() / len(unique_majors > 2)`

would work without the`.sum()`

.`unique_majors > 2`

results in a Boolean array where each value is`True`

if the respective simulation yielded 3 unique majors and`False`

otherwise.`np.count_nonzero()`

counts the number of`True`

values in the array, which corresponds to the number of simulations where all 3 students had unique majors. This returns a single integer value representing the count. The`.sum()`

method is meant for collections (like arrays or lists) to sum their elements. Since`np.count_nonzero`

returns a single integer, calling`.sum()`

on it will result in an AttributeError because individual numbers do not have a sum method.`len(unique_majors > 2)`

calculates the length of the Boolean array, which is equal to 10,000 (the total number of simulations). Because of the attempt to call`.sum()`

on an integer value, the code will raise an error and won’t produce the desired result.**Option 4**:`np.count_nonzero(unique_majors != 3)`

counts the number of trials where not all 3 students had different majors. When you call`.mean()`

on an integer value, which is what`np.count_nonzero`

returns, it’s going to raise an error.**Option 5**:`unique_majors.mean() - 3 == 0`

is trying to check if the mean of`unique_majors`

is 3. This line of code will return`True`

or`False`

, and this isn’t the right approach for calculating the estimated probability.