← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.
Note: We do not plan to
cover all problems here in the live discussion section; the problems
we don’t cover can be used for extra practice.
Consider the following assignment statement.
= np.array([5, 9, 13, 17, 21]) puffin
Provide arguments to call np.arange
with so that the
array penguin
is identical to the array
puffin
.
= np.arange(____) penguin
Answer: We need to provide np.arange
with three arguments: 5, anything in (21,
25], 4. For instance, something line
penguin = np.arange(5, 25, 4)
would work.
Fill in the blanks so that the array parrot
is also
identical to the array puffin
.
Hint: Start by choosing y
so that
parrot
has length 5.
= __(x)__ * np.arange(0, __(y)__, 2) + __(z)__ parrot
Answer:
x
: 2
y
: anything in (8,
10]z
: 5
Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.
Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.
= 0
total = 10000
repetitions for i in np.arange(repetitions):
= np.random.choice(__(a)__, 10, __(b)__)
choices if __(c)__:
= total + 1
total = total / repetitions prob_renovate
What goes in blank (a)?
np.arange(1, 75)
np.arange(10, 75)
np.arange(0, 76)
np.arange(1, 76)
What goes in blank (b)?
replace=True
replace=False
What goes in blank (c)?
choices == 23
choices is 23
np.count_nonzero(choices == 23) > 0
np.count_nonzero(choices) == 23
choices.str.contains(23)
Answer: np.arange(1, 76)
,
replace=False
,
np.count_nonzero(choices == 23) > 0
Here, the idea is to randomly choose 10 different floors repeatedly, and each time, check if floor 23 was selected.
Blank (a): The first argument to np.random.choice
needs
to be an array/list containing the options we want to choose from,
i.e. an array/list containing the values 1, 2, 3, 4, …, 75, since those
are the numbers of the floors. np.arange(a, b)
returns an
array of integers spaced out by 1 starting from a
and
ending at b-1
. As such, the correct call to
np.arange
is np.arange(1, 76)
.
Blank (b): Since we want to select 10 different floors, we need to
specify replace=False
(the default behavior is
replace=True
).
Blank (c): The if
condition needs to check if 23 was one
of the 10 numbers that were selected, i.e. if 23 is in
choices
. It needs to evaluate to a single Boolean value,
i.e. True
(if 23 was selected) or False
(if 23
was not selected). Let’s go through each incorrect option to see why
it’s wrong:
choices == 23
, does not evaluate to a single
Boolean value; rather, it evaluates to an array of length 10, containing
multiple True
s and False
s.choices is 23
, does not evaluate to what we
want – it checks to see if the array choices
is the same
Python object as the number 23, which it is not (and will never be,
since an array cannot be a single number).np.count_nonzero(choices) == 23
, does
evaluate to a single Boolean, however it is not quite correct.
np.count_nonzero(choices)
will always evaluate to 10, since
choices
is made up of 10 integers randomly selected from 1,
2, 3, 4, …, 75, none of which are 0. As such,
np.count_nonzero(choices) == 23
is the same as
10 == 23
, which is always False, regardless of whether or
not 23 is in choices
.choices.str.contains(23)
, errors, since
choices
is not a Series (and .str
can only
follow a Series). If choices
were a Series, this would
still error, since the argument to .str.contains
must be a
string, not an int.By process of elimination, Option 3,
np.count_nonzero(choices == 23) > 0
, must be the correct
answer. Let’s look at it piece-by-piece:
choices == 23
is a Boolean array
that contains True
each time the selected floor was floor
23 and False
otherwise. (Since we’re sampling without
replacement, floor 23 can only be selected at most once, and so
choices == 23
can only contain the value True
at most once.)np.count_nonzero(choices == 23)
evaluates to the number
of True
s in choices == 23
. If it is positive
(i.e. 1), it means that floor 23 was selected. If it is 0, it means
floor 23 was not selected.np.count_nonzero(choices == 23) > 0
evaluates
to True
if (and only if) floor 23 was selected.In this problem, we’ll work with the DataFrame dogs
,
which contains one row for every registered pet dog in Zurich,
Switzerland in 2017.
The first few rows of dogs
are shown below, but
dogs
has many more rows than are shown.
"owner_id" (int)
: A unique ID for each owner. Note
that, for example, there are two rows in the preview for
4215
, meaning that owner has at least 2 dogs.
Assume that if an "owner_id"
appears in
dogs
multiple times, the corresponding
"owner_age"
, "owner_sex"
, and
"district"
are always the same."owner_age" (str)
: The age group of the owner; either
"11-20"
, "21-30"
, …, or "91-100"
(9 possibilities in total)."owner_sex" (str)
: The birth sex of the owner; either
"m"
(male) or "f"
(female)."district" (int)
: The city district the owner lives in;
a positive integer between 1
and 12
(inclusive)."primary_breed" (str)
: The primary breed of the
dog."secondary_breed" (str)
: The secondary breed of the
dog. If this column is not null, the dog is a “mixed breed” dog;
otherwise, the dog is a “purebred” dog."dog_sex" (str)
: The birth sex of the dog; either
"m"
(male) or "f"
(female)."birth_year" (int)
: The birth year of the dog.Fill in the blank so that most_common
evaluates to the
most common district in dogs
. Assume there are no ties.
= ____ most_common
Answer:
"district"].value_counts().idxmax() dogs[
Above, we presented one possible solution, but there are many:
dogs["district"].value_counts().idxmax()
dogs["district"].value_counts().index[0]
dogs.groupby("district").size().sort_values(ascending=False).index[0]
dogs.groupby("district").count()["owner_id"].sort_values(ascending=False).index[0]
Fill in the blank so that female_breeds
evaluates to a
Series containing the primary breeds of all female dogs.
= dogs.____ female_breeds
Answer:
"dog_sex"] == "f", "primary_breed"] loc[dogs[
Another possible answer is:
``py query("dog_sex == 'f'")["primary_breed"]
Note that the question didn’t ask for unique primary breeds.
For this question, we will work with the DataFrame tv
,
which contains information about various TV shows available to watch on
streaming services. For each TV show, we have:
"Title" (object)
: The title of the TV show."Year" (int)
: The year in which the TV show was first
released. (For instance, the show How I Met Your Mother ran
from 2005 to 2014; there is only one row for How I Met Your
Mother in tv
, and its "Year"
value is
2005.)"Age" (object)
: The age category for the TV show. If
not missing, "Age"
is one of "all"
,
"7+"
, "13+"
, "16+"
, or
"18+"
. (For instance, "all"
means that the
show is appropriate for all audiences, while `“18+”} means that the show
contains mature content and viewers should be at least 18 years
old.)"IMDb" (float)
: The TV show’s rating on IMDb (between 0
and 10)."Rotten Tomatoes" (int)
: The TV show’s rating on Rotten
Tomatoes (between 0 and 100)."Netflix" (int)
: 1 if the show is available for
streaming on Netflix and 0 otherwise. The "Hulu"
,
"Prime Video"
, and "Disney+"
columns work the
same way.The first few rows of tv
are shown below (though
tv
has many more rows than are pictured here).
Assume that we have already run all of the necessary imports.
Throughout this problem, we will refer to tv
repeatedly.
In the following subparts, consider the variable
double_count
, defined below.
= tv["Title"].value_counts().value_counts() double_count
What is type(double_count)
?
Series
SeriesGroupBy
DataFrame
DataFrameGroupBy
Answer: Series
The .value_counts()
method, when called on a Series
s
, produces a new Series in which
s
.s
.Since tv["Title"]
is a Series,
tv["Title"].value_counts()
is a Series, and so is
tv["Title"].value_counts.value_counts()
. We provide an
interpretation of each of these Series in the solution to the next
subpart.
Which of the following statements are true? Select all that apply.
The only case in which it would make sense to set the index of
tv
to "Title"
is if
double_count.iloc[0] == 1
is True
.
The only case in which it would make sense to set the index of
tv
to "Title"
is if
double_count.loc[1] == tv.shape[0]
is
True
.
If double_count.loc[2] == 5
is True
, there
are 5 TV shows that all share the same "Title"
.
If double_count.loc[2] == 5
is True
, there
are 5 pairs of 2 TV shows such that each pair shares the same
"Title"
.
None of the above.
Answers:
tv
to "Title"
is if
double_count.loc[1] == tv.shape[0]
is
True
.double_count.loc[2] == 5
is True
, there
are 5 pairs of 2 TV shows such that each pair shares the same
"Title"
.To answer, we need to understand what each of
tv["Title"]
, tv["Title"].value_counts()
, and
tv["Title"].value_counts().value_counts()
contain. To
illustrate, let’s start with a basic, unrelated example. Suppose
tv["Title"]
looks like:
0 A
1 B
2 C
3 B
4 D
5 E
6 A
object dtype:
Then, tv["Title"].value_counts()
looks like:
2
A 2
B 1
C 1
D 1
E dtype: int64
and tv["Title"].value_counts().value_counts()
looks
like:
1 3
2 2
dtype: int64
Back to our actual dataset. tv["Title"]
, as we know,
contains the name of each TV show.
tv["Title"].value_counts()
is a Series whose index is a
sequence of the unique TV show titles in tv["Title"]
, and
whose values are the frequencies of each title.
tv["Title"].value_counts()
may look something like the
following:
1
Breaking Bad 1
Fresh Meat 1
Doctor Thorne
...1
Styling Hollywood 1
Vai Anitta with Jack Randall 1
Fearless Adventures 5368, dtype: int64 Name: Title, Length:
Then, tv["Title"].value_counts().value_counts()
is a
Series whose index is a sequence of the unique values in the above
Series, and whose values are the frequencies of each value above. In the
case where all titles in tv["Title"]
are unique, then
tv["Title"].value_counts()
will only have one unique value,
1, repeated many times. Then,
tv["Title"].value_counts().value_counts()
will only have
one row total, and will look something like:
1 5368
Name: Title, dtype: int64
This allows us to distinguish between the first two answer choices. The key is remembering that in order to set a column to the index, the column should only contain unique values, since the goal of the index is to provide a “name” (more formally, a label) for each row.
tv
to "Title"
is if
double_count.iloc[0] == 1
is True
”, is false.
As we can see in the example above, all titles are unique, but
double_count.iloc[0]
is something other than 1.tv
to "Title"
is if
double_count.loc[1] == tv.shape[0]
is True
”,
is true. If double_count.loc[1] == tv.shape[0]
, it means
that all values in tv["Title"].value_counts()
were 1,
meaning that tv["Title"]
consisted solely of unique values,
which is the only case in which it makes sense to set
"Title"
to the index.Now, let’s look at the second two answer choices. If
double_counts.loc[2] == 5
, it would mean that 5 of the
values in tv["Title"].value_counts()
were 2. This would
mean that there were 5 pairs of titles in tv["Title"]
that
were the same.
double_count.loc[2] == 5
is True
, there are 5
pairs of 2 TV shows such that each pair shares the same
"Title"
”, correct.double_count.loc[2] == 5
is True
, there are 5 TV shows that all share the same
"Title"
”, is incorrect; if there were 5 TV shows with the
same title, then double_count.loc[5]
would be at least 1,
but we can’t make any guarantees about
double_counts.loc[2]
.Ethan is an avid Star Wars fan, and the only streaming service he has an account on is Disney+. (He had a Netflix account, but then Netflix cracked down on password sharing.)
Fill in the blanks below so that star_disney_prop
evaluates to the proportion of TV shows in tv
with
"Star Wars"
in the title that are available to stream on
Disney+.
= __(a)__
star_only = __(b)__ / star_only.shape[0] star_disney_prop
What goes in the blanks?
Answers:
tv[tv["Title"].str.contains("Star Wars")]
star_only["Disney+"].sum()
We’re asked to find the proportion of TV shows with
"Star Wars"
in the title that are available to stream on
Disney+. This is a fraction, where:
"Star Wars"
in the title and are available
to stream on Disney+."Star Wars"
in the title.The key is recognizing that star_only
must be a
DataFrame that contains all the rows in which the "Title"
contains "Star Wars"
; to create this DataFrame in blank
(a), we use tv[tv["Title"].str.contains("Star Wars")]
.
Then, the denominator is already provided for us, and all we need to
fill in is the numerator. There are a few possibilities, though they all
include star_only
:
star_only["Disney+"].sum()
(star_only["Disney+"] == 1).sum()
star_only[star_only["Disney+"] == 1].shape[0]
Common misconception: Many students calculated the
wrong proportion: they calculated the proportion of shows available to
stream on Disney+ that have "Star Wars"
in the title. We
asked for the proportion of shows with "Star Wars"
in the
title that are available to stream on Disney+; “proportion of X that Y” is
always \frac{\# X \text{ and } Y}{\#
X}.
Consider the function tom_nook
, defined below. Recall
that if x
is an integer, x % 2
is
0
if x
is even and 1
if
x
is odd.
def tom_nook(crossing):
= 0
bells for nook in np.arange(crossing):
if nook % 2 == 0:
= bells + 1
bells else:
= bells - 2
bells return bells
What value does tom_nook(8)
evaluate to?
-6
-4
-2
0
2
4
6
Answer: -4
Consider the following four assignment statements.
= "5"
bass = 2
tuna = ["4.0", 5, 12.5, -10, "2023"]
sword = [4, "6", "CSE", "doc"] gold
What is the value of the expression bass * tuna
?
Answer: "55"
Which of the following expressions results in an error?
int(sword[0])
float(sword[1])
int(sword[2])
int(sword[3])
float(sword[4])
Answer: int(sword[0])
Which of the following expressions evaluates to
"DSC10"
?
gold[3].replace("o", "s").title() + str(gold[0] + gold[1])
gold[3].replace("o", "s").upper() + str(gold[0] + int(gold[1]))
gold[3].replace("o", "s").upper() + str(gold[1] + int(gold[0]))
gold[3].replace("o", "s").title() + str(gold[0] + int(gold[1]))
Answer:
gold[3].replace("o", "s").upper() + str(gold[0] + int(gold[1]))
We’d like to select three students at random from the entire class to win extra credit (not really). When doing so, we want to guarantee that the same student cannot be selected twice, since it wouldn’t really be fair to give a student double extra credit.
Fill in the blanks below so that prob_all_unique
is an
estimate of the probability that all three students selected are in
different majors.
Hint: The function np.unique
, when called on an
array, returns an array with just one copy of each unique element in the
input. For example, if vals
contains the values
1, 2, 2, 3, 3, 4
, np.unique(vals)
contains the
values 1, 2, 3, 4
.
= np.array([])
unique_majors for i in np.arange(10000):
= np.random.choice(survey.get("Major"), 3, __(a)__)
group = np.append(unique_majors, len(__(c)__))
__(b)__
= __(d)__ prob_all_unique
What goes in blank (a)?
replace=True
replace=False
Answer: replace=False
Since we want to guarantee that the same student cannot be selected twice, we should sample without replacement.
What goes in blank (b)?
Aswer: unique_majors
unique_majors
is the array we initialized before running
our for
-loop to keep track of our results. We’re already
given that the first argument to np.append
is
unique_majors
, meaning that in each iteration of the
for
-loop we’re creating a new array by adding a new element
to the end of unique_majors
; to save this new array, we
need to re-assign it to unique_majors
.
What goes in blank (c)?
Answer: np.unique(group)
In each iteration of our for
-loop, we’re interested in
finding the number of unique majors among the 3 students who were
selected. We can tell that this is what we’re meant to store in
unique_majors
by looking at the options in the next
subpart, which involve checking the proportion of times that the values
in unique_majors
is 3.
The majors of the 3 randomly selected students are stored in
group
, and np.unique(group)
is an array with
the unique values in group
. Then,
len(np.unique(group))
is the number of unique majors in the
group of 3 students selected.
What could go in blank (d)? Select all that apply. At least one option is correct; blank answers will receive no credit.
(unique_majors > 2).mean()
(unique_majors.sum() > 2).mean()
np.count_nonzero(unique_majors > 2).sum() / len(unique_majors > 2)
1 - np.count_nonzero(unique_majors != 3).mean()
unique_majors.mean() - 3 == 0
Answer: Option 1 only
Let’s break down the code we have so far:
unique_majors
is initialized to
store the number of unique majors in each iteration of the
simulation.np.unique
function is
employed to identify the number of unique majors among the selected
three. The result is then appended to the unique_majors
array.unique_majors
array contains a value greater than 2.Let’s look at each option more carefully.
(unique_majors > 2).mean()
will create a Boolean array
where each value in unique_majors
is checked if it’s
greater than 2. In other words, it’ll return True
for each
3 and False
otherwise. Taking the mean of this Boolean
array will give the proportion of True
values, which
corresponds to the probability that all 3 students selected are in
different majors.(unique_majors.sum() > 2)
will generate a single Boolean value (either True
or
False
) since you’re summing up all values in the
unique_majors
array and then checking if the sum is greater
than 2. This is not what you want. .mean()
on a single
Boolean value will raise an error because you can’t compute the mean of
a single Boolean.np.count_nonzero(unique_majors > 2).sum() / len(unique_majors > 2)
would work without the .sum()
.
unique_majors > 2
results in a Boolean array where each
value is True
if the respective simulation yielded 3 unique
majors and False
otherwise. np.count_nonzero()
counts the number of True
values in the array, which
corresponds to the number of simulations where all 3 students had unique
majors. This returns a single integer value representing the count. The
.sum()
method is meant for collections (like arrays or
lists) to sum their elements. Since np.count_nonzero
returns a single integer, calling .sum()
on it will result
in an AttributeError because individual numbers do not have a sum
method. len(unique_majors > 2)
calculates the length of
the Boolean array, which is equal to 10,000 (the total number of
simulations). Because of the attempt to call .sum()
on an
integer value, the code will raise an error and won’t produce the
desired result.np.count_nonzero(unique_majors != 3)
counts the number of
trials where not all 3 students had different majors. When you call
.mean()
on an integer value, which is what
np.count_nonzero
returns, it’s going to raise an
error.unique_majors.mean() - 3 == 0
is trying to check if the
mean of unique_majors
is 3. This line of code will return
True
or False
, and this isn’t the right
approach for calculating the estimated probability.