Arrays and DataFrames

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

Consider the following assignment statement.

puffin = np.array([5, 9, 13, 17, 21])


Problem 1.1

Provide arguments to call np.arange with so that the array penguin is identical to the array puffin.

penguin = np.arange(____)

Answer: We need to provide np.arange with three arguments: 5, anything in (21, 25], 4. For instance, something line penguin = np.arange(5, 25, 4) would work.


Problem 1.2

Fill in the blanks so that the array parrot is also identical to the array puffin.
Hint: Start by choosing y so that parrot has length 5.

parrot = __(x)__ * np.arange(0, __(y)__, 2) + __(z)__

Answer:

  • x: 2
  • y: anything in (8, 10]
  • z: 5



Problem 2

Billina Records, a new record company focused on creating new TikTok audios, has its offices on the 23rd floor of a skyscraper with 75 floors (numbered 1 through 75). The owners of the building promised that 10 different random floors will be selected to be renovated.

Below, fill in the blanks to complete a simulation that will estimate the probability that Billina Records’ floor will be renovated.

total = 0
repetitions = 10000
for i in np.arange(repetitions):
    choices = np.random.choice(__(a)__, 10, __(b)__)
    if __(c)__:
        total = total + 1
prob_renovate = total / repetitions

What goes in blank (a)?

What goes in blank (b)?

What goes in blank (c)?

Answer: np.arange(1, 76), replace=False, np.count_nonzero(choices == 23) > 0

Here, the idea is to randomly choose 10 different floors repeatedly, and each time, check if floor 23 was selected.

Blank (a): The first argument to np.random.choice needs to be an array/list containing the options we want to choose from, i.e. an array/list containing the values 1, 2, 3, 4, …, 75, since those are the numbers of the floors. np.arange(a, b) returns an array of integers spaced out by 1 starting from a and ending at b-1. As such, the correct call to np.arange is np.arange(1, 76).

Blank (b): Since we want to select 10 different floors, we need to specify replace=False (the default behavior is replace=True).

Blank (c): The if condition needs to check if 23 was one of the 10 numbers that were selected, i.e. if 23 is in choices. It needs to evaluate to a single Boolean value, i.e. True (if 23 was selected) or False (if 23 was not selected). Let’s go through each incorrect option to see why it’s wrong:

  • Option 1, choices == 23, does not evaluate to a single Boolean value; rather, it evaluates to an array of length 10, containing multiple Trues and Falses.
  • Option 2, choices is 23, does not evaluate to what we want – it checks to see if the array choices is the same Python object as the number 23, which it is not (and will never be, since an array cannot be a single number).
  • Option 4, np.count_nonzero(choices) == 23, does evaluate to a single Boolean, however it is not quite correct. np.count_nonzero(choices) will always evaluate to 10, since choices is made up of 10 integers randomly selected from 1, 2, 3, 4, …, 75, none of which are 0. As such, np.count_nonzero(choices) == 23 is the same as 10 == 23, which is always False, regardless of whether or not 23 is in choices.
  • Option 5, choices.str.contains(23), errors, since choices is not a Series (and .str can only follow a Series). If choices were a Series, this would still error, since the argument to .str.contains must be a string, not an int.

By process of elimination, Option 3, np.count_nonzero(choices == 23) > 0, must be the correct answer. Let’s look at it piece-by-piece:

  • As we saw in Option 1, choices == 23 is a Boolean array that contains True each time the selected floor was floor 23 and False otherwise. (Since we’re sampling without replacement, floor 23 can only be selected at most once, and so choices == 23 can only contain the value True at most once.)
  • np.count_nonzero(choices == 23) evaluates to the number of Trues in choices == 23. If it is positive (i.e. 1), it means that floor 23 was selected. If it is 0, it means floor 23 was not selected.
  • Thus, np.count_nonzero(choices == 23) > 0 evaluates to True if (and only if) floor 23 was selected.

Problem 3

In this problem, we’ll work with the DataFrame dogs, which contains one row for every registered pet dog in Zurich, Switzerland in 2017.

The first few rows of dogs are shown below, but dogs has many more rows than are shown.



Problem 3.1

Fill in the blank so that most_common evaluates to the most common district in dogs. Assume there are no ties.

most_common = ____

Answer:

dogs["district"].value_counts().idxmax()

Above, we presented one possible solution, but there are many:

  • dogs["district"].value_counts().idxmax()
  • dogs["district"].value_counts().index[0]
  • dogs.groupby("district").size().sort_values(ascending=False).index[0]
  • dogs.groupby("district").count()["owner_id"].sort_values(ascending=False).index[0]


Problem 3.2

Fill in the blank so that female_breeds evaluates to a Series containing the primary breeds of all female dogs.

female_breeds = dogs.____

Answer:

loc[dogs["dog_sex"] == "f", "primary_breed"]

Another possible answer is:

``py query("dog_sex == 'f'")["primary_breed"]

Note that the question didn’t ask for unique primary breeds.



Problem 4

For this question, we will work with the DataFrame tv, which contains information about various TV shows available to watch on streaming services. For each TV show, we have:

The first few rows of tv are shown below (though tv has many more rows than are pictured here).

Assume that we have already run all of the necessary imports.

Throughout this problem, we will refer to tv repeatedly.

In the following subparts, consider the variable double_count, defined below.

double_count = tv["Title"].value_counts().value_counts()


Problem 4.1

What is type(double_count)?

Answer: Series

The .value_counts() method, when called on a Series s, produces a new Series in which

  • the index contains all unique values in s.
  • the values are the frequencies of the unique values in s.

Since tv["Title"] is a Series, tv["Title"].value_counts() is a Series, and so is tv["Title"].value_counts.value_counts(). We provide an interpretation of each of these Series in the solution to the next subpart.


Problem 4.2

Which of the following statements are true? Select all that apply.

Answers:

  • The only case in which it would make sense to set the index of tv to "Title" is if double_count.loc[1] == tv.shape[0] is True.
  • If double_count.loc[2] == 5 is True, there are 5 pairs of 2 TV shows such that each pair shares the same "Title".

To answer, we need to understand what each of tv["Title"], tv["Title"].value_counts(), and tv["Title"].value_counts().value_counts() contain. To illustrate, let’s start with a basic, unrelated example. Suppose tv["Title"] looks like:

0    A
1    B
2    C
3    B
4    D
5    E
6    A
dtype: object

Then, tv["Title"].value_counts() looks like:

A    2
B    2
C    1
D    1
E    1
dtype: int64

and tv["Title"].value_counts().value_counts() looks like:

1    3
2    2
dtype: int64

Back to our actual dataset. tv["Title"], as we know, contains the name of each TV show. tv["Title"].value_counts() is a Series whose index is a sequence of the unique TV show titles in tv["Title"], and whose values are the frequencies of each title. tv["Title"].value_counts() may look something like the following:

Breaking Bad                             1
Fresh Meat                               1
Doctor Thorne                            1
                                       ...
Styling Hollywood                        1
Vai Anitta                               1
Fearless Adventures with Jack Randall    1
Name: Title, Length: 5368, dtype: int64

Then, tv["Title"].value_counts().value_counts() is a Series whose index is a sequence of the unique values in the above Series, and whose values are the frequencies of each value above. In the case where all titles in tv["Title"] are unique, then tv["Title"].value_counts() will only have one unique value, 1, repeated many times. Then, tv["Title"].value_counts().value_counts() will only have one row total, and will look something like:

1    5368
Name: Title, dtype: int64

This allows us to distinguish between the first two answer choices. The key is remembering that in order to set a column to the index, the column should only contain unique values, since the goal of the index is to provide a “name” (more formally, a label) for each row.

  • The first answer choice, “The only case in which it would make sense to set the index of tv to "Title" is if double_count.iloc[0] == 1 is True”, is false. As we can see in the example above, all titles are unique, but double_count.iloc[0] is something other than 1.
  • The second answer choice, “The only case in which it would make sense to set the index of tv to "Title" is if double_count.loc[1] == tv.shape[0] is True”, is true. If double_count.loc[1] == tv.shape[0], it means that all values in tv["Title"].value_counts() were 1, meaning that tv["Title"] consisted solely of unique values, which is the only case in which it makes sense to set "Title" to the index.

Now, let’s look at the second two answer choices. If double_counts.loc[2] == 5, it would mean that 5 of the values in tv["Title"].value_counts() were 2. This would mean that there were 5 pairs of titles in tv["Title"] that were the same.

  • This makes the fourth answer choice, “If double_count.loc[2] == 5 is True, there are 5 pairs of 2 TV shows such that each pair shares the same "Title"”, correct.
  • The third answer choice, “If double_count.loc[2] == 5 is True, there are 5 TV shows that all share the same "Title"”, is incorrect; if there were 5 TV shows with the same title, then double_count.loc[5] would be at least 1, but we can’t make any guarantees about double_counts.loc[2].


Problem 4.3

Ethan is an avid Star Wars fan, and the only streaming service he has an account on is Disney+. (He had a Netflix account, but then Netflix cracked down on password sharing.)

Fill in the blanks below so that star_disney_prop evaluates to the proportion of TV shows in tv with "Star Wars" in the title that are available to stream on Disney+.

star_only = __(a)__
star_disney_prop = __(b)__ / star_only.shape[0]

What goes in the blanks?

Answers:

  • Blank (a): tv[tv["Title"].str.contains("Star Wars")]
  • Blank (b): star_only["Disney+"].sum()

We’re asked to find the proportion of TV shows with "Star Wars" in the title that are available to stream on Disney+. This is a fraction, where:

  • The numerator is the number of TV shows that have "Star Wars" in the title and are available to stream on Disney+.
  • The denominator is the number of TV shows that have "Star Wars" in the title.

The key is recognizing that star_only must be a DataFrame that contains all the rows in which the "Title" contains "Star Wars"; to create this DataFrame in blank (a), we use tv[tv["Title"].str.contains("Star Wars")]. Then, the denominator is already provided for us, and all we need to fill in is the numerator. There are a few possibilities, though they all include star_only:

  • star_only["Disney+"].sum()
  • (star_only["Disney+"] == 1).sum()
  • star_only[star_only["Disney+"] == 1].shape[0]

Common misconception: Many students calculated the wrong proportion: they calculated the proportion of shows available to stream on Disney+ that have "Star Wars" in the title. We asked for the proportion of shows with "Star Wars" in the title that are available to stream on Disney+; “proportion of X that Y” is always \frac{\# X \text{ and } Y}{\# X}.



Problem 5

Consider the function tom_nook, defined below. Recall that if x is an integer, x % 2 is 0 if x is even and 1 if x is odd.

def tom_nook(crossing):
    bells = 0
    for nook in np.arange(crossing):
        if nook % 2 == 0:
            bells = bells + 1
        else:
            bells = bells - 2
    return bells

What value does tom_nook(8) evaluate to?

Answer: -4


Problem 6

Consider the following four assignment statements.

bass = "5"
tuna = 2
sword = ["4.0", 5, 12.5, -10, "2023"]
gold = [4, "6", "CSE", "doc"]


Problem 6.1

What is the value of the expression bass * tuna?

Answer: "55"


Problem 6.2

Which of the following expressions results in an error?

Answer: int(sword[0])


Problem 6.3

Which of the following expressions evaluates to "DSC10"?

Answer: gold[3].replace("o", "s").upper() + str(gold[0] + int(gold[1]))



Problem 7

We’d like to select three students at random from the entire class to win extra credit (not really). When doing so, we want to guarantee that the same student cannot be selected twice, since it wouldn’t really be fair to give a student double extra credit.

Fill in the blanks below so that prob_all_unique is an estimate of the probability that all three students selected are in different majors.

Hint: The function np.unique, when called on an array, returns an array with just one copy of each unique element in the input. For example, if vals contains the values 1, 2, 2, 3, 3, 4, np.unique(vals) contains the values 1, 2, 3, 4.

    unique_majors = np.array([])
    for i in np.arange(10000):
        group = np.random.choice(survey.get("Major"), 3, __(a)__)
        __(b)__ = np.append(unique_majors, len(__(c)__))
        
    prob_all_unique = __(d)__


Problem 7.1

What goes in blank (a)?

Answer: replace=False

Since we want to guarantee that the same student cannot be selected twice, we should sample without replacement.



Problem 7.2

What goes in blank (b)?

Aswer: unique_majors

unique_majors is the array we initialized before running our for-loop to keep track of our results. We’re already given that the first argument to np.append is unique_majors, meaning that in each iteration of the for-loop we’re creating a new array by adding a new element to the end of unique_majors; to save this new array, we need to re-assign it to unique_majors.



Problem 7.3

What goes in blank (c)?

Answer: np.unique(group)

In each iteration of our for-loop, we’re interested in finding the number of unique majors among the 3 students who were selected. We can tell that this is what we’re meant to store in unique_majors by looking at the options in the next subpart, which involve checking the proportion of times that the values in unique_majors is 3.

The majors of the 3 randomly selected students are stored in group, and np.unique(group) is an array with the unique values in group. Then, len(np.unique(group)) is the number of unique majors in the group of 3 students selected.


Problem 7.4

What could go in blank (d)? Select all that apply. At least one option is correct; blank answers will receive no credit.

Answer: Option 1 only

Let’s break down the code we have so far:

  • An empty array named unique_majors is initialized to store the number of unique majors in each iteration of the simulation.
  • The simulation runs 10,000 times, and in every iteration: Three majors are selected at random from the survey dataset without replacement. This ensures that the same item is not chosen more than once within a single iteration. The np.unique function is employed to identify the number of unique majors among the selected three. The result is then appended to the unique_majors array.
  • Following the simulation, the objective is to determine the fraction of iterations in which all three selected majors were unique. Since the maximum number of unique majors that can be selected in a group of three is 3, the code checks the fraction of times the unique_majors array contains a value greater than 2.

Let’s look at each option more carefully.

  • Option 1: (unique_majors > 2).mean() will create a Boolean array where each value in unique_majors is checked if it’s greater than 2. In other words, it’ll return True for each 3 and False otherwise. Taking the mean of this Boolean array will give the proportion of True values, which corresponds to the probability that all 3 students selected are in different majors.
  • Option 2: (unique_majors.sum() > 2) will generate a single Boolean value (either True or False) since you’re summing up all values in the unique_majors array and then checking if the sum is greater than 2. This is not what you want. .mean() on a single Boolean value will raise an error because you can’t compute the mean of a single Boolean.
  • Option 3: np.count_nonzero(unique_majors > 2).sum() / len(unique_majors > 2) would work without the .sum(). unique_majors > 2 results in a Boolean array where each value is True if the respective simulation yielded 3 unique majors and False otherwise. np.count_nonzero() counts the number of True values in the array, which corresponds to the number of simulations where all 3 students had unique majors. This returns a single integer value representing the count. The .sum() method is meant for collections (like arrays or lists) to sum their elements. Since np.count_nonzero returns a single integer, calling .sum() on it will result in an AttributeError because individual numbers do not have a sum method. len(unique_majors > 2) calculates the length of the Boolean array, which is equal to 10,000 (the total number of simulations). Because of the attempt to call .sum() on an integer value, the code will raise an error and won’t produce the desired result.
  • Option 4: np.count_nonzero(unique_majors != 3) counts the number of trials where not all 3 students had different majors. When you call .mean() on an integer value, which is what np.count_nonzero returns, it’s going to raise an error.
  • Option 5: unique_majors.mean() - 3 == 0 is trying to check if the mean of unique_majors is 3. This line of code will return True or False, and this isn’t the right approach for calculating the estimated probability.



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.