← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.
We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.
Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.
In this problem, we’ll work with the DataFrame dogs
, which contains one row for every registered pet dog in Zurich, Switzerland in 2017.
The first few rows of dogs
are shown below, but dogs
has many more rows than are shown.
"owner_id" (int)
: A unique ID for each owner. Note that, for example, there are two rows in the preview for 4215
, meaning that owner has at least 2 dogs. Assume that if an "owner_id"
appears in dogs
multiple times, the corresponding "owner_age"
, "owner_sex"
, and "district"
are always the same."owner_age" (str)
: The age group of the owner; either "11-20"
, "21-30"
, …, or "91-100"
(9 possibilities in total)."owner_sex" (str)
: The birth sex of the owner; either "m"
(male) or "f"
(female)."district" (int)
: The city district the owner lives in; a positive integer between 1
and 12
(inclusive)."primary_breed" (str)
: The primary breed of the dog."secondary_breed" (str)
: The secondary breed of the dog. If this column is not null, the dog is a “mixed breed” dog; otherwise, the dog is a “purebred” dog."dog_sex" (str)
: The birth sex of the dog; either "m"
(male) or "f"
(female)."birth_year" (int)
: The birth year of the dog.Fill in the blank so that most_common
evaluates to the most common district in dogs
. Assume there are no ties.
= ____ most_common
Fill in the blank so that female_breeds
evaluates to a Series containing the primary breeds of all female dogs.
= dogs.____ female_breeds
In this problem, we will work with the DataFrame tv
, which contains information about various TV shows available to watch on streaming services. For each TV show, we have:
"Title" (object)
: The title of the TV show."Year" (int)
: The year in which the TV show was first released. (For instance, the show How I Met Your Mother ran from 2005 to 2014; there is only one row for How I Met Your Mother in tv
, and its "Year"
value is 2005.)"Age" (object)
: The age category for the TV show. If not missing, "Age"
is one of "all"
, "7+"
, "13+"
, "16+"
, or "18+"
. (For instance, "all"
means that the show is appropriate for all audiences, while `“18+”} means that the show contains mature content and viewers should be at least 18 years old.)"IMDb" (float)
: The TV show’s rating on IMDb (between 0 and 10)."Rotten Tomatoes" (int)
: The TV show’s rating on Rotten Tomatoes (between 0 and 100)."Netflix" (int)
: 1 if the show is available for streaming on Netflix and 0 otherwise. The "Hulu"
, "Prime Video"
, and "Disney+"
columns work the same way.The first few rows of tv
are shown below (though tv
has many more rows than are pictured here).
As you see in the first few rows of tv
, some TV shows are available for streaming on multiple streaming services. Fill in the blanks so that the two expressions below, Expression 1 and Expression 2, both evaluate to the "Title"
of the TV show that is available for streaming on the greatest number of streaming services. Assume there are no ties and that the "Title"
column contains unique values.
Expression 1:
"Title").loc[__(a)__].T.sum(axis=0).idxmax() tv.set_index(
Expression 2:
(=tv.iloc[__(b)__].sum(__(c)__))
tv.assign(num_services"num_services")
.sort_values(
.iloc[__(d)__] )
Hint: .T
transposes the rows and columns of a DataFrame — the indexes of df
are the columns of df.T
and vice versa.
What goes in the blanks?
In the following subparts, consider the variable double_count
, defined below.
= tv["Title"].value_counts().value_counts() double_count
What is type(double_count)
?
Series
SeriesGroupBy
DataFrame
DataFrameGroupBy
Which of the following statements are true? Select all that apply.
The only case in which it would make sense to set the index of tv
to "Title"
is if double_count.iloc[0] == 1
is True
.
The only case in which it would make sense to set the index of tv
to "Title"
is if double_count.loc[1] == tv.shape[0]
is True
.
If double_count.loc[2] == 5
is True
, there are 5 TV shows that all share the same "Title"
.
If double_count.loc[2] == 5
is True
, there are 5 pairs of 2 TV shows such that each pair shares the same "Title"
.
None of the above.
Suppose you are given a DataFrame of employees for a given company. The DataFrame, called employees
, is indexed by 'employee_id'
(string) with a column called 'years'
(int) that contains the number of years each employee has worked for the company. Suppose that the code
='years', ascending=False).index[0] employees.sort_values(by
outputs '2476'
.
True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.
True
False
You are given a DataFrame called sports
, indexed by 'Sport'
containing one column, 'PlayersPerTeam'
. The first few rows of the DataFrame are shown below:
Which of the following evaluates to 'basketball'
?
sports.loc[1]
sports.iloc[1]
sports.index[1]
sports['Sport'].iloc[1]
Which of the following expressions evaluate to 5
?
sports.loc['basketball', 'PlayersPerTeam']
sports['PlayersPerTeam'].loc['basketball']
sports['PlayersPerTeam'].iloc[1]
sports.loc['PlayersPerTeam']['basketball']
sports.loc['basketball']
sports.iloc[1]