DataFrames and Querying

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

In this problem, we’ll work with the DataFrame dogs, which contains one row for every registered pet dog in Zurich, Switzerland in 2017.

The first few rows of dogs are shown below, but dogs has many more rows than are shown.



Problem 1.1

Fill in the blank so that most_common evaluates to the most common district in dogs. Assume there are no ties.

most_common = ____


Problem 1.2

Fill in the blank so that female_breeds evaluates to a Series containing the primary breeds of all female dogs.

female_breeds = dogs.____



Problem 2

In this problem, we will work with the DataFrame tv, which contains information about various TV shows available to watch on streaming services. For each TV show, we have:

The first few rows of tv are shown below (though tv has many more rows than are pictured here).


Problem 2.1

As you see in the first few rows of tv, some TV shows are available for streaming on multiple streaming services. Fill in the blanks so that the two expressions below, Expression 1 and Expression 2, both evaluate to the "Title" of the TV show that is available for streaming on the greatest number of streaming services. Assume there are no ties and that the "Title" column contains unique values.

Expression 1:

tv.set_index("Title").loc[__(a)__].T.sum(axis=0).idxmax()

Expression 2:

    (
        tv.assign(num_services=tv.iloc[__(b)__].sum(__(c)__))
            .sort_values("num_services")
            .iloc[__(d)__]
    )

Hint: .T transposes the rows and columns of a DataFrame — the indexes of df are the columns of df.T and vice versa.

What goes in the blanks?


In the following subparts, consider the variable double_count, defined below.

double_count = tv["Title"].value_counts().value_counts()


Problem 2.2

What is type(double_count)?


Problem 2.3

Which of the following statements are true? Select all that apply.



Problem 3

Suppose you are given a DataFrame of employees for a given company. The DataFrame, called employees, is indexed by 'employee_id' (string) with a column called 'years' (int) that contains the number of years each employee has worked for the company. Suppose that the code

employees.sort_values(by='years', ascending=False).index[0]

outputs '2476'.

True or False: The number of years that employee 2476 has worked for the company is greater than the number of years that any other employee has worked for the company.


Problem 4

You are given a DataFrame called sports, indexed by 'Sport' containing one column, 'PlayersPerTeam'. The first few rows of the DataFrame are shown below:


Problem 4.1

Which of the following evaluates to 'basketball'?


Problem 4.2

Which of the following expressions evaluate to 5?



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.