In [1]:
from lec_utils import *
Discussion Slides: DataFrames and Querying
Agenda 📆¶
value_counts
.loc
vs.iloc
.- Querying 🔍.
- Worksheet 📝.
Today's Dataset 📲¶
- Today, we're going to be working with a dataset on Social Media and Entertainment Usage in the USA (maybe rip TikTok?).
- A reminder that we can read in a CSV using
pd.read_csv(path)
, wherepath
is the dataset's location relative to where your notebook is stored.
- Let's store our dataset in the
df
variable.
In [2]:
df = pd.read_csv('data/social_media_entertainment.csv')
df
Out[2]:
User ID | Age | Gender | Country | ... | Digital Well-being Awareness | Sleep Quality (scale 1-10) | Social Isolation Feeling (scale 1-10) | Monthly Expenditure on Entertainment (USD) | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 51 | Female | USA | ... | High | 5 | 3 | 71.72 |
1 | 8 | 58 | Female | USA | ... | Low | 9 | 1 | 9.65 |
2 | 24 | 32 | Female | USA | ... | Low | 6 | 1 | 280.82 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
50142 | 299988 | 48 | Other | USA | ... | High | 5 | 3 | 350.45 |
50143 | 299990 | 24 | Other | USA | ... | High | 7 | 1 | 108.41 |
50144 | 299999 | 15 | Male | USA | ... | High | 7 | 5 | 432.00 |
50145 rows × 40 columns
Choosing an index¶
- The index of our DataFrame needs to be a unique identifier per row.
- Here, it makes sense to choose User ID as our index since it takes on a different value per row.
In [3]:
print('Original shape: ', df.shape)
df = df.set_index('User ID')
print('New shape: ', df.shape)
Original shape: (50145, 40) New shape: (50145, 39)
- Observe how now we only have 39 columns instead of 40!
value_counts
¶
- Each row/column of our DataFrame is a Series, a 1D, array-like object.
value_counts
is a Series method that returns another Series containing the count of unique values in a specific column or Series.
- In other words, it describes the distribution of the unique values within a certain column.
In [4]:
s = df['Preferred Entertainment Platform']
s
Out[4]:
User ID 3 Spotify 8 Amazon Prime 24 Spotify ... 299988 Netflix 299990 Netflix 299999 Spotify Name: Preferred Entertainment Platform, Length: 50145, dtype: object
In [5]:
# What is the distribution of Preferred Entertainment Platforms?
df['Preferred Entertainment Platform'].value_counts(normalize=True)
Out[5]:
Preferred Entertainment Platform Spotify 0.25 YouTube 0.25 Netflix 0.25 Amazon Prime 0.25 Name: proportion, dtype: float64
In [6]:
# Which entertainment platform is used the most?
df['Preferred Entertainment Platform'].value_counts().idxmax()
Out[6]:
'Spotify'
In [7]:
# How many users prefer the leading entertainment platform?
df['Preferred Entertainment Platform'].value_counts().max()
Out[7]:
12625
In [8]:
# How many prefer the least popular entertainment platform?
df['Preferred Entertainment Platform'].value_counts().min()
Out[8]:
12485
loc
vs. iloc
¶
iloc
stands for "integer-location".
- You use it to select data by *integer* position (i.e., row and column numbers).
iloc
uses zero-based indexing, just like 2D arrays!
In [9]:
# Select rows 1-6 in our DataFrame along with columns 3-8
df.iloc[1:7, 3:9]
Out[9]:
Daily Social Media Time (hrs) | Daily Entertainment Time (hrs) | Social Media Platforms Used | Primary Platform | Daily Messaging Time (hrs) | Daily Video Content Time (hrs) | |
---|---|---|---|---|---|---|
User ID | ||||||
8 | 4.20 | 2.77 | 4 | 3.54 | 0.69 | |
24 | 0.78 | 0.55 | 2 | 0.57 | 2.65 | |
25 | 7.72 | 5.83 | 4 | YouTube | 1.37 | 5.21 |
32 | 0.62 | 7.94 | 1 | YouTube | 1.85 | 6.92 |
45 | 7.45 | 9.85 | 5 | 4.14 | 2.23 | |
47 | 0.92 | 2.93 | 4 | TikTok | 2.80 | 4.54 |
loc
stands for "location" and allows for label-based indexing.
- The
loc
indexer works similarly to slicing 2D arrays, but it uses row and column labels instead, not positions.
Remember, the "index" refers to the row labels.
In [10]:
# Show the Sleep Quality for user with id 3
df.loc[3, 'Sleep Quality (scale 1-10)']
Out[10]:
5
Example: Daily Social Media Usage 📱¶
- Let's take a closer look at how
loc
andiloc
work by examining the'Daily Social Media Time (hrs)'
column.
In [11]:
s = df['Daily Social Media Time (hrs)']
s
Out[11]:
User ID 3 6.78 8 4.20 24 0.78 ... 299988 3.16 299990 4.37 299999 6.47 Name: Daily Social Media Time (hrs), Length: 50145, dtype: float64
- Note that
s
is a Series object!
In [12]:
type(s)
Out[12]:
pandas.core.series.Series
In [13]:
# What is the daily social media time (in hours) of user with ID 3?
s.loc[3]
Out[13]:
6.78
In [14]:
# What is the daily social media time (in hours) of the user in row 3 (zero-indexed)?
s.iloc[3]
Out[14]:
7.72
- Indeed, we can verify that this is the case!
In [15]:
s.head(4)
Out[15]:
User ID 3 6.78 8 4.20 24 0.78 25 7.72 Name: Daily Social Media Time (hrs), dtype: float64
- Notice that
loc
allows us to directly extract information about a user if we know their user ID, which is our DataFrame's index.
- If we don't know their user ID, we can instead sort the values in our series and use
iloc
!
- In general, because
iloc
relies on hard-coded integers to access values, it is most useful to us when we know exactly where certain values lie in our data.
- So... always use it after sorting!
In [16]:
# What is the highest social media usage time?
s.sort_values(ascending=False).iloc[0]
Out[16]:
8.0
- We use
iloc
instead ofloc
here because we don't know the User ID of the user with the highest social media time.
Using loc
and iloc
on a DataFrame¶
- Recall that you can provide a sequence (list, array, Series) as either argument to
loc
:
In [17]:
# The first argument is the row label, i.e. the index value.
# ↓
df.loc[3, 'Daily Social Media Time (hrs)']
# ↑
# The second argument is the column label.
Out[17]:
6.78
In [18]:
df.loc[[3, 8, 24, 25], ['Primary Platform', 'Daily Social Media Time (hrs)']]
Out[18]:
Primary Platform | Daily Social Media Time (hrs) | |
---|---|---|
User ID | ||
3 | 6.78 | |
8 | 4.20 | |
24 | 0.78 | |
25 | YouTube | 7.72 |
- With
iloc
, you can provide a range of values for both the row and column.
In [19]:
df.iloc[:6, :8]
Out[19]:
Age | Gender | Country | Daily Social Media Time (hrs) | Daily Entertainment Time (hrs) | Social Media Platforms Used | Primary Platform | Daily Messaging Time (hrs) | |
---|---|---|---|---|---|---|---|---|
User ID | ||||||||
3 | 51 | Female | USA | 6.78 | 1.77 | 4 | 2.09 | |
8 | 58 | Female | USA | 4.20 | 2.77 | 4 | 3.54 | |
24 | 32 | Female | USA | 0.78 | 0.55 | 2 | 0.57 | |
25 | 36 | Female | USA | 7.72 | 5.83 | 4 | YouTube | 1.37 |
32 | 21 | Male | USA | 0.62 | 7.94 | 1 | YouTube | 1.85 |
45 | 26 | Female | USA | 7.45 | 9.85 | 5 | 4.14 |
In [20]:
# Find the age of user with highest daily messaging time.
# Our DataFrame has a column called 'Daily Messaging Time (hrs)' and 'Age'.
df.sort_values('Daily Messaging Time (hrs)', ascending=False)['Age'].iloc[0]
Out[20]:
31
Querying 🔍¶
- We use querying to extract rows in our DataFrame that satisfies certain *conditions*.
- Boolean statements are particularly useful!
In [21]:
df['Primary Platform'] == 'TikTok'
Out[21]:
User ID 3 False 8 False 24 False ... 299988 False 299990 False 299999 False Name: Primary Platform, Length: 50145, dtype: bool
- Combining these with
loc
allows us to select the rows we're looking for.
In [22]:
df.loc[df['Primary Platform'] == 'TikTok']
Out[22]:
Age | Gender | Country | Daily Social Media Time (hrs) | ... | Digital Well-being Awareness | Sleep Quality (scale 1-10) | Social Isolation Feeling (scale 1-10) | Monthly Expenditure on Entertainment (USD) | |
---|---|---|---|---|---|---|---|---|---|
User ID | |||||||||
47 | 29 | Male | USA | 0.92 | ... | Low | 5 | 9 | 452.78 |
48 | 40 | Other | USA | 7.27 | ... | Low | 6 | 6 | 412.09 |
52 | 53 | Male | USA | 6.90 | ... | High | 5 | 7 | 188.51 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
299906 | 37 | Other | USA | 5.77 | ... | High | 2 | 8 | 481.62 |
299934 | 52 | Male | USA | 2.19 | ... | High | 6 | 1 | 42.33 |
299980 | 35 | Female | USA | 5.44 | ... | High | 5 | 7 | 324.45 |
10127 rows × 39 columns
- You can also include multiple conditions to query for! Use Boolean operators
&
or|
to separate them.
In [23]:
df.loc[(df['Primary Platform'] == 'TikTok') & (df['Daily Social Media Time (hrs)'] > 3.0)]
Out[23]:
Age | Gender | Country | Daily Social Media Time (hrs) | ... | Digital Well-being Awareness | Sleep Quality (scale 1-10) | Social Isolation Feeling (scale 1-10) | Monthly Expenditure on Entertainment (USD) | |
---|---|---|---|---|---|---|---|---|---|
User ID | |||||||||
48 | 40 | Other | USA | 7.27 | ... | Low | 6 | 6 | 412.09 |
52 | 53 | Male | USA | 6.90 | ... | High | 5 | 7 | 188.51 |
55 | 44 | Female | USA | 3.99 | ... | Low | 9 | 8 | 274.86 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
299844 | 56 | Male | USA | 4.37 | ... | High | 2 | 1 | 359.47 |
299906 | 37 | Other | USA | 5.77 | ... | High | 2 | 8 | 481.62 |
299980 | 35 | Female | USA | 5.44 | ... | High | 5 | 7 | 324.45 |
6738 rows × 39 columns
In [24]:
# How many users use TikTok as their primary platform?
df.loc[(df['Primary Platform'] == 'TikTok')].shape[0]
Out[24]:
10127
In [25]:
# Among users who use TikTok as their primary platform, what is the highest Daily Social Media Time?
df.loc[df['Primary Platform'] == 'TikTok', 'Daily Social Media Time (hrs)'].sort_values(ascending=False).iloc[0]
Out[25]:
8.0
In [26]:
# Which gender prefers to use Spotify the most?
# Our DataFrame has a column called 'Preferred Entertainment Platform'.
df.loc[df['Preferred Entertainment Platform'] == 'Spotify', 'Gender'].value_counts().idxmax()
Out[26]:
'Female'