In [1]:
from lec_utils import *
Discussion Slides: DataFrames and Querying
Agenda 📆¶
value_counts.locvs.iloc.- Querying 🔍.
- Worksheet 📝.
Today's Dataset 📲¶
- Today, we're going to be working with a dataset on Social Media and Entertainment Usage in the USA (maybe rip TikTok?).
- A reminder that we can read in a CSV using
pd.read_csv(path), wherepathis the dataset's location relative to where your notebook is stored.
- Let's store our dataset in the
dfvariable.
In [2]:
df = pd.read_csv('data/social_media_entertainment.csv')
df
Out[2]:
| User ID | Age | Gender | Country | ... | Digital Well-being Awareness | Sleep Quality (scale 1-10) | Social Isolation Feeling (scale 1-10) | Monthly Expenditure on Entertainment (USD) | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 51 | Female | USA | ... | High | 5 | 3 | 71.72 |
| 1 | 8 | 58 | Female | USA | ... | Low | 9 | 1 | 9.65 |
| 2 | 24 | 32 | Female | USA | ... | Low | 6 | 1 | 280.82 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 50142 | 299988 | 48 | Other | USA | ... | High | 5 | 3 | 350.45 |
| 50143 | 299990 | 24 | Other | USA | ... | High | 7 | 1 | 108.41 |
| 50144 | 299999 | 15 | Male | USA | ... | High | 7 | 5 | 432.00 |
50145 rows × 40 columns
Choosing an index¶
- The index of our DataFrame needs to be a unique identifier per row.
- Here, it makes sense to choose User ID as our index since it takes on a different value per row.
In [3]:
print('Original shape: ', df.shape)
df = df.set_index('User ID')
print('New shape: ', df.shape)
Original shape: (50145, 40) New shape: (50145, 39)
- Observe how now we only have 39 columns instead of 40!
value_counts¶
- Each row/column of our DataFrame is a Series, a 1D, array-like object.
value_countsis a Series method that returns another Series containing the count of unique values in a specific column or Series.
- In other words, it describes the distribution of the unique values within a certain column.
In [4]:
s = df['Preferred Entertainment Platform']
s
Out[4]:
User ID
3 Spotify
8 Amazon Prime
24 Spotify
...
299988 Netflix
299990 Netflix
299999 Spotify
Name: Preferred Entertainment Platform, Length: 50145, dtype: object
In [5]:
# What is the distribution of Preferred Entertainment Platforms?
df['Preferred Entertainment Platform'].value_counts(normalize=True)
Out[5]:
Preferred Entertainment Platform Spotify 0.25 YouTube 0.25 Netflix 0.25 Amazon Prime 0.25 Name: proportion, dtype: float64
In [6]:
# Which entertainment platform is used the most?
df['Preferred Entertainment Platform'].value_counts().idxmax()
Out[6]:
'Spotify'
In [7]:
# How many users prefer the leading entertainment platform?
df['Preferred Entertainment Platform'].value_counts().max()
Out[7]:
12625
In [8]:
# How many prefer the least popular entertainment platform?
df['Preferred Entertainment Platform'].value_counts().min()
Out[8]:
12485
loc vs. iloc¶
ilocstands for "integer-location".
- You use it to select data by *integer* position (i.e., row and column numbers).
ilocuses zero-based indexing, just like 2D arrays!
In [9]:
# Select rows 1-6 in our DataFrame along with columns 3-8
df.iloc[1:7, 3:9]
Out[9]:
| Daily Social Media Time (hrs) | Daily Entertainment Time (hrs) | Social Media Platforms Used | Primary Platform | Daily Messaging Time (hrs) | Daily Video Content Time (hrs) | |
|---|---|---|---|---|---|---|
| User ID | ||||||
| 8 | 4.20 | 2.77 | 4 | 3.54 | 0.69 | |
| 24 | 0.78 | 0.55 | 2 | 0.57 | 2.65 | |
| 25 | 7.72 | 5.83 | 4 | YouTube | 1.37 | 5.21 |
| 32 | 0.62 | 7.94 | 1 | YouTube | 1.85 | 6.92 |
| 45 | 7.45 | 9.85 | 5 | 4.14 | 2.23 | |
| 47 | 0.92 | 2.93 | 4 | TikTok | 2.80 | 4.54 |
locstands for "location" and allows for label-based indexing.
- The
locindexer works similarly to slicing 2D arrays, but it uses row and column labels instead, not positions.
Remember, the "index" refers to the row labels.
In [10]:
# Show the Sleep Quality for user with id 3
df.loc[3, 'Sleep Quality (scale 1-10)']
Out[10]:
5
Example: Daily Social Media Usage 📱¶
- Let's take a closer look at how
locandilocwork by examining the'Daily Social Media Time (hrs)'column.
In [11]:
s = df['Daily Social Media Time (hrs)']
s
Out[11]:
User ID
3 6.78
8 4.20
24 0.78
...
299988 3.16
299990 4.37
299999 6.47
Name: Daily Social Media Time (hrs), Length: 50145, dtype: float64
- Note that
sis a Series object!
In [12]:
type(s)
Out[12]:
pandas.core.series.Series
In [13]:
# What is the daily social media time (in hours) of user with ID 3?
s.loc[3]
Out[13]:
6.78
In [14]:
# What is the daily social media time (in hours) of the user in row 3 (zero-indexed)?
s.iloc[3]
Out[14]:
7.72
- Indeed, we can verify that this is the case!
In [15]:
s.head(4)
Out[15]:
User ID 3 6.78 8 4.20 24 0.78 25 7.72 Name: Daily Social Media Time (hrs), dtype: float64
- Notice that
locallows us to directly extract information about a user if we know their user ID, which is our DataFrame's index.
- If we don't know their user ID, we can instead sort the values in our series and use
iloc!
- In general, because
ilocrelies on hard-coded integers to access values, it is most useful to us when we know exactly where certain values lie in our data.
- So... always use it after sorting!
In [16]:
# What is the highest social media usage time?
s.sort_values(ascending=False).iloc[0]
Out[16]:
8.0
- We use
ilocinstead oflochere because we don't know the User ID of the user with the highest social media time.
Using loc and iloc on a DataFrame¶
- Recall that you can provide a sequence (list, array, Series) as either argument to
loc:
In [17]:
# The first argument is the row label, i.e. the index value.
# ↓
df.loc[3, 'Daily Social Media Time (hrs)']
# ↑
# The second argument is the column label.
Out[17]:
6.78
In [18]:
df.loc[[3, 8, 24, 25], ['Primary Platform', 'Daily Social Media Time (hrs)']]
Out[18]:
| Primary Platform | Daily Social Media Time (hrs) | |
|---|---|---|
| User ID | ||
| 3 | 6.78 | |
| 8 | 4.20 | |
| 24 | 0.78 | |
| 25 | YouTube | 7.72 |
- With
iloc, you can provide a range of values for both the row and column.
In [19]:
df.iloc[:6, :8]
Out[19]:
| Age | Gender | Country | Daily Social Media Time (hrs) | Daily Entertainment Time (hrs) | Social Media Platforms Used | Primary Platform | Daily Messaging Time (hrs) | |
|---|---|---|---|---|---|---|---|---|
| User ID | ||||||||
| 3 | 51 | Female | USA | 6.78 | 1.77 | 4 | 2.09 | |
| 8 | 58 | Female | USA | 4.20 | 2.77 | 4 | 3.54 | |
| 24 | 32 | Female | USA | 0.78 | 0.55 | 2 | 0.57 | |
| 25 | 36 | Female | USA | 7.72 | 5.83 | 4 | YouTube | 1.37 |
| 32 | 21 | Male | USA | 0.62 | 7.94 | 1 | YouTube | 1.85 |
| 45 | 26 | Female | USA | 7.45 | 9.85 | 5 | 4.14 |
In [20]:
# Find the age of user with highest daily messaging time.
# Our DataFrame has a column called 'Daily Messaging Time (hrs)' and 'Age'.
df.sort_values('Daily Messaging Time (hrs)', ascending=False)['Age'].iloc[0]
Out[20]:
31
Querying 🔍¶
- We use querying to extract rows in our DataFrame that satisfies certain *conditions*.
- Boolean statements are particularly useful!
In [21]:
df['Primary Platform'] == 'TikTok'
Out[21]:
User ID
3 False
8 False
24 False
...
299988 False
299990 False
299999 False
Name: Primary Platform, Length: 50145, dtype: bool
- Combining these with
locallows us to select the rows we're looking for.
In [22]:
df.loc[df['Primary Platform'] == 'TikTok']
Out[22]:
| Age | Gender | Country | Daily Social Media Time (hrs) | ... | Digital Well-being Awareness | Sleep Quality (scale 1-10) | Social Isolation Feeling (scale 1-10) | Monthly Expenditure on Entertainment (USD) | |
|---|---|---|---|---|---|---|---|---|---|
| User ID | |||||||||
| 47 | 29 | Male | USA | 0.92 | ... | Low | 5 | 9 | 452.78 |
| 48 | 40 | Other | USA | 7.27 | ... | Low | 6 | 6 | 412.09 |
| 52 | 53 | Male | USA | 6.90 | ... | High | 5 | 7 | 188.51 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 299906 | 37 | Other | USA | 5.77 | ... | High | 2 | 8 | 481.62 |
| 299934 | 52 | Male | USA | 2.19 | ... | High | 6 | 1 | 42.33 |
| 299980 | 35 | Female | USA | 5.44 | ... | High | 5 | 7 | 324.45 |
10127 rows × 39 columns
- You can also include multiple conditions to query for! Use Boolean operators
&or|to separate them.
In [23]:
df.loc[(df['Primary Platform'] == 'TikTok') & (df['Daily Social Media Time (hrs)'] > 3.0)]
Out[23]:
| Age | Gender | Country | Daily Social Media Time (hrs) | ... | Digital Well-being Awareness | Sleep Quality (scale 1-10) | Social Isolation Feeling (scale 1-10) | Monthly Expenditure on Entertainment (USD) | |
|---|---|---|---|---|---|---|---|---|---|
| User ID | |||||||||
| 48 | 40 | Other | USA | 7.27 | ... | Low | 6 | 6 | 412.09 |
| 52 | 53 | Male | USA | 6.90 | ... | High | 5 | 7 | 188.51 |
| 55 | 44 | Female | USA | 3.99 | ... | Low | 9 | 8 | 274.86 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 299844 | 56 | Male | USA | 4.37 | ... | High | 2 | 1 | 359.47 |
| 299906 | 37 | Other | USA | 5.77 | ... | High | 2 | 8 | 481.62 |
| 299980 | 35 | Female | USA | 5.44 | ... | High | 5 | 7 | 324.45 |
6738 rows × 39 columns
In [24]:
# How many users use TikTok as their primary platform?
df.loc[(df['Primary Platform'] == 'TikTok')].shape[0]
Out[24]:
10127
In [25]:
# Among users who use TikTok as their primary platform, what is the highest Daily Social Media Time?
df.loc[df['Primary Platform'] == 'TikTok', 'Daily Social Media Time (hrs)'].sort_values(ascending=False).iloc[0]
Out[25]:
8.0
In [26]:
# Which gender prefers to use Spotify the most?
# Our DataFrame has a column called 'Preferred Entertainment Platform'.
df.loc[df['Preferred Entertainment Platform'] == 'Spotify', 'Gender'].value_counts().idxmax()
Out[26]:
'Female'