In [1]:
from lec_utils import *

Discussion Slides: DataFrames and Querying

Agenda 📆¶

  • value_counts.
  • loc vs. iloc.
  • Querying 🔍.
  • Worksheet 📝.

Today's Dataset 📲¶

  • Today, we're going to be working with a dataset on Social Media and Entertainment Usage in the USA (maybe rip TikTok?).
  • A reminder that we can read in a CSV using pd.read_csv(path), where path is the dataset's location relative to where your notebook is stored.
  • Let's store our dataset in the df variable.
In [2]:
df = pd.read_csv('data/social_media_entertainment.csv')
df
Out[2]:
User ID Age Gender Country ... Digital Well-being Awareness Sleep Quality (scale 1-10) Social Isolation Feeling (scale 1-10) Monthly Expenditure on Entertainment (USD)
0 3 51 Female USA ... High 5 3 71.72
1 8 58 Female USA ... Low 9 1 9.65
2 24 32 Female USA ... Low 6 1 280.82
... ... ... ... ... ... ... ... ... ...
50142 299988 48 Other USA ... High 5 3 350.45
50143 299990 24 Other USA ... High 7 1 108.41
50144 299999 15 Male USA ... High 7 5 432.00

50145 rows × 40 columns

Choosing an index¶

  • The index of our DataFrame needs to be a unique identifier per row.
  • Here, it makes sense to choose User ID as our index since it takes on a different value per row.
In [3]:
print('Original shape: ', df.shape)
df = df.set_index('User ID')
print('New shape: ', df.shape)
Original shape:  (50145, 40)
New shape:  (50145, 39)
  • Observe how now we only have 39 columns instead of 40!

value_counts¶

  • Each row/column of our DataFrame is a Series, a 1D, array-like object.
  • value_counts is a Series method that returns another Series containing the count of unique values in a specific column or Series.
  • In other words, it describes the distribution of the unique values within a certain column.
In [4]:
s = df['Preferred Entertainment Platform']
s
Out[4]:
User ID
3              Spotify
8         Amazon Prime
24             Spotify
              ...     
299988         Netflix
299990         Netflix
299999         Spotify
Name: Preferred Entertainment Platform, Length: 50145, dtype: object
In [5]:
# What is the distribution of Preferred Entertainment Platforms?
df['Preferred Entertainment Platform'].value_counts(normalize=True)
Out[5]:
Preferred Entertainment Platform
Spotify         0.25
YouTube         0.25
Netflix         0.25
Amazon Prime    0.25
Name: proportion, dtype: float64
In [6]:
# Which entertainment platform is used the most?
df['Preferred Entertainment Platform'].value_counts().idxmax()
Out[6]:
'Spotify'
In [7]:
# How many users prefer the leading entertainment platform? 
df['Preferred Entertainment Platform'].value_counts().max()
Out[7]:
12625
In [8]:
# How many prefer the least popular entertainment platform?
df['Preferred Entertainment Platform'].value_counts().min()
Out[8]:
12485

loc vs. iloc¶

  • iloc stands for "integer-location".
  • You use it to select data by *integer* position (i.e., row and column numbers).
  • iloc uses zero-based indexing, just like 2D arrays!
In [9]:
# Select rows 1-6 in our DataFrame along with columns 3-8
df.iloc[1:7, 3:9]
Out[9]:
Daily Social Media Time (hrs) Daily Entertainment Time (hrs) Social Media Platforms Used Primary Platform Daily Messaging Time (hrs) Daily Video Content Time (hrs)
User ID
8 4.20 2.77 4 Facebook 3.54 0.69
24 0.78 0.55 2 Facebook 0.57 2.65
25 7.72 5.83 4 YouTube 1.37 5.21
32 0.62 7.94 1 YouTube 1.85 6.92
45 7.45 9.85 5 Facebook 4.14 2.23
47 0.92 2.93 4 TikTok 2.80 4.54
  • loc stands for "location" and allows for label-based indexing.
  • The loc indexer works similarly to slicing 2D arrays, but it uses row and column labels instead, not positions.
    Remember, the "index" refers to the row labels.
In [10]:
# Show the Sleep Quality for user with id 3
df.loc[3, 'Sleep Quality (scale 1-10)']
Out[10]:
5

Example: Daily Social Media Usage 📱¶

  • Let's take a closer look at how loc and iloc work by examining the 'Daily Social Media Time (hrs)' column.
In [11]:
s = df['Daily Social Media Time (hrs)']
s
Out[11]:
User ID
3         6.78
8         4.20
24        0.78
          ... 
299988    3.16
299990    4.37
299999    6.47
Name: Daily Social Media Time (hrs), Length: 50145, dtype: float64
  • Note that s is a Series object!
In [12]:
type(s)
Out[12]:
pandas.core.series.Series
In [13]:
# What is the daily social media time (in hours) of user with ID 3?
s.loc[3]
Out[13]:
6.78
In [14]:
# What is the daily social media time (in hours) of the user in row 3 (zero-indexed)?
s.iloc[3]
Out[14]:
7.72
  • Indeed, we can verify that this is the case!
In [15]:
s.head(4)
Out[15]:
User ID
3     6.78
8     4.20
24    0.78
25    7.72
Name: Daily Social Media Time (hrs), dtype: float64
  • Notice that loc allows us to directly extract information about a user if we know their user ID, which is our DataFrame's index.
  • If we don't know their user ID, we can instead sort the values in our series and use iloc!
  • In general, because iloc relies on hard-coded integers to access values, it is most useful to us when we know exactly where certain values lie in our data.
  • So... always use it after sorting!
In [16]:
# What is the highest social media usage time?
s.sort_values(ascending=False).iloc[0]
Out[16]:
8.0
  • We use iloc instead of loc here because we don't know the User ID of the user with the highest social media time.

Using loc and iloc on a DataFrame¶

  • Recall that you can provide a sequence (list, array, Series) as either argument to loc:
In [17]:
# The first argument is the row label, i.e. the index value.
#        ↓
df.loc[3, 'Daily Social Media Time (hrs)']
#                  ↑
# The second argument is the column label.
Out[17]:
6.78
In [18]:
df.loc[[3, 8, 24, 25], ['Primary Platform', 'Daily Social Media Time (hrs)']]
Out[18]:
Primary Platform Daily Social Media Time (hrs)
User ID
3 Facebook 6.78
8 Facebook 4.20
24 Facebook 0.78
25 YouTube 7.72
  • With iloc, you can provide a range of values for both the row and column.
In [19]:
df.iloc[:6, :8]
Out[19]:
Age Gender Country Daily Social Media Time (hrs) Daily Entertainment Time (hrs) Social Media Platforms Used Primary Platform Daily Messaging Time (hrs)
User ID
3 51 Female USA 6.78 1.77 4 Facebook 2.09
8 58 Female USA 4.20 2.77 4 Facebook 3.54
24 32 Female USA 0.78 0.55 2 Facebook 0.57
25 36 Female USA 7.72 5.83 4 YouTube 1.37
32 21 Male USA 0.62 7.94 1 YouTube 1.85
45 26 Female USA 7.45 9.85 5 Facebook 4.14
In [20]:
# Find the age of user with highest daily messaging time.
# Our DataFrame has a column called 'Daily Messaging Time (hrs)' and 'Age'.
df.sort_values('Daily Messaging Time (hrs)', ascending=False)['Age'].iloc[0]
Out[20]:
31

Querying 🔍¶

  • We use querying to extract rows in our DataFrame that satisfies certain *conditions*.
  • Boolean statements are particularly useful!
In [21]:
df['Primary Platform'] == 'TikTok'
Out[21]:
User ID
3         False
8         False
24        False
          ...  
299988    False
299990    False
299999    False
Name: Primary Platform, Length: 50145, dtype: bool
  • Combining these with loc allows us to select the rows we're looking for.
In [22]:
df.loc[df['Primary Platform'] == 'TikTok']
Out[22]:
Age Gender Country Daily Social Media Time (hrs) ... Digital Well-being Awareness Sleep Quality (scale 1-10) Social Isolation Feeling (scale 1-10) Monthly Expenditure on Entertainment (USD)
User ID
47 29 Male USA 0.92 ... Low 5 9 452.78
48 40 Other USA 7.27 ... Low 6 6 412.09
52 53 Male USA 6.90 ... High 5 7 188.51
... ... ... ... ... ... ... ... ... ...
299906 37 Other USA 5.77 ... High 2 8 481.62
299934 52 Male USA 2.19 ... High 6 1 42.33
299980 35 Female USA 5.44 ... High 5 7 324.45

10127 rows × 39 columns

  • You can also include multiple conditions to query for! Use Boolean operators & or | to separate them.
In [23]:
df.loc[(df['Primary Platform'] == 'TikTok') & (df['Daily Social Media Time (hrs)'] > 3.0)]
Out[23]:
Age Gender Country Daily Social Media Time (hrs) ... Digital Well-being Awareness Sleep Quality (scale 1-10) Social Isolation Feeling (scale 1-10) Monthly Expenditure on Entertainment (USD)
User ID
48 40 Other USA 7.27 ... Low 6 6 412.09
52 53 Male USA 6.90 ... High 5 7 188.51
55 44 Female USA 3.99 ... Low 9 8 274.86
... ... ... ... ... ... ... ... ... ...
299844 56 Male USA 4.37 ... High 2 1 359.47
299906 37 Other USA 5.77 ... High 2 8 481.62
299980 35 Female USA 5.44 ... High 5 7 324.45

6738 rows × 39 columns

In [24]:
# How many users use TikTok as their primary platform?
df.loc[(df['Primary Platform'] == 'TikTok')].shape[0]
Out[24]:
10127
In [25]:
# Among users who use TikTok as their primary platform, what is the highest Daily Social Media Time?
df.loc[df['Primary Platform'] == 'TikTok', 'Daily Social Media Time (hrs)'].sort_values(ascending=False).iloc[0]
Out[25]:
8.0
In [26]:
# Which gender prefers to use Spotify the most?
# Our DataFrame has a column called 'Preferred Entertainment Platform'.
df.loc[df['Preferred Entertainment Platform'] == 'Spotify', 'Gender'].value_counts().idxmax()
Out[26]:
'Female'