In [1]:
from lec_utils import *
Discussion Slides: Visualization, Imputation, and Web Scraping
Agenda 📆¶
- Web Scraping using
BeautifulSoup
. - Worksheet 📝.
- Visualizing Data
- Imputing Missing Values
Example: Scraping the Happening @ Michigan page¶
Example: Scraping the Happening @ Michigan page¶
- Our goal in today's discussion lecture is to create a DataFrame with the information about each event at events.umich.edu.
In [2]:
res = requests.get('https://events.umich.edu')
res
Out[2]:
<Response [200]>
In [3]:
soup = BeautifulSoup(res.text)
- Let's start by opening the page in Chrome, right clicking on the page, and clicking "Inspect".
As we can see, the HTML is relatively complicated – this is usually the case for real websites!
Identifying <div>
s¶
- It's not easy identifying which
<div>
s we want. The Inspect tool makes this easier, but it's good to verify thatfind_all
is finding the right number of elements.
In [4]:
divs = soup.find_all(class_='col-xs-12')
In [5]:
len(divs)
Out[5]:
9
- Again, let's deal with one
<div>
at a time. First, we should extract the title of the event.
In [6]:
divs[0]
Out[6]:
<div class="col-xs-12 col-sm-4 col-md-4 col-lg-2 flex no-pad"> <div class="event-listing-grid event-single"> <time class="time-banner" datetime="2025-05-03 11:00"><i class="fa fa-clock-o"></i> May 03, 2025 11:00am</time> <div class="list-image"> <a alt="Justin Roberts" href="/event/133745" style="background:url(https://events.umich.edu/media/cache/event_list_2x/media/attachments/2025/03/event_133745_original-1.jpg) center center no-repeat; background-size:cover; position:absolute; width:100%;height:100%; top:0px;left:0px;" title="Justin Roberts"> </a> </div> <div class="event-info"> <div class="event-title"><h3> <a href="/event/133745" title="Justin Roberts & The Not Ready for Naptime Players"> Justin Roberts & The Not Ready for Naptime... </a></h3> <h4>Presented by The Ark.</h4> </div> <ul class="event-details"> <li class="item"> <a href="/list?filter=locations:1" title="GA - The Ark"><i class="fa fa-location-arrow fa-fw"></i><span> GA - The Ark</span></a> </li> <li class="item"><a href="/group/1053" title="Michigan Union Ticket Office (MUTO)"><i class="fa fa-group fa-fw"></i><span> Michigan Union Ticket... </span></a></li> <li class="item"><a href="/list?filter=alltypes:15"><i class="fa fa-list fa-fw"></i><span> Performance </span></a></li> <li class="item"><a href="https://mutotix.umich.edu/5580/5581"> <i class="fa fa-ticket fa-fw"></i> <span>Purchase tickets here!</span> </a></li> <li class="item"><a href="https://theark.org/support-the-ark/"> <i class="fa fa-link fa-fw"></i> <span>Support The Ark!</span> </a></li> <li class="item"><a href="https://www.justinrobertsmusic.com/"> <i class="fa fa-link fa-fw"></i> <span>Justin Roberts</span> </a></li> <li class="item"><a href="https://www.youtube.com/watch?v=7K09PrjXwbU"> <i class="fa fa-youtube fa-fw"></i> <span>https://www.youtube.com/watch?v=7K09PrjXwbU</span> </a></li> </ul> <!-- <p> “Among the best craftsmen of sweet and silly kid tunes out there, making irresistible music out of small, well-observed moments from the... ( 2025-05-03 11:00am ) </p> --> </div> </div> </div>
In [7]:
divs[0].find('div', class_='event-title').find('a').get('title')
Out[7]:
'Justin Roberts & The Not Ready for Naptime Players'
- The time and location, too.
In [8]:
divs[0].find('time').get('datetime')
Out[8]:
'2025-05-03 11:00'
In [9]:
divs[0].find('ul').find('a').get('title')
Out[9]:
'GA - The Ark'
Parsing a single event, and then every event¶
- As before, we'll implement a function that takes in a BeautifulSoup object corresponding to a single
<div>
and returns a dictionary with the relevant information about that event.
In [10]:
def process_event(div):
title = div.find('div', class_='event-title').find('a').get('title')
location = div.find('ul').find('a').get('title')
time = pd.to_datetime(div.find('time').get('datetime')) # Good idea!
return {'title': title, 'time': time, 'location': location}
In [11]:
process_event(divs[12])
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) Cell In[11], line 1 ----> 1 process_event(divs[12]) IndexError: list index out of range
- Now, we can call it on every
<div>
indivs
.
Remember, we already randivs = soup.find_all(class_='col-xs-12')
.
In [12]:
row_list = []
for div in divs:
try:
row_list.append(process_event(div))
except Exception as e:
print(e)
'NoneType' object has no attribute 'find'
In [13]:
events = pd.DataFrame(row_list)
events.head()
Out[13]:
title | time | location | |
---|---|---|---|
0 | Justin Roberts & The Not Ready for Naptime Pla... | 2025-05-03 11:00:00 | GA - The Ark |
1 | Read and Look | The Water Princess | 2025-05-03 11:15:00 | Kelsey Museum of Archaeology |
2 | Women's Tennis vs Arizona State | 2025-05-03 13:00:00 | Varsity Tennis Bldg |
3 | Screening: 2025 IP Exhibition | 2025-05-03 14:00:00 | Art & Architecture Building, 2000 Bonistee... |
4 | Men's Lacrosse vs Big Ten Championship | 2025-05-03 19:00:00 | U-M Lacrosse Stadium |
- Now,
events
is a DataFrame, like any other!
In [14]:
# Which events are in-person today?
events[~events['location'].isin(['Virtual', ''])]
Out[14]:
title | time | location | |
---|---|---|---|
0 | Justin Roberts & The Not Ready for Naptime Pla... | 2025-05-03 11:00:00 | GA - The Ark |
1 | Read and Look | The Water Princess | 2025-05-03 11:15:00 | Kelsey Museum of Archaeology |
2 | Women's Tennis vs Arizona State | 2025-05-03 13:00:00 | Varsity Tennis Bldg |
... | ... | ... | ... |
5 | Men's Lacrosse vs Big Ten Championship | 2025-05-03 20:00:00 | U-M Lacrosse Stadium |
6 | Men's Lacrosse vs Big Ten Championship | 2025-05-03 20:00:00 | U-M Lacrosse Stadium |
7 | The Wildwoods | 2025-05-03 20:00:00 | ARK Reserved |
8 rows × 3 columns