In [1]:
from lec_utils import *

Discussion Slides: Visualization, Imputation, and Web Scraping

Agenda 📆¶

  • Web Scraping using BeautifulSoup.
  • Worksheet 📝.
    • Visualizing Data
    • Imputing Missing Values

Example: Scraping the Happening @ Michigan page¶


Example: Scraping the Happening @ Michigan page¶

  • Our goal in today's discussion lecture is to create a DataFrame with the information about each event at events.umich.edu.
In [2]:
res = requests.get('https://events.umich.edu')
res
Out[2]:
<Response [200]>
In [3]:
soup = BeautifulSoup(res.text)
  • Let's start by opening the page in Chrome, right clicking on the page, and clicking "Inspect".
    As we can see, the HTML is relatively complicated – this is usually the case for real websites!

Identifying <div>s¶

  • It's not easy identifying which <div>s we want. The Inspect tool makes this easier, but it's good to verify that find_all is finding the right number of elements.
In [4]:
divs = soup.find_all(class_='col-xs-12')
In [5]:
len(divs)
Out[5]:
9
  • Again, let's deal with one <div> at a time. First, we should extract the title of the event.
In [6]:
divs[0]
Out[6]:
<div class="col-xs-12 col-sm-4 col-md-4 col-lg-2 flex no-pad">
<div class="event-listing-grid event-single">
<time class="time-banner" datetime="2025-05-03 11:00"><i class="fa fa-clock-o"></i> May 03, 2025 11:00am</time>
<div class="list-image">
<a alt="Justin Roberts" href="/event/133745" style="background:url(https://events.umich.edu/media/cache/event_list_2x/media/attachments/2025/03/event_133745_original-1.jpg) center center no-repeat; background-size:cover; position:absolute; width:100%;height:100%; top:0px;left:0px;" title="Justin Roberts">
</a>
</div>
<div class="event-info">
<div class="event-title"><h3>
<a href="/event/133745" title="Justin Roberts &amp; The Not Ready for Naptime Players">
    Justin Roberts &amp; The Not Ready for Naptime...
    </a></h3>
<h4>Presented by The Ark.</h4>
</div>
<ul class="event-details">
<li class="item">
<a href="/list?filter=locations:1" title="GA - The Ark"><i class="fa fa-location-arrow fa-fw"></i><span> GA - The Ark</span></a>
</li>
<li class="item"><a href="/group/1053" title="Michigan Union Ticket Office (MUTO)"><i class="fa fa-group fa-fw"></i><span>
        Michigan Union Ticket...
    </span></a></li>
<li class="item"><a href="/list?filter=alltypes:15"><i class="fa fa-list fa-fw"></i><span> Performance </span></a></li>
<li class="item"><a href="https://mutotix.umich.edu/5580/5581">
<i class="fa fa-ticket fa-fw"></i>
<span>Purchase tickets here!</span>
</a></li>
<li class="item"><a href="https://theark.org/support-the-ark/">
<i class="fa fa-link fa-fw"></i>
<span>Support The Ark!</span>
</a></li>
<li class="item"><a href="https://www.justinrobertsmusic.com/">
<i class="fa fa-link fa-fw"></i>
<span>Justin Roberts</span>
</a></li>
<li class="item"><a href="https://www.youtube.com/watch?v=7K09PrjXwbU">
<i class="fa fa-youtube fa-fw"></i>
<span>https://www.youtube.com/watch?v=7K09PrjXwbU</span>
</a></li>
</ul>
<!--
    <p>
    “Among the best craftsmen of sweet and silly kid tunes out there, making irresistible music out of small, well-observed moments from the...
    (
        2025-05-03 11:00am
    )
    </p>
-->
</div>
</div>
</div>
In [7]:
divs[0].find('div', class_='event-title').find('a').get('title')
Out[7]:
'Justin Roberts & The Not Ready for Naptime Players'
  • The time and location, too.
In [8]:
divs[0].find('time').get('datetime')
Out[8]:
'2025-05-03 11:00'
In [9]:
divs[0].find('ul').find('a').get('title')
Out[9]:
'GA - The Ark'

Parsing a single event, and then every event¶

  • As before, we'll implement a function that takes in a BeautifulSoup object corresponding to a single <div> and returns a dictionary with the relevant information about that event.
In [10]:
def process_event(div):
    title = div.find('div', class_='event-title').find('a').get('title')
    location = div.find('ul').find('a').get('title')
    time = pd.to_datetime(div.find('time').get('datetime')) # Good idea!
    return {'title': title, 'time': time, 'location': location}
In [11]:
process_event(divs[12])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 process_event(divs[12])

IndexError: list index out of range
  • Now, we can call it on every <div> in divs.
    Remember, we already ran divs = soup.find_all(class_='col-xs-12').
In [12]:
row_list = []
for div in divs:
    try:
        row_list.append(process_event(div))
    except Exception as e:
        print(e)
'NoneType' object has no attribute 'find'
In [13]:
events = pd.DataFrame(row_list)
events.head()
Out[13]:
title time location
0 Justin Roberts & The Not Ready for Naptime Pla... 2025-05-03 11:00:00 GA - The Ark
1 Read and Look | The Water Princess 2025-05-03 11:15:00 Kelsey Museum of Archaeology
2 Women's Tennis vs Arizona State 2025-05-03 13:00:00 Varsity Tennis Bldg
3 Screening: 2025 IP Exhibition 2025-05-03 14:00:00 Art &amp; Architecture Building, 2000 Bonistee...
4 Men's Lacrosse vs Big Ten Championship 2025-05-03 19:00:00 U-M Lacrosse Stadium
  • Now, events is a DataFrame, like any other!
In [14]:
# Which events are in-person today?
events[~events['location'].isin(['Virtual', ''])]
Out[14]:
title time location
0 Justin Roberts & The Not Ready for Naptime Pla... 2025-05-03 11:00:00 GA - The Ark
1 Read and Look | The Water Princess 2025-05-03 11:15:00 Kelsey Museum of Archaeology
2 Women's Tennis vs Arizona State 2025-05-03 13:00:00 Varsity Tennis Bldg
... ... ... ...
5 Men's Lacrosse vs Big Ten Championship 2025-05-03 20:00:00 U-M Lacrosse Stadium
6 Men's Lacrosse vs Big Ten Championship 2025-05-03 20:00:00 U-M Lacrosse Stadium
7 The Wildwoods 2025-05-03 20:00:00 ARK Reserved

8 rows × 3 columns