Web Scraping

← return to study.practicaldsc.org


The problems in this worksheet are taken from past exams in similar classes. Work on them on paper, since the exams you take in this course will also be on paper.

We encourage you to complete this worksheet in a live discussion section. Solutions will be made available after all discussion sections have concluded. You don’t need to submit your answers anywhere.

Note: We do not plan to cover all problems here in the live discussion section; the problems we don’t cover can be used for extra practice.


Problem 1

For this problem, consider the HTML document shown below:

<html>
    <head>
        <title>Data Science Courses</title>
    </head>
    
    <body>
        <h1>Welcome to the World of Data Science!</h1>
        
        <h2>Current Courses</h2>
        
        <div class="course_list">
        
            <img alt="Course Banner", src="courses.png">
            <p>
                Here are the courses available to take:
            </p>
            <ul>
                <li>Machine Learning</li>
                <li>Design of Experiments</li>
                <li>Driving Business Value with DS</li>
            </ul> 

            <p>
                For last quarter's classes, see <a href="./2021-sp.html">here</a>.
            </p>
            
        </div>
        
        <h2>News</h2>
        
        <div class="news">
            <p class="news">
                New course on <b>Visualization</b> is launched.
                See <a href="https://http.cat/301.png" target="_blank">here</a>
            </p>
        </div>
        
    </body>
</html>


Problem 1.1

How many children does the div node with class course_list contain in the Document Object Model (DOM)?

Answer: 4 children

Looking at the code, we could see that the div with class course_list has 4 children, namely: a img node, p node, ul node and p node.


Problem 1.2

Suppose the HTML document has been parsed using doc = bs4.BeautifulSoup(html). Write a line of code to get the h1 header text in the form of a string.

Answer: doc.find('h1').text

Since there’s only one h1 element in the html code, we could simply do doc.find('h1') to get the h1 element. Then simply adding .text will get the text of the h1 element in the form of a string.


Problem 1.3

Suppose the HTML document has been parsed using doc = bs4.BeautifulSoup(html). Write a piece of code that scrapes the course names from the HTML. The value returned by your code should be a list of strings.

Answer: [x.text for x in doc.find_all('li')]

Doing doc.find_all('li') will find all li elements and return it is the form of a list. Simply performing some basic list comprehension combined .text to get the text of each li element will yield the desired result.


Problem 1.4

There are two links in the document. Which of the following will return the URL of the link contained in the div with class news? Mark all that apply.

Answer: Option A and Option D

  • Option A: This option works because doc.find_all('a') will return a list of all the a elements in the order that it appears in the HTML document, and since the a with class news is the second a element appearing in the HTML doc, we do [1] to select it (as we would in any other list). Finally, we return the URL of the a element by getting the 'href' attribute using .attrs['href']
  • Option B: This does not work because .find will only find the first instance of a, which is not the one we’re looking for.
  • Option C: This does not work because there are no quotations around the a.
  • Option D: This option works becuase doc.find('div', attrs={'class': 'news'}) will first find the div element with class='news', and then find the a element within that element and get the href attribute of that, which is what we want.
  • Option E: This does not work because there is no href element in the HTML document.


Problem 1.5

What is the purpose of the alt attribute in the img tag?

Answer: Option D

^pretty self-explanatory.


Problem 1.6

You are scraping a web page using the requests module. Your code works fine and returns the desired result, but suddenly you find that when you run your code it starts but never finishes – it does not raise an error or return anything. What is the most likely cause of the issue?

Answer: Option B

  • Option A: We can pretty confidentely (I hope) rule this option out since whether or not a GIF has stopped playing or not shouldn’t affect our web scraping.
  • Option B: This answer is right because a server will time you out (and potentially block you) if you make too many requests to the server.
  • Option C: This shouldn’t cause your code to never finish, rather, it’s more likely that the request module just doesn’t process said Unicode character correctly or it throws an error.
  • Option D: Again, this shouldn’t cause your code to never finish, rather, the request module will just parse the older version of the website at the time you called it.



Problem 2

After taking the SAT, Nicole wants to check the College Board’s website to see her score. However, the College Board recently updated their website to use non-standard HTML tags and Nicole’s browser can’t render it correctly. As such, she resorts to making a GET request to the site with her scores on it to get back the source HTML and tries to parse it with BeautifulSoup.

Suppose soup is a BeautifulSoup object instantiated using the following HTML document.

<college>Your score is ready!</college>

<sat verbal="ready" math="ready">
  Your percentiles are as follows:
  <scorelist listtype="percentiles">
    <scorerow kind="verbal" subkind="per">
      Verbal: <scorenum>84</scorenum>
    </scorerow>
    <scorerow kind="math" subkind="per">
      Math: <scorenum>99</scorenum>
    </scorerow>
  </scorelist>
  And your actual scores are as follows:
  <scorelist listtype="scores">
    <scorerow kind="verbal"> Verbal: <scorenum>680</scorenum> </scorerow>
    <scorerow kind="math"> Math: <scorenum>800</scorenum> </scorerow>
  </scorelist>
</sat>


Problem 2.1

Which of the following expressions evaluate to "verbal"? Select all that apply.

Answer: Option 1, Option 3, Option 4

Correct options:

  • Option 1 finds the first <scorerow> element and retrieves its "kind" attribute, which is "verbal" for the first <scorerow> encountered in the HTML document.
  • Option 2 finds the first <scorerow> tag, retrieves its text ("Verbal: 84"), splits this text by “:”, and takes the first element of the resulting list ("Verbal"), converting it to lowercase to match "verbal".
  • Option 3 creates a list of "kind" attributes for all <scorerow> elements. The second to last (-2) element in this list corresponds to the "kind" attribute of the first <scorerow> in the second <scorelist> tag, which is also "verbal".

Incorrect options:

  • Option 2 attempts to get an attribute ready from the <sat> tag, which does not exist as an attribute.
  • Option 5 tries to retrieve a "kind" attribute from a <scorelist> tag, but <scorelist> does not have a "kind" attribute.


Problem 2.2

Consider the following function.

def summer(tree):
    if isinstance(tree, list):
        total = 0
        for subtree in tree:
            for s in subtree.find_all("scorenum"):
                total += int(s.text)
        return total
    else:
        return sum([int(s.text) for s in tree.find_all("scorenum")])

For each of the following values, fill in the blanks to assign tree such that summer(tree) evaluates to the desired value. The first example has been done for you.

    tree = soup.find("scorerow")
    tree = soup.find(__a__)
    tree = soup.find(__b__)
    tree = soup.find_all(__c__)

Answer: a: "scorelist", b: "scorelist", attrs={"listtype":"scores"}, c: "scorerow", attrs={"kind":"math"}

soup.find("scorelist") selects the first <scorelist> tag, which includes both verbal and math percentiles (84 and 99). The function summer(tree) sums these values to get 183.

This selects the <scorelist> tag with listtype="scores", which contains the actual scores of verbal (680) and math (800). The function sums these to get 1480.

This selects all <scorerow>elements with kind="math", capturing both the percentile (99) and the actual score (800). Since tree is now a list, summer(tree) iterates through each <scorerow> in the list, summing their <scorenum> values to reach 899.



Problem 3

Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.

<head>
    <title>3*Canada-2022-06-04</title>
<head>
<body>
    <h1>Spotify Top 3 - Canada</h1>
    <table>
        <tr class='heading'>
            <th>Rank</th>
            <th>Artist(s)</th> 
            <th>Song</th>
        </tr>
        <tr class=1>
            <td>1</td>
            <td>Harry Styles</td> 
            <td>As It Was</td>
        </tr>
        <tr class=2>
            <td>2</td>
            <td>Jack Harlow</td> 
            <td>First Class</td>
        </tr>
        <tr class=3>
            <td>3</td>
            <td>Kendrick Lamar</td> 
            <td>N95</td>
        </tr>
    </table>
</body>

Suppose we define soup to be a BeautifulSoup object that is instantiated using the document above.


Problem 3.1

How many leaf nodes are there in the DOM tree of the previous document — that is, how many nodes have no children?

Answer: 14

There’s 1 <title>, 1 <h1>, 3 <th>s, and 9 <td>s, adding up to 14.


Problem 3.2

What does the following line of code evaluate to?

len(soup.find_all("td"))

Answer: 9

As mentioned in the solution to the part above, there are 9 <td> nodes, and soup.find_all finds them all.


Problem 3.3

What does the following line of code evaluate to?

soup.find("tr").get("class")

Answer: ["heading"] or "heading"

soup.find("tr") finds the first occurrence of a <tr> node, and get("class") accesses the value of its "class" attribute.

Note that technically the answer is ["heading"], but "heading" received full credit too.


Problem 3.4

Complete the implementation of the function top_nth, which takes in a positive integer n and returns the name of the n-th ranked song in the HTML document. For instance, top_nth(2) should evaluate to "First Class" (n=1 corresponds to the top song).

Note: Your implementation should work in the case that the page contains more than 3 songs.

def top_nth(n):
    return soup.find("tr", attrs=__(a)__).find_all("td")__(b)__

What goes in blank (a)?

What goes in blank (b)?

Answer: a) {'class' : n} b) [2].text or [-1].text

The logic is to find the <tr> node with the correct class attribute (which we do by setting attr to {'class' : 2}), then access the text of the node’s last <td> child (since that’s where the song titles are stored).


Problem 3.5

Suppose we run the line of code r = requests.get(url), where url is a string containing a URL to some online data source.

True or False: If r.status_code is 200, then r.text must be a string containing the HTML source code of the site at url.

Answer: Option B: False

A status code of 200 means that the request has succeeded. Hence, the response could be JSON, it is not necessarily HTML.



Problem 4


Problem 4.1

Rahul is trying to scrape the website of an online bookstore ‘The Book Club’.

<HTML>
<H1>The Book Club</H1>
<BODY BGCOLOR="FFFFFF">
Email us at <a href="mailto:support@thebookclub.com">
support@thebookclub.com</a>.

<div>
    <ol class="row">
    <li class="book_list">
    
        <article class="product_pod">
            <div class="image_container">
                    <img src="pic1.jpeg" alt="A Light in the Attic" 
                    class="thumbnail">
            </div>
            
            <p class="star-rating Three"></p>
            
            <h3>
            <a href="cat/index.html" title="A Light in the Attic">
            A Light in the Attic
            </a>
            </h3>
        
            <div class="product_price">
                <p class="price_color">£51.77</p>
                
                <p class="instock availability">
                    <i class="icon-ok"></i>
                    In stock
                </p>
        
            </div>
        </article>
    </li>
    </ol>

</div>
</BODY>
</HTML>

Which is the equivalent Document Object Model (DOM) tree of this HTML file?

Answer: Tree D

Following tree D in the image from top to bottom, we can follow the nesting of tags in the HTML file to verify that the DOM tree D matches the syntax of the HTML file.


Problem 4.2

Rahul wants to extract the ‘instock availability’ status of the book titled ‘A Light in the Attic’. Which of the following expressions will evaluate to "In Stock"? Assume that Rahul has already parsed the HTML into a BeautifulSoup object stored in the variable named soup.

Code Snippet A

    soup.find('p',attrs = {'class': 'instock availability'})\
    .get('icon-ok').strip()

Code Snippet B

    soup.find('p',attrs = {'class': 'instock availability'}).text.strip()

Code Snippet C

    soup.find('p',attrs = {'class': 'instock availability'}).find('i')\
    .text.strip()

Code Snippet D

    soup.find('div', attrs = {'class':'product_price'})\
    .find('p',attrs = {'class': 'instock availability'})\
    .find('i').text.strip()

Answer: Code Snippet B

Code Snippet B is the only option that finds the tag p with the attribute class being equal to instock availability and then getting the text contained in that tag, which is equal to ‘instock availability’.

Option A will cause an error because of .get('icon-ok') since 'icon-ok' is not the name of the attribute, but is instead the value of the class attribute.

Option C and D will both get the text of the i tag, which is '' and is therefore incorrect.


Problem 4.3

Rahul also wants to extract the number of stars that the book titled ‘A Light in the Attic’ received. If you look at the HTML file, you will notice that the book received a star rating of three. Which code snippet will evaluate to "Three"?

Code Snippet A

    soup.find('article').get('class').strip()

Code Snippet B

    soup.find('p').text.split(' ')

Code Snippet C

    soup.find('p').get('class')[1]

None of the above

Answer: Code Snippet C

Code Snippet C finds the first occurence of the tag p, gets the contents of its class attribute as a list, and returns the last element, which is the rating 'Three' as desired.

Option A will error because .get('class') returns ['product_pod'] and strip cannot be used on a list, but also the content of the list does not bring us closer to the desired result.

Option B gets the text contained within the first p tag as a list, which is [''].



👋 Feedback: Find an error? Still confused? Have a suggestion? Let us know here.