← return to study.practicaldsc.org
The problems in this worksheet are taken from past exams in similar
classes. Work on them on paper, since the exams you
take in this course will also be on paper.
We encourage you to
complete this worksheet in a live discussion section. Solutions will be
made available after all discussion sections have concluded. You don’t
need to submit your answers anywhere.
Note: We do not plan to
cover all problems here in the live discussion section; the problems
we don’t cover can be used for extra practice.
For this problem, consider the HTML document shown below:
<html>
<head>
<title>Data Science Courses</title>
</head>
<body>
<h1>Welcome to the World of Data Science!</h1>
<h2>Current Courses</h2>
<div class="course_list">
<img alt="Course Banner", src="courses.png">
<p>
Here are the courses available to take:</p>
<ul>
<li>Machine Learning</li>
<li>Design of Experiments</li>
<li>Driving Business Value with DS</li>
</ul>
<p>
<a href="./2021-sp.html">here</a>.
For last quarter's classes, see </p>
</div>
<h2>News</h2>
<div class="news">
<p class="news">
<b>Visualization</b> is launched.
New course on <a href="https://http.cat/301.png" target="_blank">here</a>
See </p>
</div>
</body>
</html>
How many children does the div
node with class
course_list
contain in the Document Object Model (DOM)?
Answer: 4 children
Looking at the code, we could see that the div
with
class course_list
has 4 children, namely: a
img
node, p
node, ul
node and
p
node.
Suppose the HTML document has been parsed using
doc = bs4.BeautifulSoup(html)
. Write a line of code to get
the h1
header text in the form of a string.
Answer: doc.find('h1').text
Since there’s only one h1
element in the html code, we
could simply do doc.find('h1')
to get the h1
element. Then simply adding .text
will get the text of the
h1
element in the form of a string.
Suppose the HTML document has been parsed using
doc = bs4.BeautifulSoup(html)
. Write a piece of code that
scrapes the course names from the HTML. The value returned by your code
should be a list of strings.
Answer:
[x.text for x in doc.find_all('li')]
Doing doc.find_all('li')
will find all li
elements and return it is the form of a list. Simply performing some
basic list comprehension combined .text
to get the text of
each li
element will yield the desired result.
There are two links in the document. Which of the following will
return the URL of the link contained in the div
with class
news
? Mark all that apply.
doc.find_all('a')[1].attrs['href']
doc.find('a')[1].attrs['href']
doc.find(a, class='news').attrs['href']
doc.find('div', attrs={'class': 'news'}).find('a').attrs['href']
doc.find('href', attrs={'class': 'news'})
Answer: Option A and Option D
doc.find_all('a')
will return a list of all the a
elements in the order that
it appears in the HTML document, and since the a
with class
news
is the second a
element appearing in the
HTML doc, we do [1]
to select it (as we would in any other
list). Finally, we return the URL of the a
element by
getting the 'href'
attribute using
.attrs['href']
.find
will only
find the first instance of a
, which is not the one we’re
looking for.a
.doc.find('div', attrs={'class': 'news'})
will first find
the div
element with class='news'
, and then
find the a
element within that element and get the
href
attribute of that, which is what we want.href
element in the HTML document.What is the purpose of the alt
attribute in the
img
tag?
It provides an alternative image that will be shown to some users at random
It creates a link with text present in alt
It provides the text to be shown below the image as a caption
It provides text that should be shown in the case that the image cannot be displayed
Answer: Option D
^pretty self-explanatory.
You are scraping a web page using the requests
module.
Your code works fine and returns the desired result, but suddenly you
find that when you run your code it starts but never finishes – it does
not raise an error or return anything. What is the most likely cause of
the issue?
The page has a very large GIF that hasn’t stopped playing
You have made too many requests to the server in too short of a time, and you are being “timed out”
The page contains a Unicode character that requests
cannot parse
The page has suddenly changed and has caused requests
to
enter an infinite loop
Answer: Option B
After taking the SAT, Nicole wants to check the College Board’s website to see her score. However, the College Board recently updated their website to use non-standard HTML tags and Nicole’s browser can’t render it correctly. As such, she resorts to making a GET request to the site with her scores on it to get back the source HTML and tries to parse it with BeautifulSoup.
Suppose soup
is a BeautifulSoup object instantiated
using the following HTML document.
<college>Your score is ready!</college>
<sat verbal="ready" math="ready">
Your percentiles are as follows:
<scorelist listtype="percentiles">
<scorerow kind="verbal" subkind="per">
Verbal: <scorenum>84</scorenum>
</scorerow>
<scorerow kind="math" subkind="per">
Math: <scorenum>99</scorenum>
</scorerow>
</scorelist>
And your actual scores are as follows:
<scorelist listtype="scores">
<scorerow kind="verbal"> Verbal: <scorenum>680</scorenum> </scorerow>
<scorerow kind="math"> Math: <scorenum>800</scorenum> </scorerow>
</scorelist>
</sat>
Which of the following expressions evaluate to "verbal"
?
Select all that apply.
soup.find("scorerow").get("kind")
soup.find("sat").get("ready")
soup.find("scorerow").text.split(":")[0].lower()
[s.get("kind") for s in soup.find_all("scorerow")][-2]
soup.find("scorelist", attrs={"listtype":"scores"}).get("kind")
None of the above
Answer: Option 1, Option 3, Option 4
Correct options:
<scorerow>
element and
retrieves its "kind"
attribute, which is
"verbal"
for the first <scorerow>
encountered in the HTML document.<scorerow>
tag,
retrieves its text ("Verbal: 84")
, splits this text by “:”,
and takes the first element of the resulting list
("Verbal")
, converting it to lowercase to match
"verbal"
."kind"
attributes for all
<scorerow>
elements. The second to last (-2) element
in this list corresponds to the "kind"
attribute of the
first <scorerow>
in the second
<scorelist>
tag, which is also
"verbal"
.Incorrect options:
<sat>
tag, which does not exist as an attribute."kind"
attribute from a
<scorelist>
tag, but <scorelist>
does not have a "kind"
attribute.Consider the following function.
def summer(tree):
if isinstance(tree, list):
= 0
total for subtree in tree:
for s in subtree.find_all("scorenum"):
+= int(s.text)
total return total
else:
return sum([int(s.text) for s in tree.find_all("scorenum")])
For each of the following values, fill in the blanks to assign
tree
such that summer(tree)
evaluates to the
desired value. The first example has been done for you.
84
= soup.find("scorerow") tree
183
= soup.find(__a__) tree
1480
= soup.find(__b__) tree
899
= soup.find_all(__c__) tree
Answer: a: "scorelist"
, b:
"scorelist", attrs={"listtype":"scores"}
, c:
"scorerow", attrs={"kind":"math"}
soup.find("scorelist")
selects the first
<scorelist>
tag, which includes both verbal and math
percentiles (84 and 99)
. The function
summer(tree)
sums these values to get 183
.
This selects the <scorelist>
tag with
listtype="scores"
, which contains the actual scores of
verbal (680)
and math (800)
. The function sums
these to get 1480
.
This selects all <scorerow>
elements with
kind="math"
, capturing both the percentile
(99)
and the actual score (800)
. Since tree is
now a list, summer(tree)
iterates through each
<scorerow>
in the list, summing their
<scorenum>
values to reach 899
.
Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.
<head>
<title>3*Canada-2022-06-04</title>
<head>
<body>
<h1>Spotify Top 3 - Canada</h1>
<table>
<tr class='heading'>
<th>Rank</th>
<th>Artist(s)</th>
<th>Song</th>
</tr>
<tr class=1>
<td>1</td>
<td>Harry Styles</td>
<td>As It Was</td>
</tr>
<tr class=2>
<td>2</td>
<td>Jack Harlow</td>
<td>First Class</td>
</tr>
<tr class=3>
<td>3</td>
<td>Kendrick Lamar</td>
<td>N95</td>
</tr>
</table>
</body>
Suppose we define soup
to be a
BeautifulSoup
object that is instantiated using the
document above.
How many leaf nodes are there in the DOM tree of the previous document — that is, how many nodes have no children?
Answer: 14
There’s 1 <title>
, 1 <h1>
, 3
<th>
s, and 9 <td>
s, adding up to
14.
What does the following line of code evaluate to?
len(soup.find_all("td"))
Answer: 9
As mentioned in the solution to the part above, there are 9
<td>
nodes, and soup.find_all
finds them
all.
What does the following line of code evaluate to?
"tr").get("class") soup.find(
Answer: ["heading"]
or
"heading"
soup.find("tr")
finds the first occurrence of a
<tr>
node, and get("class")
accesses the
value of its "class"
attribute.
Note that technically the answer is ["heading"]
, but
"heading"
received full credit too.
Complete the implementation of the function top_nth
,
which takes in a positive integer n
and returns the
name of the n-th ranked
song in the HTML document. For instance,
top_nth(2)
should evaluate to "First Class"
(n=1
corresponds to the top song).
Note: Your implementation should work in the case that the page contains more than 3 songs.
def top_nth(n):
return soup.find("tr", attrs=__(a)__).find_all("td")__(b)__
What goes in blank (a)?
What goes in blank (b)?
Answer: a) {'class' : n}
b)
[2].text
or [-1].text
The logic is to find the <tr>
node with the
correct class attribute (which we do by setting attr to
{'class' : 2}
), then access the text of the node’s last
<td>
child (since that’s where the song titles are
stored).
Suppose we run the line of code r = requests.get(url)
,
where url
is a string containing a URL to some online data
source.
True or False: If r.status_code
is
200
, then r.text
must be a string containing
the HTML source code of the site at url
.
True
False
Answer: Option B: False
A status code of 200 means that the request has succeeded. Hence, the response could be JSON, it is not necessarily HTML.
Rahul is trying to scrape the website of an online bookstore ‘The Book Club’.
<HTML>
<H1>The Book Club</H1>
<BODY BGCOLOR="FFFFFF">
Email us at <a href="mailto:support@thebookclub.com">
support@thebookclub.com</a>.
<div>
<ol class="row">
<li class="book_list">
<article class="product_pod">
<div class="image_container">
<img src="pic1.jpeg" alt="A Light in the Attic"
class="thumbnail">
</div>
<p class="star-rating Three"></p>
<h3>
<a href="cat/index.html" title="A Light in the Attic">
A Light in the Attic
</a>
</h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock
</p>
</div>
</article>
</li>
</ol>
</div>
</BODY>
</HTML>
Which is the equivalent Document Object Model (DOM) tree of this HTML file?
Tree A
Tree B
Tree C
Tree D
Answer: Tree D
Following tree D in the image from top to bottom, we can follow the nesting of tags in the HTML file to verify that the DOM tree D matches the syntax of the HTML file.
Rahul wants to extract the ‘instock availability’
status
of the book titled ‘A Light in the Attic’. Which of the following
expressions will evaluate to "In Stock"
? Assume that Rahul
has already parsed the HTML into a BeautifulSoup object stored in the
variable named soup
.
Code Snippet A
'p',attrs = {'class': 'instock availability'})\
soup.find('icon-ok').strip() .get(
Code Snippet B
'p',attrs = {'class': 'instock availability'}).text.strip() soup.find(
Code Snippet C
'p',attrs = {'class': 'instock availability'}).find('i')\
soup.find( .text.strip()
Code Snippet D
'div', attrs = {'class':'product_price'})\
soup.find('p',attrs = {'class': 'instock availability'})\
.find('i').text.strip() .find(
Answer: Code Snippet B
Code Snippet B is the only option that finds the tag p
with the attribute class
being equal to
instock availability
and then getting the text contained in
that tag, which is equal to ‘instock availability’
.
Option A will cause an error because of .get('icon-ok')
since 'icon-ok'
is not the name of the attribute, but is
instead the value of the class
attribute.
Option C and D will both get the text of the i
tag,
which is ''
and is therefore incorrect.
Rahul also wants to extract the number of stars that the book titled
‘A Light in the Attic’ received. If you look at the HTML file, you will
notice that the book received a star rating of three. Which code snippet
will evaluate to "Three"
?
Code Snippet A
'article').get('class').strip() soup.find(
Code Snippet B
'p').text.split(' ') soup.find(
Code Snippet C
'p').get('class')[1] soup.find(
None of the above
Answer: Code Snippet C
Code Snippet C finds the first occurence of the tag p
,
gets the contents of its class
attribute as a list, and
returns the last element, which is the rating 'Three'
as
desired.
Option A will error because .get('class')
returns
['product_pod']
and strip cannot be used on a list, but
also the content of the list does not bring us closer to the desired
result.
Option B gets the text contained within the first p
tag
as a list, which is ['']
.