In [1]:

from lec_utils import *

Discussion Slides: RegEx and Text as Data

Agenda 📆¶

Using RegEx to Parse Recipe Ingredients 🍝.
TF-IDF 📖.
Worksheet 📝.

Example: Using RegEx to Parse Recipe Ingredients 🍝¶

Example: Using RegEx to Parse Recipe Ingredients¶

Our goal in today's discussion lecture is to create a DataFrame with the information about ingredients in a recipe.

In [2]:

recipe_text = '''
Ingredients:
3/4 cup oil
    Use olive oil for best results!
1/2 teaspoon salt
    Or to taste. 
8 cups flour
12.5 tablespoons butter
    Optionally substitute heavy cream. 
'''

Quantity	Unit	Ingredient
3/4	cup	oil
1/2	teaspoon	salt
8	cups	flour
12.5	tablespoons	butter

Capturing Numbers¶

A regular expression, or regex for short, is a sequence of characters used to match patterns in strings.

In regex, \d matches any digit character (equivalent to [0-9]), and the + quantifier specifies that the preceding element must occur one or more times. \d+ matches sequences of one or more digits.

In [3]:

re.findall('\d+', recipe_text)

Out[3]:

['3', '4', '1', '2', '8', '12', '5']

This pattern doesn't capture fractions or decimal numbers because characters like / and . are not digits, causing the match to terminate.

In order to capture special characters in RegEx like . and /, we need to use the escape character, \.

In [4]:

print(re.findall('\d+\/\d+', recipe_text))
print(re.findall('\d+\.\d+', recipe_text))

['3/4', '1/2']
['12.5']

To match all whole numbers, decimals, and fractions, we can use the OR operator |.

In [5]:

# Order matters in RegEx!
# Note that if we started with \d+ as the first option, we would get ['3', '4', '1', '2', '8', '12', '5'].
re.findall('(\d+\/\d+|\d+\.\d+|\d+)', recipe_text)

Out[5]:

['3/4', '1/2', '8', '12.5']

Making Capture Groups¶

In regex, [] defines a character class, matching any single character inside. ? makes the preceding character or group optional (0 or 1 occurrence).

In [6]:

re.findall('(cup[s]?|tablespoon[s]?|teaspoon[s]?)', recipe_text)

Out[6]:

['cup', 'teaspoon', 'cups', 'tablespoons']

We can extract each part of our expression separately by using capture groups, which we define by putting our expressions in parentheses.

In [7]:

pattern = r'(\d+\/\d+|\d+\.\d+|\d+)\s(cup[s]?|tablespoon[s]?|teaspoon[s]?)\s(.+)'

In [8]:

matches = re.findall(pattern, recipe_text)
pd.DataFrame(matches, columns=['Quantity', 'Unit', 'Ingredient'])

Out[8]:

	Quantity	Unit	Ingredient
0	3/4	cup	oil
1	1/2	teaspoon	salt
2	8	cups	flour
3	12.5	tablespoons	butter

TF-IDF 📖¶

What Makes a Word Important?¶

Consider the recipe description $d$:

"This spicy sauce blends garlic, chili, and basil for a bold, spicy flavor and smooth texture."

Issue: Simply counting word occurrences in a recipe doesn't reveal which terms are truly informative: "spicy" and "and" both appear twice, but "spicy" is more significant to the meaning of the sentence.

Key Idea: An important word is one that appears frequently in one piece of text, but rarely in other pieces of text.

Term Frequency (TF)¶

Definition: The term frequency (TF) measures how often a term $t$ appears in a recipe $d$.

Intuition: A high $\text{tf}(t, d)$ means the term is very common in that particular recipe.

$$ \text{tf}(\text{"spicy"}, d) = \frac{\text{number of occurrences of "spicy" in } d}{\text{total number of terms in } d} $$

$$ \text{tf}(\text{"spicy"}, d) = \frac{2}{16} = 0.125 $$

Inverse Document Frequency (IDF)¶

Definition: The inverse document frequency (IDF) gauges how rare a term is across a collection of recipes.

Intuition: A high $\text{idf}(t)$ indicates that $t$ is rare across our collections of text, meaning that it's more significant that the term is in the text.

Imagine we have 1000 recipe descriptions in our dataset, and "spicy" appears in 100 of these.
$$ \text{idf}(\text{"spicy"}) = \log \left( \frac{\text{total number of recipes}}{\text{number of recipes in which "spicy" appears}} \right) $$

In this example, we're using $\log_{10}$, but we can also use other bases as long as we keep it consistent! On exams, we'll tell you which base to use.

$$ \text{idf}(\text{"spicy"}) = \log \left( \frac{1000}{100} \right) = \log (10) = 1 $$

Term Frequency-Inverse Document Frequency (TF-IDF)¶

Definition: TF-IDF combines term frequency and inverse document frequency to score the importance of a term in a recipe.

Words with higher TF-IDF are more important to a document’s meaning.
In essence, we're finding how common a term is in a document, and then multiplying that by how common the term is throughout all of our documents (where being less common yields a higher value).

$$ \text{tfidf}(\text{"spicy"}, d) = \text{tf}(\text{"spicy"}, d) \times \text{idf}(\text{"spicy"}) $$

$$ \text{tfidf}(\text{"spicy"}, d) = 0.125 \times 1 = 0.125 $$