In [1]:
from lec_utils import *

Discussion Slides: RegEx and Text as Data

Agenda 📆¶

  • Using RegEx to Parse Recipe Ingredients 🍝.
  • TF-IDF 📖.
  • Worksheet 📝.

Example: Using RegEx to Parse Recipe Ingredients 🍝¶


Example: Using RegEx to Parse Recipe Ingredients¶

  • Our goal in today's discussion lecture is to create a DataFrame with the information about ingredients in a recipe.
In [2]:
recipe_text = '''
Ingredients:
3/4 cup oil
    Use olive oil for best results!
1/2 teaspoon salt
    Or to taste. 
8 cups flour
12.5 tablespoons butter
    Optionally substitute heavy cream. 
'''
Quantity Unit Ingredient
3/4 cup oil
1/2 teaspoon salt
8 cups flour
12.5 tablespoons butter

Capturing Numbers¶

  • A regular expression, or regex for short, is a sequence of characters used to match patterns in strings.
  • In regex, \d matches any digit character (equivalent to [0-9]), and the + quantifier specifies that the preceding element must occur one or more times. \d+ matches sequences of one or more digits.
In [3]:
re.findall('\d+', recipe_text)
Out[3]:
['3', '4', '1', '2', '8', '12', '5']
  • This pattern doesn't capture fractions or decimal numbers because characters like / and . are not digits, causing the match to terminate.
  • In order to capture special characters in RegEx like . and /, we need to use the escape character, \.
In [4]:
print(re.findall('\d+\/\d+', recipe_text))
print(re.findall('\d+\.\d+', recipe_text))
['3/4', '1/2']
['12.5']
  • To match all whole numbers, decimals, and fractions, we can use the OR operator |.
In [5]:
# Order matters in RegEx!
# Note that if we started with \d+ as the first option, we would get ['3', '4', '1', '2', '8', '12', '5'].
re.findall('(\d+\/\d+|\d+\.\d+|\d+)', recipe_text)
Out[5]:
['3/4', '1/2', '8', '12.5']

Making Capture Groups¶

  • In regex, [] defines a character class, matching any single character inside. ? makes the preceding character or group optional (0 or 1 occurrence).
In [6]:
re.findall('(cup[s]?|tablespoon[s]?|teaspoon[s]?)', recipe_text)
Out[6]:
['cup', 'teaspoon', 'cups', 'tablespoons']
  • We can extract each part of our expression separately by using capture groups, which we define by putting our expressions in parentheses.
In [7]:
pattern = r'(\d+\/\d+|\d+\.\d+|\d+)\s(cup[s]?|tablespoon[s]?|teaspoon[s]?)\s(.+)'
In [8]:
matches = re.findall(pattern, recipe_text)
pd.DataFrame(matches, columns=['Quantity', 'Unit', 'Ingredient'])
Out[8]:
Quantity Unit Ingredient
0 3/4 cup oil
1 1/2 teaspoon salt
2 8 cups flour
3 12.5 tablespoons butter

TF-IDF 📖¶


What Makes a Word Important?¶

Consider the recipe description $d$:

"This spicy sauce blends garlic, chili, and basil for a bold, spicy flavor and smooth texture."

  • Issue: Simply counting word occurrences in a recipe doesn't reveal which terms are truly informative: "spicy" and "and" both appear twice, but "spicy" is more significant to the meaning of the sentence.
  • Key Idea: An important word is one that appears frequently in one piece of text, but rarely in other pieces of text.

Term Frequency (TF)¶

  • Definition: The term frequency (TF) measures how often a term $t$ appears in a recipe $d$.
  • Intuition: A high $\text{tf}(t, d)$ means the term is very common in that particular recipe.

$$ \text{tf}(\text{"spicy"}, d) = \frac{\text{number of occurrences of "spicy" in } d}{\text{total number of terms in } d} $$

$$ \text{tf}(\text{"spicy"}, d) = \frac{2}{16} = 0.125 $$

Inverse Document Frequency (IDF)¶

  • Definition: The inverse document frequency (IDF) gauges how rare a term is across a collection of recipes.
  • Intuition: A high $\text{idf}(t)$ indicates that $t$ is rare across our collections of text, meaning that it's more significant that the term is in the text.
  • Imagine we have 1000 recipe descriptions in our dataset, and "spicy" appears in 100 of these.

    $$ \text{idf}(\text{"spicy"}) = \log \left( \frac{\text{total number of recipes}}{\text{number of recipes in which "spicy" appears}} \right) $$
In this example, we're using $\log_{10}$, but we can also use other bases as long as we keep it consistent! On exams, we'll tell you which base to use.

$$ \text{idf}(\text{"spicy"}) = \log \left( \frac{1000}{100} \right) = \log (10) = 1 $$

Term Frequency-Inverse Document Frequency (TF-IDF)¶

  • Definition: TF-IDF combines term frequency and inverse document frequency to score the importance of a term in a recipe.
  • Words with higher TF-IDF are more important to a document’s meaning.
  • In essence, we're finding how common a term is in a document, and then multiplying that by how common the term is throughout all of our documents (where being less common yields a higher value).

$$ \text{tfidf}(\text{"spicy"}, d) = \text{tf}(\text{"spicy"}, d) \times \text{idf}(\text{"spicy"}) $$

$$ \text{tfidf}(\text{"spicy"}, d) = 0.125 \times 1 = 0.125 $$