1 Association rule mining
In this notebook, you’ll implement the basic pairwise association rule mining algorithm.
To keep the implementation simple, you will apply your implementation to a simplified
dataset, namely, letters (“items”) in words (“receipts” or “baskets”). Having finished that code,
you will then apply that code to some grocery store market basket data. If you write the code
well, it will not be difficult to reuse building blocks from the letter case in the basket data case.
1.1 Problem definition
Let’s say you have a fragment of text in some language. You wish to know whether there are
association rules among the letters that appear in a word. In this problem:
• Words are “receipts”
• Letters within a word are “items”
b, where a and b are
letters. You will write code to do that by calculating for each rule its confidence, conf(a =
)
You want to know whether there are association rules of the form, a =
)
b).
“Confidence” will be another name for an estimate of the conditional probability of b given a, or
Pr[b
a].
j
1.2 Sample text input
Let’s carry out this analysis on a “dummy” text fragment, which graphic designers refer to as the
lorem ipsum:
In [ ]: latin_text = “””
Sed ut perspiciatis, unde omnis iste natus error sit
voluptatem accusantium doloremque laudantium, totam
rem aperiam eaque ipsa, quae ab illo inventore
veritatis et quasi architecto beatae vitae dicta
sunt, explicabo. Nemo enim ipsam voluptatem, quia
voluptas sit, aspernatur aut odit aut fugit, sed
quia consequuntur magni dolores eos, qui ratione
voluptatem sequi nesciunt, neque porro quisquam est,
qui dolorem ipsum, quia dolor sit amet consectetur
adipisci[ng] velit, sed quia non numquam [do] eius
1
modi tempora inci[di]dunt, ut labore et dolore
magnam aliquam quaerat voluptatem. Ut enim ad minima
veniam, quis nostrum exercitationem ullam corporis
suscipit laboriosam, nisi ut aliquid ex ea commodi
consequatur? Quis autem vel eum iure reprehenderit,
qui in ea voluptate velit esse, quam nihil molestiae
consequatur, vel illum, qui dolorem eum fugiat, quo
voluptas nulla pariatur?
At vero eos et accusamus et iusto odio dignissimos
ducimus, qui blanditiis praesentium voluptatum
deleniti atque corrupti, quos dolores et quas
molestias excepturi sint, obcaecati cupiditate non
provident, similique sunt in culpa, qui officia
deserunt mollitia animi, id est laborum et dolorum
fuga. Et harum quidem rerum facilis est et expedita
distinctio. Nam libero tempore, cum soluta nobis est
eligendi optio, cumque nihil impedit, quo minus id,
quod maxime placeat, facere possimus, omnis voluptas
assumenda est, omnis dolor repellendus. Temporibus
autem quibusdam et aut officiis debitis aut rerum
necessitatibus saepe eveniet, ut et voluptates
repudiandae sint et molestiae non recusandae. Itaque
earum rerum hic tenetur a sapiente delectus, ut aut
reiciendis voluptatibus maiores alias consequatur
aut perferendis doloribus asperiores repellat.
“””
print(“First 100 characters:n {} …”.format(latin_text[:100]))
Exercise 0 (ungraded). Look up and read the translation of lorem ipsum!
Data cleaning. Like most data in the real world, this dataset is noisy. It has both uppercase and
lowercase letters, words have repeated letters, and there are all sorts of non-alphabetic characters.
For our analysis, we should keep all the letters and spaces (so we can identify distinct words), but
we should ignore case and ignore repetition within a word.
For example, the eighth word of this text is “error.” As an itemset, it consists of the three unique
letters,
e, o, r
f
g
. That is, treat the word as a set, meaning you only keep the unique letters.
This itemset has three possible itempairs:
e, o
,
e, r
, and
f
o, r
g
f
g
f
g
.
Start by writing some code to help “clean up” the input.
Exercise 1 (normalize_string_test: 2 points). Complete the following function,
normalize_string(s). The input s is a string (str object). The function should return a new
string with (a) all characters converted to lowercase and (b) all non-alphabetic, non-whitespace
characters removed.
Clarification. Scanning the sample text, latin_text, you may see things that look like
special cases. For instance, inci[di]dunt and [do]. For these, simply remove the
non-alphabetic characters and only separate the words if there is explicit whitespace.
2
For instance, inci[di]dunt would become incididunt (as a single word) and [do]
would become do as a standalone word because the original string has whitespace on
either side. A period or comma without whitespace would, similarly, just be treated
as a non-alphabetic character inside a word unless there is explicit whitespace. So
e pluribus.unum basium would become e pluribusunum basium even though your
common-sense understanding might separate pluribus and unum.
Hint. Regard as a whitespace character anything “whitespace-like.” That is, consider
not just regular spaces, but also tabs, newlines, and perhaps others. To detect whitespaces
easily, look for a “high-level” function that can help you do so rather than checking
for literal space characters.
In [ ]: def normalize_string(s):
assert type (s) is str
#
# YOUR CODE HERE
#
# Demo:
print(latin_text[:100], “…n=>”, normalize_string(latin_text[:100]), “…”)
In [ ]: # `normalize_string_test`: Test cell
norm_latin_text = normalize_string(latin_text)
assert type(norm_latin_text) is str
assert len(norm_latin_text) == 1694
assert all([c.isalpha() or c.isspace() for c in norm_latin_text])
assert norm_latin_text == norm_latin_text.lower()
print(“n(Passed!)”)
Exercise 2 (get_normalized_words_test: 1 point). Implement the following function,
get_normalized_words(s). It takes as input a string s (i.e., a str object). It should return a
list of the words in s, after normalization per the definition of normalize_string(). (That is, the
input s may not be normalized yet.)
In [ ]: def get_normalized_words (s):
assert type (s) is str
#
# YOUR CODE HERE
#
# Demo:
print (“First five words:n{}”.format (get_normalized_words (latin_text)[:5]))
In [ ]: # `get_normalized_words_test`: Test cell
norm_latin_words = get_normalized_words(norm_latin_text)
assert len(norm_latin_words) == 250
3