add class 20 regex material

alexanderghose · Oct 17, 2015 · 0d04448 · 0d04448
1 parent 6842385
commit 0d04448
Show file tree

Hide file tree

Showing 3 changed files with 1,473 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -545,6 +545,10 @@ Tuesday | Thursday
     * Classification: [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
     * Helper functions: [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html), [GridSearchCV](http://scikit-learn.org/stable/modules/grid_search.html)
 * Regular expressions
+    * [Baltimore homicide data](data/homicides.txt)
+    * [Regular expressions 101](https://regex101.com/#python): real-time testing of regular expressions
+    * [Reference guide](code/20_regex_reference.py)
+    * Exercise
 
 **Homework:**
 * Your final project is due next week!
@@ -559,6 +563,14 @@ Tuesday | Thursday
 * This [notebook](https://github.com/luispedro/PenalizedRegression/blob/master/PenalizedRegression.ipynb) from chapter 7 of [Building Machine Learning Systems with Python](https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) has a nice long example of regularized linear regression.
 * There are some special considerations when using dummy encoding for categorical features with a regularized model. This [Cross Validated Q&A](https://stats.stackexchange.com/questions/69568/whether-to-rescale-indicator-binary-dummy-predictors-for-lasso) debates whether the dummy variables should be standardized (along with the rest of the features), and a comment on this [blog post](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models) recommends that the baseline level should not be dropped.
 
+**Regular Expressions Resources:**
+* Google's Python Class includes an excellent [introductory lesson](https://developers.google.com/edu/python/regular-expressions) on regular expressions (which also has an associated [video](https://www.youtube.com/watch?v=kWyoYtvJpe4&index=4&list=PL5-da3qGB5IA5NwDxcEJ5dvt8F9OQP7q5)).
+* Python for Informatics has a nice [chapter](http://www.pythonlearn.com/html-270/book012.html) on regular expressions. (If you want to run the examples, you'll need to download [mbox.txt](http://www.py4inf.com/code/mbox.txt) and [mbox-short.txt](http://www.py4inf.com/code/mbox-short.txt).)
+* [Breaking the Ice with Regular Expressions](https://www.codeschool.com/courses/breaking-the-ice-with-regular-expressions/) is an interactive Code School course, though only the first "level" is free.
+* If you want to go really deep with regular expressions, [RexEgg](http://www.rexegg.com/) includes endless articles and tutorials.
+* [5 Tools You Didn't Know That Use Regular Expressions](http://blog.codeschool.io/2015/07/30/5-tools-you-didnt-know-that-use-regular-expressions/) demonstrates how regular expressions can be used with Excel, Word, Google Spreadsheets, Google Forms, text editors, and other tools.
+* [Exploring Expressions of Emotions in GitHub Commit Messages](http://geeksta.net/geeklog/exploring-expressions-emotions-github-commit-messages/) is a fun example of how regular expressions can be used for data analysis, and [Emojineering](http://instagram-engineering.tumblr.com/post/118304328152/emojineering-part-2-implementing-hashtag-emoji) explains how Instagram uses regular expressions to detect emoji in hashtags.
+
 <!--
 
 -----

diff --git a/code/20_regex_reference.py b/code/20_regex_reference.py
@@ -0,0 +1,211 @@
+'''
+REFERENCE GUIDE: Regular Expressions
+'''
+
+'''
+Rules for Searching:
+
+Search proceeds through string from start to end, stopping at first match
+All of the pattern must be matched
+
+Basic Patterns:
+
+Ordinary characters match themselves exactly
+. matches any single character except newline \n
+\w matches a word character (letter, digit, underscore)
+\W matches any non-word character
+\b matches boundary between word and non-word
+\s matches single whitespace character (space, newline, return, tab, form)
+\S matches single non-whitespace character
+\d matches single digit (0 through 9)
+\t matches tab
+\n matches newline
+\r matches return
+\ match a special character, such as period: \.
+
+Basic Python Usage:
+
+match = re.search(r'pattern', string_to_search)
+Returns match object
+If there is a match, access match using match.group()
+If there is no match, match is None
+Use 'r' in front of pattern to designate a raw string
+'''
+
+import re
+
+s = 'my 1st string!!'
+
+match = re.search(r'my', s)     # returns match object
+if match:                       # checks whether match was found
+    print match.group()         # if match was found, then print result
+
+re.search(r'my', s).group()     # single-line version (without error handling)
+re.search(r'st', s).group()     # 'st'
+re.search(r'sta', s).group()    # error
+re.search(r'\w\w\w', s).group() # '1st'
+re.search(r'\W', s).group()     # ' '
+re.search(r'\W\W', s).group()   # '!!'
+re.search(r'\s', s).group()     # ' '
+re.search(r'\s\s', s).group()   # error
+re.search(r'..t', s).group()    # '1st'
+re.search(r'\s\St', s).group()  # ' st'
+re.search(r'\bst', s).group()   # 'st'
+
+
+'''
+Repetition:
+
++ 1 or more occurrences of the pattern to its left
+* 0 or more occurrences of the pattern to its left
+? 0 or 1 occurrence of the pattern to its left
+
++ and * are 'greedy': they try to use up as much of the string as possible
+
+Add ? after + or * to make them 'lazy': +? or *?
+'''
+
+s = 'sid is missing class'
+
+re.search(r'miss\w+', s).group()    # 'missing'
+re.search(r'is\w+', s).group()      # 'issing'
+re.search(r'is\w*', s).group()      # 'is'
+
+s = '<h1>my heading</h1>'
+
+re.search(r'<.+>', s).group()   # '<h1>my heading</h1>'
+re.search(r'<.+?>', s).group()  # '<h1>'
+
+
+'''
+Positions:
+
+^ match start of a string
+$ match end of a string
+'''
+
+s = 'sid is missing class'
+
+re.search(r'^miss', s).group()  # error
+re.search(r'..ss', s).group()   # 'miss'
+re.search(r'..ss$', s).group()  # 'lass'
+
+
+'''
+Brackets:
+
+[abc] match a or b or c
+\w, \s, etc. work inside brackets, except period just means a literal period
+[a-z] match any lowercase letter (dash indicates range unless it's last)
+[abc-] match a or b or c or -
+[^ab] match anything except a or b
+'''
+
+s = 'my email is [email protected]'
+
+re.search(r'\w+@\w+', s).group()            # 'doe@gmail'
+re.search(r'[\w.-]+@[\w.-]+', s).group()    # '[email protected]'
+
+
+'''
+Lookarounds:
+
+Lookahead matches a pattern only if it is followed by another pattern
+100(?= dollars) matches '100' only if it is followed by ' dollars'
+
+Lookbehind matches a pattern only if it is preceded by another pattern
+(?<=\$)100 matches '100' only if it is preceded by '$'
+'''
+
+s = 'Name: Cindy, 30 years old'
+
+re.search(r'\d+(?= years? old)', s).group()     # '30'
+re.search(r'(?<=Name: )\w+', s).group()         # 'Cindy'
+
+
+'''
+Match Groups:
+
+Parentheses create logical groups inside of match text
+match.group(1) corresponds to first group
+match.group(2) corresponds to second group
+match.group() corresponds to entire match text (as usual)
+'''
+
+s = 'my email is [email protected]'
+
+match = re.search(r'([\w.-]+)@([\w.-]+)', s)
+if match:
+    match.group(1)      # 'john-doe'
+    match.group(2)      # 'gmail.com'
+    match.group()       # '[email protected]'
+
+
+'''
+Finding All Matches:
+
+re.findall() finds all matches and returns them as a list of strings
+list_of_strings = re.findall(r'pattern', string_to_search)
+
+If pattern includes parentheses, a list of tuples is returned
+'''
+
+s = 'emails: [email protected], [email protected]'
+
+re.findall(r'[\w.-]+@[\w.-]+', s)       # ['[email protected]', '[email protected]']
+re.findall(r'([\w.-]+)@([\w.-]+)', s)   # [('joe', 'gmail.com'), ('bob', 'gmail.com')]
+
+
+'''
+Option Flags:
+
+Options flags modify the behavior of the pattern matching
+
+default: matching is case sensitive
+re.IGNORECASE: ignore uppercase/lowercase differences ('a' matches 'a' or 'A')
+
+default: period matches any character except newline
+re.DOTALL: allow period to match newline
+
+default: within a string of many lines, ^ and $ match start and end of entire string
+re.MULTILINE: allow ^ and $ to match start and end of each line
+
+Option flag is third argument to re.search() or re.findall():
+re.search(r'pattern', string_to_search, re.IGNORECASE)
+re.findall(r'pattern', string_to_search, re.IGNORECASE)
+'''
+
+s = 'emails: [email protected], [email protected], [email protected]'
+
+re.findall(r'\w+@ga\.co', s)                # ['[email protected]']
+re.findall(r'\w+@ga\.co', s, re.IGNORECASE) # ['[email protected]', '[email protected]']
+
+
+'''
+Substitution:
+
+re.sub() finds all matches and replaces them with a specified string
+new_string = re.sub(r'pattern', r'replacement', string_to_search)
+
+Replacement string can refer to text from matching groups:
+\1 refers to group(1)
+\2 refers to group(2)
+etc.
+'''
+
+s = 'sid is missing class'
+
+re.sub(r'is ', r'was ', s)                          # 'sid was missing class'
+
+s = 'emails: [email protected], [email protected]'
+
+re.sub(r'([\w.-]+)@([\w.-]+)', r'\[email protected]', s)  # 'emails: [email protected], [email protected]'
+
+
+'''
+Useful to know, but not covered above:
+
+re.split() splits a string by the occurrences of a pattern
+re.compile() compiles a pattern (for improved performance if it's used many times)
+A|B indicates a pattern that can match A or B
+'''