- Luis Serrano - Curriculum Leader at the School of Artificial Intelligence at Udacity
Natural Language Processing is one of the most exciting and fastest growing applications in Machine Learning
Part 1:
- The Basics of Language Processing and some very interesting Probabilistic Algorithms
- Project - Building a Model for Part of Speech Tagging using Hidden Markov Models
Part 2:
- Go deeper into some of the most exciting Deep Learning Based Language Models
- Project - Building a Machine Translation Model using Deep Neural Networks
Part 3:
- Speech Recognition and Voice User Interface
- Project - End-to-End Speech Recognition Model using Deep Neural Networks
Collaborators:
- Arpan Chakraborty - Computer Vision & Machine Learning at Georgia Tech and Udacity
- Jay Alamar - Computer Scientist & Investment Principle with a very popular Machine Learning Blog
- Dana Sheahen - Electrical Engineer with a Master's in Computer Science and a love for all AI things
This program is project-based, and each project has a suggested deadline to keep students on pace towards graduation. In addition to the suggested project deadlines, each class has a term end deadline, by which date all required projects must be passed in order to successfully complete the course and receive a certificate.
Suggested Project Deadlines by Class
September Class
Project | Suggested Deadline |
---|---|
Term Begins | Sep 11 |
Part of Speech Tagging | Oct 2 |
Machine Translation | Nov 3 |
DNN Speech Recognizer | Nov 24 |
Term End Deadline | Dec 8 |
We highly recommend adding these suggested deadlines to your calendar to stay on track. You can do this in two ways:
-
Download this Suggested Project Deadlines ICS file and import it into any calendar of your preference. - https://calendar.google.com/calendar/ical/knowlabs.com_1466rajsl2g56upqgb0nb1d7bo%40group.calendar.google.com/public/basic.ics
-
Open up this Suggested Project Deadlines Google Calendar, and click the + button in the bottom right corner. - http://bit.ly/2wQBUyO
August Class
Project | Suggested Deadline |
---|---|
Term Begins | Aug 14 |
Part of Speech Tagging | Sep 4 |
Machine Translation | Oct 6 |
DNN Speech Recognizer | Oct 27 |
Term End Deadline | Nov 10 |
We highly recommend adding these suggested deadlines to your calendar to stay on track. You can do this in two ways:
-
Download this Suggested Project Deadlines ICS file and import it into any calendar of your preference. - https://calendar.google.com/calendar/ical/knowlabs.com_2spt5fjca10ko88q0lnibmrkvs%40group.calendar.google.com/public/basic.ics
-
Open up this Suggested Project Deadlines Google Calendar, and click the + button in the bottom right corner. - http://bit.ly/2OhT6EB
July Class
Project | Suggested Deadline |
---|---|
Term Begins | Jul 10 |
Part of Speech Tagging | Jul 31 |
Machine Translation | Aug 4 |
DNN Speech Recognizer | Sep 1 |
Term End Deadline | Sep 22 |
We highly recommend adding these suggested deadlines to your calendar to stay on track. You can do this in two ways:
-
Download this Suggested Project Deadlines ICS file and import it into any calendar of your preference. - https://calendar.google.com/calendar/ical/knowlabs.com_gigrfi2du74bemsr6lnpk7fuqc%40group.calendar.google.com/public/basic.ics
-
Open up this Suggested Project Deadlines Google Calendar, and click the + button in the bottom right corner. - http://bit.ly/2KpI16m
June Class
Project | Suggested Deadline |
---|---|
Term Begins | Jun 12 |
Part of Speech Tagging | Jul 3 |
Machine Translation | Aug 4 |
DNN Speech Recognizer | Aug 25 |
Term End Deadline | Sep 8 |
We highly recommend adding these suggested deadlines to your calendar to stay on track. You can do this in two ways:
-
Download this Suggested Project Deadlines ICS file and import it into any calendar of your preference. - https://calendar.google.com/calendar/ical/knowlabs.com_bn1237tlqdplpmehdkk7ptbpic%40group.calendar.google.com/public/basic.ics
-
Open up this Suggested Project Deadlines Google Calendar, and click the + button in the bottom right corner. - http://bit.ly/2MmxPbQ
May Class
Project | Suggested Deadline |
---|---|
Term Begins | May 8 |
Part of Speech Tagging | May 29 |
Machine Translation | Jun 30 |
DNN Speech Recognizer | Jul 21 |
Term End Deadline | Aug 4 |
We highly recommend adding these suggested deadlines to your calendar to stay on track. You can do this in two ways:
-
Download this Suggested Project Deadlines ICS file and import it into any calendar of your preference. - https://calendar.google.com/calendar/ical/knowlabs.com_968urc4v3prdt5c38mj0ta8htk%40group.calendar.google.com/public/basic.ics
-
Open up this Suggested Project Deadlines Google Calendar, and click the + button in the bottom right corner. - http://bit.ly/2JmBR5w
April Class
Project | Suggested Deadline |
---|---|
Term Begins | Apr 10 |
Part of Speech Tagging | May 1 |
Machine Translation | Jun 2 |
DNN Speech Recognizer | Jun 23 |
Term End Deadline | Jul 7 |
We highly recommend adding these suggested deadlines to your calendar to stay on track. You can do this in two ways:
-
Download this Suggested Project Deadlines ICS file and import it into any calendar of your preference. - https://calendar.google.com/calendar/ical/knowlabs.com_gll1ne6uaoro89lc1nbjv34fts%40group.calendar.google.com/public/basic.ics
-
Open up this Suggested Project Deadlines Google Calendar, and click the + button in the bottom right corner. - http://bit.ly/2G5IIiN
What is a “suggested deadline”?
The “suggested deadlines” are flexible deadlines that are meant to guide students towards graduation and there’s no penalty if you miss the project deadlines we’ve laid out for you. You are welcome to turn in your projects before or after the suggested deadline as long as you turn everything in by the term end date.
Do I have to turn my projects in by the dates Udacity has laid out?
No. The project deadline dates we’ve laid out are suggestions from our team to help keep you on pace to successfully graduate by the end of the Nanodegree term. The date that you should use as your compass is the term end deadline.
What happens if I don’t complete a project by the date Udacity has laid out?
Nothing. There is no penalty for missing a suggested project deadline and you can turn projects in at any time (before or after the suggested deadline), as long as it is by your term end deadline. Although the project deadlines you see are soft deadlines, we highly recommend you try your best to keep up with the suggested deadlines to ensure you can get through the content at a reasonable pace and so all your projects don’t pile up a few weeks before your term-end deadline.
What do I need to do in order to complete the course?
Students must complete and pass all the required projects by the term-end date in order to successfully graduate from the program. Passing a project means that a Udacity Reviewer has marked your project as “Meets Specifications.” The project review process can take anywhere between a couple of hours to a couple of days, so please plan accordingly as you turn in your projects towards the end of your term. Your projects must be reviewed and passed by the last day of the term. NOTE: you can submit your project as many times as you need to in order to pass.
Can I graduate from the Nanodegree program without completing and passing all the required projects?
No, all required projects must be completed and passed in order to receive a Nandodegree certificate. However, the course may have a few extra projects or labs (those not mentioned in the Deadlines calendar), which are solely for your benefit and are not required for graduation.
Do I need to wait until the last day of the term in order to graduate?
No. Students can graduate from the program at anytime after they complete and pass all the required projects. Once your final project has been passed by a Udacity Reviewer, you will see the option to begin the graduation process.
Is the term end date also a suggested deadline?
No. That is the date you should be working towards as the last date you can turn projects in.
What happens if I don’t turn all my project in by the term end date?
If you do not pass all projects by the last day of the term, you will receive an automatic and free 4-week extension (no need to write in to support to receive this extension) to complete any outstanding projects. You will only receive this extension once, so please do not aim to use this extension unless you absolutely need to and certainly do not leave a majority of your projects unattempted until this extension period.
Do I need to write in to support to get an extension, should I need it?
No. You will automatically receive a free 4-week extension if you don’t complete and pass all the required projects by the term end date.
What are my options if I still don’t complete all the required projects within the 4-week extension?
If you do not complete the term within the extension period, you will be removed from the program and will no longer be able to access course content. To resume access to the course, you would need to re-enroll in a new term and pay the associated enrollment fees again. However, your course progress will carry over to the new term, so you will be able to continue where you left off.
Your experience in the Nanodegree program and community should be an engaging, fulfilling, and positive one. As such, we have outlined the following system for reporting behavior that does not live up to Udacity’s standards, so it can quickly be addressed by our staff.
All reports of suspected violations to the TOU, Community Code of Conduct or Honor Code should be submitted to [email protected] and will be reviewed. If you witness or are experiencing any violations of our policies please get in touch with us. Below prohibited actions as set forth on our Community Code of Conduct:
-
Harassment: Inappropriate, harassing, abusive, discriminatory, derogatory or violent comments or conduct.
-
Discrimination: Offensive comments related to gender or gender identity, sexual orientation, race, ethnicity, religion, national origin, disability or disease
-
Distributing inappropriate content: Use of sexual, violent, graphic, or derogatory images
-
Bullying: Deliberate intimidation, threats of violence or violent language directed against another person
-
Sexual harassment: Unwelcome sexual attention
-
Defamation: Obscene, fraudulent, indecent, or libelous acts that defame, abuse, harass, discriminate against or threaten others
-
Plagiarism: will not cheat on any homework assignment, projects or exams for the Online Courses and, specifically, will not plagiarize materials created by others
-
Self-injury or Suicide: We do not encourage community postings in Study Groups or Knowledge related to self-injury or suicide. If you or someone you know is exhibiting signs of self-injury or suicide, find help at the Suicide Prevention Lifeline in the U.S. and Befrienders.org globally. When a potential violation is brought to our attention, we will make every effort to investigate the case thoroughly and make a decision that is fair to all parties.
Thank you, The Udacity Team
- Complete Lesson 1: Welcome to Natural Language Processing - and Lesson 2: Udacity Support:
- Complete Lesson 3: Intro to NLP:
- Complete Lesson 4: Text Processing:
- Finish Lesson 5: Spam Classifier with Naive Bayes up to the Project Overview:
To graduate, you need to pass every project.
The videos, text lessons and quizzes are recommended but optional.
We know from survey and behavioral data that graduating depends primarily on your commitment and your persistence.
But at some point, you will get stuck. Doubt can set in.
What you choose to do when this happens is what separates successful online learners from others.
Don’t panic. Don’t quit. Be patient, and work the problem.
Remember that you will encounter many of the same problems as everyone else.
We are here to help, and so are your classmates.
When you are stuck, or looking for encouragement, you’ll find Udacity mentors and other students pushing you to graduation.
The most important feedback you get from mentors will be directly from your project reviews.
You will also find mentors, classmates and alumni on two platforms to get unblocked fast: Knowledge for searchable, upvoted Q&A, and Student Hub for real time collaboration.
The most important feedback you receive from mentors at Udacity will be reviews of your project submissions.
Projects are key milestones in your Nanodegree with detailed requirements and instructions, and you can submit them from the classroom for a review.
We try to return your review within 24 hours by email.
The mentor will comment on which requirements you passed or failed line-by-line, and provide personalized suggestions for improvement or resources for further learning.
You may re-submit the project until you pass all requirements.
Students tell us they find project reviews from experts to be the most helpful component of their learning.
Udacity mentors work hard to help you improve - with a 4.9 average rating for over 2k projects a day.
More than 95% of students who submit a project eventually pass. The secret is to try and never give up!
Knowledge is our platform for asked and answered questions about projects and content, supported by Udacity mentors, other students and alumni .
If you have a question, search for it here. It may be answered, which makes searching first the fastest way to get unstuck.
If you don't see your question, ask it here. You are likely to get an answer within 24 hours and help future students with the same problem.
We see our most successful students learning from helping others.
If you see a question and know the answer, please answer!
You are helping another student get unstuck, and every future student who encounters the same problem.
You will get an email alerting you to the 🎉and 🙏comments from happy students.
We display search results and answers, and moderate posts in Knowledge based on how useful they are.
Our goal is for all content on Knowledge to be structured, searchable and useful to other students.
When you use Knowledge, please remember to upvote or downvote questions and answers, and help make it as useful as possible for others.
Student Hub is our real time collaboration platform where you can work with peers and mentors.
You will find Community rooms with other students and alumni.
And Guided Study rooms for each project, with Udacity mentors who offer guidance and answer questions.
We have an amazing team of mentors who work hard to help you pass each project and they are incentivized for you to succeed.
Udacity mentors typically respond within 24 hours. However they live all over the world and usually have active careers in industry, so expect that response times may vary.
Check out Student Hub to say hello to your classmates and mentors.
There are several ways in which you will receive support during the program from Udacity's network of Reviewers, as well as your fellow students.
Project Reviews
For any projects you submit, you will receive detailed feedback from a project Reviewer.
These reviews are meant to give you personalized code feedback and to tell you what can be improved in your code (if anything)! This feedback is very like the feedback you would receive working on a small team of engineers. Sometimes, a reviewer might ask you to resubmit a project to meet specifications. In that case, an indication of needed changes will also be provided. Note that you can submit a project as many times as needed to pass.
This feedback is especially useful when building artificially intelligent agents. Sometimes your agents don't always behave as expected -- see the untrained soccer agents below for an example! Your project Reviewer is available to help you figure out how you can change your code, to help your agents master their tasks!
Knowledge
To ask questions and get answers from Udacity staff and your peers, we have the Knowledge platform. If you have ever used StackOverflow, this platform is similar. You can post new questions here, add answers or conversational comments, and upvote questions and answers. You can also search for existing answers and filter by Nanodegree program and project. Knowledge is accessible from the classroom as a lightbulb icon that you'll see at the bottom left of your navigation bar.
https://knowledge.udacity.com/?nanodegree=af1412cc-594c-11e8-aa9f-1fd58b4f9291
Feedback
In order to keep our content up-to-date and address issues quickly, we've set up a Waffle board to track error reports and suggestions.
https://waffle.io/udacity/nlpnd-waffle-issues
If you find an error, check there to see if it has already been filed. If it hasn't, you can file an issue by clicking on the "Add issue" button, adding a title, and entering a description in the details (you will need a GitHub account for this). Links and screenshots, if available, are always appreciated!
Our first section will be taught by Arpan Chakraborty. Arpan has a Ph.D. in Computer Science and for several years, has taught at Udacity and at Georgia Tech.
Everything in NLP starts with raw text typically, produced by humans. This text is first processed using some simple transformations such as:
-
Splitting it into individual words
-
Reducing verbs to their root form
You need to do this before performing any other analysis or training complex models. This stage may sound simple but you have to be careful about how you process your raw text, it may affect the results you obtain further down the line.
Language is an important medium for human communication; it allows us to convey information, express our ideas, and give instructions to others
Some philosophers argue that it enables us to form complex thoughts and reason about them; it may turn to be critical component of human intelligence
Now consider the various artificial systems we interact with every day:
- phones
- cars
- websites
- coffee machines
It's natural to expect them to be able to process and understand human language, right? Yet, computers are still lagging behind. No doubt, we have made some incredible progress in the field of Natural Language Processing; but there is still a long way to go. And that's what makes this an exciting and dynamic area of study
In this lesson you will not only get to know more about the applications and challenges in NLP You will learn how to design an intelligent application that uses NLP techniques and deploy it on a scalable platform
What makes it so hard for computers to understand us?
One drawback of human languages, or feature depending on how you look at it, is the lack of a precisely defined structure
To understand how that makes things difficult let's take a look at some languages that are more structured.
Mathematics, for instance, uses a structured language. When I write y = 2x + 5
there is no ambiguity in what I want to convey. I'm saying that the variable y
is related to the variable x
as two times x
plus five
Formal logic also uses a structure language.
For example, consider the expression Parent(X, Y) ^ Parent(X, Z) -> Sibling(Y, Z)
This statement is asserting that if X
is parent of Y
and X
is parent of Z
then Y
and Z
are siblings
A set of structure languages that may be more familiar to you are scripting and programming languages. Consider this SQL statement
SELECT name, emaul
FROM users
WHERE name LIKE 'A%'
We are asking the database to return the names and e-mail addresses of all users whose names begin with an A. These languages are designed to be as unambiguous as possible and are suitable for computers to process.
Structured languages are easy to parse and understand for computer because they are defined by a strict set of rules or grammar. There are standard forms of expressing such grammars and algorithms that can parse properly formed statements to understand exactly what is meant.
When a statement doesn't match the prescribed grammar, a typical computer doesn't try to guess the meaning; it simply gives up Such violations of grammatical rules are reported as syntax errors
>>> say hello
SyntaxError: invalid syntax
QUIZ QUESTION
Here is a grammar, specified in a simple notation known as Backus-Naur Form or BNF [1]:
_S_ → 0 _S_ 0
_S_ → 1 _S_ 1
_S_ → 00
_S_ → 11
Which of the following sentences are valid according to this grammar?
[1] For a quick review, check out this brief segment on grammars from our Intro to Computer Science course.
https://classroom.udacity.com/courses/cs101/lessons/48299949/concepts/487192400923
Grammar
So in order to learn about programming, we need to learn new language. This will be a way to describe what we want the computer to do in a much more precise way than we could in a natural language like English. And it's a way to describe programs that the Python interpreter can run.
One of the best ways to learn a programming language is to just try things. You can try that in the Python interpreter that's running in your browser. Let's, for example, try running
2 + 2 +
In English, someone could probably guess that the value of 2 + 2 +
should be 4.
In Python, when we try running this, we get an error. And the reason we get an error is that this is not actually part of the Python language. The Python interpreter only knows how to evaluate code that's part of the Python language. If you try to evaluate something that's not part of the Python language, it will give you and error.
Errors look a bit scary, the way they print out. But there's nothing bad that can happen. It's perfectly okay to try running code. If it produces an error, that's one of the ways to learn about programming.
The error we got here is what's called a syntax error
. That means that what we tried to evaluate is not actually part of the Python language.
Like English, Python has a grammar that defines what strings are in the language. In English, we can make lots of sentences that are not completely grammatical, and people still understand them, but there's some underlying grammar behind the language.
Those of you who are native English speakers, might have learned rules like this in what was once called grammar school. Those of you who learned English as a second language, probably learned rules like this when you were learning English
So, English has a rule that says you can make a sentence, by combining a subject with a verb, followed by an object:
Almost every language has a rule sort of like this. The order of the subject and the verb and the object might be different, but there's a way to combine those three things to form a sentence.
The subject could be a noun. The object could be a noun. And then each of these parts of speech Well, we have lots of things they could be.
So a verb could be the word eat. A verb could also be the word like, and there are lots of other words that the verb could be.
A noun could be the word Python, a noun could be the word cookies.
The actual English grammar is of course, much larger and more complex than this. But we can still think of it as having rules like this that allow us to form sentences from the parts of speech that we know, from the words that make those parts of speech.
The way we're writing grammars here is a notation called Backus-Naur Form. And this was invented by John Backus
So John Backus was the lead designer of the Fortran programming language back in the 1950s at IBM This was one of the first widely used programming languages. And the way they described the Fortran language was with lots of examples and text explaining what they meant. And this is a shot from the actual manual for the first version of Fortran.
This works okay, many programmers were able to understand it and guess correctly what it meant. But was not nearly precise enough. And when it came time to design a later language which was the language called ALGOL, it became clear that this informal way of describing languages wasn't precise enough. And John Backus invented the notation that we're using here to describe languages
Sentence -> Subject Verb Object
Subject -> Noun
Object -> Noun
Verb -> Eat
Verb -> Like
Noun -> I
Noun -> Python
Noun -> Cookies
Backus Naur Form
The purpose of Backus-Naur Form is to be able to precisely describe exactly the language in a way that's very simple and very concise.
So each rule has the form like this where on the left side there's a non-terminal. Sometimes they're written with brackets around them.
There's an arrow, and then on the right side there's a replacement. The replacement can be anything, it can be a sequence of non-terminals.ç Sentence can be replaced with subject followed by verb followed by object. It can be 1 non-terminal, it can also be a terminal.
What's special about the terminals is they never appear on the left side of a rule- Once we get a terminal we're done, we're finished; there's nothing else we can replace it with
So all the rules have this form:
<Non-terminal> -> replacement
We can form a sentence by starting from some non-terminal; usually whichever one is written at the top left, in this case the one I called sentence And then by following the rules we keep replacing non-terminals with their replacements until we're left with only terminals.
Here's an example starting from sentence using the grammar above. We can start with sentence. We only can have 1 rule to choose from where sentence is on the left side, so we're going to replace sentence with subject, verb, object
sentence
|
subject verb object
Now we have a lot of choices. We can pick any of the non-terminals we have left. Find a rule where that non-terminals is on the left side.
We can pick any of the rules where it's on the left side and do the replacement. So I'm going to start with the left one. We'll pick subject. We only only have 1 replacement rule for subject. We can replace subjet with noun
The others stay like they are, so we still have verb and we still have object. Now we can keep going
subject verb object
|
noun verb object
We can pick the first one again. It's still a non-terminal, so we can still do replacements With noun we've got 3 choices. We can pick any one of those choices.
I'm going to pick the first one. We'll replace noun with the terminal I Now we've got a terminal. We're done with that replacement Verb and object stay the same
noun verb object
|
I
As a separate step we're going to find a rule that matches verb. We have 2 choices. I'll pick the second one and replace verb with like
noun verb object
| |
I like
We still have object. Object is a non-terminal, so we have to keep replacing it until we're done. We have 1 rule for object. We can replace object with noun
noun verb object
| | |
I like noun
Now we have 3 rules for noun. I'm going to pick the second rule and replace noun with Python What I've done here is what's called a derivation
I like noun
|
Python
A derivation just means starting from some non-terminal, follow the rules to derive a sequence of terminals. We're done when we have only terminals left and we can derive a sentence in the grammar. In this case we produced the sentence I like python. But there are lots of other sentences we could have produced, starting from the same non-terminal if we pick different rules to follow
The languages we use to communicate with each other also have defined grammatical rules. And indeed, in some situations we use simple structured sentences, but for the most part human discourse is complex and unstructured.
Despite that, we seem to be really good at understanding each other and even ambiguities are welcome to a certain extent.
So, what can computers do to make sense of unstructured text? Here are some preliminary ideas:
PROCESS WORDS & PHRASES
- KEYWORDS
- PARTS OF SPEECH
- NAMED ENTITIES
- DATES & QUANTITIES
PARSE SENTENCES
- STATEMENTS
- QUESTIONS
- INSTRUCTIONS
ANALYZE DOCUMENTS
- FREQUENT & RARE WORDS
- TONE & SENTIMENT
- DOCUMENT CLUSTERING
You can imagine that building on top of these ideas, computers can do a whole lot with unstructured text even if they cannot understand it like us
QUIZ QUESTION
Let's see if you can identify parts of speech! Here is a sample English sentence:
She works at IBM.
For each word in the sentence, label it with the correct part of speech. Here a named entity is essentially a proper noun.
She -> pronoun
works -> verb
at -> preposition
IBM -> named entity
Let's implement a simple function that is often used in Natural Language Processing: Counting word frequencies.
Consider this passage of text:
As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.
— Excerpt from Treasure Island, by Robert Louis Stevenson.
In the following coding exercise, we have provided code to load the text from a file, call the function count_words()
to obtain word counts (which you need to implement), and print the 10 most common and least common unique words.
Complete the portions marked as TODO to count how many times each unique word occurs in the text.
input.txt
As I was waiting, a man came out of a side room, and at a glance I was sure he must be Long John. His left leg was cut off close by the hip, and under the left shoulder he carried a crutch, which he managed with wonderful dexterity, hopping about upon it like a bird. He was very tall and strong, with a face as big as a ham—plain and pale, but intelligent and smiling. Indeed, he seemed in the most cheerful spirits, whistling as he moved about among the tables, with a merry word or a slap on the shoulder for the more favoured of his guests.
count_words.py
"""Count words."""
def count_words(text):
"""Count how many times each unique word occurs in text."""
counts = dict() # dictionary of { <word>: <count> } pairs to return
# TODO: Convert to lowercase
# TODO: Split text into tokens (words), leaving out punctuation
# (Hint: Use regex to split on non-alphanumeric characters)
# TODO: Aggregate word counts using a dictionary
return counts
def test_run():
with open("input.txt", "r") as f:
text = f.read()
counts = count_words(text)
sorted_counts = sorted(counts.items(), key=lambda pair: pair[1], reverse=True)
print("10 most common words:\nWord\tCount")
for word, count in sorted_counts[:10]:
print("{}\t{}".format(word, count))
print("\n10 least common words:\nWord\tCount")
for word, count in sorted_counts[-10:]:
print("{}\t{}".format(word, count))
if __name__ == "__main__":
test_run()
So what is stopping computers from becoming as capable as humans in understanding natural language? Part of the problem lies in the variability and the complexity of our sentences
Consider this excerpt from a movie review
- " I was lured to see this on the promised of a
smart witty slice of old fashioned fun
and intrigue.I was conned
. "
Although it starts with some potentially positive words it turns out to be strongly negative review Sentences like this might be somewhat entertaining for us but computers tend to make mistakes when trying to analyze them
But there is a bigger challenge that makes NLP harder than you think Take a look at this sentence
- " The sofa didn't fit through the door because
it
was too narrow "
What does it
refer to?
Clearly it
refers to the door
Now consider a slight variation of this sentence
- " The sofa didn't fit through the door because
it
was toowide
"
What does it
refer to in this case?
Here it's the sofa. Think about it
To understand the proper meaning of semantics
of the sentence you implicitly applied your knowledge about the physical world
That wide things don't fit through narrow things
You may have experienced a similar situation before
You can imagine that there are countless other scenarios in which some knowledge or context
is indispensable for correctly understanding what is being said
Contextual Dependence
Can you think of a similar sentence, in English or your own native language, where some contextual or background knowledge is needed to understand the intended meaning?
Natural language processing is one of the fastest growing fields in the world NLP is making its way into a number of products and services that we use every day
Let's begin with an overview of how to design an end-end NLP pipeline
You start with raw text in whatever form it is available, process it, extract relevant features, and build models to accomplish various NLP tasks
Now that I think about it, that is kind of like refining crude oil.
Anyways, you'll learn how these different stages in the pipeline depend on each other. You'll also learn how to make design decisions, how to choose existing libraries, and tools to perform each step
NLP Pipeline
- Text Processing
- Feature Extraction
- Modeling
Each stage transforms text in some way and produces a result that the next stage needs.
For example, the goal of text processing is to take raw input text, clean it. normalize it, and convert it into a form that is suitable for feature extraction.
Similarly, the next stage needs to extract and produce feature representations that are appropriate for that type of model you're planning to use and the NLP task you're trying to accomplish.
When you're building such a pipeline, your workflow may not be perfectly linear.
Let's say, you spend some time implementing text processing functions, then make some simple feature extractors, and then design a baseline statistical model.
But then, maybe you are not happy with the results.
So you go back and rethink what features you need, and that in turn, can make you change your processing routines.
Keep in mind that this is a very simplified view of natural language processing.
Your application may require additional steps.
Why do we need to process text?
Websites are a common source of textual information.
For the porpuses of natural language processing, you would typically want to get rid of all or most of thee HTML tags, and retain only plain text.
You can also remove or set aside any URLs or other items not relevant to your task.
The Web is probably the most common and fastest growing source of textual content. But you may also need to consume PDFs, Word documents or other file formats.
Or your raw input may even come from a speech recognition system of from a book scan using OCR.
Some knowledge or the source medium can help you properly handle the input.
In the end, your goal is to extract plain text that is free of any source specific markers or constructs that are not relevant to your task.
Once you have obtained plain text, some further processing may be necessary.
For instance, capitalization doesn't usually change the meaning of a word. We can conver all the words to the same case so that they're not treated differently.
Punctuation marks that we use to indicate pauses, etc. can also be removed.
Some common words in a language often help provide structure, but don't add much meaning. For example, a, and, the, of, are and so on. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.
We now have clean normalized text. Can we feed this into a statistical or machine learning model? Not quite. Let's see why.
Text data is represented on modern computers using an encoding such as ASCII or Unicode that maps every character to a number.
Computer store and transmit these values as binary, zeros and ones. These numbers also have an implicit ordering. 65 is less than 66 which is less than 67. But does that mean A is less than B, and B is less than C? No. In fact, that would be a incorrect assumption to make and might mislead our natural language processing algorithms.
Moreover, individual characters don't carry much meaning at all. It is words that we should be concerned with, but computers don't have a standard representation for words.
Yes, internally they are just sequences of ASCII or Unicode values but they don't quite capture the meanings or relationships between words.
Compare this with how an image is represented in computer memory. Each pixel value contains the relative intensity of light at that spot in the image. For a color image, we keep one value per primary color; red, green, and blue. These values carry relevant information. Two pixels with similar values are perceptually similar. Therefore, it makes sense to directly use pixel values in a numerical model. Yes, some feature engineering may be necessary such as edge detection or filtering, but pixels are a good starting point.
So the question is, how do we come up with a similar representation for text data that we can use as features for modeling?
The answer again depends on what kind of model you're using and what task you're trying to accomplish.
If you want to use a graph based model to extract insights, you may want to represent your words as symbolic nodes with relationships between them like WordNet.
For statistical models however, you need some sort of numerical reepresentation. Even then, you have to think about the end goal.
If you're trying to perform a document level task, such as spam detection or sentiment analysis, you may want to use a per document representations such as bag-of-words or doc2vec.
If you want to work with individual words and phrases such as for text generation or machine translation, you'll need a word level representation such as word2vec or glove.
There are many ways of representing textual information, and it's only through practice that you can learn what you need for each problem.
This includes designing a model, usually a statistical or a machine learning model, fitting its parameters to training data using an optimization procedure, and then using it to make predictions about unseen data.
The nice thing about working with numerical features is that it allows you to utilize pretty much any machine learning model. This includes support vector machines, decision trees, neural networks, or any custom model of your choice.
You could even combine multiple models to get better performance. How you utilize the model is up to you. You can deploy it as a web-based application, package it up into a handy mobile app, integrate it with other products, services, and so on. The possibilities are endless.
In this lesson, you'll learn how to read text data from different sources and prepare it for feature extraction.
You'll begin by cleaning it to remove irrelevant items, such as HTML tags. You will then normalize text by converting it into all lowercase, removing punctuations and extra spaces. Next, you will split the text into words or tokens and remove words that are too common, also known as stop words. Finally, you will learn how to identify different parts of speech, named entities, and convert words into canonical forms using stemming and lemmatization.
After going thorugh all these processing steps, your text may look very different, but it captures the essence of what was being conveyed in a form that is easier to work with.
Coding exercises that accompany this lesson can be accessed in two ways. The easiest way is clicking next to open the project workspace in the classroom.
You'll be using this workspace throughout this whole lesson, so we suggest that you open it in a different tab, while following along with the lessons. Also, we recommend you to save your work often in the workspace.
However, if you want to run the Jupyter server on your own system, you can download the files from this GitHub repo: https://github.com/udacity/AIND-NLP
1.- Clone the repo on your local machine. 2.- Follow the instructions provided in README.md to setup your Python environment. 3.- (Optional) Download all nltk data packages (10+GB), or get them later as needed. 4.- Launch the notebook: jupyter notebook text_processing.ipynb
Then follow along as you go through the lesson. Feel free to pause and experiment with the tools and libraries you're learning!
Introduction
Udacity Workspaces with GPU support are available for some projects as an alternative to manually configuring your own remote server with GPU support. These workspaces provide a Jupyter notebook server directly in your browser. This lesson will briefly introduce the Workspaces interface.
Important Notes:
-
Workspaces sessions are connections from your browser to a remote server. Each student has a limited number of GPU hours allocated on the servers (the allocation is significantly more than completing the projects is expected to take). There is currently no limit on the number of Workspace hours when GPU mode is disabled.
-
Workspace data stored in the user's home folder is preserved between sessions (and can be reset as needed, e.g., to get project updates).
-
Only 3 gigabytes of data can be stored in the home folder.
-
Workspace sessions are preserved if your connection drops or your browser window is closed, simply return to the classroom and re-open the workspace page; however, workspace sessions are automatically terminated after a period of inactivity. This will prevent you from leaving a session connection open and burning through your time allocation. (See the section on active connections below.)
-
The kernel state is preserved as long as the notebook session remains open, but it is not preserved if the session is closed. If you exit the notebook for more than half an hour and the session is closed, you will need to re-run any previously-run cells before continuing.
Overview Workspaces interface The default workspaces interface
When the workspace opens, you'll see the normal Jupyter file browser. From this interface you can open a notebook file, start a remote terminal session, enable the GPU, submit your project, or reset the workspace data, and more. Clicking the three bars in the top left corner above the Jupyter logo will toggle hiding the classroom lessons sidebar.
NOTE: You can always return to the file browser page from anywhere else in the workspace by clicking the Jupyter logo in the top left corner.
Opening a notebook
Project notebook view View of the project notebook
Clicking the name of a notebook (*.ipynb) file in the file list will open a standard Jupyter notebook view of the project. The notebook session will remain open as long as you are active, and will be automatically terminated after 30 minutes of inactivity.
You can exit a notebook by clicking on the Jupyter logo in the top left corner.
NOTE: Notebooks continue to run in the background unless they are stopped. IF GPU MODE IS ACTIVE, IT WILL REMAIN ACTIVE AFTER CLOSING OR STOPPING A NOTEBOOK. YOU CAN ONLY STOP GPU MODE WITH THE GPU TOGGLE BUTTON. (See next section.)
Enabling GPU Mode Enabling GPU mode The GPU Toggle Button
GPU Workspaces can also be run without time restrictions when the GPU mode is disabled. The "Enable"/"Disable" button (circled in red in the image) can be used to toggle GPU mode. NOTE: Toggling GPU support may switch the physical server your session connects to, which can cause data loss UNLESS YOU CLICK THE SAVE BUTTON BEFORE TOGGLING GPU SUPPORT.
ALWAYS SAVE YOUR CHANGES BEFORE TOGGLING GPU SUPPORT.
Keeping Your Session Active
Workspaces automatically disconnect after 30 minutes of user inactivity—which means that workspaces can disconnect during long-running tasks (like training neural networks). We have provided a utility that can keep your workspace sessions active for these tasks. However, keep the following guidelines in mind:
-
Do not try to permanently hold the workspace session active when you do not have a process running (e.g., do not try to hold the session open in the background)—the limits are in place to preserve your GPU time allocation; there is no guarantee that you'll receive additional time if you exceed the limit.
-
Make sure that you save the results of the long running task to disk as soon as the task ends (e.g., checkpoint your model parameters for deep learning networks); otherwise the workspace will disconnect 30 minutes after the active process ends, and the results will be lost.
The workspace_utils.py
module (available here) includes an iterator wrapper called keep_awake
and a context manager called active_session
that can be used to maintain an active session during long-running processes. The two functions are equivalent, so use whichever fits better in your code. NOTE: The file may be incorrectly downloaded as workspace-utils.py
(note the dash instead of an underscore in the filename). Make sure to correct the filename before uploading to your workspace; Python cannot import from file names including hyphens.
Example using keep_awake
:
from workspace_utils import keep_awake
for i in keep_awake(range(5)): #anything that happens inside this loop will keep the workspace active
# do iteration with lots of work here
Example using active_session
:
from workspace_utils import active_session
with active_session():
# do long-running work here
Submitting a Project
UI annotation for project submission button The Submit Project Button
Some workspaces are able to directly submit projects on your behalf (i.e., you do not need to manually submit the project in the classroom). To submit your project, simply click the "Submit Project" button (circled in red in the above image).
If you do not see the "Submit Project" button, then project submission is not enabled for that workspace. You will need to manually download your project files and submit them in the classroom.
NOTE: YOU MUST ENSURE THAT YOUR SUBMISSION INCLUDES ALL REQUIRED FILES BEFORE SUBMITTING -- INCLUDING ANY FILE CONVERSIONS (e.g., from ipynb to HTML)
Opening a Terminal The "new" menu The "New" menu button
Jupyter workspaces support several views, including the file browser and notebook view already covered, as well as shell terminals. To open a terminal shell, click the "New" menu button at the top right of the file browser view and select "Terminal".
Terminals Jupter terminal shell interface Jupyter terminal shell interface
Terminals provide a full Bash shell that you can use to install or update software packages, fetch updates from github repositories, or run any other terminal commands. As with the notebook view, you can return to the file browser view by clicking on the Jupyter logo at the top left corner of the window.
NOTE: Your data & changes are persistent across workspace sessions. Any changes you make will need to be repeated if you later reset your workspace data.
Resetting Data Workspaces Menu Button The Menu Button
The "Menu" button in the bottom left corner provides support for resetting your Workspaces. The "Refresh Workspace" button will refresh your session, which has no effect on the changes you've made in the workspace.
The "Reset Data" button discards all changes and restores a clean copy of the workspace. Clicking the button will open a dialog that requires you to type "Reset data" in a confirmation dialog. ALL OF YOUR DATA WILL BE LOST.
Resetting should only be required if Udacity makes changes to the project and you can't get them via git pull
, or if you destroy the contents of the workspace. If you do need to reset your data, you are strongly encouraged to download a copy of your work from the file interface before clicking Reset Data.
Best Practices
Follow the best practices outlined below to avoid common issues with Workspaces.
- Keep your home folder small
Your home folder (including subfolders) must be less than 2GB or you may lose data when your session terminates. You may use directories outside of the home folder for more space, but only the contents of the home folder are persisted between sessions and submitted with your project.
NOTE: Your home folder (including subfolders) must be less than 25 megabytes to submit as a project. If the site becomes unresponsive when you try to submit your project, it is likely that your home folder is too large. You can check the size of your home folder by opening a terminal and running the command du -h . | tail -1
You can use ls
to list the files in your terminal and rm
to remove unwanted files. (Search for both commands online to find example usage.)
- What's the "home folder"?
"Home folder" refers to the directory where users files are stored (compared to locations where system files are stored, for example). (Ref. Wikipedia: home directory) In Workspaces, the home folder is
/home/workspace
. Any files in this folder or any subfolder are part of your home folder contents, which means they're saved between sessions and transferred automatically when you switch between CPU/GPU mode.
The folder /tmp
is not in the home folder; files in any folder outside your home folder are not persisted between sessions or transferred between CPU/GPU mode. You can create a folder outside the home folder using the command mkdir
from a terminal. For example you could create a temporary folder to store data using mkdir -p /data
to create a folder at the root directory. You will need to recreate the folder and recreate any data inside every time you start a new Workspace session.
- Keeping your connection alive during long processes Workspaces automatically disconnect when the connection is inactive for about 30 minutes, which includes inactivity while deep learning models are training. You can use the workspace_utils.py module here to keep your connection alive during training. The module provides a context manager and an iterator wrapper—see example use below.
NOTE: The script sometimes raises a connection error if the request is opened too frequently; just restart the jupyter kernel & run the cells again to reset the error.
NOTE: These scripts will keep your connection alive while the training process is running, but the workspace will still disconnect 30 minutes after the last notebook cell finishes. Modify the notebook cells to save your work at the end of the last cell or else you'll lose all progress when the workspace terminates.
Example using context manager:
from workspace_utils import active_session
with active_session():
# do long-running work here
Example using iterator wrapper:
from workspace_utils import keep_awake
for i in keep_awake(range(5)):
# do iteration with lots of work here
- Manage your GPU time
It is important to avoid wasting GPU time in Workspace projects that have GPU acceleration enabled. The benefits of GPU acceleration are most useful when evaluating deep learning models—especially during training. In most cases, you can build and test your model (including data pre-processing, defining model architecture, etc.) in CPU mode, then activate GPU mode to accelerate training.
- Handling "Out of Memory" errors
This issue isn't specific to Workspaces, but rather it is an apparent issue between Pytorch & Jupyter, where Jupyter reports "out of memory" after a cell crashes. Jupyter holds references to active objects as long as the kernel is running—including objects created before an error is raised. This can cause Jupyter to persist large objects in memory long after they are no longer required. The only known solution so far is to reset the kernel and run the notebook cells again.
The processing stage begins with reading text data.
Depending on your application, that can be from one of several sources.
The simplest source is a plain text on your local machine. We can read it in using Python's built in file input mechanism.
Text data may also be included as part of a larger database or table.
Here, we have a CSV file containing information about some news articles. We can read this in using pandas very easily. Pandas includes several useful string manipulation methods that can be appied to an entire column at once.
For instance, converting all values to lowercase.
Sometimes, you may have to fetch data from an online reesource, such as a web service or API.
In this example, we use the requests library in Python to obtain a quote of the day from a simple API, but you could also obtain tweets, reviews, comments, whatever you would like to analyze. Most APIs return JSON or XML data, so you need to be aware of the structure in order to pull out the fields that you need.
Many data sets you will encounter have likely been fetched and preparedd by someone else using a similar procedure.