The ML Application Diagram file displays a diagram of a proposed solution for a real time LDA topic model running in production environment with the use case of multiple files landing to a common location. The process involves a set of AWS resources to extract the data, format, preprocessed (NLP) train a model and deploy this model to a production environment. For this use case I will mainly focus in the training part, but suggested approach to solve a real scenario is displayed in the diagram.
The second part is composed by a movie scraper that is able to extract descriptions and story lines form movies in IMDB and store them locally. After scraping the data, the Topic Modeling is applied to these, the chosen technique is LDA. The cleaning and processing pipeline is wrapped in a function after the analysis, that can be found in the notebook. The model objects are stored into the directory model within the notebooks folder.
The inference script then is in charge of loading the model and dictionary and providing the topic distribution to an unseen movie description.
move_scraper is a small Scrapy project to scrape the top 1000 movies from IMDB.
The scraper gets the title, year, genres, summary and story line from each of the movies in the top 1000.
{
"title": "Spider-Man: Far from Home",
"year": "2019", "genre": ["Action", "Adventure", "Sci-Fi"],
"story_line": "Peter Parker's world has changed a lot since ...",
"description": "..."
}
To run the scraper, navigate to the movie_scraper folder:
$ cd movie_scraper
Then run the spider as:
$ scrapy crawl movies
The data is extracted and stored under the movie_scraper/data folder with the name movies.json (the format is a JSON).
The model analysis is under the notebooks folder. You can find the way data was cleaned and preprocessed and the kodel chosen to be used for future inference. The process followed:
- Read data
- Tokenize
- remove stopwords
- lemmatization of field
- leave only alpha characters
- create dictionary
- analysis and model training
- parameter tuning
- persist model to disk plus dictionary
- deploy model into local file
The inference script will print out the topic distribution of a new unseen movie description and also the most common words for that topic. The inference script should be run as follow:
# this is the movie description for Saving the Private Ryan movie
$ python inference_topics.py "{\"text\": \"Following the Normandy Landings, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.\"}"