ML Application Diagram

The ML Application Diagram file displays a diagram of a proposed solution for a real time LDA topic model running in production environment with the use case of multiple files landing to a common location. The process involves a set of AWS resources to extract the data, format, preprocessed (NLP) train a model and deploy this model to a production environment. For this use case I will mainly focus in the training part, but suggested approach to solve a real scenario is displayed in the diagram.

The second part is composed by a movie scraper that is able to extract descriptions and story lines form movies in IMDB and store them locally. After scraping the data, the Topic Modeling is applied to these, the chosen technique is LDA. The cleaning and processing pipeline is wrapped in a function after the analysis, that can be found in the notebook. The model objects are stored into the directory model within the notebooks folder.

The inference script then is in charge of loading the model and dictionary and providing the topic distribution to an unseen movie description.

Movie scraper

move_scraper is a small Scrapy project to scrape the top 1000 movies from IMDB.

Extracted data

The scraper gets the title, year, genres, summary and story line from each of the movies in the top 1000.

{
    "title": "Spider-Man: Far from Home", 
    "year": "2019", "genre": ["Action", "Adventure", "Sci-Fi"], 
    "story_line": "Peter Parker's world has changed a lot since ...", 
    "description": "..."
}

Running the movie spider

To run the scraper, navigate to the movie_scraper folder:

$ cd movie_scraper

Then run the spider as:

$ scrapy crawl movies

Data

The data is extracted and stored under the movie_scraper/data folder with the name movies.json (the format is a JSON).

LDA MODEL

The model analysis is under the notebooks folder. You can find the way data was cleaned and preprocessed and the kodel chosen to be used for future inference. The process followed:

Read data
Tokenize
remove stopwords
lemmatization of field
leave only alpha characters
create dictionary
analysis and model training
parameter tuning
persist model to disk plus dictionary
deploy model into local file

Inference

The inference script will print out the topic distribution of a new unseen movie description and also the most common words for that topic. The inference script should be run as follow:

# this is the movie description for Saving the Private Ryan movie
$ python inference_topics.py "{\"text\": \"Following the Normandy Landings, a group of U.S. soldiers go behind enemy lines to retrieve a paratrooper whose brothers have been killed in action.\"}"

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
movie_scraper		movie_scraper
notebooks		notebooks
.gitignore		.gitignore
ML Application Diagram.png		ML Application Diagram.png
README.md		README.md
inference_topics.py		inference_topics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Application Diagram

Movie scraper

Extracted data

Running the movie spider

Data

LDA MODEL

Inference

About

Releases

Packages

Languages

cfloressuazo/AmberScript

Folders and files

Latest commit

History

Repository files navigation

ML Application Diagram

Movie scraper

Extracted data

Running the movie spider

Data

LDA MODEL

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages