This is a work in progress repository for my capstone project at Springboard Machine Learning bootcamp
- Project description
- Process raw data
- Exploratory data analysis (EDA)
- Sentence scoring algorithm
- Flask API on a web server
The objective of this project is to develop a text summarization tool able to create a short version of a given document retaining it most important information. This task is relevant for to access textual information and produce digests of news, social media and reviews. It can also be applied as part of other AI tasks such as answering questions and providing recommendations.
Dataset: The CNN news highlights dataset, which contains news articles and associated highlights, i.e., a few bullet points giving a brief overview of the article, with 92,579 documents.
The CNN dataset was downloaded from New York University, in the version made available by Kyunghyun Cho, which can be found here
A description of this project development can be found on my portfolio website,
Basic processing of the original dataset file separting article from summaries.
Notebook: 01-process-raw-data.ipynb [launch notebook on Codelab]
Analysis of number of characteres, words and sentences on both articles and summaries. Identification of malformed articles and cleaning the dataset from them.
Notebook: 02-exploratory-data-analysis.ipynb [launch notebook on Codelab]
The sentence scoring algorithm was mostly based on Alfrick Opidi's article on Floydhub, named "A Gentle Introduction to Text Summarization in Machine Learning".
Notebook: 03-sentence-scoring-algorithm.ipynb [launch notebook on Codelab]
Format:
curl -X POST --data-binary @\<filename\> -d 'tokenizer=\<stem | lemma\>&n_gram=\<1-gram |2-gram | 3-gram\>&threshold_factor=\<float\>' https://summarizer-lopasso.herokuapp.​com/predict
The response is a JSON in the following format:
{"prediction" : "The generated summary"}
Access the app on Heroku using the link. The app has a self explanatory page, where the inputs are the text to be summarized and the algorithm parameters. The generated summary appears in the field on the bottom of the page, when the button "Submit" is pressed.