Github project with all code needed for the final degree project "News similarity with Natural Language Processing". It tries to stablish a distance between the content of politics news articles, so that a network of similarities can be built.
This project has been organised in four Python libraries, detailed here:
- newsparser: defines classes Feed and Entry to extract entries from a RSS feed, saving all the necessary metadata and the article text in the central database.
- newsfilter: defines classes and methods to filter those entries that should not be considered in the system, such as entries with a broken link, that had no title, that had no meaningful content, etc.
- newstagger: defines a Flask HTTP server and its pages to allow an easy creation of a tagged dataset for the creation of the system.
- newsbreaker: defines functions and classes that inherit from the Entry class in newsparser and allow for an easy access to its content, its counters of words, its what/who/where vectors and some methods to compute the distance between two of them.
All Python code is Python 3.
All data used for this project will be stored in a zip file in the v1 release, on this Github page, except for the Wikipedia articles database, which the code can download automatically and would have taken a lot of space.
For more details on this system, check the project report.
To check the project visualisations online: