Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
README.md		README.md
collect_parse_hurriyet.py		collect_parse_hurriyet.py
collect_parse_yenisafak.py		collect_parse_yenisafak.py
wayback_collect_parse_sol.py		wayback_collect_parse_sol.py

Repository files navigation

Turkish Newspaper Parsers

Although there are wonderful libraries to collect and parse newspaper articles, such as https://github.com/fhamborg/news-please and https://github.com/codelucas/newspaper, I realized that parsing certain Turkish newspapers can be problematic. In particular, extracting dates is most of the time problematic. Even when the dates were extracted, there were issues whenever the newspaper published dates in dd/mm/yyyy format. I also realized that parsing the main text when it was short can also be problematic.

That is why I wrote custom parsers to fix specific problems whenever these libraries do not work. Each script deals with a particular newspaper: it collects URLs either using the newspaper's daily archive or search function, parses them, and saves them in a spreadsheet.

Sometimes, it is also impossible to use either news-please or newspaper3k since newspapers remove old articles from their servers. In those cases, we can access URLs archived by wayback. I use Wayback's CDX server API to search for URLs for a specific newspaper within a defined time range.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Turkish Newspaper Parsers

About

Releases

Packages

Languages

serkant/TurkishNewspaperParsers

Folders and files

Latest commit

History

Repository files navigation

Turkish Newspaper Parsers

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages