Skip to content

Commit

Permalink
reamdeupdate
Browse files Browse the repository at this point in the history
  • Loading branch information
serkant committed Oct 18, 2021
1 parent c0d2d90 commit 6f3313e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Turkish Newspaper Parsers
===

Although there are wonderful libraries to collect and parse newspaper articles, such as https://github.com/fhamborg/news-please and https://github.com/codelucas/newspaper, I realized that parsing certain Turkish newspapers can be problematic. In particular, extracting dates is most of the time problematic. Even when the dates are extracted, there are issues whenever the newspaper publishes dates in dd/mm/yyyy format. I also realized that parsing main texts could fail in `news-please` and `newspaper3k` whenever they are relatively short.
Although there are wonderful libraries to collect and parse newspaper articles, such as https://github.com/fhamborg/news-please and https://github.com/codelucas/newspaper, I realized that parsing certain Turkish newspapers can be problematic. In particular, extracting dates is most of the time problematic. Even when the dates are extracted, there are issues whenever the newspaper publishes dates in dd/mm/yyyy format. I also realized that parsing main texts could fail in `news-please` and `newspaper3k` whenever the articles are relatively short.


That is why I wrote custom parsers to fix specific problems whenever these libraries do not work. Each script deals with a particular newspaper: it collects URLs either using the newspaper's daily archive or search function, parses them, and saves them in a spreadsheet.
Expand Down

0 comments on commit 6f3313e

Please sign in to comment.