The script was written in the context of my Master's Thesis, and while it does work, it's a work in progress. I'll add better documentation and tests later if time allows and if there's any interest at all by anyone in the crawler itself (so please let me know if it's useful to you, this directly impacts whether it'll be maintained at all or not).
Download Українська правда articles from a range of dates, with all translations of that article.
Create unified list of tags, with all translations for each of them as well.
It has the following parts, all working as standalone commands as well:
downloads the dataset, documented below. It uses:up_get_uris
crawls the website and gets the list of URIs of articles to crawl from the sitemapup_craw_uris
downloads the articles from the CSV list built by theup_get_uris
converts the native JSON directory structure format to CSV.
The last 2 years of articles in CSV format are uploaded to the HF Hub: shamotskyi/ukr_pravda_2y · Datasets at Hugging Face
The script generates .json files that contain additional info, like the raw HTML of the articles. They were omitted from the CSV version above, contact me if interested.
- TODO Install the package ...
python3 -m up_crawler -h
> python3 -m up_crawler -ds 'four weeks ago' -de 'three weeks ago' -o /tmp/your/output/folder
[15:48:12] INFO Running with params Namespace(date_start='four
weeks ago', date_end='three weeks ago',
timeout=5, pdb=False, loglevel=None)
INFO Getting URLs of articles published between
2023-11-12 ('four weeks ago') and 2023-11-19
('three weeks ago')
[15:48:13] INFO Getting
INFO Got 1305 article URLs!
INFO Saved df to /tmp/your/output/folder/uris.csv
INFO Creating tag mapping from UP's website...
[15:48:16] INFO Created tag mapping with 1388 tags!
INFO Saving tags mapping to
INFO Reading /tmp/your/output/folder/uris.csv
INFO Found 542 articles (1305 incl. translations) over 7
articles: 1%|▏ | 7/1305 [00:22<1:16:00, 3.51s/it]
In the output directory, a directory is created for each article.
This can be converted to a CSV representation by running up_convert
Each article has
one to three files named like eng_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9lbmcvbmV3cy8yMDIzLzExLzEzLzc0Mjg0NjQv.json
is the language, the rest is a base64 representation of the URI of the page.
> tree /tmp/your/output/folder
├── 7428464
│ ├── eng_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9lbmcvbmV3cy8yMDIzLzExLzEzLzc0Mjg0NjQv.json
│ ├── rus_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9ydXMvbmV3cy8yMDIzLzExLzEzLzc0Mjg0NjQv.json
│ └── uk_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9uZXdzLzIwMjMvMTEvMTMvNzQyODQ2NC8=.json
├── 7428472
│ ├── eng_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9lbmcvbmV3cy8yMDIzLzExLzEzLzc0Mjg0NzIv.json
│ ├── rus_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9ydXMvbmV3cy8yMDIzLzExLzEzLzc0Mjg0NzIv.json
│ └── uk_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9uZXdzLzIwMjMvMTEvMTMvNzQyODQ3Mi8=.json
├── 7428483
│ ├── eng_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9lbmcvbmV3cy8yMDIzLzExLzEzLzc0Mjg0ODMv.json
│ ├── rus_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9ydXMvbmV3cy8yMDIzLzExLzEzLzc0Mjg0ODMv.json
│ └── uk_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9uZXdzLzIwMjMvMTEvMTMvNzQyODQ4My8=.json
├── 7428484
│ ├── rus_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9ydXMvbmV3cy8yMDIzLzExLzEzLzc0Mjg0ODQv.json
│ └── uk_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9uZXdzLzIwMjMvMTEvMTMvNzQyODQ4NC8=.json
├── 7428485
│ ├── eng_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9lbmcvbmV3cy8yMDIzLzExLzEzLzc0Mjg0ODUv.json
│ ├── rus_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9ydXMvbmV3cy8yMDIzLzExLzEzLzc0Mjg0ODUv.json
│ └── uk_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9uZXdzLzIwMjMvMTEvMTMvNzQyODQ4NS8=.json
├── 7428486
│ └── eng_aHR0cHM6Ly93d3cucHJhdmRhLmNvbS51YS9lbmcvbmV3cy8yMDIzLzExLzEzLzc0Mjg0ODYv.json
├── tags_mapping.json
└── uris.csv
contains all tags used in all translations available.uris.csv
has a list of all articles+translations published in the range of dates given, the ones that are to be downloaded
- Downloads only articles older than about 15 days, since newer articles aren't available through UP's archive sitemaps.
- Would be trivial to implement but I just don't have the resources for it, pull-requests welcome.
- Older articles that use a different article structure (sometimes have missing authors etc.) break it
- Again trivial to fix if needed.