Skip to content
forked from zhukovyuri/VIINA

VIINA: Violent Incident Information from News Articles on the 2022 Russian Invasion of Ukraine

Notifications You must be signed in to change notification settings

khansen1991/VIINA

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 

Repository files navigation

VIINA / Violent Incident Information from News Articles

2022 Russian Invasion of Ukraine

All events (events)

Control (territorial control)

VIINA/ВІЙНА/ВОЙНА/WAR is a near-real time multi-source event data system for the 2022 Russian Invasion of Ukraine. These data are based on news reports from Ukrainian and Russian media, which were geocoded and classified into standard conflict event categories through machine learning.

These data are GIS-ready, with temporal precision down to the minute. Each observation is accompanied by full source information, text and URLs.

In addition to raw events, VIINA also includes data on territorial control, at the level of individual populated places.

VIINA will be updated regularly, and is freely available for use by students, journalists, policymakers, and everyday researchers.

The most recent versions these data are available as a comma-delimited-text (csv) files here:

Previous versions are available here:

where "YYYYMMDDHHMMSS" is a time stamp (e.g. 202202240001 is "00:01, February 24, 2022").

Please cite these data as:

  • Zhukov, Yuri (2022). "VIINA: Violent Incident Information from News Articles on the 2022 Russian Invasion of Ukraine." Ann Arbor: University of Michigan, Center for Political Studies. (https://github.com/zhukovyuri/VIINA, accessed [DATE]).

Corrections, feedback welcome:

Yuri M. Zhukov. Associate Professor of Political Science, University of Michigan. Research Associate Professor, Center for Political Studies, Institute for Social Research. zhukov-at-umich-dot-edu. sites.lsa.umich.edu/zhukov.

Interactive Map

To select and read individual event reports by location and time, please take a look at the dashboard developed by Robert McGrath and Eric McGlinchey at the Schar School of Policy and Government at George Mason University:

Dashboard

Many thanks to Rob and Eric for getting this dashboard up and running!

Data Sources

VIINA draws on news reports from the following Ukrainian and Russian news providers:

To be added soon:

  • Event reports from OSINT social media feeds.

This set of sources may expand/change as the war unfolds -- due to interruptions to journalistic activity from military operations, cyber attacks, and state censorship, as well as the availability of new data from other information providers.

Using an automated web scraping routine (which runs every 6 hours), VIINA extracts the text of news reports published by each source and their associated metadata (publication time and date, web urls). Using natural language processing, the system extracts and geocodes location names mentioned in each news item. A recurrent neural network then classifies each event report into several pre-defined categories.

Geocoding

Events were geo-located by place names mentioned in the text of each news report, using APIs from Yandex and OpenStreetMaps. All unique geocoded locations were manually inspected for false positive and false negative matches.

Geocoding precision ranges from street-level (GEO_PRECISION="STREET") to province-level (GEO_PRECISION="ADM1").

Below is a map of all geocoded event reports since the start of Russia's military operations on February 24, 2022. Underneath the map is a timeline, showing the number of event reports published per hour, across all data sources.

All events All events

Below are a map and timeline, showing the subset of war-related geocoded event reports (i.e. t_mil_b == 1, see below for details).

War events War events

Event classification

To generate predicted event categories, VIINA uses a recurrent neural network (RNN) model with long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997; Chang and Masterson, 2020). LSTMs are well-suited for learning problems related to sequential data, such as sequences of words of differential length, where the vocabulary is potentially large, and where the long-term context and dependencies between inputs are potentially informative for classification (i.e. where word order and context matters, and the bag-of-words assumption is problematic).

The current version of the data uses a training set of 2000+ randomly-selected hand-coded texts. This training set will be updated/expanded periodically as more and different types of events are added to the text corpus.

Estimation was done in Python with the Keras library.

The data currently include the following event categories:

  • t_mil: Event is about war/military operations
  • t_nmil: Event is not about war/military operations (e.g. human interest story)
  • t_loc: Event report includes reference to specific location
  • t_san: Event report mentions economic sanctions imposed on Russia
  • a_rus: Event initiated by Russian or Russian-aligned armed forces
  • a_ukr: Event initiated by Ukrainian or Ukrainian-aligned armed forces
  • a_civ: Event initiated by civilians
  • a_other: Event initiated by a third party (e.g. U.S., EU, Red Cross)
  • t_aad: Anti-air defense, Buk, shoulder-fired missiles (Igla, Strela, Stinger)
  • t_airstrike: Air strike, strategic bombing, helicopter strike
  • t_armor: Tank battle or assault
  • t_arrest: Arrest by security services or detention of prisoners of war
  • t_artillery: Shelling by field artillery, howitzer, mortar, or rockets like Grad/BM-21, Uragan/BM-27, other Multiple Launch Rocket System (MRLS)
  • t_control: Establishment/claim of territorial control over population center
  • t_firefight: Any exchange of gunfire with handguns, semi-automatic rifles, automatic rifles, machine guns, rocket-propelled grenades (RPGs)
  • t_ied: Improvised explosive device, roadside bomb, landmine, car bomb, explosion
  • t_raid: Assault/attack by paratroopers or special forces, usually followed by a retreat
  • t_occupy: Occupation of territory or building
  • t_property: Destruction of property or infrastructure
  • t_cyber: Cyber operations, including DDOS attacks, website defacement
  • t_hospitals: Attacks on hospitals and humanitarian convoys
  • t_milcas: Event report mentions military casualties
  • t_civcas: Event report mentions civilian casualties

This set of categories will expand in the future, as more and different types of events are added to the text corpus.

There are two versions of each variable included in the dataset:

  1. Predicted probabilities (ending with "_pred"): predicted probability that event belongs to each category, from the LSTM model
  2. Binary indicators (ending with "_b"): dummy variables, coded 1 or 0

Cutoffs for dichotomizing the predicted probabilities were selected by minimizing Type I and Type II errors against the training set. For each variable, the algorithm considers every potential cutoff ranging from 0 to 1, compares the resulting binary values to training set labels, calculates rates of false positives and false negatives, and selects the cutoff that minimizes the sum of these rates. These cutoffs are different for each variable, and are enumerated in the table below.

Below are in-sample prediction accuracy statistics for each variable (auc: area under the ROC curve, fitted values against training set labels), along with the number of events with probabilities greater than .10 (n_p10) and greater than .90 (n_p90). Also included are recommended cutoffs for dichotomizing each variable (cutoff_01).

variable auc n_p10 n_p90 cutoff_01
a_rus_pred 0.9654805 13547 10958 0.9989986
a_ukr_pred 0.9790863 15217 8141 0.9899900
a_civ_pred 0.9526951 9297 177 0.3082623
a_other_pred 0.9555902 5332 4822 0.4984981
t_aad_pred 0.9680766 1527 1517 0.0380380
t_airstrike_pred 0.9000190 2662 2620 0.9969970
t_armor_pred 0.9027617 935 861 0.3143143
t_arrest_pred 0.8851692 2473 2233 0.0020020
t_artillery_pred 0.9825025 6815 6678 0.0010010
t_civcas_pred 0.9635935 4104 4003 0.1021021
t_control_pred 0.9520249 15111 730 0.2702702
t_cyber_pred 0.9616443 4853 4717 0.9989990
t_firefight_pred 0.8045639 764 723 0.0150150
t_hospital_pred 0.9440955 537 501 0.0010010
t_ied_pred 0.9315872 987 883 0.1091091
t_killing_pred 0.4966533 749 93 0.8512725
t_loc_pred 0.9421611 41178 39936 0.9209209
t_mil_pred 0.9815032 60329 42744 0.5575579
t_milcas_pred 0.9370085 3654 3469 0.9959960
t_occupy_pred 0.6511453 136 114 0.0100086
t_property_pred 0.7840882 4093 3654 0.9699700
t_raid_pred 0.8075744 661 620 0.0130130
t_san_pred 0.9706459 12483 11928 0.9389389

This table is updated daily and is available in csv format here:

Note that these statistics are subject to change, as new events are added to the corpus and as the training set expands.

Below are illustrative word clouds for several categories of events. The font size is proportional to word frequencies in news wire headlines predicted as being most likely to belong to each topic category (99th percentile of predicted probability). The clouds are for out-of-sample predictions on the full set of news stories in the corpus.

Events about war/military operations (t_mil_*)

wc_t_mil_test

A quick guide to what some the words mean:

  • "окупанти" (okupanty) means "occupiers" (in Ukrainian)
  • "зсу" (zsu) is the acronym for Armed Forces of Ukraine (in Ukrainian)

Russian-initiated events (a_rus_*)

wc_t_mil_test

  • "окупанти" (okupanty) means "occupiers"
  • "ворог" (voroh) means "enemy"
  • "війска" (viyska) means "forces"

Ukrainian-initiated events (a_ukr_*)

wc_t_mil_test

  • "зсу" (zsu) is the acronym for Armed Forces of Ukraine (in Ukrainian)
  • "всу" (vsu) is the acronym for Armed Forces of Ukraine (in Russian)
  • "заявили днр" (zayavili dnr) means "DNR has claimed" (in Russian)

Sanctions (t_san_*)

wc_t_mil_test

  • "санкції" (sanktsiyi) means "sanctions"
  • "сша" (ssha) means USA
  • there are also terms here for sanctions related to SWIFT, Visa, MasterCard

Anti-air defense (t_aad_*)

wc_t_mil_test

  • "збили" (zbyly) means "shot down"
  • "літак" (litak) means "aircraft"

Air strikes (t_airstrike_*)

wc_t_mil_test

  • "повітряна тривога" (povitryana tryvoha) means "air raid alert"

Arrests or detentions of POWs (t_arrest_*)

wc_t_mil_test

  • "затримали" (zatrymaly) means "arrested" or "detained"
  • "взяли в полон" (vzyaly v polon) means "taken prisoner"

Tank battles (t_armor_*)

wc_t_mil_test

  • "танки" (tanki) means "tanks"
  • "окупантів" (okupantiv) means "of occupiers"

Territorial control (t_control_*)

wc_t_mil_test

  • "голова ода" (holova oda) means "head of regional administration" (such officials sometimes make announcements about territorial control)
  • "місто" (misto) means "city"
  • "контролем" (kontrolem) means "[under] control"

Firefights (t_firefight_*)

wc_t_mil_test

  • "бої" (boyi) means "fighting" (in Ukrainian)
  • "бои" (boi) means "fighting" (in Russian)

Artillery shelling and rocket strikes (t_artillery_*)

wc_t_mil_test

  • "обстріл" (obstril) means "shelling"
  • "ракети" (rakety) means "rockets"
  • "заявили днр" (zayavili dnr) means "DNR has claimed" (i.e. allegations of shelling by UA forces in Donbas)

Raid (t_raid_*)

wc_t_mil_test

  • "наступ" (nastup) means "advance/offensive"
  • "діверсантів" (diversantiv) means "of saboteurs/diversionary units"
  • "висадився десант" (vysadyvsya desant) means "paratroopers landed"

Destruction of property or infrastructure (t_property_*)

wc_t_mil_test

  • "будинків" (budynkiv) means "houses"
  • "з під завалів" (z pid zavaliv) means "from under the rubble"

Cyber operations (t_cyber_*)

wc_t_mil_test

  • "хакери" (khakery) means "hackers"
  • "зламали сайт" (zlamaly sayt) means "hacked the website"

Attacks on hospitals (t_hospital_*)

wc_t_mil_test

  • "лікарні" (likarni) means "hospitals"
  • "гуманітарні коридори" (humanitarni korydory) means "humanitarian corridors"
  • "працюють" (pratsyuyut') means "are working"

Military casualties (t_milcas_*)

wc_t_mil_test

  • "понад" (ponad) means "more than"
  • "втрати" (vtraty) means "losses"
  • "окупантів" (okupantiv) means "of occupiers"
  • "загинули" (zahynuly) means "died"

Civilian casualties (t_civcas_*)

wc_t_mil_test

  • "поранені" (poraneni) means "wounded"
  • "людей" (lyudey) means "people"
  • "дітей" (ditey) means "children"
  • "цивільних" (tsyvil'nykh) means "civilian"

Codebook

  • event_id: Unique event ID
  • report_id: Unique ID for report that contains the event
  • location: Index of unique locations mentioned in each event
  • tempid: Temporary numeric ID
  • source: Data source short name
  • date: Date of event report (YYYYMMDD)
  • time: Time of event report (HH:MM)
  • url: URL web address of event report
  • text: Text of event report headline/description
  • lang: Language of report (ua is Ukrainian, ru is Russian)
  • address: Address of geocoded location
  • longitude: Longitude coordinate of event location
  • latitude: Latitude coordinate of event location
  • GEO_PRECISION: geographic precision of geocoded location
  • GEO_API: Geocoding API used to locate event
  • t_[event type]: Predicted probability that report describes event of each type (from LSTM model, see above)
  • a_[actor]: Predicted probability that report describes event initiated by each actor (from LSTM model, see above)

Territorial control

All events

VIINA data on territorial control are built on a manually curated subset of the above event reports, indicating whether each district administrative center (райцентр) or other major city is presently under the control of Ukrainian forces, Russian forces, or is being actively contested between the two. Control status for all other (smaller) populated places is interpolated using the status of the geographically nearest administrative center.

The full set of Ukrainian populated places (N = 33,156) includes all locations in the GeoNames gazetteer with feature_code's beginning in PPL*.

Each territorial control dataset includes the following fields:

  • geonameid: Numeric ID of populated place
  • name: Name of populated place
  • asciiname: Name of populated place, ASCII values
  • alternatenames: Alternative spellings of place name
  • longitude: Longitude coordinate of populated place
  • latitude: Latitude coordinate of populated place
  • feature_code: Type of populated place (see full list here)
  • ctr_[YYYYMMDDHHMMSS]: Control status, with timestamp (UA/RU/CONTESTED)

Note that the timestamp reflects the time at which the relevant data were collected (typically every six hours or so), which naturally lags behind the reality on the ground.

Territorial control data are presently missing for the following dates: 2022/02/24-2022/02/26, 2022/03/06-2022/03/07. Data for these dates will be retroactively added in future updates.

About

VIINA: Violent Incident Information from News Articles on the 2022 Russian Invasion of Ukraine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published