VIINA/ВІЙНА/ВОЙНА/WAR is a near-real time multi-source event data system for the 2022 Russian Invasion of Ukraine. These data are based on news reports from Ukrainian and Russian media, which were geocoded and classified into standard conflict event categories through machine learning.
These data are GIS-ready, with temporal precision down to the minute. Each observation is accompanied by full source information, text and URLs.
In addition to raw events, VIINA also includes data on territorial control, at the level of individual populated places.
VIINA will be updated regularly, and is freely available for use by students, journalists, policymakers, and everyday researchers.
The most recent versions these data are available as a comma-delimited-text (csv) files here:
Previous versions are available here:
- Data/PreviousVersions/events_[YYYYMMDDHHMMSS].csv
- Data/PreviousVersions/control_[YYYYMMDDHHMMSS].csv
where "YYYYMMDDHHMMSS" is a time stamp (e.g. 202202240001 is "00:01, February 24, 2022").
Please cite these data as:
- Zhukov, Yuri (2022). "VIINA: Violent Incident Information from News Articles on the 2022 Russian Invasion of Ukraine." Ann Arbor: University of Michigan, Center for Political Studies. (https://github.com/zhukovyuri/VIINA, accessed [DATE]).
Corrections, feedback welcome:
Yuri M. Zhukov. Associate Professor of Political Science, University of Michigan. Research Associate Professor, Center for Political Studies, Institute for Social Research. zhukov-at-umich-dot-edu. sites.lsa.umich.edu/zhukov.
To select and read individual event reports by location and time, please take a look at the dashboard developed by Robert McGrath and Eric McGlinchey at the Schar School of Policy and Government at George Mason University:
Many thanks to Rob and Eric for getting this dashboard up and running!
VIINA draws on news reports from the following Ukrainian and Russian news providers:
- 24 Канал ("24tvua"): Ukrainian 24 hour news network
- Forbes Ukraine ("forbesua"): Ukrainian edition of Forbes magazine
- Інтерфакс-Україна ("interfaxua"): Ukrainian affiliate of Russia's Interfax news wire service
- Комсомольская Правда ("kp"): Russian newspaper
- ЛІГА.net ("liga"): Ukrainian internet news service
- Мілітарний ("militarnyy"): Ukrainian defense news portal
- Медиазона ("mz"): Russian news portal
- НВ ("nv"): Ukrainian magazine and internet news portal
- Независимая Газета ("ng"): Russian newspaper
- НТВ ("ntv"): Russian television news
- Українська правда ("pravdaua"): Ukrainian newspaper
- РИА Новости ("ria"): Russian news wire service
- УНІАН ("unian"): Ukrainian news wire service
To be added soon:
- Event reports from OSINT social media feeds.
This set of sources may expand/change as the war unfolds -- due to interruptions to journalistic activity from military operations, cyber attacks, and state censorship, as well as the availability of new data from other information providers.
Using an automated web scraping routine (which runs every 6 hours), VIINA extracts the text of news reports published by each source and their associated metadata (publication time and date, web urls). Using natural language processing, the system extracts and geocodes location names mentioned in each news item. A recurrent neural network then classifies each event report into several pre-defined categories.
Events were geo-located by place names mentioned in the text of each news report, using APIs from Yandex and OpenStreetMaps. All unique geocoded locations were manually inspected for false positive and false negative matches.
Geocoding precision ranges from street-level (GEO_PRECISION="STREET") to province-level (GEO_PRECISION="ADM1").
Below is a map of all geocoded event reports since the start of Russia's military operations on February 24, 2022. Underneath the map is a timeline, showing the number of event reports published per hour, across all data sources.
Below are a map and timeline, showing the subset of war-related geocoded event reports (i.e. t_mil_b == 1, see below for details).
To generate predicted event categories, VIINA uses a recurrent neural network (RNN) model with long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997; Chang and Masterson, 2020). LSTMs are well-suited for learning problems related to sequential data, such as sequences of words of differential length, where the vocabulary is potentially large, and where the long-term context and dependencies between inputs are potentially informative for classification (i.e. where word order and context matters, and the bag-of-words assumption is problematic).
The current version of the data uses a training set of 2000+ randomly-selected hand-coded texts. This training set will be updated/expanded periodically as more and different types of events are added to the text corpus.
Estimation was done in Python with the Keras library.
The data currently include the following event categories:
- t_mil: Event is about war/military operations
- t_nmil: Event is not about war/military operations (e.g. human interest story)
- t_loc: Event report includes reference to specific location
- t_san: Event report mentions economic sanctions imposed on Russia
- a_rus: Event initiated by Russian or Russian-aligned armed forces
- a_ukr: Event initiated by Ukrainian or Ukrainian-aligned armed forces
- a_civ: Event initiated by civilians
- a_other: Event initiated by a third party (e.g. U.S., EU, Red Cross)
- t_aad: Anti-air defense, Buk, shoulder-fired missiles (Igla, Strela, Stinger)
- t_airstrike: Air strike, strategic bombing, helicopter strike
- t_armor: Tank battle or assault
- t_arrest: Arrest by security services or detention of prisoners of war
- t_artillery: Shelling by field artillery, howitzer, mortar, or rockets like Grad/BM-21, Uragan/BM-27, other Multiple Launch Rocket System (MRLS)
- t_control: Establishment/claim of territorial control over population center
- t_firefight: Any exchange of gunfire with handguns, semi-automatic rifles, automatic rifles, machine guns, rocket-propelled grenades (RPGs)
- t_ied: Improvised explosive device, roadside bomb, landmine, car bomb, explosion
- t_raid: Assault/attack by paratroopers or special forces, usually followed by a retreat
- t_occupy: Occupation of territory or building
- t_property: Destruction of property or infrastructure
- t_cyber: Cyber operations, including DDOS attacks, website defacement
- t_hospitals: Attacks on hospitals and humanitarian convoys
- t_milcas: Event report mentions military casualties
- t_civcas: Event report mentions civilian casualties
This set of categories will expand in the future, as more and different types of events are added to the text corpus.
There are two versions of each variable included in the dataset:
- Predicted probabilities (ending with "_pred"): predicted probability that event belongs to each category, from the LSTM model
- Binary indicators (ending with "_b"): dummy variables, coded 1 or 0
Cutoffs for dichotomizing the predicted probabilities were selected by minimizing Type I and Type II errors against the training set. For each variable, the algorithm considers every potential cutoff ranging from 0 to 1, compares the resulting binary values to training set labels, calculates rates of false positives and false negatives, and selects the cutoff that minimizes the sum of these rates. These cutoffs are different for each variable, and are enumerated in the table below.
Below are in-sample prediction accuracy statistics for each variable (auc: area under the ROC curve, fitted values against training set labels), along with the number of events with probabilities greater than .10 (n_p10) and greater than .90 (n_p90). Also included are recommended cutoffs for dichotomizing each variable (cutoff_01).
variable | auc | n_p10 | n_p90 | cutoff_01 |
---|---|---|---|---|
a_rus_pred | 0.9654805 | 13547 | 10958 | 0.9989986 |
a_ukr_pred | 0.9790863 | 15217 | 8141 | 0.9899900 |
a_civ_pred | 0.9526951 | 9297 | 177 | 0.3082623 |
a_other_pred | 0.9555902 | 5332 | 4822 | 0.4984981 |
t_aad_pred | 0.9680766 | 1527 | 1517 | 0.0380380 |
t_airstrike_pred | 0.9000190 | 2662 | 2620 | 0.9969970 |
t_armor_pred | 0.9027617 | 935 | 861 | 0.3143143 |
t_arrest_pred | 0.8851692 | 2473 | 2233 | 0.0020020 |
t_artillery_pred | 0.9825025 | 6815 | 6678 | 0.0010010 |
t_civcas_pred | 0.9635935 | 4104 | 4003 | 0.1021021 |
t_control_pred | 0.9520249 | 15111 | 730 | 0.2702702 |
t_cyber_pred | 0.9616443 | 4853 | 4717 | 0.9989990 |
t_firefight_pred | 0.8045639 | 764 | 723 | 0.0150150 |
t_hospital_pred | 0.9440955 | 537 | 501 | 0.0010010 |
t_ied_pred | 0.9315872 | 987 | 883 | 0.1091091 |
t_killing_pred | 0.4966533 | 749 | 93 | 0.8512725 |
t_loc_pred | 0.9421611 | 41178 | 39936 | 0.9209209 |
t_mil_pred | 0.9815032 | 60329 | 42744 | 0.5575579 |
t_milcas_pred | 0.9370085 | 3654 | 3469 | 0.9959960 |
t_occupy_pred | 0.6511453 | 136 | 114 | 0.0100086 |
t_property_pred | 0.7840882 | 4093 | 3654 | 0.9699700 |
t_raid_pred | 0.8075744 | 661 | 620 | 0.0130130 |
t_san_pred | 0.9706459 | 12483 | 11928 | 0.9389389 |
This table is updated daily and is available in csv format here:
Note that these statistics are subject to change, as new events are added to the corpus and as the training set expands.
Below are illustrative word clouds for several categories of events. The font size is proportional to word frequencies in news wire headlines predicted as being most likely to belong to each topic category (99th percentile of predicted probability). The clouds are for out-of-sample predictions on the full set of news stories in the corpus.
A quick guide to what some the words mean:
- "окупанти" (okupanty) means "occupiers" (in Ukrainian)
- "зсу" (zsu) is the acronym for Armed Forces of Ukraine (in Ukrainian)
- "окупанти" (okupanty) means "occupiers"
- "ворог" (voroh) means "enemy"
- "війска" (viyska) means "forces"
- "зсу" (zsu) is the acronym for Armed Forces of Ukraine (in Ukrainian)
- "всу" (vsu) is the acronym for Armed Forces of Ukraine (in Russian)
- "заявили днр" (zayavili dnr) means "DNR has claimed" (in Russian)
- "санкції" (sanktsiyi) means "sanctions"
- "сша" (ssha) means USA
- there are also terms here for sanctions related to SWIFT, Visa, MasterCard
- "збили" (zbyly) means "shot down"
- "літак" (litak) means "aircraft"
- "повітряна тривога" (povitryana tryvoha) means "air raid alert"
- "затримали" (zatrymaly) means "arrested" or "detained"
- "взяли в полон" (vzyaly v polon) means "taken prisoner"
- "танки" (tanki) means "tanks"
- "окупантів" (okupantiv) means "of occupiers"
- "голова ода" (holova oda) means "head of regional administration" (such officials sometimes make announcements about territorial control)
- "місто" (misto) means "city"
- "контролем" (kontrolem) means "[under] control"
- "бої" (boyi) means "fighting" (in Ukrainian)
- "бои" (boi) means "fighting" (in Russian)
- "обстріл" (obstril) means "shelling"
- "ракети" (rakety) means "rockets"
- "заявили днр" (zayavili dnr) means "DNR has claimed" (i.e. allegations of shelling by UA forces in Donbas)
- "наступ" (nastup) means "advance/offensive"
- "діверсантів" (diversantiv) means "of saboteurs/diversionary units"
- "висадився десант" (vysadyvsya desant) means "paratroopers landed"
- "будинків" (budynkiv) means "houses"
- "з під завалів" (z pid zavaliv) means "from under the rubble"
- "хакери" (khakery) means "hackers"
- "зламали сайт" (zlamaly sayt) means "hacked the website"
- "лікарні" (likarni) means "hospitals"
- "гуманітарні коридори" (humanitarni korydory) means "humanitarian corridors"
- "працюють" (pratsyuyut') means "are working"
- "понад" (ponad) means "more than"
- "втрати" (vtraty) means "losses"
- "окупантів" (okupantiv) means "of occupiers"
- "загинули" (zahynuly) means "died"
- "поранені" (poraneni) means "wounded"
- "людей" (lyudey) means "people"
- "дітей" (ditey) means "children"
- "цивільних" (tsyvil'nykh) means "civilian"
- event_id: Unique event ID
- report_id: Unique ID for report that contains the event
- location: Index of unique locations mentioned in each event
- tempid: Temporary numeric ID
- source: Data source short name
- date: Date of event report (YYYYMMDD)
- time: Time of event report (HH:MM)
- url: URL web address of event report
- text: Text of event report headline/description
- lang: Language of report (ua is Ukrainian, ru is Russian)
- address: Address of geocoded location
- longitude: Longitude coordinate of event location
- latitude: Latitude coordinate of event location
- GEO_PRECISION: geographic precision of geocoded location
- GEO_API: Geocoding API used to locate event
- t_[event type]: Predicted probability that report describes event of each type (from LSTM model, see above)
- a_[actor]: Predicted probability that report describes event initiated by each actor (from LSTM model, see above)
VIINA data on territorial control are built on a manually curated subset of the above event reports, indicating whether each district administrative center (райцентр) or other major city is presently under the control of Ukrainian forces, Russian forces, or is being actively contested between the two. Control status for all other (smaller) populated places is interpolated using the status of the geographically nearest administrative center.
The full set of Ukrainian populated places (N = 33,156) includes all locations in the GeoNames gazetteer with feature_code's beginning in PPL*.
Each territorial control dataset includes the following fields:
- geonameid: Numeric ID of populated place
- name: Name of populated place
- asciiname: Name of populated place, ASCII values
- alternatenames: Alternative spellings of place name
- longitude: Longitude coordinate of populated place
- latitude: Latitude coordinate of populated place
- feature_code: Type of populated place (see full list here)
- ctr_[YYYYMMDDHHMMSS]: Control status, with timestamp (UA/RU/CONTESTED)
Note that the timestamp reflects the time at which the relevant data were collected (typically every six hours or so), which naturally lags behind the reality on the ground.
Territorial control data are presently missing for the following dates: 2022/02/24-2022/02/26, 2022/03/06-2022/03/07. Data for these dates will be retroactively added in future updates.