VIINA / Violent Incident Information from News Articles

2022 Russian Invasion of Ukraine

(events)

(territorial control)

VIINA/ВІЙНА/ВОЙНА/WAR is a near-real time multi-source event data system for the 2022 Russian Invasion of Ukraine. These data are based on news reports from Ukrainian and Russian media, which were geocoded and classified into standard conflict event categories through machine learning.

These data are GIS-ready, with temporal precision down to the minute. Each observation is accompanied by full source information, text and URLs.

In addition to raw events, VIINA also includes data on territorial control, at the level of individual populated places.

VIINA will be updated regularly, and is freely available for use by students, journalists, policymakers, and everyday researchers.

The most recent versions these data are available as a comma-delimited-text (csv) files here:

Previous versions are available here:

where "YYYYMMDDHHMMSS" is a time stamp (e.g. 202202240001 is "00:01, February 24, 2022").

Please cite these data as:

Zhukov, Yuri (2022). "VIINA: Violent Incident Information from News Articles on the 2022 Russian Invasion of Ukraine." Ann Arbor: University of Michigan, Center for Political Studies. (https://github.com/zhukovyuri/VIINA, accessed [DATE]).

Corrections, feedback welcome:

Yuri M. Zhukov. Associate Professor of Political Science, University of Michigan. Research Associate Professor, Center for Political Studies, Institute for Social Research. zhukov-at-umich-dot-edu. sites.lsa.umich.edu/zhukov.

Interactive Map

To select and read individual event reports by location and time, please take a look at the dashboard developed by Robert McGrath and Eric McGlinchey at the Schar School of Policy and Government at George Mason University:

go.gmu.edu/ukraine

Many thanks to Rob and Eric for getting this dashboard up and running!

Data Sources

VIINA draws on news reports from the following Ukrainian and Russian news providers:

24 Канал ("24tvua"): Ukrainian 24 hour news network
Forbes Ukraine ("forbesua"): Ukrainian edition of Forbes magazine
Інтерфакс-Україна ("interfaxua"): Ukrainian affiliate of Russia's Interfax news wire service
Комсомольская Правда ("kp"): Russian newspaper
ЛІГА.net ("liga"): Ukrainian internet news service
Мілітарний ("militarnyy"): Ukrainian defense news portal
Медиазона ("mz"): Russian news portal
НВ ("nv"): Ukrainian magazine and internet news portal
Независимая Газета ("ng"): Russian newspaper
НТВ ("ntv"): Russian television news
Українська правда ("pravdaua"): Ukrainian newspaper
РИА Новости ("ria"): Russian news wire service
УНІАН ("unian"): Ukrainian news wire service

To be added soon:

Event reports from OSINT social media feeds.

This set of sources may expand/change as the war unfolds -- due to interruptions to journalistic activity from military operations, cyber attacks, and state censorship, as well as the availability of new data from other information providers.

Using an automated web scraping routine (which runs every 6 hours), VIINA extracts the text of news reports published by each source and their associated metadata (publication time and date, web urls). Using natural language processing, the system extracts and geocodes location names mentioned in each news item. A recurrent neural network then classifies each event report into several pre-defined categories.

Geocoding

Events were geo-located by place names mentioned in the text of each news report, using APIs from Yandex and OpenStreetMaps. All unique geocoded locations were manually inspected for false positive and false negative matches.

Geocoding precision ranges from street-level (GEO_PRECISION="STREET") to province-level (GEO_PRECISION="ADM1").

Below is a map of all geocoded event reports since the start of Russia's military operations on February 24, 2022. Underneath the map is a timeline, showing the number of event reports published per hour, across all data sources.

Below are a map and timeline, showing the subset of war-related geocoded event reports (i.e. t_mil_b == 1, see below for details).

Event classification

To generate predicted event categories, VIINA uses a recurrent neural network (RNN) model with long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997; Chang and Masterson, 2020). LSTMs are well-suited for learning problems related to sequential data, such as sequences of words of differential length, where the vocabulary is potentially large, and where the long-term context and dependencies between inputs are potentially informative for classification (i.e. where word order and context matters, and the bag-of-words assumption is problematic).

The current version of the data uses a training set of 2000+ randomly-selected hand-coded texts. This training set will be updated/expanded periodically as more and different types of events are added to the text corpus.

Estimation was done in Python with the Keras library.

The data currently include the following event categories:

t_mil: Event is about war/military operations
t_nmil: Event is not about war/military operations (e.g. human interest story)
t_loc: Event report includes reference to specific location
t_san: Event report mentions economic sanctions imposed on Russia
a_rus: Event initiated by Russian or Russian-aligned armed forces
a_ukr: Event initiated by Ukrainian or Ukrainian-aligned armed forces
a_civ: Event initiated by civilians
a_other: Event initiated by a third party (e.g. U.S., EU, Red Cross)
t_aad: Anti-air defense, Buk, shoulder-fired missiles (Igla, Strela, Stinger)
t_airstrike: Air strike, strategic bombing, helicopter strike
t_armor: Tank battle or assault
t_arrest: Arrest by security services or detention of prisoners of war
t_artillery: Shelling by field artillery, howitzer, mortar, or rockets like Grad/BM-21, Uragan/BM-27, other Multiple Launch Rocket System (MRLS)
t_control: Establishment/claim of territorial control over population center
t_firefight: Any exchange of gunfire with handguns, semi-automatic rifles, automatic rifles, machine guns, rocket-propelled grenades (RPGs)
t_ied: Improvised explosive device, roadside bomb, landmine, car bomb, explosion
t_raid: Assault/attack by paratroopers or special forces, usually followed by a retreat
t_occupy: Occupation of territory or building
t_property: Destruction of property or infrastructure
t_cyber: Cyber operations, including DDOS attacks, website defacement
t_hospitals: Attacks on hospitals and humanitarian convoys
t_milcas: Event report mentions military casualties
t_civcas: Event report mentions civilian casualties

This set of categories will expand in the future, as more and different types of events are added to the text corpus.

There are two versions of each variable included in the dataset:

Predicted probabilities (ending with "_pred"): predicted probability that event belongs to each category, from the LSTM model
Binary indicators (ending with "_b"): dummy variables, coded 1 or 0

Cutoffs for dichotomizing the predicted probabilities were selected by minimizing Type I and Type II errors against the training set. For each variable, the algorithm considers every potential cutoff ranging from 0 to 1, compares the resulting binary values to training set labels, calculates rates of false positives and false negatives, and selects the cutoff that minimizes the sum of these rates. These cutoffs are different for each variable, and are enumerated in the table below.

Below are in-sample prediction accuracy statistics for each variable (auc: area under the ROC curve, fitted values against training set labels), along with the number of events with probabilities greater than .10 (n_p10) and greater than .90 (n_p90). Also included are recommended cutoffs for dichotomizing each variable (cutoff_01).

variable	auc	n_p10	n_p90	cutoff_01
a_rus_pred	0.9654805	13547	10958	0.9989986
a_ukr_pred	0.9790863	15217	8141	0.9899900
a_civ_pred	0.9526951	9297	177	0.3082623
a_other_pred	0.9555902	5332	4822	0.4984981
t_aad_pred	0.9680766	1527	1517	0.0380380
t_airstrike_pred	0.9000190	2662	2620	0.9969970
t_armor_pred	0.9027617	935	861	0.3143143
t_arrest_pred	0.8851692	2473	2233	0.0020020
t_artillery_pred	0.9825025	6815	6678	0.0010010
t_civcas_pred	0.9635935	4104	4003	0.1021021
t_control_pred	0.9520249	15111	730	0.2702702
t_cyber_pred	0.9616443	4853	4717	0.9989990
t_firefight_pred	0.8045639	764	723	0.0150150
t_hospital_pred	0.9440955	537	501	0.0010010
t_ied_pred	0.9315872	987	883	0.1091091
t_killing_pred	0.4966533	749	93	0.8512725
t_loc_pred	0.9421611	41178	39936	0.9209209
t_mil_pred	0.9815032	60329	42744	0.5575579
t_milcas_pred	0.9370085	3654	3469	0.9959960
t_occupy_pred	0.6511453	136	114	0.0100086
t_property_pred	0.7840882	4093	3654	0.9699700
t_raid_pred	0.8075744	661	620	0.0130130
t_san_pred	0.9706459	12483	11928	0.9389389

This table is updated daily and is available in csv format here:

auc_latest.csv

Note that these statistics are subject to change, as new events are added to the corpus and as the training set expands.

Below are illustrative word clouds for several categories of events. The font size is proportional to word frequencies in news wire headlines predicted as being most likely to belong to each topic category (99th percentile of predicted probability). The clouds are for out-of-sample predictions on the full set of news stories in the corpus.

Events about war/military operations (t_mil_*)

A quick guide to what some the words mean:

"окупанти" (okupanty) means "occupiers" (in Ukrainian)
"зсу" (zsu) is the acronym for Armed Forces of Ukraine (in Ukrainian)

Russian-initiated events (a_rus_*)

"окупанти" (okupanty) means "occupiers"
"ворог" (voroh) means "enemy"
"війска" (viyska) means "forces"

Ukrainian-initiated events (a_ukr_*)

"зсу" (zsu) is the acronym for Armed Forces of Ukraine (in Ukrainian)
"всу" (vsu) is the acronym for Armed Forces of Ukraine (in Russian)
"заявили днр" (zayavili dnr) means "DNR has claimed" (in Russian)

Sanctions (t_san_*)

"санкції" (sanktsiyi) means "sanctions"
"сша" (ssha) means USA
there are also terms here for sanctions related to SWIFT, Visa, MasterCard

Anti-air defense (t_aad_*)

"збили" (zbyly) means "shot down"
"літак" (litak) means "aircraft"

Air strikes (t_airstrike_*)

"повітряна тривога" (povitryana tryvoha) means "air raid alert"

Arrests or detentions of POWs (t_arrest_*)

"затримали" (zatrymaly) means "arrested" or "detained"
"взяли в полон" (vzyaly v polon) means "taken prisoner"

Tank battles (t_armor_*)

"танки" (tanki) means "tanks"
"окупантів" (okupantiv) means "of occupiers"

Territorial control (t_control_*)

"голова ода" (holova oda) means "head of regional administration" (such officials sometimes make announcements about territorial control)
"місто" (misto) means "city"
"контролем" (kontrolem) means "[under] control"

Firefights (t_firefight_*)

"бої" (boyi) means "fighting" (in Ukrainian)
"бои" (boi) means "fighting" (in Russian)

Artillery shelling and rocket strikes (t_artillery_*)

"обстріл" (obstril) means "shelling"
"ракети" (rakety) means "rockets"
"заявили днр" (zayavili dnr) means "DNR has claimed" (i.e. allegations of shelling by UA forces in Donbas)

Raid (t_raid_*)

"наступ" (nastup) means "advance/offensive"
"діверсантів" (diversantiv) means "of saboteurs/diversionary units"
"висадився десант" (vysadyvsya desant) means "paratroopers landed"

Destruction of property or infrastructure (t_property_*)

"будинків" (budynkiv) means "houses"
"з під завалів" (z pid zavaliv) means "from under the rubble"

Cyber operations (t_cyber_*)

"хакери" (khakery) means "hackers"
"зламали сайт" (zlamaly sayt) means "hacked the website"

Attacks on hospitals (t_hospital_*)

"лікарні" (likarni) means "hospitals"
"гуманітарні коридори" (humanitarni korydory) means "humanitarian corridors"
"працюють" (pratsyuyut') means "are working"

Military casualties (t_milcas_*)

"понад" (ponad) means "more than"
"втрати" (vtraty) means "losses"
"окупантів" (okupantiv) means "of occupiers"
"загинули" (zahynuly) means "died"

Civilian casualties (t_civcas_*)

"поранені" (poraneni) means "wounded"
"людей" (lyudey) means "people"
"дітей" (ditey) means "children"
"цивільних" (tsyvil'nykh) means "civilian"

Codebook

event_id: Unique event ID
report_id: Unique ID for report that contains the event
location: Index of unique locations mentioned in each event
tempid: Temporary numeric ID
source: Data source short name
date: Date of event report (YYYYMMDD)
time: Time of event report (HH:MM)
url: URL web address of event report
text: Text of event report headline/description
lang: Language of report (ua is Ukrainian, ru is Russian)
address: Address of geocoded location
longitude: Longitude coordinate of event location
latitude: Latitude coordinate of event location
GEO_PRECISION: geographic precision of geocoded location
GEO_API: Geocoding API used to locate event
t_[event type]: Predicted probability that report describes event of each type (from LSTM model, see above)
a_[actor]: Predicted probability that report describes event initiated by each actor (from LSTM model, see above)

Territorial control

VIINA data on territorial control are built on a manually curated subset of the above event reports, indicating whether each district administrative center (райцентр) or other major city is presently under the control of Ukrainian forces, Russian forces, or is being actively contested between the two. Control status for all other (smaller) populated places is interpolated using the status of the geographically nearest administrative center.

The full set of Ukrainian populated places (N = 33,156) includes all locations in the GeoNames gazetteer with feature_code's beginning in PPL*.

Each territorial control dataset includes the following fields:

geonameid: Numeric ID of populated place
name: Name of populated place
asciiname: Name of populated place, ASCII values
alternatenames: Alternative spellings of place name
longitude: Longitude coordinate of populated place
latitude: Latitude coordinate of populated place
feature_code: Type of populated place (see full list here)
ctr_[YYYYMMDDHHMMSS]: Control status, with timestamp (UA/RU/CONTESTED)

Note that the timestamp reflects the time at which the relevant data were collected (typically every six hours or so), which naturally lags behind the reality on the ground.

Territorial control data are presently missing for the following dates: 2022/02/24-2022/02/26, 2022/03/06-2022/03/07. Data for these dates will be retroactively added in future updates.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
Data		Data
Diagnostics		Diagnostics
Figures		Figures
README.md		README.md

Provide feedback

Saved searches