Embed2Detect

Embed2Detect is an event detection mechanism developed for social media data. Please refer to the paper "Embed2Detect: Temporally Clustered Embedded Words for Event Detection in Social Media" for more details about this approach.
If you use this system, please consider citing this paper and reference details are given below.

About

Python 3.7 implementation of Embed2Detect
Used packages are listed in requirements.txt

Event detection

run main.py given the parameters;

data_file_path - path to input file
from_time, to_time - parameters to define time period for event detection formatted as %Y_%m_%d_%H_%M_%S (e.g. 2019_10_20_15_28_00) (Using the time period, data in the input file can be filtered timely)
window_legth - length for time window in minutes
alpha - value for parameter alpha
beta - value for parameter beta

Input file format

.tsv file formatted as follows;

should contain a post (e.g. tweet) per line
should contain 3 compulsory columns with headers; id, timestamp and text (any other column is ignored)
timestamp should be formatted as %Y-%m-%d %H:%M:%S (e.g. 2019-10-20 15:25:00)
empty cells should be indicated using '_na_'

Output format

Completed event detection saves a folder with given input file name in the results_folder_path mentioned under project_config. This folder contains .txt files where events words are saved as single word per line corresponding to each event window.

Data cleaning

Depending on the target data set, data cleaning techniques can be customised. The default flow which was developed by targeting a Twitter data set is available under the method; preprocessing_flow in data_preprocessor.py.

Project configurations

General configuration details of the project including word embedding configs, performance configs and file path configs are available in project_config.py.

preprocess -: preprocessing methods to use (provide as a list of method names)
currently supported preprocessing methods - 'rm-punct': remove punctuation, 'rm-stop_words': remove stop words
default - ['rm-punct', 'rm-stop_words']
aggregation_method -: aggregation method to use
currently supported preprocessing methods - 'max': maximum calculation, 'avg': average calculation
default - 'max'

Reference

@article{hettiarachchi2021embed2detect,
  title={{E}mbed2{D}etect: temporally clustered embedded words for event detection in social media},
  author={Hettiarachchi, Hansi and Adedoyin-Olowe, Mariam and Bhogal, Jagdev and Gaber, Mohamed Medhat},
  journal={Machine Learning},
  volume={111},
  pages={49--87},
  year={2022},
  publisher={Springer},
  doi = {10.1007/s10994-021-05988-7},
  url = "https://doi.org/10.1007/s10994-021-05988-7",
}

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
algo		algo
bert_experiments		bert_experiments
data_analysis		data_analysis
embed2detect		embed2detect
experiments		experiments
utils		utils
.gitignore		.gitignore
README.md		README.md
project_config.py		project_config.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embed2Detect

About

Event detection

Input file format

Output format

Data cleaning

Project configurations

Reference

About

Releases

Packages

Languages

HHansi/Embed2Detect

Folders and files

Latest commit

History

Repository files navigation

Embed2Detect

About

Event detection

Input file format

Output format

Data cleaning

Project configurations

Reference

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages