Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
grading.md	grading.md
makefile	makefile

Dataset and project ideas for NTDS'19

A word on data preparation

For all datasets, some degree of pre-processing is needed before obtaining usable data. That is inherent to the work of a data scientist, and will mostly involve the following steps.

Data acquisition. Raw data may need to be (i) collected via a web API, (ii) collected by scraping a website, or (iii) already collected and packaged (for example in a CSV file).
Data importation. Raw data needs to be imported in a pandas DataFrame, where rows will be nodes of the network, and columns will be features (or characteristics or attributes) of those nodes.
Data cleaning. This step varies a lot from dataset to dataset. In the context of the course, it will mostly consist in reducing the size of the network so that it can be processed in a reasonable amount of time. The less packaged the raw data, the more cleaning will be necessary.
Network creation. The network structure may either be (i) given to you as an adjacency matrix, (ii) given to you as an edge list, or (iii) inferred from raw data.

While those steps are necessary for every dataset, the amount of pre-processing will however vary. If a dataset requires raw data to be collected or the network to be created, that will be indicated in the datset description. So if you are a beginner and do not feel confident, choose a dataset that requires little pre-processing. If you feel confident, be adventurous! Keep in mind that the amount of work generally trades with flexibility: well packaged data usually serves a single task and may restrict your creativity.

Selecting a dataset

We encourage teams to choose a dataset and problematic that inspire them. While we provide a list of datasets, we encourage motivated teams to look for their own data. That is especially relevant to PhD students who want to apply the knowledge acquired in this course to their research problems.

Where to find (network) datasets?

There are plenty of sources to search for data. Here are some links:

What are the requirements for the dataset?

You need to choose a dataset that can be represented by a network. The network structure can either be given or can be constructed from features. To give you an idea of network representations, go through the list of proposed datasets. Please discuss with the TAs during lab sessions for guidance.

How to determine the task to be addressed?

Keep in mind that while necessary, it is not enough to create a network and give some statistical facts about it. You should determine a task, which is usually based on the inference of latent information. You can address it with network science tools, such as community detection, or by machine learning tools, such as classification or semi-supervised learning. Check beforehand if you can justify your results with a ground-truth, or alternatively, with a qualitative or visual assessment. The tasks that can be addressed depend on the information provided by the dataset such as features, labels, signals, etc. Some datasets, like the ones on Kaggle, were created to address one such task.

Look also at the list of past NTDS projects.

Free Music Archive

By Michaël

The Free Music Archive (FMA) is a free and open library directed by WFMU, the longest-running freeform radio station in the United States. Inspired by Creative Commons and the open-source software movement, the FMA provides a platform for curators, artists, and listeners to harness the potential of music sharing. The website provides a large catalog of artists and tracks, hand-picked by established audio curators. Each track is legally free to download as artists decided to release their works under permissive licenses.

The goal of this project is to analyze the content created by an open community. Through the light of networks, it is possible to examine how music tracks relate to each other. Does genre structure the music? Do we observe groups of similar artists? Can we build a playlist that links two tracks through a smooth transition? To answer those questions, we will build a similarity graph between music tracks.

The packaged data consists of music tracks with associated metadata. Metadata exists at the level of tracks (e.g., title, creation date), albums (e.g., track number), and artists (e.g, artist name). A full-length audio recording is available for each track. Additionally, a fixed length feature vector (audio descriptors computed from the waveform) describes the track. Relations between tracks can be studied by constructing a similarity graph. The similarity graph can be computed from audio features and metadata (at the level of tracks, albums, or artists). A subset of 2,000 tracks can be built by choosing tracks that belong to the Hip-Hop and Rock genres of the dataset "FMA small".

It is also feasible to work with other graphs, such as relations between artists or listeners. The latter requires to scrap the website.

	Description	Amount
nodes	audio tracks	25,000
edges	similarity between tracks	N/A
features	audio features pre-extracted from waveform	518
labels	musical genre (e.g., Rock, Pop)	16

Data acquisition: already collected and packaged
Requires down-sampling: up to the students
Network creation: needs to be built from features

Resources:

Paper: https://arxiv.org/abs/1612.01840
Code and data: https://github.com/mdeff/fma

Projects from NTDS'18:

[report, slides, code] Music Genre Recognition
[report, slides, code] Smooth Radio: Automatic playlist generation using signal graph processing
[report, slides, code] Free Music Alternative Playlists
[report, slides, code] Genre Classification: A Transductive, Inductive and Deep Approach
[report, slides, code] Friends Will Be Friends: A Network Tour of Musical Friendship
[report, slides, code] Transitional playlists for new musical discoveries
[report, slides, code] Mood Changing Playlist Generator

US Senators

By Michaël

The goal of this project is to understand the community structure and voting pattern of US senators. Will we observe a divide between democrats and republicans? Are there sub-communities or moderate members? Can we predict what a senator will vote knowing what some senators voted? To answer those questions, we will build a similarity graph between senators.

For each senator, we know (i) their political party (i.e., republican, democrat, independent), (ii) their voting activity (i.e., what they voted on a list of issues), and (iii) the groups they belong to (e.g., the bills they sponsored, the committees they are member of). We propose to first build the similarity graph from the votes, i.e., measure the distance between voting vectors. Going further, similarity might be measured from the other features, such as bill sponsoring or committee membership.

	Description	Amount
nodes	US senators	~100
edges	similarity between senators	N/A
features	votes, bill sponsoring, committee membership, statements	N/A
labels	political party: democrat, republican, independent	3

Data acquisition: need to be collected from a web API
Requires down-sampling: no
Network creation: needs to be built from features

Resources:

Projects from NTDS'18:

[report, slides, code] Learning US Senate voting behavior from bill sponsorship profiles
[report, slides, code] Vote prediction of US Senators from graph properties
[report, slides, code] U.S. Senators: A Voting Pattern Study

Wikipedia

By Benjamin and Nicolas

Wikipedia is an enormous source of information visited daily by millions of users. We can understand human behavior by looking at how it is built and accessed. In this project you will investigate the Wikipedia structure and learn more about our use, as humans, of the largest encyclopedia ever. Using the Wikipedia API, a network can be built where pages are nodes and hyperlinks are edges. The students exploring this dataset can focus on a (or several) particular subset(s) of pages. (The entire Wikipedia is too large for a NTDS project and cannot be retrieved from the API.) In addition, numerous attributes about the pages can be retrieved through the API. For example, similarly to the Cora dataset, from the text of the page one can build a feature vector for each page related to the keywords it contains. The category of a page can be a label. Furthermore, the number of visits per page per day can be obtained from the Pageview API. It gives the possibility to the students of analyzing the popularity of the articles over time. For example, one can check if the popularity of a page influences the popularity of the neighboring pages or if the popularity of a group of pages is related to a particular event. It is also possible to study and compare pages, networks and features in different languages. This dataset is rich, be creative! A reduced dataset, extracted from the API, will be collected by the students. We suggest to build a graph having 100 to 1000 nodes.

	Description	Amount
nodes	Wikipedia pages	100 to 1000
edges	hyperlinks	N/A
features	number of visits per day, keywords in the text	>10
labels	category	3-5

Data acquisition: to be collected from the Wikipedia API
Requires down-sampling: yes, during collection
Network creation: given

Resources:

Wikipedia API
Pageview API
Python packages to access the APIs: wikipedia, Wikipedia-API, pageviewapi

Projects from NTDS'18:

[report, slides, code] A Network Analysis of the 2018 FIFA World Cup
[report, slides, code] Conversation starter using Wikipedia
[report, slides, code] Minipedia
[report, slides, code] Wikipedia Analysis Using Keyword Based Graphs
[report, slides, code] A Wikipedia Tour of Death — or how University College Boat Club became popular

Researchers on Twitter

by Ersi

The goal of this project is to analyze the interactions of a sub-network of Twitter. The Twitter accounts in this project are Computer Science researchers or other accounts "related" to Computer Science. The dataset consists of the activity of 52,678 Twitter users. Of those, 170 are Twitter screen names of 98 Computer Science Conferences. These 170 Twitter accounts are set as seeds and more Twitter nodes have been collected if they are (i) following a seed, (ii) being followed by a seed, or (iii) have re-tweeted a seed's tweet. Out of the 52,678 Twitter users, 9,191 are verified to be researchers (they have been matched to a DBLP author profile). 22 features are given for each Twitter user, including for instance a boolean feature demonstrating if the bio description includes words that are usually being used to describe researchers. You can find a detailed description of the data in this paper. Keep in mind that the labels for this project are noisy as they are the results of classification.

	Description	Amount
nodes	Twitter accounts	52,678
edges	similarity (or connection) between Twitter accounts	N/A
features	numerical and boolean features describing Twitter accounts	22
labels	researcher or non-researcher (noisy label)	2

Data acquisition: already collected and packaged (load tsv files). Some data cleaning will be needed.
Requires down-sampling: up to the students
Network creation: If you chose to create a similarity graph between the Twitter users, it needs to be built from features. If you chose to build a graph with the connections between users, you must collect information using the Twitter API (with tweepy). As this can be time consuming, we recommend to build a feature graph first, and to explore the connections graph later.

Resources:

Paper: https://dx.doi.org/10.1145/2615569.2615676
Data: https://github.com/l3s/twitter-researcher

Scientific co-Authorship

by Ersi

This project aims to explore the co-authorship behaviour of scientific authors. The dataset is a co-authorship graph built from the DBLP digital library. Each vertex represents an author that has published at least one paper in one of the major conferences and journals of the Data Mining and Database communities between January 1990 and February 2011. Each edge links two authors who co-authored at least one paper (no matter the conference or journal). The vertex properties are the number of publications in each of the 29 selected conferences or journals and 9 topological properties (Degree Cent., Closeness Cent., Betweenness Cent., EigenVector Cent., PageRank, Clustering Coeff., Size of Max. Quasi-Clique, Number of Quasi-Cliques, Size of Community).

	Description	Amount
nodes	scientific authors	42,252
edges	authors are linked if they wrote a paper together	210,320
features	number of publications in conferences and topological features	38
labels	N/A	N/A

Data acquisition: already collected and packaged (load txt files)
Requires down-sampling: up to the students
Network creation: The connections between the nodes are already provided.

Remark 1: There are no labels given for the nodes in this dataset. You can chose as your labels two subsets of the conferences (e.g., KDD and AAAI).

Remark 2: The creator of this dataset has agreed to provide the data for this project. However, they will be given to you through a private channel and you should not re-distribute it.

Spammers on Social Networks

by Eda

The dataset contains 5.6 million users, which are described by 4 features; "user id", "gender", "age group", "spammer label". There are 7 different type of relations between the users indicating the action between them such as profile view, message, poke, etc., which may lead to 7 different directed graph. In total, there are 858 million link between the users. The original task associated with this dataset is to identify the spammers based on their features and links in the network.

Since this dataset is very big, it requires subsampling even during the loading of the data and cleaning the network accordingly. We warn the students that the size of the dataset add more difficulties (for example you can't create a 5.6x5.6 million adjacency matrix). So be sure you can make a smaller, meaningful subset for the project if you choose it.

	Description	Amount
nodes	users of tagged.com	5,607,447
edges	action between users: includes a timestamp and one of 7 types	858,247,099
features	sex, timePassedValidation, ageGroup	3
labels	spammer or not spammer	2

Data acquisition: already collected and packaged
Requires down-sampling: yes
Network creation: network is given as a list of edges

Resources:

Projects from NTDS'18:

[report, slides, code] Spammer… Catch me if you can
[report, slides, code] Identifying Spammers on Social Networks

Terrorist Attacks and Relations

by Eda

The first dataset consists of 1293 terrorist attacks (nodes), each of which is assigned to one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. In one of them edges of the graph connect the colocated attacks. On the other one, edges connect co-located terrorist attacks performed by the same terrorist organization.

The second dataset is designed for the classification of the relationships between the terrorists. The dataset contains 851 relations (aka; nodes of the graph). Each node is assigned to least one label (multiple labeling is also possible) among 4 labels; "Colleague", "Congregate", "Contact", "Family", and is described with 0/1 valued feature vector indicating absence/presence of the attributes, which are 1224 in total. There are 8592 edges on the graph, which connects the nodes involving the same terrorist group. As the goal is the classification of links, we will here build the line graph of the social network between terrorists. That is, instead of having terrorists as nodes and relationships between them as edges, relationships will be nodes and terrorists will be edges.

We warn the students that the 106 features given in this dataset are undocumented so that it can not be used for interpreting the data or doing anything meaningful. The students who choose this project will have to increase the dataset by collecting data from other source of information about terrorism.

	Description	Amount
nodes	relationships	851
edges	terrorists	8,592
features	0-1 vector of attribute values	1,224
labels	type of relationship (non exclusive): colleague, congregate, contact, family	4

	Description	Amount
nodes	terrorist attacks	1,293
edges	connect co-located attacks	3,172
features	0-1 vector of attribute values	106
labels	kind of attack: arson, bombing, kidnapping, weapon, nbcr, other	6

Data acquisition: already collected and packaged
Requires down-sampling: no
Network creation: network is given as a list of edges

Resources:

Projects from NTDS'18:

[report, slides, code] A Data Study of Terrorism and its Tendencies
[report, slides, code] Exposing the True Terrorist Network
[report, slides, code] Finding the Authors of a Terrorist Attack
[report, slides, code] How to beat terrorism efficiently: identification of set of key players in terrorist networks
[report, slides, code] Predicting Terror Attacks — A Data Story
[report, slides, code] Predicting the nature of terrorist attacks
[report, slides, code] Fight Against Terrorism

IMDb Films and Crew

By Rodrigo, taken over by Nikolaos

The IMDb datasets contain information such as crew, rating, and genre for every entertainment product in its database. The Kaggle dataset linked above is a smaller, but similar dataset, and could be used instead of the IMDb one, which is much larger. The goal of this project is to analyze this database in graph form, and attempt to recover missing information from data on cast/crew co-appearance in movies. The graphical analysis requires network creation, for which two possible paths are possible, according to which instances one wishes to consider as the nodes of the network.

The first approach could be to construct a social network of cast/crew members, where the edges are weighted according to co-appearance For example, actor_1 becomes strongly connected to actor_2 if they have appeared in a lot of movies together. The edges of the graph could be weighted according to a count on the number of entertainment products in which the two corresponding people participated together. We can take as a signal on this constructed graph the aggregate ratings of movies each person has participated in.

	Description	Amount
nodes	cast/crew	millions (IMDb), ~8500 (Kaggle)
edges	co-apearance in movies/TV/etc.	O(10) per node
features	ratings of movies taken part in	O(10) per node
labels	movie genre	unknown (IMDb), 3 (Kaggle)

A second approach could be to create a movie-network, in which movies are strongly connected if they share a lot of crew/cast members (or some other similarity measure combining this and genres, running times, release years, etc.). There are more options for the signal the students could consider on this graph, as they could use either the movie ratings, or the genre labels.

	Description	Amount
nodes	movies	millions (IMDb), ~5000 (Kaggle)
edges	count of common cast/crew + other feature similarity.	O(10) per node
features	average rating	1
labels	movie genre	unknown (IMDb), 3 (Kaggle)

For the extra work, there is plenty of extra information. For instance, the students could try to predict the revenue of movies by potentially including extra metadata. Note however that the number of instances in the original dataset is of the order of millions, so a smaller subset of those should be used.

Data acquisition: already collected and packaged
Requires down-sampling: yes if using the original datasets from IMDb, no if using the subsampled dataset from Kaggle
Network creation: needs to be built from features

Resources:

Projects from NTDS'18:

[report, slides, code] A Network Analysis of Movie Popularity
[report, slides, code] Evolution of the movie industry
[report, slides, code] History of Movie Success through GSP
[report, slides, code] Movie Grossing Success Prediction with Convolutional Neural Networks on Graphs
[report, slides, code] How to invest in movies?
[report, slides, code] A Netflix Tour of Data Science — Film suggestion by diffusion on graphs
[report, slides, code, proposal] Feminism in Hollywood
[report, slides, code] Can we estimate the earnings of a movie?
[report, slides, code] A Network Tour of Millenial Movies

Flight Routes

By Rodrigo, taken over by Guillermo

This OpenFlights/Airline Route Mapper Route Database contains 67,663 routes (EDGES) between 3,321 airports (NODES) on 548 airlines spanning the globe.

The construction of the graph is facilitated by the source and destination airports of each flight, which gives essentially an edge list for the airport graph. Students could complement the graph construction by providing weigths to the edges, proportional to the number of flights connecting the corresponding pair of airports. The visualization of the graph embedded on the globe can be achieved by using the supplemented data in https://openflights.org/data.html, which contains, among others, information on latitute/longitude of each airport. A potential goal of the extra work could then be comparing the embedding produced by the Laplacian eigenmaps algorithm seen in the course and the "natural" embedding given by the terrestrial coordinates. The most sensitive part of the project (which could be examined in the extra work) would be to find a label signal on the graph that could be recovered via a label propagation algorithm on the graph. In principle, the average number of stops of flights leaving each airport could fulfill this purpose, but it remains a metter of study whether a subsampled version of this information can be recovered from the topological information of the graph alone or not.

	Description	Amount
nodes	airports	3,321
edges	count of flights connecting airports	67,663
features	average number of stops of outbound flights	1
labels	N/A	N/A

Data acquisition: already collected and packaged
Requires down-sampling: no
Network creation: network is essentially given (list of edges)

Resources:

Projects from NTDS'18:

[report, slides, code] Retrieving the continent labels from the air routes structure
[report, slides, code] A network tour to flight delay in the US
[report, slides, code] Flight network and airline alliances
[report, slides, code] Spreading disease through the air
[report, slides, code] Finding Continents from a Flight Routes Network
[report, slides, code] Small Components' Analysis and Flight Delay Prediction
[report, slides, code] A Network Topology Analysis of the Airline Industry
[report, slides, code] Tempering the Spread of Epidemics on Aerial Networks
[report, slides, code] Re-balancing flight routes inequalities
[report, slides, code] An overviews of intercontinental flight route connections

Recipes 1M

By Nicolas

This database contains circa 1 million cooking recipes retrieved from several websites, along with one or more images for each recipe (13 million images available). Its original use is to train models to perform a "im2recipe", i.e., get recipe instructions from an image. For every recipe in this dataset, you can retrieve the full-text of instructions, ingredients contained in the file available under the "layers" link (and after you unpack the file, in the "layer1.json" file). For each ingredient line inside a recipe, the name of the ingredient has been extracted ("ingredient detection" link). Ingredient name detection from text is a non-trivial task, so expect the results to be noisy. Quantity detection has not been performed, but might be an interesting feature to study (requires extra work). If you intend to work with images, they require more than 100 GB of download and a matching storage space.

You can, for instance, construct an ingredient graph (linking ingredients when they appear simultaneously in a recipe).

	Description	Amount
nodes	ingredients	18253
edges	connect co-appearing ingredients	O(10) per node
features	recipes	1
labels	N/A	N/A

Another approach could be to create a graph from recipes having several ingredients in common. Given the size of the dataset, it might be a good idea to try out a smaller subset of the dataset, for instance by filtering recipes that contain a particular ingredient (e.g., 'beef', 'vanilla', etc.). In order for the students to get started more easily, two smaller subsets (with circa 10,000 and 23,000 recipes) are supplied.

	Description	Amount
nodes	recipes	1M
edges	connect recipes with common ingredients	depends how the graph is built
features	images / ingredient quantities	1-10
labels	N/A	N/A

Data acquisition: already collected and packaged
Requires down-sampling: yes
Network creation: to be created from co-appearing ingredients or from recipes

Resources:

http://pic2recipe.csail.mit.edu/
http://im2recipe.csail.mit.edu/dataset/download/ (requires free registration prior download)
subsets: https://drive.switch.ch/index.php/s/fjjqqaRznah6PKp

Genetics

By Benjamin

This dataset contains genes, protein expressions and phenotypes of a "family" of mice. It is a subset of the open dataset available at http://www.genenetwork.org. The goal is to explore the data using graphs and discover connections between genes, protein expression, and phenotypes. This data has been collected by different laboratories all over the world during several decades. The EPFL's LISP is part of the group. More details about the project and dataset can be found on this EPFL mediacom article.

This dataset is a collection of matrices. Each column is a single BXD strain (a mouse) whose genome is a unique combination of the C57BL/6J and DBA/2J parental strains. They have been saved as CSV files (with a txt extension). The genotype_BXD.txt file is a binary matrix that describes the contribution of each of the parental strains for a list of selected genes (rows). Each row of the genotype data indicates whether a certain position in the genome is inherited from the C57BL/6J or DBA/2J parent. Phenotype data contained in Phenotype.txt are also matrices indicating the values of each phenotype for each strain. In addition, multiomic molecular phenotypes from different mouse organs are represented as one matrix per organ (e.g., brain, bone, muscle, liver, etc.), in the expression_data folder.

It is important to note that protein or phenotype information is not available for all the mice. Depending on the research team gathering the data or the protocols, only a subset of mice have been tested for each phenotype, or different organs have been analyzed. Hence, the combination of several matrices will result in missing entries. These missing entries, along with the variety of the data, are a real challenge for a data science approach, but mimic the real life situation encountered with human medical records.

The dataset for the NTDS project can be found here. The students are expected to:

use the different matrices of data to build one or more graphs (for example a graph of mice),
explore these graphs,
associate values to the nodes of the graphs using the other matrices,
apply graph signal processing approaches,
discover new relations between the genome, protein expressions in tissues and phenotypes (optional but that would be great!).

	Description	Amount
nodes	mice	100 - 200
edges	similar genes, protein expressions, or phenotypes	O(10) per node
features	genes, protein expressions in tissues, or phenotypes	1000s
labels	depends: a particular gene, phenotype, or protein expression	N/A

Resources:

Movielens 100k

By Clément

Movielens is a personalized movie recommendation system. Several datasets have been built using this database, the smallest being Movielens 100k. It contains 100,000 ratings from 1000 users on 1700 movies. Various information is available about the users (e.g., age, gender, occupation, zip code) and the movies (e.g., release date, genre). Additional features can be retrieved as movie titles are available. Two graphs can be built out of this dataset, and they can be connected using the ratings.

The main purpose of this data is to build a recommender system that can be formulated as a semi-supervised learning problem: given a user, can you predict the ratings that they will give to a new movie? Graph neural networks can be used for this purpose, but other graph based approaches can be explored as well.

Users graph	Description	Amount
nodes	users	1000
edges	similar features	depends how the graph is built
features	age, gender, occupation, zip code	4
labels	ratings of the movies	100k

Movies graph	Description	Amount
nodes	movies	1700
edges	similar features	depends how the graph is built
features	name, release date, genre	2 + 19 genres
labels	ratings given by users	100k

Data acquisition: already collected and packaged
Requires down-sampling: no
Network creation: needs to be built from features

Resources:

Data
Papers using graph neural networks:
- Geometric Matrix Completion with Recurrent Multi-Graph Neural Networks
- Graph Convolutional Matrix Completion

Past projects

Projects from NTDS'18:

[report, slides, code] A Network Analysis of Movie Popularity
[report, slides, code] Learning US Senate voting behavior from bill sponsorship profiles
[report, slides, code] Retrieving the continent labels from the air routes structure
[report, slides, code] Evolution of the movie industry
[report, slides, code] A network tour to flight delay in the US
[report, slides, code] Flight network and airline alliances
[report, slides, code] Vote prediction of US Senators from graph properties
[report, slides, code] Spreading disease through the air
[report, slides, code] A Network Analysis of the 2018 FIFA World Cup
[report, slides, code] History of Movie Success through GSP
[report, slides, code] Music Genre Recognition
[report, slides, code] Finding Continents from a Flight Routes Network
[report, slides, code] Conversation starter using Wikipedia
[report, slides, code] Smooth Radio: Automatic playlist generation using signal graph processing
[report, slides, code] Movie Grossing Success Prediction with Convolutional Neural Networks on Graphs
[report, slides, code] How to invest in movies?
[report, slides, code] A Netflix Tour of Data Science — Film suggestion by diffusion on graphs
[report, slides, code] U.S. Senators: A Voting Pattern Study
[report, slides, code] A Data Study of Terrorism and its Tendencies
[report, slides, code] Spammer… Catch me if you can
[report, slides, code] Exposing the True Terrorist Network
[report, slides, code] Small Components' Analysis and Flight Delay Prediction
[report, slides, code] Minipedia
[report, slides, code] Wikipedia Analysis Using Keyword Based Graphs
[report, slides, code] Finding the Authors of a Terrorist Attack
[report, slides, code] How to beat terrorism efficiently: identification of set of key players in terrorist networks
[report, slides, code] A Network Topology Analysis of the Airline Industry
[report, slides, code] Predicting Terror Attacks — A Data Story
[report, slides, code] Free Music Alternative Playlists
[report, slides, code, proposal] Feminism in Hollywood
[report, slides, code] Genre Classification: A Transductive, Inductive and Deep Approach
[report, slides, code] Friends Will Be Friends: A Network Tour of Musical Friendship
[report, slides, code] Predicting the nature of terrorist attacks
[report, slides, code] Transitional playlists for new musical discoveries
[report, slides, code] A Wikipedia Tour of Death — or how University College Boat Club became popular
[report, slides, code] Tempering the Spread of Epidemics on Aerial Networks
[report, slides, code] Can we estimate the earnings of a movie?
[report, slides, code] Re-balancing flight routes inequalities
[report, slides, code, proposal] Voting patterns in the Swiss National Council
[report, slides, code] An overviews of intercontinental flight route connections
[report, slides, code] A Network Tour of Millenial Movies
[report, slides, code] Identifying Spammers on Social Networks
[report, slides, code, proposal] An exploratory study on the brain dysconnectome
[report, slides, code] Mood Changing Playlist Generator
[report, slides, code] Fight Against Terrorism

Projects from NTDS'17:

[proposal, analysis, slides] American Basketball Players
[proposal, analysis, slides] Graph-based Nutrition Guide
[proposal, analysis, slides] What Impacts the Success of a Movie?
[proposal, analysis, slides] Exploring the Crunchbase Dataset to Detect High Potential Startups
[proposal, analysis, slides] Beer Suggester
[proposal, analysis, slides] Swiss Political Survey
[proposal, analysis, slides] A StackOverflow Recommender System
[proposal, analysis, slides] Analysis of World Development Indicators as Predictors
[proposal, analysis, slides] Satellites Characterization – Clustering using Orbital Characteristics
[proposal, analysis, slides] Amazon Electronic Products – Network Analysis
[proposal, analysis, slides] GSP on the Digital Reconstruction of the Brain
[proposal, analysis, slides] Movie Recommendation
[proposal, analysis, slides] GraphLang
[proposal, analysis, slides] Buda + Pest = Budapest
[proposal, analysis, slides] Manifold Learning of Face Data
[proposal, analysis, slides] A Network Tour of the Tunisian and Egyptian Springs
[proposal, analysis, slides] StackOverflow Survey
[proposal, analysis, slides] Speech Recognition Challenge
[proposal, analysis, slides] Analysis of the Elite Football Transfer Market
[proposal, analysis, slides] Titanic
[proposal, analysis, slides] This is My Jam
[proposal, analysis, slides] A Network Tour of StackOverflow
[proposal, analysis, slides] Course Suggester
[proposal, analysis, slides] Spectral Analysis of 5000 Movies Network
[proposal, analysis, slides] Community Detection on the Wikipedia Hyperlink Graph
[proposal, analysis, slides] 3D Meshes of Facial Expression
[proposal, analysis, slides] Terrorist Attacks
[proposal, analysis, slides] Community Detection and Labelling in an Instagram Network
[proposal, analysis, slides] Graph-based Recommendation for lastFM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

projects

projects

README.md

Dataset and project ideas for NTDS'19

A word on data preparation

Selecting a dataset

Where to find (network) datasets?

What are the requirements for the dataset?

How to determine the task to be addressed?

Proposed datasets

Free Music Archive

US Senators

Wikipedia

Researchers on Twitter

Scientific co-Authorship

Spammers on Social Networks

Terrorist Attacks and Relations

IMDb Films and Crew

Flight Routes

Recipes 1M

Genetics

Movielens 100k

Past projects

Files

projects

Directory actions

More options

Directory actions

More options

Latest commit

History

projects

Folders and files

parent directory

README.md

Dataset and project ideas for NTDS'19

A word on data preparation

Selecting a dataset

Where to find (network) datasets?

What are the requirements for the dataset?

How to determine the task to be addressed?

Proposed datasets

Free Music Archive

US Senators

Wikipedia

Researchers on Twitter

Scientific co-Authorship

Spammers on Social Networks

Terrorist Attacks and Relations

IMDb Films and Crew

Flight Routes

Recipes 1M

Genetics

Movielens 100k

Past projects