Learn more at stingergraph.com.
STINGER is a package designed to support streaming graph analytics by using in-memory parallel computation to accelerate the computation. STINGER is composed of the core data structure and the STINGER server, algorithms, and an RPC server that can be used to run queries and serve visualizations. The directory structure of the components is as follows:
doc/ - Documentation
doxygen/ - Doxygen generated documentation
external/ - External dependencies packaged with STINGER
flask/ - Python Flask Relay Server for interacting with the JSON RPC server and STINGER server
html/ - Basic web pages that communicate with the JSON RPC server
lib/ - The STINGER library and dependencies
stinger_alg/ - Algorithm kernels that work on the STINGER data structure
stinger_core/ - The Core STINGER data structure
stinger_net/ - Libraries for communicating over unix sockets and/or TCP/IP using Protobufs
stinger_utils/ - Auxiliary functions over the data structure
src/ - STINGER ecosystem binaries
server/ - The STINGER server
clients/ - Clients that connect to the STINGER server
algorithms/ - Streaming Algorithm binaries
streams/ - Binaries for streaming in new data
tools/ - Auxiliary tools
py/stinger - Python bindings to STINGER
standalone/ - Standalone binaries that use the STINGER core data structure
templates/json - Common templates for stream ingest
tests/ - Tests for the STINGER data structure and algorithms
SOURCEME.sh - File to be executed from out-of-source build directory to link the html/ web pages with the STINGER server
STINGER is built using CMake. From the root of STINGER, first create a build directory:
mkdir build && cd build
. ../SOURCEME.sh
Then call CMake from that build directory to automatically configure the build and to create a Makefile:
ccmake ..
Change Release
to Debug
or RelWithDebInfo
for a debugging build during development. Finally, call make to build all libraries and executable targets (or call make and the name of an executable or library to build):
make -j8
Note: the -j flag is a multi-threaded build. Typically you should match the argument to the number of cores on your system.
All binary targets will be built and placed in build/bin. They are named according to the folder from which they were built (so src/bin/server produces build/bin/stinger_server, src/bin/clients/tools/json_rpc_server produces build/bin/stinger_json_rpc_server, etc.). If you ran SOURCEME.sh from the build directory as instructed above, the build/bin directory is appended to your path.
As indicated by the directory structure, there are three primary types of executable targets: clients, server, and standalone, and subtypes in the case of clients.
The STINGER server maintains a STINGER graph in memory and can maintain multiple connections with client streams, algorithms, and monitors.
Client streams can send edges consisting of source, destination, weight, time, and type where some fields are optional and others can optionally be text strings.
Client algorithms will receive these batches of updates in a
synchronous manner, as well as shared-memory read-only access to the complete graph. The server provides the capability
for client algorithms to request a shared memory space to store results and communicate with other algorithms.
Client algorithms declare dependencies when they connect and receive the mapped data in the returned structure.
The server guarantees that all of an algorithm's dependencies will finish processing before that algorithm is executed.
Clients algorithms are required to provide a description string that indicates what data is stored and the type of the data.
Client tools are intended to be read-only, but are notified of all running algorithms and shared data. An example of this kind of client is the JSON-RPC server (src/bin/clients/tools/json_rpc_server). This server provides access to shared algorithm data via JSON-RPC calls over HTTP. Additionally, some primitive operations are provided to support selecting the top vertices as scored by a particular algorithm or obtaining the shortest paths between two vertices, for example.
Standalone executables are generally self-contained and use the STINGER libraries for the graph structure and supporting functions. Most of the existing standalone executables demonstrate a single streaming or static algorithm on a synthetic R-MAT graph and edge stream.
To run an example using the server and five terminals:
term1:build$ env STINGER_MAX_MEMSIZE=1G ./bin/stinger_server
term2:build$ ./bin/stinger_json_rpc_server
term3:build$ ./bin/stinger_static_components
term4:build$ ./bin/stinger_pagerank
term5:build$ ./bin/stinger_rmat_edge_generator -n 100000 -x 10000
This will start a stream of R-MAT edges over 100,000 vertices in batches of 10,000 edges. A connected component labeling and PageRank scoring will be maintained. The JSON-RPC server will host interactive web pages at http://localhost:8088/full.html are powered by the live streaming analysis. The total memory usage of the dynamic graph is limited to 1 GiB.
Given a stream of Tweets in Twitter's default format (a stream of JSON objects, one per line), it is fairly easy to pipe the user mentions / retweets graph into STINGER using the json_stream. The json_stream is a templated JSON stream parser designed to consume one object per line like the Twitter stream and to produce edges from this stream based on a template.
The templates can use the following variables (where one of the two source and one of the two destination variables must be used):
$source_str - The source vertex name.
$source - The source of the edge as a number (must be able to parse as an integer
less than the maximum vertex ID in the STINGER server).
$source_type - A string representing the type of the source vertex.
$source_weight - A number to be added to the weight of the source vertex (vertex weights
start at zero).
$destination_str - The destination vertex name
$destination - The destination of the edge as a number (must be able to parse as an
integer less than the maximum vertex ID in the STINGER server).
$destination_type - A string representing the type of the destination vertex
$destination_weight - A number to be added to the weight of the destination vertex (vertex
weights start at zero).
$type_str - The edge type as a string
$weight - The weight of the edge (must be able to parse as an integer).
$time - The time of the edge (must be able to parse as an integer).
$time_ttr - Must be a string of either the form "Mon Sep 24 03:35:21 +0000 2012" or
"Sun, 28 Oct 2012 17:32:08 +0000". These will be converted internally
into integers of the form YYYYMMDDHHMMSS. Note that this does not currently support
setting a constant value.
For example, the simplest template for Twitter mentions and retweets would be (we'll call this template.json):
{
"user": {
"screen_name": "$source_str1"
},
"entities": {
"user_mentions": [
{
"screen_name": "$destination_str1"
}
]
},
"this_doesnt_matter": "$source_type=user",
"same_here": "$destination_type=user",
"and_here": "$type=mention"
}
To parse a Twitter stream into STINGER using this template:
cat twitter_sample.json | ./bin/stinger_json_stream template.json
You can replace the 'cat twitter_sample.json' command with one of the curl commands from the Twitter developer API page to directly inject a live Twitter stream (obviously you should go to dev.twitter.com to get your own OAuth data):
curl --request 'POST' 'https://stream.twitter.com/1.1/statuses/sample.json' --header
'Authorization: OAuth oauth_consumer_key="KEYKEYKEY", oauth_nonce="NONCENONCENONCE",
oauth_signature="SIGSIGSIG", oauth_signature_method="HMAC-SHA1", oauth_timestamp="ts",
oauth_token="TOKENTOKENTOKEN", oauth_version="1.0"' --verbose | ./bin/json_stream template.json
The csv_stream parser follows a similar templated format to the json parser, so parsing edges out of a file might look like:
id,email_a,config_a,email_b,config_b,unix_time,length
na,$source_str1,na,$destination_str1,na,$time1,$weight1, $source_type1=email, $destination_type1=email
This file would create edges between email addresses using the length field as the weight and the Unix timestamp field as the time. To use this template, pipe the file or stream into the parser and pass the template as a parameter like so:
cat emails.csv | ./bin/stinger_csv_stream template.csv
Please be aware that the CSV parser and the underlying code to parse CSV files does not currently trim whitespace, and does not treat quoted strings of any kind as quoted.
To create a toy R-MAT graph (256K vertices and 2M undirected edges) and run the insert-remove benchmark:
term1:build$ stinger_rmat_graph_generator -s 18 -e 8 -n 100000
term1:build$ stinger_insert_remove_benchmark -n 1 -b 100000 g.18.8.bin a.18.8.100000.bin
STINGER allocates and manages its own memory. When STINGER starts, it allocates one large block of memory (enough to hold its maximum size), and then manages its own memory allocation from that pool. The server version of STINGER does this in shared memory so that multiple processes can see the graph. Unfortunately, the error handling for memory allocations is not particularly user-friendly at the moment. Changing the way that this works is on the issues list (see https://github.com/robmccoll/stinger/issues/8).
- "Bus error" when running the server: The size of STINGER that the server is trying to allocate is too large for your memory. First try to increase the size of your /dev/shm (See below). If this does not work reduce the size of your STINGER and recompile.
- "XXX: eb pool exhausted" or "STINGER has run out of internal storage space" when running the server, standalone executables, or anything else using stinger_core: you have run out of internal edge storage. Increase the size of STINGER and recompile.
Compile-time definitions can be adjusted in ccmake to increase or decrease memory consumption. Note that adjusting these values will require you to rebuild STINGER.
- STINGER_MAX_LVERTICES: maximum number of vertices to support, often written as a power of two: 1L<<22
- STINGER_EDGEBLOCKSIZE: number of edges per block, default: 14
- STINGER_NUMETYPES: maximum number of edge types to support, default: 5
- STINGER_NUMVTYPES maximum number of vertex types to support, default: 128
STINGER_EDGEBLOCKSIZE determines how many edges are in each edge block (there are 4 * STINGER_MAX_LVERTICES edge blocks, so 4 * STINGER_MAX_LVERTICES * STINGER_EDGEBLOCKSIZE is the maximum number of directed edges).
These are listed in the order of how much of an affect they will have on the size of STINGER.
/dev/shm allows for memory-mapped files in the shared memory space. By default the size of this filesystem is set to be 1/2 the total memory in the system. STINGER by default attempts to allocate 3/4 of the total memory (unless the STINGER_MAX_MEMSIZE environment variable is used). This will commonly cause a 'Bus Error' the first time STINGER server is run on the system. There are two solutions to this problem
- Use STINGER_MAX_MEMSIZE and provide a value no greater than 1/2 the total system memory.
- Resize the /dev/shm filesystem to match the total system memory size.
To achieve #2 you should edit /etc/fstab to include the following line
tmpfs /run/shm tmpfs defaults,noexec,nosuid,size=32G 0 0
Replace 32G with the size of your system's main memory
tmpfs /dev/shm tmpfs defaults,noexec,nosuid,size=32G 0 0
Replace 32G with the size of your system's main memory
Once you have added this line either restart the machine or run the following command.
sudo mount -o remount /run/shm
First check the STINGER GitHub page to verify that the build is passing (icon immediately under the title). If it is not passing, the issue resides within the current version itself. Please checkout a previous revision - the build should be fixed shortly as failing builds sent notifications to the authors.
Build problems after pulling updates are frequently the result of changes to the Protocol Buffer formats used in STINGER. These are currently unavoidable as an unfortunate side effect of how we distribute PB and tie it into CMake. To fix, remove your build directory and build STINGER from scratch.
Additionally, this version of the STINGER tool suite is tested almost exclusively on Linux machines running later version of Ubuntu and Fedora. While we would like to have multi-platform compatibility with Mac (via "real" GCC) and Windows (via GCC on cygwin), these are lower priority for our team - unless a project sponsor requires it :-)