The Web Conference 2021 (WWW '21), April 19--23, 2021, Ljubljana Slobenia
Lin Zhao, Sourav Sen Gupta, Arijit Khan, Robby Luo
Nanyang Technological University, Singapore
With over 42 billion USD market capitalization (October 2020), Ethereum is the largest public blockchain that supports smart contracts. Recent works have modeled transactions, tokens, and other interactions in the Ethereum blockchain as static graphs to provide new observations and insights by conducting relevant graph analysis. Surprisingly, there is much less study on the evolution and temporal properties of these networks. In this paper, we investigate the evolutionary nature of Ethereum interaction networks from a temporal graphs perspective. We study the growth rate and model of four Ethereum blockchain networks, active lifespan and update rate of high-degree vertices. We detect anomalies based on temporal changes in global network properties, and forecast the survival of network communities in succeeding months leveraging on the relevant graph features and machine learning models.
Due to the size limitation, instead of uploading the dataset, we will introduce the extraction method we are using to obtain the data. We also demonstrate a sample arc list and corresponding address hased table split by year and by month (only for Contract Net) in each folder.
- Apply and login to Google Cloud Platform account.
- Create a bucket to store your files.
- Go to BigQuery and find the data set 'ethereum_blockchain'
- Select the table with desired timestamp you want and 'Export to GCS'.
- Then select the GCS location (the bucket created in step 2).
- If csv is preferred: //file*.csv (e.g. tmpbucket/blocks/blocks*.csv).
- The * will help to number the files as exporting the tables will split the data into multiple files.
Replace .csv with .txt or .json as per your preference. Pip install gsutil For downloaded entire folder: gsutil -m cp -r gs://bucketname/folder-name local-location For downloaded multiple files: gsutil -m cp -r gs://bucketname/folder-name/filename* local-location
Kaggle can be used to preview the data table columns.
Please refer to the github page for more details.
We extract all relevant data from dataset under the Google Cloud till 2019-12-31 23:59:45 UTC, which amounts to all blocks from genesis (#0) up to #9193265. The entire blockchain data is stored in seven different tables, out of which, we extract data from contracts
, token transfers
, traces
, and transactions
tables for our temporal analysis.
-
The trace table stores executions of all recorded messages and transactions (successful ones) in the Ethereum blockchain. This is the most comprehensive tables for analysis.
-
The transactions table contains all transaction details such as source and target address, and amount of ether transferred.
-
The contracts table contains all Contract Accounts, their byte code and other properties of byte code such as block_timestamp}, block_number, token types (e.g., ERC721, ERC20).
-
The token transfers table focuses on all transactions with tokens from one 20-byte address to another 20-byte address on the blockchain.
All the scripts are written in python 3.7. To run the script, please lunch a python tools like Anaconda or directly run "python xx.py"
The folder contains four folders for transactionNet, traceNet,tokenNet and contractNet arc list and accounts extraction.
For transactionNet, traceNet, tokenNet
-
Annual graph
The raw data obtained from Google Bigquery is in annual basis. Scripts named as "tracexx.py","tokenxx.py" and "transactionxx.py" are to process annual-based raw data, form the annual based arc list and corresponding hash table.
-
Result
Due to the file size limitation in github, only Year2015 annual arc list and hash table is uploaded as a reference
For contractNet
-
Annual graph
The raw data obtained from Google Bigquery is in annual basis. Scripts named as "xx_Annual_xx.py" is to process annual-based raw data, form the annual based arc list and corresponding hash table.
-
Monthly graph
Script named as "xx_Monthly_xx.py" will not only form the arc list and hash table but also help to partition the arc list into different month by matching with the timestamp in raw data.
-
Result
Due to the file size limitation in github, only ContractNet Year2015 annual arc list and hash table is uploaded as a reference in folder "contractNet_address_hash" and "contractNet_edgelist_example".
-
Find number of vertices, arcs and self-loops of each network
An example to analyze contractNet_2019 number of vertices, arcs and self-loops for figure 2 and 3.
Find common account in continuous years
An example to analyze contractNet for Figure 2.
Find common account in continuous years
An example to analyze contractNet for Figure 3.
-
Analyze graph network reciprocity, associtativity, connectedComponent, kcore properties
Analyze network pathLength, radius, diameter
Analyze network triangle, transitivity, aveClusteringCoeff
Analyze network weakly connected component and strongly connected component
An example for extract network properties for section 4, 5 and 6
-
Extract degree number for each vertex
An example to calculate number of degree/indegree/outdegree for each vertex in the network. Input is the network edgelist, Output is a csv with the account and corresponding results.
Find tokenNet top10 degree accounts
Read in vertices degree distribution file (in previous step) and list down the top 10 values for each year for table 6 and table 7.
-
Community detection
There are 3 steps in community detection
Step1: Identify communities using Multi-level algorithm
find_contract2019_community_multilevel_realEdgeIndex_3mon.py
Note: python igraph library output communities arc list using index instead of real value of nodes. In order to perform matching in next step, it is needed to attach values (which is annual basis index) to each nodes.
Step2: Match communities in 3-month dataset and 1-month dataset
Find_continuous_community1_grow_die_compareREALindex.py
This script makes use of vf2 algorithm for subiomorphism matching. The matching not only consider graph shape but also node values to be matched.
Step3: Extract properties for each community
extract_contract2016_properties.py
The script extract local and global properties of each community to be training/testing data.
-
Community Predition
Individual
Scripts logistic_regression.py and random_forest.py are used for each time period prediction. The script are generalized, it only requires to input the class 1 and class 0 training features and labels. There is a random selection function in the script to balance class 1 and class 0. It needs to adjust based input data.
Overall
Scripts logistic_regression_combine_allMonth.py and random_forest_combine_allMonth.pyare for competed year prediction. So the training data are combined pior to input into the scripts. Therefore, the scipts are almost the same as individual scripts.
- Evgeny Medvedev and the D5 team, "Ethereum ETL," https://github.com/blockchain-etl/ethereum-etl, 2018.
- Ethereum Blockchain, https://www.kaggle.com/bigquery/ethereum-blockchain, 2020