#Edge Activations
Consider the graph below where edge weights indicate distances.
Lets use a Vietoris Rips complex: denoted as
Min scale = 0 (we could start from 0.5 or any other value as well)
Max scale = 100 (we could go longer, but it would not change anything)
Multiple questions:
- Question 1: How do we define a distance based on edge weights? In the previous slide, the edge weights represented distances, eliminating the need for explicit distance definition.
- Question 2: What happens if the graph is unweighted?
- Graph resistance distance: In a graph, the resistance between two vertices measures the level of difficulty for electric current to flow between them.
- Shortest path distance (works with weighted edges as well): a measure of the minimum number of edges along the shortest path between two vertices in a graph.
- Ricci curvature (works with weighted edges as well): the concept of Ricci curvature is a generalization of Ricci curvature from Riemannian geometry. It provides a way to quantify the geometric properties of the network based on its connectivity and edge weights.
Question 3: What happens if the graph is directed?
- Use distance measures introduced in the first question, but define distances based on incoming or outgoing edges only
#Node Activation
Consider the graph below where node features indicate ages.
We will use sub or super level filtration.
- Sublevel: filtration from minimum to maximum scales
- Superlevel: filtration from maximum to minimum scales
Min scale = 10 (we could start from 15 or any other value as well)
Max scale = 80 (we could go longer, but it would not change anything)
Multiple Questions:
- Question 1: How do we define node activation values? May come from domain: in molecular networks, atomic weights of molecules.
#Simplex vs Clique
Math: Mathematically, a k-dimensional simplex is defined as the convex hull of (k + 1) affinely independent points in k-dimensional space.
CS: In graph theory, a k-clique refers to a subset of k vertices in a graph where every vertex is directly connected to every other vertex in the subset.
#K-dimensional holes
A k-dimensional hole requires vertices with at least degree k+1 (Vietoris-Rips here).
| | |:---:| | *Node f will never contribute to a k>0 dimensional hole (See our Neurips’22 CoralTDA article)* |#Power of PH
Persistent homology is powerful because it allows us to track the birth and death of dimensional features at various scales.
Insight from Yulia R. Gel of UTD Math: This PH information is similar to what we get in graph motif analysis, but with finer granularity.
- Graph motif analysis is a method used in graph theory and network analysis to identify and analyze recurring patterns or subgraphs within a larger graph.
- A motif is defined as a small subgraph that appears more frequently in a graph than would be expected by chance alone.
Account Based blockchains have two types of "transactions"
- A transfer of the used cryptocurrency, such as Ether on Ethereum.
- Internal transactions, which involve a transfer of smart contract-based tokens
#Ether transactions
Ether transactions involve one input address and one output address. An address spends coins from a balance, keeps the change. Each transaction of an address has an order (called nonce). The nonce is the number of transactions sent to the network by the address. A later transaction needs to wait for earlier transactions to be mined.
#Internal transactions
A transaction can transfer both currencies and tokens. An internal transaction can create multiple edges, although this is rare on Ethereum
#Trading Tokens - a timeline
#Token Networks
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/9bdbcc20-e0b5-45d6-b2c6-68eca148f1f0){width=65%} | |:---:| | *The largest connected component in Storj network on 13-1-2018* |We model account based blockchains as directed, weighted, multi-graphs
The network of a single token is usually sparse and devoid of community structure. Daily networks may contain many disconnected components.
#Topological Data Analysis of Blockchain – Ethereum Case
Let 𝐺=(𝑉,𝐸,𝜔) be a weighted graph, with the node set 𝑉 and edge set 𝐸 and
To account for dissimilarity between two disconnected nodes, we introduce the weight ̃$𝜔:𝑉×𝑉→𝑅^+$
$$𝜔_{uv}=\left{
\begin{array}{ll}
\omega_{uv} & (u,v)\in \mathcal{E}\
\infty & (u,v)\not\in \mathcal{E}
\end{array}
\right.$$
In the context of a weighted network, we define
where
#Topological Data Analysis of Ethereum Networks
In this context, we introduce a novel notion of Betti functions which relate these counts to the scale parameter viewed as continuum. The Betti-𝒑 function
The Betti functions can be regarded as a functional summary statistic of the network’s topological structure.
Due to the functional dependency among Betti numbers at different scales, it is important to view ${\mathfrak{B}p(\epsilon_k)}{k=1}^n$ as a function as opposed to a vector in
Consider Betti functions
We use a notion of rolling band depth:
#Predictive Models
Problem Definition: Given the transaction network of an Ethereum token and time series of the token price in fiat currency, predict whether the token price will change more than 𝛿 in the next ℎ days. Identify the maximum horizon value ℎ such that the prediction accuracy is at least 𝜌.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/50db0b47-4743-493e-83e1-280650223ad2){width=60%} | ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/95a97228-dd93-420c-92ab-b5f8f3ef0c2b) | |:---:|:---:| | A histogram of absolute price returns of 31 tokens $𝑅_𝑡=(𝑃𝑟𝑖𝑐𝑒_𝑡−𝑃𝑟𝑖𝑐𝑒_{𝑡−1})/(𝑃𝑟𝑖𝑐𝑒_{𝑡−1})$. | Number of (price) anomalous tokens in time. |Predictive Models - A Fusion of Network Analysis and Time Series
We predict price anomalies in 31 token networks, where a total of 9042 days are predicted as anomalous (Anomaly:true). To predict whether the Ethereum token price will change more than 𝛿 within the next ℎ days horizon, we combine the graph topological features with traditional network summaries. We start by defining the price return of a token on day 𝑡 as
Then, we label a day 𝑡 as anomalous if there is a significant change in token’s price. More specifically, if |$R_𝑡$|≥𝛿 where 𝛿 is a user-defined threshold (i.e., magnitude of a price shock), then 𝑡 is considered as an anomalous day. We build one predictive model for each token and examine performance for different prediction horizons ℎ>0.
Our Random Forest models use 2/3 of a token’s lifetime for training, and the remaining 1/3 for test.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/00c60556-8f8a-4c8c-b34c-fb5801416993) | ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/ccd87297-2621-488e-9eb3-8dbcf638346e){width=50%} | |:---:|:---:| ||A Venn diagram for the number of predicted anomalies in all token networks (for h = 1). Intersecting regions indicate agreement on predictions|Edge: Number of edges in the daily token network.
For up to three-day horizons, all models have accuracy >0.7. The figure shows that compared to other models, the M4 (full) model has the least deteriorating performance as horizon increases from 1 to 7. The accuracy results offer evidence that Betti models are more conservative in making anomalous day predictions, and their accuracy is better than the baseline model M1.
|![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/6a515bcb-eb85-4a71-8677-a7902f0acf09){width=65%}| |:---:| |$Accuracy: \frac{𝑇𝑃+𝑇𝑁}{𝑇𝑃+𝐹𝑃+𝑇𝑁+𝑇𝑃}$| | ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/0b6fd434-cca2-4415-8870-b11e66cbcc99) | |:---:| | Abay, N.C., Akcora, C.G., Gel, Y.R., Kantarcioglu, M., Islambekov, U.D., Tian, Y. and Thuraisingham, B., 2019, November. **Chainnet: Learning on blockchain graphs with topological features**. In 2019 IEEE international conference on data mining (ICDM) (pp. 946-951). IEEE.|#Transaction output (TXO) based blockchains
Next, if address b wants to spend its received 2B, it needs to show proof of funds:
“Use the 2B I received from Block 1, transaction 1 and to pay 1.5B to c and 0.3B to d”.
#Blockchain Graph – Substructure mining
Rather than individual edges or nodes, we use a substructure as the building block in our Bitcoin analysis. We use the term chainlet to refer to such substructures.
| | |:---:| |**Forecasting Bitcoin Price with Graph Chainlets**, Akcora et al. PaKDD 2018.|Definition [K-Chainlets]:
Let k-chainlet
If a Gk occurs more/less frequently than expected by chance, it is called
a Blockchain k-chainlet. A k-chainlet signature
#Blockchain Chainlets
Chainlets have distinct shapes that reflect their role in the network. We aggregate these roles to analyze network dynamics.
| | |:---:| | Three distinct types of first order chainlets! |#Graph features – occurrence and amount
|![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/6709942a-0b7a-43b0-9bb3-ecf314cbef18) | | |:---:|:---:|Occurrence: How many transactions were created with a given type of chainlet?
Amount: How many bitcoins were transferred with a given type of chainlet?
| | |:---:|Let’s take one-to-two chainlets, i.e., c+{1→2}
Occurrence: 2 chainlets occur:
Amount: 7.8Ƀ+6.6Ƀ=14.4 bitcoins are transferred by them.
We can record one occurrence matrix, and one amount matrix per 24 hour window.
Rows and columns indicate number of inputs and outputs, respectively.
Information about one-to-two chainlets, i.e., c_{1→2}, are given in the matrix cell [1,2]
#Topological features
Topological Data Analysis (TDA) provides methods to systematically study the topological and geometric structure underlying data. These structures are commonly analyzed via the multi-scale-based framework of persistent homology. The primary idea is to assess which topological features remain persistent over a larger set of scales and hence are likely to play a significant role in its functionality.
#Persistent homology for blockchains
We propose a novel approach that computes the Betti sequences on a network of N × N nodes where N is the size of the chainlet matrix A. For each of the
In constructing the new network, we use and hence retain the amount information from the Blockchain network. This way, we combine distance (computed from transferred coins) with edge connectedness while restricting the network size.
We describe the main steps as follows:
Given a heterogeneous Blockchain network with transferred bitcoins on edges,
- All the transferred amounts are converted from Satoshis to bitcoins and log transformed: a’ = log(1 + a/108).
- For each chainlet of a given time period, we compute the sample q-quantiles for the associated log-transformed amounts:
a k-th q-quantile, k = 0, 1,…, q, is the amount Q(k) such that
$$∑_{𝑖=1}^τ 1_{𝑦_𝑖} <𝑄(k)≈\frac{τk}{𝑞} and ∑_{𝑖=1}^τ 1{𝑦_𝑖} >𝑄(k)≈ \frac{τ(q−k)}{𝑞}$$ , where τ is the total number of transactions. - The (dis)similarity metric dij between chainlet nodes i and j is defined as the quantile-based distance
$$d_{ij} =\sqrt{∑_(𝑘=0)𝑞[𝑄𝑖(𝑘)−𝑄_𝑗 (𝑘)]^2}$$ - We construct a sequence of scales
$ɛ_1$ <$ɛ_2$ < . . . <$ɛ_S$ covering a range of distances during the entire 365-day period. - For each
$ɛ_k$ , we build the corresponding VR complex whose 0-simplices are single chainlets and 1- simplices are pairs of chainlets with distance ≤$ɛ_k$ .
As a result, we obtain the filtration of VR complexes
#Experiment
Problem Statement: Let
We downloaded and parsed the entire Bitcoin transaction graph from 2009 January to 2018 December. Using a time interval of 24 hours, we extracted daily transactions on the network and created the Bitcoin graph. Our Bitcoin price (USD) data is downloaded from blockchain.com which aggregates prices from worldwide online exchanges.
In addition to FL and Betti related features: past price, transaction count, mean degree of addresses, number of new addresses, mean and total coin amount transferred in transactions and address network average clustering coefficient.
We used ARIMAx, Random Forest, XGBT, Gaussian Process based Regression, and Elastic Net.
We use a time window-based approach in price prediction.
![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/99c08714-d965-446b-8abe-f77553c1bf44)#Baseline Experiments
We assess model performance with root mean squared error in the predicted price.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/e15b669f-8d3d-4d15-8cd0-c27f6d7ec465) | ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/19127d6e-ab87-44fe-86c6-dfc55a0bbce3)|![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/fc0b0cd8-a8e7-4123-9881-065c7bc4a630)| |:---:|:---:|:---:| |Window = 3|Window = 5|Window = 7|The simplest baseline for ChainNet can be constructed by training models on past price and past total transaction count in a sliding window prediction scheme.
#Experiments
We report the percentage predictive gain, or decrease in RMSE for a specific machine learning model m w.r.t. its baseline model
Figure: Gain over the best model is given for three windows, and multiple horizons.
Next Part