Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Grant] An Empirical Analysis of Monero's Ring Signature Resilience to Artificially Intelligent Attacks #15

Closed
ACK-J opened this issue Feb 19, 2022 · 7 comments

Comments

@ACK-J
Copy link

ACK-J commented Feb 19, 2022

Candidate Background

I am currently a graduate student completing my masters in cyber-security and have taken on this project. Additionally, I am a penetration-tester for a well-respected security firm. In the past, I have worked in a research lab for over three years, dedicated to identifying deep-learning attacks against privacy technologies, and recently had a paper published where I was the lead author (attached in my email to MAGIC). I first learned about Monero around 2019, and over the last year and a half, I have worked to familiarize myself with how each component works. I pride myself on being a strong privacy advocate and have started open source projects in the past https://github.com/ACK-J/Port_Authority to prevent invasive data collection practices. Lastly, when I have time, I run my blog where I post research, CTF writeups, data science topics, and more https://g666gle.me.

Project Description

With large, well-resourced entities recently taking aim at Monero, I look to assess the attack surface of ring signatures regarding machine and deep learning attacks. More specifically, I seek to determine which features of the blockchain can be leveraged to predict the true spend of an arbitrary transaction while disregarding outside information. This research will collect a large dataset of intentionally de-anonymized Monero transactions. The transactional data will be cleaned, filtered, and enriched with additional information present on the blockchain. Lastly, I will train multiple machine / deep learning models against the dataset to determine if a ring signature's true spend can be predicted with accuracy greater than random guessing. My research is uniquely positioned at a time of an upcoming hard fork (v15). Thus, I would like to assess the privacy protections gained against the aforementioned attacks after increasing the ring size from 11 to 16.

Abstract

Contrary to public perception, the most difficult part of data science is collecting a large and balanced dataset that can accurately generalize the population. Monero research within this field has been stale or non-existent for half a decade, and there is still currently no public dataset available. Borggren et al. commented on this challenge, stating, "The difficulty in procuring labels for supervised learning remains a central challenge, despite the accessibility of blockchain activities themselves. In Monero, this is especially so." [1] With ring signatures largely seen as the crux of Monero's strong privacy claims, we must continuously invest in researchers to evaluate and innovate new solutions and avoid ignorance to any potential attack surface. This research will collect the first large-scale, balanced, labeled, and open-sourced dataset of test-net Monero transactions. Publishing a dataset large enough to train deep neural networks will hopefully encourage scholars and independent researchers to continue scrutinizing ring signatures' security and privacy claims, ultimately hardening Monero against AI attacks. After the collection is completed, I will train multiple machine / deep learning models on the dataset to conduct my security assessment. Since I believe in the "peer review" process, an end goal would be to publish the dataset on Kaggle with a reward. This would invite data scientists around the world who may not be familiar with the Monero project to conduct their experiments in identifying flaws within the protocol. Additionally, this dataset could be used for more than just security assessments and is useful in areas such as statistical analysis and education. Once the process is fully automated, A new dataset for each subsequent hard fork of Monero could be created on the test-net, published, and peer-reviewed before it reaches main-net. As the adversaries of Monero grow day-by-day, we must stay vigilant and assume their attacks will incorporate sophisticated artificially intelligent models.

Feature Set Ideas

  1. Tx hash (Unique identifier)
  2. RingCT amount (private)
  3. Sender address (private)
  4. Receiver address (private)
  5. Wallets minimum block height (private)
  6. Wallet public/private spend key (private) (This may be a bad idea)
  7. Wallet public/private view key (private)
  8. Tx Key (private)
  9. Direction (”in” or “out”) (private)
  10. Wallet balance when transaction made (private)
  11. Output number spent from wallet (private)
  12. Is a subaddress? (boolean) (private)
  13. Number of ins
  14. Number of outs
  15. Ring size
  16. isRingCT? (boolean)
  17. RingCT Type
  18. Tx version
  19. Block number
  20. Txid
  21. Tx size in Kb
  22. Tx fee per_kb
  23. Tx fee
  24. Tx timestamp
  25. Total Block fee
  26. Block timestamp (epoch)
  27. Time since last Block (Not particularly useful for ML, but might be useful for other researchers)
  28. For each output
    1. Amount (private)
    2. Stealth address
    3. Output public key
  29. For each input
    1. Average mean epoch time of the ring signature
    2. key image
    3. Number of spends in each IQR
    4. Time Difference from the newest ring and oldest ring
    5. Time Difference between 11th - 10th ring, 10th - 9th, 9th - 8th ....
    6. Time Difference between block timestamp and newest ring time
    7. Time Difference between block timestamp and oldest ring time
    8. Time Difference between block timestamp and mean ring time ( maybe median )
    9. Each Ring member
      1. Num Ins
      2. Num Outs
      3. is black-balled? (boolean)
      4. Block number member is taken from
      5. Decoy_Output_Ring_Member_Frequency on chain
      6. Decoy_Output_Ring_Member_Sizes
      7. Data_Collection_Time
      8. Decoy_Info[[Tx_Hash, Block_Number, Timestamp, Key_Image, True_Ring_Position/Ring_Size],] (private)
      9. Timestamps

Milestones and Budget

Untitled

My rate is $50/hour, which is on the low side of data science consulting. I'm looking to contribute 20 hours of work each week, and I estimate my research will take 22 consecutive weeks. This totals 440 man-hours of work costing $22,000; however, since the maximum budget of MAGIC is $20,000, I would be willing to accept a reduced amount. Additionally, I would prefer payouts, in Monero, on a monthly schedule (after my updates have been posted). In total, there will be five monthly payouts of $4,000 adjusted for the current rate of Monero.

References

[1] N. Borggren, H.-y. Kim, L. Yao, and G. Koplik, “Simulated blockchains for machine learning traceability and transaction values in the monero network,” arXiv preprint arXiv:2001.03937, 2020.

@Rucknium
Copy link
Member

The MAGIC Monero Fund committee has voted to support this research project for 12,000 USD equivalent. Good luck!

@ACK-J
Copy link
Author

ACK-J commented Apr 1, 2022

March Update

This is the first of five subsequent updates I will deliver, touching upon the work I have completed in the last month regarding my MAGIC grant. To briefly reiterate, the goals of this research are twofold. First, to collect a deanonymized dataset of Monero stagenet transactions, this dataset will be complete incorporating all information at my disposal. The latter is to engineer features to train different machine learning classifiers and deep neural networks to identify heuristics that leak information regarding the true spend of an arbitrary ring signature.

Non-Technical tl;dr

The pipeline to collect a Monero dataset, filter and clean the transactions, and train certain machine learning models on results is complete. I've spent a lot of time experimenting to identify real user spending habits on the Monero main-net to incorporate into the automation. These experiments ensure that once the actual data collection process begins, the results will mimic real users as closely as possible to produce the best results. The start date for the large-scale data collection has been pushed back from the beginning of March to April 8th.

Technical Updates

Goals to complete by end of March

  • Finish experiment and determine appropriate "cap" for transaction delays sampled from Monero gamma distribution.
  • Conduct an experiment to identify the distribution of transaction fees on Monero's main-net.
  • Stop preliminary data collection (900 wallets not using gamma dist)
  • Finish writing data export script from cli wallet
  • Finish writing data enrichment script using onion monero blockchain explorer
  • Restart data collection to span 10 servers and sample gamma distribution for transaction delays (1500 wallets)
  • Start training preliminary data against machine learning models such as random forests to determine feature importance.
  • Run data export script on the preliminary dataset to assess the skewness (if any) of the transactions collected with no delays.
  • Finish literature review

To start, at the beginning of March, I worked with @moneroexamples to make Xmr2csv export the true ring position information for your own transactions instead of only decoys found on chain (Here). I sincerely appreciate their quick response and work implementing the feature. This was crucial for my research to continue as the added export functionality is where the labels of the dataset (true ring position) are collected, and supervised learning would not be possible without it. Additionally, I conducted multiple experiments to ensure the resulting dataset mimicked real user behavior as closely as possible. The two areas of focus were choosing a transaction fee that simulates rational user behavior and delaying user transactions in such a way as to produce an even distribution of ring positions to avoid the need for drastic undersampling. The former required collecting a transaction fee distribution crawled from the last 100,000 txs on the Monero blockchain, starting at block 2566273.

The ratio between the fee paid and the transaction size (kB) is plotted below in a histogram where the Y-axis is the log probability. More information about the experiment can be found (Here).

image

I learned quickly from reviewing the results of a few transactions that my initial approach of having the automated wallets send payments as fast as possible created a scenario where the vast majority of true spends were in the most recent ring position. A dataset with such a large right-skew would need to be undersampled, drastically reducing the size of the dataset. The distribution used to represent Monero user spending patterns (in seconds) is shown below in the left-most figure. More technically, it is a gamma distribution with a shape of 19.28 and a rate of 1.61. After setting up 200 wallets to transact with a delay sampled from this distribution, with a maximum time of 12 hours, the results largely resembled the skewness of the middle figure below. The same experiment was carried out again except with a uniform delay between 20 minutes and 12 hours. The results were noticeably less skewed, as shown below in the right-most figure. My current assumption is that to distribute true ring positions more evenly, either a uniform delay between 20 mins and 24 hours should be used or an inverse of the gamma distribution. However, these experiments are ongoing and are the main reason I did not start the data collection process in March.

Due to the inconclusive experiments regarding the correct transaction delay distribution, I have not been able to start the data collection process for the stagenet dataset. While this is unfortunate, I need to ensure the delayed distribution is adequate for representing real users and creating ring signatures where the true spend is roughly distributed evenly between all 11 positions. This delay ultimately prevented the recollection of the data four months down the line.

Seconds Delay GammaThe gamma distribution to select how many seconds to delay a transaction. Gamma Ring DistThe true ring position of transactions being delayed by exactly 20 minutes. Uniform Ring DistThe true ring position of transactions being delayed between 20 minutes and 12 hours.

A substantial update includes the completion of the data collection pipeline written using three scripts to collect, combine, filter, and enrich the information exported from each wallet. The images below detail the entire current workflow of data collection process. Notably, collect.sh and create_dataset.py both incorporate multiprocessing to increase the efficiency of any computationally expensive process.

Run.shRun.sh creates, configures and automates transactions between adjacent wallets. Collect.shCollect.sh opens each wallet file and exports transaction data into six files. Create_Dataset.pyCreate_dataset.py combines the six files for each wallet, and enriches the data before adding it to the final dataset.

Since the initial proposal, the feature-set has largely improved and has resulted in the following features collected for every transaction. This list is intentionally exhaustive and contains some data not relevant to machine or deep learning applications and instead provides context making the dataset useful for multiple applications such as education or statistical analysis.

Top Level FeaturesThe data and meta-data combined for a single transaction. InputsExpanding the input features collected. OutputsExpanding the output features collected.

After MAGIC accepted my research proposal, I was given access to a compute cluster dedicated to Monero research and owned by @gingeropolous. This allowed me to quickly process a preliminary dataset of 700,000 testnet transactions collected earlier this year. I am still in the process of exporting the data to be used within machine learning models. Note that the preliminary dataset was not consistently collected and contains many errors. However, it provided a suitable stress test for the data collection pipeline. Using a small sample of about 1,000 transactions, I successfully trained a rudimentary random forest machine learning model and have seen good results.

Lastly, I have worked on additions to the formal white paper and created graphics to help condense complicated information. One of the graphics I am most proud of is shown below; it accentuates how ring signature decoys are chosen in a non-uniform fashion from the vast set of previous outputs seen on-chain.

image

Revised Goals for April

  • Start data collection of stagenet dataset
  • Start engineering a neural network with preliminary data
  • Write white paper sections (Methodology)
  • Monero v15 hard fork estimated ( Increase Ring size to 16 )
  • Brainstorming feature engineering with @isthmus

@ACK-J
Copy link
Author

ACK-J commented May 2, 2022

April Update

This is the second update delivered for my MAGIC grant, detailing the work I have completed from April 1st to April 30th.

Non-Technical tl;dr

Major efficiency improvements have been made to the data collection pipeline allowing more wallets to run simultaneously using the same amount of hardware. The data collection process has officially started across six servers with a total of 7400 stagenet wallets and is expected to finish on July 1st. I have trained two machine learning models on a preliminary dataset, which is highly biased, and achieved accuracies upwards of three times random guessing when predicting the true member of a ring signature. It is currently unclear if these results will be reaffirmed or disputed once the new, realistic dataset is finished being collected and tested. Other progress has been made on model research and the whitepaper, and future efforts are shifting towards feature engineering.

Technical Updates

Goals for April

  • Start data collection of stagenet dataset
  • Start engineering a neural network with preliminary data
  • Write white paper sections (Methodology)
  • Monero v15 planning ( Increase Ring size to 16 )
  • Brainstorming feature engineering with @isthmus

April started with a heavy focus on finalizing a method for simulating real user behavior during the automated collection process. The target to deploy the wallets on April 8th was achieved, and the collection ran smoothly. However, after a few days and multiple discussions with other researchers, it was decided that the current deployment of wallets was not ideal.

There were two issues:

  1. The collection software was very RAM intensive and couldn’t be run on a headless server.
  2. The delay to simulate real users still wasn’t realistic with a 24-hour cap.

To solve this I rewrote the data collection script to use tmux for wallet deployment with a few other optimizations to fix the former issue. This improvement drastically reduced resources, enabling almost twice as many wallets to be deployed with the same amount of computing resources. After reviewing all the experiment results, it was clear that bounding the delays between subsequent transactions to a 24-hour "cap" could not simulate realstic behavior. The final decision was to use the standard gamma distribution proposed by Möser et al. where the delay selection would be unrestricted but would simply resample if the choosen delay surpassed the hardcoded data collection stop time of July 1st. The image below shows my GitHub commits and lines changed to the data collection repository between April 1st and April 30th.

After the necessary changes were made, the script was deployed across six servers, totaling 7400 stagenet wallets simulating real user transactions simultaneously. A hard deadline of July 1st was chosen to stop the current collection due to the Monero v15 hard-fork. This network update is scheduled for July 16th, 2022 and is expected to be active on stagenet a week prior. One part of this research proposal was to collect a dataset of transactions after the v15 hard-fork to compare how increasing the ring size affects machine learning classifier accuracies. Unfortunately, the timing of the hard-fork is towards the tail end of my MAGIC grant. It is currently unclear if it will be possible to collect a completely new dataset, process it, and conduct a thorough analysis of the results in under a month.

Next, I have worked on implementing two rudimentary machine learning models and have trained them on a preliminary dataset. I want to clarify that the preliminary dataset has many flaws, and most of the transactions were collected with an (unrealistic) 20-minute uniform delay. That being said, after training on an undersampled version of the dataset, removing the possibility of a guess newest heuristic. I observed accuracies upwards of 3x random guessing classifying the true spend of an arbitrary ring with no outside information.

ML resultsRandom Forest and Gradient Boosted Classifier accuracies on the preliminary dataset. Feature ImportanceFeature importance for the Gradient Boosted Classifier.

A random forest and gradient-boosted classifier model were fit to the dataset and, after hyperparameter tuning, achieved out-of-sample testing accuracies of 23.41% and 27.40%, respectively. The weighted F1 scores were calculated to represent the weighted average for each class's harmonic mean of the model's precision and recall. Notably, the gradient-boosted classifier across the board outperformed the less sophisticated random forest, shown in the left-most table. The gradient-boosted classifier generalized the true spend at each ring position better than the random forest but was more susceptible to overfitting.

Extracting the features and associated importance from the gradient-boosted classifier is shown in the right-most table. Notably, time deltas from the newest ring to the block provide the highest insight, 13.07%, to the model. This correlates well with the 69% accuracy of a zeroth class prediction and could indicate that true spends in this position are more distinguishable than subsequent positions. Additionally, the third most important feature was the time delta between the last two members of the ring signature, which supports the data indicating that the model predicts the last ring member with a precision of over five times random guessing.

RF confusion mtxRandom Forest Confusion Matrix. GBC confusion mtxGradient Boosted Classifier Confusion Matrix.

A multi-class confusion matrix for each model was calculated and displayed using a heatmap, shown in the two images above. Each box of the matrix indicates the precision of the model. Across the true-positive diagonal in the left-most image, there appear to be three points where the correct predictions were concentrated. Classes zero, four, and ten were predicted with the highest accuracies. Notably, the confidence of a class zero prediction almost reached 80% while the last class achieved a 56% accuracy. This indicated that the model could use the temporal information more effectively in cases where the true spend was at either end of the ring. The Gradient Boosted Classifier in the right-most image displays a more distinctive true positive line indicating that the model was better suited to predict on all classes compared to the random forest. However, the zeroth class was predicted with a 69% precision, a reduction of 10% compared to the random forest. Notably, the gradient-boosted classifier confusion matrix predicts each class with a precision greater than 9% random guessing.

Next, I made progress in researching which deep learning model would be best for training on the dataset. A standard multilayer perceptron (MLP) intuitively seems like an appropriate first step as other models, such as a 1D convolutional neural network, would emphasize the positioning of the features within the input vector too much. Lastly, I have continued to work on the whitepaper and made large strides in the background, related works, and methodology sections.

Looking towards the future

Now that the data collection has started and is mostly hands-off, I need to emphasize the importance of feature-engineering within data science. This is the process of identifying information within your dataset and representing it in the most useful way to a machine/deep learning model. This is a very time-consuming process to get right but is necessary to achieve maximum accuracy. Isthmus and I have been working to find time to brainstorm this process. I expect that the majority of May and June will consist of training different AI models and feature engineering.

Revised Goals for May

  • Train and evaulate a deep learning model
  • Monitor the servers for bugs/crashes
  • Continue updating whitepaper
  • Brainstorming feature engineering with @isthmus

@ACK-J
Copy link
Author

ACK-J commented Jun 2, 2022

May Update

This is the third update delivered for my MAGIC grant, detailing the work I have completed from May 1st to May 31st.

Non-Technical tl;dr

Over the last month, I have finished designing and constructing the deep neural network, and after training on the preliminary dataset, it predicted the true spend of a ring signature with a 23% accuracy. Additionally, many new features have been introduced to the dataset, hopefully increasing future model performance. With the new features implemented, I started reprocessing the preliminary dataset along with a small Mainnet dataset. This was to acquire metrics on how closely Test/Stagenet accuracies carry over to real-world Mainnet transactions.

Technical Updates

Goals for May

  • Train and evaluate a deep learning model
  • Monitor the servers for bugs/crashes
  • Continue updating whitepaper
  • Brainstorming feature engineering with Isthmus

This month started with designing a neural network to train on the preliminary dataset and assess ring signature resiliency to deep learning attacks. After much deliberation, a multilayer perceptron was chosen with seven dense layers, ReLU activation, three dropout layers of varying degrees, and a final softmax layer. The network's last dense layer includes eleven neurons to represent the possible positions of the true spend in a ring signature. Next, I trained the neural network on the preliminary Testnet dataset and hyperparameter tuned it to achieve the best out-of-sample accuracy with the least overfitting possible. The network achieved a 23% accuracy, on par but slightly below other machine learning models previously tested.

Neural network constructionNeural network construction and parameters. Visualization of neural networkNeural network visualization of layers.

I met with Isthmus for a meeting to discuss opportunities to improve feature engineering and the data collection process. The preliminary testnet dataset I used mainly comprises temporal features such as time deltas between ring signature members. Each ring signature input is traced back to where it originated, and the number of inputs, outputs, blocks, and time is collected.

New Features

  • Previous_Tx_Num_Outputs
  • Previous_Tx_Num_Inputs
  • Previous_Tx_Time_Deltas
  • Previous_Tx_Block_Num_Delta
  • Previous_Tx_TxExtra_Len
  • Tx_Extra_Length

I worked extensively on an additional set of features that proved to be less trivial. The idea was to collect the number of times an output has been used on-chain from the point of creation until the time of the investigated ring signature. For example, if someone sends my wallet 1 XMR on January 1st and I investigate a February 1st transaction that uses my output within a ring. This feature would look at every transaction between January 1st and February 1st to see how many other transactions included it as a potential decoy. Additionally, the time deltas between these occurrences would be calculated and saved as a feature. This clearly creates a rather large computational problem, with outlandish time complexity of O(N^5). While users have no control over when/if their transaction gets included as a decoy, it will be interesting to see if certain scenarios hint towards the true spend.

Work in Progress Features

  • Previous_Tx_Decoy_Occurrences
  • Previous_Tx_Decoy_Times

The proof of concept code I wrote to implement these features took hours to process a single transaction, while originally, it took only seconds, maybe minutes. I implemented a custom caching solution to address this issue that would cache transactions and blocks within a hashmap, which could be queried in O(1) time. Initial results, shown below, indicated that the cache was two to three orders of magnitude faster than querying the blockchain. This was promising and proved to be much faster in practice to process transactions quickly. However, while testing against the full preliminary dataset, I discovered that the cache was not being hit often enough, and the overhead of passing shared memory between independent processes did not scale, ultimately making the processing time even worse.

Currently, the code is commented out as a plausible solution has not yet been reached. However, Neptune has a project HERE that pre-processes all the ring relationships beforehand into a Postgres database. I am working with them to stand up a Stagenet instance to use in my research which would allow me to truly query the information needed in O(1) time.

Many small improvements regarding efficiency have been made throughout the code base, increasing reliability. I monitored the seven servers running almost ten thousand Monero wallets, collecting the final Stagenet dataset. A few of these machines went offline at different times due to power outages, crashes, or other issues. I also performed a sanity check to ensure that the labels matched up to the correct data points after all the processing was done.

Next, I've worked on reprocessing the preliminary dataset of Testnet transactions to include the newly engineered features and test out all of the efficiency improvements made over the last few months. The dataset contained over 760,000 transactions which had over 1.3 million ring signatures. Thanks to @Gingeropolous 's monero research server, I max out 64 processing cores and finish the data processing in under two days!

image

Lastly, I have been thinking about how to quantify the quality of the dataset I've produced and have concluded that creating a small, private validation dataset of real Mainnet transactions would be the best way. I have put a lot of time and effort into simulating real-user behavior, and what better way to test if it works than by comparing it against ~50 real Monero transactions I've made on Mainnet. This validation set is quite small and would not statistically represent the population. Still, I believe it would be a good indicator to show if what the machine/deep learning algorithms have learned applies to real transactions or not.

Looking towards the future

Many features have been added, and the preliminary dataset is almost finished being processed. Next month, I need to ensure that the code is ready to go as soon as the Stagenet dataset is finished collecting on July 1st and remove all inefficiencies or bottlenecks.

Revised Goals for June

  • Finish recollecting preliminary testnet dataset with the new features
  • Retrain all preliminary models against the new dataset
  • Collect a small mainnet dataset for validation
  • Work with Neptune to ready the Postgres database

@ACK-J
Copy link
Author

ACK-J commented Jul 2, 2022

June Update

This is the fourth update delivered for my MAGIC grant, detailing the work I have completed from June 1st to June 30th.

Non-Technical tl;dr

The stagenet dataset, collected for the last two months, was completed on July 1st and is ready to be aggregated and processed. This previous month consisted mostly of bug fixes and code optimizations, which resulted in the code-base being fully prepped to process the newly collected dataset. I also finished collecting a small mainnet validation dataset and used the pre-trained machine learning models to predict on it, assessing the real-world efficacy. While the preliminary testnet dataset achieved an accuracy of 34% in predicting the true spend of a ring signature, this did not translate to the mainnet dataset with an accuracy of only 11%. This flaw is detailed thoroughly in my technical analysis and ultimately is due to the highly flawed collection methods used for the preliminary dataset, which have been fixed in the newly collected dataset. In the next month, I will finish processing the new dataset, train it on the machine/deep learning models, and write a thorough analysis of the results.

Technical Updates

Goals for June

  • Finish recollecting the preliminary testnet dataset with the new features
  • Retrain all preliminary models against the new dataset
  • Collect a small mainnet dataset for validation
  • Work with Neptune to ready the Postgres database

Bug Fixes

This month was unexpectedly one of my most coding-heavy for this project due to the amount of debugging needed to finish recollecting the February testnet dataset with the newly added features. The first issue was a memory leak causing the server to max out all 256 GB of RAM. I rewrote the function in a much more memory-efficient way which fixed the problem and only required 150 GB of RAM. The second issue was locating a few different bottlenecks within the codebase, which were caused by the inefficient use of pandas dataframes. To discover these bottlenecks, I ran my code through a line analysis tool and was able to correct the errors. Many more code optimizations were done, including: eliminating the need for shared memory, changing Dataframes to Series objects, partitioning the dataset, etc... The impact of these optimizations was a slight decay in my mental sanity but an increase in overall performance from 1 iteration every 7 seconds to 200 iterations per second during the undersampling process. This reduced the processing time from weeks to hours.

Below is an image of all the processing needed to be done on the data before it is in a state suitable for machine learning.

Next, I faced a race condition issue caused by the multiprocessing, which would sometimes add extra samples to the undersampled dataset. This bug took me a week alone to discover its origin, after multiple failed attempts the best option was to simply disable multiprocessing. Using a single process was only feasible due to the many efficiency improvements previously implemented. Next, due to the amount of processing done on the data from start to finish, I wanted to implement a "sanity check" to ensure that the labels correctly corresponded to their records. This sanity check led me to discover two small bugs in the code that incorrectly added some ring signatures to the dataset. Once all the aforementioned bugs were fixed, I finished a sanity check of the preliminary dataset with 100% integrity.

Mainnet Dataset

Next, I moved on to collecting the mainnet dataset of the ~50 transactions previously mentioned in the May update. This process surprisingly did not have any errors, as shown below, and the mainnet validation dataset consisted of 76 total ring signatures. @Rucknium performed some power calculations on the validation set and determined that it would be statistically significant.

I refactored the machine learning code to make it easier to read and more modular. Below are the results from using the model trained on the preliminary testnet dataset to predict the mainnet dataset. The gradient boosted classifier achieved 34% accuracy on the out-of-sample data and 22% on the mainnet validation set. At first, I was ecstatic with those results, however, overall accuracies don't tell the whole story.

Dataset of 155881 samples loaded from disk.
Dataset split into training and testing.
Dataset of 76 samples loaded from disk.

In Sample Accuracy: 0.38180812163202466
Test Accuracy: 0.34246399589440935
Mainnet Validation Accuracy: 0.2236842105263158

Accuracy: 0.34

Micro Precision: 0.34
Micro Recall: 0.34
Micro F1-score: 0.34

Macro Precision: 0.35
Macro Recall: 0.34
Macro F1-score: 0.34

Weighted Precision: 0.35
Weighted Recall: 0.34
Weighted F1-score: 0.34

Classification Report

              precision-recall  f1-score   support

           1       0.39      0.68      0.50      2835
           2       0.30      0.37      0.33      2850
           3       0.28      0.28      0.28      2811
           4       0.25      0.24      0.24      2812
           5       0.27      0.24      0.25      2881
           6       0.25      0.24      0.25      2752
           7       0.27      0.24      0.25      2862
           8       0.29      0.27      0.28      2862
           9       0.35      0.29      0.32      2801
          10       0.39      0.32      0.35      2821
          11       0.79      0.59      0.67      2890

    accuracy                           0.34     31177
   macro avg       0.35      0.34      0.34     31177
weighted avg       0.35      0.34      0.34     31177

The leftmost heatmap confusion matrix below uses the preliminary testnet dataset, which gives an easy way to represent misclassifications done by the model visually. The diagonal line represents correct predictions made by the model, while anything outside of that line indicates a misclassification. The rightmost heatmap confusion matrix below results from using the previously trained model to predict on mainnet samples. Comparatively to the testnet results, there is no defined line on the diagnol, and the results are more concentrated around the newest ring member.

preliminary confusion matrixConfusion Matrix from Preliminary Training Data. mainnet confusion matrixMainnet Prediction Confusion Matrix.

What do these results mean?

These results confirm @Rucknium prediction that the preliminary machine/deep learning models were exploiting a blatant difference in the user spending distribution compared to the decoy selection distribution proposed by Möser et al.. This data affirms the criticality of a carefully procured decoy selection algorithm such as OSPEAD as a discrepancy could lead to an adversary gaining 20%-30% prediction accuracy instead of the assumed 9%, all without external information.

The metrics below give a weighted F-1 score (harmonic mean of the model's recall and precision) of 11% accuracy. My theory is that the model does not learn a guess newest heuristic; this is because the training data was undersampled such that no single class would experience a higher probability of occurrence in the testing dataset. Instead, I believe this results from the preliminary dataset's highly flawed data collection method, with an unrealistic transaction pattern of each sample only spaced about 20 minutes apart. When the model was faced with real mainnet ring signatures spanning days, weeks, or months, it was unable to correlate the patterns it learned to the incoming samples.

Accuracy: 0.22

Micro Precision: 0.22
Micro Recall: 0.22
Micro F1-score: 0.22

Macro Precision: 0.04
Macro Recall: 0.10
Macro F1-score: 0.05

Weighted Precision: 0.08
Weighted Recall: 0.22
Weighted F1-score: 0.11

Classification Report

              precision-recall  f1-score   support

           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         5
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00         6
           5       0.00      0.00      0.00         8
           6       0.00      0.00      0.00         5
           7       0.00      0.00      0.00         4
           8       0.00      0.00      0.00        11
           9       0.00      0.00      0.00         7
          10       0.20      0.12      0.15         8
          11       0.27      0.94      0.42        17

    accuracy                           0.22        76
   macro avg       0.04      0.10      0.05        76
weighted avg       0.08      0.22      0.11        76

To further support my theory, the large uptick in testnet transactions per day for a short time, shown below, resulted from the data collection script failing to sleep in between subsequent transactions. A modification made to the stagenet dataset is that each transaction would sample an amount of seconds to sleep from the gamma distribution. This is shown in the rightmost image below, with a more gradual 2,500 transactions per day instead of 60,000. However, there are instances with sudden spikes within the stagenet dataset due to intermittently restarting the data collection script after a power outage. While the stagenet dataset is not perfect, I believe it represents a much more realistic sample of transactions and thus should perform drastically better than the preliminary dataset.

Testnet transactions per dayTestnet Transactions Per Day. Stagenet Transactions Per Day.Stagenet Transactions Per Day.

Feature Engineering

@neptuneresearch has been a huge help in setting up the ring membership Postgresql database. This will allow me to incorporate the two "work in progress" features previously engineered. My naive implementation had a time complexity of O(N^5) while Neptunes SQL query will be practically O(1). This will reduce the lookup time per transaction from days down to milliseconds.

WITH txinfo AS (
    SELECT 
        X.tx_block_height AS height_B,
        X.tx_vin_index AS input_pos,
        X.ringmember_block_height AS height_A,
        X.ringmember_txo_key AS stealth_addr_A,
        X.tx_vin_ringmember_index AS input_mem_idx
    FROM tx_ringmember_list X
    WHERE X.tx_hash = %s
)
SELECT DISTINCT  
    height_A,
    Y.ringtx_block_height AS height_B,
    input_pos,
    input_mem_idx
FROM ringmember_tx_list Y
JOIN txinfo T ON T.stealth_addr_A = Y.txo_key
WHERE Y.ringtx_block_height >= height_A AND Y.ringtx_block_height <= height_B
ORDER BY input_pos, input_mem_idx, height_B ASC

Miscellaneous

The original way I was exporting the dataset was using pickle to serialize the dataframe. However, this object was 40GB and needed to be loaded into RAM before it could be used; this would not be practical for many data scientists' workstations and might deter researchers from conducting their own analysis of the dataset. I wrote a custom function to export the original data as a CSV and used pandas native functionality to export the undersampled data frame as a CSV. This format allows the data to be read line-by-line with no RAM constraints.

Looking towards the future

July is the home stretch for this research project! Now that the dataset is finished being collected, it needs to be aggregated and analyzed using the different machine/deep learning algorithms I've implemented. Each model will predict on the validation mainnet dataset to determine the true accuracy, and I will write up a thorough analysis of the results.

Revised Goals for July

  • Aggregate the wallets off of the ~10 servers into a central location
  • Process the new stagenet dataset
  • Train the models on the dataset
  • Analyze the results
  • Add results to the whitepaper

@ACK-J ACK-J changed the title [Grant] Security Audit of Ring Signature Resiliency Against Machine/Deep Learning Attacks [Grant] An Empirical Analysis of Monero's Ring Signature Resilience to Artificially Intelligent Attacks Aug 13, 2022
@mj-xmr
Copy link

mj-xmr commented Aug 23, 2022

Fantastic work. Thanks!

@ACK-J
Copy link
Author

ACK-J commented Aug 31, 2022

July-August Update

This is the fifth and final update delivered for my MAGIC grant, detailing the work I have completed from July 1st to August 31st. MAGIC permitted a one-month extension so I would have enough time to edit and revise my final report. I want to thank the MAGIC committee for their support over the last six months, helping me complete my research!

Non-Technical tl;dr

This security audit of Monero's ring signature resilience to AI-attacks proposes three different models, the best of which achieved 13.3% accuracy predicting the true spend of mainnet transactions. This revealed that blockchain metadata could provide a marginal improvement of 4.3% greater than random guessing. To hopefully inspire future works, the code to collect, process, and train ML/DL models, as well as the datasets used, are freely available on the Github repo. The final report, containing all technical details, is also published on the GitHub repo and can be accessed HERE.

Technical Updates

Goals for July-August

  • Aggregate the wallets off of the 7 servers into a central location
  • Process the new stagenet dataset
  • Train the models on the dataset
  • Analyze the results
  • Add results to the whitepaper

I started this research project December 2021 and now 9 months later it comes to fruition after almost 7,000 lines of code.

Stagenet Dataset Collection and Processing

To start July, I aggreagted the 165GB of Monero wallet files off of the 7 servers used for the data collection and made a few redundant backups to ensure the data wouldn't be lost. Also, with the help of @neptuneresearch, I added two new engineered features to the Stagenet dataset:

  • On Chain Decoy Block Deltas
  • Number of On Chain Decoys

These features can best be explained using the following graphic, where the occurrences of an output are aggregated between the outputs origin and the investigated use of the output. These features could help correlate anomolous inclusions of the output into ring signatures especially if a drastic drop in the number of transactions processed by the network occurs right after the transaction was sent.

decoy

Next, the 9,342 Monero wallet files entered the second stage of the data collection pipeline where the transactional metadata was exported into CSV files. These CSV files were used by the third stage of the pipeline to construct the final dataset.

Stagenet Dataset Stats:

  • 184,980 total transactions
  • 248,723 ring signatures
  • Undersampling of 14,750 ring signatures used per class
  • 162,250 total undersampled ring signatures

Model Training and Paper

Once the stagent and mainnet datasets were processed, I hyperparamater tuned using an exhaustive search over specified parameters taking an average of the 5-fold cross validation.

Wallet2.cpp Bug Discovered

@Rucknium brought to my attention that a GitHub issue identified a RPC limitation caused when the mempool exceeds 100 transactions. The identification of the bug lines up exactly to when I accidentally caused a flood of transactions on the staging network due to restarting wallets for my data collection (June 11th), due to a power outage. This bug fix was merged and is now included in the hardfork v0.18.0.0 Fluorine Fermi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants