-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Grant] An Empirical Analysis of Monero's Ring Signature Resilience to Artificially Intelligent Attacks #15
Comments
The MAGIC Monero Fund committee has voted to support this research project for 12,000 USD equivalent. Good luck! |
March UpdateThis is the first of five subsequent updates I will deliver, touching upon the work I have completed in the last month regarding my MAGIC grant. To briefly reiterate, the goals of this research are twofold. First, to collect a deanonymized dataset of Monero stagenet transactions, this dataset will be complete incorporating all information at my disposal. The latter is to engineer features to train different machine learning classifiers and deep neural networks to identify heuristics that leak information regarding the true spend of an arbitrary ring signature. Non-Technical tl;drThe pipeline to collect a Monero dataset, filter and clean the transactions, and train certain machine learning models on results is complete. I've spent a lot of time experimenting to identify real user spending habits on the Monero main-net to incorporate into the automation. These experiments ensure that once the actual data collection process begins, the results will mimic real users as closely as possible to produce the best results. The start date for the large-scale data collection has been pushed back from the beginning of March to April 8th. Technical UpdatesGoals to complete by end of March
To start, at the beginning of March, I worked with The ratio between the fee paid and the transaction size (kB) is plotted below in a histogram where the Y-axis is the log probability. More information about the experiment can be found (Here). I learned quickly from reviewing the results of a few transactions that my initial approach of having the automated wallets send payments as fast as possible created a scenario where the vast majority of true spends were in the most recent ring position. A dataset with such a large right-skew would need to be undersampled, drastically reducing the size of the dataset. The distribution used to represent Monero user spending patterns (in seconds) is shown below in the left-most figure. More technically, it is a gamma distribution with a shape of Due to the inconclusive experiments regarding the correct transaction delay distribution, I have not been able to start the data collection process for the stagenet dataset. While this is unfortunate, I need to ensure the delayed distribution is adequate for representing real users and creating ring signatures where the true spend is roughly distributed evenly between all 11 positions. This delay ultimately prevented the recollection of the data four months down the line. A substantial update includes the completion of the data collection pipeline written using three scripts to collect, combine, filter, and enrich the information exported from each wallet. The images below detail the entire current workflow of data collection process. Notably, Since the initial proposal, the feature-set has largely improved and has resulted in the following features collected for every transaction. This list is intentionally exhaustive and contains some data not relevant to machine or deep learning applications and instead provides context making the dataset useful for multiple applications such as education or statistical analysis.
After MAGIC accepted my research proposal, I was given access to a compute cluster dedicated to Monero research and owned by Lastly, I have worked on additions to the formal white paper and created graphics to help condense complicated information. One of the graphics I am most proud of is shown below; it accentuates how ring signature decoys are chosen in a non-uniform fashion from the vast set of previous outputs seen on-chain. Revised Goals for April
|
April UpdateThis is the second update delivered for my MAGIC grant, detailing the work I have completed from April 1st to April 30th. Non-Technical tl;drMajor efficiency improvements have been made to the data collection pipeline allowing more wallets to run simultaneously using the same amount of hardware. The data collection process has officially started across six servers with a total of 7400 stagenet wallets and is expected to finish on July 1st. I have trained two machine learning models on a preliminary dataset, which is highly biased, and achieved accuracies upwards of three times random guessing when predicting the true member of a ring signature. It is currently unclear if these results will be reaffirmed or disputed once the new, realistic dataset is finished being collected and tested. Other progress has been made on model research and the whitepaper, and future efforts are shifting towards feature engineering. Technical UpdatesGoals for April
April started with a heavy focus on finalizing a method for simulating real user behavior during the automated collection process. The target to deploy the wallets on There were two issues:
To solve this I rewrote the data collection script to use After the necessary changes were made, the script was deployed across Next, I have worked on implementing two rudimentary machine learning models and have trained them on a preliminary dataset. I want to clarify that the preliminary dataset has many flaws, and most of the transactions were collected with an (unrealistic) 20-minute uniform delay. That being said, after training on an undersampled version of the dataset, removing the possibility of a guess newest heuristic. I observed accuracies upwards of
A random forest and gradient-boosted classifier model were fit to the dataset and, after hyperparameter tuning, achieved out-of-sample testing accuracies of Extracting the features and associated importance from the gradient-boosted classifier is shown in the right-most table. Notably, time deltas from the newest ring to the block provide the highest insight,
A multi-class confusion matrix for each model was calculated and displayed using a heatmap, shown in the two images above. Each box of the matrix indicates the precision of the model. Across the true-positive diagonal in the left-most image, there appear to be three points where the correct predictions were concentrated. Classes zero, four, and ten were predicted with the highest accuracies. Notably, the confidence of a class zero prediction almost reached Next, I made progress in researching which deep learning model would be best for training on the dataset. A standard multilayer perceptron (MLP) intuitively seems like an appropriate first step as other models, such as a 1D convolutional neural network, would emphasize the positioning of the features within the input vector too much. Lastly, I have continued to work on the whitepaper and made large strides in the background, related works, and methodology sections. Looking towards the futureNow that the data collection has started and is mostly hands-off, I need to emphasize the importance of feature-engineering within data science. This is the process of identifying information within your dataset and representing it in the most useful way to a machine/deep learning model. This is a very time-consuming process to get right but is necessary to achieve maximum accuracy. Isthmus and I have been working to find time to brainstorm this process. I expect that the majority of May and June will consist of training different AI models and feature engineering. Revised Goals for May
|
May UpdateThis is the third update delivered for my MAGIC grant, detailing the work I have completed from May 1st to May 31st. Non-Technical tl;drOver the last month, I have finished designing and constructing the deep neural network, and after training on the preliminary dataset, it predicted the true spend of a ring signature with a 23% accuracy. Additionally, many new features have been introduced to the dataset, hopefully increasing future model performance. With the new features implemented, I started reprocessing the preliminary dataset along with a small Mainnet dataset. This was to acquire metrics on how closely Test/Stagenet accuracies carry over to real-world Mainnet transactions. Technical UpdatesGoals for May
This month started with designing a neural network to train on the preliminary dataset and assess ring signature resiliency to deep learning attacks. After much deliberation, a multilayer perceptron was chosen with seven dense layers, ReLU activation, three dropout layers of varying degrees, and a final softmax layer. The network's last dense layer includes eleven neurons to represent the possible positions of the true spend in a ring signature. Next, I trained the neural network on the preliminary Testnet dataset and hyperparameter tuned it to achieve the best out-of-sample accuracy with the least overfitting possible. The network achieved a
I met with Isthmus for a meeting to discuss opportunities to improve feature engineering and the data collection process. The preliminary testnet dataset I used mainly comprises temporal features such as time deltas between ring signature members. Each ring signature input is traced back to where it originated, and the number of inputs, outputs, blocks, and time is collected. New Features
I worked extensively on an additional set of features that proved to be less trivial. The idea was to collect the number of times an output has been used on-chain from the point of creation until the time of the investigated ring signature. For example, if someone sends my wallet 1 XMR on January 1st and I investigate a February 1st transaction that uses my output within a ring. This feature would look at every transaction between January 1st and February 1st to see how many other transactions included it as a potential decoy. Additionally, the time deltas between these occurrences would be calculated and saved as a feature. This clearly creates a rather large computational problem, with outlandish time complexity of Work in Progress Features
The proof of concept code I wrote to implement these features took hours to process a single transaction, while originally, it took only seconds, maybe minutes. I implemented a custom caching solution to address this issue that would cache transactions and blocks within a hashmap, which could be queried in Currently, the code is commented out as a plausible solution has not yet been reached. However, Neptune has a project HERE that pre-processes all the ring relationships beforehand into a Postgres database. I am working with them to stand up a Stagenet instance to use in my research which would allow me to truly query the information needed in Many small improvements regarding efficiency have been made throughout the code base, increasing reliability. I monitored the seven servers running almost ten thousand Monero wallets, collecting the final Stagenet dataset. A few of these machines went offline at different times due to power outages, crashes, or other issues. I also performed a sanity check to ensure that the labels matched up to the correct data points after all the processing was done. Next, I've worked on reprocessing the preliminary dataset of Testnet transactions to include the newly engineered features and test out all of the efficiency improvements made over the last few months. The dataset contained over 760,000 transactions which had over 1.3 million ring signatures. Thanks to @Gingeropolous 's monero research server, I max out 64 processing cores and finish the data processing in under two days! Lastly, I have been thinking about how to quantify the quality of the dataset I've produced and have concluded that creating a small, private validation dataset of real Mainnet transactions would be the best way. I have put a lot of time and effort into simulating real-user behavior, and what better way to test if it works than by comparing it against ~50 real Monero transactions I've made on Mainnet. This validation set is quite small and would not statistically represent the population. Still, I believe it would be a good indicator to show if what the machine/deep learning algorithms have learned applies to real transactions or not. Looking towards the futureMany features have been added, and the preliminary dataset is almost finished being processed. Next month, I need to ensure that the code is ready to go as soon as the Stagenet dataset is finished collecting on July 1st and remove all inefficiencies or bottlenecks. Revised Goals for June
|
June UpdateThis is the fourth update delivered for my MAGIC grant, detailing the work I have completed from June 1st to June 30th. Non-Technical tl;drThe stagenet dataset, collected for the last two months, was completed on July 1st and is ready to be aggregated and processed. This previous month consisted mostly of bug fixes and code optimizations, which resulted in the code-base being fully prepped to process the newly collected dataset. I also finished collecting a small mainnet validation dataset and used the pre-trained machine learning models to predict on it, assessing the real-world efficacy. While the preliminary testnet dataset achieved an accuracy of 34% in predicting the true spend of a ring signature, this did not translate to the mainnet dataset with an accuracy of only 11%. This flaw is detailed thoroughly in my technical analysis and ultimately is due to the highly flawed collection methods used for the preliminary dataset, which have been fixed in the newly collected dataset. In the next month, I will finish processing the new dataset, train it on the machine/deep learning models, and write a thorough analysis of the results. Technical UpdatesGoals for June
Bug FixesThis month was unexpectedly one of my most coding-heavy for this project due to the amount of debugging needed to finish recollecting the February testnet dataset with the newly added features. The first issue was a memory leak causing the server to max out all 256 GB of RAM. I rewrote the function in a much more memory-efficient way which fixed the problem and only required 150 GB of RAM. The second issue was locating a few different bottlenecks within the codebase, which were caused by the inefficient use of pandas dataframes. To discover these bottlenecks, I ran my code through a line analysis tool and was able to correct the errors. Many more code optimizations were done, including: eliminating the need for shared memory, changing Dataframes to Series objects, partitioning the dataset, etc... The impact of these optimizations was a slight decay in my mental sanity but an increase in overall performance from 1 iteration every 7 seconds to 200 iterations per second during the undersampling process. This reduced the processing time from weeks to hours. Below is an image of all the processing needed to be done on the data before it is in a state suitable for machine learning. Next, I faced a race condition issue caused by the multiprocessing, which would sometimes add extra samples to the undersampled dataset. This bug took me a week alone to discover its origin, after multiple failed attempts the best option was to simply disable multiprocessing. Using a single process was only feasible due to the many efficiency improvements previously implemented. Next, due to the amount of processing done on the data from start to finish, I wanted to implement a "sanity check" to ensure that the labels correctly corresponded to their records. This sanity check led me to discover two small bugs in the code that incorrectly added some ring signatures to the dataset. Once all the aforementioned bugs were fixed, I finished a sanity check of the preliminary dataset with 100% integrity. Mainnet DatasetNext, I moved on to collecting the mainnet dataset of the ~50 transactions previously mentioned in the May update. This process surprisingly did not have any errors, as shown below, and the mainnet validation dataset consisted of 76 total ring signatures. @Rucknium performed some power calculations on the validation set and determined that it would be statistically significant. I refactored the machine learning code to make it easier to read and more modular. Below are the results from using the model trained on the preliminary testnet dataset to predict the mainnet dataset. The gradient boosted classifier achieved 34% accuracy on the out-of-sample data and 22% on the mainnet validation set. At first, I was ecstatic with those results, however, overall accuracies don't tell the whole story. Dataset of 155881 samples loaded from disk.
Dataset split into training and testing.
Dataset of 76 samples loaded from disk.
In Sample Accuracy: 0.38180812163202466
Test Accuracy: 0.34246399589440935
Mainnet Validation Accuracy: 0.2236842105263158
Accuracy: 0.34
Micro Precision: 0.34
Micro Recall: 0.34
Micro F1-score: 0.34
Macro Precision: 0.35
Macro Recall: 0.34
Macro F1-score: 0.34
Weighted Precision: 0.35
Weighted Recall: 0.34
Weighted F1-score: 0.34
Classification Report
precision-recall f1-score support
1 0.39 0.68 0.50 2835
2 0.30 0.37 0.33 2850
3 0.28 0.28 0.28 2811
4 0.25 0.24 0.24 2812
5 0.27 0.24 0.25 2881
6 0.25 0.24 0.25 2752
7 0.27 0.24 0.25 2862
8 0.29 0.27 0.28 2862
9 0.35 0.29 0.32 2801
10 0.39 0.32 0.35 2821
11 0.79 0.59 0.67 2890
accuracy 0.34 31177
macro avg 0.35 0.34 0.34 31177
weighted avg 0.35 0.34 0.34 31177 The leftmost heatmap confusion matrix below uses the preliminary testnet dataset, which gives an easy way to represent misclassifications done by the model visually. The diagonal line represents correct predictions made by the model, while anything outside of that line indicates a misclassification. The rightmost heatmap confusion matrix below results from using the previously trained model to predict on mainnet samples. Comparatively to the testnet results, there is no defined line on the diagnol, and the results are more concentrated around the newest ring member.
What do these results mean?These results confirm @Rucknium prediction that the preliminary machine/deep learning models were exploiting a blatant difference in the user spending distribution compared to the decoy selection distribution proposed by Möser et al.. This data affirms the criticality of a carefully procured decoy selection algorithm such as OSPEAD as a discrepancy could lead to an adversary gaining 20%-30% prediction accuracy instead of the assumed 9%, all without external information. The metrics below give a weighted F-1 score (harmonic mean of the model's recall and precision) of 11% accuracy. My theory is that the model does not learn a guess newest heuristic; this is because the training data was undersampled such that no single class would experience a higher probability of occurrence in the testing dataset. Instead, I believe this results from the preliminary dataset's highly flawed data collection method, with an unrealistic transaction pattern of each sample only spaced about 20 minutes apart. When the model was faced with real mainnet ring signatures spanning days, weeks, or months, it was unable to correlate the patterns it learned to the incoming samples. Accuracy: 0.22
Micro Precision: 0.22
Micro Recall: 0.22
Micro F1-score: 0.22
Macro Precision: 0.04
Macro Recall: 0.10
Macro F1-score: 0.05
Weighted Precision: 0.08
Weighted Recall: 0.22
Weighted F1-score: 0.11
Classification Report
precision-recall f1-score support
1 0.00 0.00 0.00 1
2 0.00 0.00 0.00 5
3 0.00 0.00 0.00 4
4 0.00 0.00 0.00 6
5 0.00 0.00 0.00 8
6 0.00 0.00 0.00 5
7 0.00 0.00 0.00 4
8 0.00 0.00 0.00 11
9 0.00 0.00 0.00 7
10 0.20 0.12 0.15 8
11 0.27 0.94 0.42 17
accuracy 0.22 76
macro avg 0.04 0.10 0.05 76
weighted avg 0.08 0.22 0.11 76 To further support my theory, the large uptick in testnet transactions per day for a short time, shown below, resulted from the data collection script failing to sleep in between subsequent transactions. A modification made to the stagenet dataset is that each transaction would sample an amount of seconds to sleep from the gamma distribution. This is shown in the rightmost image below, with a more gradual 2,500 transactions per day instead of 60,000. However, there are instances with sudden spikes within the stagenet dataset due to intermittently restarting the data collection script after a power outage. While the stagenet dataset is not perfect, I believe it represents a much more realistic sample of transactions and thus should perform drastically better than the preliminary dataset.
Feature Engineering@neptuneresearch has been a huge help in setting up the ring membership Postgresql database. This will allow me to incorporate the two "work in progress" features previously engineered. My naive implementation had a time complexity of O(N^5) while Neptunes SQL query will be practically O(1). This will reduce the lookup time per transaction from days down to milliseconds. WITH txinfo AS (
SELECT
X.tx_block_height AS height_B,
X.tx_vin_index AS input_pos,
X.ringmember_block_height AS height_A,
X.ringmember_txo_key AS stealth_addr_A,
X.tx_vin_ringmember_index AS input_mem_idx
FROM tx_ringmember_list X
WHERE X.tx_hash = %s
)
SELECT DISTINCT
height_A,
Y.ringtx_block_height AS height_B,
input_pos,
input_mem_idx
FROM ringmember_tx_list Y
JOIN txinfo T ON T.stealth_addr_A = Y.txo_key
WHERE Y.ringtx_block_height >= height_A AND Y.ringtx_block_height <= height_B
ORDER BY input_pos, input_mem_idx, height_B ASC MiscellaneousThe original way I was exporting the dataset was using pickle to serialize the dataframe. However, this object was 40GB and needed to be loaded into RAM before it could be used; this would not be practical for many data scientists' workstations and might deter researchers from conducting their own analysis of the dataset. I wrote a custom function to export the original data as a CSV and used pandas native functionality to export the undersampled data frame as a CSV. This format allows the data to be read line-by-line with no RAM constraints. Looking towards the futureJuly is the home stretch for this research project! Now that the dataset is finished being collected, it needs to be aggregated and analyzed using the different machine/deep learning algorithms I've implemented. Each model will predict on the validation mainnet dataset to determine the true accuracy, and I will write up a thorough analysis of the results. Revised Goals for July
|
Fantastic work. Thanks! |
July-August UpdateThis is the fifth and final update delivered for my MAGIC grant, detailing the work I have completed from July 1st to August 31st. MAGIC permitted a one-month extension so I would have enough time to edit and revise my final report. I want to thank the MAGIC committee for their support over the last six months, helping me complete my research! Non-Technical tl;drThis security audit of Monero's ring signature resilience to AI-attacks proposes three different models, the best of which achieved 13.3% accuracy predicting the true spend of mainnet transactions. This revealed that blockchain metadata could provide a marginal improvement of 4.3% greater than random guessing. To hopefully inspire future works, the code to collect, process, and train ML/DL models, as well as the datasets used, are freely available on the Github repo. The final report, containing all technical details, is also published on the GitHub repo and can be accessed HERE. Technical UpdatesGoals for July-August
I started this research project December 2021 and now 9 months later it comes to fruition after almost 7,000 lines of code. Stagenet Dataset Collection and ProcessingTo start July, I aggreagted the 165GB of Monero wallet files off of the 7 servers used for the data collection and made a few redundant backups to ensure the data wouldn't be lost. Also, with the help of @neptuneresearch, I added two new engineered features to the Stagenet dataset:
These features can best be explained using the following graphic, where the occurrences of an output are aggregated between the outputs origin and the investigated use of the output. These features could help correlate anomolous inclusions of the output into ring signatures especially if a drastic drop in the number of transactions processed by the network occurs right after the transaction was sent. Next, the Stagenet Dataset Stats:
Model Training and PaperOnce the stagent and mainnet datasets were processed, I hyperparamater tuned using an exhaustive search over specified parameters taking an average of the 5-fold cross validation. Wallet2.cpp Bug Discovered@Rucknium brought to my attention that a GitHub issue identified a RPC limitation caused when the mempool exceeds 100 transactions. The identification of the bug lines up exactly to when I accidentally caused a flood of transactions on the staging network due to restarting wallets for my data collection (June 11th), due to a power outage. This bug fix was merged and is now included in the hardfork v0.18.0.0 Fluorine Fermi. |
Candidate Background
I am currently a graduate student completing my masters in cyber-security and have taken on this project. Additionally, I am a penetration-tester for a well-respected security firm. In the past, I have worked in a research lab for over three years, dedicated to identifying deep-learning attacks against privacy technologies, and recently had a paper published where I was the lead author (attached in my email to MAGIC). I first learned about Monero around 2019, and over the last year and a half, I have worked to familiarize myself with how each component works. I pride myself on being a strong privacy advocate and have started open source projects in the past https://github.com/ACK-J/Port_Authority to prevent invasive data collection practices. Lastly, when I have time, I run my blog where I post research, CTF writeups, data science topics, and more https://g666gle.me.
Project Description
With large, well-resourced entities recently taking aim at Monero, I look to assess the attack surface of ring signatures regarding machine and deep learning attacks. More specifically, I seek to determine which features of the blockchain can be leveraged to predict the true spend of an arbitrary transaction while disregarding outside information. This research will collect a large dataset of intentionally de-anonymized Monero transactions. The transactional data will be cleaned, filtered, and enriched with additional information present on the blockchain. Lastly, I will train multiple machine / deep learning models against the dataset to determine if a ring signature's true spend can be predicted with accuracy greater than random guessing. My research is uniquely positioned at a time of an upcoming hard fork (v15). Thus, I would like to assess the privacy protections gained against the aforementioned attacks after increasing the ring size from 11 to 16.
Abstract
Contrary to public perception, the most difficult part of data science is collecting a large and balanced dataset that can accurately generalize the population. Monero research within this field has been stale or non-existent for half a decade, and there is still currently no public dataset available. Borggren et al. commented on this challenge, stating, "The difficulty in procuring labels for supervised learning remains a central challenge, despite the accessibility of blockchain activities themselves. In Monero, this is especially so." [1] With ring signatures largely seen as the crux of Monero's strong privacy claims, we must continuously invest in researchers to evaluate and innovate new solutions and avoid ignorance to any potential attack surface. This research will collect the first large-scale, balanced, labeled, and open-sourced dataset of test-net Monero transactions. Publishing a dataset large enough to train deep neural networks will hopefully encourage scholars and independent researchers to continue scrutinizing ring signatures' security and privacy claims, ultimately hardening Monero against AI attacks. After the collection is completed, I will train multiple machine / deep learning models on the dataset to conduct my security assessment. Since I believe in the "peer review" process, an end goal would be to publish the dataset on Kaggle with a reward. This would invite data scientists around the world who may not be familiar with the Monero project to conduct their experiments in identifying flaws within the protocol. Additionally, this dataset could be used for more than just security assessments and is useful in areas such as statistical analysis and education. Once the process is fully automated, A new dataset for each subsequent hard fork of Monero could be created on the test-net, published, and peer-reviewed before it reaches main-net. As the adversaries of Monero grow day-by-day, we must stay vigilant and assume their attacks will incorporate sophisticated artificially intelligent models.
Feature Set Ideas
Milestones and Budget
My rate is $50/hour, which is on the low side of data science consulting. I'm looking to contribute 20 hours of work each week, and I estimate my research will take 22 consecutive weeks. This totals 440 man-hours of work costing $22,000; however, since the maximum budget of MAGIC is $20,000, I would be willing to accept a reduced amount. Additionally, I would prefer payouts, in Monero, on a monthly schedule (after my updates have been posted). In total, there will be five monthly payouts of $4,000 adjusted for the current rate of Monero.
References
[1] N. Borggren, H.-y. Kim, L. Yao, and G. Koplik, “Simulated blockchains for machine learning traceability and transaction values in the monero network,” arXiv preprint arXiv:2001.03937, 2020.
The text was updated successfully, but these errors were encountered: