main code is in src/accounts-db/
the main files include:
db.zig
: includes the main database structAccountsDB
accounts_file.zig
: includes the main struct for reading + validating account filesindex.zig
: all index related structs (account ref, simd hashmap, …)snapshots.zig
: fields + data to deserialize snapshot metadatabank.zig
: minimal logic for bank (still being built out)genesis_config.zig
: genesis config fieldssysvars.zig
: system variables definitions and addresses (clock, slot_history, …)
--help
output of accounts-db related flags:
-s, --snapshot-dir <snapshot_dir> path to snapshot directory (where snapshots are downloaded and/or unpacked to/from) - default: test-data/
-t, --n-threads-snapshot-load <n_threads_snapshot_load> number of threads to load snapshots: - default: ncpus
-u, --n-threads-snapshot-unpack <n_threads_snapshot_unpack> number of threads to unpack snapshots - default: ncpus * 2
-d, --disk-index-path <disk_index_path> path to disk index - default: no disk index, index will use ram
-f, --force-unpack-snapshot force unpack snapshot even if it exists
--min-snapshot-download-speed <min_snapshot_download_speed_mb> minimum download speed of full snapshots in megabytes per second - default: 20MB/s
--force-new-snapshot-download force download of new snapshot (usually to get a more up-to-date snapshot)
-t, --trusted_validator <Trusted Validator> public key of a validator whose snapshot hash is trusted to be downloaded
Additional context on specific cli flags is given throughout these docs.
zig-out/bin/sig snapshot-download \
# where to save snapshot
-s test-data/tmp \
# gossip peers to join network from
--entrypoint 34.83.231.102:8001 \
--entrypoint 145.40.67.83:8001 \
# pubkeys of validators whos snapshot hashes you trust
--trusted_validator x19btgySsrjuo25CJCj7oE7DREwezDhnx7pZkj2v69N
# minimum MB/s speed when downloading snapshot
--min-snapshot-download-speed 50 \
checkout the full accounts-db deep-dive blog post here: https://blog.syndica.io/sig-engineering-part-3-solanas-accountsdb/
snapshots contain the full state of the blockchain (including all accounts) at a specific slot. They are requested/downloaded from existing validators in the network and are used to bootstrap new validators (instead of starting from Genesis).
the typical snapshot layout is as follows:
when starting up, we use SnapshotFiles.find
with the snapshot directory string
to find the highest metadata file existing. if it doesnt exist, then a new
snapshot is downloaded.
A snapshot contains multiple account files which contains all the account data for a specific slot. Each file is organized as a list of accounts as bytes.
on startup, the validator does the following:
- snapshots are downloaded from peers
- the snapshot is decompressed to mulitple account files
- each account file is mmap'd into memory and validated
- the account index is generated by reading each account file (creating an index which maps pubkeys to the location of the corresponding account)
- the accounts-db state is validated to ensure no data corruption occured
note: if --force-unpack-snapshot
is used, then a snapshot is always downloaded. if not and an accounts/
directory exists then
it will attempt to load using the accounts located in that directory.
var accounts_db = try AccountsDB.init(
allocator, // any allocator used for all operations
logger, // used for outputting progress/info/debug details
.{} // custom configuration (defaults are reasonable)
);
defer accounts_db.deinit();
for more usage examples, checkout the tests by searching for test "accountsdb
in
the codebase.
we'll cover how to load accounts db from a snapshot later in the docs.
the two major components in the db include:
- a account_file map which maps a file_id to the mmap'd contents of that file
- the account index which maps a pubkey to a file_id and an offset of where the account's bytes begin
To make the file_map thread-safe we had to modify a few things.
To better understand this, theres three main thread-safe scenarios we care about:
- adding new account files (flushing)
- reading account files (snapshot generation, account queries)
- removing account files (shrinking and purging)
the two main fields are:
file_map_fd_rw
: a read-lock on this mux should be held whenever an account file is being held. A write-lock on this mux should be held whenever we are closing a file.
Adding an account file should never invalidate the account files observed by another thread. The file-map should be write-locked so any map resizing (if theres not enough space) doesnt invalidate other threads values.
All reading threads must first acquire a read (shared) lock on the file_map_fd_rw, before acquiring a lock on the file map, and reading an account file - to ensure account files will not be closed while being read.
After doing so, the file_map_rw may be unlocked, without releasing the file_map_fd_rw, allowing other threads to modify the file_map, whilst preventing any files being closed until all reading threads have finished their work.
A thread which wants to delete/close an account files must first
acquire a write (exclusive) lock on file_map_fd_rw
, before acquiring
a write-lock on the file map to access the account_file and close/delete/remove it.
NOTE: Holding a write lock on file_map_fd_rw
is very expensive, so we only acquire
a write-lock inside deleteAccountFiles
which has a minimal amount of logic.
NOTE: no method modifieds/mutates account files after they have been flushed. They are 'shrunk' with deletion + creating a 'smaller' file, or purged with deletion. This allows us to not use a lock per-account-file.
The account index shards pubkeys across multiple shards where each pubkey is associated with a specific shard based on the pubkey’s first N bits. This allows for parallel read/write access to the database (locking only a single shard for each lookup vs the entire struct).
due to the large amount of accounts on solana, storing all account references
in ram would be very expensive - which is why we also support storing account
indexes (more specifically, the references ArrayList
) on disk using
a backing file.
to achieve fast read/write speeds, we needed to implement our own hashmap based on Google's Swissmap. we saw 2x improvement on getOrPut calls for reads and writes.
to support disk-based account references, we created a general purpose disk allocator which creates memory from mmap-ing files stored on disk.
// files are created using `data/test-data/bin_{i}` format where `i` is
// incremented by one for each new allocation.
var dma_dir = try std.fs.cwd().makeOpenPath("data/test-data");
defer dma_dir.close();
var dma_state: DiskMemoryAllocator = .{};
const dma = dma_state.allocator();
Unlike a simpler page allocator, it stores certain metadata after the user-facing buffer which tracks the associated file, and the true mmap'd size to allow for resizes.
we also run background threads in the runManagerLoop
method which does the following:
- flush the cache to account files in
flushSlot
- clean account files in
cleanAccountFiles
- shrink account files in
shrinkAccountFiles
- deletes account files in
deleteAccountFiles
- periodically create full snapshots and incremental snapshots
for an overview on how the methods should work checkout the blogpost details on background threads.
since acquiring a write-lock on file_map_fd_rw
is very expensive (ensuring no account-files
can have read-access), we ensure its only write-locked during deletion in deleteAccountFiles
and
contains the minimal amount of logic.
we also limit how often the method is called by requiring a minimum number of account files to delete
per call (defined by DELETE_ACCOUNT_FILES_MIN
).
we creat both full snapshots and incremental snapshots every N roots (defined in ManagerLoopConfig
).
- full snapshots:
makeFullSnapshotGenerationPackage
- incremental snapshots:
makeIncrementalSnapshotGenerationPackage
the general usage is to create a snapshot package which implements a write method that can
be used to write a tar-archive of the snapshot (using the method writeSnapshotTarWithFields
). the
package collects all the account files which should be included in the snapshot and also computes
the accounts-hash and total number of lamports to populate the manifest with.
in the loop, we create the package and then write the tar-archive into a zstd compression library
(zstd.writerCtx
) which itself pipes into a file on disk.
After the writing has been complete the internal accounts-db state is updated using commitFullSnapshotInfo
and commitIncrementalSnapshotInfo
which tracks the new snapshot
created and either deletes or ignores older snapshots (which arent needed anymore).
all the code can be found in src/accountsdb/download.zig
: downloadSnapshotsFromGossip
first, theres two types of snapshots: full snapshots and incremental snapshots
- full snapshots include all the accounts on the network at some specific slot.
- incremental snapshots are smaller and only contain the accounts which changed from a full snapshot.
for example, if the network is on slot 100, the full snapshot could contain all accounts at slot 75, and a matching incremental snapshot could contain all accounts that changed between slot 75 and slot 100.
to download a snapshot, gossip is started up to find other nodes in the network and collect gossip data - we look for peers who
- have a matching shred version (ie, the network version/hard-forks)
- have a valid rpc socket (ie, can download from)
- have a snapshot hash available
the snapshot hash structure is a gossip datatype which contains
- the largest full snapshot (both a the slot and hash)
- and a list of incremental snapshots (also slot and hash)
when downloading,
- we prioritize snapshots with larger slots
- and if we have a list of 'trusted' validators, we only download snapshots whos hashes matches the trusted validators hashes
then for each of these valid peers, we construct the url of the snapshot:
- full: snapshot-{slot}-{hash}.tar.zstd
- incremental: incremental-snapshot-{base_slot}-{slot}-{hash}.tar.zstd
and then start the download - we periodically check the download speed and make sure its fast enough, or we try another peer
snapshots are downloaded as .tar.zstd
and we decompress them using parallelUnpackZstdTarBall
we use a zstd library C bindings to create a decompressed stream which we then
feed the results to untar the archive to files on disk. the unarchiving
happens in parallel using n-threads-snapshot-unpack
. since there is
a large amount of I/O, the default value is 2x the number of CPUs on the machine.
loading from a snapshot begins in accounts_db.loadFromSnapshot
is a very
expensive operation.
the steps include:
- reads and load all the account files based on the snapshot manifest's file map
- validates + indexes every account in each file (in parallel)
- combines the results across the threads (also in parallel)
to achieve this in parallel we split processing the account files across multiple threads (part1 of the diagram below) - this means each thread:
- reads and mmaps every account file
- creates and populates an
ArrayList(AccountRef)
with every account it parses from the account files - populates their own sharded index by sharding the pubkeys and populating
the hashmap with the
*AccountRef
s
the result is N threads (--n-threads-snapshot-load
decides the value for N) each with their own account index, which we now need
to comsharde. to combine indexes we merge index shards in parallel across threads.
for example, one thread will merge shards[0..10] another will merge shards[10..20], ... etc for all the shards across all the threads.
this approach generates the index with zero locks
when loading and verifying account files in loadAndVerifyAccountsFiles
, we also stream the
accounts out to geyser (more docs in src/geyser/readme.md
).
for each account file, we track the associated Accounts and pubkey in the GeyserTmpStorage
during
indexing and then we push them to the pipe and reset the storage.
note: this will likely change with future improvements to the solana protocol account hashing
the goal of validating snapshots is to generate a merkle tree over all the accounts in the db and compares the root hash with the hash in the metadata. the entrypoint
is validateLoadFromSnapshot
.
we take the following approach:
- account hashes are collected in parallel across shards using
getHashesFromIndexMultiThread
- similar to how the index is generated - each thread will have a slice of hashes, the root hash is computed against this nested slices using
NestedHashTree
note: pubkeys are also sorted so results are consistent
after validating accounts-db data, we also validate a few key structs:
GenesisConfig
: this data is validated in against the bank inBank.validateBankFields(bank.bank_fields, &genesis_config);
Bank
: containsbank_fields
which is in the snapshot metadata (not used right now)StatusCache / SlotHistory Sysvar
: additional validation performed instatus_cache.validate
note: at the time of writing, this functionality is in its infancy.
The core logic for generating a snapshot lives in accounts_db.db.writeSnapshotTarWithFields
; the principle entrypoint is AccountsDB.writeSnapshotTar
.
The procedure consists of writing the version file, the status cache (snapshots/status_cache
) file, the snapshot manifest (snapshots/{SLOT}/{SLOT}
),
and the account files (accounts/{SLOT}.{FILE_ID}
). This is all written to a stream in the TAR archive format.
The snapshot manifest file content is comprised of the bincoded (bincode-encoded) data structure SnapshotFields
, which is an aggregate of:
- implicit state: data derived from the current state of AccountsDB, like the file map for all the account which exist at that snapshot, or which have changed relative to a full snapshot in an incremental one
- configuration state: data that is used to communicate details about the snapshot, like the full slot to which an incremental snapshot is relative.
For full snapshots, we write all account files present in AccountsDB which are rooted - as in, less than or equal to the latest rooted slot.
BenchArgs
contains all the configuration of a benchmark (comments describe each parameter)
- found at the bottom of
db.zig
writing accounts uses putAccountSlice
which takes a slice of accounts
and putAccountFile
which takes an account file
reading accounts uses accounts_db.getAccount(pubkey);
.
- found at the bottom of
index.zig
- run using
zig build -Doptimize=ReleaseSafe benchmark -- swissmap
Benchmark Iterations Min(ns) Max(ns) Variance Mean(ns)
---------------------------------------------------------------------------------------
WRITE: 814.917us (2.00x faster than std)
READ: 2.706ms (0.78x faster than std)
swissmapBenchmark(100k accounts) 1 814917 814917 0 814917
WRITE: 7.715ms (1.46x faster than std)
READ: 23.055ms (0.77x faster than std)
swissmapBenchmark(500k accounts) 1 7715875 7715875 0 7715875
WRITE: 17.163ms (1.44x faster than std)
READ: 50.975ms (0.70x faster than std)
swissmapBenchmark(1m accounts) 1 17163500 17163500 0 17163500