Skip to content

Latest commit

 

History

History
 
 

accountsdb

accounts-db docs

main code is in src/accounts-db/

the main files include:

  • db.zig: includes the main database struct AccountsDB
  • accounts_file.zig: includes the main struct for reading + validating account files
  • index.zig: all index related structs (account ref, simd hashmap, …)
  • snapshots.zig: fields + data to deserialize snapshot metadata
  • bank.zig: minimal logic for bank (still being built out)
  • genesis_config.zig: genesis config fields
  • sysvars.zig: system variables definitions and addresses (clock, slot_history, …)

cli options

--help output of accounts-db related flags:

-s, --snapshot-dir <snapshot_dir>                                    path to snapshot directory (where snapshots are downloaded and/or unpacked to/from) - default: test-data/

-t, --n-threads-snapshot-load <n_threads_snapshot_load>              number of threads to load snapshots: - default: ncpus

-u, --n-threads-snapshot-unpack <n_threads_snapshot_unpack>          number of threads to unpack snapshots - default: ncpus * 2

-d, --disk-index-path <disk_index_path>                              path to disk index - default: no disk index, index will use ram

-f, --force-unpack-snapshot                                          force unpack snapshot even if it exists

--min-snapshot-download-speed <min_snapshot_download_speed_mb>   minimum download speed of full snapshots in megabytes per second - default: 20MB/s

--force-new-snapshot-download                                    force download of new snapshot (usually to get a more up-to-date snapshot)

-t, --trusted_validator <Trusted Validator>                          public key of a validator whose snapshot hash is trusted to be downloaded

Additional context on specific cli flags is given throughout these docs.

download a snapshot through cli

zig-out/bin/sig snapshot-download \
    # where to save snapshot
    -s test-data/tmp \
    # gossip peers to join network from
    --entrypoint 34.83.231.102:8001 \
    --entrypoint 145.40.67.83:8001 \
    # pubkeys of validators whos snapshot hashes you trust
    --trusted_validator x19btgySsrjuo25CJCj7oE7DREwezDhnx7pZkj2v69N
    # minimum MB/s speed when downloading snapshot
    --min-snapshot-download-speed 50 \

background

checkout the full accounts-db deep-dive blog post here: https://blog.syndica.io/sig-engineering-part-3-solanas-accountsdb/

snapshots

snapshots contain the full state of the blockchain (including all accounts) at a specific slot. They are requested/downloaded from existing validators in the network and are used to bootstrap new validators (instead of starting from Genesis).

the typical snapshot layout is as follows:

when starting up, we use SnapshotFiles.find with the snapshot directory string to find the highest metadata file existing. if it doesnt exist, then a new snapshot is downloaded.

account files

A snapshot contains multiple account files which contains all the account data for a specific slot. Each file is organized as a list of accounts as bytes.

startup

on startup, the validator does the following:

  • snapshots are downloaded from peers
  • the snapshot is decompressed to mulitple account files
  • each account file is mmap'd into memory and validated
  • the account index is generated by reading each account file (creating an index which maps pubkeys to the location of the corresponding account)
  • the accounts-db state is validated to ensure no data corruption occured

note: if --force-unpack-snapshot is used, then a snapshot is always downloaded. if not and an accounts/ directory exists then it will attempt to load using the accounts located in that directory.

creating a AccountsDB instance

var accounts_db = try AccountsDB.init(
    allocator, // any allocator used for all operations
    logger, // used for outputting progress/info/debug details
    .{}    // custom configuration (defaults are reasonable)
);
defer accounts_db.deinit();

for more usage examples, checkout the tests by searching for test "accountsdb in the codebase.

we'll cover how to load accounts db from a snapshot later in the docs.

architecture

the two major components in the db include:

  • a account_file map which maps a file_id to the mmap'd contents of that file
  • the account index which maps a pubkey to a file_id and an offset of where the account's bytes begin

account file_map

To make the file_map thread-safe we had to modify a few things.

To better understand this, theres three main thread-safe scenarios we care about:

  • adding new account files (flushing)
  • reading account files (snapshot generation, account queries)
  • removing account files (shrinking and purging)

the two main fields are:

  • file_map_fd_rw: a read-lock on this mux should be held whenever an account file is being held. A write-lock on this mux should be held whenever we are closing a file.

Adding new account files (flushing):

Adding an account file should never invalidate the account files observed by another thread. The file-map should be write-locked so any map resizing (if theres not enough space) doesnt invalidate other threads values.

Reading account files (snapshot generation, account queries):

All reading threads must first acquire a read (shared) lock on the file_map_fd_rw, before acquiring a lock on the file map, and reading an account file - to ensure account files will not be closed while being read.

After doing so, the file_map_rw may be unlocked, without releasing the file_map_fd_rw, allowing other threads to modify the file_map, whilst preventing any files being closed until all reading threads have finished their work.

Removing account files (deleting):

A thread which wants to delete/close an account files must first acquire a write (exclusive) lock on file_map_fd_rw, before acquiring a write-lock on the file map to access the account_file and close/delete/remove it.

NOTE: Holding a write lock on file_map_fd_rw is very expensive, so we only acquire a write-lock inside deleteAccountFiles which has a minimal amount of logic.

NOTE: no method modifieds/mutates account files after they have been flushed. They are 'shrunk' with deletion + creating a 'smaller' file, or purged with deletion. This allows us to not use a lock per-account-file.

account index

The account index shards pubkeys across multiple shards where each pubkey is associated with a specific shard based on the pubkey’s first N bits. This allows for parallel read/write access to the database (locking only a single shard for each lookup vs the entire struct).

due to the large amount of accounts on solana, storing all account references in ram would be very expensive - which is why we also support storing account indexes (more specifically, the references ArrayList) on disk using a backing file.

high-perf hashmap: SwissMap struct

to achieve fast read/write speeds, we needed to implement our own hashmap based on Google's Swissmap. we saw 2x improvement on getOrPut calls for reads and writes.

disk-based allocator: DiskMemoryAllocator

to support disk-based account references, we created a general purpose disk allocator which creates memory from mmap-ing files stored on disk.

// files are created using `data/test-data/bin_{i}` format where `i` is 
// incremented by one for each new allocation.
var dma_dir = try std.fs.cwd().makeOpenPath("data/test-data");
defer dma_dir.close();
var dma_state: DiskMemoryAllocator = .{};
const dma = dma_state.allocator();

Unlike a simpler page allocator, it stores certain metadata after the user-facing buffer which tracks the associated file, and the true mmap'd size to allow for resizes.

background-threads

we also run background threads in the runManagerLoop method which does the following:

  1. flush the cache to account files in flushSlot
  2. clean account files in cleanAccountFiles
  3. shrink account files in shrinkAccountFiles
  4. deletes account files in deleteAccountFiles
  5. periodically create full snapshots and incremental snapshots

for an overview on how the methods should work checkout the blogpost details on background threads.

shrink + delete and thread-saftey

since acquiring a write-lock on file_map_fd_rw is very expensive (ensuring no account-files can have read-access), we ensure its only write-locked during deletion in deleteAccountFiles and contains the minimal amount of logic.

we also limit how often the method is called by requiring a minimum number of account files to delete per call (defined by DELETE_ACCOUNT_FILES_MIN).

snapshot creation

we creat both full snapshots and incremental snapshots every N roots (defined in ManagerLoopConfig).

  • full snapshots: makeFullSnapshotGenerationPackage
  • incremental snapshots: makeIncrementalSnapshotGenerationPackage

the general usage is to create a snapshot package which implements a write method that can be used to write a tar-archive of the snapshot (using the method writeSnapshotTarWithFields). the package collects all the account files which should be included in the snapshot and also computes the accounts-hash and total number of lamports to populate the manifest with.

in the loop, we create the package and then write the tar-archive into a zstd compression library (zstd.writerCtx) which itself pipes into a file on disk.

After the writing has been complete the internal accounts-db state is updated using commitFullSnapshotInfo and commitIncrementalSnapshotInfo which tracks the new snapshot created and either deletes or ignores older snapshots (which arent needed anymore).

methods

downloading a snapshot

all the code can be found in src/accountsdb/download.zig : downloadSnapshotsFromGossip

first, theres two types of snapshots: full snapshots and incremental snapshots

  • full snapshots include all the accounts on the network at some specific slot.
  • incremental snapshots are smaller and only contain the accounts which changed from a full snapshot.

for example, if the network is on slot 100, the full snapshot could contain all accounts at slot 75, and a matching incremental snapshot could contain all accounts that changed between slot 75 and slot 100.

to download a snapshot, gossip is started up to find other nodes in the network and collect gossip data - we look for peers who

  • have a matching shred version (ie, the network version/hard-forks)
  • have a valid rpc socket (ie, can download from)
  • have a snapshot hash available

the snapshot hash structure is a gossip datatype which contains

  • the largest full snapshot (both a the slot and hash)
  • and a list of incremental snapshots (also slot and hash)

when downloading,

  • we prioritize snapshots with larger slots
  • and if we have a list of 'trusted' validators, we only download snapshots whos hashes matches the trusted validators hashes

https://github.com/Syndica/sig/blob/fd10bad14cd32f99b7f698118305960a4d26da49/src/gossip/data.zig#L908

then for each of these valid peers, we construct the url of the snapshot:

  • full: snapshot-{slot}-{hash}.tar.zstd
  • incremental: incremental-snapshot-{base_slot}-{slot}-{hash}.tar.zstd

and then start the download - we periodically check the download speed and make sure its fast enough, or we try another peer

decompressing a snapshot

snapshots are downloaded as .tar.zstd and we decompress them using parallelUnpackZstdTarBall

we use a zstd library C bindings to create a decompressed stream which we then feed the results to untar the archive to files on disk. the unarchiving happens in parallel using n-threads-snapshot-unpack. since there is a large amount of I/O, the default value is 2x the number of CPUs on the machine.

loading from a snapshot

loading from a snapshot begins in accounts_db.loadFromSnapshot is a very expensive operation.

the steps include:

  • reads and load all the account files based on the snapshot manifest's file map
  • validates + indexes every account in each file (in parallel)
  • combines the results across the threads (also in parallel)

to achieve this in parallel we split processing the account files across multiple threads (part1 of the diagram below) - this means each thread:

  • reads and mmaps every account file
  • creates and populates an ArrayList(AccountRef) with every account it parses from the account files
  • populates their own sharded index by sharding the pubkeys and populating the hashmap with the *AccountRefs

the result is N threads (--n-threads-snapshot-load decides the value for N) each with their own account index, which we now need to comsharde. to combine indexes we merge index shards in parallel across threads.

for example, one thread will merge shards[0..10] another will merge shards[10..20], ... etc for all the shards across all the threads.

this approach generates the index with zero locks

geyser during load

when loading and verifying account files in loadAndVerifyAccountsFiles, we also stream the accounts out to geyser (more docs in src/geyser/readme.md).

for each account file, we track the associated Accounts and pubkey in the GeyserTmpStorage during indexing and then we push them to the pipe and reset the storage.

validating a snapshot

note: this will likely change with future improvements to the solana protocol account hashing

the goal of validating snapshots is to generate a merkle tree over all the accounts in the db and compares the root hash with the hash in the metadata. the entrypoint is validateLoadFromSnapshot.

we take the following approach:

  • account hashes are collected in parallel across shards using getHashesFromIndexMultiThread - similar to how the index is generated
  • each thread will have a slice of hashes, the root hash is computed against this nested slices using NestedHashTree

note: pubkeys are also sorted so results are consistent

validating other metadata

after validating accounts-db data, we also validate a few key structs:

  • GenesisConfig : this data is validated in against the bank in Bank.validateBankFields(bank.bank_fields, &genesis_config);
  • Bank : contains bank_fields which is in the snapshot metadata (not used right now)
  • StatusCache / SlotHistory Sysvar : additional validation performed in status_cache.validate

generating a snapshot

note: at the time of writing, this functionality is in its infancy.

The core logic for generating a snapshot lives in accounts_db.db.writeSnapshotTarWithFields; the principle entrypoint is AccountsDB.writeSnapshotTar. The procedure consists of writing the version file, the status cache (snapshots/status_cache) file, the snapshot manifest (snapshots/{SLOT}/{SLOT}), and the account files (accounts/{SLOT}.{FILE_ID}). This is all written to a stream in the TAR archive format.

The snapshot manifest file content is comprised of the bincoded (bincode-encoded) data structure SnapshotFields, which is an aggregate of:

  • implicit state: data derived from the current state of AccountsDB, like the file map for all the account which exist at that snapshot, or which have changed relative to a full snapshot in an incremental one
  • configuration state: data that is used to communicate details about the snapshot, like the full slot to which an incremental snapshot is relative.

For full snapshots, we write all account files present in AccountsDB which are rooted - as in, less than or equal to the latest rooted slot.

read/write benchmarks

BenchArgs contains all the configuration of a benchmark (comments describe each parameter)

  • found at the bottom of db.zig

writing accounts uses putAccountSlice which takes a slice of accounts and putAccountFile which takes an account file reading accounts uses accounts_db.getAccount(pubkey);.

swissmap benchmarks

  • found at the bottom of index.zig
  • run using zig build -Doptimize=ReleaseSafe benchmark -- swissmap
Benchmark                        Iterations    Min(ns)    Max(ns)   Variance   Mean(ns)
---------------------------------------------------------------------------------------
        WRITE: 814.917us (2.00x faster than std)
        READ: 2.706ms (0.78x faster than std)
swissmapBenchmark(100k accounts)          1     814917     814917          0     814917
        WRITE: 7.715ms (1.46x faster than std)
        READ: 23.055ms (0.77x faster than std)
swissmapBenchmark(500k accounts)          1    7715875    7715875          0    7715875
        WRITE: 17.163ms (1.44x faster than std)
        READ: 50.975ms (0.70x faster than std)
swissmapBenchmark(1m accounts)            1   17163500   17163500          0   17163500