Skip to content

Commit

Permalink
Clarification and bug fixes
Browse files Browse the repository at this point in the history
- Document PROGPOW_PERIOD
- Document keccak variant used
- Clarify/fix endianess of keccak result
- Clarify c_dag is the first 16kb of g_dag
- Clarify result hash is 256 bits
- Fix OpenCL to use 256 bit result hash
  • Loading branch information
ifdefelse committed Oct 17, 2018
1 parent aacb30c commit f69b33d
Show file tree
Hide file tree
Showing 5 changed files with 95 additions and 56 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -74,3 +74,4 @@ node_modules/

# vscode
.vscode/
/.vs
79 changes: 53 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,16 +84,19 @@ Ethash requires external memory due to the large size of the DAG. However that

## ProgPoW Algorithm Walkthrough

The DAG is generated exactly as in ethash. The only difference is that an additional PROGPOW_SIZE_CACHE worth of data is generated that will live in the L1 cache instead of the framebuffer.
The DAG is generated exactly as in ethash.

ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

* `PROGPOW_LANES:` The number of parallel lanes that coordinate to calculate a single hash instance; default is `32.`
* `PROGPOW_REGS:` The register file usage size; default is `16.`
* `PROGPOW_CACHE_BYTES:` The size of the cache; default is `16 x 1024.`
* `PROGPOW_CNT_MEM:` The number of frame buffer accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
* `PROGPOW_CNT_CACHE:` The number of cache accesses per loop; default is `8.`
* `PROGPOW_CNT_MATH:` The number of math operations per loop; default is `8.`
* `PROGPOW_PERIOD`: Number of blocks before changing the random program; default is `50`.
* `PROGPOW_LANES`: The number of parallel lanes that coordinate to calculate a single hash instance; default is `32`.
* `PROGPOW_REGS`: The register file usage size; default is `16`.
* `PROGPOW_CACHE_BYTES`: The size of the cache; default is `16 x 1024`.
* `PROGPOW_CNT_MEM`: The number of frame buffer accesses, defined as the outer loop of the algorithm; default is `64` (same as Ethash).
* `PROGPOW_CNT_CACHE`: The number of cache accesses per loop; default is `8`.
* `PROGPOW_CNT_MATH`: The number of math operations per loop; default is `8`.

The random program changes every `PROGPOW_PERIOD` (default `50`) blocks to ensure the hardware executing the algorithm is fully programmable. If the program only changed every DAG epoch (roughly 5 days) a miner could have time to develop a hand-optimized version of the random sequence, giving them an undue advantage.

ProgPoW uses **FNV1a** for merging data. The existing Ethash uses FNV1 for merging, but FNV1a provides better distribution properties.

Expand Down Expand Up @@ -146,34 +149,59 @@ void fill_mix(
}
```

The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.
Like ethash Keccak is used to seed the sequence per-nonce and to produce the final result. The keccak-f800 variant is used as the 32-bit word size matches the native word size for GPUs. The implementation is a variant of SHAKE with width=800, bitrate=576, capacity=224, output=256, and no padding. The result of keccak is treated as a 256-bit big-endian number - that is result byte 0 is the MSB of the value.

```cpp
hash32_t keccak_f800_progpow(hash32_t header, uint64_t seed, hash32_t result)
{
uint32_t st[25];

for (int i = 0; i < 25; i++)
st[i] = 0;
for (int i = 0; i < 8; i++)
st[i] = header.uint32s[i];
st[8] = seed;
st[9] = seed >> 32;
for (int i = 0; i < 8; i++)
st[10+i] = result.uint32s[i];

for (int r = 0; r < 22; r++)
keccak_f800_round(st, r);

hash32_t ret;
for (int i=0; i<8; i++)
ret.uint32s[i] = st[i];
}
```
The main search algorithm generates a seed, expands random data from the seed, does a sequence of loads and random math on the mix data, then compresses the result, and then does a final Keccak permutation for target comparison.
```cpp
bool progpow_search(
const uint64_t prog_seed,
const uint64_t prog_seed, // value is (block_number/PROGPOW_PERIOD)
const uint64_t nonce,
const hash32_t header,
const uint64_t target,
const uint64_t *g_dag, // gigabyte DAG located in framebuffer
const uint64_t *c_dag // kilobyte DAG located in l1 cache
const hash32_t target, // miner can use a uint64_t target, doesn't need the full 256 bit target
const uint64_t *dag // gigabyte DAG located in framebuffer - the first portion gets cached
)
{
uint32_t mix[PROGPOW_LANES][PROGPOW_REGS];
uint32_t result[8];
hash32_t result;
for (int i = 0; i < 8; i++)
result[i] = 0;
result.uint32s[i] = 0;
// keccak(header..nonce)
uint64_t seed = keccak_f800(header, nonce, result);
hash32_t seed_256 = keccak_f800_progpow(header, nonce, result);
// endian swap so byte 0 of the hash is the MSB of the value
uint64_t seed = bswap(seed_256[0]) << 32 | bswap(seed_256[1]);
// initialize mix for all lanes
for (int l = 0; l < PROGPOW_LANES; l++)
fill_mix(seed, l, mix);
fill_mix(seed, l, mix[l]);
// execute the randomly generated inner loop
for (int i = 0; i < PROGPOW_CNT_MEM; i++)
progPowLoop(prog_seed, i, mix, g_dag, c_dag);
progPowLoop(prog_seed, i, mix, dag);
// Reduce mix data to a single per-lane result
uint32_t lane_hash[PROGPOW_LANES];
Expand All @@ -183,14 +211,14 @@ bool progpow_search(
for (int i = 0; i < PROGPOW_REGS; i++)
fnv1a(lane_hash[l], mix[l][i]);
}
// Reduce all lanes to a single 128-bit result
// Reduce all lanes to a single 256-bit result
for (int i = 0; i < 8; i++)
result[i] = 0x811c9dc5;
result.uint32s[i] = 0x811c9dc5;
for (int l = 0; l < PROGPOW_LANES; l++)
fnv1a(result[l%8], lane_hash[l])
fnv1a(result.uint32s[l%8], lane_hash[l])
// keccak(header .. keccak(header..nonce) .. result);
return (keccak_f800(header, seed, result) <= target);
return (keccak_f800_progpow(header, seed, result) <= target);
}
```

Expand Down Expand Up @@ -276,8 +304,7 @@ void progPowLoop(
const uint64_t prog_seed,
const uint32_t loop,
uint32_t mix[PROGPOW_LANES][PROGPOW_REGS],
const uint64_t *g_dag,
const uint32_t *c_dag)
const uint64_t *dag)
{
// All lanes share a base address for the global load
// Global offset uses mix[0] to guarantee it depends on the load result
Expand All @@ -286,7 +313,7 @@ void progPowLoop(
for (int l = 0; l < PROGPOW_LANES; l++)
{
// global load to sequential locations
uint64_t data64 = g_dag[offset_g + l];
uint64_t data64 = dag[offset_g + l];
// initialize the seed and mix destination sequence
int mix_seq[PROGPOW_REGS];
Expand All @@ -300,9 +327,9 @@ void progPowLoop(
if (i < PROGPOW_CNT_CACHE)
{
// Cached memory access
// lanes access random location
// lanes access random 32-bit locations within the first portion of the DAG
offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
data32 = c_dag[offset];
data32 = (uint32_t*)dag[offset];
merge(mix[l][mix_dst()], data32, rnd());
}
if (i < PROGPOW_CNT_MATH)
Expand Down
38 changes: 20 additions & 18 deletions libethash-cl/CLMiner_kernel.cl
Original file line number Diff line number Diff line change
Expand Up @@ -77,9 +77,10 @@ void keccak_f800_round(uint32_t st[25], const int r)
st[0] ^= keccakf_rndc[r];
}

// Implementation of the Keccak sponge construction (with padding omitted)
// The width is 800, with a bitrate of 448, and a capacity of 352.
uint64_t keccak_f800(__constant hash32_t const* g_header, uint64_t seed, uint32_t result[4])
// Keccak - implemented as a variant of SHAKE
// The width is 800, with a bitrate of 576, a capacity of 224, and no padding
// Only need 64 bits of output for mining
uint64_t keccak_f800(__constant hash32_t const* g_header, uint64_t seed, hash32_t result)
{
uint32_t st[25];

Expand All @@ -89,16 +90,18 @@ uint64_t keccak_f800(__constant hash32_t const* g_header, uint64_t seed, uint32_
st[i] = g_header->uint32s[i];
st[8] = seed;
st[9] = seed >> 32;
for (int i = 0; i < 4; i++)
st[10+i] = result[i];
for (int i = 0; i < 8; i++)
st[10+i] = result.uint32s[i];

for (int r = 0; r < 21; r++) {
keccak_f800_round(st, r);
}
// last round can be simplified due to partial output
keccak_f800_round(st, 21);

return (uint64_t)st[1] << 32 | st[0];
// Byte swap so byte 0 of hash is MSB of result
uint64_t res = (uint64_t)st[1] << 32 | st[0];
return as_ulong(as_uchar8(res).s76543210);
}

#define fnv1a(h, d) (h = (h ^ d) * 0x1000193)
Expand Down Expand Up @@ -162,17 +165,17 @@ __kernel void ethash_search(
const uint32_t lane_id = lid & (PROGPOW_LANES - 1);
const uint32_t group_id = lid / PROGPOW_LANES;

// Load random data into the cache
// TODO: should be a new blob of data, not existing DAG data
// Load the first portion of the DAG into the cache
for (uint32_t word = lid*2; word < PROGPOW_CACHE_WORDS; word += GROUP_SIZE*2)
{
uint64_t data = g_dag[word];
c_dag[word + 0] = data;
c_dag[word + 1] = data >> 32;
}

uint32_t result[4];
result[0] = result[1] = result[2] = result[3] = 0;
hash32_t result;
for (int i = 0; i < 8; i++)
result.uint32s[i] = 0;
// keccak(header..nonce)
uint64_t seed = keccak_f800(g_header, start_nonce + gid, result);

Expand Down Expand Up @@ -203,22 +206,21 @@ __kernel void ethash_search(
for (int i = 0; i < PROGPOW_REGS; i++)
fnv1a(mix_hash, mix[i]);

// Reduce all lanes to a single 128-bit result
uint32_t result_hash[4];
for (int i = 0; i < 4; i++)
result_hash[i] = 0x811c9dc5;
// Reduce all lanes to a single 256-bit result
hash32_t result_hash;
for (int i = 0; i < 8; i++)
result_hash.uint32s[i] = 0x811c9dc5;
share[group_id].uint32s[lane_id] = mix_hash;
barrier(CLK_LOCAL_MEM_FENCE);
#pragma unroll
for (int i = 0; i < PROGPOW_LANES; i++)
fnv1a(result_hash[i%4], share[group_id].uint32s[i]);
fnv1a(result_hash.uint32s[i%8], share[group_id].uint32s[i]);
if (h == lane_id)
for (int i = 0; i < 4; i++)
result[i] = result_hash[i];
result = result_hash;
}

// keccak(header .. keccak(header..nonce) .. result);
if (keccak_f800(g_header, seed, result) <= target)
if (keccak_f800(g_header, seed, result) < target)
{
uint slot = atomic_inc(&g_output[0]) + 1;
if(slot < MAX_OUTPUTS)
Expand Down
18 changes: 12 additions & 6 deletions libethash-cuda/CUDAMiner_kernel.cu
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,14 @@ __device__ __forceinline__ void keccak_f800_round(uint32_t st[25], const int r)
st[0] ^= keccakf_rndc[r];
}

// Implementation of the Keccak sponge construction (with padding omitted)
// The width is 800, with a bitrate of 576, and a capacity of 224.
__device__ __forceinline__ uint32_t cuda_swab32(const uint32_t x)
{
return __byte_perm(x, x, 0x0123);
}

// Keccak - implemented as a variant of SHAKE
// The width is 800, with a bitrate of 576, a capacity of 224, and no padding
// Only need 64 bits of output for mining
__device__ __noinline__ uint64_t keccak_f800(hash32_t header, uint64_t seed, hash32_t result)
{
uint32_t st[25];
Expand All @@ -93,7 +99,8 @@ __device__ __noinline__ uint64_t keccak_f800(hash32_t header, uint64_t seed, has
// last round can be simplified due to partial output
keccak_f800_round(st, 21);

return (uint64_t)st[0] << 32 | st[1];
// Byte swap so byte 0 of hash is MSB of result
return (uint64_t)cuda_swab32(st[0]) << 32 | cuda_swab32(st[1]);
}

#define fnv1a(h, d) (h = (uint32_t(h) ^ uint32_t(d)) * uint32_t(0x1000193))
Expand Down Expand Up @@ -145,8 +152,7 @@ progpow_search(

const uint32_t lane_id = threadIdx.x & (PROGPOW_LANES - 1);

// Load random data into the cache
// TODO: should be a new blob of data, not existing DAG data
// Load the first portion of the DAG into the cache
for (uint32_t word = threadIdx.x*2; word < PROGPOW_CACHE_WORDS; word += blockDim.x*2)
{
uint64_t data = g_dag[word];
Expand Down Expand Up @@ -199,7 +205,7 @@ progpow_search(
}

// keccak(header .. keccak(header..nonce) .. result);
if (keccak_f800(header, seed, result) > target)
if (keccak_f800(header, seed, result) >= target)
return;

uint32_t index = atomicInc((uint32_t *)&g_output->count, 0xffffffff);
Expand Down
15 changes: 9 additions & 6 deletions libprogpow/ProgPow.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ std::string ProgPow::getKern(uint64_t prog_seed, kernel_t kern)
// lanes access random locations
std::string src = mix_src();
std::string dest = mix_dst();
uint32_t r = rnd();
uint32_t r = rnd();
ret << "// cache load\n";
ret << "offset = " << src << " % PROGPOW_CACHE_WORDS;\n";
ret << "data32 = c_dag[offset];\n";
Expand All @@ -138,18 +138,21 @@ std::string ProgPow::getKern(uint64_t prog_seed, kernel_t kern)
// reduced to a single result
std::string src1 = mix_src();
std::string src2 = mix_src();
uint32_t r1 = rnd();
uint32_t r2 = rnd();
std::string dest = mix_dst();
uint32_t r1 = rnd();
std::string dest = mix_dst();
uint32_t r2 = rnd();
ret << "// random math\n";
ret << math("data32", src1, src2, r1);
ret << merge(dest, "data32", r2);
}
}
// Consume the global load data at the very end of the loop, to allow fully latency hiding
ret << "// consume global load data\n";
ret << merge("mix[0]", "data64", rnd());
ret << merge(mix_dst(), "(data64>>32)", rnd());
uint32_t r1 = rnd();
std::string dest = mix_dst();
uint32_t r2 = rnd();
ret << merge("mix[0]", "data64", r1);
ret << merge(dest, "(data64>>32)", r2);
ret << "}\n";
ret << "\n";

Expand Down

0 comments on commit f69b33d

Please sign in to comment.