Chunk-Based Code Merkleization
相关视频
正文
Abstract
Code merkleization, along with binarification of the trie and gas cost bump of state accessing opcodes, are considered as the main levers for decreasing block witness sizes in stateless or partial-stateless Eth1x roadmaps. Here we specify a fixed-sized chunk approach to code merkleization and outline how the transition of existing contracts to this model would look like.
Motivation
Bytecode is currently the second contributor to block witness size, after the proof hashes. Transitioning the trie from hexary to binary reduces the hash section of the witness by 3x, thereby making code the first contributor. By breaking contract code into chunks and committing to those chunks in a merkle tree, stateless clients would only need the chunks that were touched during a given transaction to execute it.
Specification
What follows is structured to have two sections:
- How a given contract code is split into chunks and then merkleized
- How to merkleize all existing contract codes during a hardfork
Constants and Definitions
Constants
CHUNK_SIZE
: 32 (bytes)VERSION_KEY
:max(u256)
VERSION
: 0EMPTY_CODE_ROOT
:0xc5d2460186f7233c927e7db2dcc703c0e500b653ca82273b7bfad8045d85a470
(==keccak256('')
)HF_TIMESTAMP
: to be definedITERATOR_STRIDE
:TBC
but based on verkle numbers, first devnets should use50_000
Definitions
BE(x, N)
: castsx
to an unsigned integer ofN
bytes and returns its big-endian representation
Code merkleization
For an account record A
with code C
, two extra optional fields are added: A.codeSize
and A.codeRoot
. A.codeHash
is retained to serve EXTCODEHASH
.
If C
is empty, i.e. in the case of an EoA, A.codeRoot
and A.codeSize
are omitted from the account's RLP. This is intended to limit the size overhead of this change.
If C
is not empty:
A.codeSize = len(code)
A.codeRoot
contains the root ofcodeTrie
, a trie with the following leaves:- Key:
VERSION_KEY
, value:BE(VERSION, 1)
- A list of code
chunks = [(FIO_0, code_0), ..., (FIO_n, code_n)]
which are derived fromC
as follows:code_0 || ... || code_n == C
.length(code_i) == CHUNK_SIZE
where0 <= i < n
.length(code_n) <= CHUNK_SIZE
.FIO_i
is the offset of the first instruction within the chunk. It should only be greater than zero if the last instruction incode_i-1
is a multi-byte instruction (likePUSHN
) crossing the chunk boundary. It is set toCHUNK_SIZE
in the case where all bytes of a chunk are data.
- Key:
The i
th element of chunks
is stored in codeTrie
with:
- Key:
BE(i, 4)
- Value:
BE(FIO_i, 1) || code_i
, where||
stands for byte-wise concatenation
Contract creation gas cost
As of now there is a charge of 200 gas per byte of the code stored in state by contract creation operations, be it initiated via CREATE
, CREATE2
, or an external transaction. This per byte cost is to be increased from 200
to 500
to account for the chunking and merkleization costs. This number is inherited from EIP-4762 and is picked for forward-compatibility.
Updating existing code (transition process)
The transition process involves reading all contracts in the state and applying the above procedure to them. A process simliar to EIP-7612 is to be used, as the total code size at the time of this EIP edit is >10GB and can not be processed in a single block.
Note that:
- Because multiple accounts can share the same code hash, the whole account tree needs to be iterated over to convert each account.
- Nonetheless, the conversion process should take a few hours, instead of days for a full tree change.
At HF_TIMESTAMP
the EIP gets activated:
- any contract creation and 7702 delegation updates must use the new format.
- for 7702 accounts, the code size is set to
23
, andCODESIZE
/CODECOPY
are forwarded to the delegated account.EXTCODESIZE
/EXTCODEHASH
/EXTCODECOPY
behave the same as they did before the activation of this EIP. - accounts are NOT converted in case of a balance or nonce update
- for 7702 accounts, the code size is set to
- at the end of block processing, and after all transactions have been executed, the client iterates over
ITERATOR_STRIDE
accounts in the tree, and for each account:- if the account has empty code, or is a contract using the new format, leave that account untouched,
- if the account is a "legacy" contract, convert it and write it back to the tree
The value for ITERATOR_STRIDE
has been chosen to be safe while ensuring the transition process does not last too long. At the current state size and block time, this represents about 10000 blocks, which is about one and a half day.
Rationale
Hexary vs binary trie
The trie format is chosen to be the same as that of the account trie. If a tree conversion happens at a later stage, the chunk tree will have to be converted as well, e.g. the way it is in EIP-6800 or EIP-7864.
Chunk size
The current recommended chunk size of 32 bytes has been selected based on a few observations. Smaller chunks are more efficient (i.e. have higher chunk utilization, but incur a larger hash overhead (i.e. number of hashes as part of the proof) due to a higher trie depth. Larger chunks are less efficient, but incur less hash overhead. We plan to run a larger experiment comparing various chunk sizes to arrive at a final recommendation.
First instruction offset
The firstInstructionOffset
fields allows safe jumpdest analysis when a client doesn't have all the chunks, e.g. a stateless clients receiving block witnesses.
Note: there could be an edge case when computing FIO for the chunks where the data bytes at the end of a bytecode (last chunk) resemble a multi-byte instruction. This case can be safely ignored.
Gas cost of code-accessing opcodes
Details of how the code is stored, is left to the client implementers. However, to reflect the removal of the max code limit and the fact that larger codes will be more expensive to load, reading a non-accessed chunk will incur a 200 warming cost, and chunks written beyond the 24kb limit will cost 500 gas instead of 200.
Different chunking logic
We have considered an alternative option to package chunks, where each chunk is prepended with its chunkLength
and would only contain complete opcodes (i.e. any multi-byte opcode not fitting the CHUNK_SIZE
would be deferred to the next chunk).
This approach has downsides compared to the one specified:
- Requires a larger
CHUNK_SIZE
-- at least 33 bytes to accommodate thePUSH32
instruction. - It is more wasteful. For example,
DUP1 PUSH32 <32-byte payload>
would be encoded as two chunks, the first chunk contains onlyDUP1
, and the second contains only thePUSH32
instruction with its payload. - Calculating the number of chunks is not trivial and would have to be stored explicitly in the metadata.
Additionally we have reviewed many other options (basic block based, Solidity subroutines (requires determining the control flow), EIP-2315 subroutines). This EIP however only focuses on the chunk-based option.
Backwards Compatibility
From the perspective of contracts, the design aims to be transparent, with the exception of changes in gas costs.
Outside of the interface presented for contracts, this proposal introduces major changes to the way contract code is stored, and needs a substantial modification of the Ethereum state. Therefore it can only be introduced via a hard fork.
Test Cases
TBD
Show the codeRoot
for the following cases:
code=''
code='PUSH1(0) DUP1 REVERT'
code='PUSH32(-1)'
(data passing through a chunk boundary)
Security Considerations
TBA
Copyright
Copyright and related rights waived via CC0.