Have you ever asked yourself how blockchain data is stored? How does every miner or validator keep a copy of the blockchain, and how do we ensure those copies are consistent? As a distributed system, blockchain miners need to be able to do this rapidly; and they do it with Merkle Trees. Let’s dive into what they are, and exactly how they’re used crypto today. Enjoy.
*Thank you to Aniket for proofreading.
The Need for Efficient Data Storage
With every block added to a blockchain, its size grows overall. Storing each block as raw data would quickly become burdensome, and comparing two copies of one blockchain would require going through that large set of data. Merkle Trees exist to make blockchain storage and verification easier, without compromising on security. Their basic building blocks are hash functions. Understanding hash functions will make Merkle Trees easy.
Hash Functions
Hash functions can take any input, but they always produce an output in a specific format. For example, your wallet address is actually the output of a hash, with that format being 0x-and-then-64-characters. Beyond that, each input changes the output randomly. Finally, you can’t find out what an input was purely by looking at an output. Hashing functions are one way algorithms, that essentially create a “digital fingerprint” for any data.
Hash functions are commonly used in traditional applications. For example, companies don’t store your password, they store a hash of it instead. When you log into an application, your input is hashed and then compared to the output stored in the application’s database.
They’re also used to safely transfer files. I can hash a file and send it to you, and when you receive the file you can do that too. If both of our hashes are the same, then we know that no-one has intercepted the file in between, and the file you received is exactly the one that I sent.
Hash functions make verifying data extremely efficient. They can be stacked on top of each other to create a data structure called a Merkle Tree. Verifying a Merkle Tree is far more efficient than verifying the entire dataset. On Bitcoin, for example, a Merkle Tree is created for every block, containing all the transactions inside.
Merkle Tree
Above is a Merkle Tree. Each transaction in a block is hashed, starting at the bottom. Each hash is then paired with another, and hashed again. Ultimately, one hash is left, called the “Root Hash.” This structure of hashes itself is a Merkle tree.
When blockchains are stored on miner’s devices, this is the format they’re in. Each block has a data component, the Merkle Tree, as well as a block header. The block header contains general information about the block, like it’s number, and also the Root Hash from the previous block.
So on Bitcoin, miners don’t store a copy of every transaction. Instead, miners store Merkle Trees of every block. When a new block is created, it contains the Root Hash of the previous block. This creates a chain between blocks (i.e. a block… chain).
Ethereum’s Data Storage
In Ethereum’s case, data storage is more elaborate. Ethereum doesn’t just process transactions, but also complex behaviors with smart contracts. To ensure everything is accurately tracked, Ethereum’s blocks contain 3 Merkle Trees, compared to Bitcoin’s single one.
The first Merkle Tree is based on all the transactions in a block, similar to Bitcoin. This is called the Transaction Tree. Next, the Receipts Tree contains every account on Ethereum, and what their current balance is at the end of that block. Finally, the State Tree contains all of the smart contract data on Ethereum.
These 3 Trees combine into a greater structure Ethereum calls a Patricia Merkle Tree. This is the same as a Merkle Tree, except each value in the Tree is assigned a “Key.” These Keys can be used to trace lines down the Merkle Tree, to isolate a specific transaction or event very easily.
Ethereum validators then store the Patricia Merkle Tree for every block, alongside each block’s header. This makes it impossible to change something that happened in a past block, since it would change a block’s Root Hash, and therefore every block after that.
But if only hashes of the past are stored, that begs a question. When blocks are added, how do miners and validators verify that the transactions inside are legitimate?
Data Availability
When validators add blocks to the Ethereum blockchain, they must broadcast all the transaction data for that block to the other validators on the network. This is called making the data available, or Data Availability.
When validators receive a block of data, they execute all the transaction data inside and compute the results. Every single validator does this independently, and they all arrive at the same Root Hash for that block, checking each other’s work. Then, validators proceed to preparing the next block to be added to the blockchain.
The fact that every validator must execute every transaction indicates something very important: a blockchain can only handle as many transactions as its validators can execute. For the blockchain to process more transactions, in other words, each validator would be forced to execute more transactions.
This is the Data Availability Problem, and is at the heart of the Blockchain Trilemma. This is the concept that says decentralization, security, and scalability are extremely difficult to solve for at the same time. Increasing block size to increase scalability, for example, means that validators must all process more transactions, which requires improved hardware, which negatively impacts decentralization.
Ethereum’s Future Solutions
Ethereum has solutions to the Data Availability Problem that will increase scalability without hurting decentralization or security. Most notably, these include Data Availability Sampling, a solution coming with the Danksharding upgrade (read more about that here).
The Data Availability Problem is also somewhat more forgiving on Rollups. On Rollups, miners don’t produce blocks; Sequencers do. The responsibility of everyone else is to verify that data within a certain amount of time. As a result, the limit on how many transactions can be processed by Rollups is instead based on how much data can be verified as available within a given period of time (like a week).
In Conclusion
Thank you for reading, I hope you enjoyed this summary on Merkle Trees, Data Availability, and blockchain data storage!
If you have any questions, please comment them on this post! If you’d like me to cover any topics, feel free to suggest them in the comments also.
Thank You & Additional Reading!
Thanks a lot for reading! Here are some more resources if you'd like to dive deeper.
The Data Availability Problem, presentation by Vitalik Buterin
Merkle Tree, by Brilliant
Modified Merkle Patricia Trie — How Ethereum saves a state, by Kiyun Kim
A Note on Data Availability and Erasure Coding by Vitalik Buterin
Sign up if you haven’t already for more simple write-ups on blockchain concepts!
Share a Summary
Thanks again, please consider sharing this write-up below!
Stay kind. Stay curious.