Module 2
Distributed Database System
A distributed database is basically a database that is not limited to one
system, it is spread over different sites, i.e, on multiple computers or over
a network of computers. A distributed database system is located on
various sites that don’t share physical components. This may be required
when a particular database needs to be accessed by various users globally.
It needs to be managed such that for the users it looks like one single
database.
Types:
1. Homogeneous Database:
In a homogeneous database, all different sites store database identically.
The operating system, database management system, and the data
structures used – all are the same at all sites. Hence, they’re easy to
manage.
2. Heterogeneous Database:
In a heterogeneous distributed database, different sites can use different
schema and software that can lead to problems in query processing and
transactions. Also, a particular site might be completely unaware of the
other sites. Different computers may use a different operating system,
different database application. They may even use different data models
for the database. Hence, translations are required for different sites to
communicate.
Distributed Data Storage :
There are 2 ways in which data can be stored on different sites. These
are:
1. Replication –
In this approach, the entire relationship is stored redundantly at 2 or more
sites. If the entire database is available at all sites, it is a fully redundant
database. Hence, in replication, systems maintain copies of data.
This is advantageous as it increases the availability of data at different
sites. Also, now query requests can be processed in parallel.
However, it has certain disadvantages as well. Data needs to be
constantly updated. Any change made at one site needs to be recorded at
every site that relation is stored or else it may lead to inconsistency. This
is a lot of overhead. Also, concurrency control becomes way more
complex as concurrent access now needs to be checked over a number of
sites.
2. Fragmentation –
In this approach, the relations are fragmented (i.e., they’re divided into
smaller parts) and each of the fragments is stored in different sites where
they’re required. It must be made sure that the fragments are such that
they can be used to reconstruct the original relation (i.e, there isn’t any
loss of data).
Fragmentation is advantageous as it doesn’t create copies of data,
consistency is not a problem.
Fragmentation of relations can be done in two ways:
Horizontal fragmentation – Splitting by rows –
The relation is fragmented into groups of tuples so that each tuple is
assigned to at least one fragment.
Vertical fragmentation – Splitting by columns –
The schema of the relation is divided into smaller schemas. Each
fragment must contain a common candidate key so as to ensure a
lossless join.
In certain cases, an approach that is hybrid of fragmentation and
replication is used.
Applications of Distributed Database:
It is used in Corporate Management Information System.
It is used in multimedia applications.
Used in Military’s control system, Hotel chains etc.
It is also used in manufacturing control system.
A distributed database system is a type of database management system
that stores data across multiple computers or sites that are connected by a
network. In a distributed database system, each site has its own database,
and the databases are connected to each other to form a single, integrated
system.
The main advantage of a distributed database system is that it can
provide higher availability and reliability than a centralized database
system. Because the data is stored across multiple sites, the system can
continue to function even if one or more sites fail. In addition, a
distributed database system can provide better performance by
distributing the data and processing load across multiple sites.
There are several different architectures for distributed database
systems, including:
Client-server architecture: In this architecture, clients connect to a
central server, which manages the distributed database system. The server
is responsible for coordinating transactions, managing data storage, and
providing access control.
Peer-to-peer architecture: In this architecture, each site in the
distributed database system is connected to all other sites. Each site is
responsible for managing its own data and coordinating transactions with
other sites.
Federated architecture: In this architecture, each site in the distributed
database system maintains its own independent database, but the
databases are integrated through a middleware layer that provides a
common interface for accessing and querying the data.
Distributed database systems can be used in a variety of applications,
including e-commerce, financial services, and telecommunications.
However, designing and managing a distributed database system can be
complex and requires careful consideration of factors such as data
distribution, replication, and consistency.
Advantages of Distributed Database System :
1) There is fast data processing as several sites participate in request
processing.
2) Reliability and availability of this system is high.
3) It possess reduced operating cost.
4) It is easier to expand the system by adding more sites.
5) It has improved sharing ability and local autonomy.
Disadvantages of Distributed Database System :
1) The system becomes complex to manage and control.
2) The security issues must be carefully managed.
3) The system require deadlock handling during the transaction
processing otherwise
the entire system may be in inconsistent state.
4) There is need of some standardization for processing of distributed
database
system.
Two General problems
The Two Generals' Problem, a thought experiment in computer
science, highlights the difficulty of achieving consensus in a distributed
system with unreliable communication, a challenge that blockchain
technology addresses through consensus mechanisms like Proof-of-Work
(PoW).
Here's a breakdown:
The Problem:
Two generals need to coordinate an attack, but their communication
channel (messengers) is unreliable, meaning messages might be lost or
intercepted.
The Challenge:
Even if both generals send messages, there's no guarantee the other
received them, or that the sender knows the other received the
acknowledgement.
Blockchain's Solution:
Blockchain uses consensus mechanisms like Proof-of-Work (PoW) to
address this challenge by ensuring that all nodes in the network agree on
the state of the blockchain, even in the presence of unreliable
communication or malicious actors.
Proof-of-Work (PoW):
PoW is a mechanism where nodes compete to solve a complex
mathematical problem, and the first to solve it gets to add a new block to
the chain, thereby verifying the transactions within that block.
Byzantine Fault Tolerance (BFT):
The Byzantine Generals Problem, a generalization of the Two Generals'
Problem, deals with scenarios where some nodes (generals) might be
malicious or faulty. BFT algorithms, like those used in blockchain, aim
to ensure that even with a certain number of faulty nodes, the system
can still reach consensus.
Practical Application:
Blockchain's reliance on consensus mechanisms like PoW allows it to
function as a trustless system, where transactions are verified and
recorded without relying on a central authority.
In essence, the Two Generals' Problem illustrates the inherent challenges
of distributed systems, while blockchain technology provides solutions
like PoW and BFT to overcome these challenges and achieve consensus
in a decentralized manner.
Byzantine Generals Problem in Blockchain
What is Byzantine General’s Problem?
In 1982, The Byzantine General’s Problem was invented by Leslie
Lamport, Robert Shostak, and Marshall Pease. Byzantine Generals
Problem is an impossibility result which means that the solution to this
problem has not been found yet as well as helps us to understand the
importance of blockchain. It is basically a game theory problem that
provides a description of the extent to which decentralized parties
experience difficulties in reaching consensus without any trusted central
parties.
The Byzantine army is divided into many battalions in this classic
problem called the Byzantine General’s problem, with each division
led by a general.
The generals connect via messenger in order to agree to a joint plan
of action in which all battalions coordinate and attack from all sides
in order to achieve success.
It is probable that traitors will try to sabotage their plan by
intercepting or changing the messages.
As a result, the purpose of this challenge is for all of the faithful
commanders to reach an agreement without the imposters tampering
with their plans.
Money and Byzantine General’s Problem
Money is one such commodity whose value should be same throughout
the society, that is everyone should agree upon the value of a certain
amount of money, despite all the differences therefore in the initial
times, precious metals and rare goods were chosen as money because
their value was seen equally throughout the society, but in some cases
such as precious metals the purity of the metals could not be known for
sure or checking the purity was an extremely tedious task which turned
out to be very inefficient for the daily transactions, therefore it was
decided upon to replace gold with a central party which would be highly
trustable chosen by the people in the society to establish and maintain
the system of money. But with time it was later realized that those
central parties, how much-ever qualified were still not completely
trustworthy as it was so simple for them to manipulate the data.
Centralized systems do not address the Byzantine Generals problem,
which requires that truth be verified in an explicitly transparent way,
yet centralized systems give no transparency, increasing the
likelihood of data corruption.
They forgo transparency in order to attain efficiency easily and
prefer to avoid dealing with the issue entirely.
The fundamental issue of centralized systems, however, is that they
are open to corruption by the central authority, which implies that the
data can be manipulated by anyone who has control of the database
itself because the centralized system concentrates all power on one
central decision maker.
Therefore, Bitcoin was invented to make the system of money
decentralized using blockchain to make money verifiable, counterfeit-
resistant, trustless, and separate from a central agency.
How Bitcoin Solves the Byzantine General’s Problem?
In the Byzantine Generals Problem, the untampered agreement that all
the loyal generals need to agree to is the blockchain. Blockchain is a
public, distributed ledger that contains the records of all transactions. If
all users of the Bitcoin network, known as nodes, could agree on which
transactions occurred and in what order, they could verify the ownership
and create a functioning, trustless money system without the need for a
centralized authority. Due to its decentralized nature, blockchain relies
heavily on a consensus technique to validate transactions. It is a peer-to-
peer network that offers its users transparency as well as trust. Its
distributed ledger is what sets it apart from other systems. Blockchain
technology can be applied to any system that requires proper
verification.
Proof Of Work: The network would have to be provable, counterfeit-
resistant, and trust-free in order to solve the Byzantine General’s
Problem. Bitcoin overcame the Byzantine General’s Problem by
employing a Proof-of-Work technique to create a clear, objective
regulation for the blockchain. Proof of work (PoW) is a method of
adding fresh blocks of transactions to the blockchain of a
cryptocurrency. In this scenario, the task consists of creating a hash (a
long string of characters) that matches the desired hash for the current
block.
1. Counterfeit Resistant: Proof-of-Work requires network participants
to present proof of their work in the form of a valid hash in order for
their block, i.e. piece of information, to be regarded as valid. Proof-
of-Work requires miners to expend significant amounts of energy
and money in order to generate blocks, encouraging them to
broadcast accurate information and so protecting the network. Proof-
of-Work is one of the only ways for a decentralized network to agree
on a single source of truth, which is essential for a monetary system.
There can be no disagreement or tampering with the information on
the blockchain network because the rules are objective. The ruleset
defining which transactions are valid and which are invalid, as well
as the system for choosing who can mint new bitcoin, are both
objectives.
2. Provable: Once a block is uploaded to the blockchain, it is
incredibly difficult to erase, rendering Bitcoin’s history immutable.
As a result, participants of the blockchain network may always agree
on the state of the blockchain and all transactions inside it. Each node
independently verifies whether blocks satisfy the Proof-of-Work
criterion and whether transactions satisfy additional requirements.
3. Trust-free: If any network member attempts to broadcast misleading
information, all network nodes immediately detect it as objectively
invalid and ignore it. Because each node on the Bitcoin network can
verify every information on the network, there is no need to trust
other network members, making Bitcoin a trustless system.
Byzantine Fault Tolerance (BFT)
The Byzantine Fault Tolerance was developed as inspiration in order to
address the Byzantine General’s Problem. The Byzantine General’s
Problem, a logical thought experiment where multiple generals must
attack a city, is where the idea for BFT originated.
Byzantine Fault Tolerance is one of the core characteristics of
developing trustworthy blockchain rules or features is tolerance.
When two-thirds of the network can agree or reach a consensus and
the system still continues to operate properly, it is said to have BFT.
Blockchain networks’ most popular consensus protocols, such as
proof-of-work, proof-of-stake, and proof-of-authority, all have some
BFT characteristics.
In order to create a decentralized network, the BFT is essential.
The consensus method determines the precise network structure. For
instance, BFT has a leader as well as peers who can and cannot validate.
In order to maintain the sequence of the Blockchain SC transactions and
the consistency of the global state through local transaction replay,
consensus messages must pass between the relevant peers.
More inventive approaches to designing BFT systems will be found and
put into practice as more individuals and companies investigate
distributed and decentralized systems. Systems that use BFT are also
employed in sectors outside of blockchains, such as nuclear power,
space exploration, and aviation.
Byzantine General’s Problem in a Distributed System
In order to address this issue, honest nodes (such as computers or other
physical devices) must be able to establish an agreement in the presence
of dishonest nodes.
In the Byzantine agreement issue, an arbitrary processor initializes a
single value that must be agreed upon, and all nonfaulty processes
must agree on that value. Every processor has its own beginning
value in the consensus issue, and all nonfaulty processors must agree
on a single common value.
The Byzantine army’s position can be seen in computer networks.
The divisions can be viewed as computer nodes in the network, and
the commanders as programs running a ledger that records
transactions and events in the order that they occur. The ledgers are
the same for all systems, and if any of them is changed, the other
ledgers are updated as well if the changes are shown to be true, so all
distributed ledgers should be in agreement.
Byzantine General’s Problem Example
A basic Byzantine fault is a digital signal that is stuck at “1/2,” i.e. a
voltage that is anywhere between the voltages for a valid logical “0” and
a valid logical “1.” Because these voltages are in the region of a gate’s
transfer function’s maximum gain, little quantities of noise on the gate’s
input become enormous amounts of noise on the gate’s output. This is
due to the fact that “digital circuits are simply analog circuitry driven to
extremes.”
This problem is solvable because, with a dominating input, even a
Byzantine input has no output impact.
An excellent composite example is the well-known 3-input majority
logic “voter.”
If one of the inputs is “1/2” and the other two are both 0 or both 1,
the result is 0 or 1 (due to masking within the voter).
When one of the inputs is “1/2” and the other two are different
values, the output can be 0, “1/2,” or 1, depending on the precise gain
and threshold voltages of the voter gates and the properties of the
“1/2” signal.
Hadoop – HDFS (Hadoop Distributed File
System)
Before head over to learn about the HDFS(Hadoop Distributed File System),
we should know what actually the file system is. The file system is a kind of
Data structure or method which we use in an operating system to manage
file on disk space. This means it allows the user to keep maintain and
retrieve data from the local disk.
An example of the windows file system is NTFS(New Technology File
System) and FAT32(File Allocation Table 32). FAT32 is used in some older
versions of windows but can be utilized on all versions of windows xp.
Similarly like windows, we have ext3, ext4 kind of file system for Linux
OS.
What is DFS?
DFS stands for the distributed file system, it is a concept of storing the file
in multiple nodes in a distributed manner. DFS actually provides the
Abstraction for a single large system whose storage is equal to the sum of
storage of other nodes in a cluster.
Let’s understand this with an example. Suppose you have a DFS comprises
of 4 different machines each of size 10TB in that case you can store let say
30TB across this DFS as it provides you a combined Machine of size 40TB.
The 30TB data is distributed among these Nodes in form of Blocks.
Why We Need DFS?
You might be thinking that we can store a file of size 30TB in a single
system then why we need this DFS. This is because the disk capacity of a
system can only increase up to an extent. If somehow you manage the data
on a single system then you’ll face the processing problem, processing large
datasets on a single machine is not efficient.
Let’s understand this with an example. Suppose you have a file of size 40TB
to process. On a single machine, it will take suppose 4hrs to process it
completely but what if you use a DFS(Distributed File System). In that case,
as you can see in the below image the File of size 40TB is distributed among
the 4 nodes in a cluster each node stores the 10TB of file. As all these nodes
are working simultaneously it will take the only 1 Hour to completely
process it which is Fastest, that is why we need DFS.
Local File System Processing:
Distributed File System Processing:
Overview – HDFS
Now we think you become familiar with the term file system so let’s begin
with HDFS. HDFS(Hadoop Distributed File System) is utilized for storage
permission is a Hadoop cluster. It mainly designed for working on
commodity Hardware devices(devices that are inexpensive), working on a
distributed file system design. HDFS is designed in such a way that it
believes more in storing the data in a large chunk of blocks rather than
storing small data blocks. HDFS in Hadoop provides Fault-tolerance and
High availability to the storage layer and the other devices present in that
Hadoop cluster.
HDFS is capable of handling larger size data with high volume velocity and
variety makes Hadoop work more efficient and reliable with easy access to
all its components. HDFS stores the data in the form of the block where the
size of each data block is 128MB in size which is configurable means you
can change it according to your requirement in hdfs-site.xml file in your
Hadoop directory.
Some Important Features of HDFS(Hadoop Distributed File
System)
It’s easy to access the files stored in HDFS.
HDFS also provides high availability and fault tolerance.
Provides scalability to scaleup or scaledown nodes as per our
requirement.
Data is stored in distributed manner i.e. various Datanodes are
responsible for storing the data.
HDFS provides Replication because of which no fear of Data Loss.
HDFS Provides High Reliability as it can store data in a large range
of Petabytes.
HDFS has in-built servers in Name node and Data Node that helps
them to easily retrieve the cluster information.
Provides high throughput.
HDFS Storage Daemon’s
As we all know Hadoop works on the MapReduce algorithm which is a
master-slave architecture, HDFS has NameNode and DataNode that works
in the similar pattern.
1. NameNode(Master)
2. DataNode(Slave)
1. NameNode: NameNode works as a Master in a Hadoop cluster that
Guides the Datanode(Slaves). Namenode is mainly used for storing the
Metadata i.e. nothing but the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about
the location(Block number, Block ids) of Datanode that Namenode stores to
find the closest DataNode for Faster Communication. Namenode instructs
the DataNodes with the operation like delete, create, Replicate, etc.
As our NameNode is working as a Master it should have a high RAM or
Processing power in order to Maintain or Guide all the slaves in a Hadoop
cluster. Namenode receives heartbeat signals and block reports from all the
slaves i.e. DataNodes.
2. DataNode: DataNodes works as a Slave DataNodes are mainly utilized
for storing the data in a Hadoop cluster, the number of DataNodes can be
from 1 to 500 or even more than that, the more number of DataNode your
Hadoop cluster has More Data can be stored. so it is advised that the
DataNode should have High storing capacity to store a large number of file
blocks. Datanode performs operations like creation, deletion, etc. according
to the instruction provided by the NameNode.
Objectives and Assumptions Of HDFS
1. System Failure: As a Hadoop cluster is consists of Lots of nodes with are
commodity hardware so node failure is possible, so the fundamental goal of
HDFS figure out this failure problem and recover it.
2. Maintaining Large Dataset: As HDFS Handle files of size ranging from
GB to PB, so HDFS has to be cool enough to deal with these very large data
sets on a single cluster.
3. Moving Data is Costlier then Moving the Computation: If the
computational operation is performed near the location where the data is
present then it is quite faster and the overall throughput of the system can be
increased along with minimizing the network congestion which is a good
assumption.
4. Portable Across Various Platform: HDFS Posses portability which
allows it to switch across diverse Hardware and software platforms.
5. Simple Coherency Model: A Hadoop Distributed File System needs a
model to write once read much access for Files. A file written then closed
should not be changed, only data can be appended. This assumption helps us
to minimize the data coherency issue. MapReduce fits perfectly with such
kind of file model.
6. Scalability: HDFS is designed to be scalable as the data storage
requirements increase over time. It can easily scale up or down by adding or
removing nodes to the cluster. This helps to ensure that the system can
handle large amounts of data without compromising performance.
7. Security: HDFS provides several security mechanisms to protect data
stored on the cluster. It supports authentication and authorization
mechanisms to control access to data, encryption of data in transit and at
rest, and data integrity checks to detect any tampering or corruption.
8. Data Locality: HDFS aims to move the computation to where the data
resides rather than moving the data to the computation. This approach
minimizes network traffic and enhances performance by processing data on
local nodes.
9. Cost-Effective: HDFS can run on low-cost commodity hardware, which
makes it a cost-effective solution for large-scale data processing.
Additionally, the ability to scale up or down as required means that
organizations can start small and expand over time, reducing upfront costs.
10. Support for Various File Formats: HDFS is designed to support a
wide range of file formats, including structured, semi-structured, and
unstructured data. This makes it easier to store and process different types of
data using a single system, simplifying data management and reducing
costs.
Distributed Hash Tables with Kademlia
Distributed Hash Tables with Kademlia revolutionize decentralized systems
by efficiently storing and retrieving data across a network of nodes. This
article explores Kademlia's key principles, decentralized routing, and fault-
tolerant architecture, which are pivotal in modern peer-to-peer networks.
Distributed Hash Tables with Kademlia
Important Topics for Distributed Hash Tables with Kademlia
What are Distributed Hash Tables (DHTs)
Distributed Hash Tables (DHTs) are decentralized distributed systems that
provide a key-value store across a network of participating nodes. They are
designed to efficiently distribute and look up data across a large number of
nodes without relying on centralized coordination or control.
Below are some key characteristics and components of DHTs:
Decentralization: DHTs operate in a decentralized manner where
each node in the network is responsible for storing and retrieving a
portion of the data based on a consistent hashing algorithm.
Key-Value Storage: They provide a distributed key-value storage
abstraction where data items are associated with unique keys. Keys
are typically hashed to determine which node in the DHT network is
responsible for storing and managing each key-value pair.
Routing: Nodes in a DHT maintain routing tables that help them
efficiently route messages to the appropriate node responsible for a
given key. This routing is typically achieved using techniques like
consistent hashing or variants thereof.
Scalability: DHTs are designed to scale efficiently with the number
of nodes and the amount of data in the system. Adding or removing
nodes does not significantly disrupt the overall functionality of the
DHT, and data can be evenly distributed across the network.
Consistency: DHTs typically provide eventual consistency
guarantees, where updates to the DHT eventually propagate across
the network and all replicas converge to a consistent state over time.
Importance of DHTs in Distributed Systems
Below are the importance of DHTs in Distributed Systems:
Decentralization: They get rid of the multiple concentrations of data
in some nodes, thereby increasing the reliability of the network.
Scalability: DHTs give systems the ability to integrate new nodes
into the system with little or no overhead cost while the load adjusts
dynamically.
Efficient Data Retrieval: They offer easily accessible data based on
the keys and, therefore, are useful for huge distributed computing
framework applications.
Fault Tolerance: It claims that DHTs can handle node failures and
network partitions that make data available and consistent.
Flexibility: They are useful for such operations as file sharing and
content delivery networks and CDNs, as well as decentralized
cryptocurrencies and other uses in distributed computing.
What is Kademlia?
Kademlia is one of the commonly used algorithms employed in the
Distributed Hash Table (DHT) in peer-to-peer networks. It is mostly used as
a solution for indexing and searching data and is highly scalable as well as
fault-tolerant.
Kademlia divides nodes into a tree and calls it a Kademlia tree. Every
node is assigned a unique key, which is generated by applying a
cryptographic hash function to the node's IP address.
Below every node is information about other nodes that are nearest to
them in the space of identifiers; thus, the occurring data requests can
be routed effectively.
This structure makes it possible for Kademlia to have fast look-up
times while at the same time being resistant to node failure, making it
ideal for peer-to-peer propositions such as file sharing and distributed
storage.
Kademlia Protocol explanation
The Kademlia protocol is a robust, decentralized system for distributed hash
tables (DHTs). It optimizes key-value storage and retrieval across large
networks by using efficient routing algorithms and ensuring fault tolerance.
Node ID and Key Space: Storing nodes and each data item contain
ID in the form of an efficient 160-bit key of a hash function like
SHA-1.
Routing Table: Nodes have a routing table that stores contacts to
other nodes, and they are grouped in buckets, which are functions of
the difference between the keys of the nodes and store the current
node.
Peer Discovery: Nodes enter the network by making calls to other
nodes with which they have some prior knowledge and then building
their routing tables in relation to the responses.
Store and Retrieve: These nodes can hold data items and get them
by posing a question to the network and returning information that is
connected to the key.
Iterative Lookup: To find the nearest neighbor or the closest data
item, nodes continuously send queries to nodes that are near the target
key, and each time a reply has been received, the routing tables are
updated.
Implementation of Kademlia
The implementation of Kademlia involves translating its decentralized
routing and storage principles into practical software.
Node ID Generation: The node generates what is called a node ID,
usually produced from a hash function such as SHA-1 from the node
IP address or any other identification.
Routing Table Initialization: Nodes start a routing table; it is
common to use the K-bucket structure as a table; every bucket
contains contact information for other nodes with their IP address and
node ID.
Joining the Network: When a node becomes a member of the
network, it initiates contact with other nodes it already knows
(bootstrap nodes) to acquire a routing table. It modifies its own node
ID and sends its contact information to other nodes.
Routing and Data Storage: To store data, it performs hashing of the
data key and then searches for the K nearest neighbors in the routing
table of the node. It then replicates the data at these nodes, thus
effectively decentralizing the data and providing fail-safes.
Data Retrieval: Nodes employ Look up iterative processes in the
retrieval of data. It starts querying the nodes that are close to the data
key, hence reaching the required datum or ensuring that the datum is
not available.
Real-World Applications
Below are some real-world examples of Distributed Hash Tables with
Kademlia:
Peer-to-Peer File Sharing: Using DHTs, applications such as
BitTorrent employ decentralized tracking as well as distribution of
the files among the peers.
Decentralized Storage Networks: Applications like IPFS
(InterPlanetary File System) use DHT to annotate content and retrieve
this content from the nodes distributed all around the network.
Blockchain and Cryptocurrencies: Some of the blockchain
applications utilize DHTs in order to manage a decentralized process
of peer discovery and network self-organization while improving
scalability.
Content Delivery Networks (CDNs): DHTs enable the efficient
caching of content to other nodes based on specific geographical
locations, thereby enhancing the efficiency and reliability of content
distribution.
Decentralized Messaging and Communication: Most of the
applications that are developed in this class, such as secure messaging
platforms and VoIP services, use DHTs to peer and forward messages
without the use of central servers.
IoT Networks: DHTs are able to organize the storage and retrieval of
data within the decentralized IoT network, thus enabling the growing
and stable retrieval of data from connected devices.
Open Source Implementations of Kademlia
Below are some open source implementation of Kademlia:
libp2p (Go): libp2p by Protocol Labs contains a Kademlia DHT that
can be seen in IPFS and Filecoin. It is intended for distributed
networking and P2P services.
Kademlia (Python): Some Python implementations of Kademlia
exist, of which kademlia-dht and pykad offer fundamental DHT
services and are employed in limited and educational settings.
Kademlia (Java): TomP2P and kad-socks: The two are
implementations of Kademlia in Java appropriate for developing
distributed systems and peer-to-peer applications in the Java domain.
Rust-Kademlia (Rust): Rust-Kademlia is a Kademlia DHT that is
implemented in Rust for high performance in applications based on
Rust that need to store and access information in a decentralized
manner.
J-Kademlia (Java): J-Kademlia is another Java-based Kademlia
implementation that can be used as a solid base for constructing
decentralized applications and services.
Challenges of Distributed Hash Tables with Kademlia
Implementing and operating Distributed Hash Tables (DHTs) with
Kademlia comes with several challenges, including:
Routing Efficiency: Maintaining efficient routing tables across
potentially large networks while ensuring nodes can locate data
swiftly.
Scalability: Ensuring the system scales seamlessly as nodes are
added or removed, without compromising performance.
Fault Tolerance: Handling node failures or departures gracefully
without losing data or impacting system availability.
Data Integrity: Ensuring data consistency and reliability across
distributed nodes, especially under concurrent updates or network
partitions.
Security: Protecting against malicious attacks or unauthorized access
to data within the decentralized network.
Use Cases Best Suited for Kademlia
Peer-to-Peer File Sharing: Due to the Kademlia’s efficient data look
up and storage, this P2P network is most suitable for decentralised
file-sharing networks such as the Bit-torrent, where users find where
to download file pieces from many other peers.
Decentralized Storage Networks: Applications such as IPFS utilize
Kademlia for managing distributed content addressing; data can be
found and pulled from different nodes rather than necessarily looking
up a single helper.
Blockchain and Cryptocurrencies: Kademlia is used by several
blockchain networks for the discovery and synchronization of peers,
by which the nodes can know each other and exchange information
freely without a controlling center.
Content Delivery Networks (CDNs): Kademlia can disseminate
content through the nodes of a network, thus, the content delivery can
be cached and accessed minimizing the time delay.
Decentralized Messaging and Communication: Using Kademlia,
instant messaging services and VoIP services can directly connect
with other peers, directly eliminating the middle man and indirectly
enhancing anonymity for secure communication.
Introduction to ASIC Resistance
Cryptocurrency technology continues to evolve, with ASIC resistance
being a crucial aspect for many digital currencies. ASIC (Application-
Specific Integrated Circuit) resistance is a feature designed to
prevent the domination of mining activities by ASIC miners, which
are highly specialized hardware. This guide delves into the
importance of ASIC resistance in the crypto world, its implications,
and the future it holds.
What is ASIC Resistance?
ASIC resistance is a feature of certain cryptocurrencies that
prevents the mining of coins using highly specialized hardware. This
is intended to keep mining accessible and decentralized, allowing
more individuals to participate using common hardware like CPUs
and GPUs.
Key Benefits of ASIC Resistance:
Decentralization: Prevents mining centralization, a significant
concern in networks like Bitcoin.
Accessibility: Allows average users with standard computers
to participate in mining.
Security: Diversifies the mining landscape, potentially
increasing network security.
The Role of Algorithms in ASIC
Resistance
The algorithm a cryptocurrency uses can determine its ASIC
resistance. Some algorithms are designed to be memory-hard,
requiring more memory than what is feasible for an ASIC to
effectively mine.
Popular ASIC Resistant Algorithms:
Ethash: Used by Ethereum, designed to be memory-intensive.
RandomX: Monero's choice, focuses on being CPU-friendly.
The Debate: ASIC Resistance vs.
ASIC Adoption
The debate between maintaining ASIC resistance and embracing
ASIC mining is complex. ASICs offer efficiency and power but risk
centralizing the mining process.
Pros and Cons:
ASIC Resistance Pros:
o Greater decentralization.
o Lower entry barriers for miners.
ASIC Resistance Cons:
o Potentially less network security compared to ASIC-
dominated networks.
o Higher vulnerability to certain types of attacks.
ASIC Adoption Pros:
o Increased mining efficiency and power.
o Potentially higher network security.
ASIC Adoption Cons:
o Risk of mining centralization.
o Higher entry barriers for new miners.
What is Turing-Completeness in Ethereum?
Turing completeness is a concept from computer science that describes a
system’s ability to perform any computation that can be expressed
algorithmically, given enough time and resources. In the context of
Ethereum, a Turing-complete system means that its programming language,
Solidity, can execute any computational task or algorithm. This capability
allows developers to create complex smart contracts and decentralized
applications (dApps) with a wide range of functionalities. Essentially,
Ethereum’s Turing-completeness provides flexibility and power, enabling a
diverse array of innovations on its blockchain.
What is Turing-Completeness?
Turing-completeness is a concept from theoretical computer science that
refers to a system’s capability to perform any computation that can be
described by an algorithm. The term is named after the mathematician and
computer scientist Alan Turing, who formalized the idea in the 1930s.
Turing-Completeness in Blockchain Technology
Turing-completeness in blockchain technology refers to the ability of a
blockchain’s smart contract platform to support complex computations and
logic, similar to how a general-purpose computer can execute any algorithm.
This capability significantly expands what can be achieved on the
blockchain beyond simple transactions.
1. Memory and Storage: The ability to store and manipulate data as
needed for complex computations.
2. Conditional Statements: Support for conditional logic (e.g., if-else
statements) to make decisions based on various inputs.
3. Loops and Iteration: The capability to execute repetitive tasks or
loops, which are essential for many algorithms and processes.
Ethereum and Turing-Completeness
Ethereum is a pioneering blockchain platform that leverages the concept of
Turing-completeness to support a wide range of decentralized applications
(dApps) and smart contracts.
1. Memory and Storage: Ethereum’s architecture supports dynamic
memory and storage capabilities. Smart contracts can manage state
and store data, allowing for complex interactions and persistent
records.
2. Conditional Logic: Ethereum supports conditional statements (e.g.,
if-else) in smart contracts, which lets contracts make decisions based
on input conditions.
3. Loops and Iterations: Contracts can use loops to perform repetitive
tasks or processes, essential for complex computations and
operations.
Advantages of Turing-Completeness in Ethereum
1. Enhanced Flexibility: Developers can create smart contracts with
intricate logic and conditions, enabling a broad spectrum of
functionalities from simple token transfers to complex financial
instruments.
2. Support for Diverse Applications: Turing-completeness enables the
creation of sophisticated DeFi applications, such as decentralized
exchanges, lending platforms, and synthetic assets, which rely on
complex interactions and calculations.
3. Innovation and Development Opportunities: Turing-complete
contracts can interact with other contracts and dApps, fostering a rich
ecosystem of interconnected services and enabling complex use cases
that leverage multiple components.
4. Automation and Efficiency: Automation through smart contracts
can streamline processes and reduce operational costs by removing
intermediaries and reducing administrative overhead.
5. Enhanced Developer Empowerment: The Turing-complete nature
of Ethereum has led to the development of a robust ecosystem of
tools, libraries, and frameworks that support various aspects of smart
contract and dApp development.
Challenges of Turing-Completeness in Ethereum
Turing-completeness in Ethereum opens up a vast range of possibilities for
decentralized applications and smart contracts but also introduces several
challenges that need to be managed carefully.
1. High Gas Costs: Executing complex smart contracts often requires
more computational resources, leading to higher gas fees. This can
make transactions expensive, especially when dealing with intricate
contracts or during periods of high network demand.
2. Network Congestion: Complex contracts can contribute to network
congestion by consuming significant resources, which affects overall
transaction speed and network performance.
3. Difficult Development: Writing, testing, and debugging Turing-
complete smart contracts is inherently more complex compared to
simpler systems. Ensuring that contracts behave as expected requires
rigorous testing and can be resource-intensive.
4. Ongoing Maintenance: Managing and updating smart contracts can
be challenging. Changes to the code need to be made carefully to
avoid introducing new issues, and once deployed, smart contracts are
often immutable, making fixes more complicated.
5. Limited Throughput: The computational demands of Turing-
complete contracts can limit the Ethereum network’s transaction
throughput. As more complex contracts and dApps are deployed, they
can strain the network’s capacity and impact scalability.
Comparison with Other Blockchain Platforms
Features Ethereu Bitcoin Binance Polkadot Solana
m Smart
Chain
(BSC)
Turing-
Complete Yes No Yes Yes Yes
ness
Smart Limited
Ink! (for
Contract Solidity, scripting
Solidity Rust), Rust, Move
Langauge Vyper capabili
Substrate
ties
Develop
Remix, Script-
ment Hardhat, Substrate Anchor,
Hardhat, based,
Environm Truffle Remix Framework Solana CLI
limited
ent
Transacti ~13-15 ~7 ~55-60 ~1000
~65,000
on Speed transacti transacti transactio transaction
transactions
ons per ons per ns per s per
per second
second second second second
Gas Fees Generally
lower Generally
Variable N/A Low
than lower
Ethereum
Consensu Proof of Nominated Proof of
Proof of Proof of
s Staked Proof of History (PoH)
Stake Work
Mechanis Authority Stake + Proof of
(PoS) (PoW)
m (PoSA) (NPoS) Stake (PoS)
Privacy Transpar Transpar Transpare Transparen Transparent,
Features ent, on- ent, on- nt, on- t, on-chain on-chain
chain chain chain
Use Interopera
Digital DeFi,
Cases DeFi, bility, High-
gold, Binance
DAOs, custom performance
store of ecosyste
NFTs blockchain dApps, DeFi
value m
s
Impact on Development and Innovation
1. Enabling Complex Applications: Turing-completeness allows
developers to create highly sophisticated smart contracts that can
handle complex business logic, automate multi-step processes, and
interact with other smart contracts.
2. New Business Models: Turing-completeness facilitates the
development of new business models and decentralized services.
Examples include automated market makers (AMMs), decentralized
autonomous organizations (DAOs), and non-fungible tokens (NFTs).
3. Interoperable dApps: Turing-complete smart contracts can interact
with each other, enabling the creation of complex decentralized
applications (dApps) that combine functionalities and services. This
creates a rich ecosystem where different components can work
together seamlessly.
4. Scalability Solutions: The challenges associated with Turing-
completeness, such as high gas fees and network congestion, have led
to advancements in scalability solutions like Ethereum 2.0 and Layer-
2 solutions (e.g., Rollups). These innovations aim to improve the
efficiency and capacity of the Ethereum network.
5. Impact on Other Sectors: Complex smart contracts can be used to
create transparent and efficient supply chain solutions, improving
traceability and reducing fraud. Turing-complete contracts enable
innovative governance models, such as DAOs and decentralized
voting systems, allowing for more democratic and transparent
decision-making processes.
Conclusion
In conclusion, Turing-completeness in Ethereum has revolutionized the
blockchain space by enabling the creation of complex and highly
customizable smart contracts. This capability has driven innovation in areas
like decentralized finance (DeFi), decentralized autonomous organizations
(DAOs), and various other applications. While it introduces challenges such
as security risks and high gas fees, it also fosters a diverse and dynamic
ecosystem. Overall, Turing-completeness empowers developers to build
sophisticated solutions, supports a wide range of use cases, and continuously
fuels advancements in blockchain technology.
HASH Function
A Hash Function (H) takes a variable-length block of data
and returns a hash value of a fixed size. A good hash
function has a property that when it is applied to a large
number of inputs, the outputs will be evenly distributed and
appear random. Generally, the primary purpose of a hash
function is to maintain data integrity. Any change to any bits
or bits in the results will result in a change in the hash code,
with a high probability.
The type of hash function that is needed for security
purposes is called a cryptographic hash function.
A cryptographic hash function (or cryptographic hash
algorithm) is an algorithm that is not computationally
efficient (no attack is more efficient than brute force) when it
is used to find either:
A data object which maps to a predefined hash result
Two data objects that map to the hash result in
collision-free property.
Because of these properties, a hash function is often used to
check whether data has changed.
Block
Diagram of Cryptographic: Hash Function; h = H(M)
Working on Hashing Algorithms in
Cryptography
Now that we have a basic idea of what a hash function is in
cryptography, let's break down the internal mechanics.
The first act of the hashing algorithm is to divide the large
input data into blocks of equal size. Further, the algorithm
applies the hashing process to the data blocks one by one.
Though one block is hashed separately, all the blocks are
related to each other. The output hash value for the first
data block is taken as an input value and is summed up with
the second data block. Similarly, the hashed output of the
second block is summed up with the third block, and the
summed-up input value is again hashed. And this process
goes on and on until you get the final hash output, which is
the summed-up value of all the blocks that were involved.
Therefore, tampering with the data of any block will change
its hash value. As its hash value goes into the feeding of
blocks following it, all the hash values are changed. This is
how even the smallest change in the input data is
detectable, as it changes the entire hash value.
Alice is a vendor whose business supplies stationery to Bob's
office on credit. She sends Bob an invoice with an inventory
list, billing amount, and her bank account details a month
later. She applies her digital signature to the document and
hashes it before sending it to Bob. However, Todd, who's a
hacker, intercepts the document while it's in transit and
replaces Alice's bank account details with his.
When Bob receives the letter, his computer calculates the
hash value of the document and finds that it's different from
the original hash value. Bob's computer immediately raises a
flag, warning him that something is fishy with the document
and he shouldn't trust it.
Without the hashed document, Bob would easily have
trusted the content of the document because he was
acquainted with Alice and the transaction details in the
document were genuine. However, since the hash values did
not match, Bob was aware of the change. Now, he contacts
Alice by phone and shares with her the information in the
document he received. Alice confirms that her bank account
is different than what is written in the document.
That's how a hashing function saves Alice and Bob from
financial fraud. Now imagine this scenario with your own
business and how it could.
Primary Terminologies
Preimage: Let’s say we have a hash value (hash value h =
h(x)). We say that x is the first image of h. Let’s call x a data
block, whose hash function (using the function H) is h.
Because H is a multiple-fold mapping, there will always be
some number of preimages for any hash value h.
Collision: If �≠�x=y and H(x) = H(y), we’ll have a collision.
Since we’re dealing with hash functions, it’s obvious that
collisions are not desirable.
Popular Hash Functions
Hash functions play an important role in computing,
providing versatile capabilities like: Quick retrieval of data,
Secure protection of information (cryptography), Ensuring
data remains unaltered (integrity verification). Some
commonly used hash functions are
Message Digest 5 (MD5)
MD5 is a specific message digest algorithm, a type of
cryptographic hash function. It takes an input of any length
(a message) and produces a fixed-length (128-bit) hash
value, which acts like a unique fingerprint for the message.
MD5 was widely used from the early 1990s onwards for
various purposes, including:
File Check: Making sure a file got from the web was not
changed while transferring. MD5 was used to make a
code for the first file and compare it to the code of the
received file.
Password Storage: MD5 was sometimes used to store
passwords on servers. However, it was never
recommended to store passwords directly in plain text.
Instead, the password was hashed using MD5, and the
hash value was stored. This meant that even if a
security breach occurred, the actual passwords
wouldn't be compromised.
Secure Hash Function (SHA)
SHA stands for Safe Hash Algorithm. It's a group of codes for
keeping data safe made by NIST. These codes convert any
size input into a fixed code, called a hash value or message
digest.
There are different SHA types, each with varied lengths and
security features:
SHA-1: The first SHA code, making a 160-bit hash. It's
now unsafe because of flaws and is no longer used.
SHA-2: A family of improved SHA algorithms with
various output lengths:
o SHA-224 (224 bits)
o SHA-256 (256 bits - most common)
o SHA-384 (384 bits)
o SHA-512 (512 bits)
SHA-3: A completely redesigned hash function
introduced after weaknesses were found in SHA-2. It
offers improved security but isn't as widely used yet.
SHAs have a number of applications in digital security:
Data Integrity: Checking if data is changed. Even small
change means different hash value.
E-Signatures: Verify documents. It uses private key,
hash to sign data. Receiver checks signature using
sender's public key, re-computed hash.
Password Protection: Passwords are encrypted before
saved. If there's a breach, only hash is compromised,
not passwords.
Software Check: Verify downloaded file is unchanged.
Often, hash is given by distributor to check file's
authenticity.
Applications
Message Authentication
Message Authentication is the process or service used for
making sure that a message is authentic. It also means
assuring that the received data is the same as the one sent
—that is, not tampered with to delete, insert, or replay. In
most cases, authentication will also ensure that the alleged
sender is who he claims he is.
More precisely, the hash function is referred to as a
message digest when hash functions are applied for
verifying a message.
Digital Signatures
In the case of a digital signature, the hash value of a
message is encrypted using the private key owned by the
person sending it out so that no one else can change what
they have said without being detected easily by those
looking for it also within seconds by those scanning across
different networks particularly corporate or government
intranets. With this information at hand, hackers always
attempt hacking passwords and other security codes so that
torrent downloaders from this work freely without facing any
restrictions they may not be able to avoid under lawful
circumstances.
A hash code is used to provide a digital signature as:
a. The hash code is encrypted with public-key encryption
using the sender's private key. This provides authentication,
but it also provides a digital signature because only the
sender can have produced the encrypted hash code. In fact,
this is what the digital signature technique is all about.
b. If you want both confidentiality and a digital signature,
then you can encrypt the message plus the private-key
encrypted hash code using a symmetrical secret key. This is
a common technique.
Create a single-pass password
One of the most common uses of hash functions is the
creation of a single pass password file. A single pass
password file is a scheme where the operating system holds
the hash value of a user’s password rather than the actual
password itself.In other words, it’s the hash value that the
operating system stores. That’s why a hacker can’t get the
real password from that file. When you type a password into
your computer, the system uses the hash value to check if
you typed the right password. Many systems use this
method to protect passwords.
Intrusion detection and virus
Hash functions can also be used to detect intrusions and
viruses. For each file on your system, store H(F) and keep
the hash values safe (for example, on a protected CD-R).
You can later check if a file is changed by recreating H(F).
For example, an intruder would have to modify F without
altering H(F).
Pseudorandom number generator (PRNG)
You can use a cryptographic hash function to create a PRF or
a PRNG. One of the most common uses for a hash based PRF
is to generate symmetric keys.
Security Requirements for
Cryptographic Hash Functions
Requirement Description
H can be applied to a block of data of any
Variable input size
size.
Fixed output size H produces a fixed-length output.
H(x) is relatively easy to compute for any
Efficiency given x, making both hardware and
software implementations practical
Preimage For any given hash value , it is
resistant (one- computationally infeasible to find u such
way property) that H(y) = h
Second preimage For any given block x, it is
�≠�y=x with H(y)=H(x) .
resistant (weak computationally infeasible to find
collision resistant)
Collision resistant
It is computationally infeasible to find any
(strong collision
pair (x , y) such that H(x) = H(y).
resistant)
Pseudorandomnes Output of H meets standard tests for
s pseudorandomness.
The first three characteristics are necessary for the
practical use of a hashfunction.
The fourth characteristic is preimage resistant. It is
easy to generate a code for a message but almost
impossible for a message to be generated for a code.
This property is important when the authentication
technique is to use a secret value, where the secret
value is not to be sent. If the hash function does not
have one way, the secret value can easily be
discovered by an attacker. For example, if the attacker
is able to observe or intercept the transmission, they
can get the message M. They can also get the hash
code h = H(S||M). They can then invert the hash
function S||M = H-1(MDM) and get SAB || M. Since they
now have both M and SAB || M, it is trivial to recover SAB.
The fifth property – second preimage resistant –
ensures that you can’t find a different message with
the exact same hash value. This property stops forgery
when you’re using encrypted hash code. If this
property wasn’t true, attackers would be able to do the
following: First, they’d observe or intercept the
message plus the encrypted hash code. Second, they
would get the unencrypted hash from the message.
Third, they would generate an alternate message with
the identical hash code.
A weak hash function satisfies the first fifth properties.
If the sixth property (collision resistant) is also met,
then the hash function is called a strong one. Strong
hash functions protect against an attack where one
side creates a message for the other side to sign.
The last requirement – pseudorandomization – has not
traditionally been mentioned as a requirement for
cryptographic hash functions, but is rather implicit.
Because cryptographic hash functions are often used to
derive keys and generate pseudorandom numbers, and
because in message integrity applications, these three
resistant properties are dependent on the hash
function’s output being random, it’s logical to verify
that a given hash function actually produces random
output.
Drawback
Just like other technologies and processes, the hash
functions in cryptography aren't perfect either. There are a
few key issues that are worth mentioning.
There had been incidences in the past while popular
algorithms like MD5 and SHA-1 had been returning the
same hash value for different data. Hence, the quality
of collision-resistance was compromised.
There's a technology called "rainbow tables" that
hackers use to try to crack unsalted hash values. That's
why salting before hashing is so important to secure
password storage.
There are some software services and hardware tools—
known as "hash cracking rigs"—that are used by
hackers, security researchers, and sometimes even
government entities to crack the hashed passwords.
Some kinds of brute force attacks can crack the hashed
data.
Conclusion
Hashing is a very handy cryptographic tool for information
technology when it comes to verification: checking digital
signatures, file integrity, or data, password integrity, and
many more. Cryptographic hash functions are not perfect,
but they are a pretty good checksum and authentication
mechanism. It is one of the methods of storing passwords
securely when a salting technique is in place, in a manner
that is just impractical for cybercriminals to even try to
invert it to something they can use.
Zero Knowledge Proof
Zero Knowledge Proof (ZKP) is an encryption scheme
originally proposed by MIT researchers Shafi Goldwasser
Zero-knowledge protocols are probabilistic assessments,
which means they don’t prove something with as much
certainty as simply revealing the entire information would.
They provide unlinkable information that can together show
the validity of the assertion is probable.
Currently, a website takes the user password as an input
and then compares its hash to the stored hash. Similarly a
bank requires your credit score to provide you the loan
leaving your privacy and information leak risk at the mercy
of the host servers. If ZKP can be utilized, the client’s
password is unknown the to verifier and the login can still be
authenticated. Before ZKP, we always questioned the
legitimacy of the prover or the soundness of the proof
system, but ZKP questions the morality of the verifier. What
if the verifier tries to leak the information?
Example-1: A Colour-blind friend and Two balls :
There are two friends Sachin and Sanchita, out of whom
Sanchita is colour blind. Sachin has two balls and he needs
to prove that both the balls our of different colour. Sanchita
switches the balls randomly behind her back and shows it to
Sachin who has to tell if the balls are switched or not. If the
balls are of the same colour and Sachin had given false
information, the probability of him answering correctly is
50%. When the activity is repeated several times, the
probability of Sachin giving the correct answer with the false
information is significantly low. Here Sachin is the “prover”
and Sanchita is the “verifier”. Colour is the absolute
information or the algorithm to be executed, and it is proved
of its soundness without revealing the information that is the
colour to the verifier.
Example-2: Finding Waldo :
Finding Waldo is a game where you have to find a person
called Waldo from a snapshot of a huge crowd taken from
above. Sachin has an algorithm to find Waldo but he doesn’t
want to reveal it to Sanchita. Sanchita wants to buy the
algorithm but would need to check if the algorithm is
working. Sachin cuts a small hole on a cardboard and places
over Waldo. Sachin is the “prover” and Sanchita is the
“verifier”. The algorithm is proved with zero knowledge
about it.
Properties of Zero Knowledge Proof :
Zero-Knowledge –
If the statement is true, the verifier will not know that
the statement or was. Here statement can be an
absolute value or an algorithm.
Completeness –
If the statement is true then an honest verifier can be
convinced eventually.
Soundness –
If the prover is dishonest, they can’t convince the
verifier of the soundness of the proof.
Types of Zero Knowledge Proof :
1. Interactive Zero Knowledge Proof –
It requires the verifier to constantly ask a series of
questions about the “knowledge” the prover possess.
The above example of finding Waldo is interactive since
the “prover” did a series of actions to prove the about
the soundness of the knowledge to the verifier.
2. Non-Interactive Zero Knowledge Proof –
For “interactive” solution to work, both the verifier and
the prover needed to be online at the same time
making it difficult to scale up on the real world
application. Non-interactive Zero-Knowledge Proof do
not require an interactive process, avoiding the
possibility of collusion. It requires picking a hash
function to randomly pick the challenge by the verifier.
In 1986, Fiat and Shamir invented the Fiat-Shamir
heuristic and successfully changed the interactive zero-
knowledge proof to non-interactive zero knowledge
proof.