A Case Study On Different Applications and Security Issues in Distributed Systems
A Case Study On Different Applications and Security Issues in Distributed Systems
Abhinav Agarwal
19ucs254
1
Abstract
This case study will look into the background of the different applications of
distributed systems and security around them.
In the later part of the case study, we will look into web caching and the various
methods to achieve the same. Like we will see the push, pull, and lease-based
approach and their respective merits and demerits.
Then, we will look into the solution proposed by Haobo Yu and the team in [7] to
achieve scalability in web caching and how to ensure consistency among proxies
which is a major drawback in other approaches.
2
Background
A standard web application consists of a web browser client and a web server.. It
can also contain a Proxy Server, which acts as an intermediary between the client
and the server. This proxy server can have multiple use cases ranging from load
balancing to caching. There is also a database server along with a file server with
which our web server communicates to get the data requested by the user.
Web Caching
Web caching is used when we want performance while scaling the system, in this
we duplicate some functionality or data on multiple nodes and request to these
functions, and data get served from these nodes instead of our main server. This
helps in reducing the load on our main server. Web caching works best when
employes closest to the clients this is where technologies like Edge Computing and
CDN comes into play. Where we come compute result at the “edge” of the network.
The problem with this method is that we observe consistency issues because
although the data got updated at the main server but it has still not been updated at
these nodes, and this old data get served to the users. To solve this problem we can
either use a pull or push-based approach, wherein the first these nodes pull data
from the central server at regular intervals, and in the second, the main server
pushes the new data onto these nodes whenever it receives it.
3
Distributed File Systems
DFS is yet another application of distributed system, where your files are these on
some remote servers and can be accessed using Remote Procedure Calls.
Sun’s Network File System if a widely used DFS, which uses virtual file system layer
to handle local and remote files, NFS uses mount protocol to access remote files,
mount protocol establishes a local name for the remote files, users access remote
files using local names and OS takes care of the mapping, NFS also allows client
caching, where cache data can stale upto 30 seconds, NFS implements security
using user ID, group ID authentication only [2].
To make the file system disconnection transparent, which is especially needed for
the mobile clients CODA was developed. In CODA each file belongs to exactly one
volume and each volume may be replicated across several servers, CODA works on
the principal of read-once write-all, where write conflicts are resolves manually by
the user like GIT [3].
Lets talk about xFS a little, it is basically a server less file system which is designed
for high speed LAN environments, it distributes data storage disks using software
RAID and log based network stripping. It also eliminates central server caching
using cooperative caching. As xFS uses RAID so overhead of parity management
hurts performance for small writes also RAID are very expensive hardware.
Some other file systems include LFS which is a Log Structured FS, which provides
fast writes, simple recovery and flexible file locations, another is Hadoop DFS
(HDFS), which is optimized for large data sets which is accessed using Hadoop. [6]
4
There have been alot of developments to answer the question, how to provide the
authentication to the user, alot of answers are in the direction of encryption but
even if we make it possible using public-private key cryptography it is as “secure” as
the public key distribution and for that algorithms like Diffie-Hellman have been
introduced.
To protect against the intruders, one can use Firewall which is a network
component sitting between inside and outside, it drops packets on the basis of
source and destination address,
To provide encryption and authentication between web server and client SSL
(Secure Socket Layer ) was developed by the Netscape, to begin the SSL session
server’s public key is needed which is encrypted using CA’s private key, and it is
decoded using CA’s public keys which are stored in the browser [4].
5
Case Evaluation - Web Caching
Web caching is traditionally done using three methods, pull-based caching,
push-based caching, and a hybrid approach.
Pull-Based Caching
This approach is based on the concept of time-to-live (TTL). When the request
arrives at the cache after the TTL has expired, it pulls the latest data from the
server. If the TTL is fixed, then the cache staleness is bounded by this TTL. If we set
a very small value for TTL, then it mitigates the benefits of web caching.
The proxy can also dynamically determine the refresh interval (TTL) based on past
observations, this is known as intelligent polling. So it can be something like,
increase the interval if the object has not changed in two previous polls and
decrease the interval if it has.
Generally, the pull-based approach is not preferred for dynamic content due to the
high overhead of pulling unchanged data, also there can also be consistency issues
if the data is changing very frequently then the user can see previous data because
new data has not been pulled yet on its closest node. Whereas for static content it
is the best approach
Push-Based Caching
In this type of caching, each server keeps track of the changes on a particular page,
and then, whenever that page changes, it notifies the proxies and floods the
network with the updated data. While this approach eliminates the staleness but it
incurs the cost of requiring the server to keep track of all proxies. Also, flooding the
entire network has its own overhead, Thus this approach does not scale.
When working with dynamic content, ensuring consistency is a very big issue, if we
make our dynamic content static and store it in the cache to be served to the user
then if we employ the push-based approach, then even for very little change we
have to flush the entire cache regenerate the content and again store it in the
cache. Also, this approach is not resilient to server crashes.
6
Hybrid Approach - Leases
Lease is a duration for which the server agrees to notify the proxy of modification,
so a lease is issued to a proxy on the first request, and the server will send the
notifications until expiry. Once the lease expires the proxy have to renew the lease.
So if there is no load on the proxy so it will just poll the main server whenever
necessary, or if it is in a load then infinite push is there
There are different policies defined for Leased duration, one is an Age-based lease
where larger the expected lifetime, longer the lease. Another is Renewal-Frequency
Based, where proxies at which objects are popular get a longer lease. One more is
server load based where shorter leases are given during heavy load.
The Efficiency of the whole system depends upon the lease duration, and there is
the overhead of renewing the short leases.
7
Proposed Solution & Implementation
A scalable consistent hashing method has been proposed in [7], which utilizes
invalidations, hierarchy, and leases.
Each group in the hierarchy is associated with the caches, and caches send
heartbeats to each other that are equivalent to cache-to-cache leases. The cache
maintains a server table in order to locate where the web server is located in the
hierarchy. The client request is forwarded to the first cache in the hierarchy which
consists of a valid copy of the requested page.
The caching hierarchy is maintained in the form of groups where each cache joins
the group owned by its parent. Thus there is no need for parents to know who its
children are and children can choose its parent freely as long as cycles are
prevented. More on hierarchy establishment and maintenance has been discussed
in [8].
The hierarchy is kept alive with the help of heartbeats, Each group owner sends
periodic heartbeats to its associated group. Let each lease length is T and t is the
time difference between subsequent leases then (T/t = 5) in their case. This ensures
that if some heartbeat is lost then it will cause much problems.
With the heartbeat, we piggyback the knowledge of the invalid page. We only need
invalid pages that have been requested after they were last rendered invalid. Each
8
heartbeat request contains the knowledge of these pages that have been rendered
invalid at the parent cache and this knowledge is propagated to its child caches.
Heartbeat along with traveling down also travels up, from the server to the
top-level cache, the cache with which the web server is attached is called a primary
cache. Each server sends a JOIN request up the hierarchy, and every cache on
receiving this request makes an entry in its server routing table. The thing to note is
top-level cache knows all the servers attached in the hierarchy. These servers
communicate with the primary cache with the help of a heartbeat. A cache can also
send a LEAVE signal to its parent and children if it does not receives a heartbeat
within T seconds.
Client can attach to any cache in the hierarchy lets call it the clients’s primary
cache. When a clients requests a page, it sends the request to its primary cache.
This cache checks if it contains the requested page if not the request is forwarded
the next cache. When the request is fulfilled either by the originating server or
some intermediary cache, the response takes the reverse path updating all the
caches in the way and serving the user in the end.
9
Conclusion
In this term paper we first discussed the various applications of distributed systems
and the role of security in the same. We saw that a lot of advancements have been
made in the fields of web applications, caching, and distributed file systems. Then
we saw various approaches used to perform web caching for static and dynamic
content and the merits and demerits of each approach.
At last, we saw a new kind of approach implementing web caching which is scalable
and consistent invariant. The approach combined the lessons of the pull, push, and
lease-based approach. The respective author’s performance evaluation suggests
that when the heartbeat rate is larger than the write times, then this approach is
very effective in keeping the pages fresh. When the pages are write-dominated
then this approach ensures freshness because if the page is invalid the request is
served from the cache in the hierarchy which contains the correct page. However,
when the pages are read-dominated, then the invalidation approach offers
significant reductions in server hits counts and client response time.
References
1. Course on Distributed Systems, 2022-23
2. https://www.ibm.com/docs/en/aix/7.1?topic=management-network-f
ile-system
3. http://www.coda.cs.cmu.edu/
4. https://www.cloudflare.com/learning/ssl/how-does-ssl-work/
5. Course on Blockchain Foundation and Smart Contract, 2021-22
6. Distributed Principals and Paradigms, Tanenbaum-Steen
7. A Scalable Web Consistency Architecture.
8. ROSESSTEIN, A.. 12, J.. AND Tow. S. Y. MASH: The rnulticasting archive
server hierarchy. SIGCOMM Computer Cornmrmication Revtew 2’7. 3
(July 1997).
10