0% found this document useful (0 votes)
12 views

4 Distributed Databases, NOSQL Systems, and BigData-1

The document discusses distributed databases, NoSQL systems, and big data, highlighting the concepts, advantages, and disadvantages of distributed database systems (DDBS). It covers data fragmentation, replication, and allocation techniques, as well as the architecture and components of DDBS, emphasizing their ability to improve data accessibility and organizational efficiency. Additionally, it addresses challenges such as complexity, cost, and security associated with DDBS implementation.

Uploaded by

neupanepratik1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
12 views

4 Distributed Databases, NOSQL Systems, and BigData-1

The document discusses distributed databases, NoSQL systems, and big data, highlighting the concepts, advantages, and disadvantages of distributed database systems (DDBS). It covers data fragmentation, replication, and allocation techniques, as well as the architecture and components of DDBS, emphasizing their ability to improve data accessibility and organizational efficiency. Additionally, it addresses challenges such as complexity, cost, and security associated with DDBS implementation.

Uploaded by

neupanepratik1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 40
DISTRIBUTED DATABASES, NOSQL Systems AND BIGDATA _ § ISI wer Comprehensive study of this chapter, you will be able to: Distributed Database Concepts and Advantages Data Fragmentation, Replication and Allocation Techniques fr Distributed Database Design ‘Type of Distributed Systems Distributed Database Architecture Introduction to NOSQUL Systems The CAP Theorem Document-based, Key-value Stores, Column-based and Grapt-basd System Big Data MapReduce Hadoop possesses 8B Advanced Database DistrisuteD DATABASE CONCEPT AND ADVANTAGES A major motivation behind the development of database exams is the ee to int operational data of an organization and to provide controlled a 0 the date, Althoug, integration and contotied necess may imply centralization, this isnot the intention, fy fot development of computer networks promotes a decentralized mode of work, ‘This Aecentrat approach misrors the organizational structure of many companies, which arg ian distributed into divisions, departments, projects, and so on, and physically distributed ims offices, plants, factories, where each unit maintains its own operational data, ‘The share abil, of the data and the efficiency of data access should be improved by the developmen, * distributed database system that reflects this organizational structure, makes th, data in a units accessible, and stores data proximate to the location where it is most frequently used Distributed DBMSs should help resolve the islands of information problem, Databases tm sometimes regarded as electronic islands that are distinct and generally inaccessible Places te remote islands. This may be a result of geographical separation, incompatible computer architectures, incompatible communication protocols, and so on. Integrating the database. intoa logical whole may prevent this way of thinking, A distributed database (DDB) is a collection of multiple, logically interrelated database distributed over a computer network. A distributed database management system DBMS) is the software that manages the DDB and provides an access mechanism thes makes this distribution transparent to the users. The distributed database (DDB) and distributed database management system (DBMS) together is called Distributed database system (DDBS). arate 4 Database Technology Computer Networks Integration Distribution | Distributed Database System Integration (Integration # Centralization) Figure 4.1: Distributed Database System Communication’ Network Figure 4.2: Centralized Environment _ (eure) 0 Distributed Databases, NOSQL Systems and igi 109 Communicati Network Figure 4.3: Distributed Environment characteristics of Distribute Database System ‘The Distributed Database System has following characteristics: . A collection of logically related shared dat + ‘The data is split into a number of fragments; + Fragments may be replicated; + Fragments/replicas are allocated to sites; © The sites are linked by a communications network; * The data at each site is under the control of a DBMS; . ‘The DBMS at each site can handle local applications, autonomously; * Each DBMS participates in at least one global application. Components of Distributed Database System ‘The different components of DDBS are as follows: ‘ Computer workstations or remote devices (sites or nodes) that form the network system. The distributed database system must be independent of the computer system hardware. . Network hardware and software components that reside in each workstation or device, The network components allow all sites to interact and exchange data. Because the components—computers, operating systems, network hardware, and so on—are likely to be supplied by different vendors, it is best to ensure that distributed database functions can be run on multiple platforms. s media that carry the data from ohé node to another. The tions media-independent; that is, it must be able to wunications media. hich is tho software component found in each ‘tise transaction processor receives and (remote and loca). The TP is also’ known (AP) or the transaction manager mm). + Communication: DDBMS must be communica support several types of comm + The transaction processor (TP), w computer or device that requests data. processes the application's date requests as the application processor | 190, Advanced Database Advantages of DDBS ‘The advantages of DDBS are as follows: 1. bw ‘The data processor (DP), which is the software component residing o, computer or device that stores and retrieves data located at the site. The Dp known as the data manager (DM). A data processor may even be a cena al DBMS. Reflects organizational structure: Many organizations are naturally distributeg several locations. Improved share-ability anc organization can be reflected i data stored at other sites. Data can d local autonomy: The geographical distribution of n the distribution of the data; users at one site can a be placed at the site close to the users who norma use that data. In this way, users have local control of the data and they can conseque establish and enforce local policies regarding the use of this data. A global DBg ; responsible for the entire system. Generally, part of this responsibility is devolved toi local level, so that the local DBA can manage the local DBMS. Improved availability: In a centralized DBMS, a computer failure terminates 1 operations of the DBMS. However, a failure at one site of a DDBMS or a failure of, ‘communication link making some sites inaccessible does not make the entire syst inoperable, Distributed DBMSs are designed to continue to function despite such failure’ If a single node fails, the system may be able to reroute the failed node's requests w another site. Improved reliability: Because data may be replicated so that it exists at more than om site, the failure of a node or a communication link does ‘not necessarily make the da inaccessible. Improved performance as the data is located near the site of “greats demand,” and given the inherent parallelism of distributed DBMSs, speed of databus access may be better than that achievable from a remote centralized databs Furthermore, since each site handles only a part of the entire database, there may pot te the same contention for CPU and I/O services as characterized by a centralized DBMS. Economics: The potential cost saving occurs where databases are geographically reno and the applications require access to distributed data. In such cases, owing Ld relative expense of data being transmitted across the network as ‘opposed to the eo Tocal access, it may be much more economical to partition the application and perfor! processing locally at each site. Modular growth: In a distributed environment, it is much easier to handle oot New sites can be added to the network without affecting the operations of other sites flexibility allows an organization to expand relatively easily. Integration: At the start of this section, we noted that integratio of the DBMS approach, not centralization. The integration of particular example that demonstrates how some organizations are foFee™ Mfetributed data processing to allow their legacy systems to coexist with Oy ms. At the same time, no one package can provide all the function? ye” uires nowadays. Thus, it is important for organizations a pecific 4 tt was a Key a8 tems legacy SIP 8 modern syste! an organization reat integrate software components from different vendors to mect theirs 4. Integrity control more diffi 5. Lack of standard: 6. Lack of experience: 7. Database desig? more oonign of 0 Distributed Databases, NOSQL Remaining competiti f petitive: There are number of ative Systems and Bigdata 191 er of relativel heavily on distributed : databas, ae collaborative work and au © technology such as ee developments that rely aaa low manager e-business, com their businesses and use distributed dntahoae yy emerties % computer supported 8 d ines have database technology to remain some ean pisadvantages of DDBS Poi ‘The disadvantages of DDBS are as follows: 4, Complexity: A distrib pisces porters that hides the distributed nature fh ee oumplec thon a ae performance, reliability, and avila iy ie eas ailability is inherentl ea eeatommplnie se a = itis fact that data ean be replicated also adds an replication adequately, x DBMS. If the software does not handle data there will be d : performance compared with the aotrilicad poor a in availability, reliability, and will become disadvantages. and the advantages we cited earlier 2, Cost: Increased complexity m: sila eerie fee es ‘we can expett the provirement end maintenance czts for DDEMS to be higher than thon or a cents DEMS. Furthermore 2 eas hardware to establish a network between sites. re ongoing communication costs incurred with the use of thi alco additional labor costs to manage and maintain the local DBMSe eat oars 1e local DBMSs and the underlying 3. Security: In a centralized system, access to the data can be easily controlled. However, in a distributed DBMS not only does ozs to replated data have to be conte in multiple locations, but the network itself has to be made secure. In the past, networks were regarded as an insecure communication medium. Although this is still partially true, Spnifcant developments have been made to make networks more secure ‘cult; Database integrity refers to the validity and consiztency of stored data, Integrity is usually expressed in ters of constraints, which sre consistency rules that the database is not permitted to violate, Enforcing integrity constraints generally requires access to a large amount of data that defines the constraint but that is not involved in the actual update ‘operation itself. Ina distributed DBMS, the communication and, propeasing coats that.Ar° required to enforce integrity constraints a teem d on effective communication, we tion and data access mntial of distributed centralized distributed DBMSs depen a ‘ce of standard communica to see the appearan a are only now Stari Mif standards hao significant limited the pote DBMS a a also no tools oF ‘methodologies to help users convert ® s. panes ies Bree eden co distributed DBMSs have nol General-purPo pat ._ problems ete protocols and PEON indus eae athe mat ‘have the same level of experen ar ss Cones ed DBMS For @ prospective 9 with centralis significant deterrent- - pesides t been widely ddopter of this augnculties of designing © to take account o centralized database, 1° of data, fragmentation 492 Advanced Database DistRIBUTED DATABASE DESIGN cause techniques that are used to break up the database into logieg nesigned for storage at the various nodes. We also din sat date replication, which permits certain data to be stored in more than one ste y ina wee ibikity and reliability; and the process of allocating fragments—or replicas of fragmon” for atorage at the various nodes. These techniques are used during the process of distin. amreabese design. The information concerning data fragmentation, allocation, and replicas “tore in a global directory that is accessed by the DDBS applications as needed. In this section, we dit called fragments, which may be ony Data Fragmentation Fragmentation is the task of dividing a table into a set of smaller tables. The subsets of, table are called fragments. These fragments may be stored at different locations. Morey, fragmentation increases parallelism and provides better disaster recovery. Fragmentation be of three types: = Vertical Fragmentation * “Horizontal Fragmentation * Hybrid Fragmentation Fragmentation should be done in a way so that the original table can be reconstructed from fragments. This is needed so that the original table can be reconstructed from the fragmens whenever required. This requirement is called “reconstructiveness”. Advantages of Fragmentation = Since data is stored close to the site of usage, efficiency of the database sys** increased. . Local query optimization techniques are sufficient for most queries since ds * locally available. pt, ; 4 . eer) + Since irrelevant data is not available at the sites, security and privat database system can be maintained, Disadvantages of Fragmentation . When data from different fragments are required, the access speeds ™*” 7 high. : geo + In case of recursive fragmentations, the job of reconstruction wil ne! © techniques. in! on . Lack of back-up copies of data in different sites may render the databss? in case of failure of a site, Vertical Fragmentation ‘ ot 5 : ts In vertical fragmentation, the fields or columns of a table are grouped into fre ae) ae to maintain re-constructiveness, each fragment should contain the primary key table. Vertical fragmentation can be used to enforce privacy of data. © Distributed Databases, NOSQL, Systems and Bigdata 19 Table ramp Fragment 1 Fragment 2 Fragment 3 Fragment n Figure 4.4: Vertical fragmentation us consider t] ze part sen al eve he Reem e oted Student Stu_id Stu_name | Stu_address | Dept_id 10 Maya Palpa 1 u Abin Ktm 2 12 Arav Ktm 1 13 Ashna Palpa 3 14 Anju Pokhara 4 16 Manish Banepa 2 16 Pinky Ktm a Figure 4.5: Student table before fragmentation Now, the address details are maintained in the admin section. In this cast fragment the database as follows: CREATE TABLE Std_address AS SELECT Stu_id, Stu_address FROM Student; | By executing above query, we get the following result: Std_address Fraid | Stu.address Palpa 10 Ktm 13 | Palpa__ Pokhara Figure 4.6: Std_address To e, the designer will | 494 Advanced Database Horizontal Fragmentation Horizontal fragmentation groups the tuples of a table in accordance to values of one oF mp fields, Horizontal fragmentation should also confirm to the rule of reconstructiveness, pay : Ba horizontal fragment must have all columns of the original base table. ch Fragment 1 Fragment 2 Table => Fragment 3 Fragment n Figure 4.7: Horizontal fragmentation Example 4.2: In the student schema which is shown in figure 7.6, if the details of all studentscf department 1 need to be maintained at the respective faculty, then the designer will horizontally fragment the database as follows: ; CREATE TABLE Department AS SELECT * FROM Student WHERE Dept i By executing above query, we get the following result: Department 16 Pinky Ktm 1 Figure 4.8: Department Table (Horizontal Fragment of Student Table) Hybrid Fragmentation In hybrid fragmentation, a combination of horizontal and vertical fragmentation techniaut used, This is the most flexible fragmentation technique since it generates fragments minimal extraneous information. However, reconstruction of the original table is often expensive task. Hybrid fragmentation can be done in two alternative ways: esi8 oot 1. At first, generate a set of horizontal fragments; then generate vertical fragments from or more of the horizontal fragments. from ‘At first, generate a set of vertical fragments; then generate horizontal fragment’ or more of the vertical fragments. Gam) : O Distributed Dat bases, NOSQL Systoms and Bi a5 L Systems and Bige Fragment 1 @ —— aa ey Fragment 1 Fragment 2 | ‘Fragment 2 Fragment n ‘Table | Fragment n-1 Fragment 2 ‘x ie Figure 4.9: Hybrid fragmentation CREATE TABLE Hybrid AS SELECT Stu_id, Stu_name FROM Studer WHERE Stu_id=12; - By executing above query, we get the following result: Figure 4.10: Hybrid Table (hybrid fragment of Student Table) Data Replication Data Replication i a pote process of generating and reproducing multiple copies of data at one or ee wig an important mechanism, because it enables organizations to provide raat current data where and when they need it It is intended to increase the system such that if one database fails another can continu {0 SE queries or uy Satins 3 Paha requests, Replication is sometimes described using the publishing industry ‘metaphor hers, distributors, and subscribers. ta available to ‘other locations through + Publisher: A DEMS that makes dal replication. The publisher can have one or more publications (made up of one or more articles), each defining a logically related set of objects and data to replicate. + Distributor: A DBMS that store® replication data and metadata about the publication and in some cases a publisher to te ag a queue for data moving (om the the subscribers. A DBMS can ‘act as both the publisher and the distributor. + Subscriber: A DBMS that rec! ives. replicated data. A subscriber can receive data from multiple publishers and pul tion yer can also P ications. Depending on the (0° of repli chosen, the subscrib ass data changes back to the publisher or republish the data to other subscribers. “496 Advanced Database Replication Purpose | ‘The purpose of replications are as follows: | 1. System availability: Distributed Database System may remove single point, | by replicating data, so that data items are accessible from multiple sites, Con, | even when some sites are down, data may be accessible from other sites, | 2, Performance: One of the major contributors to response time is the commun | overhead, Replication enables us to locate the data closer to their access points, 5 ont 88 points, the localizing most of the access that contributes to a reduction in response time, ey oF ity Seu, 4. Scalability: As systems grow geographically and in terms of the number of, (consequently, in terms of the number of access requests), replication allows for = support this growth with acceptable response times. mH 4. Application requirements: Finally, replication may be dictated by the applica which may wish to maintain multiple data copies as part of their operatiny specifications. Challenges in Replications 4. Placement of replicas: The major challenge in replication is where to put the repli ‘There are three places to put replicas. = Permanent replicas: permanent replicas consist of cluster of servers that mayb geographically dispersed. = Server initiated replicas: Server initiated caches include placing replicas int hosting servers and server caches. : = Client initiated replicas: Client initiated replicas include web browsers cache 2. Propagation of updates among replicas: The net challenge is to how to propagate updates in one replica among all the replicas efficiently and faster as possible. | + Push based propagation: A replica in which update occurs pushes the upésts all other replicas. te = Pull based propagation: A replicas requests another replica to send the ewes data it has. i 3. Lack of consistency: If a copy if modified, the copy becomes inconsistent from therst# copies. It takes some time for all the copies to be consistent. Advantages of Data Replication a + Reliability: In case of failure of any site, the database system continues © since a copy is available at another site(s). « Reduction in Network Load: Since local copies of data are processing can be done with reduced network usage, particularly hours. Data updating can be done at non-prime hours. Quicker Response: Availability of local copies of data processing and consequently quick response time. ao Simpler Transactions: Transactions require a smaller number of joins located at different sites and minimal coordination across the networ™ availbls O jurins nsures quick © fe eeeieee bere eT ECU become simpler in nature. y pis © Distributed Database yantages of Data Replication vases, NOSQL Systems and Bigdata 197 Increased Storage Requirements: ‘ ‘Sto ¢ Maintaining multi le copi associated with increased storage costs. The storage Rare copies of data is the storage required for a centralized system, nes Increased Cost and Complexity of Data Updating: updated, the update needs to be reflected in all the conics i " of the i sites, This requires complex synchronization techniques and pares ‘he Geren Each time a data item is + Undesirable Application - Database coupling: are not used, removing data inconsistency requ application level. This results in undesirable applicati f complex update mechanisms ires complex co-ordination at tion — database coupling. pata Allocation Bach fragment or each copy of a fragment is stored at a particular site in the distributed system sith “optimal” distribution. This process is called data distribution (or data allocation). The oice of sites and the degree of replication depend on the performance and availability goals of the system and on the types and frequencies of transactions submitted at each site. Example 4.4: If high availability is required, transactions can be submitted at any site, and ‘most transactions are retrieval only, a fully replicated database is a good choice. However, if certain transactions that access particular parts of the database are mostly submitted at a particular site, the corresponding set of fragments can be allocated at that site only. Data that is accessed at multiple sites can be replicated at those sites. If many updates are performed, it may be useful to limit replication. Finding an optimal or even a good solution to distributed data allocation is a complex optimization problem. There are four alternative strategies regarding the placement of data: centralized, fragmented, complete replication, and selective replication. Centralized This strategy consists of a single database and DBMS stored at one site with wee bt szoss the network. Locality of reference is at its lowest as al sites, except the central as, 8 ; nication costs to use the network for all data accesses. This also menns that communication Sak, Ms Bt Reliability and availability are low, as a failure of the central site res database system. Fragmented (or Partitioned) , This strategy partitions the database into disjoint fragments, ne site, If data items are located at the site where they re tion, storage costs are with each fragment assigned to sed most frequently, locality of : Seay Jow; similarly, reliability an¢ Yelerence is high. ; s Sty ant aera lized case, aS the failure of ilabili he contraliz ‘ lability are low, although they are higher than in econ ono sults in the lose of only that site's data, Performance shou ‘ow ifthe distribution is designed properly: C " . ; a aplention rote eopy of the database at each site There : ini i 1d. However, ay ‘ intaining Oo ance are maximized. 1 ot a Sabie and avi and ener rence, i lL =z helt 4198 Advanced Datal - storage costs and communication costs for updates are the most Ci He acermos ome these problems, snapshots are sometimes used. A snapshot is a md of oe toa a sven ting | "The copies are updated periodienlly—for example, hourly or weekly—o bey o8y : - = up to date, Snapshots are also sometimes used to implement views in a di atabase I improve the time it takes to perform a database operation on a view. se Selective Replication i ‘This strategy is a combination of fragmentation, replication, and centralization. Some data t items are fragmented to achieve high locality of reference, and others that are used at many sites and are not frequently updated are replicated; otherwise, the data items are centralized, The objective ofthis strategy is to have all the advantages of the other approaches but none the disadvantages. This is the most commonly used strategy, because of its flexibility. TYPES OF DISTRIBUTED DATABASE SYSTEMS ee Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database environments. Homogeneous DDBS In a homogeneous system, all sites use the same DBMS product. Homogeneous systems are much easier to design and manage. This approach provides incremental growth, making the addition of a new site to the DDBMS easy, and allows increased performance by exploiting the parallel processing capability of multiple sites, Example 4.5: Consider that we have three departments using Oracle 19¢ for DBMS. If som: changes are made in one department, then, it would update the other department also. Figure 4.11; Homogeneous distibuted system ‘Types of Homogeneous Distributed Database System ‘There are two types of homogeneous distributed database system: * Autonomous: Each database is independent that’ functions on its own. They &° integrated by a controlling application and updates. : 8 use message passing to share di . | Non-autonomous: Data is distributed across the homogeneous nodes and ® | central or master DBMS co-ordinates data updates across the sites. ' yr geneous DDBS (Gee) 0 Distributed Databases, NOS, Systems and Bigdata 199 tore srogencous system, sites may run different DBMS products, which need not be based on ‘came underlying data model, and so the system may be composed of relational, network, M arehical, and object-oriented DBMSs, Heterogeneous syatems usually result when individual weave implemented their own databases and integration in considered at a Inter stage. In a s system, translations are required to allow communication between different se ss. To provide DBMS transparency, users must be able to make requesta in the language of the DBMS at their local site, The system then has the task of locating the data and performing sag necessary translation. Data may be required from another site that may have: «Different hardware + Different DBMS products + Different hardware and different DBMS products. ne Example 4.6: In the following diagram, different DBMS software are accessible to each other sesing generic connectivity (ODBC and JDBC). Communication Figure 4.12: Heterogeneous distributed system If the hardware is different but the DBMS products are the same, the translation is straightforward, involving the change of codes and word lengths. If the DBMS products are different, the translation is complicated involving the mapping of data structures in one data ‘model to the equivalent data structures in another data model. Example 4.7: If relations in the relational data model are mapped to records and sets in the network model. It is also necessary to translate the query language used (for ‘example, SQL SELECT statements are mapped to the network FIND and GET statements). If both the hardware and software are different, then both these types of translation are required. This ‘makes the processing extremely complex. ‘Types of Heterogeneous Distributed Databases System + Federated: The heterogeneous database systems are independent in nature and Integrated toghther 20 that thay funtion aa a eingle database e7eien . © Un-federated: The database systems employ a central coordinating module throug which the databases are accessed. | DistrIBUTED DATABASE ARCHITECTURES ——T— ew DDBS architecture are generally developed depending on three parameters: General Architecture of Pure Distributed Databases Here, we discuss both the logical and component architectural models of a DDBS. In figure 4.13, which describes the generic schema architecture of a DDBS, the enterprise is presented with 1 consistent, unified view showing the logical structures of underlying data across all nodes. Advanced Database Distribution: It states the physical distribution of data across the different sitay i.e., whether the components of the system are located on the same machine or not ‘Autonomy: It indicates the distribution of control of the database system and the degree to which each constituent DBMS can operate independently. Autonomy is 4 function of a number of factors such as whether the component systems 4, individual DBMSs) exchange information, whether they can independently exeey transactions, and whether one is allowed to modify them. ¢ Heterogeneity: It refers to the uniformity or dissimilarity of the data models system components and databases. === Stored Data Stored Data Site 1 Site n Figure 4.13: Schema architecture of distributed database © Distributed Databases, NOSQL, Systems and Bigdata 200 is architecture generally has four levels of schemas: Tier External View or Schema (EV or ES): De + Global Conceptual Schema (GC: provides network transparency, picts user view of data. 8): Depicts the global logical view of data, which . Local Conceptual Schema (Les): Depicts logical data organization at each site, +. Toeal Intamal Echems (116): Depicts physieal data organization at sach aif federated Database Schema Architecture typical five-level schema architecture to support global applications in the FDBS environment is shown in figure 4.14. In this architecture, the local schema is the conceptual schema (full Gutabese definition) of a component database, and the component schema is derived by smlating the local schema into a canonical data model or common data model (CDM) for the PDBS. Schema translation from the local schema to the component schema is accompanied by eating mappings to transform commands on a component schema into commands on the = nding local schema. The export schema represents the subset of a component a that available to the FDBS. The federated schema is the global schema or view, a ee result of integrating all the shareable export schemas. The pee aes efins schema for a user group or an application, as in the three-level schema architecture. 202 © Advanced Database Overview of Three -Tier Client/Server Architecture Full-scale DDBMSs have not been developed to support all the types of functional have known so far, Instead, distributed database applications are being developed in of the client/server architectures. It is now more common to use a three- than a two-tier architecture, particularly in Web applications. This architecture is jl} figure 4.15, Client User interface or presentation tier (Web browser, HTML, JavaScript, Visual Basic, Nee te a tbat ane caren canara’ S| t HTTP Protocol ODBC, JDBC, SQUCLI, SQLI t Database server nt/server architecture the following three layers exist: Presentation layer (client): This provides the user interface and interacts 62 the user. The programs at this layer present Web interfaces or forms to the = order to interface with the application. Web browsers are often utilized. and languages and specifications used include HTML, XHTML, CSS, Flash, Ms Scalable Vector Graphies (SVG), Java, JavaScript, Adobe Flex, ana others TS layer handles user input, output, and navigation by accepting user commands 2° displaying the needed information, usually in the form of static or dynamse W® pages. The latter are employed when the interaction involves database 8° When a Web interface is used, this layer typically communicates with application layer via the HTTP protocol, «Application layer (business logic): This la For example, In the three-tier client/server architecture, er programs the application queries can be formulated based on user input from the cle“ query results can be formatted and sent to the client for presentation. Addit application functionality can be handled at this layer, such as securitY © Distributed buted Datubaves, NOSQL Syatome and Wilt 208 identity verification, tu nections, The and othor one or more databases or den using ODBC, JDBC, SQUELI. fen internet with ine to the databane oF othor databane necous techni ochniquon, Database server: This tayor hand application layer, processes the ‘ve used to access the database AWOFY AN update requonta from th Fequosta, and no ‘nnd sends the renults, Unually, SQL, in ° it is rolntio database procedures may also bo invol, Ge rae eitona, and stored clbed- Gen nd ator formatted into XML XML when transmitte fennel ine 3 ‘ansmitted between the application no, Ath ever and the '¥ ronulta (and queries) may be INTRODUCTION TO NOSQL systems ANoSQL database, which stands for "non SQL" or "non-rolational i a d data storage and retrieval. It avoids joins, and is ensy to scale. The eee NoSQL databace is for distributed data stores with humongous data sarge node os used for big data and real-time web apps. For example, companies like "wit = sophal Google collect terabytes of user data every single day, ae Traditional RDBMS uses SQL syntax to store and retrieve data for further insights, Instead, a NoSQL database system encompasses a wide range of database technologies that can store structured, semi-structured, unstructured and polymorphic data. Why NoSQL? t of NoSQL databases became popular with Internet giants like Google, Facebook, The concept ‘The eystem response time becomes slow when ‘Amazon, ete. who deal with huge volumes of data. you use RDBMS for massive volumes of data. wuld “scale up” our systems e for this issu: ‘This method is known as “scaling out.” sabases as they are by upgrading our existing hardware, To resolve this problem, we co 1c is to distribute database load on ‘This process is expensive. The alternativ: multiple hosts whenever the load increases. ‘ales out better than relational dats NoSQL database is non-relational, so it se designed with web applications in mind. More Ram More CPU More HDD Figure 4.16: Seale UP (vertical sealing) vo | 206 Advanced Database Figure 4.16: Scale-out (Horizentals Scaling) Differences between RDBMS and NoSQL RDBMS is called relational databases while NoSQL is called a distributed database. They do not have any relations ey, there is a proper method in NoSQL to use unstructured data. RDBMS is scalable vertically and NoSQL is scalable horizontally. Hence in RDBMS, Maintenance of RDBMS is expensive as manpower is needed to manage the servers added in the database. NoSQL is mostly automatic and does some repairs on its own. Data distribution and administration is less in NoSQL. Example: Lets take data stored in RDBMS as, User Slcill Uid | Fname | Lname User ia | Sill name 1 |imdra | Chaudhary 1 Big Data 2 | Chokraj | Dawadi 1 Cloud 2 Caleutus Experience User id Role [= Company 1 Full time faculty CAB College 2 Principal Now Summit College 2 Visiting faculty member | KMC Reading this profile would require the application to read 4 rows from three tables, Come) O Distributed Databases, NOSQL Systems and Bi Lname | Still name Rote C : ompany Chaudhary [BigData [Fatt ame freutty [CAB Con : wl ane fey lege Chaudhary [Cloud Full time faculty [Gab cam CAB College Dawadi Caleulus Prineipal New Summit C ae |New § immit College KMC in NaSQL we cam express above three tables data inthe form of JSON as below, { User: [ { Vid First name: “Indra” Last name: “Chaudhary” } { Uid: 2 First name: “Indra” Last name: “Chaudhary” } 1 Skill: [ “Big Data”, “Cloud”, “Calculus”), Experience: [ { Role: “Full time faculty” Company: “CAB College” J { Role: “Principal” Company: “New Summit College” } { Role: “Visiting faculty” Company: “KMC” } ] i © 208 Acivanced Database Major differences between them are tabulated below, Users know RDBMS well as it is old and many] organizations use this database for the proper’ format of data. User interface tools to access data is available in the market s0 that users can try with all the schema to the RDBMS infrastructure. ‘This hhelps to interact with the data well and users will understand the data in a better manner. "This is relatively new and experts j ocd i are less as this database is evolving User interface tools to access and ma NoSQL 2p Nes "Vy ay] ipl "ser dy data, data in NoSQL is very less and hence not have many options to interact with RDBMS scalability and performance faces: some issues if the data is huge. Servers may not run properly with the available load and this leads to performance issues, ‘Multiple tables can be joined easily in RDBMS. and this does not cause any latency in the working of the database. The primary key helps in this case, Ie works well with high loads. Sealaiay > very good in NoSQL. This makes) performance of the database better qi, compared with RDBMS. A huge amount ony, could be easily handled by users, Multiple tables cannot be joined in NoSQL wy is not an easy task for the database and ins not work well with the performance ofthe dats ‘The availability of the database depends on the server performance and it is mostly available whenever the database is opened. The data Provided is consistent and does not confuse| users. Though the databases are readily availabl, consistency provided in some databases is less ‘This results in the performance of the database Data analysis and querying can be done easily with RDBMS even though the queries are complex. Slicing and dicing can be done with the available data to make the proper analysis of the data given, Data analysis is done also in NoSQL but works well with real-time data analytics Reports are not done in the database but ifthe application has to be built, then NoSQL is 4 solution for the same. Documents cannot be stored in RDBMS because data in the database should be structured and in a proper format to create identifiers. Documents can be stored in the Nose database as this is unstructured and nt ® rows and columns format. Partitions cannot be created in the database, Key-value pairs are needed to identify the data in a particular format specified in the schema database, Partitions can be ereated in the database es) and key-value pairs are not needed to ide) the data in the source, Software as a se can be integrated with NoSQL. Example: MySQL, Oracle, SQL Server ete. tt Example: IBM Domino, Oracle NS Apache HBase etc, ry © Distributed Data vax S01, Databa = 'taases, NOSQL Systems and Bigdata 207 type ‘era different Varieties of NoSQL databases i have bee Seas cases: These fll into four main eategoron en created to support specific ne pocument Databases prcunent databases, like ISON (JavaScript Object Nou gach document has a set of field and value pairs, The values might be oe . many sorts, such texts, Integers, Booleans, arrays, or objects, and their ms aajects that developers interact with within code, res are usually aligned with the pocument databases are useful for a brond number of use cases and may be utilized asa 1 database due to their vari general-purpose , eir variety of field value types and st fheyean expand out horizontally to accommodate enormous dats volumes nn nes ation) objects, store data in documents, as Key-Value Databases Key-value databases are a simpler form of database that has keys and values for each item. Learning how to query for a certain key-value pair is usually straightforward because a value ean only be accessed by referencing its key. Key-value databases are ideal for situations in which you need to store a significant quantity of data but don't need to access it using complicated queries. Caching and saving user preferences are two common use cases. Popular key-value databases include Redis and DynamoDB. Wide-Column stores Wide-column NoSQL databases store data in tables with rows and columns similar to RDBMS, but names and formats of columns can vary ftom row to row across the table. Wide-column databases group columns of related data together. A query can retrieve related data in a single operation because only the columns associated with the query are retrieved. In an RDBMS. the deta would be in different rows stored in different places on disk, requiring multiple disk operations for retrieval. Graph Databases Data is stored in nodes and edges in relationships between nodes, whereas Pi objects, A graph database uses graph struc provide index-free adjacency, 0 that adjacent index, : 1s, Edges hold information about the graph database: : fides store information about people, locations, and ‘map, and query relationships. They tures to store, sd elements are linked together without using a a jon of Noe Jitional RDBMS, including: NoSQL databases offer enterprises important advantages over tradi ases offer ent wear ay to add or reduce me that makes it easy 10 8 : aod horizontal seale-out net ht ms mia . = a yt ing th rary when axcempring (0 Sale ‘apacity quickly and non-disruptve mendous cost and complexity of manual RDEMs, 1 sharding that is nece ae _ 208 © Advanced Database Performance enterprises can increase performance with yy, Jiably fast user experiences without the overhead a commodity resources, By simply addini E iy simply ns to continue to deliver re databases. This enables organizal predictable return on investment for adding resources—again, with manual sharding. with, Socata RES ERD High Availability NoSQL databases are generally designed to ensure high availability and avoid the complex that comes with a typical RDBMS architecture that relies on primary and secondary nod Some “distributed” NoSQL databases use a masterless architecture that automata, Gistributes data equally among multiple resources so that the application remains available both read and write operations even when one node fails. =e Global Availability By automatically replicating data across multiple servers, data centers, or cloud resoures, distributed NoSQL databases can minimize latency and ensure a consistent applicatin experience wherever users are located. An added benefit is a significantly reduced database management burden from manual RDBMS configuration, freeing operations teams to focus other business priorities. Flexible Data Modeling NoSQL offers the ability to implement flexible and fluid data models. Application developers cx leverage the data types and query options that are the most natural fit to the specific application: use case rather than those that fit the database schema. The result is a simpler interaction between the application and the database and faster, more agile development. eee st RR AS There is less management Despite huge advancements in our DBMS domain over the years, relational databases ® heavily on database administrators, also known as DBAs. On the other hand, NoSQL detsbsss are typically built from the ground up to eliminate needless management, automated ae distribution, and simpler data models, resulting in lower administration and perform demands. The CAP THEOREM CAP stands for Consistency, Availability, and Partitioning. It is very important to unde the limitations of NoSQL database, NoSQL. cannot provide consistency and high «a8 together. This was first expressed by Eric Brewer in CAP Theorem. CAP theorem ® Brewers theorem states that we can only achieve at most two out of three guarant database: Consistency, Availability and Partition Tolerance. Consistency Consistency is all about data consistency, or in other words, making sure that within ® 43°. environment, every node of the database has exactly the same information at a" ae Imagine having two nodes with purchase orders from your ecommeree site. If ther? a F g§ O Distribute scy ene 0 td ayy Distributed Databases, NOSQL Systems and Bigdata 209 and 8 A unique cluste sees amon thes and the ster, the moment ie _ show You missing transsctona. Ant the ei oe ta, the results mi isons, fo consistency is definitely an important. characteris eee & “ever, not all of them can provide it. So what joa “eventual consistency.” Meaning that while a geal eventually be 80, This helps in making su iD but also making calculations based on that dy 2 they do instend? They go for something he Point the cluster may not be consistent re that you don’t get the types of problems [ availability vailablity stands for “high availability” or in other words the abilit; } rin y of the database to alwa te available, no matter what happens. This is not the same and should not be crt with ‘igult tolerance” however. A highly available database is usually one that has replicas in aultiple geographical zones, that way if there is a big network outage, itl still be accessible through one ofits other replicas. For example, a system that's only installed and working on one of our servers can’t be highly available because the moment that server fails, well lose our database. Partitioning Partitioning stands for “partitioning tolerance” or in other words, having the ability to support broken links within the cluster the database is distributed in. Think about a graph representing your database cluster. You have multiple nodes sharing data and working wonderfully and suddenly there is a problem and a section of that cluster fails. If the database is “partition tolerant” it'l still work despite the sudden lack of some of its nodes. ‘The CAP theorem categorizes systems into three categories: K I AC systems CP systems Pp 7 ‘AC systems Figure 4.17: Visual representation of the CAP diagram [ACP database delivers consistency and partition oceurs between any two __ make it unavailable) until the 1) database: ability, When @ P nnsistent node (i CP (Consistent and Partition Tolerant) Partition tolerance at the expense of avail nodes, the system has to shut down the non-co! Partition is resolved. puted system. Meaning, within a distrit a ee bani at m, there is & partition another node in the s! Partition refers to a communicatio ns if'a node cannot receive any messages fo ‘210 Advanced Database Partition could have been because of network failure, server crash, rs between the two nodes. any other reason. AP (Available and Partition Tolerant) database: An AP database delivers avaitatitt, yang partition tolerance at the expense of ‘consistency. When a partition occurs, all nodes rq °8 Femaiq vaitable but those at the wrong end of 2 partition might return an older version of data y, others, When the partition is resolved, the AP databases typically resync the nodes to repaiy ‘a a inconsistencies in the system. CA (Consistent and Available) database: ACA delivers consistency and availability ing, absence of any network partition. Often a single node's DB servers are categorized as ¢4 systems. Single node DB servers ‘eal with partition tolerance and are do not need to di considered CA systems. In any networked shared-data systems or distributed systems partition tolerance is a mus Network portitions and dropped messages are a fact of Fife and must be handled appropriate, Consequently, system designers must ‘choose between consistency and availability. * ae lerosoft SQL Server thas ererenc APACHE clients see the same data at the same HBASE time mongo DB Partition-Tolerance The system continues to operate in spite of network failures Availability ‘The system continues to operate even in the presence of node failures andra Figure 4.18: Classification of different databases based on the CAP theorem Comes) 0 distin a javantages of NoSQL 'steibuted Databases, NOSQL Systems and Bigata 211 is! vio? gisadvantages of NoSQL are described below: w No standardization rules i Limited query capabilities + RDBMS databases and tools are comparatively mat ure It does not offer any traditic . ; r ional database capabilities, til multiple transactions are performed simaltaeny. a ly. «When the volume of data increases it i e ases it is diff i ‘ een ficult to maintain unique values as keys + Doesn't work as well with relational data + The learning curve is stiff for new developers + Open source options so not so popular for enterprises. DOCUMENT-BASED, KEY-VALUE STORES, COLUMN-BASED, AND GRAPH-BASED SYSTEMS Key-Value Database Key.value databases are the simplest type of NoSQL database. Thanks to their simplicity, they are also the most scalable, allowing horizontal scaling of large amounts of data. These NoSQL databases have a dictionary data structure that consists of a set of objects that represent fields of data. Each object is assigned a unique key. To retrieve data stored in a articular object, you need to use a specific key. In turn, you get the value Ge. data) assigned to the key. This value can be a number, a string, or even another set of key-value pairs. 2 G00 8 database Figure 4.19: Key valve ' a do not require @ prede key-value databases: un 980 Mey hg data and have faster perOreMee wit Unlike traditional relational databi et ‘a lighter solution a3 they require fews structure, They offer more flexibility when SE having to rely on placeholders, key-value databases Tesources, Such functionalities are suitable for large dat are commonly used for caching, storink Tecommendations. ata, Therefore, they it rT deal with simple iy servicing, and 3 that taboo ‘se8si0n8, ‘and managing user ——i & e remeron ani 23 ‘Advanced Database Document Database atabase is a type of NoSQL database that consists of sets of key-value pai rg je units of data which you can algo gry! UP ints A document 4 into a document. These documents are basic collections (databases) based on their functionality. Figure 4.20: Document database Being a NoSQL database, you can easily store data without implementing a schema. You ex, transfer the object model directly into a document using several different formats. The mos, commonly used are JSON, BSON, and XML. Here is an example of a simple document in JSON format that consists of three key-value pars: { "ID" : "001", “Name”: "John", "Grade" : "Senior", } What's more, you can also use nested queries in such formats, providing easier data distributia across multiple disks and enhanced performance. For instance, we can add a nested value stisg to the document above: { "ID" : "001", } ca cryp aid Due to their structure, document databases are optimal for use cases that require flexi -wual development, For example, you can use them for managing user Profiles, tion provided. Its schema-less structure allows you © pb ‘Monee? Examples of NoSQL document databases include fast, conti differ according to the informat different attributes and values. CouchDB, Elasticsearch, and others. ‘ \ \ mis %. wee Se Se - | | t Gam) 0 _couumn stores are another type of NoSQL, database ely stored columns instend of rows, Such ibuted Databases jje-column Database buted Databases, NOSQL Systems and Bigdata 218 wit wile My them, data is stored and grouped databases organize information into om ns that function similarly to tables in relational databases, Row-oriented 1D Name [Grade GPA 001 John [Senior 4.00 (002 Karen [Freshman [3.67 003 Bil__[Junior 3.33 Column-oriented Name [ID Grade [ID GPA[ID John [001 Senior [001 4.00 [001 Karen | 002 Freshman | 002 3.67 |002) Bill 003, Junior 003 3.33 |003 However, unlike traditional databases, wide-column databases are highly flexible. They have no ‘ed keys nor column names. Their schema-free characteristic allows variation of column James even within the same table, as well as adding columns in real-time. ‘The most significant benefit of having column-oriented databases is that you can store large smounts of data within single column; This feature allows you to reduce disk resources and the time it takes to retrieve information from it. They are also excellent in situations when you fave to spread data across multiple servers. Examples of popular wide-column databases include Apache Cassandra, HBase, and CosmoDB. Graph Database Graph databases use flexible graphical representation to manage data. ‘These graphs consist of two elements: . Nodes (for storing data entities) «Edges (for storing the relationship between entities) These relationships between entities allow data in the store to be li many cases, retrieved with one operation. Nodes and edges have defined using these properties, you can query data easily. inked together directly and, in properties, and by ro 4.21: Graph database _—~ ~ \dvanced Database © m Adv ¢ data-storing is quite specific, it is not @ commonly used NoSQL, inky 0 tase cases in which having graphical representations is they, n use graphs to store information about hoy 1, networks oftet raph, and Neodj are just some examples of graph data since this tyPe However, there are certain olution, For instance, social net vers are linked. OrientDB, Reds you should consider using. Bic DATA jon of data that is huge in volume, yet growing exponentially with time, tj that none of traditional data management tool gay also a data but with huge size. Big Data is a collect a data with so large size and complexity store it or process it efficiently. Big data is Big data can be categorized as unstructured or structured. Structured data consists «f information already managed by ther organization in databases and spreadsheets; it frequently numeric in nature. Unstructured data is information that is unorganized and does ot fall into a predetermined model or format. It includes data gathered from social media sources, which help institutions gather information on customer needs. 1a be collected from publicly shared comments on social networks and website, Big data cai d apps, through questionnaires, product voluntarily gathered from personal electronics ant purchases, and electronic check-ins. The presence of sensors and other inputs in smart devices allows for data to be gathered across a broad spectrum of situations and circumstances. Big data is most often stored in computer databases and is analyzed using software specifically designed to handle large, complex data sets. Many software-as-a-service (SaaS) comparis specialize in managing this type of complex data. n 28, Network Sty we Ae BIG DATA Dh Cloud ‘Technology i y Analysis ze @ Tousen Volume Research Figure 4.22: Uses of Big Data Big Data Architecture Big data architecture refers to the logical and physical structure that dictates how high vou of data are ingested, processed, stored, managed, and accessed. ° eT Most big data architectures include some or all of the following components: 1 % 4, O Distributed Databases, NOSQL Systems and Bigdata 215 Data Storage ||_ Batch Processing Analytics Ra and Messe Stream Reporting] Ingestion _Crocessing Figure 4,23: ig data architectu Data sources: All big data solutions start with re one or more a: I ° lata sources. Examples + Application data stores, such as relational databases. = Static files produced by applications, such as web server log files. + Real-time data sources, such as Io devices. Data storage: Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage. Batch processing: Because the data sets are so large, often a big data solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis. Usually these jobs involve reading source files, processing them, and writing the output to new files. Options include running U-SQUL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster. Real-time message ingestion: If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing. This might be a simple data store, where incoming messages are dropped into a folder for processing. However, many solutions need a message aay to act as a buffer for messages, and to support sealé-out processing, fares an a and other message queuing semantics. Options include Azure Event Hubs, Az Hubs, and Kafka. Stream processing: After ¢ them by filtering, aggregatin he processed stream data is then wr en Prove managed er Pessina ae aon se, Ssh is bounded streams. Ml Becca See ey aaioal on ke Storm ‘and Spark Streaming in ‘an HDInsight cluster. streaming technologies like ee cal data store: Many big for analysis 8 Analytical d ja data solutions prepare data for . Se ay mY seructured format that can be queried using analytical ta in serve the processed dat apturing real-time messages, the solution must prosess g, and otherwise preparing the data for analysis. ‘The ; van output sink, Azure Stream Analytics tually running SQL | 216 = Advanced Database tools. The analytical data store used to serve these queries ean be a Kim relational data warehouse, a seen in most traditional busing? integer solutions. Alternatively, the data could be presented through a low-lateney Nyce technology such as HBase, or an interactive Hive database that Provides a mS abstraction over data files in the distributed data store. Amun. Synapse ana provides a managed service for large-scale, cloud-based data warehousing. HDtn ot sabborts Interactive Hive, HBase, and Spark SQL, which ean also be uses wrve gt for analysis, ‘ta 7. Analysis and reporting: into the data through anal: the architecture may inch ‘The goal of most big data solution: lysis and reporting. To empower users lude a data modeling layer, such as OLAP cube or tabular data model in Azure Analysis Services. I self-service BI, using the modeling and visualization te: BI or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts, For these scenarios, many Azure Services support analytical notebooks, such as: Jupyter, enabling these users to leverage their existing skills with Python or R. For large-scale date exploration, you can use Microsoft R Server, either standalone or with Spark. Orchestration: Most big data solutions consist of repeated data eperations, encapsulated in workflows that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate thee workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop. & to provide insight, to analyze the data, @ multidimensional t might also suppor ‘chnologies in Microsoft Poyes Characteristics of Big Data ‘Three characteristics define Big Data: volume, variety, and velocity. Together, these characteristics define “Big Data”. They have created the need for a new class of capabilities to augment the way things are done today to provide a better line of sight and control over out existing knowledge domains and the ability to act on them. Velocit Volume Terabytes of —_ \ cata, Biions of | Recores real-time Structured, Unstructured, ‘Semistructured data Variety Figure 4.24: Three V big data model Gore) 0 Distributed Databases, NOSQL,: Volume of Data ‘Systems and Big 1 ghe iedata 207 sheer volume of data being a tabytes (PB) of data wore stored tha! aeloti. Ind being created today isn't analyzed at all and thats oe considered. This number is expected to reach 35 an by ar alone generates more than 7 terabytes (TB) of data every day poe oo titer some enterprises generate terabytes of data every hour of ae seebook 10 TB, and no longer unheard of for individual enterprises to have stems the Year It's petabytes of data, storage clusters holding ‘The name Big Data itself is related to an enormow of data generated from many sources daily, such as business procesecs social media platforms, networks, human interactions, and many more. Facebook can generate approximately a billion messages, 4.5 billion times that the "Like" button is recorded, and more than 360 million new posts are uploaded each day. Big data technologies can handle large amounts of data, 2, Variety 8 size. Big Data is a vast ‘volumes! machines, Big Data can be structured, unstructured, and semi-structured that are being collected from different sources. Data will only be collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc. 4. Velocity Velocity plays an important role compared to others. Velocity creates the speed by which the data is created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity bursts. The primary aspect of Big Data is to provide demanding data rapidly. Big data velocity deals with the speed at the data flows from sources like application logs, business processes, networks, and social media sites, sensors, mobile devices, ete. Advantages and disadvantages of Big Data ‘The advantages of Big Data Analytics below show and how retrieving information and collecting data is useful. 1. Voluminous Collection A large amount of market data can be generated using Big Data analytics, and various graphical and mathematical representations can be made for easy analysis, This massive information is further helpful for deriving market-based conclusions ‘and predict consumer behavior. However, new technology needs to ramp up in this field as traditional software cannot process big data. Future Insights are mentionable Advantages of Big Data With the predictions and statistical data obtained, a business can control rospests; The growth and future problems of a business can be well handled osing oo analysis, Using these datasets, a company can plan its launches also create 3. 5. ‘Advanced Database products and services. Scientists foresee the same benefits of Big Data ana, the healthcare industry and societal affairs. in Big Data Analytics is Cost Efficient It is essential to understand the changing trends of the market and to do thay, market analysis needs to be done. At a particular time, a specific direction is fs, ¢ while others are remaining constant or decreasing. A business is all dependent gy actual demand, and if they ean predict it, they can have control over the prod Ttcan save costs incurred in storing raw material and finished products Research will take less time New software can easily analyze and interpret data sets, which helps make deisimn, and saves a lot of time. Also, new data can be generated automatically in bulk xi, updated information and trends. This can help businesses to stay stable in the Ig run. - Fraud Detection and Prevention Big Data is capable of stopping fraudulent transactions, as in the case of banking services. The frauds are getting smarter, and people need to know not to share personal information; the automated software can detect fraudulent accounts and cards. Based on recurring patterns and the spending behavior of consumers, it woul also be possible to track what's usually missed manually. Disadvantages of Big Data Since can be examined needs a vast space. Although the analysis of enormous information seexs all the information collected requires a lot of effort and resources, storing it before i possible, some significant disadvantages of Big Data come to light in terms of space. cos. and user security. 1. 2. 3. Unstructured Data The data collected can be arranged or present in the form of random informatio. More variations in data can create difficulty in processing results and generating solutions. If the information is broken or unstructured, many users can get neglects! while deriving future outcomes or analyzing present scenarios, Security Concerns are most dreaded disadvantages of Big Data For highly secured data or confidential information, highly secured networks #* needed for its transfer and storage. Furthermore, with the increased global politi** and complex situations between nations, leaked data can be used as an advantag? bY enemies, so keeping it secure is essential and requires building such a network. Expensive The process of data generation and its analysis is costly without the surety favorable results. The top businesses can mainly research this field as the sP** sector, where the wealthiest companies and individuals carry out research. The Co of setting up super-computers is one of the leading disadvantages of Big D3 Cre) Distributed Databases, ia ot SAL Systems and Bigdate 219° curred 16 info impo hs toe arranged forand an the information usuall Oat Asal maintenance, analytics. Even if the cost i, ly residing on ‘ fessionals needed to o me pote carry out renearch and i highly paid and hard to find. ‘There in a seatcity of indigent sorare are analyst job despite the increasing ge tulled for the data resource of the new generation as to remain in the market; nse the ourself updated with further information, ‘i i necemsary to keep 5, Hardware and Storage The servers and hardware needed to store and run high-quality software are very costly and hard to build. Also, the information is available in bulk with eontancn changes, and processing requires faster software and applications. And me cone forget the uncertainty involved with getting accurate results, MapREDUCE a Map-Reduce is a programming model designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. Map-Reduce programs are sritten in a particular style influenced by functional programming constructs, specifically idioms for processing lists of data. This module explains the nature of this programming model and how it can be used to write programs which run in the Hadoop environment. MapReduce is a Hadoop framework used for writing applications that can process vast amounts of data on large clusters. It can also be called a programming model in which we can process large datasets across computer clusters. This application allows data to be stored in a distributed form. It simplifies enormous volumes of data and large scale computing. ‘There are two primary tasks in MapReduce: map and reduce. We perform the former task before the latter. In the map job, we split the input dataset into chunks. Map task processes these chunks in parallell. The map we use outputs as inputs for the reduce tasks. Reducers Process the intermediate data from the maps into smaller tuples, which reduces the tasks, leading to the final output of the framework. The MapReduce framework enhances the scheduling and monitoring of tasks. The failed tasks are re-executed by the framework. This framework can be used easily, even " Programmers with little expertise in distributed processing. Mapes in iM implemented using various programming languages such as Java, Hive, Pig, Scala, Python, How MapReduce in Hadoop works An pas of MapReduce inde and MapReduce’s phases will help us understand how MapReduce in Hadoop works. MapReduce architecture

You might also like