0% found this document useful (0 votes)
76 views16 pages

NGT Unit 1

Next Generation Emerging Technology Mumbai University Ty bcsit sem 5

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views16 pages

NGT Unit 1

Next Generation Emerging Technology Mumbai University Ty bcsit sem 5

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

www.acuityeducare.

com

Acuity Educare

NGT
SEM : V
SEM V: UNIT 1

607A, 6th floor, Ecstasy business park, city of joy, JSD


road, mulund (W) | 8591065589/022-25600622

Abhay More abhay_more


TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

Q1: What is big data?


Ans:
 Big data is a term used to describe data that has massive volume, comes in a variety of
structures, and is generated at high velocity. This kind of data poses challenges to the
traditional RDBMS systems used for storing and processing data. Big data is paving way
for newer approaches of processing and storing data. Big data is data that has high
volume, is generated at high velocity, and has multiple varieties.
 Big data challenges include capturing data, data storage, data analysis,
search, sharing, transfer, visualization, querying, updating, information privacy and data
source.
 There are a number of concepts associated with big data: originally there were 3
concepts volume, variety, velocity. Other concepts later attributed with big data
are veracity (i.e., how much noise is in the data) and value.
 Lately, the term "big data" tends to refer to the use of predictive analytics, user
behaviour analytics, or certain other advanced data analytics methods that extract value
from data, and seldom to a particular size of data set.
 5.Data sets grow rapidly - in part because they are increasingly gathered by cheap and
numerous information-sensing Internet of things devices such as mobile devices, aerial
(remote sensing), software logs, cameras, microphones, radio-frequency
identification (RFID) readers and wireless sensor networks.
 The world's technological per-capita capacity to store information has roughly doubled
every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data
are generated.
 Based on an IDC report prediction, the global data volume will grow exponentially from
4.4 zetta bytes to 44 zetta bytes between 2013 and 2020.
 By 2025, IDC predicts there will be 163 zetta bytes of data.
 One question for large enterprises is determining who should own big-data initiatives
that affect the entire organization.

Q2: Explain Facts about big data


Ans:
 Various research teams around the world have done analysis on the amount of data
being generated. For example, IDC’s analysis revealed that the amount of digital data
generated in a single year (2007) is larger than the world’s total capacity to store it,
which means there is no way in which to store all of the data that is being generated.
Also, the rate at which data is getting generated will soon outgrow the rate at which data
storage capacity is expanding.
 The study makes the case that the business and economic possibilities of big data and
its wider implications are important issues that business leaders and policy makers must
tackle.

The Size of Big Data Varies Across Sectors


 The growth of big data is a phenomenon that is observed in every sector.MGI estimates
that enterprises around the world used more than 7 exabytes of incremental disk drive
data storage capacity in 2010; what’s interesting is that nearly 80 percent of that total
seemed to duplicate data that was stored elsewhere.
 MGI also estimated that, by 2009, nearly all sectors in the US economy had at least an
average of 200 terabytes of stored data per company and that many sectors had more
than 1 petabyte in mean stored data per company. Some sectors exhibited far higher
Page 1 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

levels of data intensity than others; in this case, data intensity refers to the average
amount of data getting accumulated across companies/firms of that sector, implying that
they have more potential to capture value from big data. Financial services sectors,
including banking, investment, and securities services, are highly transaction-oriented;
they are also required by regulations to store data. The analysis shows that they have
the most digital data stored per firm on average. Communications and media firms,
utilities, and government also have significant digital data stored per enterprise or
organization, which appears to reflect the fact that such entities have a high volume of
operations and multimedia data. Discrete and process manufacturing have the highest
aggregate data stored in bytes. However, these sectors rank much lower in intensity
terms, since they are fragmented into a large number of firms.

The Big Data Type Varies Across Sectors


 The MGI research also shows that the type of data stored also varies by sector. For
instance, retail and wholesale, administrative parts of government, and financial services
all generate significant amounts of text and numerical data including customer data,
transaction information, and mathematical modeling and simulations. Sectors such as
manufacturing, health care, media and communications are responsible for higher
percentages of multimedia data. And image data in the form of X-rays, CT, and other
scans dominate data storage volumes in health care. In terms of geographic spread of
big data, North America and Europe have 70% of the global total currently. Thanks to
cloud computing, data generated in one region can be stored in another country’s
datacenter. As a result, countries with significant cloud and hosting provider offerings
tend to have high storage of data.

Q3: What is Sources of big data?


Ans:
 Machine data consists of information generated from industrial equipment, real-time data
from sensors that track parts and monitor machinery (often also called the Internet of
Things), and even web logs that track user behavior online.
 At arcplan client CERN, the largest particle physics research center in the world, the
Large Hadron Collider (LHC) generates 40 terabytes of data every second during
experiments.
 Regarding transactional data, large retailers and even B2B companies can generate
multitudes of data on a regular basis considering that their transactions consist of one
or many items, product IDs, prices, payment information, manufacturer and distributor
data, and much more.
 Major retailers like Amazon.com, which posted $10B in sales in Q3 2011, and restaurants
like US pizza chain Domino’s, which serves over 1 million customers per day, are
generating peta bytes of transactional big data. The thing to note is that big data can
resemble traditional structured data or unstructured, high frequency information.
 There are some of many sources of Big Data:
 Sensors/meters and activity records from electronic devices: These kind of
information is produced on real-time, the number and periodicity of observations of the
observations will be variable, sometimes it will depend of a lap of time, on others of the
occurrence of some event (per example a car passing by the vision angle of a camera)
and in others will depend of manual manipulation (from an strict point of view it will be

Page 2 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

the same that the occurrence of an event). Quality of this kind of source depends mostly
of the capacity of the sensor to take accurate measurements in the way it is expected.
 Social interactions: Is data produced by human interactions through a network, like
Internet. The most common is the data produced in social networks. This kind of
data implies qualitative and quantitative aspects which are of some interest to be
measured. Quantitative aspects are easier to measure tan qualitative aspects, first
ones implies counting number of observations grouped by geographical or temporal
characteristics, while the quality of the second ones mostly relies on the accuracy of the
algorithms applied to extract the meaning of the contents which are commonly found
as unstructured text written in natural language, examples of analysis that are made
from this data are sentiment analysis, trend topics analysis, etc.
 Business transactions: Data produced as a result of business activities can be recorded
in structured or unstructured databases. When recorded on structured data bases the
most common problem to analyze that information and get statistical indicators is the
big volume of information and the periodicity of its production because sometimes these
data is produced at a very fast pace, thousands of records can be produced in a second
when big companies like supermarket chains are recording their sales. But these kind of
data is not always produced in formats that can be directly stored in relational databases,
an electronic invoice is an example of this case of source, it has more or less an structure
but if we need to put the data that it contains in a relational database, we will need
to apply some process to distribute that data on different tables (in order to normalize
the data accordingly with the relational database theory), and maybe is not in plain text
(could be a picture, a PDF, Excel record, etc.), one problem that we could have here is
that the process needs time and as previously said, data maybe is being produced too
fast, so we would need to have different strategies to use the data, processing it as it
is without putting it on a relational database, discarding some observations (which
criteria?), using parallel processing, etc. Quality of information produced from business
transactions is tightly related to the capacity to get representative observations and to
process them.
 Electronic Files: These refers to unstructured documents, statically or dynamically
produced which are stored or published as electronic files, like Internet pages, videos,
audios, PDF files, etc. They can have contents of special interest but are difficult
to extract, different techniques could be used, like text mining, pattern recognition, and
so on. Quality of our measurements will mostly rely on the capacity to extract and
correctly interpret all the representative information from those documents.
 Broadcastings: Mainly referred to video and audio produced on real time, getting
statistical data from the contents of this kind of electronic data by now is too complex
and implies big computational and communications power, once solved the problems of
converting “digital-analog” contents to “digital-data” contents we will have similar
complications to process it like the ones that we can find on social interactions.

Page 3 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

Q4:What are 3 vs of big data?


Ans:
1. VOLUME
 Within the Social Media space for example, Volume refers to the amount of data
generated through websites, portals and online applications. Especially for B2C
companies, Volume encompasses the available data that are out there and need to be
assessed for relevance. Consider the following -Facebook has 2 billion users, Youtube 1
billion users, Twitter 350 million users and Instagram 700 million users. Every day, these
users contribute to billions of images, posts, videos, tweets etc. You can now imagine
the insanely large amount -or Volume- of data that is generated every minute and every
hour.

2. VELOCITY
 With Velocity we refer to the speed with which data are being generated. Staying with
our social media example, every day 900 million photos are uploaded on Facebook, 500
million tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube
and 3.5 billion searches are performed in Google. This is like a nuclear data explosion.
Big Data helps the company to hold this explosion, accept the incoming flow of data and
at the same time process it fast so that it does not create bottlenecks.

3. VARIETY
 Variety in Big Data refers to all the structured and unstructured data that has the
possibility of getting generated either by humans or by machines. The most commonly
added data are structured -texts, tweets, pictures & videos. However, unstructured data
like emails, voicemails, hand-written text, ECG reading, audio recordings etc, are also
important elements under Variety. Variety is all about the ability to classify the incoming
data into various categories.

Q5: Explain Usage of Big Data


Ans:
 Big data is a completely new source of data; it’s data that is generated when you post
on a blog, like a product, or travel. Previously, such minutely available information was
not captured. Now it is and organizations that embrace such data can pursue innovations,
improve their agility, and increase their profitability. Big data can create value for any
organization in a variety of ways.
 As listed in the MGI report, this can be broadly categorized into five ways of usage of big
data.
A. Visibility: Accessibility to data in a timely fashion to relevant stakeholders generates
a tremendous amount of value. Let’s understand this with an example. Consider a
manufacturing company that has R&D, engineering, and manufacturing departments
dispersed geographically. If the data is accessible across all these departments and can
be readily integrated, it can not only reduce the search and processing time but will also
help in improving the product quality according to the present needs.
B. Discover and Analyze Information: Most of the value of big data comes from when
the data collected from outside sources can be merged with the organization’s internal
data. Organizations are capturing detailed data on inventories, employees, and customers.
Using all of this data, they can discover and analyze new information and patterns; as a

Page 4 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

result, this information and knowledge can be used to improve processes and
performance.
C. Segmentation and Customizations: Big data enables organizations to create tailor-
made products and services to meet specific segment needs. This can also be used in the
social sector to accurately segment populations and target benefit schemes for specific
needs. Segmentation of customers based on various parameters can aid in targeted
marketing campaigns and tailoring of products to suit the needs of customers.
D. Aiding Decision Making: Big data can substantially minimize risks, improve decision
making , and uncover valuable insights. Automated fraud alert systems in credit card
processing and automatic fine-tuning of inventory are examples of systems that aid or
automate decision-making based on big data analytics.
E. Innovation: Big data enables innovation of new ideas in the form of products and
services. It enables innovation in the existing ones in order to reach out to large segments
of people. Using data gathered for actual products, the manufacturers can not only
innovate to create the next generation product but they can also innovate sales offerings.
As an example, real-time data from machines and vehicles can be analyzed to provide
insight into maintenance schedules; wear and tear on machines can be monitored to make
more resilient machines; fuel consumption monitoring can lead to higher efficiency
engines. Real-time traffic information is already making life easier for commuters by
providing them options to take alternate routes.

Q6: what are Big Data Challenges?


Ans:
Big data also poses some challenges:
A. Policies and Procedures: As more and more data is gathered, digitized, and
moved around the globe, the policy and compliance issues become increasingly
important. Data privacy, security, intellectual property, and protection are of
immense importance to organizations. Compliance with various statutory and
legal requirements poses a challenge in data handling. Issues around ownership
and liabilities around data are important legal aspects that need to be dealt with
in cases of big data. Moreover, many big data projects leverage the scalability
features of public cloud computing providers. This poses a challenge for
compliance. Policy questions on who owns the data, what is defined as fair use
of data, and who is responsible for accuracy and confidentiality of data also need
to be answered.
B. Access to Data: Accessing data for consumption is a challenge for big data
projects. Some of the data may be available to third parties, and gaining access
can be a legal, contractual challenge. Data about a product or service is available
on Facebook, Twitter feeds, reviews, and blogs, so how does the product owner
access this data from various sources owned by various providers? Likewise,
contractual clauses and economic incentives for accessing big data need to be
tied in to enable the availability of data by the consumer.
C. Technology and Techniques: New tools and technologies built specifically to
address the needs of big data must be leveraged, rather than trying to address
the aforementioned issues through legacy systems. The inadequacy of legacy

Page 5 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

systems to deal with big data on one hand and the lack of experienced resources
in newer technologies is a challenge that any big data project has to manage.

Q7: Explain Legacy system and Big data


Ans:
 Structure of Big Data: Legacy systems are designed to work with structured data
where tables with columns are defined. The format of the data held in the columns is
also known. However, big data is data with many structures. It’s basically unstructured
data such as images, videos, logs, etc. Since big data can be unstructured, legacy
systems created to perform fast queries and analysis through techniques like indexing
based on particular data types held in various columns cannot be used to hold or process
big data.
 Data Storage: Legacy systems use big servers and NAS and SAN systems to store
the data. As the data increases, the server size and the backend storage size has to be
increased. Traditional legacy systems typically work in a scaleup model where more and
more compute, memory, and storage needs to be added to a server to meet the
increased data needs. Hence the processing time increases exponentially, which defeats
the other important requirement of big data, which is velocity.
 Data Processing: The algorithms in legacy system are designed to work with structured
data such as strings and integers. They are also limited by the size of data. Thus, legacy
systems are not capable of handling the processing of unstructured data, huge volumes
of such data, and the speed at which the processing needs to be performed. As a result,
to capture value from big data, we need to deploy newer technologies in the field of
storing, computing, and retrieving, and we need new techniques for analyzing the data.

Q8: Explain Big Data Technologies


Ans:
 Big data technologies are important in providing more accurate analysis, which may lead
to more concrete decision-making resulting in greater operational efficiencies, cost
reductions, and reduced risks for the business. To harness the power of big data, you
would require an infrastructure that can manage and process huge volumes of structured
and unstructured data in realtime and can protect data privacy and security.
 There are various technologies in the market from different vendors including Amazon,
IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle
big data, we examine the following two classes of technology:

Operational Big Data


 This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
 NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to
be run inexpensively and efficiently. This makes operational big data workloads much
easier to manage, cheaper, and faster to implement.
 Some NoSQL systems can provide insights into patterns and trends based on real-time
data with minimal coding and without the need for data scientists and additional
infrastructure.

Page 6 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

Analytical Big Data


 This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis
that may touch most or all of the data.
 MapReduce provides a new method of analyzing data that is complementary to the
capabilities provided by SQL, and a system based on MapReduce that can be scaled up
from single servers to thousands of high and low end machines.
 These two classes of technology are complementary and frequently deployed together.
Operational vs. Analytical
Operational Analytical
Latency 1 ms - 100 ms 1 min - 100 min
Concurrency 1000 - 100,000 1 - 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database
 You have seen what big data is. In this section we will briefly look at what technologies can handle this
humongous source of data. The technologies in discussion need to efficiently accept and process different
types of data. The recent technology advancements that enable organizations to make the most of its big data
are the following:
1. New storage and processing technologies designed specifically for large unstructured data
2. Parallel processing
3. Clustering
4. Large grid environments
5. High connectivity and high throughput
6. Cloud computing and scale-out architectures

There are a growing number of technologies that are making use of these technological
advancements. In this book, we will be discussing MongoDB, one of the technologies that can
be used to store and process big data.

Q9: what is SQL


Ans:
The idea of RDBMS was borne from E.F. Codd’s 1970 whitepaper titled “A relational model of
data for large shared data banks.” The language used to query RDBMS systems is SQL
(Sequel Query Language ).
RDBMS systems are well suited for structured data held in columns and rows, which can be
queried using SQL. The RDBMS systems are based on the concept of ACID transactions. ACID
stands for Atomic, Consistent, Isolated, and Durable, where
•Atomic :implies either all changes of a transaction are applied completely or not applied at
all.
• Consistent :means the data is in a consistent state after the transaction is applied. This
means after a transaction is committed, the queries fetching a particular data will see the
same result.
• Isolated :means the transactions that are applied to the same set of data are independent
of each other. Thus, one transaction will not interfere with another transaction.
• Durable :means the changes are permanent in the system and will not be lost in case of
any failures.
Page 7 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

Challenges of RDBMS
 RDBMS assumes a well-defined structure of data and assumes that the data is largely uniform.
 It needs the schema of your application and its properties (columns, types, etc.) to be defined up-front before
building the application. This does not match well with the agile development approaches for highly dynamic
applications.
 As the data starts to grow larger, you have to scale your database vertically, i.e. adding more capacity to the
existing servers.

Q10: what is NoSQL


Ans:
 NoSQL is a term used to refer to non-relational databases . Thus, it encompasses
majority of the data stores that are not based on the conventional RDBMS principles
and are used for handling large data sets on an Internet scale.
 Big data, as discussed in the previous chapter, is posing challenges to the traditional
ways of storing and processing data, such as the RDBMS systems. As a result, we see
the rise of NoSQL databases, which are designed to process this huge amount and
variety of data within time and cost constraints.
 Thus NoSQL databases evolved from the need to handle big data; traditional RDBMS
technologies could not provide adequate solutions.
 Some examples of big data use cases that are a good fit for NoSQL databases are the
following:

• Social Network Graph : Who is connected to whom? Whose post should be visible on the
user’s wall or homepage on a social network site?
• Search and Retrieve : Search all relevant pages with a particular keyword ranked by the
number of times a keyword appears on a page.
 Definition : NoSQL doesn’t have a formal definition . It represents a form of
persistence/data storage mechanism that is fundamentally different from RDBMS. But
if pushed to define NoSQL, here it is: NoSQL is an umbrella term for data stores that
don’t follow the RDBMS principles.

A Brief History of NoSQL:


 In 1998, Carlo Strozzi coined the term NoSQL . He used this term to identify his
database because the database didn’t have a SQL interface. The term resurfaced in
early 2009 when Eric Evans (a Rackspace employee) used this term in an event on
open source distributed databases to refer to distributed databases that were non-
relational and did not follow the ACID features of relational databases.

Page 8 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

Q11: Explain CAP Theorem ( Brewer’s Theorem )


Ans:
Eric Brewer outlined the CAP theorem in 2000. This is an important concept that needs to be
well understood by developers and architects dealing with distributed databases. The theorem
states that when designing an application in a distributed environment there are three basic
requirements that exist, namely consistency, availability, and partition tolerance.
• Consistency : means that the data remains consistent after any operation is performed
that changes the data, and that all users or clients accessing the application see the same
updated data.
• Availability : means that the system is always available.
• Partition : Tolerance means that the system will continue to function even if it is
partitioned into groups of servers that are not able to communicate with one another.
The CAP theorem states that at any point in time a distributed system can fulfil only two of
the above three guarantees. Figure 2-1.

Figure 2-1. CAP Theorem


The BASE
Eric Brewer coined the BASE acronym . BASE can be explained as
• Basically Available : means the system will be available in terms of the CAP theorem.
• Soft state : indicates that even if no input is provided to the system, the state will change
over time. This is in accordance to eventual consistency.
• Eventual consistency : means the system will attain consistency in the long run, provided
no input is sent to the system during that time.
Hence BASE is in contrast with the RDBMS ACID transactions.
You have seen that NoSQL databases are eventually consistent but the eventual consistency
implementation may vary across different NoSQL databases.
NRW is the notation used to describe how the eventual consistency model is implemented
across NoSQL databases where
• N : is the number of data copies that the database has maintained. • R is the number of
copies that an application needs to refer to before returning a read request’s output.
• W : is the number of data copies that need to be written to before a write operation is
marked as completed successfully.
Using these notation configurations , the databases implement the model of eventual
Page 9 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

consistency.
Consistency can be implemented at both read and write operation levels.
Write Operations
 N=W implies that the write operation will update all data copies before returning the
control to the client and marking the write operation as successful. This is similar to
how the traditional RDBMS databases work when implementing synchronous
replication. This setting will slow down the write performance.
 If write performance is a concern, which means you want the writes to be happening
fast, you can set W=1, R=N. This implies that the write will just update any one copy
and mark the write as successful, but whenever the user issues a read request, it will
read all the copies to return the result. If either of the copies is not updated, it will
ensure the same is updated, and then only the read will be successful. This
implementation will slow down the read performance.
 Hence most NoSQL implementations use N>W>1. This implies that greater than one
node needs to be updated successfully; however, not all nodes need to be updated at
the same time.

Read Operations
 If R is set to 1, the read operation will read any data copy, which can be outdated. If
R>1, more than one copy is read, and it will read most recent value. However, this can
slow down the read operation.
 Using N<W+R always ensures that a read operation retrieves the latest value. This is
because the number of written copies and read copies are always greater than the
actual number of copies, ensuring that at least one read copy has the latest version.
This is quorum assembly .

Table 2-1. ACID vs. BASE


ACID BASE

Atomicity Basically
Consistency Available
Isolation Eventually
Durable Consistency
Soft State

Q12: Explain NoSQL Advantages and Disadvantages


Ans:
Advantages of NoSQL
Let’s talk about the advantages of NoSQL databases.
•High scalability : This scaling up approach fails when the transaction rates and fast
response requirements increase. In contrast to this, the new generation of NoSQL databases
is designed to scale out (i.e. to expand horizontally using low-end commodity servers).
• Manageability and administration : NoSQL databases are designed to mostly work with
automated repairs, distributed data, and simpler data models, leading to low manageability
and administration.
•Low cost : NoSQL databases are typically designed to work with a cluster of cheap
Page 10 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

commodity servers, enabling the users to store and process more data at a low cost.
•Flexible data models: NoSQL databases have a very flexible data model, enabling them to
work with any type of data; they don’t comply with the rigid RDBMS data models. As a result,
any application changes that involve updating the database schema can be easily
implemented.
Disadvantages of NoSQL
In addition to the above mentioned advantages, there are many impediments that you need
to be aware of before you start developing applications using these platforms.
•Maturity : Most NoSQL databases are pre-production versions with key features that are
still to be implemented. Thus, when deciding on a NoSQL database, you should analyze the
product properly to ensure the features are fully implemented and not still on the To-do list .
•Support : Support is one limitation that you need to consider. Most NoSQL databases are
from start-ups which were open sourced. As a result, support is very minimal as compared to
the enterprise software companies and may not have global reach or support resources.
•Limited Query Capabilities : Since NoSQL databases are generally developed to meet the
scaling requirement of the web-scale applications, they provide limited querying capabilities.
A simple querying requirement may involve significant programming expertise.
• Administration : Although NoSQL is designed to provide a no-admin solution, it still
requires skill and effort for installing and maintaining the solution.
• Expertise : Since NoSQL is an evolving area, expertise on the technology is limited in the
developer and administrator community.
Although NoSQL is becoming an important part of the database landscape, you need to be
aware of the limitations and advantages of the products to make the correct choice of the
NoSQL database platform.

Q13: Explain SQL vs. NoSQL Databases


Ans:
 Now you know the details regarding NoSQL databases. Although NoSQL is increasingly
getting adopted as a database solution, it’s not here to replace SQL or RDBMS
databases . In this section, you will look at the differences between SQL and NoSQL
databases.
 Let’s do a quick recap of the RDBMS system. RDBMS systems have prevailed for about
30 years, and even now they are the default choice of the solution architect for data
storage for an application. If we will list a few of the good points of RDBMS system,
first and the foremost is the use of SQL, which is a rich declarative query language
used for data processing. It is well understood by users. In addition, the RDBMS
system offers ACID support for transactions, which is a must in many sectors, such as
banking applications.
 However, the biggest drawbacks of the RDBMS system are its difficulty in handling
schema changes and scaling issues as data increases. As data increases, the read
read/write performance degrades. You face scaling issues with RDBMS systems
because they are mostly designed to scale up and not scale out.
 In contrast to the SQL RDBMS databases, NoSQL promotes the data stores, which
break away from the RDBMS paradigm.

Let’s talk abouttechnical scenarios and how they compare in RDBMS vs. NoSQL :
• Schema flexibility: This is a must for easy future enhancements and integration with
external applications (outbound or inbound). RDBMS are quite inflexible in their design.
Adding a column is an absolute no-no, especially if the table has some data. The reasons
range from default value, indexes, and performance implications. More often than not, you

Page 11 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

end up creating new tables, and increasing the complexity by introducing relationships across
tables.
• Complex queries: The traditional designing of the tables leads to developers writing
complex JOIN queries, which are not only difficult to implement and maintain but also take
substantial database resources to execute.
• Data update: Updating data across tables is probably one of the more complex scenarios,
especially if they are a part of the transaction. Note that keeping the transaction open for a
long duration hampers the performance. You also have to plan for propagating the updates to
multiple nodes across the system. And if the system does not support multiple masters or
writing to multiple nodes simultaneously, there is a risk of node failure and the entire
application moving to read-only mode.
• Scalability: Often the only scalability that may be required is for read operations.
However, several factors impact this speed as operations grow. Some of the key questions to
ask are:
• What is the time taken to synchronize the data across physical database instances?
• What is the time taken to synchronize the data across datacenters?
• What is the bandwidth requirement to synchronize data?
• Is the data exchanged optimized?
• What is the latency when any update is synchronized across servers? Typically, the records
will be locked during an update.
NoSQL-based solutions provide answers to most of the challenges listed above.
Let’s now see what NoSQL has to offer against each technical question mentioned above.
• Schema flexibility: Column-oriented databases store data as columns as opposed to rows
in RDBMS. This allows the flexibility of adding one or more columns as required, on the fly.
Similarly, document stores that allow storing semi-structured data are also good options.
• Complex queries: NoSQL databases do not have support for relationships or foreign keys.
There are no complex queries. There are no JOIN statements.
Is that a drawback? How does one query across tables?
It is a functional drawback, definitely. To query across tables, multiple queries must be
executed. A database is a shared resource, used across application servers and must not be
released from use as quickly as possible. The options involve a combination of simplifying the
queries to be executed, caching data, and performing complex operations in the application
tier. A lot of databases provide in-built entitylevel caching. This means that when a record is
accessed, it may be automatically cached transparently by the database. The cache may be
in-memory distributed cache for performance and scale.
• Data update: Data updating and synchronization across physical instances are difficult
engineering problems to solve. Synchronization across nodes within a datacenter has a
different set of requirements compared to synchronizing across multiple datacenters. One
would want the latency within a couple of milliseconds or tens of milliseconds at the best.
NoSQL solutions offer great synchronization options.
MongoDB, for example, allows concurrent updates across nodes, synchronization with
conflict resolution, and eventually, consistency across the datacenters within an acceptable
time that would run in few milliseconds. As such, MongoDB has no concept of isolation. Note
that now because the complexity of managing the transaction may be moved out of the
database, the application will have to do some hard work.
A plethora of databases offer multiversion concurrency control (MCC) to achieve transactional
consistency.
Well, as Dan Pritchett ( www.addsimplicity.com/ ), Technical Fellow at eBay puts it, eBay.com
does not use transactions. Note that PayPal does use transactions.
• Scalability: NoSQL solutions provider greater scalability for obvious reasons. A lot of the
complexity that is required for transaction-oriented RDBMS does not exist in ACID non-

Page 12 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

compliant NoSQL databases. Interestingly, since NoSQL does not provide cross-table
references and there are no JOIN queries possible, and because you can’t write a single query
to collate data across multiple tables, one simple and logical solution is to—at times—
duplicate the data across tables. In some scenarios, embedding the information within the
primary entity—especially in one-to-one mapping cases—may be a great idea.

Q14: Explain MongoDB Design Philosophy


Ans:
i)Speed, Scalability, and Agility
 The design team’s goal when designing MongoDB was to create a database that was
fast, massively scalable, and easy to use. To achieve speed and horizontal scalability in
a partitioned database, as explained in the CAP theorem, the consistency and
transactional support have to be compromised.
 Thus, per this theorem, MongoDB provides high availability, scalability, and partitioning
at the cost of consistency and transactional support. In practical terms, this means that
instead of tables and rows, MongoDB uses documents to make it flexible, scalable, and
fast.

ii) Non-Relational Approach


 Traditional RDBMS platforms provide scalability using a scale-up approach, which
requires a faster server to increase performance. The following issues in RDBMS
systems led to why MongoDB and other NoSQL databases were designed the way they
are designed:
 In order to scale out, the RDBMS database needs to link the data available in two or
more systems in order to report back the result. This is difficult to achieve in RDBMS
systems since they are designed to work when all the data is available for computation
together. Thus the data has to be available for processing at a single location.
 In case of multiple Active-Active servers, when both are getting updated from multiple
sources there is a challenge in determining which update is correct.
 When an application tries to read data from the second server, and the information has
been updated on the first server but has yet to be synchronized with the second
server, the information returned may be stale.
 The MongoDB team decided to take a non-relational approach to solving these
problems. As mentioned, MongoDB stores its data in BSON documents where all the
related data is placed together, which means everything is in one place. The queries in
MongoDB are based on keys in the document, so the documents can be spread across
multiple servers. Querying each server means it will check its own set of documents
and return the result. This enables linear scalability and improved performance.
 MongoDB has a primary-secondary replication where the primary accepts the write
requests. If the write performance needs to be improved, then sharding can be used;
this splits the data across multiple machines and enables these multiple machines to
update different parts of the datasets. Sharding is automatic in MongoDB; as more
machines are added, data is distributed automatically.

iii) JSON-Based Document Store


 MongoDB uses a JSON-based (JavaScript Object Notation) document store for the
data. JSON/BSON offers a schema-less model, which provides flexibility in terms of
database design. Unlike in RDBMSs, changes can be done to the schema seamlessly.

Page 13 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

 This design also makes for high performance by providing for grouping of relevant data
together internally and making it easily searchable.
 A JSON document contains the actual data and is comparable to a row in SQL.
However, in contrast to RDBMS rows, documents can have dynamic schema. This
means documents within a collection can have different fields or structure, or common
fields can have different type of data.

A document contains data in form of key-value pairs. Let’s understand this with an example:
{
"Name": "ABC",
"Phone": ["1111111",
........"222222"
........],
"Fax":..
}
 As mentioned, keys and values come in pairs. The value of a key in a document can be
left blank. In the above example, the document has three keys, namely “Name,”
”Phone,” and “Fax.” The “Fax” key has no value.

iv) Performance vs. Features


 In order to make MongoDB high performance and fast, certain features commonly
available in RDBMS systems are not available in MongoDB. MongoDB is a document-
oriented DBMS where data is stored as documents. It does not support JOINs, and it
does not have fully generalized transactions. However, it does provide support for
secondary indexes, it enables users to query using query documents, and it provides
support for atomic updates at a per document level. It provides a replica set, which is a
form of master-slave replication with automated failover, and it has built-in horizontal
scaling.

v)Running the Database Anywhere


 One of the main design decisions was the ability to run the database from anywhere,
which means it should be able to run on servers, VMs, or even on the cloud using the
pay-for-what-you-use service. The language used for implementing MongoDB is C++,
which enables MongoDB to achieve this goal. The 10gen site provides binaries for
different OS platforms, enabling MongoDB to run on almost any type of machine.

Q15: Explain SQL Comparison


Ans:
The following are the ways in which MongoDB is different from SQL.
 MongoDB uses documents for storing its data, which offer a flexible schema
(documents in same collection can have different fields). This enables the users to
store nested or multi-value fields such as arrays, hashes, etc. In contrast, RDBMS
systems offer a fixed schema where a column’s value should have a similar data type.
Also, it’s not possible to store arrays or nested values in a cell.
 MongoDB doesn’t provide support for JOIN operations, like in SQL. However, it enables
the user to store all relevant data together in a single document, avoiding at the
periphery the usage of JOINs. It has a workaround to overcome this issue. We will be
discussing this in more detail in a later chapter.

Page 14 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622
TRAINING -> CERTIFICATION -> PLACEMENT BSC IT : SEM – V NGT: UNIT1

 MongoDB doesn’t provide support for transactions in the same way as SQL. However, it
guarantees atomicity at the document level. Also, it uses an isolation operator to
isolate write operations that affect multiple documents, but it does not provide “all-or-
nothing” atomicity for multi-document write operations.

Page 15 of 15
YouTube - Abhay More | Telegram - abhay_more
607A, 6th floor, Ecstasy business park, city of joy, JSD road, mulund (W) | 8591065589/022-25600622

You might also like