0% found this document useful (0 votes)
31 views18 pages

Unit I Bda

The document provides an overview of Big Data, defining it as high-volume, velocity, and variety information assets that require innovative processing for better insights. It discusses the history, types, characteristics, and challenges of Big Data, as well as the Big Data Analytics Project Life cycle, which includes phases from problem definition to data visualization. Additionally, it highlights the importance of advanced analytics tools and data governance in managing and extracting value from Big Data.

Uploaded by

rdevika753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views18 pages

Unit I Bda

The document provides an overview of Big Data, defining it as high-volume, velocity, and variety information assets that require innovative processing for better insights. It discusses the history, types, characteristics, and challenges of Big Data, as well as the Big Data Analytics Project Life cycle, which includes phases from problem definition to data visualization. Additionally, it highlights the importance of advanced analytics tools and data governance in managing and extracting value from Big Data.

Uploaded by

rdevika753
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT I

INTRODUCTION TO BIG DATA


Digital Data- Types of Digital Data; Characteristics of Big Data-Challenges With Big
Data-Big Data Analytics ; Terminologies Used in Big Data Environments-Big Data
Analytics Project Life-Example Applications for Big Data; Top Analytics Tools-Big Data
Technology Landscape-NOSQL and Hadoop
What is Big Data?
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-effective,
innovative forms of information processing for enhanced insight and decision making.”
This definition clearly answers the “What is Big Data?”
Big Data refers to complex andlarge data sets that have to be processed and analyzed to
uncover valuable information that canbenefit businesses and organizations. However, there are
certain basic tenets of Big Data that will make it even simpler to answer
whatis Big Data:

t cannot be processed or analysed using conventional data


processing techniques.

-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyse the data.
The History of Big Data
Although the concept of big data itself is relatively new, the origins of large data sets go back
to the 1960s and '70s when the world of data was just getting started with the first data centers
and the development of the relational database. Around 2005, people began to realize just how
much data users generated through Facebook, YouTube, and other online services. Hadoop (an
open-source framework created specifically to store and analyze big data sets) was developed
that same year. NoSQL also began to gain popularity during this time.
The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
essential for the growth of big data because they make big data easier to work with and cheaper
to store. In the years since then, the volume of big data has skyrocketed. Users are still
generating huge amounts of data—but it’s not just humans who are doing it.
With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence
of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of data.
BIG DATA ANALYTICS
Benefits of Big Data and Data Analytics

information.
—which means a completely
different approach to tackling problems.
Types of Big Data
Now that we are on track with what is big data, let’s have a look at the types of big data:
a) Structured
Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is
an example of unstructured data. Structured and unstructured are two important types of big
data.
c) Semi-structured
Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database),
yet contains vital information or tags that segregate individual elements within the data. Thus
we come to the end of types of data.
Characteristics of Big Data
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity,
and
Volume. Let’s discuss the characteristics of big data.
These characteristics, isolated, are enough to know what big data is. Let’s look at them in depth:
a) Variety
Variety of Big Data refers to structured, unstructured, and semi-structured data that is gathered
from multiple sources. While in the past, data could only be collected from spreadsheets and
databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios,
SM posts, and so much more. Variety is one of the important characteristics of big data.
b) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
c) Volume
Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large
amount of data is stored in data warehouses. Thus comes to the end of characteristics of big
data.
The Challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled, the technology’s failure may occur,
leading to some unpleasant results. Big data challenges include storing and analyzing
extremely large and fast-growing data.
Big Challenges with Big Data

This article explores some of the most pressing challenges associated with Big Data and
offers potential solutions for overcoming them.
Big Challenges with Big Data
 Data Volume: Managing and Storing Massive Amounts of Data
 Data Variety: Handling Diverse Data Types
 Data Velocity: Processing Data in Real-Time
 Data Veracity: Ensuring Data Quality and Accuracy
 Data Security and Privacy: Protecting Sensitive Information
 Data Integration: Combining Data from Multiple Sources
 Data Analytics: Extracting Valuable Insights
 Data Governance: Establishing Policies and Standards

Data Volume: Managing and Storing Massive Amounts of Data


 Challenge: The most apparent challenge with Big Data is the sheer volume of data being
generated. Organizations are now dealing with petabytes or even exabytes of data,
making traditional storage solutions inadequate. This vast amount of data requires
advanced storage infrastructure, which can be costly and complex to maintain.
 Solution: Adopting scalable cloud storage solutions, such as Amazon S3, Google Cloud
Storage, or Microsoft Azure, can help manage large volumes of data. These platforms
offer flexible storage options that can grow with your data needs. Additionally,
implementing data compression and deduplication techniques can reduce storage costs
and optimize the use of available storage space.
Data Variety: Handling Diverse Data Types
 Challenge: Big Data encompasses a wide variety of data types, including structured data
(e.g., databases), semi-structured data (e.g., XML, JSON), and unstructured data (e.g.,
text, images, videos). The diversity of data types can make it difficult to integrate,
analyze, and extract meaningful insights.
 Solution: To address the challenge of data variety, organizations can employ data
integration platforms and tools like Apache Nifi, Talend, or Informatica. These tools help
in consolidating disparate data sources into a unified data model. Moreover, adopting
schema-on-read approaches, as opposed to traditional schema-on-write, allows for more
flexibility in handling diverse data types.
Data Velocity: Processing Data in Real-Time
 Challenge: The speed at which data is generated and needs to be processed is another
significant challenge. For instance, IoT devices, social media platforms, and financial
markets produce data streams that require real-time or near-real-time processing. Delays
in processing can lead to missed opportunities and inefficiencies.
 Solution: To handle high-velocity data, organizations can implement real-time data
processing frameworks such as Apache Kafka, Apache Flink, or Apache Storm. These
frameworks are designed to handle high-throughput, low-latency data processing,
enabling businesses to react to events as they happen. Additionally, leveraging edge
computing can help process data closer to its source, reducing latency and improving real-
time decision-making.
Data Veracity: Ensuring Data Quality and Accuracy
 Challenge: With Big Data, ensuring the quality, accuracy, and reliability of data—
referred to as data veracity—becomes increasingly difficult. Inaccurate or low-quality
data can lead to misleading insights and poor decision-making. Data veracity issues can
arise from various sources, including data entry errors, inconsistencies, and incomplete
data.
 Solution: Implementing robust data governance frameworks is crucial for maintaining
data veracity. This includes establishing data quality standards, performing regular data
audits, and employing data cleansing techniques. Tools like Trifacta, Talend Data
Quality, and Apache Griffin can help automate and streamline data quality management
processes.
Data Security and Privacy: Protecting Sensitive Information
 Challenge: As organizations collect and store more data, they face increasing risks
related to data security and privacy. High-profile data breaches and growing concerns
over data privacy regulations, such as GDPR and CCPA, highlight the importance of
safeguarding sensitive information.
 Solution: To mitigate security and privacy risks, organizations must adopt
comprehensive data protection strategies. This includes implementing encryption, access
controls, and regular security audits. Additionally, organizations should stay informed
about evolving data privacy regulations and ensure compliance by adopting privacy-by-
design principles in their data management processes.
Data Integration: Combining Data from Multiple Sources
 Challenge: Integrating data from various sources, especially when dealing with legacy
systems, can be a daunting task. Data silos, where data is stored in separate systems
without easy access, further complicate the integration process, leading to inefficiencies
and incomplete analysis.
 Solution: Data integration platforms like Apache Camel, MuleSoft, and IBM DataStage
can help streamline the process of integrating data from multiple sources. Adopting a
microservices architecture can also facilitate easier data integration by breaking down
monolithic applications into smaller, more manageable services that can be integrated
more easily.
Data Analytics: Extracting Valuable Insights
 Challenge: The ultimate goal of Big Data is to derive actionable insights, but the
complexity of analyzing large, diverse datasets can be overwhelming. Traditional
analytical tools may struggle to scale, and the lack of skilled data scientists can further
hinder the ability to extract meaningful insights.
 Solution: Organizations should invest in advanced analytics platforms like Apache
Spark, Hadoop, or Google BigQuery, which are designed to handle large-scale data
processing and analysis. Additionally, fostering a culture of data literacy and providing
training for employees can help bridge the skills gap and empower teams to effectively
analyze Big Data.
Data Governance: Establishing Policies and Standards
 Challenge: As data becomes a critical asset, establishing effective data governance
becomes essential. However, many organizations struggle with creating and enforcing
policies and standards for data management, leading to issues with data consistency,
quality, and compliance.
 Solution: Implementing a formal data governance framework is key to overcoming this
challenge. This framework should define roles and responsibilities, establish data
stewardship programs, and enforce data management policies. Tools like Collibra,
Alation, and Informatica’s data governance suite can assist in creating and maintaining a
robust data governance strategy.

Big Data Analytics Terminologies Used in Big Data Environments


Big Data Analytics Project Life
The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results.
Let us discuss each phase :
 Phase I Business Problem Definition –
In this stage, the team learns about the business domain, which presents the motivation
and goals for carrying out the analysis. In this stage, the problem is identified, and
assumptions are made that how much potential gain a company will make after carrying
out the analysis. Important activities in this step include framing the business problem as
an analytics challenge that can be addressed in subsequent phases. It helps the decision-
makers understand the business resources that will be required to be utilized thereby
determining the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data problem
or not, based on the business requirements in the business case. To qualify as a big data
problem, the business case should be directly related to one(or more) of the characteristics
of volume, velocity, or variety.

 Phase II Data Definition –


Once the business case is identified, now it’s time to find the appropriate datasets to work
with. In this stage, analysis is done to see what other companies have done for a similar
case.
Depending on the business case and the scope of analysis of the project being addressed,
the sources of datasets can be either external or internal to the company. In the case of
internal datasets, the datasets can include data collected from internal sources, such as
feedback forms, from existing software, On the other hand, for external datasets, the list
includes datasets from third-party providers.

 Phase III Data Acquisition and filtration –


Once the source of data is identified, now it is time to gather the data from such sources.
This kind of data is mostly unstructured.Then it is subjected to filtration, such as removal
of the corrupt data or irrelevant data, which is of no scope to the analysis objective. Here
corrupt data means data that may have missing records, or the ones, which include
incompatible data types.
After filtration, a copy of the filtered data is stored and compressed, as it can be of use in
the future, for some other analysis.

 Phase IV Data Extraction –


Now the data is filtered, but there might be a possibility that some of the entries of the
data might be incompatible, to rectify this issue, a separate phase is created, known as the
data extraction phase. In this phase, the data, which don’t match with the underlying scope
of the analysis, are extracted and transformed in such a form.

 Phase V Data Munging –


As mentioned in phase III, the data is collected from various sources, which results in the
data being unstructured. There might be a possibility, that the data might have constraints,
that are unsuitable, which can lead to false results. Hence there is a need to clean and
validate the data.
It includes removing any invalid data and establishing complex validation rules. There
are many ways to validate and clean the data. For example, a dataset might contain few
rows, with null entries. If a similar dataset is present, then those entries are copied from
that dataset, else those rows are dropped.

 Phase VI Data Aggregation & Representation –


The data is cleansed and validates, against certain rules set by the enterprise. But the data
might be spread across multiple datasets, and it is not advisable to work with multiple
datasets. Hence, the datasets are joined together. For example: If there are two datasets,
namely that of a Student Academic section and Student Personal Details section, then
both can be joined together via common fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very large.
Automation can be brought into consideration, so that these things are executed, without
any human intervention.

 Phase VII Exploratory Data Analysis –


Here comes the actual step, the analysis task. Depending on the nature of the big data
problem, analysis is carried out. Data analysis can be classified as Confirmatory analysis
and Exploratory analysis. In confirmatory analysis, the cause of a phenomenon is
analyzed before. The assumption is called the hypothesis. The data is analyzed to approve
or disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions and confirms
whether an assumption was true or not.In an exploratory analysis, the data is explored to
obtain information, why a phenomenon occurred. This type of analysis answers “why” a
phenomenon occurred. This kind of analysis doesn’t provide definitive, meanwhile, it
provides discovery of patterns.

 Phase VIII Data Visualization –


Now we have the answer to some questions, using the information from the data in the
datasets. But these answers are still in a form that can’t be presented to business users. A
sort of representation is required to obtains value or some conclusion from the analysis.
Hence, various tools are used to visualize the data in graphic form, which can easily be
interpreted by business users.
Visualization is said to influence the interpretation of the results. Moreover, it allows the
users to discover answers to questions that are yet to be formulated.

 Phase IX Utilization of analysis results –


The analysis is done, the results are visualized, now it’s time for the business users to
make decisions to utilize the results. The results can be used for optimization, to refine
the business process. It can also be used as an input for the systems to enhance
performance.
The block diagram of the life cycle is given below :
It is evident from the block diagram that Phase VII, i.e. exploratory Data analysis, is modified
successively until it is performed satisfactorily. Emphasis is put on error correction.
Moreover, one can move back from Phase VIII to Phase VII, if a satisfactory result is not
achieved. In this manner, it is ensured that the data is analyzed properly.

Applications of Big Data

The term Big Data is referred to as large amount of complex and unprocessed data. Now a day's
companies use Big Data to make business more informative and allows to take business
decisions by enabling data scientists, analytical modelers and other professionals to analyse
large volume of transactional data. Big data is the valuable and powerful fuel that drives large
IT industries of the 21st century. Big data is a spreading technology used in each business
sector. In this section, we will discuss application of Big Data.

Travel and Tourism

Travel and tourism are the users of Big Data. It enables us to forecast travel facilities
requirements at multiple locations, improve business through dynamic pricing, and many more.

Financial and banking sector

The financial and banking sectors use big data technology extensively. Big data analytics
help banks and customer behaviour on the basis of investment patterns, shopping trends,
motivation to invest, and inputs that are obtained from personal or financial backgrounds.

Healthcare

Big data has started making a massive difference in the healthcare sector, with the help
of predictive analytics, medical professionals, and health care personnel. It can
produce personalized healthcare and solo patients also.

Telecommunication and media

Telecommunications and the multimedia sector are the main users of Big Data. There
are zettabytes to be generated every day and handling large-scale data that require big data
technologies.

Government and Military

The government and military also used technology at high rates. We see the figures that
the government makes on the record. In the military, a fighter plane requires to
process petabytes of data.

Government agencies use Big Data and run many agencies, managing utilities, dealing with
traffic jams, and the effect of crime like hacking and online fraud.

Aadhar Card: The government has a record of 1.21 billion citizens. This vast data is analyzed
and store to find things like the number of youth in the country. Some schemes are built to
target the maximum population. Big data cannot store in a traditional database, so it stores and
analyze data by using the Big Data Analytics tools.

E-commerce

E-commerce is also an application of Big data. It maintains relationships with customers that
is essential for the e-commerce industry. E-commerce websites have many marketing ideas to
retail merchandise customers, manage transactions, and implement better strategies of
innovative ideas to improve businesses with Big data.

o Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic


daily. But, when there is a pre-announced sale on Amazon, traffic increase rapidly that
may crash the website. So, to handle this type of traffic and data, it uses Big Data. Big
Data help in organizing and analyzing the data for far use.

Social Media

Social Media is the largest data generator. The statistics have shown that around 500+ terabytes
of fresh data generated from social media daily, particularly on Facebook. The data mainly
contains videos, photos, message exchanges, etc. A single activity on the social media site
generates many stored data and gets processed when required. The data stored is in terabytes
(TB); it takes a lot of time for processing. Big Data is a solution to the problem.

Top Analytics Tools


* R is a language for statistical computing and graphics. It also used for big data analysis. It
provides a wide variety of statistical tests.
Features:

-screen or on
hardcopy
* Apache Spark is a powerful open source big data analytics tool. It offers over 80 high-level
operators that make it easy to build parallel apps. It is used at a wide range of organizations
to process large datasets.
Features:
to run an application in Hadoop cluster, up to 100 times faster in memory, and ten
times faster on disk

* Plotly is an analytics tool that lets users create charts and dashboards to share online.
Features:
-catching and informative graphics
-grained information on data provenance
ffers unlimited public file hosting through its free community plan
* Lumify is a big data fusion, analysis, and visualization platform. It helps users to discover
connections and explore relationships in their data via a suite of analytic options.
Features:
t provides a variety of options for analyzing the links between entities on the graph
textual content, images,
and videos

* IBM SPSS Modeler is a predictive big data analytics platform. It offers predictive models
and
delivers to individuals, groups, systems and the enterprise. It has a range of advanced
algorithms and analysis techniques.
Features:

an intuitive interface for everyone to learn


-premises, cloud and hybrid deployment options

* MongoDB is a NoSQL, document-oriented database written in C, C++, and JavaScript. It


is free to use and is an open source tool that supports multiple operating systems including
Windows
Vista ( and later versions), OS X (10.7 and later versions), Linux, Solaris, and FreeBSD.
Its main features include Aggregation, Adhoc-queries, Uses BSON format, Sharding,
Indexing,
Replication, Server-side execution of javascript, Schemaless, Capped collection, MongoDB
management service (MMS), load balancing and file storage.
Features:

Big Data Technology Landscape


Big Data deals with large data sets or deals with the deals with complexities handled by
traditional data processing application software.
Some Popular Big Data Technologies:
Here, we will discuss the overview of these big data technologies in detail and will mainly
focus on the overview part of each technology as mentioned above in the diagram.
1. Apache Cassandra: It is one of the No-SQL databases which is highly scalable and has
high availability. In this, we can replicate data across multiple data centers. Replication
across multiple data centers is supported. In Cassandra, fault tolerance is one of the big
factors in which failed nodes can be easily replaced without any downtime.
2. Apache Hadoop: Hadoop is one of the most widely used big data technology that is used
to handle large-scale data, large file systems by using Hadoop file system which is called
HDFS, and parallel processing like features using the MapReduce framework of Hadoop.
Hadoop is a scalable system that helps to provide a scalable solution capable of handling
large capacities and capabilities. For example: If you see real use cases like NextBio is using
Hadoop MapReduce and HBase to process multi-terabyte data sets off the human genome.
3. Apache Hive: It is used for data summarization and ad hoc querying which means for
querying and analyzing Big Data easily. It is built on top of Hadoop for providing data
summarization, ad-hoc queries, and the analysis of large datasets using SQL-like language
called HiveQL. It is not a relational database and not a language for real-time queries. It has
many features like: designed for OLAP, SQL type language called HiveQL, fast, scalable,
and extensible.
4. Apache Flume: It is a distributed and reliable system that is used to collect, aggregate,
and move large amounts of log data from many data sources toward a centralized data store.
5. Apache Spark: The main objective of spark for speeding up the Hadoop computational
computing software process, and It was introduced by Apache Software Foundation. Apache
Spark can work independently because it has its own cluster management, and It is not an
updated or modified version of Hadoop and if you delve deeper then you can say it is just
one way to implement Spark with Hadoop. The Main idea to implement Spark with Hadoop
in two ways is for storage and processing. So, in two ways Spark uses Hadoop for storage
purposes just because Spark has its own cluster management computation. In Spark, it
includes interactive queries and stream processing, and in-memory cluster computing is one
of the key features.
6. Apache Kafka: It is a distributed publish-subscribe messaging system and more
specifically you can say it has a robust queue that allows you to handle a high volume of data,
and you can pass the messages from one point to another as you can say from one sender to
receiver. You can perform message computation in both offline and online modes, it is
suitable for both. To prevent data loss Kafka messages are replicated within the cluster. For
real-time streaming data analysis, it integrates Apache Storm and Spark and is built on top
of the ZooKeeper synchronization service.
7. MongoDB: It is based on cross-platform and works on a concept like collection and
document. It has document-oriented storage that means data will be stored in the form of
JSON form. It can be an index on any attribute. It has features like high availability,
replication, rich queries, support by MongoDB, Auto-Sharding, and Fast in-place updates.
8. ElasticSearch: It is a real-time distributed system, and open-source full-text search and
analytics engine. It has features like scalability factor is high and scalable structured and
unstructured data up to petabytes, It can be used as a replacement of MongoDB, RavenDB
which is based on document-based storage. To improve the search performance, it uses
denormalization. If you see the real use case then it is an enterprise search engine and big
organizations using it, for example- Wikipedia, GitHub.

NOSQL and Hadoop


NoSQL is a type of database management system (DBMS) that is designed to handle and store
large volumes of unstructured and semi-structured data. Unlike traditional relational databases
that use tables with pre-defined schemas to store data, NoSQL databases use flexible data
models that can adapt to changes in data structures and are capable of scaling horizontally to
handle growing amounts of data.

The term NoSQL originally referred to “non-SQL” or “non-relational” databases, but the term
has since evolved to mean “not only SQL,” as NoSQL databases have expanded to include a
wide range of different database architectures and data models.

NoSQL databases are generally classified into four main categories:


Document databases: These databases store data as semi-structured documents, such as JSON
or XML, and can be queried using document-oriented query languages.
Key-value stores: These databases store data as key-value pairs, and are optimized for simple
and fast read/write operations.
Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity. They are optimized for fast and efficient querying
of large amounts of data.
Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.
Types of NoSQL database: Types of NoSQL databases and the name of the database system
that falls in that category are:
1. Graph Databases: Examples – Amazon Neptune, Neo4j
2. Key value store: Examples – Memcached, Redis, Coherence
3. Column: Examples – Hbase, Big Table, Accumulo
4. Document-based: Examples – MongoDB, CouchDB, Cloudant
NoSQL originally referring to non SQL or non relational is a database that provides a
mechanism for storage and retrieval of data. This data is modeled in means other than
the tabular relations used in relational databases. Such databases came into existence in
the late 1960s, but did not obtain the NoSQL moniker until a surge of popularity in the
early twenty-first century. NoSQL databases are used in real-time web applications and
big data and their use are increasing over time.
 NoSQL systems are also sometimes called Not only SQL to emphasize the fact that
they may support SQL-like query languages. A NoSQL database includes simplicity of
design, simpler horizontal scaling to clusters of machines,has and finer control over
availability. The data structures used by NoSQL databases are different from those used
by default in relational databases which makes some operations faster in NoSQL. The
suitability of a given NoSQL database depends on the problem it should solve.
 NoSQL databases, also known as “not only SQL” databases, are a new type of database
management system that has, gained popularity in recent years. Unlike traditional
relational databases, NoSQL databases are designed to handle large amounts of
unstructured or semi-structured data, and they can accommodate dynamic changes to
the data model. This makes NoSQL databases a good fit for modern web applications,
real-time analytics, and big data processing.
 Data structures used by NoSQL databases are sometimes also viewed as more flexible
than relational database tables. Many NoSQL stores compromise consistency in favor
of availability, speed,, and partition tolerance. Barriers to the greater adoption of
NoSQL stores include the use of low-level query languages, lack of standardized
interfaces, and huge previous investments in existing relational databases.
 Most NoSQL stores lack true ACID(Atomicity, Consistency, Isolation, Durability)
transactions but a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE,
Google Spanner (though technically a NewSQL database), Symas LMDB, and
OrientDB have made them central to their designs.
 Most NoSQL databases offer a concept of eventual consistency in which database
changes are propagated to all nodes so queries for data might not return updated data
immediately or might result in reading data that is not accurate which is a problem
known as stale reads. Also,has some NoSQL systems may exhibit lost writes and other
forms of data loss. Some NoSQL systems provide concepts such as write-ahead logging
to avoid data loss.
 One simple example of a NoSQL database is a document database. In a document
database, data is stored in documents rather than tables. Each document can contain a
different set of fields, making it easy to accommodate changing data requirements
 For example, “Take, for instance, a database that holds data regarding employees.”. In
a relational database, this information might be stored in tables, with one table for
employee information and another table for department information. In a document
database, each employee would be stored as a separate document, with all of their
information contained within the document.
 NoSQL databases are a relatively new type of database management system that
hasa gained popularity in recent years due to their scalability and flexibility. They are
designed to handle large amounts of unstructured or semi-structured data and can
handle dynamic changes to the data model. This makes NoSQL databases a good fit for
modern web applications, real-time analytics, and big data processing.
Key Features of NoSQL:
1. Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate
changing data structures without the need for migrations or schema alterations.
2. Horizontal scalability: NoSQL databases are designed to scale out by adding more
nodes to a database cluster, making them well-suited for handling large amounts of data
and high levels of traffic.
3. Document-based: Some NoSQL databases, such as MongoDB, use a document-based
data model, where data is stored in a schema-less semi-structured format, such as JSON
or BSON.
4. Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model,
where data is stored as a collection of key-value pairs.
5. Column-based: Some NoSQL databases, such as Cassandra, use a column-based data
model, where data is organized into columns instead of rows.
6. Distributed and high availability: NoSQL databases are often designed to be highly
available and to automatically handle node failures and data replication across multiple
nodes in a database cluster.
7. Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible
and dynamic manner, with support for multiple data types and changing data structures.
8. Performance: NoSQL databases are optimized for high performance and can handle a
high volume of reads and writes, making them suitable for big data and real-time
applications.
Advantages of NoSQL: There are many advantages of working with NoSQL databases
such as MongoDB and Cassandra. The main advantages are high scalability and high
availability.
1. High scalability: NoSQL databases use sharding for horizontal scaling. Partitioning of
data and placing it on multiple machines in such a way that the order of the data is
preserved is sharding. Vertical scaling means adding more resources to the existing
machine whereas horizontal scaling means adding more machines to handle the data.
Vertical scaling is not that easy to implement but horizontal scaling is easy to
implement. Examples of horizontal scaling databases are MongoDB, Cassandra, etc.
NoSQL can handle a huge amount of data because of scalability, as the data grows
NoSQL scalesThe auto itself to handle that data in an efficient manner.
2. Flexibility: NoSQL databases are designed to handle unstructured or semi-structured
data, which means that they can accommodate dynamic changes to the data model. This
makes NoSQL databases a good fit for applications that need to handle changing data
requirements.
3. High availability: The auto, replication feature in NoSQL databases makes it highly
available because in case of any failure data replicates itself to the previous consistent
state.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
1. Lack of standardization: There are many different types of NoSQL databases, each
with its own unique strengths and weaknesses. This lack of standardization can make it
difficult to choose the right database for a specific application
2. Lack of ACID compliance: NoSQL databases are not fully ACID-compliant, which
means that they do not guarantee the consistency, integrity, and durability of data. This
can be a drawback for applications that require strong data consistency guarantees.
3. Narrow focus: NoSQL databases have a very narrow focus as it is mainly designed for
storage but it provides very little functionality. Relational databases are a better choice
in the field of Transaction Management than NoSQL.
When should NoSQL be used:
1. When a huge amount of data needs to be stored and retrieved.
2. The relationship between the data you store is not that important
3. The data changes over time and is not structured.
4. Support of Constraints and Joins is not required at the database level
5. The data is growing continuously and you need to scale the database regularly to handle
the data.
Hadoop
Hadoop is an open source software programming framework for storing a large amount
of data and performing the computation. Its framework is based on Java programming
with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component of Hadoop,
which allows for the storage of large amounts of data across multiple machines. It is
designed to work with commodity hardware, which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as CPU and
memory) for processing the data stored in HDFS.
 Hadoop also includes several additional modules that provide additional functionality,
such as Hive (a SQL-like query language), Pig (a high-level platform for creating
MapReduce programs), and HBase (a non-relational, distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and
data mining. It enables the distributed processing of large data sets across clusters of
computers using a simple programming model.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
 Distributed Storage: Hadoop stores large data sets across multiple machines, allowing
for the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making it
easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can
continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the
same node where it will be processed, this feature helps to reduce the network traffic
and improve the performance
 High Availability: Hadoop provides High Availability feature, which helps to make
sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide
variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure
that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate
the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which helps
to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run and
process data stored in HDFS.

You might also like