Unit 1
Unit 1
Data which are very large in size is called Big Data. Normally we work on data of size
MB (WordDoc , Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e.
10^15 byte size is called Big Data. It is stated that almost 90% of today's data has
been generated in the past 3 years.
o Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its
million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.
AD
Issues
Huge amount of unstructured data which needs to be stored, processed and
analyzed.
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File
System) which uses commodity hardware to form clusters and store data in a
distributed fashion. It works on Write once, read many times principle.
There are five v's of Big Data that explains the characteristics.
Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes'
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.
Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.
d. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some
server that contains a list of activities.
Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.
Types of Big Data
2.5 quintillion bytes of data are generated every day by users. Predictions by Statista
suggest that by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data would be
generated by the internet. Managing such a vacuous and perennial outsourcing of data
is increasingly difficult. So, to manage such huge complex data, Big data was
introduced, it is related to the extraction of large and complex data into meaningful
data which can’t be extracted or analyzed by traditional methods.
All data cannot be stored in the same way. The methods for data storage can be
accurately evaluated after the type of data has been identified. A Cloud Service, like
Microsoft Azure, is a one-stop destination for storing all kinds of data; blobs, queues,
files, tables, disks, and applications data. However, even within the Cloud, there are
special services to deal with specific sub-categories of data.
For example, Azure Cloud Services like Azure SQL and Azure Cosmos DB help in
handling and managing sparsely varied kinds of data.
Applications Data is the data that is created, read, updated, deleted, or processed by
applications. This data could be generated via web apps, android apps, iOS apps, or
any applications whatsoever. Due to a varied diversity in the kinds of data being used,
determining the storage approach is a little nuanced.
Structured Data
Structured data can be crudely defined as the data that resides in a fixed
field within a record.
It is type of data most familiar to our everyday lives. for ex:
birthday,address
A certain schema binds it, so all the data has the same set of properties.
Structured data is also called relational data. It is split into multiple tables to
enhance the integrity of the data by creating a single record to depict an
entity. Relationships are enforced by the application of table constraints.
The business value of structured data lies within how well an organization
can utilize its existing systems and processes for analysis purposes.
A Structured Query Language (SQL) is needed to bring the data together. Structured
data is easy to enter, query, and analyze. All of the data follows the same format.
However, forcing a consistent structure also means that any alteration of data is too
tough as each record has to be updated to adhere to the new structure. Examples of
structured data include numbers, dates, strings, etc. The business data of an e-
commerce website can be considered to be structured data.
Roll
Name Class Section No Grade
Geek
11 A 1 A
1
Geek
11 A 2 B
2
Geek
11 A 3 A
3
Semi-Structured Data
Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized
into rows and columns like that in a spreadsheet. However, there are some
features like key-value pairs that help in discerning the different entities
from each other.
Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
A data serialization language is used to exchange semi-structured data
across systems that may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for
computer programs.
This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw
valuable insights. Due to a simple format, data serialization readers can be
implemented on hardware with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files,
transit, store, and parse. The sender and the receiver don’t need to know about the
other system. As long as the same serialization language is used, the data can be
understood by both systems comfortably. There are three predominantly used
Serialization languages.
It is usually a small amount of data that It is usually a big amount of data that
can be collected and analyzed using cannot be processed and analyzed easily
traditional methods easily. using traditional methods.
It usually comes from internal systems. It comes from various sources such as
mobile devices, social media, etc.
Analysis of traditional data can be done Analysis of big data needs advanced
with the use of primary statistical analytics methods such as machine
methods. learning, data mining, etc.
Traditional methods to analyze data are Methods to analyze big data are fast and
slow and gradual. instant.
It is limited in its value and insights. It provides valuable insights and patterns
for good decision-making.
It is used for simple and small business It is used for complex and big business
processes. processes.
It does not provide in-depth insights. It provides in-depth insights.
It is easy to secure and protect than big It is harder to secure and protect than
data because of its small size and traditional data because of its size and
simplicity. complexity.
It requires less time and money to store It requires more time and money to store
traditional data. big data.
It is less efficient than big data. It is more efficient than traditional data.
1. DataWarehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an
open-source framework.
3. NoSQLDatabases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. CloudComputing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. MachineLearning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This
has led to the development of artificial intelligence (AI) applications.
6. DataStreaming:
Data Streaming technology has emerged as a solution to process large volumes of
data in real time.
7. EdgeComputing:
edge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source
of the data.
Overall, big data technology has come a long way since the early days of data
warehousing. The introduction of Hadoop, NoSQL databases, cloud computing,
machine learning, data streaming, and edge computing has revolutionized how we
store, process, and analyze large volumes of data. As technology evolves, we can
expect Big Data to play a very important role in various industries.
Scalability:
The main impact of Big Data on DBMS has been the need for scalability. Big data
requires a DBMS to handle large volumes of data. Traditional DBMSs were not
designed to handle the amount of data that Big Data generates. As a result, DBMSs
must be able to scale horizontally and vertically to meet the growing demand for
data storage and processing.
Distributed Architectures:
This architecture helps the organizations to manage their vast amount of data which
are clustered into different nodes. This provides better fault tolerance, availability,
and scalability.
o In shared-nothing architectures, each node in the cluster is independent and has its
own storage and processing power.
o In shared-disk architectures, all nodes share the same storage, and each node has its
own processing power.
Both types of architecture have their advantages and drawbacks, and the choice of
architecture depends on the need of the application.
NoSQL Databases:
The growth of Big Data has led to the emergence of NoSQL databases. NoSQL
databases provide a flexible way to store and retrieve unstructured data.NoSQL
database does not have any fixed structure or schema like other DBMS have. This
makes them ideal for handling Big Data, which often has a variable schema. NoSQL
databases can be categorized into four types: document-oriented, key-value,
column-family, and graph. Each type of database has its advantages and
disadvantages, and the choice of the database depends on the specific requirements
of the application.
Real-time Processing:
Big data requires DBMSs to provide real-time processing of data. Real-time
Processing allows applications to process data as it is generated. This requires
DBMSs to support in-memory data processing and streaming data processing. In-
memory data processing allows applications to store data in memory instead of on
disk, which provides faster access to the data. Streaming data processing allows
applications to process data as it is generated, which provides real-time insights into
the data.
Advanced Analytics:
DBMSs must be able to handle advanced analytics such as data mining, machine
learning, and artificial intelligence. This requires DBMSs to provide support for these
types of algorithms and tools.
AD
Conclusion:
In conclusion, Big Data has driven significant changes in the DBMS landscape. DBMSs
must now be able to handle large volumes of data, provide real-time processing, and
support advanced analytics. The rise of Distributed Architectures and NoSQL
databases has provided new opportunities for managing big data. We can expect
further evolution in DBMSs as Big Data grows in importance. Organizations which
have better management of Big Data will be able to grow their business in a better
way, and decision-making power will be better.
4. **Veracity**: Ensuring the quality, accuracy, and reliability of data is crucial for
meaningful analysis and decision-making. Big data often involves dealing with
noisy, incomplete, or inconsistent data, which can introduce errors and biases into
the analysis.
5. **Value**: Extracting actionable insights and value from big data requires
sophisticated analysis techniques and tools. Identifying relevant patterns, trends, and
correlations amidst the vast amount of data can be challenging, and there's often a
need for domain expertise to interpret the results accurately.
6. **Privacy and Security**: Big data often contains sensitive and personal
information, raising concerns about privacy and security. Safeguarding data against
unauthorized access, breaches, and misuse is critical, especially with the increasing
adoption of cloud-based and distributed computing environments.
Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine
learning, artificial intelligence (AI), and Internet of Things (IoT) that are massively
augmented. In combination with these technologies, big data technologies are
focused on analyzing and handling large amounts of real-time data and batch-
related data.
Some specific examples that include the Operational Big Data Technologies can be
listed as below:
o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart,
Walmart, etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.
Some common examples that involve the Analytical Big Data Technologies can be
listed as below:
AD
o Data Storage
o Data Mining
o Data Analytics
o Data Visualization
Data Storage
Let us first discuss leading Big Data Technologies that come under Data Storage:
o Hadoop: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information. Also, it is capable
enough to process tasks in batches. The Hadoop framework was mainly introduced to
store and process data in a distributed data processing environment parallel to
commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as
one of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming
language.
o MongoDB: MongoDB is another important component of big data technologies in
terms of storage. No relational properties and RDBMS properties apply to MongoDb
because it is a NoSQL database. This is not the same as traditional RDBMS databases
that use structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional
RDBMS databases. This enables MongoDB to hold massive amounts of data. It is
based on a simple cross-platform document-oriented design. The database in
MongoDB uses documents similar to JSON with the schema. This ultimately helps
operational data storage options, which can be seen in most financial organizations.
As a result, MongoDB is replacing traditional mainframes and offering the flexibility
to handle a wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage
and analyze organizations' Big Data requirements. It uses deduplication strategies
that help manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just
like SQL. Companies such as Barclays and Credit Suisse are using RainStor for their
big data needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop
clusters using virtual indexes. This helps us to use the spunk search processing
language to analyze data. Also, Hunk allows us to report and visualize vast amounts
of data from Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of
top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail. This
ultimately helps in the process of handling data efficiently on large commodity
groups. Cassandra's essential features include fault-tolerant mechanisms, scalability,
MapReduce support, distributed nature, eventual consistency, query language
property, tunable consistency, and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.
Data Mining
Let us now discuss leading Big Data Technologies that come under Data Mining:
Data Analytics
Now, let us discuss leading Big Data Technologies that come under Data Analytics:
Data Visualization
Let us discuss leading Big Data Technologies that come under Data Visualization:
o Tableau: Tableau is one of the fastest and most powerful data visualization tools used
by leading business intelligence industries. It helps in analyzing the data at a very
faster speed. Tableau helps in creating the visualizations and insights in the form of
dashboards and worksheets.
Tableau is developed and maintained by a company named TableAU. It was
introduced in May 2013. It is written using multiple languages, such as Python, C, C+
+, and Java. Some of the list's top companies are Cognos, QlikQ, and ORACLE
Hyperion, using this tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js, etc.
This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins
and Bitbank are some of those companies that are making good use of Plotly.
AD
These are emerging technologies. However, they are not limited because the
ecosystem of big data is constantly emerging. That is why new technologies are
coming at a very fast pace based on the demand and requirements of IT industries.
8. **Sentiment Analysis**: Data analytics is used to analyze text data from social
media, customer reviews, and other sources to understand public sentiment,
opinions, and trends. It helps organizations monitor brand reputation, assess
customer feedback, and respond to emerging issues effectively.
9. **Smart Cities and IoT**: Data analytics is applied to analyze data from various
IoT sensors and devices deployed in smart cities to optimize urban planning,
transportation systems, energy consumption, and public services. It enables cities
to improve sustainability, safety, and quality of life for residents.
10. **Financial Analytics**: In the financial sector, data analytics is used for
portfolio management, risk assessment, algorithmic trading, fraud detection, and
compliance monitoring. It helps financial institutions make data-driven investment
decisions, mitigate risks, and ensure regulatory compliance.
These are just a few examples of how data analytics is being used across different
industries and domains to drive insights, innovation, and value creation. As
technology advances and data becomes increasingly abundant, the potential
applications of data analytics are expected to continue expanding.
Expected Properties of a Big Data System
There are various properties which mostly relies on complexity as per their scalability
in the big data. As per these properties, Big data system should perform well, efficient,
and reasonable as well. Let’s explore these properties step by step.
1. Robustness and error tolerance – As per the obstacles in distributed
system encountered, it is quite arduous to build a system that “do the right
thing”. Systems are required to behave in a right manner despite machines
going down randomly, the composite semantics of uniformity in distributed
databases, redundancy, concurrency, and many more. These obstacles make
it complicated to reason about the functioning of the system. Robustness of
big data system is the solution to overcome the obstacles associated with it.
It’s domineering for system to tolerate the human-fault. It’s an often-
disregarded property of the system which can not be overlooked. In a
production system, its domineering that the operator of the system might
make mistakes, such as by providing incorrect program that can interrupt
the functioning of the database. If re-computation and immutability is built
in the core of a big data system, the system will be distinctively robust
against human fault by delivering a relevant and quite cinch mechanism for
recovery.
2. Debuggability – A system must be debug when unfair thing happens by the
required information delivered by the big data system. The key must be able
to recognize, for every value in the system. Debuggability is proficient in
the Lambda Architecture via the functional behaviour of the batch layer and
with the help of re-computation algorithm when needed.
3. Scalability – It is the tendency to handle the performance in the context of
growing data and load by adding resources to the system. The Lambda
Architecture is straight scalable diagonally to all layers of the system stack:
scaling is achieved by including further number of machines.
4. Generalization – A wide range of applications can be function in a general
system. As Lambda Architecture is based on function of all data, a number
of applications can run in a generalized system. Also, Lambda architecture
can generalize social networking, applications, etc.
5. Ad hoc queries – The ability to perform ad hoc queries on the data is
significant. Every large dataset contains unanticipated value in it. Having
the ability of data mining constantly provides opportunities for new
application and business optimization.
6. Extensibility – Extensible system enables to function to be added cost
effectively. Sometimes, a new feature or a change to an already existing
system feature needs to reallocate of pre-existing data into a new data
format. Large-scale transfer of data become easy as it is the part in building
an extensible system.
7. Low latency reads and updates – Numerous applications are needed the
read with low latency, within a few milliseconds and hundred milliseconds.
In Contradict, Update latency varies within the applications. Some of the
applications needed to be broadcast with low latency, while some can
function with few hours of latency. In big data system, there is a need of
applications low latency or updates propagated shortly.
8. Minimal Maintenance – Maintenance is like penalty for developers. It is
the operations which is needed to keep the functionality of the systems
smooth. This includes forestalling when to increase number of machines to
scale, keeping processes functioning well along with their debugging.
Selecting components with probably little complexity plays a significant
role in minimal maintenance. A developer always willing to rely on
components along with quite relevant mechanism. Significantly, distributed
database has more probability of complicated internals.