0% found this document useful (0 votes)

11 views26 pages

Unit 1

Uploaded by

mukul.money2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views26 pages

Unit 1

Uploaded by

mukul.money2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 26

What is Big Data

Data which are very large in size is called Big Data. Normally we work on data of size
MB (WordDoc , Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e.
10^15 byte size is called Big Data. It is stated that almost 90% of today's data has
been generated in the past 3 years.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day to day basis as they have billions of users
worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount
of logs from which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data
which are stored and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends
and accordingly publish their plans and for this they store the data of its
million users.
o Share Market: Stock exchange across the world generates huge amount of
data through its daily transaction.

3V's of Big Data

1. Velocity: The data is increasing at a very fast rate. It is estimated that the
volume of data will double in every 2 years.
2. Variety: Now a days data are not stored in rows and column. Data is
structured as well as unstructured. Log file, CCTV footage is unstructured data.
Data which can be saved in tables are structured data like the transaction data
of the bank.
3. Volume: The amount of data which we deal with is of very large size of Peta
bytes.
Use case
An e-commerce site XYZ (having 100 million users) wants to offer a gift voucher of
100$ to its top 10 customers who have spent the most in the previous
year.Moreover, they want to find the buying trend of these customers so that
company can suggest more items related to them.

Issues
Huge amount of unstructured data which needs to be stored, processed and
analyzed.

Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File
System) which uses commodity hardware to form clusters and store data in a
distributed fashion. It works on Write once, read many times principle.

Processing: Map Reduce paradigm is applied to data distributed over network to

find the required output.

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.

Big Data Characteristics

Big Data contains a large amount of data that is not being processed by traditional
data storage or the processing unit. It is used by many multinational
companies to process the data and business of many organizations. The data flow
would exceed 150 exabytes per day before replication.

There are five v's of Big Data that explains the characteristics.

5 V's of Big Data

o Volume
o Veracity
o Variety
o Value
o Velocity

Volume
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes'
of data generated from many sources daily, such as business processes, machines,
social media platforms, networks, human interactions, and many more.

Facebook can generate approximately a billion messages, 4.5 billion times that the
"Like" button is recorded, and more than 350 million new posts are uploaded each
day. Big data technologies can handle large amounts of data.
Variety
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management
system.

b. Semi-structured: In Semi-structured, the schema is not appropriately defined,

e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations,
i.e., tables.

c. Unstructured Data: All the unstructured files, log files, audio files, and image files
are included in the unstructured data. Some organizations have much data available,
but they did not know how to derive the value of data since the data is raw.

d. Quasi-structured Data:The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.

Example: Web server logs, i.e., the log file is created and maintained by some
server that contains a list of activities.

Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data. Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.

For example, Facebook posts with hashtags.

Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.

Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.

Big data velocity deals with the speed at the data flows from sources
like application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.
Types of Big Data



2.5 quintillion bytes of data are generated every day by users. Predictions by Statista
suggest that by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data would be
generated by the internet. Managing such a vacuous and perennial outsourcing of data
is increasingly difficult. So, to manage such huge complex data, Big data was
introduced, it is related to the extraction of large and complex data into meaningful
data which can’t be extracted or analyzed by traditional methods.
All data cannot be stored in the same way. The methods for data storage can be
accurately evaluated after the type of data has been identified. A Cloud Service, like
Microsoft Azure, is a one-stop destination for storing all kinds of data; blobs, queues,
files, tables, disks, and applications data. However, even within the Cloud, there are
special services to deal with specific sub-categories of data.
For example, Azure Cloud Services like Azure SQL and Azure Cosmos DB help in
handling and managing sparsely varied kinds of data.
Applications Data is the data that is created, read, updated, deleted, or processed by
applications. This data could be generated via web apps, android apps, iOS apps, or
any applications whatsoever. Due to a varied diversity in the kinds of data being used,
determining the storage approach is a little nuanced.

Types of Big Data

Structured Data

 Structured data can be crudely defined as the data that resides in a fixed
field within a record.
 It is type of data most familiar to our everyday lives. for ex:
birthday,address
 A certain schema binds it, so all the data has the same set of properties.
Structured data is also called relational data. It is split into multiple tables to
enhance the integrity of the data by creating a single record to depict an
entity. Relationships are enforced by the application of table constraints.
 The business value of structured data lies within how well an organization
can utilize its existing systems and processes for analysis purposes.

Sources of structured data

A Structured Query Language (SQL) is needed to bring the data together. Structured
data is easy to enter, query, and analyze. All of the data follows the same format.
However, forcing a consistent structure also means that any alteration of data is too
tough as each record has to be updated to adhere to the new structure. Examples of
structured data include numbers, dates, strings, etc. The business data of an e-
commerce website can be considered to be structured data.
Roll
Name Class Section No Grade

Geek
11 A 1 A
1

Geek
11 A 2 B
2

Geek
11 A 3 A
3

Cons of Structured Data

1. Structured data can only be leveraged in cases of predefined functionalities.
This means that structured data has limited flexibility and is suitable for
certain specific use cases only.
2. Structured data is stored in a data warehouse with rigid constraints and a
definite schema. Any change in requirements would mean updating all of
that structured data to meet the new needs. This is a massive drawback in
terms of resource and time management.

Semi-Structured Data

 Semi-structured data is not bound by any rigid schema for data storage and
handling. The data is not in the relational format and is not neatly organized
into rows and columns like that in a spreadsheet. However, there are some
features like key-value pairs that help in discerning the different entities
from each other.
 Since semi-structured data doesn’t need a structured query language, it is
commonly called NoSQL data.
 A data serialization language is used to exchange semi-structured data
across systems that may even have varied underlying infrastructure.
 Semi-structured content is often used to store metadata about a business
process but it can also include files containing machine instructions for
computer programs.
 This type of information typically comes from external sources such as
social media platforms or other web-based data feeds.
Semi-Structured Data

Data is created in plain text so that different text-editing tools can be used to draw
valuable insights. Due to a simple format, data serialization readers can be
implemented on hardware with limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files,
transit, store, and parse. The sender and the receiver don’t need to know about the
other system. As long as the same serialization language is used, the data can be
understood by both systems comfortably. There are three predominantly used
Serialization languages.

Difference between Traditional Data

and Big Data
The main differences between traditional data and
big data are AD
Traditional Data Big Data

It is usually a small amount of data that It is usually a big amount of data that
can be collected and analyzed using cannot be processed and analyzed easily
traditional methods easily. using traditional methods.

It is usually structured data and can be It includes semi-structured, unstructured,

stored in spreadsheets, databases, etc. and structured data.

It often collects data manually. It collects information automatically with

the use of automated systems.

It usually comes from internal systems. It comes from various sources such as
mobile devices, social media, etc.

It consists of data such as customer It consists of data such as images, videos,

information, financial transactions, etc. etc.

Analysis of traditional data can be done Analysis of big data needs advanced
with the use of primary statistical analytics methods such as machine
methods. learning, data mining, etc.

Traditional methods to analyze data are Methods to analyze big data are fast and
slow and gradual. instant.

It generates data after the happening of It generates data every second.

an event.

It is typically processed in batches. It is developed and processed in real-time.

It is limited in its value and insights. It provides valuable insights and patterns
for good decision-making.

It contains reliable and accurate data. It may contain unreliable, inconsistent, or

inaccurate data because of its size and
complexity.

It is used for simple and small business It is used for complex and big business
processes. processes.
It does not provide in-depth insights. It provides in-depth insights.

It is easy to secure and protect than big It is harder to secure and protect than
data because of its small size and traditional data because of its size and
simplicity. complexity.

It requires less time and money to store It requires more time and money to store
traditional data. big data.

It can be stored on a single computer or It requires distributed storage across

server. numerous systems.

It is less efficient than big data. It is more efficient than traditional data.

It can be managed in a centralized It requires a decentralized infrastructure to

structure easily. manage the data.

Evolution of Big Data and its Impact

on Database Management Systems
Evolution of Big Data:
If we see the last few decades, we can analyze that Big Data technology has gained
so much growth. There are a lot of milestones in the evolution of Big Data which are
described below:

1. DataWarehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large
volumes of structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed
storage medium and large data processing are provided by Hadoop, and it is an
open-source framework.
3. NoSQLDatabases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and
retrieve unstructured data.
4. CloudComputing:
Cloud Computing technology helps companies to store their important data in data
centers that are remote, and it saves their infrastructure cost and maintenance costs.
5. MachineLearning:
Machine Learning algorithms are those algorithms that work on large data, and
analysis is done on a huge amount of data to get meaningful insights from it. This
has led to the development of artificial intelligence (AI) applications.
6. DataStreaming:
Data Streaming technology has emerged as a solution to process large volumes of
data in real time.
7. EdgeComputing:
edge Computing is a kind of distributed computing paradigm that allows data
processing to be done at the edge or the corner of the network, closer to the source
of the data.

Overall, big data technology has come a long way since the early days of data
warehousing. The introduction of Hadoop, NoSQL databases, cloud computing,
machine learning, data streaming, and edge computing has revolutionized how we
store, process, and analyze large volumes of data. As technology evolves, we can
expect Big Data to play a very important role in various industries.

Impact of Big Data on Database

Management System:
In recent years, Big Data has become increasingly important in various industries, and
this has led to huge changes in the way we manage data. Database Management
Systems (DBMS) have evolved to handle the growing demand for data storage,
processing, and analysis. In this article, we will discuss the impact of Big Data on
DBMS and the changes that have taken place in the field.

Scalability:
The main impact of Big Data on DBMS has been the need for scalability. Big data
requires a DBMS to handle large volumes of data. Traditional DBMSs were not
designed to handle the amount of data that Big Data generates. As a result, DBMSs
must be able to scale horizontally and vertically to meet the growing demand for
data storage and processing.
Distributed Architectures:
This architecture helps the organizations to manage their vast amount of data which
are clustered into different nodes. This provides better fault tolerance, availability,
and scalability.

Distributed Architectures can be categorized into two types: shared-

nothing and shared-disk.

o In shared-nothing architectures, each node in the cluster is independent and has its
own storage and processing power.
o In shared-disk architectures, all nodes share the same storage, and each node has its
own processing power.

Both types of architecture have their advantages and drawbacks, and the choice of
architecture depends on the need of the application.

NoSQL Databases:
The growth of Big Data has led to the emergence of NoSQL databases. NoSQL
databases provide a flexible way to store and retrieve unstructured data.NoSQL
database does not have any fixed structure or schema like other DBMS have. This
makes them ideal for handling Big Data, which often has a variable schema. NoSQL
databases can be categorized into four types: document-oriented, key-value,
column-family, and graph. Each type of database has its advantages and
disadvantages, and the choice of the database depends on the specific requirements
of the application.

Real-time Processing:
Big data requires DBMSs to provide real-time processing of data. Real-time
Processing allows applications to process data as it is generated. This requires
DBMSs to support in-memory data processing and streaming data processing. In-
memory data processing allows applications to store data in memory instead of on
disk, which provides faster access to the data. Streaming data processing allows
applications to process data as it is generated, which provides real-time insights into
the data.
Advanced Analytics:
DBMSs must be able to handle advanced analytics such as data mining, machine
learning, and artificial intelligence. This requires DBMSs to provide support for these
types of algorithms and tools.

o Data Mining is a way of discovering patterns in data.

o Machine learning is the way in which a computer learns itself from the given data.
o Artificial intelligence is the way in which machines do the work, which is not possible
without the human brain.

Conclusion:
In conclusion, Big Data has driven significant changes in the DBMS landscape. DBMSs
must now be able to handle large volumes of data, provide real-time processing, and
support advanced analytics. The rise of Distributed Architectures and NoSQL
databases has provided new opportunities for managing big data. We can expect
further evolution in DBMSs as Big Data grows in importance. Organizations which
have better management of Big Data will be able to grow their business in a better
way, and decision-making power will be better.

Big Challenges with Big Data




Big data presents several significant challenges, ranging from collection and storage
to analysis and interpretation. Here are some of the major challenges associated with
big data:

1. Volume: The sheer amount of data being generated is enormous and

continues to grow exponentially. Managing and processing this volume of data
requires scalable infrastructure and efficient algorithms.

2. Velocity: Data is being generated at an unprecedented rate, often in real-time

or near real-time. Processing and analyzing data streams in a timely manner pose
significant challenges, especially for applications that require quick insights or
responses.
3. **Variety**: Data comes in various formats and structures, including structured,
semi-structured, and unstructured data. Integrating and analyzing diverse data types
from different sources can be complex and requires flexible data management and
analysis techniques.

4. **Veracity**: Ensuring the quality, accuracy, and reliability of data is crucial for
meaningful analysis and decision-making. Big data often involves dealing with
noisy, incomplete, or inconsistent data, which can introduce errors and biases into
the analysis.

5. **Value**: Extracting actionable insights and value from big data requires
sophisticated analysis techniques and tools. Identifying relevant patterns, trends, and
correlations amidst the vast amount of data can be challenging, and there's often a
need for domain expertise to interpret the results accurately.

6. **Privacy and Security**: Big data often contains sensitive and personal
information, raising concerns about privacy and security. Safeguarding data against
unauthorized access, breaches, and misuse is critical, especially with the increasing
adoption of cloud-based and distributed computing environments.

7. Scalability: As data volumes continue to grow, systems and algorithms need

to scale efficiently to handle the increasing workload. Scalability issues can arise in
data storage, processing, and analysis, requiring continuous optimization and
investment in infrastructure.

8. Complexity: Big data ecosystems typically involve multiple technologies,

platforms, and stakeholders, leading to complexity in system integration,
management, and governance. Coordinating and aligning these diverse components
can be challenging, especially in large organizations with complex data landscapes.

Addressing these challenges requires a combination of technological advancements,

data management strategies, domain expertise, and organizational commitment.
Additionally, ongoing research and innovation are essential to develop new
approaches and solutions for effectively harnessing the potential of big data while
mitigating associated risks.

Big Data Technologies

Before big data technologies were introduced, the data was managed by general
programming languages and basic structured query languages. However, these
languages were not efficient enough to handle the data because there has been
continuous growth in each organization's information and data and the domain. That
is why it became very important to handle such huge data and introduce an efficient
and stable technology that takes care of all the client and large organizations'
requirements and needs, responsible for data production and control. Big data
technologies, the buzz word we get to hear a lot in recent times for all such needs.

What is Big Data Technology?

Big data technology is defined as software-utility. This technology is primarily
designed to analyze, process and extract information from a large data set and a
huge set of extremely complex structures. This is very difficult for traditional data
processing software to deal with.

Among the larger concepts of rage in technology, big data technologies are widely
associated with many other technologies such as deep learning, machine
learning, artificial intelligence (AI), and Internet of Things (IoT) that are massively
augmented. In combination with these technologies, big data technologies are
focused on analyzing and handling large amounts of real-time data and batch-
related data.

Types of Big Data Technology

Before we start with the list of big data technologies, let us first discuss this
technology's board classification. Big Data technology is primarily classified into the
following two types:

Operational Big Data Technologies

This type of big data technology mainly includes the basic day-to-day data that
people used to process. Typically, the operational-big data includes daily basis data
such as online transactions, social media platforms, and the data from any particular
organization or a firm, which is usually needed for analysis using the software based
on big data technologies. The data can also be referred to as raw data used as the
input for several Analytical Big Data Technologies.

Some specific examples that include the Operational Big Data Technologies can be
listed as below:

o Online ticket booking system, e.g., buses, trains, flights, and movies, etc.
o Online trading or shopping from e-commerce websites like Amazon, Flipkart,
Walmart, etc.
o Online data on social media sites, such as Facebook, Instagram, Whatsapp, etc.
o The employees' data or executives' particulars in multinational companies.

Analytical Big Data Technologies

Analytical Big Data is commonly referred to as an improved version of Big Data
Technologies. This type of big data technology is a bit complicated when compared
with operational-big data. Analytical big data is mainly used when performance
criteria are in use, and important real-time business decisions are made based on
reports created by analyzing operational-real data. This means that the actual
investigation of big data that is important for business decisions falls under this type
of big data technology.

Some common examples that involve the Analytical Big Data Technologies can be
listed as below:

o Stock marketing data

o Weather forecasting data and the time series analysis
o Medical health records where doctors can personally monitor the health status of an
individual
o Carrying out the space mission databases where every information of a mission is
very important

Top Big Data Technologies

We can categorize the leading big data technologies into the following four sections:

o Data Storage
o Data Mining
o Data Analytics
o Data Visualization

Data Storage
Let us first discuss leading Big Data Technologies that come under Data Storage:

o Hadoop: When it comes to handling big data, Hadoop is one of the leading
technologies that come into play. This technology is based entirely on map-reduce
architecture and is mainly used to process batch information. Also, it is capable
enough to process tasks in batches. The Hadoop framework was mainly introduced to
store and process data in a distributed data processing environment parallel to
commodity hardware and a basic programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as
one of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming
language.
o MongoDB: MongoDB is another important component of big data technologies in
terms of storage. No relational properties and RDBMS properties apply to MongoDb
because it is a NoSQL database. This is not the same as traditional RDBMS databases
that use structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional
RDBMS databases. This enables MongoDB to hold massive amounts of data. It is
based on a simple cross-platform document-oriented design. The database in
MongoDB uses documents similar to JSON with the schema. This ultimately helps
operational data storage options, which can be seen in most financial organizations.
As a result, MongoDB is replacing traditional mainframes and offering the flexibility
to handle a wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
o RainStor: RainStor is a popular database management system designed to manage
and analyze organizations' Big Data requirements. It uses deduplication strategies
that help manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just
like SQL. Companies such as Barclays and Credit Suisse are using RainStor for their
big data needs.
o Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop
clusters using virtual indexes. This helps us to use the spunk search processing
language to analyze data. Also, Hunk allows us to report and visualize vast amounts
of data from Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
o Cassandra: Cassandra is one of the leading big data technologies among the list of
top NoSQL databases. It is open-source, distributed and has extensive column
storage options. It is freely available and provides high availability without fail. This
ultimately helps in the process of handling data efficiently on large commodity
groups. Cassandra's essential features include fault-tolerant mechanisms, scalability,
MapReduce support, distributed nature, eventual consistency, query language
property, tunable consistency, and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.

Data Mining
Let us now discuss leading Big Data Technologies that come under Data Mining:

o Presto: Presto is an open-source and a distributed SQL query engine developed to

run interactive analytical queries against huge-sized data sources. The size of data
sources can vary from gigabytes to petabytes. Presto helps in querying the data in
Cassandra, Hive, relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache
Software Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr
are using this big data technology and making good use of it.
o RapidMiner: RapidMiner is defined as the data science software that offers us a very
robust and powerful graphical user interface to create, deliver, manage, and maintain
predictive analytics. Using RapidMiner, we can create advanced workflows and
scripting support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning Environment). A
few sets of companies that are making good use of the RapidMiner tool are Boston
Consulting Group, InFocus, Domino's, Slalom, and Vivint.SmartHome.
o ElasticSearch: When it comes to finding information, elasticsearch is known as an
essential tool. It typically combines the main components of the ELK stack (i.e.,
Logstash and Kibana). In simple words, ElasticSearch is a search engine based on the
Lucene library and works similarly to Solr. Also, it provides a purely distributed, multi-
tenant capable search engine. This search engine is completely text-based and
contains schema-free JSON documents with an HTTP web interface.
ElasticSearch is primarily written in a Java programming language and was developed
in 2010 by Shay Banon. Now, it has been handled by Elastic NV since 2012.
ElasticSearch is used by many top companies, such as LinkedIn, Netflix, Facebook,
Google, Accenture, StackOverflow, etc.

Data Analytics
Now, let us discuss leading Big Data Technologies that come under Data Analytics:

o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming

platform is primarily known for its three core capabilities: publisher, subscriber and
consumer. It is referred to as a distributed streaming platform. It is also defined as a
direct messaging, asynchronous messaging broker system that can ingest and
perform data processing on real-time streaming data. This platform is almost similar
to an enterprise messaging system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through
a producer-consumer mechanism. Kafka has received many enhancements to date
and includes some additional levels or properties, such as schema, Ktables, KSql,
registry, etc. It is written in Java language and was developed by the Apache
software community in 2011. Some top companies using the Apache Kafka platform
include Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
o Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk
can also produce graphs, alerts, summarized reports, data visualizations, and
dashboards, etc., using related data. It is mainly beneficial for generating business
insights and web analytics. Besides, Splunk is also used for security purposes,
compliance, application management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with
AJAX, Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are
making good use of Splunk for their analytical and security needs.
o KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze
the obtained models, results, and interactive views. It also allows us to execute all the
analysis steps altogether. It consists of an extension mechanism that can add more
plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
o Spark: Apache Spark is one of the core technologies in the list of big data
technologies. It is one of those essential technologies which are widely used by top
companies. Spark is known for offering In-memory computing capabilities that help
enhance the overall speed of the operational process. It also provides a generalized
execution model to support more applications. Besides, it includes top-level APIs
(e.g., Java, Scala, and Python) to ease the development process.
Also, Spark allows users to process and handle real-time streaming data using
batching and windowing operations techniques. This ultimately helps to generate
datasets and data frames on top of RDDs. As a result, the integral components of
Spark Core are produced. Components like Spark MlLib, GraphX, and R help analyze
and process machine learning and data science. Spark is written using Java, Scala,
Python and R language. The Apache Software Foundation developed it in 2009.
Companies like Amazon, ORACLE, CISCO, VerizonWireless, and Hortonworks are using
this big data technology and making good use of it.
o R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data
miners, practitioners and statisticians. Language is primarily beneficial in the
development of statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
o Blockchain: Blockchain is a technology that can be used in several applications
related to different industries, such as finance, supply chain, manufacturing, etc. It is
primarily used in processing operations like payments and escrow. This helps in
reducing the risks of fraud. Besides, it enhances the transaction's overall processing
speed, increases financial privacy, and internationalize the markets. Additionally, it is
also used to fulfill the needs of shared ledger, smart contract, privacy, and consensus
in any Business Network Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world
application in Jan 2009 when Bitcoin was launched. It is a specific type of database
based on Python, C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of
those top companies using Blockchain technology.

Data Visualization
Let us discuss leading Big Data Technologies that come under Data Visualization:

o Tableau: Tableau is one of the fastest and most powerful data visualization tools used
by leading business intelligence industries. It helps in analyzing the data at a very
faster speed. Tableau helps in creating the visualizations and insights in the form of
dashboards and worksheets.
Tableau is developed and maintained by a company named TableAU. It was
introduced in May 2013. It is written using multiple languages, such as Python, C, C+
+, and Java. Some of the list's top companies are Cognos, QlikQ, and ORACLE
Hyperion, using this tool.
o Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js, etc.
This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins
and Bitbank are some of those companies that are making good use of Plotly.

Emerging Big Data Technologies

Apart from the above mentioned big data technologies, there are several other
emerging big data technologies. The following are some essential technologies
among them:

o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible

ecosystem tools, and community resources that help researchers implement the
state-of-art in Machine Learning. Besides, this ultimately allows developers to build
and deploy machine learning-powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on C+
+, CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using this
technology for their business requirements.
o Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It
is written in Python and Java. Some leading companies like Amazon, ORACLE, Cisco,
and VerizonWireless are using this technology.
o Docker: Docker is defined as the special tool purposely developed to create, deploy,
and execute applications easier by using containers. Containers usually help
developers pack up applications properly, including all the required components like
libraries and dependencies. Typically, containers bind all components and ship them
all together as a package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this technology.
o Airflow: Airflow is a technology that is defined as a workflow automation and
scheduling system. This technology is mainly used to control, and maintain data
pipelines. It contains workflows designed using the DAGs (Directed Acyclic Graphs)
mechanism and consisting of different tasks. The developers can also define
workflows in codes that help in easy testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is based
on a Python language. Companies like Checkr and Airbnb are using this leading
technology.
o Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container
management tool made open-source in 2014 by Google. It provides a platform for
automation, deployment, scaling, and application container operations in the host
clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing
Foundation. It is written in the Go language. Companies like American Express, Pear
Deck, PeopleSource, and Northwestern Mutual are making good use of this
technology.

These are emerging technologies. However, they are not limited because the
ecosystem of big data is constantly emerging. That is why new technologies are
coming at a very fast pace based on the demand and requirements of IT industries.

Uses of Data Analytics




Data analytics is a versatile tool with a wide range of applications across various
industries and domains. Some common uses of data analytics include:

1. Business Intelligence: Data analytics is used to analyze historical data to

gain insights into business performance, trends, and customer behavior. This
information helps organizations make informed decisions, optimize operations,
and identify new opportunities for growth.

2. Predictive Analytics: By analyzing historical data and patterns, predictive

analytics enables organizations to forecast future trends, outcomes, and behaviors.
This capability is valuable for anticipating market changes, predicting customer
churn, optimizing inventory management, and mitigating risks.

3. Marketing and Customer Analytics: Data analytics is used to analyze

customer demographics, preferences, and behavior to personalize marketing
campaigns, improve customer targeting, and enhance customer engagement. It helps
organizations understand customer needs and preferences better, leading to more
effective marketing strategies and higher customer satisfaction.

4. Risk Management: Data analytics is employed to assess and mitigate

various types of risks, including financial, operational, and cybersecurity risks. By
analyzing historical data and identifying patterns, organizations can proactively
manage risks, detect anomalies, and prevent potential threats.

5. Healthcare Analytics: In the healthcare sector, data analytics is used for

patient diagnosis and treatment optimization, medical research, and healthcare
resource management. It helps healthcare providers improve patient outcomes,
reduce costs, and enhance operational efficiency.

6. Supply Chain Optimization: Data analytics is utilized to optimize supply

chain operations by analyzing demand forecasts, inventory levels, transportation
routes, and supplier performance. It helps organizations streamline logistics,
reduce costs, minimize stockouts, and improve overall supply chain efficiency.
7. **Fraud Detection and Prevention**: Data analytics is employed to detect
fraudulent activities and transactions by analyzing patterns, anomalies, and
deviations from normal behavior. It helps organizations identify potential
fraudsters, prevent financial losses, and enhance security measures.

8. **Sentiment Analysis**: Data analytics is used to analyze text data from social
media, customer reviews, and other sources to understand public sentiment,
opinions, and trends. It helps organizations monitor brand reputation, assess
customer feedback, and respond to emerging issues effectively.

9. **Smart Cities and IoT**: Data analytics is applied to analyze data from various
IoT sensors and devices deployed in smart cities to optimize urban planning,
transportation systems, energy consumption, and public services. It enables cities
to improve sustainability, safety, and quality of life for residents.

10. **Financial Analytics**: In the financial sector, data analytics is used for
portfolio management, risk assessment, algorithmic trading, fraud detection, and
compliance monitoring. It helps financial institutions make data-driven investment
decisions, mitigate risks, and ensure regulatory compliance.

These are just a few examples of how data analytics is being used across different
industries and domains to drive insights, innovation, and value creation. As
technology advances and data becomes increasingly abundant, the potential
applications of data analytics are expected to continue expanding.
Expected Properties of a Big Data System



There are various properties which mostly relies on complexity as per their scalability
in the big data. As per these properties, Big data system should perform well, efficient,
and reasonable as well. Let’s explore these properties step by step.
1. Robustness and error tolerance – As per the obstacles in distributed
system encountered, it is quite arduous to build a system that “do the right
thing”. Systems are required to behave in a right manner despite machines
going down randomly, the composite semantics of uniformity in distributed
databases, redundancy, concurrency, and many more. These obstacles make
it complicated to reason about the functioning of the system. Robustness of
big data system is the solution to overcome the obstacles associated with it.
It’s domineering for system to tolerate the human-fault. It’s an often-
disregarded property of the system which can not be overlooked. In a
production system, its domineering that the operator of the system might
make mistakes, such as by providing incorrect program that can interrupt
the functioning of the database. If re-computation and immutability is built
in the core of a big data system, the system will be distinctively robust
against human fault by delivering a relevant and quite cinch mechanism for
recovery.
2. Debuggability – A system must be debug when unfair thing happens by the
required information delivered by the big data system. The key must be able
to recognize, for every value in the system. Debuggability is proficient in
the Lambda Architecture via the functional behaviour of the batch layer and
with the help of re-computation algorithm when needed.
3. Scalability – It is the tendency to handle the performance in the context of
growing data and load by adding resources to the system. The Lambda
Architecture is straight scalable diagonally to all layers of the system stack:
scaling is achieved by including further number of machines.
4. Generalization – A wide range of applications can be function in a general
system. As Lambda Architecture is based on function of all data, a number
of applications can run in a generalized system. Also, Lambda architecture
can generalize social networking, applications, etc.
5. Ad hoc queries – The ability to perform ad hoc queries on the data is
significant. Every large dataset contains unanticipated value in it. Having
the ability of data mining constantly provides opportunities for new
application and business optimization.
6. Extensibility – Extensible system enables to function to be added cost
effectively. Sometimes, a new feature or a change to an already existing
system feature needs to reallocate of pre-existing data into a new data
format. Large-scale transfer of data become easy as it is the part in building
an extensible system.
7. Low latency reads and updates – Numerous applications are needed the
read with low latency, within a few milliseconds and hundred milliseconds.
In Contradict, Update latency varies within the applications. Some of the
applications needed to be broadcast with low latency, while some can
function with few hours of latency. In big data system, there is a need of
applications low latency or updates propagated shortly.
8. Minimal Maintenance – Maintenance is like penalty for developers. It is
the operations which is needed to keep the functionality of the systems
smooth. This includes forestalling when to increase number of machines to
scale, keeping processes functioning well along with their debugging.
Selecting components with probably little complexity plays a significant
role in minimal maintenance. A developer always willing to rely on
components along with quite relevant mechanism. Significantly, distributed
database has more probability of complicated internals.

Big Data Unit 1 Notes
100% (1)
Big Data Unit 1 Notes
27 pages
Bda M1
No ratings yet
Bda M1
111 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 1
No ratings yet
Unit 1
56 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Presentation 1
No ratings yet
Presentation 1
27 pages
Big Data 101
No ratings yet
Big Data 101
18 pages
Big Data
No ratings yet
Big Data
7 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
1.8 Big Data - Introduction & Characteristics
No ratings yet
1.8 Big Data - Introduction & Characteristics
9 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Bda QB Answer
No ratings yet
Bda QB Answer
39 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data UNIT1
No ratings yet
Big Data UNIT1
23 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
Chapter 1 Introduction To Big Data
No ratings yet
Chapter 1 Introduction To Big Data
19 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
Unit 4
No ratings yet
Unit 4
29 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
Unit 1
No ratings yet
Unit 1
57 pages
BDA Unit 1
No ratings yet
BDA Unit 1
60 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Main Big Data
No ratings yet
Main Big Data
40 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Big Data12
No ratings yet
Big Data12
11 pages
UNIT 1big Data Introduction
No ratings yet
UNIT 1big Data Introduction
56 pages
Big Data
No ratings yet
Big Data
16 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
BIG DATA and Its Traits
No ratings yet
BIG DATA and Its Traits
25 pages
big Data
No ratings yet
big Data
21 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Big Data
No ratings yet
Big Data
110 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
BDA Unit 1
No ratings yet
BDA Unit 1
28 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
Big Data 1
No ratings yet
Big Data 1
22 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
Lecture 1: Big Data Challenges and Overview: Extracted From
No ratings yet
Lecture 1: Big Data Challenges and Overview: Extracted From
26 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
26 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
21 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Introduction To Big Data Analytics - Thendral1
No ratings yet
Introduction To Big Data Analytics - Thendral1
26 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
BigData Unit-1
No ratings yet
BigData Unit-1
72 pages
Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Big Type Data
No ratings yet
Big Type Data
4 pages
Big Data Class 27feb
No ratings yet
Big Data Class 27feb
48 pages
Unit 1 What Is Big Data
No ratings yet
Unit 1 What Is Big Data
26 pages
Big Data
No ratings yet
Big Data
7 pages
Course Material
100% (1)
Course Material
57 pages
Big Data
No ratings yet
Big Data
41 pages
Big Data Chapter 1
No ratings yet
Big Data Chapter 1
22 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Unit 2
No ratings yet
Unit 2
26 pages
Unit 2
No ratings yet
Unit 2
29 pages
Swasthavritta RecordF
No ratings yet
Swasthavritta RecordF
68 pages
Ncism - II Bams - Ayug-At Record Book
No ratings yet
Ncism - II Bams - Ayug-At Record Book
25 pages
AI DS Scheme Plus Syllabus All Years 2022 April
No ratings yet
AI DS Scheme Plus Syllabus All Years 2022 April
76 pages
Languaje and Political Change (Quentin Skinner)
No ratings yet
Languaje and Political Change (Quentin Skinner)
9 pages
Plates and Screws: An Overview: Presented by DR Oteki Misiani
100% (1)
Plates and Screws: An Overview: Presented by DR Oteki Misiani
45 pages
Room Checksums: Room - 001 Heating Coil Peak CLG Space Peak Cooling Coil Peak Temperatures
No ratings yet
Room Checksums: Room - 001 Heating Coil Peak CLG Space Peak Cooling Coil Peak Temperatures
1 page
Project Work
No ratings yet
Project Work
34 pages
W5-Group III Cations
No ratings yet
W5-Group III Cations
10 pages
Efficient Layout Design of Junctionless Transistor Based 6-T
No ratings yet
Efficient Layout Design of Junctionless Transistor Based 6-T
7 pages
Chapter 4 Measures of Location
No ratings yet
Chapter 4 Measures of Location
37 pages
Simple Stresses and Strains of Statically Indeterminate Structures
No ratings yet
Simple Stresses and Strains of Statically Indeterminate Structures
12 pages
Lesson Plan in General Mathematics: R Log I I R I I I 10,000 I I
No ratings yet
Lesson Plan in General Mathematics: R Log I I R I I I 10,000 I I
4 pages
Rocker Gear and Valves
No ratings yet
Rocker Gear and Valves
10 pages
A Fatigue Driving Detection Algorithm Based On Facial Multi-Feature Fusion
No ratings yet
A Fatigue Driving Detection Algorithm Based On Facial Multi-Feature Fusion
16 pages
Modified Test
No ratings yet
Modified Test
12 pages
82bace127438068b8ebe
No ratings yet
82bace127438068b8ebe
73 pages
B07-Hybrid Photothermal-Photocatalyst Sheets For Solar-Driven Overall Water Splitting Coupled To Water Purification
No ratings yet
B07-Hybrid Photothermal-Photocatalyst Sheets For Solar-Driven Overall Water Splitting Coupled To Water Purification
14 pages
EA04A
No ratings yet
EA04A
3 pages
Wellcare Oil Tools Private Limited
No ratings yet
Wellcare Oil Tools Private Limited
4 pages
Chapter3 Electrochemistyry
No ratings yet
Chapter3 Electrochemistyry
2 pages
Inheritance B
No ratings yet
Inheritance B
7 pages
Lavalle Planning
No ratings yet
Lavalle Planning
121 pages
STAT 206 - Chapter 10 (Two-Sample Hypothesis Tests)
No ratings yet
STAT 206 - Chapter 10 (Two-Sample Hypothesis Tests)
38 pages
Chapter 7 (Part I) - User Defined Datatypes
No ratings yet
Chapter 7 (Part I) - User Defined Datatypes
53 pages
Anatomy Spleen
No ratings yet
Anatomy Spleen
32 pages
Ita Report-N7-V3 BD P
No ratings yet
Ita Report-N7-V3 BD P
12 pages
Bose
No ratings yet
Bose
9 pages
The Basel II IRB Approach For Credit Portfolios
0% (1)
The Basel II IRB Approach For Credit Portfolios
30 pages
ISI-Entrance Solutions: Economics
No ratings yet
ISI-Entrance Solutions: Economics
4 pages
Cold Storage of Tomato The Good The Bad en The Ug-Wageningen University and Research 444870
No ratings yet
Cold Storage of Tomato The Good The Bad en The Ug-Wageningen University and Research 444870
1 page
1988 - Nesbitt - Gold Deposit Continuum A Genetic Model For Lode Au
No ratings yet
1988 - Nesbitt - Gold Deposit Continuum A Genetic Model For Lode Au
5 pages
Custom Reports Design Manual: Micros
No ratings yet
Custom Reports Design Manual: Micros
58 pages
Jan 25 Chem Pastec Paper CXC
No ratings yet
Jan 25 Chem Pastec Paper CXC
20 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

What is Big Data

Sources of Big Data

3V's of Big Data

Processing: Map Reduce paradigm is applied to data distributed over network to

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.

Big Data Characteristics

5 V's of Big Data

The data is categorized as below:

b. Semi-structured: In Semi-structured, the schema is not appropriately defined,

For example, Facebook posts with hashtags.

Types of Big Data

Sources of structured data

Cons of Structured Data

Difference between Traditional Data

It is usually structured data and can be It includes semi-structured, unstructured,

It often collects data manually. It collects information automatically with

It consists of data such as customer It consists of data such as images, videos,

It generates data after the happening of It generates data every second.

It is typically processed in batches. It is developed and processed in real-time.

It contains reliable and accurate data. It may contain unreliable, inconsistent, or

It can be stored on a single computer or It requires distributed storage across

It can be managed in a centralized It requires a decentralized infrastructure to

Evolution of Big Data and its Impact

Impact of Big Data on Database

Distributed Architectures can be categorized into two types: shared-

o Data Mining is a way of discovering patterns in data.

Big Challenges with Big Data

1. **Volume**: The sheer amount of data being generated is enormous and

2. **Velocity**: Data is being generated at an unprecedented rate, often in real-time

7. **Scalability**: As data volumes continue to grow, systems and algorithms need

8. **Complexity**: Big data ecosystems typically involve multiple technologies,

Addressing these challenges requires a combination of technological advancements,

Big Data Technologies

What is Big Data Technology?

Types of Big Data Technology

Operational Big Data Technologies

Analytical Big Data Technologies

o Stock marketing data

Top Big Data Technologies

o Presto: Presto is an open-source and a distributed SQL query engine developed to

o Apache Kafka: Apache Kafka is a popular streaming platform. This streaming

Emerging Big Data Technologies

o TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible

Uses of Data Analytics

1. **Business Intelligence**: Data analytics is used to analyze historical data to

2. **Predictive Analytics**: By analyzing historical data and patterns, predictive

3. **Marketing and Customer Analytics**: Data analytics is used to analyze

4. **Risk Management**: Data analytics is employed to assess and mitigate

5. **Healthcare Analytics**: In the healthcare sector, data analytics is used for

6. **Supply Chain Optimization**: Data analytics is utilized to optimize supply

You might also like

1. Volume: The sheer amount of data being generated is enormous and

2. Velocity: Data is being generated at an unprecedented rate, often in real-time

7. Scalability: As data volumes continue to grow, systems and algorithms need

8. Complexity: Big data ecosystems typically involve multiple technologies,

1. Business Intelligence: Data analytics is used to analyze historical data to

2. Predictive Analytics: By analyzing historical data and patterns, predictive

3. Marketing and Customer Analytics: Data analytics is used to analyze

4. Risk Management: Data analytics is employed to assess and mitigate

5. Healthcare Analytics: In the healthcare sector, data analytics is used for

6. Supply Chain Optimization: Data analytics is utilized to optimize supply