0% found this document useful (0 votes)

8 views56 pages

UNIT 1big Data Introduction

The document discusses the evolution and significance of Big Data, highlighting its characteristics such as volume, velocity, variety, veracity, and value. It outlines the challenges associated with Big Data, including data quality, storage, analytics, and security, while also mentioning various technologies and applications that leverage Big Data for insights and improvements across industries. Additionally, it describes the types of Big Data, including structured, semi-structured, and unstructured data, along with examples and technologies used for processing and analyzing this data.

Uploaded by

ninjanotes7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views56 pages

UNIT 1big Data Introduction

Uploaded by

ninjanotes7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

Story of Big Data

• In ancient days, people used to travel from one village to another village on a horse
driven cart, but as the time passed, villages became towns and people spread out.
The distance to travel from one town to the other town also increased. So, it
became a problem to travel between towns, along with the luggage. Out of the
blue, one smart fella suggested, we should groom and feed a horse more, to solve
this problem. When I look at this solution, it is not that bad, but do you think a
horse can become an elephant? I don’t think so. Another smart guy said, instead of
1 horse pulling the cart, let us have 4 horses to pull the same cart. What do you
guys think of this solution? I think it is a fantastic solution. Now, people can travel
large distances in less time and even carry more luggage.
• The same concept applies on Big Data. Big Data says, till today, we were okay with
storing the data into our servers because the volume of the data was pretty limited,
and the amount of time to process this data was also okay. But now in this current
technological world, the data is growing too fast and people are relying on the data
a lot of times. Also the speed at which the data is growing, it is becoming impossible
to store the data into any server.
What is Big Data?

• Big Data is a term used for a collection of data sets that are large and
complex, which is difficult to store and process using available
database management tools or traditional data processing applications.
The challenge includes capturing, curating, storing, searching, sharing,
transferring, analyzing and visualization of this data.
Big Data Characteristics

• VOLUME
• VELOCITY
• VARIETY
• VERACITY
• VALUE
VOLUME
• Volume refers to the ‘amount of data’, which is growing day by day at
a very fast pace. The size of data generated by humans, machines and
their interactions on social media itself is massive.
VELOCITY
• Velocity is defined as the pace at which different sources generate the
data every day. This flow of data is massive and continuous. There are
1.03 billion Daily Active Users (Facebook) on Mobile as of now, which
is an increase of 22% year-over-year.
VARIETY

• As there are many sources which are contributing to Big Data, the
type of data they are generating is different. It can be structured,
semi-structured or unstructured
VERACITY

• Veracity refers to the data in doubt or uncertainty of data available

due to data inconsistency and
• incompleteness
VALUE

• After discussing Volume, Velocity, Variety and Veracity, there is

another V that should be taken into account when looking at Big Data
i.e. Value. It is all well and good to have access to big data but unless
we can turn it into value it is useless.
Types of Big Data

• Structured
• Semi-Structured
• Unstructured
Structured

• The data that can be stored and processed in a fixed format is called
as Structured Data. Data stored in a relational database management
system (RDBMS) is one example of ‘structured’ data. It is easy to
process structured data as it has a fixed schema. Structured Query
Language (SQL) is often used to manage such kind of Data
Semi-Structured

• Semi-Structured Data is a type of data which does not have a formal

structure of a data model, i.e. a table definition in a relational DBMS,
but nevertheless it has some organizational properties like tags and
other markers to separate semantic elements that makes it easier to
analyze. XML files or JSON documents are examples of semi-
structured data.
Unstructured
• The data which have unknown form and cannot be stored in RDBMS
and cannot be analyzed unless it is transformed into a structured
format is called as unstructured data. Text Files and multimedia
contents like images, audios, videos are example of unstructured
data. The unstructured data is growing quicker than others, experts
say that 80 percent of the data in an organization are unstructured.
Examples of Big Data
• Walmart handles more than 1 million customer transactions every hour.
• Facebook stores, accesses, and analyzes 30+ Petabytes of user generated
data.
• 230+ millions of tweets are created every day.
• More than 5 billion people are calling, texting, tweeting and browsing on
mobile phones worldwide.
• YouTube users upload 48 hours of new video every minute of the day.
• Amazon handles 15 million customer click stream user data per day to
recommend products.
• 294 billion emails are sent every day. Services analyses this data to find the
spams.
• Modern cars have close to 100 sensors which monitors fuel level, tire
pressure etc. , each vehicle generates a lot of sensor data.
Applications of Big Data
• Smarter Healthcare: Making use of the petabytes of patient’s data, the organization can
extract meaningful information and then build applications that can predict the patient’s
deteriorating condition in advance.
• Telecom: Telecom sectors collects information, analyzes it and provide solutions to
different problems. By using Big Data applications, telecom companies have been able to
significantly reduce data packet loss, which occurs when networks are overloaded, and
thus, providing a seamless connection to their customers.
• Retail: Retail has some of the tightest margins, and is one of the greatest beneficiaries of
big data. The beauty of using big data in retail is to understand consumer behavior.
Amazon’s recommendation engine provides suggestion based on the browsing history of
the consumer.
• Traffic control: Traffic congestion is a major challenge for many cities globally. Effective
use of data and sensors will be key to managing traffic better as cities become
increasingly densely populated.
• Manufacturing: Analyzing big data in the manufacturing industry can reduce component
defects, improve product quality, increase efficiency, and save time and money.
• Search Quality: Every time we are extracting information from google, we are
simultaneously generating data for it. Google stores this data and uses it to improve its
search quality.
Traditional versus Big data
Big Challenges with Big Data

• The challenges in Big Data are the real implementation hurdles. These
require immediate attention and need to be handled because if not
handled then the failure of the technology may take place which can
also lead to some unpleasant result. Big data challenges include the
storing, analyzing the extremely large and fast-growing data
• Data Quality – The problem here is the 4th V i.e. Veracity. The data here is very messy,
inconsistent and incomplete. Dirty data cost $600 billion to the companies every year in the
United States.
• Discovery – Finding insights on Big Data is like finding a needle in a haystack. Analyzing petabytes
of data using extremely powerful algorithms to find patterns and insights are very difficult.
• Storage – The more data an organization has, the more complex the problems of managing it can
become. The question that arises here is “Where to store it?”. We need a storage system which
can easily scale up or down on-demand.
• Analytics – In the case of Big Data, most of the time we are unaware of the kind of data we are
dealing with, so analyzing that data is even more difficult.
• Security – Since the data is huge in size, keeping it secure is another challenge. It includes user
authentication, restricting access based on a user, recording data access histories, proper use of
data encryption etc.
• Lack of Talent – There are a lot of Big Data projects in major organizations, but a sophisticated
team of developers, data scientists and analysts who also have sufficient amount of domain
knowledge is still a challenge.
Some other Big Data challenges
are:
Sharing and Accessing Data:
•Perhaps the most frequent challenge in big data efforts is the inaccessibility
of data sets from external sources.
•Sharing data can cause substantial challenges.
•It include the need for inter and intra- institutional legal documents.
•Accessing data from public repositories leads to multiple difficulties.
•It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the companies information system is to be
used to make accurate decisions in time then it becomes necessary for data to
be available in this manner.
Privacy and Security:

•It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.

•Most of the organizations are unable to maintain regular checks due to large
amounts of data generation. However, it should be necessary to perform
security checks and observation in real time because it is most beneficial.
Analytical Challenges:

•There are some huge analytical challenges in big data which arise some main
challenges questions like how to deal with a problem if data volume gets too
large?

•Or how to find out the important data points?

•Or how to use data to the best advantage?

Technical challenges:
• Quality of data

• Fault tolerance

• Scalability
Big Data Technologies
• Big Data Technology can be defined as a Software-Utility that is
designed to Analyse, Process and Extract the information from an
extremely complex and large data sets which the Traditional Data
Processing Software could never deal with.

• We need Big Data Processing Technologies to Analyse this huge

amount of Real-time data and come up with Conclusions and
Predictions to reduce the risks in the future.
Types of Big Data Technologies:
Big Data Technology is mainly classified into two types:
• Operational Big Data Technologies
• Analytical Big Data Technologies
Operational Big Data Technologies
• Online ticket bookings, which includes your Rail tickets, Flight tickets,
movie tickets etc.
• Online shopping which is your Amazon, Flipkart, Walmart, Snap deal
and many more.
• Data from social media sites like Facebook, Instagram, what’s app and
a lot more.
• The employee details of any Multinational Company.
Analytical Big Data Technologies
• Stock marketing
• Carrying out the Space missions where every single bit of information
is crucial.
• Weather forecast information.
• Medical fields where a particular patients health status can be
monitored.
• Let us have a look at the top Big Data Technologies being used in the
IT Industries.
Top Big Data Technologies

• Data Storage
• Data Mining
• Data Analytics
• Data Visualization
Data Storage
Hadoop Framework
• Hadoop Framework was designed to store and process data in a Distributed Data Processing
Environment with commodity hardware with a simple programming model. It can Store and
Analyse the data present in different machines with High Speeds and Low Costs.

• Developed by: Apache Software Foundation in the year 2011 10th of Dec.
• Written in: JAVA
• Current stable version: Hadoop 3.11
MongoDB
• The NoSQL Document Databases like MongoDB, offer a direct alternative to the rigid schema
used in Relational Databases. This allows MongoDB to offer Flexibility while handling a wide
variety of Datatypes at large volumes and across Distributed Architectures.

• Developed by: MongoDB in the year 2009 11th of Feb

• Written in: C++, Go, JavaScript, Python
• Current stable version: MongoDB 4.0.10
Rainstor
• RainStor is a software company that developed a Database Management System of the same name
designed to Manage and Analyse Big Data for large enterprises. It uses Deduplication Techniques
to organize the process of storing large amounts of data for reference.

• Developed by: RainStor Software company in the year 2004.

• Works like: SQL
• Current stable version: RainStor 5.5
Hunk
• Hunk lets you access data in remote Hadoop Clusters through virtual indexes and lets you use the
Splunk Search Processing Language to analyse your data. With Hunk, you can Report and
Visualize large amounts from your Hadoop and NoSQL data sources.

• Developed by: Splunk INC in the year 2013.

• Written in: JAVA
• Current stable version: Splunk Hunk 6.2
Data Mining
Presto
• Presto is an open source Distributed SQL Query Engine for running Interactive Analytic
Queries against data sources of all sizes ranging from Gigabytes to Petabytes. Presto allows
querying data in Hive, Cassandra, Relational Databases and Proprietary Data Stores.

• Developed by: Apache Foundation in the year 2013.

• Written in: JAVA
• Current stable version: Presto 0.22
Rapid Miner
• RapidMiner is a Centralized solution that features a very powerful and robust Graphical User
Interface that enables users to Create, Deliver, and maintain Predictive Analytics. It allows creating
very Advanced Workflows, Scripting support in several languages.

• Developed by: RapidMiner in the year 2001

• Written in: JAVA
• Current stable version: RapidMiner 9.2
Elasticsearch
• Elasticsearch is a Search Engine based on the Lucene Library. It provides a Distributed,
MultiTenant-capable, Full-Text Search Engine with an HTTP Web Interface and Schema-free
JSON documents.

• Developed by: Elastic NV in the year 2012.

• Written in: JAVA
• Current stable version: ElasticSearch 7.1
Data Analytics
Kafka
• Apache Kafka is a Distributed Streaming platform. A streaming platform has Three Key
Capabilities that are as follows:
• Publisher
• Subscriber
• Consumer

• Developed by: Apache Software Foundation in the year 2011

• Written in: Scala, JAVA
• Current stable version: Apache Kafka 2.2.0
Splunk
• Splunk captures, Indexes, and correlates Real-time data in a Searchable Repository from which it
can generate Graphs, Reports, Alerts, Dashboards, and Data Visualizations. It is also used for
Application Management, Security and Compliance, as well as Business and Web Analytics.

• Developed by: Splunk INC in the year 2014 6th May

• Written in: AJAX, C++, Python, XML
• Current stable version: Splunk 7.3
R-Language
• R is a Programming Language and free software environment for Statistical Computing and
Graphics. The R language is widely used among Statisticians and Data Miners for developing
Statistical Software and majorly in Data Analysis.

• Developed by: R-Foundation in the year 2000 29th Feb

• Written in: Fortran
• Current stable version: R-3.6.0
Blockchain
• BlockChain is used in essential functions such as payment, escrow, and title can also reduce fraud,
increase financial privacy, speed up transactions, and internationalize markets.

• BlockChain can be used for achieving the following in a Business Network Environment:
• Shared Ledger: Here we can append the Distributed System of records across a Business
network.
• Smart Contract: Business terms are embedded in the transaction Database and Executed
with transactions.
• Privacy: Ensuring appropriate Visibility, Transactions are Secure, Authenticated and
Verifiable
• Consensus: All parties in a Business network agree to network verified transactions.
• Developed by: Bitcoin
• Written in: JavaScript, C++, Python
• Current stable version: Blockchain 4.0
Data Visualization
Tableau
• Tableau is a Powerful and Fastest growing Data Visualization tool used in the Business
Intelligence Industry. Data analysis is very fast with Tableau and the Visualizations created are in
the form of Dashboards and Worksheets.

• Developed by: TableAU 2013 May 17th

• Written in: JAVA, C++, Python, C
• Current stable version: TableAU 8.2
Plotly
• Mainly used to make creating Graphs faster and more efficient. API libraries for Python, R,
MATLAB, Node.js, Julia, and Arduino and a REST API. Plotly can also be used to style
Interactive Graphs with Jupiter notebook

• Developed by: Plotly in the year 2012

• Written in: JavaScript
• Current stable version: Plotly 1.47.4
Big Data: Infrastructure

• Hadoop is essentially an open-source framework for processing,

storing and analyzing data. The fundamental principle behind Hadoop
is rather than tackling one monolithic block of data all in one go, it’s
more efficient to break up & distribute data into many parts, allowing
processing and analyzing of different parts concurrently.
• When hearing Hadoop discussed, it’s easy to think of Hadoop as one
vast entity; this is a Myth In reality, Hadoop is a whole ecosystem of
different products, largely presided over by the Apache Software
Foundation. Some key components include:
• HDFS- The default storage layer
• MapReduce- Executes a wide range of analytic functions by analysing
datasets in parallel before ‘reducing’ the results. The “Map” job
distributes a query to different nodes, and the “Reduce” gathers the
results and resolves them into a single value.
• YARN- Responsible for cluster management and scheduling user
applications
• Spark- Used on top of HDFS, and promises speeds up to 100 times
faster than the two-step MapReduce function in certain applications.
Allows data to loaded in-memory and queried repeatedly, making it
particularly apt for machine learning algorithms
Use of Data Analytics
• Descriptive
• Diagnostics
• Predictive
• Prescriptive

Scriptinge ServiceNow PDF
No ratings yet
Scriptinge ServiceNow PDF
140 pages
Unit 5
No ratings yet
Unit 5
63 pages
Unit 1
No ratings yet
Unit 1
56 pages
Overview of Big Data: Saidatul Rahah Hamidi
No ratings yet
Overview of Big Data: Saidatul Rahah Hamidi
25 pages
BDA Unit 1
No ratings yet
BDA Unit 1
68 pages
Unit 4
No ratings yet
Unit 4
29 pages
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
No ratings yet
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
75 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data
No ratings yet
Big Data
7 pages
Unit I Bda
No ratings yet
Unit I Bda
18 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Big Data-Hadoop
No ratings yet
Big Data-Hadoop
6 pages
Unit 1
No ratings yet
Unit 1
26 pages
What Is Big Data
No ratings yet
What Is Big Data
8 pages
Presentation 1
No ratings yet
Presentation 1
27 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
CC Becse Unit 4 PDF
No ratings yet
CC Becse Unit 4 PDF
32 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
11 pages
Hadoop 2 & 3 Units Final
No ratings yet
Hadoop 2 & 3 Units Final
27 pages
14 Big Data
No ratings yet
14 Big Data
39 pages
Seminar Report BIG DATA
No ratings yet
Seminar Report BIG DATA
28 pages
Big Data (Analytics) in Power Systems
No ratings yet
Big Data (Analytics) in Power Systems
20 pages
Lecture 1: Big Data Challenges and Overview: Extracted From
No ratings yet
Lecture 1: Big Data Challenges and Overview: Extracted From
26 pages
Big Data Analysis
No ratings yet
Big Data Analysis
14 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
BDT Module 1
No ratings yet
BDT Module 1
107 pages
Module-1-Introduction To BigData Platform
No ratings yet
Module-1-Introduction To BigData Platform
21 pages
What Is Data
No ratings yet
What Is Data
20 pages
BDA Unit 1
No ratings yet
BDA Unit 1
28 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
No ratings yet
Big Data: Made By: Harshita Salian 17038 Syed Khadija Rizvi 17049 Sayyed Alfiya 17041 Rahul Masam 17028 Deepak Pal 17033
12 pages
Big Data 101
No ratings yet
Big Data 101
18 pages
Acc 411 Topic 2
No ratings yet
Acc 411 Topic 2
30 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Big Data Analysis
No ratings yet
Big Data Analysis
3 pages
Unit 1
No ratings yet
Unit 1
57 pages
Emerging Big Data and Cloud Computing
No ratings yet
Emerging Big Data and Cloud Computing
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
What Is Big Data - Introduction
No ratings yet
What Is Big Data - Introduction
6 pages
Unit I - BDA
No ratings yet
Unit I - BDA
12 pages
Big Data Lecture # 1
No ratings yet
Big Data Lecture # 1
15 pages
$RM5TSDQ
No ratings yet
$RM5TSDQ
70 pages
Report of Big Data
No ratings yet
Report of Big Data
14 pages
Unit 1 Question&answers
No ratings yet
Unit 1 Question&answers
36 pages
The Definition of Big Data
No ratings yet
The Definition of Big Data
7 pages
ETEM S01 - (Big Data)
No ratings yet
ETEM S01 - (Big Data)
24 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Big Data Lecture 1
No ratings yet
Big Data Lecture 1
22 pages
NJ Cse4261-1
No ratings yet
NJ Cse4261-1
26 pages
Bda M1
No ratings yet
Bda M1
111 pages
Unit 1
No ratings yet
Unit 1
56 pages
2 LecturE 1 2
No ratings yet
2 LecturE 1 2
28 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Big Data
No ratings yet
Big Data
16 pages
BigData UNIT-1
No ratings yet
BigData UNIT-1
19 pages
BD Unit 1
No ratings yet
BD Unit 1
63 pages
Unit I Introduction To Big Data
No ratings yet
Unit I Introduction To Big Data
36 pages
Big Data Analytics
No ratings yet
Big Data Analytics
82 pages
Ds Unit-1
No ratings yet
Ds Unit-1
19 pages
BD 1
No ratings yet
BD 1
15 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Erwin Compliance White Paper April 2019
No ratings yet
Erwin Compliance White Paper April 2019
12 pages
A Summer Training Presentation On Oracle 10G and
No ratings yet
A Summer Training Presentation On Oracle 10G and
24 pages
Database List
0% (10)
Database List
70 pages
Functions of Database Server
0% (2)
Functions of Database Server
4 pages
List, Tuple, Set and Dictionary
100% (1)
List, Tuple, Set and Dictionary
1 page
Teratom
No ratings yet
Teratom
19 pages
Chapter 8 - PPT Asli Buku
100% (1)
Chapter 8 - PPT Asli Buku
52 pages
Abhinav Shetty SR PL - SQL
No ratings yet
Abhinav Shetty SR PL - SQL
6 pages
Oracle RMAN Notes
No ratings yet
Oracle RMAN Notes
4 pages
SQL Query Interview Questions and Answers
100% (2)
SQL Query Interview Questions and Answers
4 pages
Practical Programs
No ratings yet
Practical Programs
9 pages
Provisional Result of M.C.A., Batch-2023-25, Semester-II Examination, May 2024
No ratings yet
Provisional Result of M.C.A., Batch-2023-25, Semester-II Examination, May 2024
1 page
Oracle Applications - Oracle 9i - SQL Questions
No ratings yet
Oracle Applications - Oracle 9i - SQL Questions
5 pages
Ieee 2017 .Net Projects List
No ratings yet
Ieee 2017 .Net Projects List
2 pages
Sistem Informasi Pengolahan Data Barang
No ratings yet
Sistem Informasi Pengolahan Data Barang
7 pages
Final-Project Report DB PDF
No ratings yet
Final-Project Report DB PDF
29 pages
Chapter 1.3
No ratings yet
Chapter 1.3
9 pages
SQL Joins Cheat Sheet
No ratings yet
SQL Joins Cheat Sheet
1 page
TD - DTD Hatem
No ratings yet
TD - DTD Hatem
2 pages
dp-700 6
No ratings yet
dp-700 6
16 pages
Study Lib
No ratings yet
Study Lib
36 pages
Solr Cookbook Third Edition Sample Chapter
No ratings yet
Solr Cookbook Third Edition Sample Chapter
45 pages
Source: PEC IPT Preliminary Course - Samuel Davis & Jacaranda IPT Preliminary Course IPT Preliminary - Carole Wilson
No ratings yet
Source: PEC IPT Preliminary Course - Samuel Davis & Jacaranda IPT Preliminary Course IPT Preliminary - Carole Wilson
13 pages
04 Data Modeling Advanced Concepts
No ratings yet
04 Data Modeling Advanced Concepts
34 pages
SQL Day3
No ratings yet
SQL Day3
5 pages
Complete RDBMS Questions
No ratings yet
Complete RDBMS Questions
83 pages
SAP Data Archiving Demo - Ravi Anand
No ratings yet
SAP Data Archiving Demo - Ravi Anand
16 pages
DMSMP
No ratings yet
DMSMP
20 pages
4670 Lecture4 Profile Privilege
No ratings yet
4670 Lecture4 Profile Privilege
57 pages

UNIT 1big Data Introduction

Uploaded by

UNIT 1big Data Introduction

Uploaded by

Story of Big Data

• Veracity refers to the data in doubt or uncertainty of data available

• After discussing Volume, Velocity, Variety and Veracity, there is

• Semi-Structured Data is a type of data which does not have a formal

•Or how to find out the important data points?

•Or how to use data to the best advantage?

• We need Big Data Processing Technologies to Analyse this huge

• Developed by: MongoDB in the year 2009 11th of Feb

• Developed by: RainStor Software company in the year 2004.

• Developed by: Splunk INC in the year 2013.

• Developed by: Apache Foundation in the year 2013.

• Developed by: RapidMiner in the year 2001

• Developed by: Elastic NV in the year 2012.

• Developed by: Apache Software Foundation in the year 2011

• Developed by: Splunk INC in the year 2014 6th May

• Developed by: R-Foundation in the year 2000 29th Feb

• Developed by: TableAU 2013 May 17th

• Developed by: Plotly in the year 2012

• Hadoop is essentially an open-source framework for processing,

You might also like