0% found this document useful (0 votes)
11 views111 pages

Bda M1

Uploaded by

Omkar Masaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views111 pages

Bda M1

Uploaded by

Omkar Masaye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

BIG DATA ANALYTICS

Dr Brinthakumari S
Module No - 1
Introduction to Big Data and Hadoop

1.1 Introduction to Big Data - Big Data characteristics and Types of


Big Data, Traditional vs. Big Data business approach
Introduction to Big Data
What is Data?
Data refers to the information collected, stored, and processed by
computers, systems, or organizations. It can take many forms,
including:

- Numbers and statistics


- Text and words
- Images and videos
- Audio files and sounds
- Sensor readings and measurements
How large your data is?

● What is the maximum file size


you have dealt so far?
● Movies/files/streaming video
that you have used?
● What is the maximum download
speed you get to retrieve data
stored in distant locations?
● How fast your computation is?
● How much time to just transfer
from you,process and get result?
How large your data is?

● Normally we work on data of


size MB(WordDoc ,Excel) or
maximum GB(Movies, Codes)
but data in Peta bytes i.e.
10^15 byte size is called Big
Data.
How large your data is?

● Every Month per Person → 40 Exabytes x 5,000,000,000 =


2000,000,000,…….

■ It is stated that almost 90% of today's


data has been generated in the past 3
years.
Growth of data

Data Explosion:
2.5 quintillion bytes (2.5 e+9 GB) of data on regular basis
What is Big Data?

Since 1970 to till date


Data
Now data is Big data!
● Data which are large in size are called Big Data
● No single standard definition!
● ‘Big-data’ is similar to ‘Small-data’, but bigger
● Big Data is a term used for a collection of data sets that are large and
complex
● Difficult to store and process using available database management
tools or traditional data processing applications.
What is big Data?
Big Data is the data which cannot be managed by using traditional
databases.
Sources of Big Data
● Social networking sites:
○ Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of
users worldwide.
Eg. , Every action on Social media generate Data
○ registration with personal details
○ uploading any post (photo, audio, vider, text)
○ Reactions ( like, tag, comment)
Sources of Big Data
● E-commerce site:
○ Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which
users buying trends can be traced.
● e=Eg., Search History, Purchase history, Personal Details
● Weather Station:
○ All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
● Telecom company:
○ Telecom giants like Airtel, Vodafone study the user trends and accordingly
publish their plans and for this they store the data of its million users.
● Share Market:
○ Stock exchange across the world generates huge amount of data through its
daily transaction.
How do you Classify any Data as Big Data ???

● This is possible with the concept of 5 Vs

● The 5 V's of Big Data are:


○ Volume
○ Velocity
○ Variety
○ Veracity
○ Value
If the data meets some or all of these criteria, it can be classified as Big
Data.
1. Volume: Is the data scale massive, exceeding traditional processing
capabilities (e.g., terabytes, petabytes, or more)?
2. Velocity: Is the data generated at a high speed, such as real-time
streaming data from sensors, social media, or applications?
3. Variety: Does the data consist of diverse formats, such as structured,
semi-structured, and unstructured data (e.g., text, images, audio,
video)?
4. Veracity: Is the data accurate, reliable, and trustworthy, or does it
require processing to ensure quality?
5. Value: Does the data have potential value and relevance for analysis,
insights, and decision-making?
Big Data Characteristics
Volume
● The name Big Data itself is related to an enormous size.
● Big Data is a vast 'volumes' of data generated from many sources
daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more.

● Facebook can generate approximately a billion messages, 4.5 billion


times that the "Like" button is recorded, and more than 350 million
new posts are uploaded each day.
● Big data technologies can handle large amounts of data.
Velocity:

● Velocity plays an important role compared to others. Velocity creates the


speed by which the data is created in real-time.
● It contains the linking of incoming data sets speeds, rate of change, and
activity bursts.
● The primary aspect of Big Data is to provide demanding data rapidly.
● Big data velocity deals with the speed at the data flows from sources like
application logs, business processes, networks, and social media sites,
sensors, mobile devices, etc.
Variety:

● Big Data can be structured, unstructured, and semi-structured that


are being collected from different sources.
● Data will only be collected from databases and sheets in the past,
But these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.
● The data is categorized as below:
○ Structured data
○ Semi-structured
○ Unstructured Data
Structured data:

● In Structured schema, along with all the required columns.


● It is in a tabular form. Structured Data is stored in the relational
database management system.
Semi-structured:
● In Semi-structured, the schema is not appropriately defined, e.g.,
JSON, XML, CSV, TSV, and email.
● OLTP (Online Transaction Processing) systems are built to work
with semi-structured data. It is stored in relations, i.e., tables.
Unstructured Data:
● All the unstructured files, log files, audio files, and image files are
included in the unstructured data.
● Some organizations have much data available, but they did not know
how to derive the value of data since the data is raw.
Quasi-structured Data:
● The data format contains textual data with inconsistent data formats
that are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by
some server that contains a list of activities.
Veracity:
● Veracity means how much the data is reliable. It has many ways to filter or
translate the data. V
● Veracity is the process of being able to handle and manage data efficiently.
Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value:
● Value is an essential characteristic of big data. It is not the data that
we process or store.
● It is valuable and reliable data that we store, process, and also
analyze.
Examples of Big Data
Healthcare System (Smart Hospital)
Examples of Big Data
Examples of Big Data
Examples of Big Data - Social Media
Solution
Storage: This huge amount of data, Hadoop uses HDFS (Hadoop
Distributed File System) which uses commodity hardware to form
clusters and store data in a distributed fashion. It works on Write
once, read many times principle.

Processing: Map Reduce paradigm is applied to data distributed


over network to find the required output.

Analyze: Pig, Hive can be used to analyze the data.

Cost: Hadoop is open source so the cost is no more an issue.


Types of Big Data

Big Data could be of three types:

● Structured
● Semi-Structured
● Unstructured
Structured Data

Any data that can be processed, is easily accessible, and can be stored in a fixed

format is called structured data. In Big Data, structured data is the easiest to work

with because it has highly coordinated measurements that are defined by setting

parameters. Structured types of Big Data are:-

Overview:

● Highly organized and easily searchable in databases.


● Follows a predefined schema (e.g., rows and columns in a table).
● Typically stored in relational databases (SQL).
Examples:

● Customer information databases (names, addresses, phone numbers).


● Financial data (transactions, account balances).
● Inventory management systems.
● Metadata (data about data).
Merits:

● Easy to analyze and query.


● High consistency and accuracy.
● Efficient storage and retrieval.
● Strong data integrity and validation.

Limitations:

● Limited flexibility (must adhere to a strict schema).


● Scalability issues with very large datasets.
● Less suitable for complex data types.
Semi-structured Data

● In Big Data, semi-structured data is a combination of both unstructured and


structured types of data.
● This form of data constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or any
relational database.
● Some semi-structured data examples include XML and JSON.

Overview:

● Contains both structured and unstructured elements.


● Lacks a fixed schema but includes tags and markers to separate data elements.
● Often stored in formats like XML, JSON, or NoSQL databases.
Examples:

● JSON files for web APIs.


● XML documents for data interchange.
● Email messages (headers are structured, body can be unstructured).
● HTML pages.
Merits:

● More flexible than structured data.


● Easier to parse and analyze than unstructured data.
● Can handle a wide variety of data types.

● Better suited for hierarchical data.

Limitations:

● More complex to manage than structured data.


● Parsing can be resource-intensive.
● Inconsistent data quality.
Quasi-Structured Data:

Overview:

● Loosely structured data that does not fit neatly into traditional database schemas.
● Contains some organizational properties but lacks a fixed structure.
● Often encountered in large-scale data systems and logs.

Examples:

● Log files (system logs, application logs).


● Clickstream data from web analytics.
● Sensor data streams.
Merits:

● Can provide valuable insights with proper analysis.


● Flexible data format suitable for big data systems.
● Facilitates real-time data processing.
● Capable of capturing a wide range of data types.

Limitations:

● Data extraction and transformation can be challenging.


● Higher storage and processing costs.
● Requires specialized tools for analysis.
Unstructured Data

● Unstructured data in Big Data is where the data format constitutes multitudes of

unstructured files (images, audio, log, and video).

● This form of data is classified as intricate data because of its unfamiliar structure

and relatively huge size. A stark example of unstructured data is an output

returned by ‘Google Search’ or ‘Yahoo Search.’

Overview:

● Data that does not conform to a predefined schema.


● Includes text, multimedia, and other non-tabular data types.
● Stored in data lakes, NoSQL databases, and other flexible storage solutions.
Examples:

● Text documents (Word files, PDFs).


● Multimedia files (images, videos, audio).
● Social media posts.
● Web pages.
Merits:

● Capable of storing vast amounts of diverse data.


● High flexibility in data storage.
● Suitable for complex data types like multimedia.
● Facilitates advanced analytics and machine learning applications.

Limitations:

● Difficult to search and analyze without preprocessing.


● Requires large storage capacities.
● Inconsistent data quality and reliability.
Comparison Table: Structured vs Unstructured vs Semi-Structured Data

Feature Structured Data Semi-Structured Unstructured Data


Data

Schema Fixed schema Flexible schema No fixed schema


(rows and (tags, markers)
columns)

Storage Relational NoSQL Data lakes, NoSQL


databases (SQL) databases, XML, databases
JSON
Searchability High Moderate Low

Flexibility Low High Very high

Ease of Easy Moderate Difficult


Analysis

Data Types Numeric, Hierarchical, mixed Text, multimedia,


categorical types complex types

Scalability Moderate High Very high


Data Types Numeric, Hierarchical, Text, multimedia,
categorical mixed types complex types

Scalability Moderate High Very high

Common Use Financial systems, Web APIs, Social media,


Cases inventory email, HTML documents, media
Big Data and Traditional System
Dec 18 [5M]

You might also like