0% found this document useful (0 votes)
21 views19 pages

Unit 1.1 - Introduction To Big Data Analytics

Big data

Uploaded by

Srimathi mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views19 pages

Unit 1.1 - Introduction To Big Data Analytics

Big data

Uploaded by

Srimathi mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

4CAE2BD / 4ITE2BD

Big Data Analytics


Unit 1.1
Introduction to Big Data Analytics
Why is data important?

Data
• Raw facts

Information
• Processed data -

Insight / Knowledge
• Deep understanding

Insights are crucial in making business decisions


Digital Data
• Digital data may be
• Internal to the enterprise
• External to the enterprise
• Digital data sources may be
• Homogenous
• E.g.: Only text
• Heterogenous
• E.g.: Text + Audio
Classification of Digital Data

Digital Data
Structured

Semi-structured

Unstructured
1. Structured Digital Data
• Data is called structured when it conforms to a pre-defined
schema or structure
• E.g.: RDBMS
• Ease of working with Structured Data
• Add/Mod/Del
• Security
• Indexing
• Scalability
• Transaction processing
2. Semi-structured Digital Data
• Also referred to as Self-describing Structure
• Data stored using Markup tags
• E.g.:
• HTML – HyperText Markup Language
• XML – eXtensible Markup Language
• JSON – Java Script Object Notation
• There is no separation between the data and the schema
• Entities belonging to the same class need not have the same
attributes
3. Unstructured Digital Data
• Data that does not conform to any pre-defined data model
• E.g.:
• Text messages
• Log files on a server
• Email
• Web pages
• Images
• Audio
• Video
• Free-form text
• Social media posts
• Chats
• Document
How much data is structured?

Structured Unstructured
Data Data

20% 80%
Introduction to Big Data
• Big Data refers to the massive datasets that are collected from
a variety of data sources for business needs to reveal Big Data
new insights for optimized decision-making. Analytics

• Big Data Analytics is the result of the growth of 3 major


computational aspects:
Facebook, Twitter,
Using hand-held devices
Social Instagram, Pinterest…
like smartphones, tablets Networking

Mobile Cloud
Computing Computing

Highly available
Big Data data storage and
Analytics computational facility
Big Data Characteristics (5 Vs)
1. Volume
2. Velocity
3. Variety
4. Veracity
5. Value
Big Data Characteristics (5 Vs)
1. Volume
• The “big” in big data is a relative term
• E.g.: What is big for a small company may be quite small for a government
Real-time
data Petabytes (~1015 bytes)

Web Terabytes (~1012 bytes)

CRM Gigabytes (~109 bytes)

ERP Megabytes (~106 bytes)

• Normally, Volume = length x breadth x height


Big Data Volume = time x depth x analysis
Big Data Characteristics (5 Vs)
2. Velocity
• The speed and direction in which real-time data is captured and used
• E.g.:
• Clickstreams, ad impressions capture millions of events per second
• Stock trading algorithms reflect market changes within microseconds
• Sensors generate massive log data in real-time
• Online gaming systems support millions of concurrent users, each producing multiple inputs per second
• Speed of uploading data in social media platforms
3. Variety
• Big Data is not just text, but also
• Text
• Geospatial data
• 3D data
• Audio
• Video
• Unstructured text like log files, social media text
4. Veracity
• Genuineness of every piece of data has to be verified
5. Value
• For the enterprise, there should be a value attached to each data stored and used
Types of Big Data
• Structured Data
• Traditional databases
• Data Warehouse (Data Mining algorithms are used to derive patterns)
• Semi-structured Data
• XML and RDF (graph)
• Data Streams – ordered sequence of instances that are scanned only once
• E.g.: Telephone conversations, ATM transactions, network traffic, web searches, sensor data
• Unstructured Data
• Emails
• Audio
• Video
• Logs
• Blogs and Forums
• Social media sites
• Clickstreams
• Sensor data
• Mobile App data
• Statistical data
• …
Traditional vs. Big Data Approach
Aspect Traditional Data Analytics Big Data Analytics

Volume Can handle only less volume Can handle much larger volumes

Velocity Can handle only less velocity Can handle large velocity data

Can handle unstructured data


Use structured and semi-structured data
Variety (Data is not modelled and stored, so no
(Data is modelled and stored) prior decisions need to be made while
storing)
Infrastructure for Big Data
• Where is processing hosted?
• Distributed Servers / Cloud
• Where is data stored?
• Distributed Storage on the Cloud
• What is the Programming Model?
• Distributed Processing (MapReduce)
• How is data stored and indexed?
• In high-performance, schema-free databases
• What operations are performed on data?
• Analytical processing
• Semantic processing
Use of Data Analytics
• To gain insights about hidden patterns and
unknown correlations
• To aid in better decision-making in the short-term
(operational), medium-term (tactical) and long-
term (strategic)
• To help in effective marketing
• To improve customer satisfaction
• To increase revenues
Big Data Challenges
• Right data is not captured
• Costs escalate too fast
• Building data-related business cases
• Requires non-traditional thinking
• Finding the right talent to work with new technologies
• Data access and connectivity
• Security and Privacy concerns about use of data
• Technology is changing very fast
• Requires working across functions like IT, engineering, finance.
Procurement, while ownership of data is fragmented across the
organization
Desired Properties of a Big Data System
1. Robust and Fault-tolerant
• Behave correctly when computers go down
2. Low-latency
• Reads and updates should be quick
3. Scalable
• Maintain performance in the face of increasing data load
4. Versatility
• Support a wide range of applications
5. Extensible
• Allow functionality to be added with minimal development cost
• Allow migrations of old data quickly and easily
6. Debuggable
• Provide the information necessary to debug the system when things go wrong
Some Applications of Big Data Systems
• Insurance companies
• To understand likelihood of fraud in claim processing
• Manufacturers and distributors
• To understand supply chain issues earlier so that decisions on logistics can be
taken to avoid additional costs associated with material delays, overstocking
and stock-out conditions
• Service industry (hotels, telecom companies, retailers, restaurants, …)
• To get better clarity on customer needs to build a strong customer base and
loyalty
• Public services (traffic, ambulance, transport, …)
• To optimize their delivery mechanisms
• Smart City
• To make cities more liveable by using data related to censors, crime, emergency
services, real-estate, energy, financial transactions, scientific data, ….
• Clickstream Analytics
• Data includes the pages loaded by the website visitor, time spent on each page,
links clicked, frequency of visit, from which page the customer exits.

You might also like