Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
(CS443)
IV B.Tech (IT)
2018-19 I semester
1
Big Data
And Analytics
Seema Acharya
Subhashini Chellappan
2
Chapter 1
3
Learning Objectives and Learning Outcomes
4
Agenda
Semi-Structured
Sources of semi-structured data
Unstructured
Sources of unstructured data
Issues with terminology
Dealing with unstructured data
5
Classification of Digital Data
Structured data
Semi-structured data
Unstructured data
6
Approximate Percentage Distribution of Digital Data
7
Structured Data
8
Structured Data
This is the data which is in an organized form (e.g., in rows and columns) and can
be easily used by a computer program.
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
OLTP Systems
10
Ease with Structured Data
Input / Update /
Delete
DML
Security
Transaction
Processing
(ACID)
11
Semi-structured Data
12
Semi-structured Data
13
Sources of Semi-structured Data
14
JSON is used to transmit data between a server and a web
application. JSON is popularized by web services developed utilizing
the Representational State Transfer (REST) – an architectural style for
creating scalable web services.
Inconsistent Structure
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information
is blended with data
values
Data objects may have different
attributes not known beforehand
16
Unstructured Data
17
Unstructured Data
This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer
program.
18
Sources of Unstructured Data
Web Pages
Images
Free-
Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media
data
Document
Word 19
Issues with terminology – Unstructured Data
20
Dealing with Unstructured Data
Data Mining
Unstructured Information
Management Architecture 21
Dealing with Unstructured Data
Knowledge discovery in databases, popular Mining algorithms are
Association rule mining, Regression Analysis, and Collaborative filtering
Data Mining
23
Noisy Text Analytics: Process of extraction structured or semi-
structured from noisy unstructured data such as chats, blogs,
wikis, emails .. Spelling mistakes, abbreviations, uh, hm, non
standard words. .
Manual Tagging with Meta data: This is about tagging
manually with adequate meta data to provide the requisite
semantics to understand unstructured data.
Parts of Speech Tagging : POST is the process of reading text
and tagging each word in the sentence belonging to particular
parts of speech such as noun, verb, objective...
UIMA: Open source platform from IBM used for real time
content analytics.
24
Answer a few quick questions …
25
Place the following in suitable basket:
Email ii. MS Access iii. Images iv. Database
v. Char conversions vi. Relations / Tables vii. Face book
viii.Videos ix. MS Excel x. XML
26
Match the following
Column A Column B
NLP Content analytics
Text analytics Text messages
UIMA Chats
Noisy Text mining
unstructured
data
Data mining Comprehend human or natural language input
IBM UIMA
5, 4, 1, 2, 6, 3, 7
27
Answer Me
Which category (structured, semi-structured, or unstructured) will you place
a Web Page in?
28
List various types of digital data?
Structured, Semi-structured and unstructured
Why an email placed in the Unstructured category?
Because it contains hyperlinks, attachments, videos, images, free flowing text...
What category will you place a CCTV footage into? unstructured
You have just got a book issued from the library. What are the details about the book
that can be placed in an RDBMS table.
Ans: Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not.
Which category would you place the consumer complaints and feedback? Unstructured.
Which category (structured, semi-structured or Unstructured) will you place a web page
in? Unstructured
Which category (structured, semi-structured or Unstructured) will you place a Power
point presentation in? Unstructured
Which category (structured, semi-structured or Unstructured) will you place a word
document in? Unstructured
29
Data Generation Definition Information Examples
Origin Management
Proficiency
32
Evolution of Big Data:
1. 1970s and before : main frames (the data is
primitive and structured)
2. 1980s and 1990s Relational databases (the
data intensive applications)
3. the WWW and the IOT have led to an
onslaught of structured, unstructured, and
multimedia data.
33
The Evolution of Big Data
Data Generation and Data Utilization Data Driven
storage
Complex and Structured data,
unstructured Unstructured data,
Multimedia data
34
What’s Big Data?
No single definition; here is from Wikipedia:
Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
It is about three Vs.
"Big data" is high-volume, -velocity and -variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
35
What’s Big Data?
“Big data is high-volume, -velocity and -variety information assets” talks about voluminous
data that may have great variety(s,s,u) and will require a good speed/pace for storage,
preparation, processing and analysis.
“Enhanced insight and decision making” talks about deriving deeper, richer, and meaningful
insights and then using these insights to make faster and better decisions to gain business
value and this competitive edge.
Data -> information -> Actionable intelligence -> better decisions -> Enhanced
business value.
36
Challenges with Big Data
The challenges with big data:
1.Data today is growing at an exponential rate.
The key question is : will all this data be useful for analysis how will
separate knowledge from noise..
2. How to host big data solutions outside the world.
3. The period of retention of big data.
4. Dearth of skilled professionals.
5. Challenges with respect to capture, curation, storage, search,
sharing, transfer, analysis, privacy violations and visualization.
6. Shortage of data visualization experts.
37
What is Big Data
Big data is the data that is Big in volume, velocity and variety
Volume: Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes->
Zettabytes-> Yottabytes
Where does this data get generated
1. Typical internal data sources: data storage, archives,
2. External data sources: Public web: wikipedia, weather, regulatory,
complience..
3. Both (nternal and external sources).
38
What is Big Data
Velocity
Batch -> periodic ->Near real time ->Real time processing.
Variety: Structured, semi-structured and Unstructured.
Other:
1. Veracity and validity..
2. Volatality
3. variability
39
40
41
TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA
42
TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA
3. Traditional BI is about structured data and the
data is taken to process functions (move data to
code).
Where as Big data is about variety: Structured,
semi structured, and unstructured data and here
the processing functions are taken to the data
(move code to data)
43
Big Data Analytics
44
Typical data warehouse environment
CRM, ERP, Legacy, Third party apps. The data is then integrated, cleaned,
transformed, and standerdized through the process of Extraction, Transformation
and loading(ETL).
Used to enable decision making from the use of adhoc queries.
Reporing/Dashboarding, OLAP, Adhoc querying, and Modeling. 45
Typical Hadoop environment
The data placed in Hadoop Distributed File system. Operational data store
46
What is Big Data Analytics
Big data Analytics is the process of
examining big data to uncover patterns,
unearth trends, and find unknown
correlations and other useful information to
make faster and better decisions.
47
What is Big Data Analytics
The Big Data Analytics is:
•Technology enabled analytics: few data
analytics and visualization tools
•Richer, deeper insights into customers,
partners and the business
•Competitive advantage
48
Big Data Analytics is:
•Collaboration of three communities: IT, Business
users and data scientists.
•Working with data sets whose volume and
variety exceed the current storage and processing
capabilities and infrastructure of enterprise.
•Move code to data for greater speed and
efficiency
•Better faster decisions in real time.
49
What is Big Data Analytics
Few Top Analytics tools are: MS Excel, SAS, IBM
SPSS Modeler, R analytics, Statistica, World
Programming Systems (WPS), and Weka.
The open source analytics tools are: R analytics
and Weka.
50
What is Big Data Analytics
The open source analytics tools are: R analytics
and Weka.
Classification of Analytics: There are basically
two schools of thought:
Those that classify analytics into basic,
operational, advanced and monetized.
Those that classify analytics into analytics 1.0,
analytics 2.0 and analytics 3.0.
51
First school of thought:
Basic analytics: This primarily slicing and slicing of data to
help with basic business insights. This is about reporting on
historical data, basic visualization etc.
Operationalized Analytics: It is operationalized analytics if it
gets woven into the enterprise’s business process.
Advanced Analytics: This largely is about forecasting for the
future by way of predictive and prescriptive modeling.
Monetized analytics: This is analytics in use to derive direct
business revenue.
52
53
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to Era: 2005 to 2012 Era: 2012 to present
2009
Descriptive Descriptive statistics + Descriptive statistics + Predictive
statistics (report Predictive statistics (use statistics + prescriptive statistics
events, data from the past to (use data from the past to make
occurrences etc of make predictions for prophecies for the future and at the
the past. the future. same time make recommendations
to leverage the situations to one’s
advantage.
54
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to Era: 2005 to 2012 Era: 2012 to present
2009
Key questions Key questions are: Key questions are:
asked: What will happen? What will happen?
What happened? Why will it happen? When will it happen?
Why did it Why will it happen?
happen? What should be the
action taken to take
advantage of what will
happen?
55
Data from legacy Big Data A blend of big data and
systems, ERP,CRM data from legacy
and third party systems, ERP,CRM and
applications. third party applications.
Small and structured Big data is being taken up A blend of big data and
data sources. Data seriously. Data is mainly traditional analytics to
stored in enterprise unstructured, arriving at a yield insights and
data warehouses or higher pace. This fast flow offerings with speed
data marts. of big volume data had to be and impact.
stored and processed
rapidly, often on massively
parallel servers running
hadoop.
56
Data was Data was often Data is being both
internally externally sourced. internally and
sourced. externally sourced.
Relational Database In ,memory
databases applications, analytics, in database
Hadoopo clusters, processing, agile
SQL to hadoop analytical methods,
environments etc.. Machine learning
techniques etc ..
57
58
Top Challenges facing Big Data
Scale
Security
Schema
Continuous availability
Consistency
Partition tolerant
Data quality
59
Top Challenges facing Big Data
Scale: Storage (RDBMS, NoSQL is the major
concern that needs to be addressed
Security (poor security mechansim)
Schema (no rigid schema, Dynamic is required)
Continuous availability (how to provide 24X7 support)
Consistency
Partition tolerant
Data quality
60
Techniques used in Big data environments:
65
Techniques used in Big data environments:
Massively Parallel Processing:
Massively parallel Processing (MPP) refers to the coordinated
processing of programs by a number of processors working
parallel. The processors, each have their own OS and dedicated
memory. They work on different parts of the same program.
The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP
works with processors sharing the same OS and same memory.
SMP also referred as tightly coupled Multiprocessing.
66
Techniques used in Big data environments:
Massively Parallel Processing:
Massively parallel Processing (MPP) refers to the coordinated
processing of programs by a number of processors working
parallel. The processors, each have their own OS and dedicated
memory. They work on different parts of the same program.
The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP
works with processors sharing the same OS and same memory.
SMP also referred as tightly coupled Multiprocessing.
67
Techniques used in Big data environments:
Shared nothing Architecture:
The three most common types of architecture for multiprocessor
systems:
Shared memory
Shared disk
Shared nothing.
In shared memory architecture, a common central memory is shared by
multiple processors. In shared disk architecture, Multiple processors
share a common collection of disks while having their own private
memory. In shared nothing architecture, neither memory nor disk is
shared among multiple processors.
68
69
Techniques used in Big data environments:
Advantages of shared nothing architecture:
•Fault isolation:
•Scalability:
70
CAP Theorem:
The CAP theorem is also called the Brewer’s theorem. It states that in a
distributed computing environment, it is possible to provide the following
guarantees:
Consistency
Availability
Partition tolerance
Consistency implies that every read fetches the last write.
Availability implies that reads and writes always succeed. In other words,
each non-failing node will return response in a reasonable amount of time.
Partition tolerance implies that the system will continue to function when
network partition occurs.
71
BASE
Definition - What does Basically Available, Soft State, Eventual
Consistency (BASE) mean?
Basically Available, Soft State, Eventual Consistency (BASE) is a data system
design philosophy that prizes availability over consistency of operations.
BASE was developed as an alternative for producing more scalable and
affordable data architectures, providing more options to expanding
enterprises/IT clients and simply acquiring more hardware to expand data
operations.