0% found this document useful (0 votes)

288 views72 pages

Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester

The document discusses different types of digital data: structured, semi-structured, and unstructured. Structured data is organized and stored in databases, spreadsheets, and transaction systems. Semi-structured data includes XML, JSON, and other markup languages that have some structure. Unstructured data makes up 80-90% of organizational data and includes text, images, audio, and video with no predefined structure. The document also outlines methods for dealing with unstructured data such as data mining, natural language processing, and text analytics.

Uploaded by

Siva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

288 views72 pages

Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester

Uploaded by

Siva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 72

Big Data Analytics

(CS443)
IV B.Tech (IT)
2018-19 I semester

1
Big Data
And Analytics

Seema Acharya
Subhashini Chellappan

2
Chapter 1

Types of Digital Data

3
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

Introduction to digital data
and its types

1. Structured data: Sources of a) To differentiate between

structured data, ease with structured, semi-structured
structured data, etc. and unstructured data.

2. Semi-Structured data: Sources b) To understand the need to

of semi-structured data, integrate structured, semi-
characteristics of semi- structured and unstructured
structured data. data.

3. Unstructured data: Sources of

unstructured data, issues with
terminology, dealing with
unstructured data.

4
Agenda

Types of Digital Data

 Structured
 Sources of structured data
 Ease with structured data

 Semi-Structured
 Sources of semi-structured data

 Unstructured
 Sources of unstructured data
 Issues with terminology
 Dealing with unstructured data

5
Classification of Digital Data

Digital data is classified into the following categories:

 Structured data

 Semi-structured data

 Unstructured data

6
Approximate Percentage Distribution of Digital Data

Approximate percentage distribution of digital data

7
Structured Data

8
Structured Data
 This is the data which is in an organized form (e.g., in rows and columns) and can
be easily used by a computer program.

 Relationships exist between entities of data, such as classes and

their objects.

 Data stored in databases is an example of structured data.

 Think structured data, and think data model. (Store, Process and Access) RDBMS.
 schema, columns and rows, No.of rows is cardinality and no. of columns degree
of relation.
 with constraints Employee schema
9
Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems

10
Ease with Structured Data

Input / Update /
Delete
DML

Security

Ease with Structured data Indexing / Speeds up data retrieval operations

Searching

Storage and processing

Scalability capability

Transaction
Processing
(ACID)

11
Semi-structured Data

12
Semi-structured Data

 This is the data which does not conform to a data model

but has some structure. However, it is not in a form
which can be used easily by a computer program.

 Example, emails, XML, markup languages like HTML,

etc. Metadata for this data is available but is not
sufficient.

13
Sources of Semi-structured Data

XML (Extensible Markup Language) developed utilizing

the SOAP(Simple Object Access protocol) Principles

Other Markup Languages

JSON (Java Script object Notation)

Semi structured data

14
JSON is used to transmit data between a server and a web
application. JSON is popularized by web services developed utilizing
the Representational State Transfer (REST) – an architectural style for
creating scalable web services.

Mango DB and couchbase store data natively in JSON format.

Mango DB(open source, distributed, NoSQL, document oriented
database)
Couchbase ( originally known as Membase, DB(open source,
distributed, NoSQL, document oriented database
15
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information
is blended with data
values
Data objects may have different
attributes not known beforehand

16
Unstructured Data

17
Unstructured Data
 This is the data which does not conform to a data model or
is not in a form which can be used easily by a computer
program.

 About 80–90% data of an organization is in this format.

 Example: memos, chat rooms, PowerPoint

presentations, images, videos, letters, researches, white
papers, body of an email, etc.

18
Sources of Unstructured Data
Web Pages

Images

Free-
Form
Text
Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media
data

Document
Word 19
Issues with terminology – Unstructured Data

Structure can be implied despite not being

formerly defined.

Data with some structure may still be labeled

Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be

highly structured in ways that are unanticipated
or unannounced.

20
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

Manual tagging with Metadata

Part of speech tagging

Unstructured Information
Management Architecture 21
Dealing with Unstructured Data
Knowledge discovery in databases, popular Mining algorithms are
Association rule mining, Regression Analysis, and Collaborative filtering
Data Mining

It is related to HCI. It is about enabling computers to understand human or

Natural Language Processing (NLP) natural language input.

Text mining is the process of gleaning high quality and meaningful

Text Analytics information from text. It includes tasks such as text categorization, text
clustering, sentiment analysis and concept/entity extraction.
Process of extraction structured or semi-structured from noisy
Noisy Text Analytics
unstructured data such as chats, blogs, wikis, emails .. Spelling
mistakes, abbreviations, uh, hm, non standard words. .
Manual tagging with This is about tagging manually with adequate meta data to provide
Metadata the requisite semantics to understand unstructured data.
Part of speech tagging POST is the process of reading text and tagging each word in the sentence
belonging to particular parts of speech such as noun, verb, objective...
Unstructured Information Open source platform from IBM used for real time content analytics.
Management Architecture 22
Data Mining : Knowledge discovery in databases, popular
Mining algorithms are Association rule mining, Regression
Analysis, and Collaborative filtering.
NLP: It is related to HCI. It is about enabling computers to
understand human or natural language input.
Text Analytics: Text mining is the process of gleaning high quality
and meaningful information from text. It includes tasks such as text
categorization, text clustering, sentiment analysis and
concept/entity extraction.

23
Noisy Text Analytics: Process of extraction structured or semi-
structured from noisy unstructured data such as chats, blogs,
wikis, emails .. Spelling mistakes, abbreviations, uh, hm, non
standard words. .
Manual Tagging with Meta data: This is about tagging
manually with adequate meta data to provide the requisite
semantics to understand unstructured data.
Parts of Speech Tagging : POST is the process of reading text
and tagging each word in the sentence belonging to particular
parts of speech such as noun, verb, objective...
UIMA: Open source platform from IBM used for real time
content analytics.

24
Answer a few quick questions …

25
Place the following in suitable basket:
Email ii. MS Access iii. Images iv. Database
v. Char conversions vi. Relations / Tables vii. Face book
viii.Videos ix. MS Excel x. XML

Structured Unstructured Semi structured

26
Match the following

Column A Column B
NLP Content analytics
Text analytics Text messages
UIMA Chats
Noisy Text mining
unstructured
data
Data mining Comprehend human or natural language input

Noisy Uses methods at the intersection of statistics,

unstructured
Artificial Intelligence, machine learning & DBs
data

IBM UIMA

5, 4, 1, 2, 6, 3, 7
27
Answer Me
 Which category (structured, semi-structured, or unstructured) will you place
a Web Page in?

 Which category (structured, semi-structured, or unstructured) will you place

Word Document in?

 State a few examples of human generated and machine-generated data.

28
List various types of digital data?
Structured, Semi-structured and unstructured
Why an email placed in the Unstructured category?
Because it contains hyperlinks, attachments, videos, images, free flowing text...
What category will you place a CCTV footage into? unstructured
You have just got a book issued from the library. What are the details about the book
that can be placed in an RDBMS table.
Ans: Title, author, publisher, year, no.of pages, type of book, price, ISBN, with CD or not.
Which category would you place the consumer complaints and feedback? Unstructured.
Which category (structured, semi-structured or Unstructured) will you place a web page
in? Unstructured
Which category (structured, semi-structured or Unstructured) will you place a Power
point presentation in? Unstructured
Which category (structured, semi-structured or Unstructured) will you place a word
document in? Unstructured

29
Data Generation Definition Information Examples
Origin Management
Proficiency

Business process data e.g.,

Structured payment transactions, sales
order, call record, ERP, CRM
Semistructured Weblogs
Data representing the
Humans digitization of human Content such as Web pages, E-
interactions Unstructured mail, Blog, Wiki, Review,
Comment

Binary Content such as Video, Audio,

Photo
Data representing Structured Some devices
machine-to-machine
interactions, or simply Computer logs, Device logs,
not human-generated Semistructured Network logs,
Machines (Internet of Things) Sensor/Meter logs

Binary Video, Audio, Photo

30
Answer Me
•Why an email placed in the Unstructured category?
•What category will you place a CCTV footage into?
•You have just got a book issued from the library. What are the details
about the book that can be placed in an RDBMS table.
•Which category would you place the consumer complaints and
feedback?
•Which category (structured, semi-structured or Unstructured) will
you place a web page in?
•Which category (structured, semi-structured or Unstructured) will
you place a Power point presentation in?
•Which category (structured, semi-structured or Unstructured) will
you place a word document in?
31
Introduction to Big Data
Characteristics of Data:
1.Composition: It deals with structure of data, that is, the
sources of data, granularity, the types and the nature of data
as to whether it is static or real streaming.
2.Condition: It deals with the state of data, can any one use this
as it is or cleansing is required for further enhancement and
enrichment.
3.Context: It deals with “where has this data been generated”
why was this data generated” how sensitive is this data

32
Evolution of Big Data:
1. 1970s and before : main frames (the data is
primitive and structured)
2. 1980s and 1990s Relational databases (the
data intensive applications)
3. the WWW and the IOT have led to an
onslaught of structured, unstructured, and
multimedia data.

33
The Evolution of Big Data
Data Generation and Data Utilization Data Driven
storage
Complex and Structured data,
unstructured Unstructured data,
Multimedia data

Complex and Relational

Relational databases : Data
intensive applications

Primitive and Main frames: Basic data

structured storage
1970s and before Relational 2000s and beyond
1980s and 1990s

34
What’s Big Data?
No single definition; here is from Wikipedia:

Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
It is about three Vs.
"Big data" is high-volume, -velocity and -variety information assets that
demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.

35
What’s Big Data?
“Big data is high-volume, -velocity and -variety information assets” talks about voluminous
data that may have great variety(s,s,u) and will require a good speed/pace for storage,
preparation, processing and analysis.

“Cost-effective, innovative forms of information processing” talks about embracing new

techniques and technologies to capture, store, process, persist, integrate, and visualize high
volume, high variety and high velocity data.

“Enhanced insight and decision making” talks about deriving deeper, richer, and meaningful
insights and then using these insights to make faster and better decisions to gain business
value and this competitive edge.
Data -> information -> Actionable intelligence -> better decisions -> Enhanced
business value.

36
Challenges with Big Data
The challenges with big data:
1.Data today is growing at an exponential rate.
The key question is : will all this data be useful for analysis how will
separate knowledge from noise..
2. How to host big data solutions outside the world.
3. The period of retention of big data.
4. Dearth of skilled professionals.
5. Challenges with respect to capture, curation, storage, search,
sharing, transfer, analysis, privacy violations and visualization.
6. Shortage of data visualization experts.

37
What is Big Data
Big data is the data that is Big in volume, velocity and variety
Volume: Bits-> Bytes-> KBs-> MBs-> GBs-> TBs-> PBs-> Exabytes->
Zettabytes-> Yottabytes
Where does this data get generated
1. Typical internal data sources: data storage, archives,
2. External data sources: Public web: wikipedia, weather, regulatory,
complience..
3. Both (nternal and external sources).

38
What is Big Data
Velocity
Batch -> periodic ->Near real time ->Real time processing.
Variety: Structured, semi-structured and Unstructured.

Other:
1. Veracity and validity..
2. Volatality
3. variability

39
40
41
TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA

1. In Traditional BI environment, all the enterprise’s data is

housed in a central server where as in a Big data
environment data resides in a distributed file system.
The distributed file system scales by scaling in or out
horizontally as compared to typical database sever that
scales vertically.
2. In traditional BI, data is generally analyzed in an offline
mode whereas in Big data, it is analyzed both real time as
well as in offline mode.

42
TRADITIONAL BUSINESS INTELLIGENCE VS BIG DATA
3. Traditional BI is about structured data and the
data is taken to process functions (move data to
code).
Where as Big data is about variety: Structured,
semi structured, and unstructured data and here
the processing functions are taken to the data
(move code to data)

43
Big Data Analytics

Big data is more real-time in nature than

traditional DW applications
Traditional DW architectures (e.g.
Exadata, Teradata) are not well-suited
for big data apps
Shared nothing, massively parallel
processing, scale out architectures are
well-suited for big data apps

44
Typical data warehouse environment

CRM, ERP, Legacy, Third party apps. The data is then integrated, cleaned,
transformed, and standerdized through the process of Extraction, Transformation
and loading(ETL).
Used to enable decision making from the use of adhoc queries.
Reporing/Dashboarding, OLAP, Adhoc querying, and Modeling. 45
Typical Hadoop environment

The data placed in Hadoop Distributed File system. Operational data store
46
What is Big Data Analytics
Big data Analytics is the process of
examining big data to uncover patterns,
unearth trends, and find unknown
correlations and other useful information to
make faster and better decisions.

47
What is Big Data Analytics
The Big Data Analytics is:
•Technology enabled analytics: few data
analytics and visualization tools
•Richer, deeper insights into customers,
partners and the business
•Competitive advantage

48
Big Data Analytics is:
•Collaboration of three communities: IT, Business
users and data scientists.
•Working with data sets whose volume and
variety exceed the current storage and processing
capabilities and infrastructure of enterprise.
•Move code to data for greater speed and
efficiency
•Better faster decisions in real time.
49
What is Big Data Analytics
Few Top Analytics tools are: MS Excel, SAS, IBM
SPSS Modeler, R analytics, Statistica, World
Programming Systems (WPS), and Weka.
The open source analytics tools are: R analytics
and Weka.

50
What is Big Data Analytics
The open source analytics tools are: R analytics
and Weka.
Classification of Analytics: There are basically
two schools of thought:
Those that classify analytics into basic,
operational, advanced and monetized.
Those that classify analytics into analytics 1.0,
analytics 2.0 and analytics 3.0.
51
First school of thought:
Basic analytics: This primarily slicing and slicing of data to
help with basic business insights. This is about reporting on
historical data, basic visualization etc.
Operationalized Analytics: It is operationalized analytics if it
gets woven into the enterprise’s business process.
Advanced Analytics: This largely is about forecasting for the
future by way of predictive and prescriptive modeling.
Monetized analytics: This is analytics in use to derive direct
business revenue.
52
53
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to Era: 2005 to 2012 Era: 2012 to present
2009
Descriptive Descriptive statistics + Descriptive statistics + Predictive
statistics (report Predictive statistics (use statistics + prescriptive statistics
events, data from the past to (use data from the past to make
occurrences etc of make predictions for prophecies for the future and at the
the past. the future. same time make recommendations
to leverage the situations to one’s
advantage.

54
Analytics 1.0 Analytics 2.0 Analytics 3.0
Era: 1950s to Era: 2005 to 2012 Era: 2012 to present
2009
Key questions Key questions are: Key questions are:
asked: What will happen? What will happen?
What happened? Why will it happen? When will it happen?
Why did it Why will it happen?
happen? What should be the
action taken to take
advantage of what will
happen?
55
Data from legacy Big Data A blend of big data and
systems, ERP,CRM data from legacy
and third party systems, ERP,CRM and
applications. third party applications.
Small and structured Big data is being taken up A blend of big data and
data sources. Data seriously. Data is mainly traditional analytics to
stored in enterprise unstructured, arriving at a yield insights and
data warehouses or higher pace. This fast flow offerings with speed
data marts. of big volume data had to be and impact.
stored and processed
rapidly, often on massively
parallel servers running
hadoop.
56
Data was Data was often Data is being both
internally externally sourced. internally and
sourced. externally sourced.
Relational Database In ,memory
databases applications, analytics, in database
Hadoopo clusters, processing, agile
SQL to hadoop analytical methods,
environments etc.. Machine learning
techniques etc ..

57
58
Top Challenges facing Big Data
Scale
Security
Schema
Continuous availability
Consistency
Partition tolerant
Data quality
59
Top Challenges facing Big Data
Scale: Storage (RDBMS, NoSQL is the major
concern that needs to be addressed
Security (poor security mechansim)
Schema (no rigid schema, Dynamic is required)
Continuous availability (how to provide 24X7 support)
Consistency
Partition tolerant
Data quality
60
Techniques used in Big data environments:

•In memory analytics

•In-Database processing
•Symmetric Multiprocessor system
•Massively parallel processing
•Distributed systems
•Shared nothing architecture
•CAP theorem
61
Techniques used in Big data environments:

In-memory Analytics: Data access from non-volatile

storage such as hard disk is a slow process. This
problem has been addressed using in-memory
analytics. Here all the relevant data is stored in
Random Access memory (RAM) or primary storage
thus eliminating the need to access the data from hard
disk. The advantage is faster access rapid deployment,
better insights, and minimal IT involvement.
62
Techniques used in Big data environments:
In-Database Processing: In-Database processing is also called
as in-database analytics. It works by fusing data warehouses
with analytical systems. Typically the data from various
enterprise OLTP systems after cleaning up through the
process of ETL is stored in the Enterprise dataware house or
data marts. The huge data sets are then exported to
analytical programs for complex and extensive computations.
With in-database processing, the database program itself can
run the computations by eliminating the need for export and
thereby saving on time. Leading database vendors are
offering this feature to large businesses. 63
Techniques used in Big data environments:

Symmetric Multi-Processor System: In this there is

single common main memory that is shared by two or
more identical processors. The processors have full
access to all I/O devices and are controlled by single
operating system instance.
SMP are tightly coupled multiprocessor systems. Each
processor has its own high speed memory called cache
memory and are connected using a system bus.
64
Techniques used in Big data environments:

65
Techniques used in Big data environments:
Massively Parallel Processing:
Massively parallel Processing (MPP) refers to the coordinated
processing of programs by a number of processors working
parallel. The processors, each have their own OS and dedicated
memory. They work on different parts of the same program.
The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP
works with processors sharing the same OS and same memory.
SMP also referred as tightly coupled Multiprocessing.
66
Techniques used in Big data environments:
Massively Parallel Processing:
Massively parallel Processing (MPP) refers to the coordinated
processing of programs by a number of processors working
parallel. The processors, each have their own OS and dedicated
memory. They work on different parts of the same program.
The MPP processors communicate using some sort of
messaging interface.
MPP is different from symmetric multiprocessing in that SMP
works with processors sharing the same OS and same memory.
SMP also referred as tightly coupled Multiprocessing.
67
Techniques used in Big data environments:
Shared nothing Architecture:
The three most common types of architecture for multiprocessor
systems:
Shared memory
Shared disk
Shared nothing.
In shared memory architecture, a common central memory is shared by
multiple processors. In shared disk architecture, Multiple processors
share a common collection of disks while having their own private
memory. In shared nothing architecture, neither memory nor disk is
shared among multiple processors.

68
69
Techniques used in Big data environments:
Advantages of shared nothing architecture:
•Fault isolation:
•Scalability:

70
CAP Theorem:
The CAP theorem is also called the Brewer’s theorem. It states that in a
distributed computing environment, it is possible to provide the following
guarantees:
Consistency
Availability
Partition tolerance
Consistency implies that every read fetches the last write.
Availability implies that reads and writes always succeed. In other words,
each non-failing node will return response in a reasonable amount of time.
Partition tolerance implies that the system will continue to function when
network partition occurs.

71
BASE
Definition - What does Basically Available, Soft State, Eventual
Consistency (BASE) mean?
Basically Available, Soft State, Eventual Consistency (BASE) is a data system
design philosophy that prizes availability over consistency of operations.
BASE was developed as an alternative for producing more scalable and
affordable data architectures, providing more options to expanding
enterprises/IT clients and simply acquiring more hardware to expand data
operations.

Techopedia explains Basically Available, Soft State, Eventual Consistency

(BASE)
BASE may be explained in contrast to another design philosophy -
Atomicity, Consistency, Isolation, Durability (ACID). The ACID model
promotes consistency over availability, whereas BASE promotes availability 72
over consistency.

Big - Data PPT Unit 1
No ratings yet
Big - Data PPT Unit 1
85 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Big Data Analytics PPT Fat 2
No ratings yet
Big Data Analytics PPT Fat 2
9 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Chapter 2
67% (3)
Chapter 2
39 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
01 - Introduction To Big Data Analytics PDF
No ratings yet
01 - Introduction To Big Data Analytics PDF
38 pages
Big Data Analytics and Visualization Lab
No ratings yet
Big Data Analytics and Visualization Lab
193 pages
Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
DAV Quantum
No ratings yet
DAV Quantum
143 pages
E20-007 Data Science and Big Data Analytics (EMCDSA)
100% (3)
E20-007 Data Science and Big Data Analytics (EMCDSA)
3 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
HR Planning Guideline
No ratings yet
HR Planning Guideline
102 pages
Data Science and Ethical Issues
No ratings yet
Data Science and Ethical Issues
42 pages
Thesis Source Code Appendix
100% (3)
Thesis Source Code Appendix
8 pages
BIG DATA ANALYTICS - Syllabus
No ratings yet
BIG DATA ANALYTICS - Syllabus
4 pages
Lecture1 Big Data
No ratings yet
Lecture1 Big Data
47 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
20IT503 - Big Data Analytics - Unit1
No ratings yet
20IT503 - Big Data Analytics - Unit1
59 pages
Unit IV - Database Normalization
No ratings yet
Unit IV - Database Normalization
31 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Chapter 6
100% (1)
Chapter 6
51 pages
Organizational Behaviour: Alagappa University
No ratings yet
Organizational Behaviour: Alagappa University
208 pages
CCW331 Business Analytics Material Unit I Type2
No ratings yet
CCW331 Business Analytics Material Unit I Type2
43 pages
Business Intelligence Hand Book
No ratings yet
Business Intelligence Hand Book
33 pages
Unit VIII - Query Processing and Security
No ratings yet
Unit VIII - Query Processing and Security
29 pages
SE Sec-A Lecture-10
No ratings yet
SE Sec-A Lecture-10
48 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
Utils C Ultra PDF
No ratings yet
Utils C Ultra PDF
693 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
PennStateSchool08 LecNotes
No ratings yet
PennStateSchool08 LecNotes
529 pages
DWDM UNIT-1 Lecture Notes
No ratings yet
DWDM UNIT-1 Lecture Notes
15 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
I MTechMPharmacy
No ratings yet
I MTechMPharmacy
1 page
Revised IV B. Tech
No ratings yet
Revised IV B. Tech
1 page
IMBAMCA
No ratings yet
IMBAMCA
1 page
Fineartseventsyouthfestivaldec 2022
No ratings yet
Fineartseventsyouthfestivaldec 2022
1 page
2e. Supply Chain Analytics - Presentation
No ratings yet
2e. Supply Chain Analytics - Presentation
39 pages
Unit 01
No ratings yet
Unit 01
32 pages
Big Data Analytics PDF
No ratings yet
Big Data Analytics PDF
22 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
100% (1)
A Big Data Analytics Study Challenges, Unresolved Research Issues, and Techniques
8 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
Chapter 9
50% (2)
Chapter 9
32 pages
Sharda dss10 PPT 04
No ratings yet
Sharda dss10 PPT 04
38 pages
Big Data NOTES and QB
No ratings yet
Big Data NOTES and QB
92 pages
Chapter 8
No ratings yet
Chapter 8
16 pages
Unit I - Introduction To DBMS
No ratings yet
Unit I - Introduction To DBMS
9 pages
CHAPTER - 1 - Introduction - 1
No ratings yet
CHAPTER - 1 - Introduction - 1
33 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Big Data - Hadoop Questions Answers
No ratings yet
Big Data - Hadoop Questions Answers
18 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
Big Data Syllabus For Theory and Lab
No ratings yet
Big Data Syllabus For Theory and Lab
4 pages
MC5403 Adbdm Unit I Notes
No ratings yet
MC5403 Adbdm Unit I Notes
95 pages
Sybca Bigdata MCQ
No ratings yet
Sybca Bigdata MCQ
7 pages
Negative Impact of Chatbot AI in Students
No ratings yet
Negative Impact of Chatbot AI in Students
5 pages
427 16sacaob3 2020051805192483
No ratings yet
427 16sacaob3 2020051805192483
66 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
Big Data Analytics PPT-2 (Section-A)
No ratings yet
Big Data Analytics PPT-2 (Section-A)
10 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Chapter 7
No ratings yet
Chapter 7
37 pages
Data Warehousing & Mining: Unit - V
100% (2)
Data Warehousing & Mining: Unit - V
13 pages
Content Analysis
100% (1)
Content Analysis
34 pages
NN DL
No ratings yet
NN DL
54 pages
Financial Analytics 4
No ratings yet
Financial Analytics 4
9 pages
CHNP
No ratings yet
CHNP
29 pages
Bigdata MINT PDF
No ratings yet
Bigdata MINT PDF
4 pages
Img - 31 5 13
No ratings yet
Img - 31 5 13
5 pages
Universal Human Values II - Understanding Harmony 20EGM03
No ratings yet
Universal Human Values II - Understanding Harmony 20EGM03
2 pages
MC5403 Adbdm Unit Ii Notes
No ratings yet
MC5403 Adbdm Unit Ii Notes
59 pages
01 Konsep Big Data
No ratings yet
01 Konsep Big Data
60 pages
Business View of Information Technology Applications
No ratings yet
Business View of Information Technology Applications
5 pages
Big Data
No ratings yet
Big Data
15 pages
M.Tech - CTM II Year Syllabus
No ratings yet
M.Tech - CTM II Year Syllabus
17 pages
Distributed Shared Memory
No ratings yet
Distributed Shared Memory
30 pages
Hydrostatics (4A)
No ratings yet
Hydrostatics (4A)
4 pages
B. Techacforgovt
No ratings yet
B. Techacforgovt
4 pages
Syllabus EC 003
No ratings yet
Syllabus EC 003
2 pages
Data Presentation: I. Textual Narrative or Textual Presentation
No ratings yet
Data Presentation: I. Textual Narrative or Textual Presentation
11 pages
Eps Unit 2
No ratings yet
Eps Unit 2
81 pages
Global Mapper
No ratings yet
Global Mapper
26 pages
WT R19 Unit 3
No ratings yet
WT R19 Unit 3
18 pages
Oracle Interview Questions and Answers
No ratings yet
Oracle Interview Questions and Answers
5 pages
D2-S1 B Self-Exploration J Happiness and Prosperity July 26
No ratings yet
D2-S1 B Self-Exploration J Happiness and Prosperity July 26
17 pages
Data Warehousing
No ratings yet
Data Warehousing
24 pages
249 Enterpreneurship Lesson 14
No ratings yet
249 Enterpreneurship Lesson 14
11 pages
Review Questions Second Test
No ratings yet
Review Questions Second Test
25 pages
Difference Between MOLAP, ROLAP and HOLAP in SSAS
No ratings yet
Difference Between MOLAP, ROLAP and HOLAP in SSAS
3 pages
Big Data Analytics - Sgtrategy and Roadmap
No ratings yet
Big Data Analytics - Sgtrategy and Roadmap
31 pages
Cns Lessonplan
No ratings yet
Cns Lessonplan
2 pages
Model Perilaku Pencarian Informasi Analisis Teori
No ratings yet
Model Perilaku Pencarian Informasi Analisis Teori
14 pages
Banking UNIT IV
No ratings yet
Banking UNIT IV
17 pages
Quali Finalpaper
No ratings yet
Quali Finalpaper
19 pages
R Cet Vacancy 202218032023
No ratings yet
R Cet Vacancy 202218032023
2 pages
Eps Unit 1
No ratings yet
Eps Unit 1
11 pages
Evaluation of Urban Congestion Between The Qualitative and Quantitative Approach WR
No ratings yet
Evaluation of Urban Congestion Between The Qualitative and Quantitative Approach WR
7 pages
Transparency, Replicability, and Discovery in Cognitive Aging Research: A Computational Modeling Approach
No ratings yet
Transparency, Replicability, and Discovery in Cognitive Aging Research: A Computational Modeling Approach
20 pages
Distributed DBMS: Announcements
100% (1)
Distributed DBMS: Announcements
11 pages
Servtime - Computation of Service Times
No ratings yet
Servtime - Computation of Service Times
11 pages
Ba - Case Study
No ratings yet
Ba - Case Study
12 pages
Database 100 Questions and Answers
No ratings yet
Database 100 Questions and Answers
12 pages
Goes Et Al. - 2021 - Unlocking The Potential of Big Data To Support Tac
No ratings yet
Goes Et Al. - 2021 - Unlocking The Potential of Big Data To Support Tac
17 pages
Bachelor Virtual Final Exam TT For Physical Exams Only
No ratings yet
Bachelor Virtual Final Exam TT For Physical Exams Only
28 pages
8 - Business Intelligence and Analytics
No ratings yet
8 - Business Intelligence and Analytics
52 pages
Oracle Reviewer
No ratings yet
Oracle Reviewer
24 pages
Report Writing Procedures
No ratings yet
Report Writing Procedures
15 pages
Varmaraju Oracle Enhance - Prod Resume1
No ratings yet
Varmaraju Oracle Enhance - Prod Resume1
4 pages
ProcurePort Announces The Release of SpendPilot™, Further Expanding Spend Data Analysis
No ratings yet
ProcurePort Announces The Release of SpendPilot™, Further Expanding Spend Data Analysis
2 pages
Lesson 2: RMAN Architecture Workshop ........................................................... 2
No ratings yet
Lesson 2: RMAN Architecture Workshop ........................................................... 2
4 pages

Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester

Uploaded by

Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester

Uploaded by

Big Data Analytics

Types of Digital Data

Learning Objectives Learning Outcomes

1. Structured data: Sources of a) To differentiate between

2. Semi-Structured data: Sources b) To understand the need to

3. Unstructured data: Sources of

Types of Digital Data

Digital data is classified into the following categories:

Approximate percentage distribution of digital data

 Relationships exist between entities of data, such as classes and

 Data stored in databases is an example of structured data.

Structured data Spreadsheets

Ease with Structured data Indexing / Speeds up data retrieval operations

Storage and processing

 This is the data which does not conform to a data model

 Example, emails, XML, markup languages like HTML,

XML (Extensible Markup Language) developed utilizing

Other Markup Languages

JSON (Java Script object Notation)

Mango DB and couchbase store data natively in JSON format.

 About 80–90% data of an organization is in this format.

 Example: memos, chat rooms, PowerPoint

Structure can be implied despite not being

Data with some structure may still be labeled

Data may have some structure or may even be

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

Manual tagging with Metadata

Part of speech tagging

It is related to HCI. It is about enabling computers to understand human or

Text mining is the process of gleaning high quality and meaningful

Structured Unstructured Semi structured

Noisy Uses methods at the intersection of statistics,

 Which category (structured, semi-structured, or unstructured) will you place

 State a few examples of human generated and machine-generated data.

Business process data e.g.,

Binary Content such as Video, Audio,

Binary Video, Audio, Photo

Complex and Relational

Primitive and Main frames: Basic data

“Cost-effective, innovative forms of information processing” talks about embracing new

1. In Traditional BI environment, all the enterprise’s data is

Big data is more real-time in nature than

•In memory analytics

In-memory Analytics: Data access from non-volatile

Symmetric Multi-Processor System: In this there is

Techopedia explains Basically Available, Soft State, Eventual Consistency

You might also like