0% found this document useful (0 votes)

15 views33 pages

Data Science Class2

Structured data follows a consistent data model and structure, while unstructured data does not conform to any model. Semi-structured data has some structure but not enough metadata. Common types of structured data include databases and spreadsheets, while XML, JSON, and HTML are examples of semi-structured data. Most of an organization's data is unstructured, such as text, images, and videos, which requires techniques like data mining, natural language processing, and text analytics to analyze.

Uploaded by

Yashwanth Yashu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views33 pages

Data Science Class2

Uploaded by

Yashwanth Yashu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

DATA SCIENCE (IT258M)

Types of data
• Digital data is classified into the following categories:

• Structured data
• Semi-structured data
• Unstructured data
Structured Data
• It owns a dedicated data model.

• It also has a well defined structure, it follows a consistent

order and it is designed in such a way that it can be easily
accessed and used by person or a computer.

• Structured data is usually stored in well defined columns and

databases.

• Example : DBMS, RDBMS

• Sources of structured data

• Databases: Oracle Corp-Oracle, IBM-DB2, Microsof-

Microsoft SQL Server, EMCGreenplum, Teradata-Teradata,
MySQL, PostgreSQL.

• Spreadsheets : MS Excel, Google sheets

• On-Line Transaction Processing (OLTP) Systems

• Ease of Working with Structured Data

• Insert/update/delete: The Data Manipulation Language (DML)

operations provide the required ease with data input, storage,
access, process, analysis, etc.
• Security: There are available check encryption and tokenization
solutions to warrant the security of information throughout its
lifecycle.
• Indexing: An index is a data structure that speeds up the data
retrieval operations (primarily the SELECT DML statement) at the cost
of additional writes and storage space.
• Scalability: The storage and processing capabilities of the traditional
RDBMS can be easily scaled up by increasing the horsepower of the
database server .
• Transaction processing: RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID) properties of transaction
Semi-Structured Data
• The data does not conform to a data model but has some
structure.

• Example: en XML, markup languages like HTML, etc. Metadata

for this data is available but is not sufficient.

• Semi-structured data is also referred to as self-describing

structure
Features

• It does not conform to the data models that one typically

associates with relational databases or any other form of data
tables.
• It uses tags to segregate semantic elements.
• Tags are also used to enforce hierarchies of records and fields
within data.
• There is no separation between the data and the schema. The
amount of structure used is dictated by the purpose at hand.
• In semi-structured data, entities belonging to the same class
and also grouped together need not necessarily have the
same set of attributes
Sources of Semi-structured data
• XML: eXtensible Markup Language (XML) is hugely popularized
by web services developed utilizing the Simple Object Access
Protocol (SOAP) principles.

• JSON: Java Script Object Notation (JSON) is used to transmit

data between a server and a web application using REST
architecture.

• MongoDB and Couchbase (originally known as Membase, store

data natively in JSON format
Unstructured Data
• Unstructured data does not conform to a data model or is not
in a form which can be used easily by a computer program.

• Unstructured data is completely different of which neither has

a structure nor obeys to follow formal structural rules of data
models.

• It does not even have a consistent format and it found to be

varying all the time.

• About 80–90% data of an organization is in this format

• Sources of Unstructured Data
– Web Pages, Images, Free-Form Text, Audios, Videos, Body
of Email, Text, Messages, Chats, Social Media data, Word
Document

Dealing with Unstructured Data

• Data Mining
• Natural Language Processing (NLP)
• Text Analytics
• Noisy Text Analytics
• Data Mining:

• First, we deal with large data sets.

• Second, we use methods at the intersection of artificial

intelligence, machine learning, statistics, and database
systems to unearth consistent patterns in large data sets
and/or systematic relationships between variables.

• Few popular data mining algorithms are as follows:

• Association rule mining,
• Regression analysis
• Collaborative filtering
• Natural language processing (NLP):
• It is related to the area of human computer interaction.

• It about enabling computers to understand human or natural

language input.

• Text Analytics or Text Mining:

• Text mining is the process of gleaning high quality and
meaningful information (through devising of patterns and
trends by means of statistical pattern learning) from text.

• It includes tasks such as text categorization, text clusterirg,

sentiment analysis, concept/entity extraction, etc
• Noisy Text Analytics:

• It is the process of extracting structured or semi-structured

information from noisy unstructured data such as chats, blogs,
wikis, emails, message-boards, text messages, etc.
Qualitative vs Quantitative Data
Qualitative Data Quantitative Data
Overview: Overview:
•Deals with descriptions. •Deals with numbers.

•Data can be observed but •Data which can be

not measured. measured.

•Colors, textures, smells, •Length, height, area, volume,

tastes, appearance, beauty, weight, speed, time,
etc. temperature, humidity, sound
levels, cost, members, ages,
•Qualitative → Quality etc.
•Quantitative → Quantity
• Characteristics of Data

• Composition: The composition of data deals with the

structure of data, that is, the sources of data the granularity,
the types, and the nature of data as to whether it is static or
real-time streaming.
• Condition: The condition of data deals with the state of
data, that is, “Can one use this data as is for analysis?” or
“Does it require cleansing for further enhancement and
enrichment?”
• Context: The context of data deals with “Where has this
data been generated?” “Why was this data generated?” “How
sensitive is this data?” “What are the events associated with
this data?” and so on
Evolution of Big Data

Common Eras of Evolution

• 1970s and before was the era of mainframes. The data was
essentially primitive and structured.

• Relational databases evolved in 1980s and 1990s. The era

was of data intensive applications.

• 2000 and beyond: The World Wide Web (WWW) and the
Internet of Things (IoT) have led to an onslaught of structured,
unstructured, and multimedia data
• Characteristics of Big Data

• Big data characteristics are mere word that explain the

remarkable potential of big data.

• In early stages development of big data and related terms

there were only 3 V’s (Volume, Variety, Velocity) considered as
potential characteristics.

• But ever growing technology and tools and variety of sources

where information being received has potentially increased
these 3 V’s into 5 V’s and still evolving.
• The 5 V’s are

• Volume
• Variety
• Velocity
• Veracity
• Value
• Volume:

• Volume refers to the unimaginable amounts of information

generated every second.

• This information comes from variety of sources like social

media, cell phones, sensors, financial records, stock market etc.
• Variety
• Variety refers to the many types of data that are available.

• A reason for rapid growth of data volume is that the data is

coming from different sources in various formats.

• Big data extends beyond structured data to include

unstructured data of all varieties: text, sensor data, audio,
video, click streams, log files and more.

• The variety of data is categorized as follows:

– • Structured – RDBMS
– • Semi Structured – XML, HTML, RDF, JSON
– • Unstructured- Text, audio, video, logs, images
• Velocity

• Velocity essentially refers to the speed at which data is being

created in real- time.

• It is the fast rate at which data is received and (perhaps) acted

on.

• In other words it is the speed at which the data is generated

and processed to meet the demands and challenges that lie in
the path of growth and development
• Veracity:

• Data veracity, in general, is how accurate or truthful a data set

may be.

• More specifically, when it comes to the accuracy of big data,

it’s not just the quality of the data itself but how trustworthy
the data source, type, and processing of it is.
• Value:

• Value is the major issue that we need to concentrate on.

• It is not just the amount of data that we store or process.

• It is actually the amount of valuable, reliable and trustworthy

data that needs to be stored, processed, analyzed to find
insights.

• Mine the data, i.e., a process to turn raw data into useful data.
Value represents benefits of data to your business such as in
finding out insights, results, etc. which were not possible
earlier
STATISTICS

• Descriptive Statistics
– Frequencies & percentages
– Means & standard deviations
• Inferential Statistics
– Correlation
– T-tests
– Chi-square
– Logistic Regression
Descriptive Statistics

Descriptive statistics can be used to summarize

and describe a single variable (UNIvariate)
• Frequencies (counts) & Percentages
– Use with categorical (nominal) data
• Levels, types, groupings, yes/no, Drug A vs. Drug B

• Means & Standard Deviations

– Use with continuous (interval/ratio) data
• Height, weight, cholesterol, scores on a test
Frequencies & Percentages
Look at the different ways we can display frequencies and
percentages for this data:

Pie chart

Table
frequency
distributions –
good if more
than 20
observations

Good if more
than 20
observations Bar chart
Distributions
The distribution of scores or values can also be
displayed using Box and Whiskers Plots and Histograms
Continuous  Categorical

It is possible to take
continuous data
(such as hemoglobin
levels) and turn it
into categorical data
by grouping values
together. Then we
can calculate
frequencies and
percentages for each
group.
Continuous  Categorical
Distribution of
Glasgow Coma
Scale Scores

Even though
this is
continuous
data, it is
being treated
as “nominal”
as it is broken
down into
groups or
Tip: It is usually better to collect continuous data and then break it categories
down into categories for data analysis as opposed to collecting data
that fits into preconceived categories.
Ordinal Level Data
Ordinal data is a categorical, statistical data type where the
variables have natural, ordered categories and the distances
between the categories are not known.

Frequencies and percentages can be computed for ordinal data

– Examples: Likert Scales (Strongly Disagree to Strongly Agree); High
School/Some College/College Graduate/Graduate School

60
50
40
30
20
10
0
Strongly Agree Disagree Strongly
Agree Disagree
Interval/Ratio Data
• Ratio data has a defined zero point.
• Interval data lacks the absolute zero point, which makes direct
comparisons of magnitude impossible (e.g. A is twice as large as
B).
We can compute frequencies and percentages for interval and ratio
level data as well
– Examples: Age, Temperature, Height, Weight, Many Clinical Serum Levels
Distribution of Injury Severity
Score in a population of patients
Interval/Ratio Distributions
The distribution of interval/ratio data often
forms a “bell shaped” curve.
– Many phenomena in life are normally
distributed (age, height, weight, IQ).
Interval & Ratio Data
Measures of central tendency and measures of dispersion are often computed with
interval/ratio data

• Measures of Central Tendency (aka, the “Middle Point”)

– Mean, Median, Mode
– If your frequency distribution shows outliers, you might want to use the
median instead of the mean

• Measures of Dispersion (aka, How “spread out” the data are)

― Variance, standard deviation, standard error of the mean
― Describe how “spread out” a distribution of scores is
― High numbers for variance and standard deviation may mean that scores are
“all over the place” and do not necessarily fall close to the mean

In research, means are usually presented along with standard deviations or standard
errors.

CS 441 Handouts
No ratings yet
CS 441 Handouts
300 pages
Introduction To Data Science - Students
No ratings yet
Introduction To Data Science - Students
237 pages
Data Science Fundamentals - Class1
100% (1)
Data Science Fundamentals - Class1
51 pages
DA Unit 1
No ratings yet
DA Unit 1
44 pages
Module 1
No ratings yet
Module 1
60 pages
BAD601 Module 1 PDF
No ratings yet
BAD601 Module 1 PDF
64 pages
Unit 1
No ratings yet
Unit 1
10 pages
Big Data Study 1
No ratings yet
Big Data Study 1
77 pages
Big Data Class 27feb
No ratings yet
Big Data Class 27feb
48 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
17 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
Bigdata Notes-1 To 3
No ratings yet
Bigdata Notes-1 To 3
32 pages
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
No ratings yet
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
62 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Bda (Unit 1)
No ratings yet
Bda (Unit 1)
24 pages
01 Unit-BDA - Intro BDA
No ratings yet
01 Unit-BDA - Intro BDA
37 pages
Big Data Analytics
No ratings yet
Big Data Analytics
15 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
51 pages
Cyber Crime
No ratings yet
Cyber Crime
67 pages
Big Data Analytics Notess
No ratings yet
Big Data Analytics Notess
69 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
BDA M1 (40pgs)
No ratings yet
BDA M1 (40pgs)
40 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
32 pages
Unit-I - Big Data
No ratings yet
Unit-I - Big Data
29 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
Lecture 1 Introduction To Data Engineering
No ratings yet
Lecture 1 Introduction To Data Engineering
7 pages
BigData 1
No ratings yet
BigData 1
14 pages
BDA Presentations M1 P1
No ratings yet
BDA Presentations M1 P1
40 pages
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
BDA Unit-1
No ratings yet
BDA Unit-1
33 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
Itfm Assignment Group 8
100% (1)
Itfm Assignment Group 8
16 pages
Unit 1
No ratings yet
Unit 1
44 pages
Bda Module 1 Notes
No ratings yet
Bda Module 1 Notes
10 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Big Data Chapter-I - New
No ratings yet
Big Data Chapter-I - New
49 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
BIG Data Analytics
No ratings yet
BIG Data Analytics
17 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Bda (Chapter 1)
No ratings yet
Bda (Chapter 1)
8 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Module 1. 16974328175990
No ratings yet
Module 1. 16974328175990
119 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
No ratings yet
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
14 pages
BDA Question Answer
No ratings yet
BDA Question Answer
29 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
19 pages
FREEWAT Vol0 v.1.1.2
No ratings yet
FREEWAT Vol0 v.1.1.2
159 pages
Unit 1
No ratings yet
Unit 1
59 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
05 MultizoneBuilding PDF
No ratings yet
05 MultizoneBuilding PDF
230 pages
File Explorer - Window Environment: Submitted by - Ritik Bhadoria BCA (2018-21) - C
No ratings yet
File Explorer - Window Environment: Submitted by - Ritik Bhadoria BCA (2018-21) - C
17 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Lab 4a 1077561
No ratings yet
Lab 4a 1077561
26 pages
CSBS R23 II Year Course Structure and Syllabus
No ratings yet
CSBS R23 II Year Course Structure and Syllabus
52 pages
EM Wave Equation PDF
No ratings yet
EM Wave Equation PDF
32 pages
SV9100 PCPro Manual Issue 2 0 GE
No ratings yet
SV9100 PCPro Manual Issue 2 0 GE
204 pages
UML Certification - Fundamental Exam
No ratings yet
UML Certification - Fundamental Exam
8 pages
Ds Inferential Statistics
No ratings yet
Ds Inferential Statistics
139 pages
Electromagnetic Flowmeter MODBUS RTU
No ratings yet
Electromagnetic Flowmeter MODBUS RTU
12 pages
M100741G MAI Memex Memory Upgrade For Fanuc 16 18
No ratings yet
M100741G MAI Memex Memory Upgrade For Fanuc 16 18
32 pages
Resume of Rasel
No ratings yet
Resume of Rasel
5 pages
ABAP 7.40 Quick Reference
100% (1)
ABAP 7.40 Quick Reference
15 pages
Ds Eternus Dx600 s5 WW en
No ratings yet
Ds Eternus Dx600 s5 WW en
7 pages
CV - Venkata Subhash Muthareddy
No ratings yet
CV - Venkata Subhash Muthareddy
6 pages
Ebook Letter To Parents Final
No ratings yet
Ebook Letter To Parents Final
3 pages
Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents
No ratings yet
Usage Apriori and Clustering Algorithms in WEKA Tools To Mining Dataset of Traffic Accidents
16 pages
Q-1 What Is Parsing? Explain XML Parsing and JSON Parsing With Example. OR Explain JSON Parsing With Example
No ratings yet
Q-1 What Is Parsing? Explain XML Parsing and JSON Parsing With Example. OR Explain JSON Parsing With Example
36 pages
Economic Benefits of Machinery Revamping
No ratings yet
Economic Benefits of Machinery Revamping
26 pages
Automatic PO Cretion From PR ME59N T Code
No ratings yet
Automatic PO Cretion From PR ME59N T Code
3 pages
Assignment2 Operators
No ratings yet
Assignment2 Operators
3 pages
Hand Gesture Recognition Based On Convolution Neural Network CNN and Support Vector Machine SVM
No ratings yet
Hand Gesture Recognition Based On Convolution Neural Network CNN and Support Vector Machine SVM
4 pages
OOAD Lecture 4
No ratings yet
OOAD Lecture 4
14 pages
ANSYS HFSS L05 1 HFSS 3D Optimetrics
No ratings yet
ANSYS HFSS L05 1 HFSS 3D Optimetrics
20 pages
Review of Vector Calculus
No ratings yet
Review of Vector Calculus
21 pages
Deleted Text Messages: Betray A Daughter's Killer
No ratings yet
Deleted Text Messages: Betray A Daughter's Killer
6 pages
Virtual Private Network (VPN) : Technical-Specification
No ratings yet
Virtual Private Network (VPN) : Technical-Specification
9 pages
Swetha
No ratings yet
Swetha
1 page
Event Id 3041 PDF
No ratings yet
Event Id 3041 PDF
2 pages
Training For SAP ERP in For Applications
No ratings yet
Training For SAP ERP in For Applications
4 pages

Data Science Class2

Uploaded by

Data Science Class2

Uploaded by

DATA SCIENCE (IT258M)

• It also has a well defined structure, it follows a consistent

• Structured data is usually stored in well defined columns and

• Example : DBMS, RDBMS

• Databases: Oracle Corp-Oracle, IBM-DB2, Microsof-

• Spreadsheets : MS Excel, Google sheets

• On-Line Transaction Processing (OLTP) Systems

• Insert/update/delete: The Data Manipulation Language (DML)

• Example: en XML, markup languages like HTML, etc. Metadata

• Semi-structured data is also referred to as self-describing

• It does not conform to the data models that one typically

• JSON: Java Script Object Notation (JSON) is used to transmit

• MongoDB and Couchbase (originally known as Membase, store

• Unstructured data is completely different of which neither has

• It does not even have a consistent format and it found to be

• About 80–90% data of an organization is in this format

Dealing with Unstructured Data

• First, we deal with large data sets.

• Second, we use methods at the intersection of artificial

• Few popular data mining algorithms are as follows:

• It about enabling computers to understand human or natural

• Text Analytics or Text Mining:

• It includes tasks such as text categorization, text clusterirg,

• It is the process of extracting structured or semi-structured

•Data can be observed but •Data which can be

•Colors, textures, smells, •Length, height, area, volume,

• Composition: The composition of data deals with the

Common Eras of Evolution

• Relational databases evolved in 1980s and 1990s. The era

• Big data characteristics are mere word that explain the

• In early stages development of big data and related terms

• But ever growing technology and tools and variety of sources

• Volume refers to the unimaginable amounts of information

• This information comes from variety of sources like social

• A reason for rapid growth of data volume is that the data is

• Big data extends beyond structured data to include

• The variety of data is categorized as follows:

• Velocity essentially refers to the speed at which data is being

• It is the fast rate at which data is received and (perhaps) acted

• In other words it is the speed at which the data is generated

• Data veracity, in general, is how accurate or truthful a data set

• More specifically, when it comes to the accuracy of big data,

• Value is the major issue that we need to concentrate on.

• It is not just the amount of data that we store or process.

• It is actually the amount of valuable, reliable and trustworthy

Descriptive statistics can be used to summarize

• Means & Standard Deviations

Frequencies and percentages can be computed for ordinal data

• Measures of Central Tendency (aka, the “Middle Point”)

• Measures of Dispersion (aka, How “spread out” the data are)

You might also like