0% found this document useful (0 votes)
15 views33 pages

Data Science Class2

Structured data follows a consistent data model and structure, while unstructured data does not conform to any model. Semi-structured data has some structure but not enough metadata. Common types of structured data include databases and spreadsheets, while XML, JSON, and HTML are examples of semi-structured data. Most of an organization's data is unstructured, such as text, images, and videos, which requires techniques like data mining, natural language processing, and text analytics to analyze.

Uploaded by

Yashwanth Yashu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views33 pages

Data Science Class2

Structured data follows a consistent data model and structure, while unstructured data does not conform to any model. Semi-structured data has some structure but not enough metadata. Common types of structured data include databases and spreadsheets, while XML, JSON, and HTML are examples of semi-structured data. Most of an organization's data is unstructured, such as text, images, and videos, which requires techniques like data mining, natural language processing, and text analytics to analyze.

Uploaded by

Yashwanth Yashu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

DATA SCIENCE (IT258M)

Types of data
• Digital data is classified into the following categories:

• Structured data
• Semi-structured data
• Unstructured data
Structured Data
• It owns a dedicated data model.

• It also has a well defined structure, it follows a consistent


order and it is designed in such a way that it can be easily
accessed and used by person or a computer.

• Structured data is usually stored in well defined columns and


databases.

• Example : DBMS, RDBMS


• Sources of structured data

• Databases: Oracle Corp-Oracle, IBM-DB2, Microsof-


Microsoft SQL Server, EMCGreenplum, Teradata-Teradata,
MySQL, PostgreSQL.

• Spreadsheets : MS Excel, Google sheets

• On-Line Transaction Processing (OLTP) Systems


• Ease of Working with Structured Data

• Insert/update/delete: The Data Manipulation Language (DML)


operations provide the required ease with data input, storage,
access, process, analysis, etc.
• Security: There are available check encryption and tokenization
solutions to warrant the security of information throughout its
lifecycle.
• Indexing: An index is a data structure that speeds up the data
retrieval operations (primarily the SELECT DML statement) at the cost
of additional writes and storage space.
• Scalability: The storage and processing capabilities of the traditional
RDBMS can be easily scaled up by increasing the horsepower of the
database server .
• Transaction processing: RDBMS has support for Atomicity,
Consistency, Isolation, and Durability (ACID) properties of transaction
Semi-Structured Data
• The data does not conform to a data model but has some
structure.

• Example: en XML, markup languages like HTML, etc. Metadata


for this data is available but is not sufficient.

• Semi-structured data is also referred to as self-describing


structure
Features

• It does not conform to the data models that one typically


associates with relational databases or any other form of data
tables.
• It uses tags to segregate semantic elements.
• Tags are also used to enforce hierarchies of records and fields
within data.
• There is no separation between the data and the schema. The
amount of structure used is dictated by the purpose at hand.
• In semi-structured data, entities belonging to the same class
and also grouped together need not necessarily have the
same set of attributes
Sources of Semi-structured data
• XML: eXtensible Markup Language (XML) is hugely popularized
by web services developed utilizing the Simple Object Access
Protocol (SOAP) principles.

• JSON: Java Script Object Notation (JSON) is used to transmit


data between a server and a web application using REST
architecture.

• MongoDB and Couchbase (originally known as Membase, store


data natively in JSON format
Unstructured Data
• Unstructured data does not conform to a data model or is not
in a form which can be used easily by a computer program.

• Unstructured data is completely different of which neither has


a structure nor obeys to follow formal structural rules of data
models.

• It does not even have a consistent format and it found to be


varying all the time.

• About 80–90% data of an organization is in this format


• Sources of Unstructured Data
– Web Pages, Images, Free-Form Text, Audios, Videos, Body
of Email, Text, Messages, Chats, Social Media data, Word
Document

Dealing with Unstructured Data

• Data Mining
• Natural Language Processing (NLP)
• Text Analytics
• Noisy Text Analytics
• Data Mining:

• First, we deal with large data sets.

• Second, we use methods at the intersection of artificial


intelligence, machine learning, statistics, and database
systems to unearth consistent patterns in large data sets
and/or systematic relationships between variables.

• Few popular data mining algorithms are as follows:


• Association rule mining,
• Regression analysis
• Collaborative filtering
• Natural language processing (NLP):
• It is related to the area of human computer interaction.

• It about enabling computers to understand human or natural


language input.

• Text Analytics or Text Mining:


• Text mining is the process of gleaning high quality and
meaningful information (through devising of patterns and
trends by means of statistical pattern learning) from text.

• It includes tasks such as text categorization, text clusterirg,


sentiment analysis, concept/entity extraction, etc
• Noisy Text Analytics:

• It is the process of extracting structured or semi-structured


information from noisy unstructured data such as chats, blogs,
wikis, emails, message-boards, text messages, etc.
Qualitative vs Quantitative Data
Qualitative Data Quantitative Data
Overview: Overview:
•Deals with descriptions. •Deals with numbers.

•Data can be observed but •Data which can be


not measured. measured.

•Colors, textures, smells, •Length, height, area, volume,


tastes, appearance, beauty, weight, speed, time,
etc. temperature, humidity, sound
levels, cost, members, ages,
•Qualitative → Quality etc.
•Quantitative → Quantity
• Characteristics of Data

• Composition: The composition of data deals with the


structure of data, that is, the sources of data the granularity,
the types, and the nature of data as to whether it is static or
real-time streaming.
• Condition: The condition of data deals with the state of
data, that is, “Can one use this data as is for analysis?” or
“Does it require cleansing for further enhancement and
enrichment?”
• Context: The context of data deals with “Where has this
data been generated?” “Why was this data generated?” “How
sensitive is this data?” “What are the events associated with
this data?” and so on
Evolution of Big Data

Common Eras of Evolution

• 1970s and before was the era of mainframes. The data was
essentially primitive and structured.

• Relational databases evolved in 1980s and 1990s. The era


was of data intensive applications.

• 2000 and beyond: The World Wide Web (WWW) and the
Internet of Things (IoT) have led to an onslaught of structured,
unstructured, and multimedia data
• Characteristics of Big Data

• Big data characteristics are mere word that explain the


remarkable potential of big data.

• In early stages development of big data and related terms


there were only 3 V’s (Volume, Variety, Velocity) considered as
potential characteristics.

• But ever growing technology and tools and variety of sources


where information being received has potentially increased
these 3 V’s into 5 V’s and still evolving.
• The 5 V’s are

• Volume
• Variety
• Velocity
• Veracity
• Value
• Volume:

• Volume refers to the unimaginable amounts of information


generated every second.

• This information comes from variety of sources like social


media, cell phones, sensors, financial records, stock market etc.
• Variety
• Variety refers to the many types of data that are available.

• A reason for rapid growth of data volume is that the data is


coming from different sources in various formats.

• Big data extends beyond structured data to include


unstructured data of all varieties: text, sensor data, audio,
video, click streams, log files and more.

• The variety of data is categorized as follows:


– • Structured – RDBMS
– • Semi Structured – XML, HTML, RDF, JSON
– • Unstructured- Text, audio, video, logs, images
• Velocity

• Velocity essentially refers to the speed at which data is being


created in real- time.

• It is the fast rate at which data is received and (perhaps) acted


on.

• In other words it is the speed at which the data is generated


and processed to meet the demands and challenges that lie in
the path of growth and development
• Veracity:

• Data veracity, in general, is how accurate or truthful a data set


may be.

• More specifically, when it comes to the accuracy of big data,


it’s not just the quality of the data itself but how trustworthy
the data source, type, and processing of it is.
• Value:

• Value is the major issue that we need to concentrate on.

• It is not just the amount of data that we store or process.

• It is actually the amount of valuable, reliable and trustworthy


data that needs to be stored, processed, analyzed to find
insights.

• Mine the data, i.e., a process to turn raw data into useful data.
Value represents benefits of data to your business such as in
finding out insights, results, etc. which were not possible
earlier
STATISTICS

• Descriptive Statistics
– Frequencies & percentages
– Means & standard deviations
• Inferential Statistics
– Correlation
– T-tests
– Chi-square
– Logistic Regression
Descriptive Statistics

Descriptive statistics can be used to summarize


and describe a single variable (UNIvariate)
• Frequencies (counts) & Percentages
– Use with categorical (nominal) data
• Levels, types, groupings, yes/no, Drug A vs. Drug B

• Means & Standard Deviations


– Use with continuous (interval/ratio) data
• Height, weight, cholesterol, scores on a test
Frequencies & Percentages
Look at the different ways we can display frequencies and
percentages for this data:

Pie chart

Table
frequency
distributions –
good if more
than 20
observations

Good if more
than 20
observations Bar chart
Distributions
The distribution of scores or values can also be
displayed using Box and Whiskers Plots and Histograms
Continuous  Categorical

It is possible to take
continuous data
(such as hemoglobin
levels) and turn it
into categorical data
by grouping values
together. Then we
can calculate
frequencies and
percentages for each
group.
Continuous  Categorical
Distribution of
Glasgow Coma
Scale Scores

Even though
this is
continuous
data, it is
being treated
as “nominal”
as it is broken
down into
groups or
Tip: It is usually better to collect continuous data and then break it categories
down into categories for data analysis as opposed to collecting data
that fits into preconceived categories.
Ordinal Level Data
Ordinal data is a categorical, statistical data type where the
variables have natural, ordered categories and the distances
between the categories are not known.

Frequencies and percentages can be computed for ordinal data


– Examples: Likert Scales (Strongly Disagree to Strongly Agree); High
School/Some College/College Graduate/Graduate School

60
50
40
30
20
10
0
Strongly Agree Disagree Strongly
Agree Disagree
Interval/Ratio Data
• Ratio data has a defined zero point.
• Interval data lacks the absolute zero point, which makes direct
comparisons of magnitude impossible (e.g. A is twice as large as
B).
We can compute frequencies and percentages for interval and ratio
level data as well
– Examples: Age, Temperature, Height, Weight, Many Clinical Serum Levels
Distribution of Injury Severity
Score in a population of patients
Interval/Ratio Distributions
The distribution of interval/ratio data often
forms a “bell shaped” curve.
– Many phenomena in life are normally
distributed (age, height, weight, IQ).
Interval & Ratio Data
Measures of central tendency and measures of dispersion are often computed with
interval/ratio data

• Measures of Central Tendency (aka, the “Middle Point”)


– Mean, Median, Mode
– If your frequency distribution shows outliers, you might want to use the
median instead of the mean

• Measures of Dispersion (aka, How “spread out” the data are)


― Variance, standard deviation, standard error of the mean
― Describe how “spread out” a distribution of scores is
― High numbers for variance and standard deviation may mean that scores are
“all over the place” and do not necessarily fall close to the mean

In research, means are usually presented along with standard deviations or standard
errors.

You might also like