0% found this document useful (0 votes)
7 views75 pages

DA Unit I

data analytics complete notes

Uploaded by

Madhuri 018
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views75 pages

DA Unit I

data analytics complete notes

Uploaded by

Madhuri 018
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

UNIT-I

Data Management
Design Data Architecture and manage the
Data for analysis
• Data architecture is composed of models, policies, rules
or standards that govern which data is to be collected,
and how it is stored, arranged, integrated, put to use in
data systems and in organizations.
• Data is usually one of several architecture domains that
form the pillars of an enterprise architecture or solution
architecture.
• Various constraints and influences will have an
effect on data architecture design. These
include
– Enterprise requirements
– Technology drivers
– Economics
– Business policies
– Data processing needs.
• Enterprise requirements
 These will generally include such elements as economical
and effective system expansion, acceptable performance
levels (especially system access speed), transaction
reliability, and transparent data management.
 In addition, the conversion of raw data such as transaction
records and image files into more useful information
forms through such features as data warehouses is also a
common organizational requirement, since this enables
managerial decision making and other organizational
processes.
 One of the architecture techniques is the split between
managing transaction data and (master) reference data.
Another one is splitting data capture systems from data
retrieval systems (as done in a data warehouse).
• Technology drivers
These are usually suggested by the completed data
architecture and database architecture designs.
In addition, some technology drivers will be
derived from existing organizational integration
frameworks and standards, organizational
economics, and existing site resources (e.g.
previously purchased software licensing).
• Economics

These are also important factors that must be


considered during the data architecture phase. It is
possible that some solutions, while optimal in
principle, may not be potential candidates due to
their cost.
External factors such as the business cycle, interest
rates, market conditions, and legal considerations
could all have an effect on decisions relevant to data
architecture.
• Business policies
 Business policies that also drive data architecture
design include internal organizational policies,
rules of regulatory bodies, professional standards, and
applicable governmental laws that can vary by
applicable agency.
 These policies and rules will help describe the manner
in which enterprise wishes to process their data.
• Data processing needs
 These include accurate and reproducible
transactions performed in high volumes, data
warehousing for the support of management
information systems (and potential data mining),
repetitive periodic reporting, ad-hoc reporting,
and support of various organizational initiatives
as required (i.e. Annual budgets, new product
development).
The General Approach is based on designing the
Architecture at three Levels of Specification :-
 The Physical Level
 The Logical Level
 The Implementation Level
• Physical level: This is the lowest level of data
abstraction. It describes how data is actually
stored in database. You can get the complex data
structure details at this level.
• Logical level: This is the middle level of 3-level
data abstraction architecture. It describes what
data is stored in database.
• Implementation level: Highest level of data
abstraction. This level describes the user
interaction with database system.
Example:
• Let’s say we are storing customer information in a customer table.
At physical level these records can be described as blocks of
storage (bytes, gigabytes, terabytes etc.) in memory. These details
are often hidden from the programmers.
• At the logical level these records can be described as fields and
attributes along with their data types, their relationship among
each other can be logically implemented. The programmers
generally work at this level because they are aware of such things
about database systems.
• At Implementation level, user just interact with the system using
the help of GUI and enter the details at the screen, they are not
aware of how the data is stored and where the data is stored;
such details are hidden from them.
Understand various sources of the Data

Data can be generated from two types of sources


namely Primary and Secondary.

The sources of generating primary data are –


 Observation Method
 Survey Method
 Experimental Method
• Experimental Method

There are number of experimental designs that


are used in carrying out an experiment.
However, Market researchers have used 4
experimental designs most frequently. They are:

 CRD - Completely Randomized Design


 RBD - Randomized Block Design
 LSD - Latin Square Design
 FD - Factorial Designs
• CRD - Completely Randomized Design:
– Studying the effects of one primary factor without
the need to take other nuisance variables into
account. The experiment compares the values of a
response variable based on the different levels of
that primary factor.
– For completely randomized designs, the levels of
the primary factor are randomly assigned to the
experimental units.
• RBD - Randomized Block Design –
– The term Randomized Block Design has originated from
agricultural research. In this design several treatments of variables
are applied to different blocks of land to ascertain their effect on
the yield of the crop.
– Blocks are formed in such a manner that each block contains as
many plots as a number of treatments so that one plot from each
is selected at random for each treatment. The production of each
plot is measured after the treatment is given.
– These data are then interpreted and inferences are drawn by using
the analysis of Variance Technique so as to know the effect of
various treatments like different dozes of fertilizers, different types
of irrigation etc
• LSD - Latin Square Design –
– Latin square is one of the experimental designs which has a
balanced two way classification scheme say for example - 4 X 4
arrangement. In this scheme each letter from A to D occurs only
once in each row and also only once in each column. The
balance arrangement, it may be noted that, will not get
disturbed if any row gets changed with the other.

– The balance arrangement achieved in a Latin Square is its main


strength. In this design, the comparisons among treatments, will
be free from both differences between rows and columns. Thus
the magnitude of error will be smaller than any other design.
• FD - Factorial Designs –
– This design allows the experimenter to test two or
more variables simultaneously. It also measures
interaction effects of the variables and analyzes
the impacts of each of the variables.
– In a true experiment, randomization is essential so
that the experimenter can infer cause and effect
without any bias.
Sources of Secondary Data
While primary data can be collected through
questionnaires, depth interview, focus group interviews,
case studies, experimentation and observation;
The secondary data can be obtained through
 Internal Sources - These are within the organization
 External Sources - These are outside the organization
Internal Sources of Data
• If available, internal secondary data may be obtained with less
time, effort and money than the external secondary data. In
addition, they may also be more pertinent to the situation at hand
since they are from within the organization.
• The internal sources include
– Accounting resources- This gives so much information which can be
used by the marketing researcher. They give information about internal
factors.
– Sales Force Report- It gives information about the sale of a product. The
information provided is of outside the organization.
– Internal Experts- These are people who are heading the various
departments. They can give an idea of how a particular thing is working
– Miscellaneous Reports- These are what information you are getting
from operational reports. If the data available within the organization
are unsuitable or inadequate, the marketer should extend the search to
external secondary data sources.
External Sources of Data
External Sources are sources which are outside the company in a larger
environment. Collection of external data is more difficult because the data have
much greater variety and the sources are much more numerous.

External data can be divided into following classes.


Government Publications- Government sources provide an extremely rich pool of
data for the researchers. In addition, many of these data are available free of cost
on internet websites. There are number of government agencies generating data.
They are:
₋Registrar General of India- It is an office which generates demographic data. It
includes details of gender, age, occupation etc.
₋Central Statistical Organization- This organization publishes the national accounts
statistics. It contains estimates of national income for several years, growth rate, and
rate of major economic activities. Annual survey of Industries is also published by
the CSO. It gives information about the total number of workers employed,
production units, material used and value added by the manufacturer.
₋General of Commercial Intelligence- This office operates from Kolkata. It gives
information about foreign trade i.e. import and export. These figures are provided
region-wise and country-wise.
⁻ Ministry of Commerce and Industries- This ministry through the office of
economic advisor provides information on wholesale price index. These indices
may be related to a number of sectors like food, fuel, power, food grains etc. It
also generates All India Consumer Price Index numbers for industrial workers,
urban, non manual employees and cultural labourers.
⁻ Planning Commission- It provides the basic statistics of Indian Economy.
⁻ Reserve Bank of India- This provides information on Banking Savings and
investment. RBI also prepares currency and finance reports.
⁻ Labour Bureau- It provides information on skilled, unskilled, white collared
jobs etc.
⁻ National Sample Survey- This is done by the Ministry of Planning and it
provides social, economic, demographic, industrial and agricultural statistics.
⁻ Department of Economic Affairs- It conducts economic survey and it also
generates information on income, consumption, expenditure, investment,
savings and foreign trade.
⁻ State Statistical Abstract- This gives information on various types of activities
related to the state like - commercial activities, education, occupation etc.
Non Government Publications- These includes publications of various industrial and trade
associations, such as
• The Indian Cotton Mill Association
• Various chambers of commerce
• The Bombay Stock Exchange (it publishes a directory containing financial accounts, key
profitability and other relevant matter)
• Various Associations of Press Media.
• Export Promotion Council.
• Confederation of Indian Industries ( CII )
• Small Industries Development Board of India
• Different Mills like - Woollen mills, Textile mills etc

The only disadvantage of the above sources is that the data may be biased. They are likely
to colour their negative points.

Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the
information suits the subscriber. These services are useful in television viewing, movement
of consumer goods etc. These syndicate services provide information data from both
household as well as institution.
In collecting data from household they use three approaches
• Survey- They conduct surveys regarding - lifestyle, sociographic, general topics.
• Mail Diary Panel- It may be related to 2 fields - Purchase and Media.
• Electronic Scanner Services- These are used to generate data on volume.

They collect data for Institutions from


Whole sellers
Retailers, and
Industrial Firms

Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).

Importance of Syndicate Services


Syndicate services are becoming popular since the constraints of decision
making are changing and we need more of specific decision-making in the light
of changing environment. Also Syndicate services are able to provide
information to the industries at a low unit cost.
Disadvantages of Syndicate Services:
The information provided is not exclusive. A number of research agencies
provide customized services which suits the requirement of each
individual organization.

International Organization- These includes:

The International Labour Organization (ILO)- It publishes data on the


total and active population, employment, unemployment, wages and
consumer prices

The Organization for Economic Co-operation and development (OECD)


- It publishes data on foreign trade, industry, food, transport, and
science and technology.

The International Monetary Fund (IMA) - It publishes reports on


national and international foreign exchange regulations.
Other sources:
• Sensors data: With the advancement of IoT devices, the
sensors of these devices collect data which can be used
for sensor data analytics to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data
in terabytes on daily basis through surveillance cameras
which can be used to collect useful information.
• Web traffic: Due to fast and cheap internet facilities many
formats of data which is uploaded by users on different
platforms can be predicted and collected with their
permission for data analytics. The search engines also
provide their data through keywords and queries
searched mostly.
Data Quality
• Poor data quality negatively affects many data processing
efforts
“The most important point is that poor data quality is an
unfolding disaster.
– Poor data quality costs the typical company at least ten
percent (10%) of revenue; twenty percent (20%) is
probably a better estimate.”
• Data mining example: a classification model for detecting
people who are loan risks is built using poor data
– Some credit-worthy candidates are denied loans
– More loans are given to individuals that default
Data Quality …
• What kinds of data quality problems?
• How can we detect problems with the data?
• What can we do about these problems?

• Examples of data quality problems:


– Noise and outliers
– Missing values
– Duplicate data
– Wrong data
Noise
• For objects, noise is an extraneous object
• For attributes, noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen
– We talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two
waves combined with noise and has high SNR

Two Sine Waves Two Sine Waves + Noise


Origins of noise
• outliers -- values seemingly out of the normal range
of data
• duplicate records -- good database design should
minimize this (use DISTINCT on SQL retrievals)
• incorrect attribute values -- again good db design and
integrity constraints should minimize this
• numeric only, deal with rogue strings or characters
where numbers should be.
• how to locate and treat outliers (values seemingly out
of the normal range)
• null handling for attributes (nulls=missing values)
Outliers
• Outliers are data objects with characteristics
that are considerably different than most of the
other data objects in the data set
– Case 1: Outliers are
noise that interferes
with data analysis
– Case 2: Outliers are
the goal of our analysis
• Credit card fraud
• Intrusion detection
Missing Values
• Reasons for missing values
– Information is not collected
(e.g., people decline to give their age and weight)
– Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

• Handling missing values


– Eliminate data objects or variables
– Estimate missing values
• Example: time series of temperature
• Example: census results
– Ignore the missing value during analysis
Missing Data Handling
Many causes: malfunctioning equipment, changes in experimental design,
collation of different data sources, measurement not possible. People may
wish to not supply information. Information is not applicable (children don't
have annual income)
• Discard records with missing values
• Ordinal-continuous data, could replace with attribute means
• Substitute with a value from a similar instance
• Ignore missing values, i.e., just proceed and let the tools deals with them
• Treat missing values as equals (all share the same missing value code)
• Treat missing values as unequal values

BUT...Missing (null) values may have significance in themselves (e.g. missing


test in a medical examination, death date missing means still alive!)
Missing Data Handling …
• Missing completely at random (MCAR)
– Missingness of a value is independent of attributes
– Fill in values based on the attribute
– Analysis may be unbiased overall
• Missing at Random (MAR)
– Missingness is related to other variables
– Fill in values based other values
– Almost always produces a bias in the analysis
• Missing Not at Random (MNAR)
– Missingness is related to unobserved measurements
– Informative or non-ignorable missingness

It is Not possible to know the situation from the data


Duplicate Data
• Data set may include data objects that are duplicates, or
almost duplicates of one another
– Major issue when merging data from heterogeneous sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

• When should duplicate data not be removed?


DATA PREPROCESSING
Data preprocessing is a data mining technique which is used to
transform the raw data in to a useful and efficient format.
• Aggregation
• Sampling
• Dimensionality Reduction
• Feature subset selection
• Feature creation
• Discretization and Binarization
• Attribute Transformation
Aggregation
• Combining two or more attributes (or objects) into a
single attribute (or object)

• Purpose
– Data reduction
• Reduce the number of attributes or objects
– Change of scale
• Cities aggregated into regions, states, countries, etc.
• Days aggregated into weeks, months, or years
– More “stable” data
• Aggregated data tends to have less variability
Example: Precipitation in Australia
• This example is based on precipitation in Australia from the
period 1982 to 1993.
The next slide shows
– A histogram for the standard deviation of average monthly
precipitation for 3,030 0.5km by 0.5km grid cells in Australia, and
– A histogram for the standard deviation of the average yearly
precipitation for the same locations.
• The average yearly precipitation has less variability than the
average monthly precipitation.
• All precipitation measurements (and their standard
deviations) are in centimeters.
Example: Precipitation in Australia …
The average yearly precipitation has less variability than the
average monthly precipitation

Standard Deviation of Average Standard Deviation of


Monthly Precipitation Average Yearly Precipitation
Sampling
• Sampling is the main technique employed for data
reduction.
– It is often used for both the preliminary investigation of
the data and the final data analysis.

• Statisticians often sample because obtaining the


entire set of data of interest is too expensive or
time consuming.

• Sampling is typically used in data mining because


processing the entire set of data of interest is too
expensive or time consuming.
Sampling …
• The key principle for effective sampling is the
following:

– Using a sample will work almost as well as using


the entire data set, if the sample is representative

– A sample is representative if it has approximately


the same properties (of interest) as the original
set of data
Sample Size

8000 points 2000 Points 500 Points


Types of Sampling
• Simple Random Sampling
– There is an equal probability of selecting any particular
item
– Sampling without replacement
• As each item is selected, it is removed from the population
– Sampling with replacement
• Objects are not removed from the population as they are
selected for the sample.
• In sampling with replacement, the same object can be picked
up more than once
• Stratified sampling
– Split the data into several partitions; then draw random
samples from each partition
Sample Size
• What sample size is necessary to get at least one
object from each of 10 equal-sized groups.
Curse of Dimensionality

• As the number of features or dimensions grows, the amount of


data we need to generalize accurately grows exponentially.

• Let’s take an example below. Fig. a shows 10 data points in one


dimension i.e. there is only one feature in the data set. It can be
easily represented on a line with only 10 values, x=1, 2, 3... 10.
• But if we add one more feature, same data will be represented in
2 dimensions causing increase in dimension space to 10*10 =100.
And again if we add 3rd feature, dimension space will increase to
10*10*10 = 1000. As dimensions grows, dimensions space
increases exponentially.
• This exponential growth in data causes high
sparsity in the data set and unnecessarily
increases storage space and processing time
for the particular modelling algorithm.
• Think of image recognition problem of high
resolution images 1280 × 720 = 921,600
pixels i.e. 921600 dimensions. And that’s
why it’s called Curse of Dimensionality.
Dimensionality Reduction
• Dimensionality reduction is the process of reducing the number
of random variables under consideration, by obtaining a set of
principal variables.
• Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data mining
algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce noise

• Techniques
– Principal Components Analysis (PCA)
– Singular Value Decomposition
– Others: supervised and non-linear techniques
There are two components of dimensionality reduction:
• Feature selection: In this, we try to find a subset of the
original set of variables, or features, to get a smaller
subset which can be used to model the problem. It usually
involves three ways:
– Filter
– Wrapper
– Embedded
• Feature extraction: This reduces the data in a high
dimensional space to a lower dimension space, i.e. a space
with lesser no. of dimensions.
• Filter methods:
– information gain
– chi-square test
– fisher score
– correlation coefficient
– variance threshold
• Wrapper methods:
– recursive feature elimination
– sequential feature selection algorithms
– genetic algorithms
• Embedded methods:
– L1 (LASSO) regularization
– decision tree
Dimensionality Reduction: PCA
• It works on a condition that while the data in a higher dimensional space
is mapped to data in a lower dimension space, the variance of the data in
the lower dimensional space should be maximum.
• Goal is to find a projection that captures the largest amount of variation
in data

Principal component analysis x2


(PCA) reduces the number of
dimensions in large datasets to
principal components that retain
most of the original information. e
It does this by transforming
potentially correlated variables
into a smaller set of variables,
called principal components.

x1
It involves the following steps:
• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigen values
are used to reconstruct a large fraction of variance of
the original data.
Hence, we are left with a lesser number of eigenvectors,
and there might have been some data loss in the process.
But, the most important variances should be retained by
the remaining eigenvectors.
Dimensionality Reduction: PCA
Feature Subset Selection
Another way to reduce dimensionality of data
• Redundant features
– Duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid
• Irrelevant features
– Contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' GPA
Feature Creation
• Create new attributes that can capture the
important information in a data set much more
efficiently than the original attributes

• Three general methodologies:


– Feature extraction
• Example: extracting edges from images
– Feature construction
• Example: dividing mass by volume to get density
– Mapping data to new space
• Example: Fourier and wavelet analysis
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.

Disadvantages of Dimensionality Reduction


• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is
sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to
define datasets.
• We may not know how many principal components to keep- in
practice, some thumb rules are applied.
Discretization
• Discretization is the process of converting a
continuous attribute into an ordinal attribute
– A potentially infinite number of values are
mapped into a small number of categories
– Discretization is commonly used in classification
– Many classification algorithms work best if both the
independent and dependent variables have only a few
values
– We give an illustration of the usefulness of
discretization using the Iris data set
Iris Sample Data Set

• Iris Plant data set.


– Can be obtained from the UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html
– From the statistician Douglas Fisher
– Three flower types (classes):
• Setosa
• Versicolour
• Virginica
– Four (non-class) attributes
• Sepal width and length
• Petal width and length
Discretization: Iris Example

Petal width low or petal length low implies Setosa.


Petal width medium or petal length medium implies Versicolour.
Petal width high or petal length high implies Virginica.
Discretization: Iris Example …
• How can we tell what the best discretization is?
– Unsupervised discretization: find breaks in the data values
• Example: 50
Petal Length
40

30

Counts
20

10

0
0 2 4 6 8
Petal Length

– Supervised discretization: Use class labels to find breaks


Discretization Without Using Class Labels

Data consists of four groups of points and two outliers. Data is one-
dimensional, but a random y component is added to reduce overlap.
Discretization Without Using Class Labels

Equal interval width approach used to obtain 4 values.


Discretization Without Using Class Labels

Equal frequency approach used to obtain 4 values.


Discretization Without Using Class Labels

K-means approach to obtain 4 values.


Binarization
• Binarization maps a continuous or categorical
attribute into one or more binary variables

• Typically used for association analysis

• Often convert a continuous attribute to a categorical


attribute and then convert a categorical attribute to a
set of binary attributes
– Association analysis needs asymmetric binary attributes
– Examples: eye color and height measured as
{low, medium, high}
Attribute Transformation
• An attribute transform is a function that maps the
entire set of values of a given attribute to a new set
of replacement values such that each old value can
be identified with one of the new values
– Simple functions: xk, log(x), ex, |x|
– Normalization
• Refers to various techniques to adjust to differences
among attributes in terms of frequency of occurrence,
mean, variance, range
• Take out unwanted, common signal, e.g., seasonality
– In statistics, standardization refers to subtracting off
the means and dividing by the standard deviation
Example: Sample Time Series of Plant
Growth
Minneapolis

Net Primary
Production (NPP)
is a measure of
plant growth used
by ecosystem
scientists.

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.7591 -0.7581
Atlanta 0.7591 1.0000 -0.5739
Sao Paolo -0.7581 -0.5739 1.0000
Seasonality Accounts for Much
Correlation
Minneapolis
Normalized using
monthly Z Score:
Subtract off monthly
mean and divide by
monthly standard
deviation

Correlations between time series


Correlations between time series
Minneapolis Atlanta Sao Paolo
Minneapolis 1.0000 0.0492 0.0906
Atlanta 0.0492 1.0000 -0.0154
Sao Paolo 0.0906 -0.0154 1.0000

You might also like