Data Analytics Unit-I
Data Analytics Unit-I
KARMNAGAR
(Approved by AICTE, New Delhi and- Affiliated
505481 to JNTU, Hyderabad)
Data Management: Design Data Architecture and manage the data for analysis, understand
various sources of Data like Sensors/Signals/GPS etc. Data Management, Data Quality
(noise, outliers, missing values, duplicate data) and Data Processing & Processing.
Data Management:
Data Management is an administrative process that includes acquiring, validating, storing,
protecting and processing required data to ensure the accessibility, reliability and timelines of the
data for its users.
Design Data Architecture and manage the Data for analysis Data
architecture
o is collected,
o how it is stored,
o arranged,
o integrated,
o put to use
in data systems and in organizations.
• Data is usually one of several architecture domains that form the pillars of an enterprise
architecture or solution architecture.
Various constraints and influences that will have an effect on data architecture design are
• enterprise requirements
• technology drivers
• economics
• business policies
• Data processing needs.
Enterprise requirements
Technology drivers
Economics
• These are also important factors that must be considered during the data architecture
phase.
• It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost.
• External factors such as
o business cycle,
o interest rates,
o market conditions,
o legal considerations
Business policies
• These include
o accurate and reproducible transactions performed in high volumes,
o data warehousing for the support of management information systems (and
potential data mining),
o repetitive periodic reporting,
o ad hoc reporting,
o support of various organizational initiatives as required (i.e. annual budgets, new
product development).
General Approach
The General Approach is based on designing the Architecture at three Levels of Specification:
• Sensor Data: Sensor data is the output of a device that detects and responds to some
type of input from the physical environment. The output may be used to provide
information or input to another system or to guide a process.
• The Global Positioning System (GPS) has been developed in order to allow accurate
determination of geographical locations by military and civil users. It is based on the use
of satellites in Earth orbit that transmit information which allow to measure the distance
between the satellites and the user.
• Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.
Observation method
▪ The observation method involves human or mechanical observation of what people
actually do or what events take place during a buying or consumption situation.
▪ “Information is collected by observing process at work.”
▪ The following are a few situations:
o Service Stations-
▪ Pose as a customer,
▪ go to a service station and observe.
o To evaluate the effectiveness of display of Dunlop Pillow Cushions-
o In a departmental store, observer notes:-
▪ How many pass by;
▪ How many stopped to look at the display;
▪ How many decide to buy.
o Super Market-
▪ Which is the best location in the shelf? Hidden cameras are used.
▪ To determine typical sales arrangement and find out sales enthusiasm
shown by various salesmen-
o Normally this is done by an investigator using a concealed tape-recorder.
▪ Advantages of Observation Method
o If the researcher observes and record events, it is not necessary to rely on the
willingness and ability of respondents to report accurately.
o The biasing effect of interviewers is either eliminated or reduced. Data
collected by observation are, thus, more objective and generally more
accurate.
▪ Disadvantages of Observation Method
o The most limiting factor in the use of observation method is the inability to
observe such things such as attitudes, motivations, customers/consumers state
of mind, their buying motives and their images.
o It also takes time for the investigator to wait for a particular action to take
place.
Survey Method
There are mainly 4 methods by which we can collect data through the Survey Method
• Telephonic Interview
• Personal Interview
• Mail Interview
• Electronic Interview
Telephonic Interview
• Best method for gathering quickly needed information.
• Responses are collected from the respondents by the researcher on telephone.
• Advantages of Telephonic Interview
o It is very fast method of data collection.
o It has the advantage over “Mail Questionnaire” of permitting the interviewer
to talk to one or more persons and to clarifying his questions if they are not
understood.
o Response rate of telephone interviewing seems to be a little better than mail
questionnaires
o The quality of information is better
o It is less costly method and there are less administration problems
• Disadvantages of Telephonic Interview
o They can’t handle interview which need props
o It can’t handle unstructured interview
o It can’t be used for those questions which requires long descriptive answers
o Respondents cannot be observed
o People are reluctant to disclose personal information on telephone
o People who don’t have telephone facility cannot be approached
Personal Interviewing
• It is the most versatile of the all methods. They are used when props are required
along with the verbal response non-verbal responses can also be observed.
Mail Survey
• Questionnaires are sent to the respondents; they fill it up and send it back.
• Advantages of Mail Survey
o It can reach all types of people.
o Response rate can be improved by offering certain incentives.
• Disadvantages of Mail Survey
o It cannot be used for unstructured study.
o It is costly.
o It requires established mailing list.
o It is time consuming.
o There is problem in case of complex questions.
Electronic Interview
• Electronic interviewing is a process of recognizing and noting people, objects, and
occurrences rather than asking for information.
• For example-When you go to store, you notice which product people like to use.
• The Universal Product Code (UPC) is also a method of observing what people are
buying.
• Advantages of Electronic Interview
o There is no relying on willingness or ability of respondent.
o The data is more accurate and objective.
• Disadvantages of Electronic Interview
o Attitudes cannot be observed.
o Those events which are of long duration cannot be observed.
o There is observer bias. It is not purely objective.
o If the respondents know that they are being observed, their response can be
biased.
o It is a costly method.
ABCDB
CDACD
ABDAB
C
• The balance arrangement achieved in a Latin Square is its main strength.
Prepared by N.Venkateswaran, Associate Professor, CSE Dept Page 10
JYOTHISHMATHI INSTITUTE OF TECHNOLOGY AND SCIENCE
KARMNAGAR
(Approved by AICTE, New Delhi and- Affiliated
505481 to JNTU, Hyderabad)
FD - Factorial Designs
• This design allows the experimenter to test two or more variables simultaneously.
• It also measures interaction effects of the variables and analyzes the impacts of each of
the variables.
• In a true experiment, randomization is essential so that the experimenter can infer cause
and effect without any bias.
▪ Sales Force Report- It gives information about the sale of a product. The information
provided is of outside the organization.
▪ Internal Experts- These are people who are heading the various departments. They
▪ Miscellaneous Reports- These are what information you are getting from operational
reports. If the data available within the organization are unsuitable or inadequate, the
marketer should extend the search to external secondary data sources.
DATA QUALITY
Data Quality is a Perception or an assessment of data's fitness to serve its purpose in a given
context.
✓ Improved data quality leads to better decision-making across an organization. The more
high-quality data you have, the more confidence you can have in your decisions. Good data
decreases risk and can result in consistent improvements in results.
Consistency: This is about the single version of truth. Consistency means data
throughout the enterprise should be sync with each other.
Completeness: It is the extent to which the expected attributes of data are provided.
Timeliness: Right data to the right person at the right time is important business.
Metadata: Data about data.
• Data mining applications are often applied to data that was collected for another purpose,
or for future, but unspecified applications.
• For that reason data mining cannot usually take advantage of the significant benefits of
"addressing quality issues at the source."
• In contrast, much of statistics deals with the design of experiments or surveys that
achieve a pre specified level of data quality.
• Because preventing data quality problems is typically not an option, data mining focuses
on
1. Detection and correction (called data cleaning ) of data quality problems
2. Use of algorithms that can tolerate poor data quality.
Outliers
• Outliers are either
1. data objects that have characteristics that are different from most of the other data
objects in the data set, or
2. Values of an attribute that are unusual with respect to the typical values for that
attribute.
• Outliers can be legitimate data objects or values.
• Unlike noise, outliers may sometimes be of interest.
Missing Values
• It is not unusual for an object to be missing one or more attribute values.
• In some cases, the information was not collected; e.g., some people decline to give their
age or weight.
• In other cases, some attributes are not applicable to all objects; e.g., often, forms have
conditional parts that are filled out only when a person answers a previous question in a
certain way, but for simplicity, all fields are stored.
• Missing values should be taken into account during the data analysis.
• Strategies for dealing with missing data, each of which may be appropriate in certain
circumstances:
Eliminate Data Objects or Attributes
o A simple and effective strategy is to eliminate objects with missing values.
o However, even a partially specified data object contains some information, and if
many objects have missing values, then a reliable analysis can be difficult or
impossible.
o However, if a data set has only a few objects that have missing values, then it may
be convenient to omit them.
o A related strategy is to eliminate attributes that have missing values.
o This should be done with caution, however, since the eliminated attributes may be
the ones that are critical to the analysis.
Inconsistent Values
• Data can contain inconsistent values.
• Consider an address field, where both a zip code and city are listed, but the specified zip
code area is not contained in that city.
• It may be that the individual entering this information transposed two digits, or perhaps a
digit was misread when the information was scanned from a handwritten form.
• It is important to detect and, if possible, correct such problems.
• Some types of inconsistencies are easy to detect. For instance, a person's height should
not be negative.
• In some cases, it can be necessary to consult an external source of information.
• For example, when an insurance company processes claims for reimbursement, it checks
the names and addresses on the reimbursement forms against a database of its customers.
• Once an inconsistency has been detected, it is sometimes possible to correct the data.
• A product code may have "check" digits, or it may be possible to double-check a product
code against a list of known product codes, and then correct the code if it is incorrect, but
close to a known code.
• The correction of an inconsistency requires additional or redundant information.
o Relevance
▪ The available data must contain the information necessary for the
application.
▪ Consider the task of building a model that predicts the accident rate for
drivers.
▪ If information about the age and gender of the driver is omitted, then it is
likely that the model will have limited accuracy unless this information is
indirectly available through other attributes.
▪ Making sure that the objects in a data set are relevant is also challenging.
o Sampling bias
▪ Which occurs when a sample does not contain different types of objects in
proportion to their actual occurrence in the population?
They are Data Cleaning/Cleansing, Data Integration, Data Transformation, and Data
Reduction.
1. Data Cleaning/Cleansing
Data can be noisy, having incorrect attribute values. Owing to the following, the data
collection instruments used may be fault. Maybe human or computer errors occurred at data
entry. Errors in data transmission can also occur.
“Dirty” data can cause confusion for the mining procedure. Although most mining routines
have some procedures, they deal incomplete or noisy data, which are not always robust.
Therefore, a useful Data Preprocessing step is to run the data through some Data
Cleaning/Cleansing routines.
2. Data Integration
Data Integration is involved in data analysis task which combines data from multiple
sources into a coherent data store, as in data warehousing. These sources may include
multiple databases, data cubes, or flat files. The issue to be considered in Data Integration is
schema integration. It is tricky.
How can real-world entities from multiple data sources be ‘matched up’? This is referred as
entity identification problem. For example, how can a data analyst be sure that customer_id
in one database and cust_number in another refer to the same entity? The answer is
metadata. Databases and data warehouses typically have metadata. Simply, metadata is data
about data.
Metadata is used to help avoiding errors in schema integration. Another important issue is
redundancy. An attribute may be redundant, if it is derived from another table.
Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set.
3. Data Transformation
Data are transformed into appropriate forms of mining. Data Transformation involves the
following:
1. In Normalisation, where the attribute data are scaled to fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
2. Smoothing works to remove the noise from the data. Such techniques include
binning, clustering, and regression.
3. In Aggregation, summary or aggregation operations are applied to the data. For
example, daily sales data may be aggregated so as to compute monthly and annual
total amounts. This step is typically used in constructing a data cube for analysis of
the data at multiple granularities.
4. In Generalisation of the Data, low level or primitive/raw data are replaced by higher
level concepts through the use of concept hierarchies. For example, categorical
attributes are generalised to higher level concepts street into city or country.
Similarly, the values for numeric attributes may be mapped to higher level concepts
like, age into young, middle-aged, or senior.
4. Data Reduction
Complex data analysis and mining on huge amounts of data may take a very long time,
making such analysis impractical or infeasible. Data Reduction techniques are helpful in
analysing the reduced representation of the data set without compromising the integrity of
the original data and yet producing the qualitative knowledge. Strategies for data reduction
include the following:
1. In Data Cube Aggregation, aggregation operations are applied to the data in the
construction of a data cube.
2. In Dimension Reduction, irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
Data Preprocessing
• Steps that should be applied to make the data more suitable for data mining.
• Consists of a number of different strategies and techniques that are interrelated in
complex ways.
Goal:
• To improve the data mining analysis with respect to time, cost, and quality.
Aggregation
• Quantitative attributes are typically aggregated by taking a sum or an average.
• A qualitative attribute can either be omitted or summarized.
Disadvantage of aggregation
• Potential loss of interesting details.
Sampling
Sampling Approaches
• Random sampling.
• Progressive or Adaptive Sampling
Random sampling
• Sampling without replacement: as each item is selected, it is removed from the set
of all objects that together constitute the population.
• Sampling with replacement: objects are not removed from the population as they
are selected for the sample. Same object can be picked more than once.
Dimensionality reduction
• Data mining algorithms work better if the dimensionality - the number of
attributes in the data - is lower.
• Eliminate irrelevant features and reduce noise.
• Lead to a more understandable model due to fewer attributes.
• Allow the data to be more easily visualized.
• Amount of time and memory required by the data mining algorithm is reduced.
Binarization
• Transform both continuous and discrete attributes into one or more binary
attributes.
Variable transformation
• A transformation that is applied to all the values of a variable.