Unit 2 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

Syllabus

Introduction to analytics :
• Four types of analytics to improve decision
making :
• Descriptive analytics
• Diagnostic analytics
• Predictive analytics
• Prescriptive analytics
Data analytics capabilities
Descriptive analytics
Answers : what happened
mines – raw data from multiple sources
Gives valuable insight into the past
Ex: health care provider to analyze patients
BI consulting to analyze category of products
Retailer to find average sales of month
Mines raw data from multiple sources
Gives insight of past or only answers what
happened
Diagnostic analytics
• Answers- why something happened
• Finds dependencies and identifies patterns
• Gives deep insight into the problem
• Examples: retailer compares his sales by
subcategories
• Health care provider compares patients
response
Predictive analytics
• Answers: what is likely to happen
• Uses result of descriptive and diagnostic
analytics
• Tool for forecasting
• Predictive analytics allows for ex :
a leading FMCG company to predict what they
could expect after changing brand positioning.
Prescriptive analytics
• Answers : what action to take
• Eliminates future problem
• Gives a promising trend.
• An example -a multinational company was
able to identify opportunities for repeat
purchases based on customer analytics and
sales history.
Places where analytics is used :
Reporting Vs Analytics:

• Reporting is presenting result of data analysis


• Analytics is process or systems involved in
analysis of data to obtain a desired output.
Introduction to tools and Environment:

Analytics is now days used in all the fields ranging from


Medical Science to Aero science to Government Activities.
Data Science and Analytics are used by manufacturing
companies as well as real estate firms to develop their
business and solve various issues by the help of historical
data base.

• Tools are the softwares that can be used for Analytics like
SAS (statistica l analysis system) or R.
• Techniques are the procedures to be followed to reach up
to a solution.
• Various steps involved in Analytics:
1. Access
2. Manage
3. Analyze
4. Report
Various Analytics techniques are:
1.Data Preparation
2. Reporting, Dashboards & Visualization
3. Segmentation Icon
4. Forecasting
5. Descriptive Modeling
6. Predictive Modeling
7. Optimization
Application of Modeling in Business:
• A statistical model embodies a set of assumptions concerning the
generation of the observed data, and similar data from a larger
population.

• A model represents --- idealized form of the data-generating


process.

Signal processing is an enabling technology that encompasses the
fundamental theory, applications, algorithms, and implementations
of processing or transferring information contained in many
different physical, symbolic, or abstract formats broadly designated
as signals.

• In manufacturing statistical models are used to define Warranty


policies, solving various conveyor related issues, statistical process
control etc.
Databases can be categorized as either:
• Relational
• Non-relational

The main difference between these is how they


store their information.
Relational database

• Stores information in tables.

• Maintains structured data.

• Use Structured Query Language (SQL).

• The database contains tables consisting of columns and rows. When new
data is added, new records (row) are inserted into existing tables or new
tables are added. Relationships can then be made between two or more
tables.

• Relational databases are used when the data they contain doesn’t change
very often, and when accuracy is crucial.
• Used mostly in financial applications.

• For example, a shop could store details of their customers’ names and
addresses in one table and details of their orders in another.
Non-relational database

• Often called NoSQL databases.

• Store their data in a non-tabular form.

• Non-relational databases are based on data structures like documents. A


document can be highly detailed while containing a range of different
types of information in different formats.

• Non-relational databases much more flexible than relational databases.

• Non-relational databases are often used when large quantities of complex


and diverse data need to be organized.

• For example, a large store might have a database in which each customer
has their own document containing all of their information, from name
and address to order history and credit card information.
• Non-relational databases perform faster since
Query doesn’t have to view several tables in
order to deliver an answer.

• Non-relational databases are ideal for storing


data that may be changed frequently or for
applications that handle many different kinds of
data.

• They support rapidly developing applications


requiring a dynamic database able to change
quickly and to accommodate large amounts of
complex, unstructured data.
The five differences of SQL vs NoSQL:

• SQL databases are relational, NoSQL are non-relational.

• SQL databases use SQL and have a predefined schema. NoSQL


databases have dynamic schemas for unstructured data.

• SQL databases are vertically scalable, NoSQL databases are


horizontally scalable.

• SQL databases are table based, while NoSQL databases are


document, key-value, graph or wide-column stores.

• SQL databases are better for multi-row transactions, NoSQL are


better for unstructured data like documents or JSON.
Difference between SQL database and NoSQL Database
Types of data variables
Data can be categorized to :
Categorical
Numeric type

Numeric data can be further divided to :


Discrete
Continuous
Categorical Data can be divided into 3 categories
Nominal
binary
ordinal.

Based on usage data is divided into 2 categories :


Quantitative
Qualitative

Ex: Manufacturing industry also have their data divided in the groups discussed
above. Like production quantity is a discrete quantity while production rate is a
continuous data. Similarly quality parameter can be given ratings which ordinal
data.
Attribute
It is a data field that represents characteristics or features of a data object.
For a customer object attributes can be customer id, address etc.
Set of attributes used to describe a given object are known as attribute vector or feature vector.

Type of attributes :

We differentiate between different types of attributes and then preprocess the data.
It is the first step in data pre processing.

Attribute types

1. Qualitative --- categorical (Nominal (N), Ordinal (O), Binary(B))

2. Quantitative --- Numerical (Discrete, Continuous)


Qualitative Attributes
Nominal Attributes – related to names
The values of a Nominal attribute are name of things,
some kind of symbols.
Values of nominal attributes represents some category
or state and hence referred as categorical attributes.
There is no order (rank, position) among values of
nominal attribute.
Ex :
Binary Attributes : Binary data has only 2
values/states.

For Example yes or no, affected or unaffected,


true or false.

i) Symmetric : Both values are equally


important (Gender).

ii) Asymmetric : Both values are not equally


important (Result).
Ordinal Attributes : The Ordinal Attributes contains values that have a meaningful
sequence or ranking(order) between them, but the magnitude between values is not
actually known, the order of values that shows what is important but don’t indicate
how important it is.
Quantitative Attributes
Numeric : It is a measurable quantity, represented in integer or real values.
Numerical attributes are of 2 types, interval and ratio.

i) An interval-scaled attribute has values, whose differences are


interpretable, but the numerical attributes do not have the correct
reference point or we can call zero point.
Data can be added and subtracted at interval scale but can not be multiplied
or divided.
Consider a example of temperature in degrees Centigrade.
If a days temperature of one day is twice than the other day we cannot say
that one day is twice as hot as another day.

ii) A ratio-scaled attribute is a numeric attribute with an fix zero-point. If


a measurement is ratio-scaled, we can say of a value as being a multiple
(or ratio) of another value. The values are ordered, and we can also
compute the difference between values, and the mean, median, mode,
Quantile-range and Five number summary can be given.
Data modelling techniques :
Data modelling is a process through which data is
stored structurally in a format in a database.
Data modelling is important because it enables
organizations to make data-driven decisions and
meet varied business goals.
• The entire process of data modelling is not as
easy as it seems.
• We need to have a deeper understanding of the
structure of an organization and then propose a
solution that aligns with its end-goals and suffices
it in achieving the desired objectives.
Types of Data Models
Data modeling can be achieved in various ways.
Commonly used data modeling methods are:

• Hierarchical model
• Relational model
• Network model
• Object oriented model
• Entity relationship model
Hierarchial model :

This data model makes use of hierarchy to structure the data in a tree-like format.
However, retrieving and accessing data is difficult in a hierarchical database.It is
rarely used now.
Relational model:
• An alternative to hierarchical model.
• Here data is represented in the form of tables.
• It reduces the complexity
• Provides a clear overview of the data.
Network model
• Network model is inspired by hierarchical model.
• However, unlike the hierarchical model, this model makes it
easier to convey complex relationships as each record can be
linked with multiple parent records.
Object-oriented model
• consists of a collection of objects, each with its own features
and methods.
• It is also called the post-relational database model.
Entity-relationship model:
• Also known as ER model, represents entities and their
relationships in a graphical format.
• An entity could be anything – a concept, a piece of data, or an
object.
Importance of Data Modeling
• A clear representation of data makes it easier to
analyze the data properly. It provides a quick
overview of the data which can then be used by
the developers in varied applications.
• Data modeling represents the data properly in a
model. It rules out any chances of data
redundancy and omission. This helps in clear
analysis and processing.
• Data modeling improves data quality and enables
the concerned stakeholders to make data-driven
decisions.
Missing Imputations
An object may have missing one or more attribute values.
Reasons:
information was not collected.
Example some people decline to give their phone numbers or
age details.
some attributes are not applicable to all objects.

Regardless, missing values should be taken into account during


the data analysis.
Strategies for dealing with missing data:

• Eliminate data objects or attributes


• Estimate missing values
• the average attribute value of the nearest neighbors is
used.
• Ignore the missing values during analysis
• Fill in the missing value manually
• Use a global constant to fill in the missing value
• Use a measure of central tendency for the attribute
• Use the attribute mean or median for all samples
belonging to the same class as the given tuple
• the most probable value to fill in the missing value
• In R, missing values are represented by the symbol NA (not
available).
• Impossible values(e.g., dividing by zero) are represented by
the symbol NaN (not a number).
• Testing for missing values

y <- c(1,2,3,NA)
is.na(y)
# returns a vector (F FF T)

• Arithmetic functions on missing values yield missing values.

• For Example,
mean(x)
# returns NA

Handling missing values/Imputing missing values

Remove missing values (One method)

To remove missing values from our dataset we use na.omit() function.


For Example:
We can create new dataset without missing data as below: -

newdata<- na.omit(mydata)

Or, we can also use “na.rm=TRUE” in argument of the operator. From


above example we use
na.rm and get desired result:
x <- c(1,2,NA,3)
mean(x, na.rm=TRUE)
# returns 2
MICE Package -> Multiple Imputation by Chained
Equations

MICE uses PMM to impute missing values in a dataset.


PMM-> Predictive Mean Matching (PMM)

• PMM is a semi-parametric imputation approach.


• It is similar to the regression method except that for
each missing value, it fills in a value randomly
from among the a observed donor values from an
observation whose regression-predicted values
are closest to the regression-predicted value for the
missing value from the simulated regression
model.

You might also like