Data Science Training On Statistical Techniques For Analytics
Data Science Training On Statistical Techniques For Analytics
Data Science Training On Statistical Techniques For Analytics
DATA MINING
Statistics/
AI
Database
systems
Fraud detection
APPLICATIONS
Banking:
Customer
Targeted
relationship management:
marketing:
Fraud
Manufacturing
and production:
This is the
information of
domain we are
mining like concept
hierarchies, to
organize attributes
onto various levels
of abstraction
DATA SOURCE
Application-Orientation
Subject-Orientation
Operation
al
Database
Loans
Credit
Card
Data
Warehouse
Customer
Vendor
Trust
Savings
Product
Activity
Industry
Customer All
File
Usage
Technology
Volumes
Track
Legacy application, flatSmall-medium
Customer
files, main frames
Details
Account
Finance
Control
Legacy applications, Large
Balance
account
hierarchical databases,
activities
mainframe
Point-of- Retail
Generate
ERP, Client/Server,
Very Large
Sale data
bills, managerelational databases
stock
Call
Telecomm- Billing
Legacy application,
Very Large
Record
unications
hierarchical database,
mainframe
Production Manufact- Control
ERP,
Medium
Record
uring
Production relational databases,
AS/400
Data Marts
Data Warehouse
CONTI
DATA VISUALIZATION:
SEEING THE DATA
VISUAL PRESENTATION
For any kind of high dimensional data set,
displaying predictive relationships is a challenge.
The picture on the previous slide uses 3-D
graphics to portray the weather balloon data
We learn very little from just examining the
numbers .
Shading is used to represent relative degrees of
thunderstorm activity, with the darkest regions
the heaviest activity.
A BIT OF HISTORY
An early effort used sequences of twodimensional graphs to add depth.
Current virtual reality programs allow the user
to step through a data set. Try going to a
realtors website and taking a tour of a house up
for sale.
High
Activity
Low
Activity
2003, Prentice-Hall
2003, Prentice-Hall
2003, Prentice-Hall
AN ENLIVENED RISK
ANALYSIS REPORT
GEOGRAPHICAL INFORMATION
SYSTEMS
A GIS is a special purpose database that contains a
spatial coordinate system. A comprehensive
GIS requires:
1.
2.
3.
4.
2003, Prentice-Hall
Companies that use data mining for target marketing walk a tightrope
between personalization and privacy.
Further, technology appears to create new ways to acquire information
faster than the legal system can handle the ethical and property
issues.
Nonetheless, many view information as a natural resource that should
be managed as such.
STATISTICS
STATISTICS
Numerical representations of our data
Can be:
Descriptive
STATISTICS
Descriptive objectives/
research questions:
Descriptive
statistics
Inferential
Statistics
DESCRIPTIVE STATISTICS
Can be applied to any measurements
(quantitative or qualitative)
Offers a summary/ overview/ description of data.
Does not explain or interpret.
DESCRIPTIVE STATISTICS
Number
Frequency Count
Percentage
Deciles and quartiles
Measures of Central
Tendency (Mean,
Midpoint, Mode)
Variability
Graphs
Normal Curve
SHAPES OF DISTRIBUTION
Normal Curve (aka Bell Curve)
Repeated sampling of a population should result
in a normal distribution- clustering of values
around a central tendency.
In a symmetrical distribution, median, mode and
mean all fall at the same point
INFERENTIAL STATISTICS
Hypothesis Testing
LEVELS OF SIGNIFICANCE
=probability
< = less than (> = more than)
ERROR TYPE
Type I error
Reject
the null
hypothesis when it is
really true
Type II error
Fail
PROBABILITY
By using inferential statistics to make decisions,
we can report the probability that we have made
a Type I error (indicated by the p value we
report)
By reporting the p value, we alert readers to the
odds that we were incorrect when we decided to
reject the null hypothesis
PARAMETRIC AND
NONPARAMETRIC STATISTICS
Parametric
SELECTING AN APPROPRIATE
STATISTICAL TEST
42
Pren
tice
Hall
CONTI
Data Mining functionalities are used to specify the kind of patterns to be found
in data mining tasks.
Decisions
in data mining
Kinds
of databases to be mined
Kinds
of knowledge to be discovered
Kinds
of techniques utilized
Kinds
of applications adapted
Deviation detection
Regression
Classification
Collaborative Filtering
CONTI
Medicine:
Molecular/Pharmaceutical:
drugs
Scientific data analysis:
identify
clusters
Web
identify new
find
Pre-processing: cleaning
Transformation:
Classification
(Supervised learning)
CLASSIFICATION
Previouscustomers
Age
Salary
Profession
Location
Customertype
Classifier
Decisionrules
Salary>5L
Prof.=Exec
Newapplicantsdata
Good/
bad
CLASSIFICATION METHODS
Goal: Predict class Ci = f(x1, x2, .. Xn)
Regression: (linear or any other polynomial)
a*x1
+ b*x2 + c = Ci.
Nearest neighour
Decision tree classifier: divide decision space into
piecewise constant regions.
Probabilistic/generative models
Neural networks: partition by non-linear boundaries
Clustering or
Unsupervised Learning
CLUSTERING
Unsupervised
each
APPLICATIONS
Customer
marketing
Group/cluster
Collaborative
group
Text
filtering:
clustering
Compression
DISTANCE FUNCTIONS
Numeric
distance (# dissimilarity)
Jaccard coefficients: #similarity in 1s/(# of 1s)
data dependent measures: similarity of A and
B depends on co-occurance with C.
Combined
weighted
normalized distance:
DATA TYPES
Qualitative (or categorical) data consist of
values that can be separated into different
categories that are distinguished by some
nonnumeric characteristic.
Quantitative data consist of values representing
counts or measurements.
LEVELS OF MEASUREMENT
Nominal
Ordinal
Interval
Ratio
Nominal and ordinal are qualitative (categorical)
levels of measurement.
Interval and ratio are quantitative levels of
measurement.
TYPES OF QUALITATIVE
MEASUREMENTS
Nominal level of measurementclassifies data
into names, labels or categories in which no order
or ranking can be imposed. Examplethe
number of courses offered in each of the different
colleges.
Ordinal level of measurementclassifies data
into categories that can be ordered or ranked, but
precise differences between the ranks do not
exist. Generally it does not make sense to do
calculations with data at the ordinal level.
Exampleletter grades of A, B, C, D, and F.
TYPES OF QUANTITATIVE
MEASUREMENTS
APPLICATION AREAS
Industry
Application
Finance
Credit Card Analysis
Insurance
Claims, Fraud Analysis
Telecommunication Call record analysis
Transport
Logistics management
Consumer goods
promotion analysis
Data Service providersValue added data
Utilities
Power usage analysis
WHY NOW?
Data is being produced
Data is being warehoused
The computing power is available
The computing power is affordable
The competitive pressures are strong
Commercial products are available
Warehousing provides
the Enterprise with a memory
USAGE SCENARIOS
Data
warehouse mining:
assimilate
Mining
log data
Continuous mining: example in process
control
Stages in mining:
MINING MARKET
Around
Enterprise Miner
IBM SPSS & AMOS
R
Met-Lab
Rapid Miner
Tanagra
Weka
Clementine,
All
Heavy
Tedious
Ideal
Clustering
[Pilot software]
segment customers to define hierarchy on that dimension
Time
Query
Sarawagi
[VLDB2000]
CONTI
INTRODUCTION
TO
R LANGUAGE
WHAT IS R
R comes with a standard set of packages. Others are available for download
and installation. Once installed, they have to be loaded into the every session
to be used.
There are about eight packages supplied with the R distribution and many
more are available through the CRAN family of Internet sites covering a very
wide range of modern statistics.
To
download
and
install
a
other
packages
visit
http://cran.r-project.org/web/packages/ site. Here you can find a list of
different available packages which are currently in Rs CRAN package
repository. Till now, CRAN package repository have 4366 available
packages.
BENEFITS OF R
LIMITATION OF R
HOW TO INSTALL R
On the next page, they give a choice for R setup for different operating
system. Choose option to download R according your OS . Windows user
click on Download R for Windows.
One more page is open and it give a general information and link
install R for the first time click on it and you can see the
Download R 2.15.3 for Windows (47 megabytes, 32/64 bit) to download R
exe on your PC/Laptop.
Double click on icon of R from the desktop and open a R Console window.
Your current directory is C:\ and you want to work on some files ]which
are in D:\practise then you use setwd() command as follows:
setwd(D:\\practise)
STARTING WITH R
In R console window, you see the > sign in red color at end of lines is
called the prompt. When a command is too long to fit on a line, a + sign is
used for the continuation prompt.
In R software, If you want to assign value to particular variable then a <- sign
is used. The = is valid as of Rs previous version of 1.4.0. but now-w-days, <is used for assignment.
In above example, str is a variable, and in which we can store a string like Hello,
I am R Console Window. To show the output, we just give a variable name and
press enter. Using the <- operator or sign, we can assign numeric, text, vector,
data frame, character, list etc
Basic Commands of R
To print some text without store it in any variable , data frame or vector, we
use print() command.
Example:
Output
>print(Hello, I am programmer)
[1]Hello, I am programmer
[1] 50 69 41 89 75 69 25 85 75 39 81 65
R is working on different packages. R has number of different packages to perform number of statistical
and graphical techniques.
In R, To obtain a list of all packages available at a given mirror site, use the available.packages() command.
Installing Packages
1.
Example: install.packages(tm)
2. Using Packages menu: Click on Packages->Install Package(s). It open a
window with a available packages then select appropriate package and click
on ok and it look like as below:
Example: >remove.packages(tm)
Updating Packages: R is open source software, so that who have a time they
go and work on it and comes with something new and update related package
or library. To update all installed packages, use the update.packages()
command. For each out of date package that is found, you will be prompted to
confirm the update, as demonstrated below. In this case, you should type "y"
and press enter to continue and then it download update automatically.
Example: >update.packages()
IMPORTING DATA IN R
When we start working with R, we need data to run various statistical and
graphical technique. To import data into R, we use different commands for
different types of files. To import different types of files in R, we must be
install and load a foreign package. So, first you install and load foreign
package into R console window.
Generally, foreign package is used to access all functions for reading and
writing data stored by statistical packages such as Minitab, S, SAS, SPSS,
Stata, Systat, dBase files etc
Price. Qty
1,
Rice,
12000, 20kg
2,
Sugar,
15000, 35kg
3,
Tea,
18000, 30kg
4,
Daal,
25000, 23kg
5,
wheat,
56000, 25kg
Excel File: To import excel file into R, we must be install and load
gdata package. (for .xls file format)
Example Code:
>library(gdata)
>data<-read.xls(D:\\data.xls)
Example Code:
>library(xlsx)
>library(xlsxjars)
>data1<-read.xls(D:\\data1.xlsx,1)
Where, first argument is path of file or URL of xlsx file, second numeric
argument is a sheet index of a file.
SPSS File:
Example Code:
>Library(foreign)
>data1<-read.spss(D:\\datafile.spss", use.value.labels=TRUE,
max.value.labels = 10)
VECTOR IN R
Example:
>length(numvec)
[1] 5
MATRIX IN R
Example Code:
>datamtrx <- matrix(c(1,2,3, 11,12,13), nrow = 2, ncol = 3, byrow = TRUE, dimnames =
list(c("row1", "row2"), c("C.1", "C.2", "C.3")))
>datamtrx
C.1 C.2 C.3
row1
row2 11
12
13
11
16
[2,]
12
17
[3,]
13
18
[4,]
14
19
[5,]
10 15
20
4
[2,] 5
[3,] 9 10
11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
13 17
[2,]
6 10
14 18
[3,]
7 11
15 19
[4,]
8 12
16 20
700
760
820
880
set to zero elements of a less than 5,but note that m1[m1<5] is a vector:
> m1[m1<15]=0
> m1
[,1] [,2] [,3] [,4] [,5]
[1,]
0 17
[2,]
0 18
[3,]
0 15 19
[4,]
0 16 20
How to merge matrix m1 and m2. Use rbind() function, it merge to matrix
vertically.
>m1=matrix(1:20,ncol=5)
>m2<-matrix(1:20,nrow=5,byrow=T)
>rbind(m1,m2)
9 13 17
[2,]
6 10 14 18
[3,]
7 11 15 19
[4,]
8 12 16 20
[5,]
[6,]
9 10
[7,] 11 12 13 14 15
[8,] 16 17 18 19 20
How to calculate sum of the rows or columns. If you want to sum of the
column then set 2 as a second argument, and set 1, to calculate row sum of
a matrix.
> apply(m2,1,sum)
[1] 15 40 65 90
> apply(m2,2,sum)
[1] 34 38 42 46 50
FACTOR IN R
Syntax:
factor(x = character(), levels, labels = levels, exclude = NA, ordered =
is.ordered(x))
Where, x = a vector of data, usually taking a small number of distinct values.
levels: an optional vector of the values that x might have taken.
labels = an optional vector of labels for the levels.
exclude = a vector of values to be excluded when forming the set of levels.
This should be of the same type as x.
ordered = logical flag to determine if the levels should be regarded as ordered
Example Code:
> state=c("Tarang","Veer","shashi","sarang","Anand")
> statef=factor(state)
> levels(statef)
[1] "Anand" "sarang" "shashi" "Tarang" "Veer"
> incomes=c(60,59,40,42,23)
> tapply(incomes,statef,mean)
Anand sarang shashi Tarang Veer
23
42
40
60
59
DATA FRAME IN R
the data you store in the columns of a data frame can be of various types.
I.e., one column might be a numerical variable, another might be a factor,
and a third might be a character variable. All columns have to be the same
length means that contain the same number of data items.
Syntax:
no
name Income
1
Ajay
2 Jayendra 20000
Raj
10000
30000
To know the name of columns in the data frame object, we use names() command.
>names(df)
[1] no name Income
All components in a data frame can be extracted as vectors with the corresponding
name:
>attach(df)
>Income
[1] 10000 20000 30000
To remove a data frame from a R object search path, we use detach() function.
>detach()
name
Income gender
Ajay
10000
2 Jayendra 20000
Raj
30000
To sort data of data frame using order option of the data frame object.
no
Ajay
10000
Rajkot
2 Jayendra 20000
Ahmedabad
Surat
Raj
30000
> sort<-df[order(df[,"address"]), ]
> sort
no
name
Jayendra
20000
Ahmedabad
Ajay
10000
Rajkot
Raj
30000
Surat
LIST IN R
Lists are used to bind vectors with different lengths, this is impossible to do
with matrices. Another advantage to lists is that you may assign categories
to the values.
Example Code:
> n = c(2, 3, 5)
> s = c("aa", "bb", "cc", "dd", "ee")
> b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
> x = list(n, s, b, 3)
>x
Output:
[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE FALSE TRUE FALSE FALSE
[[4]]
[1] 3
If we want to retrive a particular list slice, we use following code:
>x[2]
[[1]]
[1]"aa""bb""cc""dd""ee"
If you want to modify in any list slice, then we can do easily like as
follows:
> x[[2]][1] = "ta"
> x[[2]]
[1] "ta" "bb" "cc" "dd" "ee
> log2(32)
[1] 5
> sqrt(2)
[1] 1.414214
-1 .0
-0 .5
0 .0
0 .5
1 .0
s in (s e q (0 , 2 * p i, le n g th = 1 0 0 ))
20
40
60
Index
80
100
STATISTICS IN R
Introduction of Packages:-
Descriptive statistics
Regression (relational statistics like linear, generalized linear
models and nonlinear regression models) Analysis
Time series analysis
Clustering (classification statistics)Analysis etc
STATISTICAL FUNCTION IN R
If you are beginner then, I suggest that use this command to know all
statistical function which are available in R.
>help(package=stats)
This command gives you a list of function in alphabetical order, so that you can
do your task very easily and efficiently.
Example Code:
> x<-c(155, 160, 171, 182, 162, 153, 190, 167, 168, 165, 191)
> median(x)
[1] 167
> x<-c(155, 160, 171, 182, 162, 153, 190, 167, 168, 165, 191, 175)
> median(x)
[1] 167.5
> x<-c(155, 160, 171, 182, 162, 153, 190, 167, 168, 165, NA, 191)
> median(x)
[1] NA
> median(x, na.rm=T)
[1] 167
the values in x.
Syntax: sd(x, na.rm = FALSE)
Arithmetic Operations:
+,
Matrix Arithmetic.
*
Assignment
To
D <- read.table(path,sep=,,header=TRUE)
Subsetting
data.
Use a logical operator to do this.
==,
Example:
D[D$Gender
== M,]
This will return the rows of D where Gender is M.
Remember R is case sensitive!
This code does nothing to the original dataset.
D.M <- D[D$Gender == M,] gives a dataset with the
appropriate rows.
Ca s e Study-1(part
A & B)
GRAPHICS
IN
R LANGUAGE
THE AGENDA
Introduction to graphics in R language
Line Charts( code & Output)
Bar Charts ( code & Output)
Histograms ( code & Output)
Pie Charts ( code & Output)
Dot charts ( code & Output)
Colorful Dot charts ( code & Output)
How to save graph
INTRODUCTION TO GRAPHICS IN R
LANGUAGE
First, we start with how to plot a points on a graph. Lets first we load
graphics package and then we use plot() function to put points on graphs.
Example:
> library(graphics)
> sample<-c(20,52,62,14,65,42,48,36,26)
> plot(sample)
TO CONTINUE THIS EXAMPLE WITH ADDING TITLE TO A GRAPH AND CONNECT ALL
POINTS IN A GRAPH.
EXAMPLE:
> PLOT(SAMPLE, TYPE="O", COL="BLUE")
> TITLE(MAIN="THIS IS DEMO OF LINE CHART", COL.MAIN="RED", FONT.MAIN=4)
BAR CHARTS
Simple Bar chart: To create a simple bar chart in R, we use barplot() fucntion.
Example code: >sample<-c(c(20,52,62,14,65,42,48,36,26)
> barplot(sample)
Where, main this parameter used to display give main title of the
graph.
Xlab, Ylab this parameter used to give label of x and Y- axis of the
graph.
Where,
data1 is a data vector- in which we can store a test namely tab delimiter file
using read.table() function.
HISTOGRAM
simple histogram:
PIE CHART
DOT CHART
Example Code:
>product<- read.table(D:\\test.txt", header=T, sep="\t")
>dotchart(t(product))
Here, the t() function is new. t() function is used toreturn the
transpose ofx object. X object is a matrix or data frame,
here name is product.
Question?
Thank you!