Data Science With Python Class Room Notes Qulaity Thought
Data Science With Python Class Room Notes Qulaity Thought
EDITION
Quality Thought Front Page of Book
About Institute?
About Course?
Course 3: I want to apply statistics and Python on Machine Learning models for
Predictions& classifications of Data in various industry segments for intelligent
Business
Course 7: Start Learning Neural Networks using Tensor flows and Keras for image
classification and Data Extraction from Image(OCR)
Course 9: Sensing Real world Data and transforming it to Intelligent actions using
IOT
Nowadays many misconceptions are there related to the words machine learning, deep
learning and artificial intelligence(AI), most of the people think all these things are same
whenever they hear the word AI, they directly relate that word to machine learning or vice
versa, well yes, these things are related to each other but not the same. Let‟s see how.
Machine Learning:
Before talking about machine learning lets talk about another concept that is called
data mining. Data mining is a technique of examining a large pre-existing database and
extracting new information from that database, it‟s easy to understand, right, machine learning
does the same, in fact, machine learning is a type of data mining technique.
Here‟s is a basic definition of machine learning –
“Machine Learning is a technique of parsing data, learn from that data and then apply what
they have learned to make an informed decision”
Now a day's many of big companies use machine learning to give there users a better
experience, some of the examples are, Amazon using machine learning to give better product
choice recommendations to there costumers based on their preferences, Netflix uses machine
learning to give better suggestions to their users of the Tv series or movie or shows that they
would like to watch.
Deep Learning:
Deep learning is actually a subset of machine learning. It technically is machine learning
and functions in the same way but it has different capabilities.
The main difference between deep and machine learning is, machine learning models
become better progressively but the model still needs some guidance. If a machine learning
model returns an inaccurate prediction then the programmer needs to fix that problem explicitly
but in the case of deep learning, the model does it by himself. Automatic car driving system is a
good example of deep learning.
Let‟s take an example to understand both machine learning and deep learning –
Suppose we have a flashlight and we teach a machine learning model that whenever
someone says “dark” the flashlight should be on, now the machine learning model will analyse
different phrases said by people and it will search for the word “dark” and as the word comes
the flashlight will be on but what if someone said “I am not able to see anything the light is very
dim”, here the user wants the flashlight to be on but the sentence does not the consist the word
“dark” so the flashlight will not be on. That‟s where deep learning is different from machine
learning. If it were a deep learning model it would on the flashlight, a deep learning model is
able to learn from its own method of computing.
Artificial intelligence:
Now if we talk about AI, it is completely a different thing from Machine learning and
deep learning, actually deep learning and machine learning both are the subsets of AI. There is
no fixed definition for AI, you will find a different definition everywhere, but here is a definition
1
AI means to actually replicate a human brain, the way a human brain thinks, works and
functions. The truth is we are not able to establish a proper AI till now but we are very close to
establish it, one of the examples of AI is Sophia, the most advanced AI model present today. The
reason we are not able to establish proper AI till now is, we don‟t know the many aspects of the
human brain till now like why do we dream ? etc.
Why people relate machine learning and deep learning with artificial intelligence?
Machine learning and deep learning is a way of achieving AI, which means by the use
of machine learning and deep learning we may able to achieve AI in future but it is not AI.
Machine Learning Introduction
Machine Learning is a field which is raised out of Artificial Intelligence(AI). Applying AI, we
wanted to build better and intelligent machines. But except for few mere tasks such as finding
the shortest path between point A and B, we were unable to program more complex and
constantly evolving challenges. There was a realization that the only way to be able to achieve
this task was to let machine learn from itself. This sounds similar to a child learning from its self. So
machine learning was developed as a new capability for computers. And now machine
learning is present in so many segments of technology, that we don‟t even realize it while using
it.
Finding patterns in data on planet earth is possible only for human brains. The data being
very massive, the time taken to compute is increased, and this is where Machine Learning
comes into action, to help people with large data in minimum time.
If big data and cloud computing are gaining importance for their contributions, machine
learning as technology helps analyze those big chunks of data, easing the task of data scientists
in an automated process and gaining equal importance and recognition.
The techniques we use for data mining have been around for many years, but they were
not effective as they did not have the competitive power to run the algorithms. If you run deep
learning with access to better data, the output we get will lead to dramatic breakthroughs
which is machine learning.
Supervised Learning
A majority of practical machine learning uses supervised learning.
In supervised learning, the system tries to learn from the previous examples that are
given. (On the other hand, in unsupervised learning, the system attempts to find the patterns
directly from the example given.)
Speaking mathematically, supervised learning is where you have both input variables (x)
and output variables(Y) and can use an algorithm to derive the mapping function from the
input to the output.
Example :
Supervised learning problems can be further divided into two parts, namely classification, and
regression.
Classification: A classification problem is when the output variable is a category or a group, such
as “black” or “white” or “spam” and “no spam”.
Regression: A regression problem is when the output variable is a real value, such as “Rupees” or
“height.”
Unsupervised Learning
In unsupervised learning, the algorithms are left to themselves to discover interesting
structures in the data.
Mathematically, unsupervised learning is when you only have input data (X) and no
corresponding output variables.
This is called unsupervised learning because unlike supervised learning above, there are
no given correct answers and the machine itself finds the answers.
Unsupervised learning problems can be further divided into association and clustering
3
problems.
Page
Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as “people that buy X also tend to buy Y”.
Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behavior.
Reinforcement Learning
A computer program will
interact with a dynamic environment
in which it must perform a particular
goal (such as playing a game with an
opponent or driving a car). The
program is provided feedback in
terms of rewards and punishments as it
navigates its problem space.
Using this algorithm, the
machine is trained to make specific
decisions. It works this way: the
machine is exposed to an
environment where it continuously
trains itself using trial and error
method.
Here is the minimum level of mathematics that is needed for Machine Learning Engineers / Data
Scientists.
2. Probability Theory and Statistics (Probability Rules & Axioms, Bayes‟ Theorem, Random
Variables, Variance and Expectation, Conditional and Joint Distributions, Standard
Distributions.)
Regression Models
In this Machine Learning tutorial, we are going to get the basic understanding of what
exactly the Regression Models in Machine Learning.
Regression Models are the most popular among all statistical models which are generally
used to estimate the relationship between variables.
There are different types of Regression Models. Out of which below are the most
commonly used models for Regression Analysis.
3. Polynomial Regression
Page
4. Logistic Regression
5. Ridge Regression
6. Lasso Regression
7. ElasticNet Regression
8. Support Vector Regression
These above models will be discussed elaborately in the next upcoming topics in this Machine
learning tutorial.
Regression Analysis involves in creating the machine learning models which predict a
numeric value. Prediction always happens with a solid machine learning model which estimates
the relationship between a dependent variable and Independent variable.
From the above graph, it can be understood the Independent or Predictor variable is on the X-
axis,
Whereas the dependent or response variable is on the Y-axis. Coming to the inclined line is none
other than the Regression Line. And the Plotted data points can be seen as blue colored dots.
Dependent Variable: By the name itself we can clearly understand that this variable will vary
depending on other variables or other factors.
Dependent variable is also called as Response variable (Outcome).
Example:
Consider the students score in the examination which could vary based on several factors.
Independent Variable: This is the variable which is not influenced by other variable rather we
can say this variable standalone which can have a quality of influencing others.
Independent variable is also called as Predictor variable (Input).
6
Page
Example:
Consider the same example as student score in the examination. Generally, student score will
depend on various factors like hours of study, attendance etc. So, the time spent by the student
to prepare for examination can be considered as independent variable.
The data that can‟t be controlled i.e. dependent variables need to predicted or estimated.
Model:
A model is a transformation engine that helps us to express dependent variables as a
function of independent variables.
Parameters: Parameters are ingredients added to the model for estimating the output.
Concept
Linear regression models provide a simple approach towards supervised learning. They
are simple yet effective.
Linear implies the following: arranged in or extending along a straight or nearly straight
line. Linear suggests that the relationship between dependent and independent variable can
be expressed in a straight line.
Recall the geometry lesson from high school. What is the equation of a line?
y = mx + c
Linear regression models are not perfect. It tries to approximate the relationship between
dependent and independent variables in a straight line. Approximation leads to errors. Some
errors can be reduced. Some errors are inherent in the nature of the problem. These errors
cannot be eliminated. They are called as an irreducible error, the noise term in the true
relationship that cannot fundamentally be reduced by any model.
β0 and β1 are two unknown constants that represent the intercept and slope. They are the
parameters.
Formulation
Let us go through an example to explain the terms and workings of a Linear regression
model.
Fernando is a Data Scientist. He wants to buy a car. He wants to estimate or predict the
car price that he will have to pay. He has a friend at a car dealership company. He asks for
prices for various other cars along with a few characteristics of the car. His friend provides him
with some information.
First, Fernando wants to evaluate if indeed he can predict car price based on engine size. The
first set of analysis seeks the answers to the following questions:
Is price of car price related with engine size?
How strong is the relationship?
Is the relationship linear?
Can we predict/estimate car price based on engine size?
Fernando does a correlation analysis. Correlation is a measure of how much the two variables
are related. It is measured by a metric called as the correlation coefficient. Its value is between 0
and 1.
If the correlation coefficient is a large(> 0.7) +ve number, it implies that as one variable
increases, the other variable increases as well. A large -ve number indicates that as one variable
increases, the other variable decreases.
He does a correlation analysis. He plots the relationship between price and engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is used for
the test.
He builds a linear regression model. He uses a statistical package to create the model. The
model creates a linear equation that expresses price of the car as a function of engine size.
A straight line can fit => A decent prediction of price can be made using engine size.
Fernando now wants to build a linear regression model that will estimate the price of the car
price based on engine size. Superimposing the equation to the car price problem, Fernando
formulates the following equation for price prediction.
Model
Recall the earlier discussion, on how the data needs to be split into training
and testing set. The training data is used to learn about the data. The training data is used to
create the model. The testing data is used to evaluate the model performance.
Fernando splits the data into training and test set. 75% of data is used for training.
Remaining is used for the test. He builds a linear regression model. He uses a statistical package
to create the model. The model produces a linear equation that expresses price of the car as a
function of engine size.
He splits the data into training and test set. 75% of data is used for training. Remaining is used for
the test.
10
Page
He builds a linear regression model. He uses a statistical package to create the model. The
model creates a linear equation that expresses price of the car as a function of engine size.
β1 is estimated as 156.9
Interpretation
11
Page
The model provides the equation for the predicting the average car pricegiven a specific
engine size. This equation means the following:
One unit increase in engine size will increase the average price of the car by 156.9 units.
Evaluation
The model is built. The robustness of the model needs to be evaluated. How can we be sure that
the model will be able to predict the price satisfactory? This evaluation is done in two parts. First,
test to establish the robustness of the model. Second, test to evaluate the accuracy of the
model.
Fernando first evaluates the model on the training data. He gets the following statistics.
There are a lot of statistics in there. Let us focus on key ones (marked in red). Recall the
discussion on hypothesis testing. The robustness of the model is evaluated using hypothesis
testing.
β1: The value of β1 determines the relationship between price and engine size. If β1 = 0 then
there is no relationship. In this case, β1 is positive. It implies that there is some relationship
between price and engine size.
12
Page
t-stat: The t-stat value is how many standard deviations the coefficient estimate (β1) is far away
from zero. Further, it is away from zero stronger the relationship between price and engine size.
The coefficient is significant. In this case, t-stat is 21.09. It is far enough from zero.
p-value: p-value is a probability value. It indicates the chance of seeing the given t-statistics,
under the assumption that NULL hypothesis is true. If the p-value is small e.g. < 0.0001, it implies
that the probability that this is by chance and there is no relation is very low. In this case, the p-
value is small. It means that relationship between price and engine is not by chance.
With these metrics, we can safely reject the NULL hypothesis and accept the alternate
hypothesis. There is a robust relationship between price and engine size
The relationship is established. How about accuracy? How accurate is the model? To get a feel
for the accuracy of the model, a metric named R-squared or coefficient of determination is
important.
R-squared or Coefficient of determination: To understand these metrics, let us break it down into
its component.
Error (e) is the difference between the actual y and the predicted y. The predicted y is
denoted as ŷ. This error is evaluated for each observation. These errors are also called
as residuals.
Then all the residual values are squared and added. This term is called as Residual Sum of
Squares (RSS). Lower the RSS, the better it is.
There is another part of the equation of R-squared. To get the other part, first, the mean
value of the actual target is computed i.e. average value of the price of the car is
estimated. Then the differences between the mean value and actual values are
calculated. These differences are then squared and added. It is the total sum of squares
(TSS).
In the example above, RSS is computed based on the predicted price for three cars. RSS value is
41450201.63. The mean value of the actual price is 11,021. TSS is calculated as 44,444,546. R-
squared is computed as 6.737%. For these three specific data points, the model is only able to
explain 6.73% of the variation. Not good enough!!
However, for Fernando‟s model, it is a different story. The R-squared for the training set is 0.7503
i.e. 75.03%. It means that the model can explain more 75% of the variation.
Conclusion
Voila!! Fernando has a good model now. It performs satisfactorily on the training data. However,
there is 25% of data unexplained. There is room for improvement. How about adding more
independent variable for predicting the price? When more than one independent variables are
added for predicting a dependent variable, a multivariate regression model is created i.e. more
than one variable.
Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables
to predict the outcome of a response variable. The goal of multiple linear regression (MLR) is to
model the relationship between the explanatory and response variables.
14
A simple linear regression is a function that allows an analyst or statistician to make predictions
about one variable based on the information that is known about another variable.
Linear regression can only be used when one has two continuous variables – an independent
variable and a dependent variable.
The independent variable is the parameter that is used to calculate the dependent variable or
outcome.
For example, an analyst may want to know how the movement of the market affects the price
of Exxon Mobil (XOM). In this case, linear equation will have the value of S&P 500 index as the
independent variable or predictor, and the price of XOM as the dependent variable.
In reality, there are multiple factors that predict the outcome of an event. The price movement
of Exxon Mobil, for example, depends on more than just the performance of the overall market.
Other predictors such as the price of oil, interest rates, and the price movement of oil futures can
affect the price of XOM and stock prices of other oil companies.
To understand a relationship in which more than two variables are present, a multiple linear
regression is used.
The least squares estimates, B0, B1, B2…Bp are usually computed by statistical software. As many
variables can be included in the regression model in which each independent variable is
Page
The multiple regression model allows an analyst to predict an outcome based on information
provided on multiple explanatory variables. Still, the model is not always perfectly accurate as
each data point can differ slightly from the outcome predicted by the model.
The residual value, E, which is the difference between the actual outcome and the predicted
outcome, is included in the model to account for such slight variations.
There is a linear relationship between the dependent variables and the independent
variables
The independent variables are not too highly correlated with each other
yi observations are selected independently and randomly from the population
Residuals should be normally distributed with a mean of 0 and variance σ
The co-efficient of determination, R-squared or R2, is a statistical metric that is used to measure
how much of the variation in outcome can be explained by the variation in the independent
variables.
R2 always increases as more predictors are added to the MLR model even though the predictors
may not related to the outcome variable. Therefore, R2 by itself, cannot be used to identify
which predictors should be included in a model and which should be excluded.
R2 can only be between 0 and 1, where 0 indicates that the outcome cannot be predicted by
any of the independent variables and 1 indicates that the outcome can be predicted without
error from the independent variables.
Assuming we run our XOM price regression model through a statistics computation software that
returns this output:
An analyst would interpret this output to mean if other variables are held constant, the price of
XOM will increase by 7.8% if the price of oil in the markets increases by 1%. The model also shows
that the price of XOM will decrease by 1.5% following a 1% rise in interest rates. R 2 indicates that
86.5% of the variations in the stock price of Exxon Mobil can be explained by changes in the
interest rate, oil price, oil futures, and S&P 500 index.
Polynomial Regression
This function fits a polynomial regression model to powers of a single predictor by the method
of linear least squares. Interpolation and calculation of areas under the curve are also given.
If a polynomial model is appropriate for your study then you may use this function to fit a k
order/degree polynomial to your data:
- where Y caret is the predicted outcome value for the polynomial model with regression
coefficients b1 to k for each degree and Y intercept b0.
16
The model is simply a general linear regression model with k predictors raised to the power of
i where i=1 to k.
Page
A third order (k=3) polynomial forms a cubic expression and a fourth order
(k=4) polynomial forms a quartic expression.
the fitted model is more reliable when it is built on large numbers of observations.
do not extrapolate beyond the limits of observed values.
choose values for the predictor (x) that are not too large as they will cause overflow with
higher degree polynomials; scale x down if necessary.
do not draw false confidence from low P values, use these to support your model only if the
plot looks reasonable.
More complex expressions involving polynomials of more than one predictor can be
achieved by using the general linear regression function.
For more detail from the regression, such as analysis of residuals, use the general linear
regression function. To achieve a polynomial fit using general linear regression you must first
create new workbook columns that contain the predictor (x) variable raised to powers up to
the order of polynomial that you want. For example, a second order fit requires input data of
Y, x and x².
Subjective goodness of fit may be assessed by plotting the data and the fitted curve. An
analysis of variance is given via the analysis option; this reflects the overall fit of the model. Try
to use as few degrees as possible for a model that achieves significance at each degree.
The plot function supplies a basic plot of the fitted curve and a plot with confidence bands
and prediction bands. You can save the fitted Y values with their standard errors, confidence
intervals and prediction intervals to a workbook.
Example
Here we use an example from the physical sciences to emphasize the point that polynomial
regression is mostly applicable to studies where environments are highly controlled and
observations are made to a specified level of tolerance. The data below are the electricity
consumptions in kilowatt-hours per month from ten houses and the areas in square feet of
those houses:
1470 1264
1600 1493
Page
1710 1571
1840 1711
1980 1804
2230 1840
2400 1956
2930 1954
Polynomial regression
Thus, the overall regression and both degree coefficients are highly significant.
Plots
18
Page
19
Page
In the above curve the right hand end shows a very sharp decline. If you were to extrapolate
beyond the data, you have observed then you might conclude that very large houses have
very low electricity consumption. This is obviously wrong. Polynomials are frequently illogical
for some parts of a fitted curve. You must blend common sense, art and mathematics when
fitting these models!
Generalization
The above terms are related to learning theory and the theory of generalization that
includes the expectations that the „out of sample data performance‟ tracks „in sample data
performance‟. This in turn is the first building block of the theory of generalization with the
meaning that if we reduce the error of „in sample data‟, it is likely that the error of „out of
sample data‟ will be also reduced and approximately the same. The second building block
of generalization theory is then that the learning algorithms will practically reduce the error of
„in sample data‟ and bring it as close to zero as possible. The latter might lead to a problem
called overfitting whereby we memorize data instead of learning from it. A learning model
20
that is overfitting the „in sample data‟ is less likely to generalize well on „out of sample data‟.
Page
Mapping from emails to whether they are spam or not for email spam classification.
Mapping from house details to house sale price for house sale price regression.
Mapping from photograph to text to describe the photo in photo caption generation.
We can summarize this mapping that machine learning algorithms learn as a function (f) that
predicts the output (y) given the input (X), or restated:
1 y = f(X)
Our goal in fitting the machine learning algorithms is to get the best possible f() for our
purposes.
We are training the model to make predictions in the future given inputs for cases where we
do not have the outputs. Where the outputs are unknown. This requires that the algorithm
learn in general how to take observations from the domain and make a prediction, not just
the specifics of the training data.
This is called generalization.
A machine learning algorithm must generalize from training data to the entire domain of all
unseen observations in the domain so that it can make accurate predictions when you use
the model.
This is really hard.
This approach of generalization requires that the data that we use to train the model (X) is a
good and reliable sample of the observations in the mapping we want the algorithm to
learn. The higher the quality and the more representative, the easier it will be for the model
to learn the unknown and underlying “true” mapping that exists from inputs to outputs.
We don‟t memorize specific roads when we learn to drive; we learn to drive in general so
that we can drive on any road or set of conditions.
We don‟t memorize specific computer programs when learning to code; we learn general
ways to solve problems with code for any business case that might come up.
21
We don‟t memorize the specific word order in natural language; we learn general meanings
for words and put them together in new sequences as needed.
Page
It is the speed and scale with which these automated generalization machines operate that
is what is so exciting in the field of machine learning.
The machine learning model is the result of the automated generalization procedure called
the machine learning algorithm.
The model could be said to be a generalization of the mapping from training inputs to
training outputs.
There may be many ways to map inputs to outputs for a specific problem and we can
navigate these ways by testing different algorithms, different algorithm configurations,
different training data, and so on.
We cannot know which approach will result in the most skillful model beforehand, therefore
we must test a suite of approaches, configurations, and framings of the problem to discover
what works and what the limits of learning are on the problem before selecting a final model
to use.
The skill of the model at making predictions determines the quality of the generalization and
can help as a guide during the model selection process.
Out of the millions of possible mappings, we prefer simpler mappings over complex
mappings. Put another way, we prefer the simplest possible hypothesis that explains the
data.
The simpler model is often (but not always) easier to understand and maintain and is more
robust. In practice, you may want to choose the best performing simplest model.
The ability to automatically learn by generalization is powerful, but is not suitable for all
problems.
Some problems require a precise solution, such as arithmetic on a bank account balance.
Some problems can be solved by generalization, but simpler solutions exist, such as
calculating the square root of positive numbers.
Some problems look like they could be solved by generalization but there exists no structured
underlying relationship to generalize from the data, or such a function is too complex, such
as predicting security prices.
Key to the effective use of machine learning is learning where it can and cannot (or should
not) be used.
22
Page
Sometimes this is obvious, but often it is not. Again, you must use experience and
experimentation to help tease out whether a problem is a good fit for being solved by
generalization.
Regularization:
Your model becomes more interesting and more complex. You measure its accuracy regarding
a loss metric L(X,Y) where X is your design matrix and Y is the observations (also denoted targets)
vector (here the wages).
You find out that your result are quite good but not as perfect as you wish.
So you add more variables: location, profession of parents, social background, number of
children, weight, number of books, preferred color, best meal, last holidays destination and so
on and so forth.
Your model will do good but it is probably overfitting, i.e. it will probably have poor prediction
and generalization power: it sticks too much to the data and the model has probably learned
the background noise while being fit. This isn't of course acceptable.
So how do you solve this?
You penalize your loss function by adding a multiple of an L1 (LASSO) or an L2 (Ridge) norm of
your weights vector w (it is the vector of the learned parameters in your linear regression).
L(X,Y) + λN(w)
This will help you avoid overfitting and will perform, at the same time, features selection for
certain regularization norms (the L1 in the LASSO does the job).
Finally you might ask: OK I have everything now. How can I tune in the regularization term λ?
One possible answer is to use cross-validation: you divide your training data, you train your
model for a fixed value of λ and test it on the remaining subsets and repeat this procedure while
varying λ. Then you select the best λ that minimizes your loss function.
23
Page
Y = f(X)
This characterization describes the range of classification and prediction problems and the
machine algorithms that can be used to address them.
An important consideration in learning the target function from the training data is how
well the model generalizes to new data. Generalization is important because the data we
collect is only a sample, it is incomplete and noisy.
In machine learning we describe the learning of the target function from training data as
inductive learning.
Induction refers to learning general concepts from specific examples which is exactly the
problem that supervised machine learning problems aim to solve. This is different from
deduction that is the other way around and seeks to learn specific concepts from general
rules.
Generalization refers to how well the concepts learned by a machine learning model
apply to specific examples not seen by the model when it was learning.
The goal of a good machine learning model is to generalize well from the training data to
any data from the problem domain. This allows us to make predictions in the future on data
the model has never seen.
There is a terminology used in machine learning when we talk about how well a machine
learning model learns and generalizes to new data, namely overfitting and underfitting.
Overfitting and underfitting are the two biggest causes for poor performance of machine
learning algorithms.
Statistical Fit
Statistics often describe the goodness of fit which refers to measures used to estimate how
well the approximation of the function matches the target function.
Some of these methods are useful in machine learning (e.g. calculating the residual errors),
but some of these techniques assume we know the form of the target function we are
approximating, which is not the case in machine learning.
24
If we knew the form of the target function, we would use it directly to make predictions,
Page
rather than trying to learn an approximation from samples of noisy training data.
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the
extent that it negatively impacts the performance of the model on new data. This means
that the noise or random fluctuations in the training data is picked up and learned as
concepts by the model. The problem is that these concepts do not apply to new data and
negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility
when learning a target function. As such, many nonparametric machine learning
algorithms also include parameters or techniques to limit and constrain how much detail
the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very
flexible and is subject to overfitting training data. This problem can be addressed by
pruning a tree after it has learned in order to remove some of the detail it has picked up.
Underfitting refers to a model that can neither model the training data nor generalize to
new data.
An underfit machine learning model is not a suitable model and will be obvious as it will
have poor performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric.
The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it
does provide a good contrast to the problem of overfitting.
Ideally, you want to select a model at the sweet spot between underfitting and overfitting.
Over time, as the algorithm learns, the error for the model on the training data goes down
and so does the error on the test dataset. If we train for too long, the performance on the
training dataset may continue to decrease because the model is overfitting and learning
the irrelevant detail and noise in the training dataset. At the same time the error for the test
set starts to rise again as the model‟s ability to generalize decreases.
The sweet spot is the point just before the error on the test dataset starts to increase where
the model has good skill on both the training dataset and the unseen test dataset.
25
You can perform this experiment with your favorite machine learning algorithms. This is
often not useful technique in practice, because by choosing the stopping point for training
Page
using the skill on the test dataset it means that the testset is no longer “unseen” or a
standalone objective measure. Some knowledge (a lot of useful knowledge) about that
data has leaked into the training procedure.
There are two additional techniques you can use to help find the sweet spot in practice:
resampling methods and a validation dataset.
Both overfitting and underfitting can lead to poor model performance. But by far the most
common problem in applied machine learning is overfitting.
Overfitting is such a problem because the evaluation of machine learning algorithms on
training data is different from the evaluation we actually care the most about, namely how
well the algorithm performs on unseen data.
There are two important techniques that you can use when evaluating machine learning
algorithms to limit overfitting:
The most popular resampling technique is k-fold cross validation. It allows you to train and
test your model k-times on different subsets of training data and build up an estimate of the
performance of a machine learning model on unseen data.
A validation dataset is simply a subset of your training data that you hold back from your
machine learning algorithms until the very end of your project. After you have selected
and tuned your machine learning algorithms on your training dataset you can evaluate
the learned models on the validation dataset to get a final objective idea of how the
models might perform on unseen data.
Using cross validation is a gold standard in applied machine learning for estimating model
accuracy on unseen data. If you have the data, using a validation dataset is also an
excellent practice.
Overfitting: Good performance on the training data, poor generalization to other data.
Underfitting: Poor performance on the training data and poor generalization to other data
Cross Validation
k-Fold Cross-Validation
The procedure has a single parameter called k that refers to the number of groups that a
given data sample is to be split into. As such, the procedure is often called k-fold cross-
validation. When a specific value for k is chosen, it may be used in place of k in the
reference to the model, such as k=10 becoming 10-fold cross-validation.
Importantly, each observation in the data sample is assigned to an individual group and
stays in that group for the duration of the procedure. This means that each sample is given
the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.
Note: This approach involves randomly dividing the set of observations into k groups, or
folds, of approximately equal size. The first fold is treated as a validation set, and the
method is fit on the remaining k − 1 folds.
It is also important that any preparation of the data prior to fitting the model occur on the
CV-assigned training dataset within the loop rather than on the broader data set. This also
applies to any tuning of hyperparameters. A failure to perform these operations within the
loop may result in data leakage and an optimistic estimate of the model skill.
Note: Despite the best efforts of statistical methodologists, users frequently invalidate their
results by inadvertently peeking at the test data.
The results of a k-fold cross-validation run are often summarized with the mean of the
model skill scores. It is also good practice to include a measure of the variance of the skill
scores, such as the standard deviation or standard error.
Configuration of k
A poorly chosen value for k may result in a mis-representative idea of the skill of the model,
such as a score with a high variance (that may change a lot based on the data used to fit
Page
the model), or a high bias, (such as an overestimate of the skill of the model).
Representative: The value for k is chosen such that each train/test group of data samples is
large enough to be statistically representative of the broader dataset.
k=10: The value for k is fixed to 10, a value that has been found through experimentation to
generally result in a model skill estimate with low bias a modest variance.
k=n: The value for k is fixed to n, where n is the size of the dataset to give each test sample
an opportunity to be used in the hold out dataset. This approach is called leave-one-out
cross-validation.
Note: The choice of k is usually 5 or 10, but there is no formal rule. As k gets larger, the
difference in size between the training set and the resampling subsets gets smaller. As this
difference decreases, the bias of the technique becomes smaller
A value of k=10 is very common in the field of applied machine learning, and is
recommend if you are struggling to choose a value for your dataset.
If a value for k is chosen that does not evenly split the data sample, then one group will
contain a remainder of the examples. It is preferable to split the data sample into k groups
with the same number of samples, such that the sample of model skill scores are all
equivalent.
The first step is to pick a value for k in order to determine the number of folds used to split
the data. Here, we will use a value of k=3. That means we will shuffle the data and then
split the data into 3 groups. Because we have 6 observations, each group will have an
equal number of 2 observations.
For example:
We can then make use of the sample, such as to evaluate the skill of a machine learning
algorithm.
Three models are trained and evaluated with each fold given a chance to be the held out
test set.
28
Page
For example:
The models are then discarded after they are evaluated as they have served their
purpose.
The skill scores are collected for each model and summarized for use.
Cross-Validation API
The split() function can then be called on the class where the data sample is provided as
an argument. Called repeatedly, the split will return each group of train and test sets.
Specifically, arrays are returned containing the indexes into the original data sample of
observations to use for train and test sets on each iteration.
For example, we can enumerate the splits of the indices for a data sample using the
created KFold instance as follows:
# enumerate splits
for train, test in kfold.split(data):
print('train: %s, test: %s' % (train, test))
We can tie all of this together with our small dataset used in the worked example of the
prior section.
Running the example prints the specific observations chosen for each train and test set.
The indices are used directly on the original data array to retrieve the observation values.
29
Nevertheless, the KFold class can be used directly in order to split up a dataset prior to
modeling such that all models will use the same data splits. This is especially helpful if you
are working with very large data samples. The use of the same splits across algorithms can
have benefits for statistical tests that you may wish to perform on the data later.
Variations on Cross-Validation
Train/Test Split: Taken to one extreme, k may be set to 1 such that a single train/test split is
created to evaluate the model.
LOOCV: Taken to another extreme, k may be set to the total number of observations in the
dataset such that each observation is given a chance to be the held out of the dataset.
This is called leave-one-out cross-validation, or LOOCV for short.
Stratified: The splitting of data into folds may be governed by criteria such as ensuring that
each fold has the same proportion of observations with a given categorical value, such as
the class outcome value. This is called stratified cross-validation.
Repeated: This is where the k-fold cross-validation procedure is repeated n times, where
importantly, the data sample is shuffled prior to each repetition, which results in a different
split of the sample.
Ridge Regression
One of the major aspects of training your machine learning model is avoiding overfitting. The
model will have a low accuracy if it is overfitting. This happens because your model is trying too
hard to capture the noise in your training dataset. By noise we mean the data points that don‟t
really represent the true properties of your data, but random chance. Learning such data points,
makes your model more flexible, at the risk of overfitting.
The concept of balancing bias and variance, is helpful in understanding the phenomenon of
overfitting.
One of the ways of avoiding overfitting is using cross validation, that helps in estimating the error
over test set, and in deciding what parameters work best for your model.
Regularization
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards
zero. In other words, this technique discourages learning a more complex or flexible model, so
30
A simple relation for linear regression looks like this. Here Y represents the learned relation and β
represents the coefficient estimates for different variables or predictors(X).
The fitting procedure involves a loss function, known as residual sum of squares or RSS. The
coefficients are chosen, such that they minimize this loss function.
Now, this will adjust the coefficients based on your training data. If there is noise in the training
data, then the estimated coefficients won‟t generalize well to the future data. This is where
regularization comes in and shrinks or regularizes these learned estimates towards zero.
Ridge Regression
Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning
parameter that decides how much we want to penalize the flexibility of our model. The increase
in flexibility of a model is represented by increase in its coefficients, and if we want to minimize
the above function, then these coefficients need to be small. This is how the Ridge regression
technique prevents coefficients from rising too high. Also, notice that we shrink the estimated
association of each variable with the response, except the intercept β0, This intercept is a
measure of the mean value of the response when
When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will
be equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the
ridge regression coefficient estimates will approach zero. As can be seen, selecting a good
value of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates
produced by this method are also known as the L2 norm.
The coefficients that are produced by the standard least squares method are scale equivariant,
i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of
1/c. Therefore, regardless of how the predictor is scaled, the multiplication of predictor and
coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and
31
therefore, we need to standardize the predictors or bring the predictors to the same scale before
performing ridge regression. The formula used to do this is given below.
Page
Ridge Regression : In ridge regression, the cost function is altered by adding a penalty
equivalent to square of the magnitude of the coefficients.
This is equivalent to saying minimizing the cost function in equation 1.2 under the condition as
below
boston=load_boston()
boston_df=pd.DataFrame(boston.data,columns=boston.feature_names)
#print boston_df.info()
32
# add another column that contains the house prices which in scikit learn datasets are
considered as target
Page
boston_df['Price']=boston.target
#print boston_df.head(3)
newX=boston_df.drop('Price',axis=1)
print newX[0:3] # check
newY=boston_df['Price']
#print type(newY)# pandas core frame
X_train,X_test,y_train,y_test=train_test_split(newX,newY,test_size=0.3,random_state=3)
print len(X_test), len(y_test)
lr = LinearRegression()
lr.fit(X_train, y_train)
rr = Ridge(alpha=0.01) # higher the alpha value, more restriction on the coefficients; low alpha >
more generalization, coefficients are barely
# restricted and in this case linear and ridge regression resembles
rr.fit(X_train, y_train)
rr100 = Ridge(alpha=100) # comparison with alpha value
rr100.fit(X_train, y_train)
train_score=lr.score(X_train, y_train)
test_score=lr.score(X_test, y_test)
Ridge_train_score = rr.score(X_train,y_train)
Ridge_test_score = rr.score(X_test, y_test)
Ridge_train_score100 = rr100.score(X_train,y_train)
Ridge_test_score100 = rr100.score(X_test, y_test)
print "linear regression train score:", train_score
print "linear regression test score:", test_score
print "ridge regression train score low alpha:", Ridge_train_score
print "ridge regression test score low alpha:", Ridge_test_score
print "ridge regression train score high alpha:", Ridge_train_score100
print "ridge regression test score high alpha:", Ridge_test_score100
plt.plot(rr.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Ridge;
$\alpha = 0.01$',zorder=7) # zorder for ordering the markers
plt.plot(rr100.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Ridge;
$\alpha = 100$') # alpha here is for transparency
plt.plot(lr.coef_,alpha=0.4,linestyle='none',marker='o',markersize=7,color='green',label='Linear
Regression')
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.show()
33
Page
Figure 1: Ridge regression for different values of alpha is plotted to show linear regression as
limiting case of ridge regression.
Let‟s understand the figure above. In X axis we plot the coefficient index and for Boston data
there are 13 features (for Python 0th index refers to 1st feature). For low value of alpha (0.01),
when the coefficients are less restricted, the coefficient magnitudes are almost same as of linear
regression. For higher value of alpha (100), we see that for coefficient indices 3,4,5 the
magnitudes are considerably less compared to linear regression case. This is an example
of shrinking coefficient magnitude using Ridge regression
34
Page
Lasso Regression
The cost function for Lasso (least absolute shrinkage and selection operator) regression can be
written as
Supplement 2: Lasso regression coefficients; subject to similar constrain as Ridge, shown before.
Just like Ridge regression cost function, for lambda =0, the equation above reduces to equation
1.2. The only difference is instead of taking the square of the coefficients, magnitudes are taken
into account. This type of regularization (L1) can lead to zero coefficients i.e. some of the
features are completely neglected for the evaluation of output. So Lasso regression not only
helps in reducing over-fitting but it can help us in feature selection. Just like Ridge regression the
regularization parameter (lambda) can be controlled and we will see the effect below using
cancer data set in sklearn. Reason I am using cancer data instead of Boston house data, that I
have used before, is, cancer data-set have 30 features compared to only 13 features of Boston
house data. So feature selection using Lasso regression can be depicted well by changing the
regularization parameter.
35
Page
Figure 2: Lasso regression and feature selection dependence on the regularization parameter
value.
import pandas as pd
import numpy as np
# difference of lasso and ridge regression is that some of the coefficients can be zero i.e. some
of the features are
# completely neglected
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_breast_cancer
from sklearn.cross_validation import train_test_split
cancer = load_breast_cancer()
#print cancer.keys()
cancer_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
#print cancer_df.head(3)
X = cancer.data
Y = cancer.target
X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.3, random_state=31)
lasso = Lasso()
lasso.fit(X_train,y_train)
train_score=lasso.score(X_train,y_train)
test_score=lasso.score(X_test,y_test)
coeff_used = np.sum(lasso.coef_!=0)
print "training score:", train_score
print "test score: ", test_score
print "number of features used: ", coeff_used
lasso001 = Lasso(alpha=0.01, max_iter=10e5)
lasso001.fit(X_train,y_train)
train_score001=lasso001.score(X_train,y_train)
test_score001=lasso001.score(X_test,y_test)
coeff_used001 = np.sum(lasso001.coef_!=0)
print "training score for alpha=0.01:", train_score001
print "test score for alpha =0.01: ", test_score001
print "number of features used: for alpha =0.01:", coeff_used001
lasso00001 = Lasso(alpha=0.0001, max_iter=10e5)
lasso00001.fit(X_train,y_train)
train_score00001=lasso00001.score(X_train,y_train)
test_score00001=lasso00001.score(X_test,y_test)
coeff_used00001 = np.sum(lasso00001.coef_!=0)
print "training score for alpha=0.0001:", train_score00001
36
lr = LinearRegression()
lr.fit(X_train,y_train)
lr_train_score=lr.score(X_train,y_train)
lr_test_score=lr.score(X_test,y_test)
print "LR training score:", lr_train_score
print "LR test score: ", lr_test_score
plt.subplot(1,2,1)
plt.plot(lasso.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Lasso;
$\alpha = 1$',zorder=7) # alpha here is for transparency
plt.plot(lasso001.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Las
so; $\alpha = 0.01$') # alpha here is for transparency
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.subplot(1,2,2)
plt.plot(lasso.coef_,alpha=0.7,linestyle='none',marker='*',markersize=5,color='red',label=r'Lasso;
$\alpha = 1$',zorder=7) # alpha here is for transparency
plt.plot(lasso001.coef_,alpha=0.5,linestyle='none',marker='d',markersize=6,color='blue',label=r'Las
so; $\alpha = 0.01$') # alpha here is for transparency
plt.plot(lasso00001.coef_,alpha=0.8,linestyle='none',marker='v',markersize=6,color='black',label=r'
Lasso; $\alpha = 0.00001$') # alpha here is for transparency
plt.plot(lr.coef_,alpha=0.7,linestyle='none',marker='o',markersize=5,color='green',label='Linear
Regression',zorder=2)
plt.xlabel('Coefficient Index',fontsize=16)
plt.ylabel('Coefficient Magnitude',fontsize=16)
plt.legend(fontsize=13,loc=4)
plt.tight_layout()
plt.show()
#output
training score: 0.5600974529893081
test score: 0.5832244618818156
number of features used: 4
training score for alpha=0.01: 0.7037865778498829
test score for alpha =0.01: 0.664183157772623
number of features used: for alpha =0.01: 10
training score for alpha=0.0001: 0.7754092006936697
test score for alpha =0.0001: 0.7318608210757904
number of features used: for alpha =0.0001: 22
LR training score: 0.7842206194055068
LR test score: 0.7329325010888681
With this, out of 30 features in cancer data-set, only 4 features are used (non zero value of
the coefficient).
Both training and test score (with only 4 features) are low; conclude that the model is
37
Reduce this under-fitting by reducing alpha and increasing number of iterations. Now
alpha = 0.01, non-zero features =10, training and test score increases.
Comparison of coefficient magnitude for two different values of alpha are shown in the left
panel of figure 2. For alpha =1, we can see most of the coefficients are zero or nearly zero,
which is not the case for alpha=0.01.
Further reduce alpha =0.0001, non-zero features = 22. Training and test scores are similar to
basic linear regression case.
In the right panel of figure, for alpha = 0.0001, coefficients for Lasso regression and linear
regression show close resemblance.
So far we have gone through the basics of Ridge and Lasso regression and seen some examples
to understand the applications. Now, I will try to explain why the Lasso regression can result in
feature selection and Ridge regression only reduces the coefficients close to zero, but not zero.
An illustrative figure below will help us to understand better, where we will assume a hypothetical
data-set with only two features. Using the constrain for the coefficients of Ridge and Lasso
regression (as shown above in the supplements 1 and 2), we can plot the figure below
38
Figure 3: Why LASSO can reduce dimension of feature space? Example on 2D feature space.
Page
For a two dimensional feature space, the constraint regions (see supplement 1 and 2) are
plotted for Lasso and Ridge regression with cyan and green colours. The elliptical contours are
the cost function of linear regression (eq. 1.2). Now if we have relaxed conditions on the
coefficients, then the constrained regions can get bigger and eventually they will hit the centre
of the ellipse. This is the case when Ridge and Lasso regression resembles linear regression results.
Otherwise, both methods determine coefficients by finding the first point where the elliptical
contours hit the region of constraints. The diamond (Lasso) has corners on the axes, unlike the
disk, and whenever the elliptical region hits such point, one of the features completely
vanishes! For higher dimensional feature space there can be many solutions on the axis with
Lasso regression and thus we get only the important features selected.
Finally to end this meditation, let‟s summarize what we have learnt so far
1. Cost function of Ridge and Lasso regression and importance of regularization term.
2. Went through some examples using simple data-sets to understand Linear regression as a
limiting case for both Lasso and Ridge regression.
3. Understood why Lasso regression can lead to feature selection whereas Ridge can only
shrink coefficients close to zero.
ElasticNet Regression
Elastic Net produces a regression model that is penalized with both the L1-norm and L2-norm.
The consequence of this is to effectively shrink coefficients (like in ridge regression) and to set
some coefficients to zero (as in LASSO).
library(tidyverse)
library(caret)
library(glmnet)
We‟ll use the Boston data set [in MASS package], introduced in Chapter @ref(regression-
analysis), for predicting the median house value (mdev), in Boston Suburbs, based on multiple
predictor variables.
We‟ll randomly split the data into training set (80% for building a predictive model) and test set
(20% for evaluating the model). Make sure to set seed for reproducibility.
# Load the data
data("Boston", package = "MASS")
# Split the data into training and test set
39
set.seed(123)
training.samples <- Boston$medv %>%
Page
# Predictor variables
x <- model.matrix(medv~., train.data)[,-1]
# Outcome variable
y <- train.data$medv
Classification Models
In most of the problems in machine learning however we want to predict whether our
output variable belongs to a particular category.
2. Bank of America wants to know if a customer will pay back the loan on time , Prepay or not
pay at all ( OnTime Or Prepaid or Default )
3. Doctors want to know whether a patient will develop Coronary Heart Disease within next
10 years or not ( Yes or No )
4. Optical Character Reader wants to read English Characters and find which character is
read ( A or B or C – Z or 1 or 2 or 3 -9 )
All of these are classification problems which fall under the area of Supervised Learning.
Problem 1 and Problem 3 in RED are Binary classification problems since we are classifying
the output into 2 classes in both the cases as Yes or No.
40
Problem 2 and Problem 4 in BLUE are Multi Class Classification problems since we want to
Page
Binary Classification
a. Logistic Regression
b. Decision Trees
c. Random Forests
d. Neural Networks
Multi class classification problems are popularly tackled using following techniques.
c. Neural Networks
d. A popular technique is to split a multi class classification problem into multiple binary
classification problems and then then model each of the sub problem separately.
Logistic Regression
Logistic regression is another technique borrowed by machine learning from the field of
statistics.
It is the go-to method for binary classification problems (problems with two class values). In
this post you will discover the logistic regression algorithm for machine learning.
Logistic Function
Logistic regression is named for the function used at the core of the method, the logistic
function.
The logistic function, also called the sigmoid function was developed by statisticians to
describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment. It‟s an S-shaped curve that can take any real-valued
41
number and map it into a value between 0 and 1, but never exactly at those limits.
Page
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler‟s number or the EXP() function in your
spreadsheet) and value is the actual numerical value that you want to transform. Below is a
plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic
function.
Now that we know what the logistic function is, let‟s see how it is used in logistic regression.
Logistic regression uses an equation as the representation, very much like linear regression.
Input values (x) are combined linearly using weights or coefficient values (referred to as the
Greek capital letter Beta) to predict an output value (y). A key difference from linear
regression is that the output value being modeled is a binary values (0 or 1) rather than a
numeric value.
Where y is the predicted output, b0 is the bias or intercept term and b1 is the coefficient for
the single input value (x). Each column in your input data has an associated b coefficient (a
constant real value) that must be learned from your training data.
42
The actual representation of the model that you would store in memory or in a file are the
coefficients in the equation (the beta value or b‟s).
Page
Logistic regression models the probability of the default class (e.g. the first class).
For example, if we are modeling people‟s sex as male or female from their height, then the
first class could be male and the logistic regression model could be written as the probability
of male given a person‟s height, or more formally:
P(sex=male|height)
Written another way, we are modeling the probability that an input (X) belongs to the
default class (Y=1), we can write this formally as:
P(X) = P(Y=1|X)
Note that the probability prediction must be transformed into a binary values (0 or 1) in
order to actually make a probability prediction. More on this later when we talk about
making predictions.
Logistic regression is a linear method, but the predictions are transformed using the logistic
function. The impact of this is that we can no longer understand the predictions as a linear
combination of the inputs as we can with linear regression, for example, continuing on from
above, the model can be stated as:
I don‟t want to dive into the math too much, but we can turn around the above equation
as follows (remember we can remove the e from one side by adding a natural logarithm
(ln) to the other):
ln(p(X) / 1 – p(X)) = b0 + b1 * X
This is useful because we can see that the calculation of the output on the right is linear
again (just like linear regression), and the input on the left is a log of the probability of the
default class.
This ratio on the left is called the odds of the default class (it‟s historical that we use odds, for
example, odds are used in horse racing rather than probabilities). Odds are calculated as a
ratio of the probability of the event divided by the probability of not the event, e.g. 0.8/(1-
0.8) which has the odds of 4. So we could instead write:
ln(odds) = b0 + b1 * X
Because the odds are log transformed, we call this left hand side the log-odds or the probit.
It is possible to use other types of functions for the transform (which is out of scope_, but as
such it is common to refer to the transform that relates the linear regression equation to the
probabilities as the link function, e.g. the probit link function.
43
We can move the exponent back to the right and write it as:
Page
odds = e^(b0 + b1 * X)
All of this helps us understand that indeed the model is still a linear combination of the
inputs, but that this linear combination relates to the log-odds of the default class.
The coefficients (Beta values b) of the logistic regression algorithm must be estimated from
your training data. This is done using maximum-likelihood estimation.
The best coefficients would result in a model that would predict a value very close to 1 (e.g.
male) for the default class and a value very close to 0 (e.g. female) for the other class. The
intuition for maximum-likelihood for logistic regression is that a search procedure seeks
values for the coefficients (Beta values) that minimize the error in the probabilities predicted
by the model to those in the data (e.g. probability of 1 if the data is the primary class).
We are not going to go into the math of maximum likelihood. It is enough to say that a
minimization algorithm is used to optimize the best values for the coefficients for your training
data. This is often implemented in practice using efficient numerical optimization algorithm
(like the Quasi-newton method).
When you are learning logistic, you can implement it yourself from scratch using the much
simpler gradient descent algorithm.
Making predictions with a logistic regression model is as simple as plugging in numbers into
the logistic regression equation and calculating a result.
Let‟s make this concrete with a specific example.
Let‟s say we have a model that can predict whether a person is male or female based on
their height (completely fictitious). Given a height of 150cm is the person male or female.
We have learned the coefficients of b0 = -100 and b1 = 0.6. Using the equation above we
can calculate the probability of male given a height of 150cm or more formally
P(male|height=150). We will use EXP() for e, because that is what you can use if you type
this example into your spreadsheet:
Now that we know how to make predictions using logistic regression, let‟s look at how we
can prepare our data to get the most from the technique.
The assumptions made by logistic regression about the distribution and relationships in your
data are much the same as the assumptions made in linear regression.
Much study has gone into defining these assumptions and precise probabilistic and
statistical language is used. My advice is to use these as guidelines or rules of thumb and
experiment with different data preparation schemes.
Ultimately in predictive modeling machine learning projects you are laser focused on
making accurate predictions rather than interpreting the results. As such, you can break
some assumptions as long as the model is robust and performs well.
Binary Output Variable: This might be obvious as we have already mentioned it, but logistic
regression is intended for binary (two-class) classification problems. It will predict the
probability of an instance belonging to the default class, which can be snapped into a 0 or
1 classification.
Remove Noise: Logistic regression assumes no error in the output variable (y), consider
removing outliers and possibly misclassified instances from your training data.
Remove Correlated Inputs: Like linear regression, the model can overfit if you have multiple
highly-correlated inputs. Consider calculating the pairwise correlations between all inputs
and removing highly correlated inputs.
Fail to Converge: It is possible for the expected likelihood estimation process that learns the
coefficients to fail to converge. This can happen if there are many highly correlated inputs in
your data or the data is very sparse (e.g. lots of zeros in your input data).
Knn Algorithm
K nearest neighbors is a simple algorithm that stores all available cases and classifies new
cases based on a similarity measure (e.g., distance functions). KNN has been used in
statistical estimation and pattern recognition already in the beginning of 1970‟s as a non-
parametric technique.
Algorithm
A case is classified by a majority vote of its neighbors, with the case being assigned to the
45
class most common amongst its K nearest neighbors measured by a distance function. If
K = 1, then the case is simply assigned to the class of its nearest neighbor.
Page
It should also be noted that all three distance measures are only valid for continuous
variables. In the instance of categorical variables the Hamming distance must be used. It
also brings up the issue of standardization of the numerical variables between 0 and 1
when there is a mixture of numerical and categorical variables in the dataset.
Choosing the optimal value for K is best done by first inspecting the data. In general, a
large K value is more precise as it reduces the overall noise but there is no guarantee.
Cross-validation is another way to retrospectively determine a good K value by using an
independent dataset to validate the K value. Historically, the optimal K for most datasets
46
has been between 3-10. That produces much better results than 1NN.
Page
Example:
Consider the following data concerning credit default. Age and Loan are two numerical
variables (predictors) and Default is the target.
We can now use the training set to classify an unknown case (Age=48 and
Loan=$142,000) using Euclidean distance. If K=1 then the nearest neighbor is the last case
in the training set with Default=Y.
With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The
prediction for the unknown case is again Default=Y.
47
Standardized Distance
Page
One major drawback in calculating distance measures directly from the training set is in
the case where variables have different measurement scales or there is a mixture of
numerical and categorical variables. For example, if one variable is based on annual
income in dollars, and the other is based on age in years then income will have a much
higher influence on the distance calculated. One solution is to standardize the training set
as shown below.
Using the standardized distance on the same training set, the unknown case returned a
different neighbor which is not a good sign of robustness.
KNN can be used for both classification and regression predictive problems. However, it is more
widely used in classification problems in the industry. To evaluate any technique we generally
look at 3 important aspects:
2. Calculation time
3. Predictive Power
KNN algorithm fairs across all parameters of considerations. It is commonly used for its easy of
interpretation and low calculation time.
Let‟s take a simple case to understand this algorithm. Following is a spread of red circles (RC)
and green squares (GS) :
You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else.
The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let‟s say K = 3.
Hence, we will now make a circle with BS as center just as big as to enclose only three
datapoints on the plane. Refer to following diagram for more details:
The three closest points to BS is all RC. Hence, with good confidence level we can say that the
BS should belong to the class RC. Here, the choice became very obvious as all three votes from
the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm.
49
Next we will understand what are the factors to be considered to conclude the best K.
Page
First let us try to understand what exactly does K influence in the algorithm. If we see the last
example, given that all the 6 training observation remain constant, with a given K value we can
make boundaries of each class. These boundaries will segregate RC from GS. The same way,
let‟s try to see the effect of value “K” on the class boundaries. Following are the different
boundaries separating the two classes with different values of K.
If you watch carefully, you can see that the boundary becomes smoother with increasing value
of K. With K increasing to infinity it finally becomes all blue or all red depending on the total
majority. The training error rate and the validation error rate are two parameters we need to
access on different K-value. Following is the curve for the training error rate with varying value of
K:
50
Page
As you can see, the error rate at K=1 is always zero for the training sample. This is because the
closest point to any training data point is itself.Hence the prediction is always accurate with K=1.
If validation error curve would have been similar, our choice of K would have been 1. Following
is the validation error curve with varying value of K:
This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate
initially decreases and reaches a minima. After the minima point, it then increase with increasing
K. To get the optimal value of K, you can segregate the training and validation from the initial
dataset. Now plot the validation error curve to get the optimal value of K. This value of K should
be used for all predictions.
51
Page
# Importing libraries
import pandas as pd
import numpy as np
import math
import operator
#### Start of STEP 1
# Importing data
data = pd.read_csv("iris.csv")
#### End of STEP 1
data.head()
# Defining a function which calculates euclidean distance between two data points
def euclideanDistance(data1, data2, length):
distance = 0
for x in range(length):
distance += np.square(data1[x] - data2[x])
return np.sqrt(distance)
# Defining our KNN model
def knn(trainingSet, testInstance, k):
distances = {}
52
sort = {}
Page
length = testInstance.shape[1]
neighbors = []
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
#### End of STEP 3.4
#### Start of STEP 3.5
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
return(sortedVotes[0][0], neighbors)
#### End of STEP 3.5
# Creating a dummy testset
testSet = [[7.2, 3.6, 5.1, 2.5]]
test = pd.DataFrame(testSet)
#### Start of STEP 2
# Setting number of neighbors = 1
k=1
#### End of STEP 2
# Running KNN model
result,neigh = knn(data, test, k)
# Predicted class
print(result)
53
-> Iris-virginica
Page
# Nearest neighbor
print(neigh)
-> [141]
Now we will try to alter the k values, and see how the prediction changes.
54
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is
Page
You can look at definition of support vectors and a few examples of its working here.
Above, we got accustomed to the process of segregating the two classes with a hyper-plane.
Now the burning question is “How can we identify the right hyper-plane?”. Don‟t worry, it‟s not
as hard as you think!
Let‟s understand:
Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C).
Now, identify the right hyper-plane to classify star and circle.
You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-
plane which segregates the two classes better”. In this scenario, hyper-plane “B”
has excellently performed this job.
Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and C)
and all are segregating the classes well. Now, How can we identify the right hyper-plane?
55
Page
Here, maximizing the distances between nearest data point (either class) and hyper-plane
will help us to decide the right hyper-plane. This distance is called as Margin. Let‟s look at
the below snapshot:
Above, you can see that the margin for hyper-plane C is high as compared to both A and
B. Hence, we name the right hyper-plane as C. Another lightning reason for selecting the
hyper-plane with higher margin is robustness. If we select a hyper-plane having low margin
then there is high chance of miss-classification.
Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in previous
section to identify the right hyper-plane
Some of you may have selected the hyper-plane B as it has higher margin compared to A. But,
here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior
to maximizing margin. Here, hyper-plane B has a classification error and A has classified all
correctly. Therefore, the right hyper-plane is A.
56
Page
Can we classify two classes (Scenario-4)?: Below, I am unable to segregate the two
classes using a straight line, as one of star lies in the territory of other(circle) class as an
outlier.
As I have already mentioned, one star at other end is like an outlier for star class. SVM has a
feature to ignore outliers and find the hyper-plane that has maximum margin. Hence, we
can say, SVM is robust to outliers.
Find the hyper-plane to segregate to classes (Scenario-5): In the scenario below, we can‟t
have linear hyper-plane between the two classes, so how does SVM classify these two
classes? Till now, we have only looked at the linear hyper-plane.
57
Page
SVM can solve this problem. Easily! It solves this problem by introducing additional feature.
Here, we will add a new feature z=x^2+y^2. Now, let‟s plot the data points on axis x and z:
manually to have a hyper-plane. No, SVM has a technique called the kernel trick.
These are functions which takes low dimensional input space and transform it to a
Page
problem. Simply put, it does some extremely complex data transformations, then find
out the process to separate the data based on the labels or outputs you‟ve defined.
When we look at the hyper-plane in original input space it looks like a circle:
Now, let‟s look at the methods to apply SVM algorithm in a data science challenge.
In Python, scikit-learn is a widely used library for implementing machine learning algorithms, SVM
is also available in scikit-learn library and follow the same structure (Import library, object
creation, fitting model and prediction). Let‟s look at the below code:
#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of
test_dataset
# Create SVM classification object
model = svm.svc(kernel='linear', c=1, gamma=1)
# there is various option associated with it, like changing kernel, gamma and C value. Will
discuss more # about it in next section.Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
The e1071 package in R is used to create Support Vector Machines with ease. It has helper
functions as well as code for the Naive Bayes Classifier. The creation of a support vector
machine in R and Python follow similar approaches, let‟s take a look now at the following code:
#Import Library
require(e1071) #Contains the SVM
59
# there are various options associated with SVM training; like changing kernel, gamma and C
value.
# create model
model <-
svm(Target~Predictor1+Predictor2+Predictor3,data=Train,kernel='linear',gamma=0.2,cost=100)
#Predict Output
preds <- predict(model,Test)
table(preds)
Tuning parameters value for machine learning algorithms effectively improves the model
performance. Let‟s look at the list of parameters available with SVM.
kernel: We have already discussed about it. Here, we have various options available with kernel
like, “linear”, “rbf”,”poly” and others (default value is “rbf”). Here “rbf” and “poly” are useful for
non-linear hyper-plane. Let‟s look at the example, where we‟ve used linear kernel on two
feature of iris data set to classify their class.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=1,gamma=0).fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
h = (x_max / x_min)/100
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
plt.subplot(1, 1, 1)
Z = svc.predict(np.c_[xx.ravel(), yy.ravel()])
60
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
Page
Change the kernel type to rbf in below line and look at the impact.
61
Page
I would suggest you to go for linear kernel if you have large number of features (>1000) because
it is more likely that the data is linearly separable in high dimensional space. Also, you can RBF
but do not forget to cross validate for its parameters as to avoid over-fitting.
gamma: Kernel coefficient for „rbf‟, „poly‟ and „sigmoid‟. Higher the value of gamma, will try to
exact fit the as per training data set i.e. generalization error and cause over-fitting problem.
Example: Let‟s difference if we have gamma different gamma values like 0, 10 or 100.
C: Penalty parameter C of the error term. It also controls the trade-off between smooth decision
boundary and classifying the training points correctly.
We should always look at the cross validation score to have effective combination of these
parameters and avoid over-fitting.
In R, SVMs can be tuned in a similar fashion as they are in Python. Mentioned below are the
respective parameters for e1071 package:
Pros:
o It works really well with clear margin of separation
o It is effective in high dimensional spaces.
o It is effective in cases where number of dimensions is greater than the number of
samples.
o It uses a subset of training points in the decision function (called support vectors), so it
is also memory efficient.
Cons:
o It doesn‟t perform well, when we have large data set because the required training
time is higher
o It also doesn‟t perform very well, when the data set has more noise i.e. target classes
are overlapping
o SVM doesn‟t directly provide probability estimates, these are calculated using an
expensive five-fold cross-validation. It is related SVC method of Python scikit-learn
library.
Example:-
Let‟s say we have a sample of 30 students with three variables Gender (Boy/ Girl), Class( IX/ X)
and Height (5 to 6 ft). 15 out of these 30 play cricket in leisure time. Now, I want to create a
model to predict who will play cricket during leisure period? In this problem, we need to
segregate students who play cricket in their leisure time based on highly significant input
variable among all three.
63
This is where decision tree helps, it will segregate the students based on all values of three
variable and identify the variable, which creates the best homogeneous sets of students (which
Page
are heterogeneous to each other). In the snapshot below, you can see that variable Gender is
able to identify best homogeneous sets compared to the other two variables.
As mentioned above, decision tree identifies the most significant variable and it‟s value that
gives best homogeneous sets of population. Now the question which arises is, how does it
identify the variable and the split? To do this, decision tree uses various algorithms, which we will
shall discuss in the following section.
Types of decision tree is based on the type of target variable we have. It can be of two types:
1. Categorical Variable Decision Tree: Decision Tree which has categorical target variable
then it called as categorical variable decision tree. Example:- In above scenario of student
problem, where the target variable was “Student will play cricket or not” i.e. YES or NO.
2. Continuous Variable Decision Tree: Decision Tree has continuous target variable then it is
called as Continuous Variable Decision Tree.
Example:- Let‟s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that income of customer is
a significant variable but insurance company does not have income details for all customers.
Now, as we know this is an important variable, then we can build a decision tree to predict
customer income based on occupation, product and various other variables. In this case, we
are predicting values for continuous variable.
1. Root Node: It represents entire population or sample and this further gets divided into two
or more homogeneous sets.
2. Splitting: It is a process of dividing a node into two or more sub-nodes.
3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision
node.
4. Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
64
Page
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You
can say opposite process of splitting.
5. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
6. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of parent node.
These are the terms commonly used for decision trees. As we know that every algorithm has
advantages and disadvantages, below are the important factors which one should know.
Advantages
1. Easy to Understand: Decision tree output is very easy to understand even for people from
non-analytical background. It does not require any statistical knowledge to read and
interpret them. Its graphical representation is very intuitive and users can easily relate their
hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant
variables and relation between two or more variables. With the help of decision trees, we
can create new variables / features that has better power to predict target variable. You
can refer article (Trick to enhance power of regression model) for one such trick. It can
also be used in data exploration stage. For example, we are working on a problem where
we have information available in hundreds of variables, there decision tree will help to
identify most significant variable.
3. Less data cleaning required: It requires less data cleaning compared to some other
modeling techniques. It is not influenced by outliers and missing values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric method. This
means that decision trees have no assumptions about the space distribution and the
classifier structure.
Disadvantages
1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This
problem gets solved by setting constraints on model parameters and pruning (discussed in
detailed below).
2. Not fit for continuous variables: While working with continuous numerical variables, decision
65
We all know that the terminal nodes (or leaves) lies at the bottom of the decision tree. This
means that decision trees are typically drawn upside down such that leaves are the the bottom
& roots are the tops (shown below).
Both the trees work almost similar to each other, let‟s look at the primary differences &
similarity between classification and regression trees:
1. Regression trees are used when dependent variable is continuous. Classification trees are
used when dependent variable is categorical.
2. In case of regression tree, the value obtained by terminal nodes in the training data is the
mean response of observation falling in that region. Thus, if an unseen data observation
falls in that region, we‟ll make its prediction with mean value.
3. In case of classification tree, the value (class) obtained by terminal node in the training
data is the mode of observations falling in that region. Thus, if an unseen data observation
falls in that region, we‟ll make its prediction with mode value.
4. Both the trees divide the predictor space (independent variables) into distinct and non-
overlapping regions. For the sake of simplicity, you can think of these regions as high
dimensional boxes or boxes.
5. Both the trees follow a top-down greedy approach known as recursive binary splitting. We
call it as „top-down‟ because it begins from the top of tree when all the observations are
available in a single region and successively splits the predictor space into two new
branches down the tree. It is known as „greedy‟ because, the algorithm cares (looks for
best variable available) about only the current split, and not about future splits which will
66
6. This splitting process is continued until a user defined stopping criteria is reached. For
example: we can tell the the algorithm to stop once the number of observations per node
becomes less than 50.
7. In both the cases, the splitting process results in fully grown trees until the stopping criteria is
reached. But, the fully grown tree is likely to overfit data, leading to poor accuracy on
unseen data. This bring „pruning‟. Pruning is one of the technique used tackle overfitting.
We‟ll learn more about it in following section.
The decision of making strategic splits heavily affects a tree‟s accuracy. The decision criteria is
different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The
creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can
say that purity of the node increases with respect to the target variable. Decision tree splits the
nodes on all available variables and then selects the split which results in most homogeneous
sub-nodes.
The algorithm selection is also based on type of target variables. Let‟s look at the four most
commonly used algorithms in decision tree:
Gini Index
Gini index says, if we select two items from a population at random then they must be of same
class and probability for this is 1 if population is pure.
1. Calculate Gini for sub-nodes, using formula sum of square of probability for success and
failure (p^2+q^2).
2. Calculate Gini for split using weighted Gini score of each node of that split
Example: – Referring to example used above, where we want to segregate the students based
on target variable ( playing cricket or not ). In the snapshot below, we split the population using
two input variables Gender and Class. Now, I want to identify which split is producing more
homogeneous sub-nodes using Gini index.
67
Page
Split on Gender:
Chi-Square
It is an algorithm to find out the statistical significance between the differences between sub-
nodes and parent node. We measure it by sum of squares of standardized differences between
observed and expected frequencies of target variable.
1. Calculate Chi-square for individual node by calculating the deviation for Success and
Failure both
2. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each
node of the split
Example: Let‟s work with above example that we have used to calculate Gini.
68
Page
Split on Gender:
1. First we are populating for node Female, Populate the actual value for “Play
Cricket” and “Not Play Cricket”, here these are 2 and 8 respectively.
2. Calculate expected value for “Play Cricket” and “Not Play Cricket”, here it would be 5 for
both because parent node has probability of 50% and we have applied same probability
on Female count(10).
3. Calculate deviations by using formula, Actual – Expected. It is for “Play Cricket” (2 – 5 = -3)
and for “Not play cricket” ( 8 – 5 = 3).
4. Calculate Chi-square of node for “Play Cricket” and “Not Play Cricket” using formula with
formula, = ((Actual – Expected)^2 / Expected)^1/2. You can refer below table for
calculation.
5. Follow similar steps for calculating Chi-square value for Male node.
6. Now add all Chi-square values to calculate Chi-square for split Gender.
Split on Class:
Perform similar steps of calculation for split on Class and you will come up with below table.
Above, you can see that Chi-square also identify the Gender split is more significant compare to
Class.
Information Gain:
Look at the image below and think which node can be described easily. I am sure, your answer
is C because it requires less information as all values are similar. On the other hand, B requires
more information to describe it and A requires the maximum information. In other words, we can
say that C is a Pure node, B is less Impure and A is more impure.
69
Page
Now, we can build a conclusion that less impure node requires less information to describe it.
And, more impure node requires more information. Information theory is a measure to define this
degree of disorganization in a system known as Entropy. If the sample is completely
homogeneous, then the entropy is zero and if the sample is an equally divided (50% – 50%), it has
entropy of one.
Here p and q is probability of success and failure respectively in that node. Entropy is also used
with categorical target variable. It chooses the split which has lowest entropy compared to
parent node and other splits. The lesser the entropy, the better it is.
1. Entropy for parent node = -(15/30) log2 (15/30) – (15/30) log2 (15/30) = 1. Here 1 shows that
it is a impure node.
2. Entropy for Female node = -(2/10) log2 (2/10) – (8/10) log2 (8/10) = 0.72 and for male
node, -(13/20) log2 (13/20) – (7/20) log2 (7/20) = 0.93
3. Entropy for split Gender = Weighted entropy of sub-nodes = (10/30)*0.72 + (20/30)*0.93
= 0.86
4. Entropy for Class IX node, -(6/14) log2 (6/14) – (8/14) log2 (8/14) = 0.99 and for Class
X node, -(9/16) log2 (9/16) – (7/16) log2 (7/16) = 0.99.
5. Entropy for split Class = (14/30)*0.99 + (16/30)*0.99 = 0.99
Above, you can see that entropy for Split on Gender is the lowest among all, so the tree will split
on Gender. We can derive information gain from entropy as 1- Entropy.
70
Page
Reduction in Variance
Till now, we have discussed the algorithms for categorical target variable. Reduction in variance
is an algorithm used for continuous target variables (regression problems). This algorithm uses the
standard formula of variance to choose the best split. The split with lower variance is selected as
the criteria to split the population:
1. Variance for Root node, here mean value is (15*1 + 15*0)/30 = 0.5 and we have 15 one
and 15 zero. Now variance would be ((1-0.5)^2+(1-0.5)^2+….15 times+(0-0.5)^2+(0-
0.5)^2+…15 times) / 30, this can be written as (15*(1-0.5)^2+15*(0-0.5)^2) / 30 = 0.25
2. Mean of Female node = (2*1+8*0)/10=0.2 and Variance = (2*(1-0.2)^2+8*(0-0.2)^2) / 10 =
0.16
3. Mean of Male Node = (13*1+7*0)/20=0.65 and Variance = (13*(1-0.65)^2+7*(0-0.65)^2) /
20 = 0.23
4. Variance for Split Gender = Weighted Variance of Sub-nodes = (10/30)*0.16 + (20/30) *0.23
= 0.21
5. Mean of Class IX node = (6*1+8*0)/14=0.43 and Variance = (6*(1-0.43)^2+8*(0-0.43)^2) /
14= 0.24
6. Mean of Class X node = (9*1+7*0)/16=0.56 and Variance = (9*(1-0.56)^2+7*(0-0.56)^2) / 16
= 0.25
7. Variance for Split Gender = (14/30)*0.24 + (16/30) *0.25 = 0.25
Above, you can see that Gender split has lower variance compare to parent node, so the split
would take place on Gender variable.
Until here, we learnt about the basics of decision trees and the decision making process
involved to choose the best splits in building a tree model. As I said, decision tree can be
applied both on regression and classification problems. Let‟s understand these aspects in detail.
4. What are the key parameters of tree modeling and how can we avoid over-fitting in decision
trees?
Overfitting is one of the key challenges faced while modeling decision trees. If there is no limit set
of a decision tree, it will give you 100% accuracy on training set because in the worse case it will
end up making 1 leaf for each observation. Thus, preventing overfitting is pivotal while modeling
a decision tree and it can be done in 2 ways:
71
2. Tree pruning
This can be done by using various parameters which are used to define a tree. First, lets look at
the general structure of a decision tree:
The parameters used for defining a tree are further explained below. The parameters described
below are irrespective of tool. It is important to understand the role of parameters used in tree
modeling. These parameters are available in R & Python.
Tree Pruning
Let's analyze these choice. In the former choice, you‟ll immediately overtake the car ahead and
reach behind the truck and start moving at 30 km/h, looking for an opportunity to move back
right. All cars originally behind you move ahead in the meanwhile. This would be the optimum
choice if your objective is to maximize the distance covered in next say 10 seconds. In the later
choice, you sale through at same speed, cross trucks and then overtake maybe depending on
situation ahead. Greedy you!
73
Page
This is exactly the difference between normal decision tree & pruning. A decision tree with
constraints won‟t see the truck ahead and adopt a greedy approach by taking a left. On the
other hand if we use pruning, we in effect look at a few steps ahead and make a choice.
So we know pruning is better. But how to implement it in decision tree? The idea is simple.
“If I can use logistic regression for classification problems and linear regression for regression
problems, why is there a need to use trees”? Many of us have this question. And, this is a valid
one too.
Actually, you can use any algorithm. It is dependent on the type of problem you are solving.
Let‟s look at some key factors which will help you to decide which algorithm to use:
For R users and Python users, decision tree is quite easy to implement. Let‟s quickly look at the set
of codes which can get you started with this algorithm. For ease of use, I‟ve shared standard
codes where you‟ll need to replace your data set name and variables to get started.
For R users, there are multiple packages available to implement decision tree such as ctree,
rpart, tree etc.
> library(rpart)
> x <- cbind(x_train,y_train)
# grow tree
> fit <- rpart(y_train ~ ., data = x,method="class")
> summary(fit)
#Predict Output
> predicted= predict(fit,x_test)
In the code above:
#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of
test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the
algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)
The literary meaning of word „ensemble‟ is group. Ensemble methods involve group of predictive
models to achieve a better accuracy and model stability. Ensemble methods are known to
impart supreme boost to tree based models.
Like every other model, a tree based model also suffers from the plague of bias and variance.
Bias means, „how much on an average are the predicted values different from the actual
75
value.‟ Variance means, „how different will the predictions of the model be at the same point if
different samples are taken from the same population‟.
Page
You build a small tree and you will get a model with low variance and high bias. How do you
manage to balance the trade off between bias and variance ?
Normally, as you increase the complexity of your model, you will see a reduction in prediction
error due to lower bias in the model. As you continue to make your model more complex, you
end up over-fitting your model and your model will start suffering from high variance.
A champion model should maintain a balance between these two types of errors. This is known
as the trade-off management of bias-variance errors. Ensemble learning is one way to execute
this trade off analysis.
Some of the commonly used ensemble methods include: Bagging, Boosting and Stacking. In this
tutorial, we‟ll focus on Bagging and Boosting in detail.
Bagging is a technique used to reduce the variance of our predictions by combining the
result of multiple classifiers modeled on different sub-samples of the same data set. The following
figure will make it clearer:
76
Page
There are various implementations of bagging models. Random forest is one of them and we‟ll
discuss it next.
77
Page
Random Forest is considered to be a panacea of all data science problems. On a funny note,
when you can‟t think of any algorithm (irrespective of situation), use random forest!
Random Forest is a versatile machine learning method capable of performing both regression
and classification tasks. It also undertakes dimensional reduction methods, treats missing values,
outlier values and other essential steps of data exploration, and does a fairly good job. It is a
type of ensemble learning method, where a group of weak models combine to form a powerful
model.
In Random Forest, we grow multiple trees as opposed to a single tree in CART model (see
comparison between CART and Random Forest here, part1 and part2). To classify a new object
based on attributes, each tree gives a classification and we say the tree “votes” for that class.
The forest chooses the classification having the most votes (over all the trees in the forest) and in
case of regression, it takes the average of outputs by different trees.
It works in the following manner. Each tree is planted & grown as follows:
1. Assume number of cases in the training set is N. Then, sample of these N cases is taken at
random but with replacement. This sample will be the training set for growing the tree.
2. If there are M input variables, a number m<M is specified such that at each node, m
variables are selected at random out of the M. The best split on these m is used to split the
78
4. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for
classification, average for regression).
To understand more in detail about this algorithm using a case study, please read this article
“Introduction to Random forest – Simplified“.
This algorithm can solve both type of problems i.e. classification and regression and does a
decent estimation at both fronts.
One of benefits of Random forest which excites me most is, the power of handle large
data set with higher dimensionality. It can handle thousands of input variables and identify
most significant variables so it is considered as one of the dimensionality reduction
methods. Further, the model outputs Importance of variable, which can be a very handy
feature (on some random data set).
79
Page
It has an effective method for estimating missing data and maintains accuracy when a
large proportion of the data are missing.
It has methods for balancing errors in data sets where classes are imbalanced.
The capabilities of the above can be extended to unlabeled data, leading to
unsupervised clustering, data views and outlier detection.
Random Forest involves sampling of the input data with replacement called as bootstrap
sampling. Here one third of the data is not used for training and can be used to testing.
These are called the out of bag samples. Error estimated on these out of bag samples is
known as out of bag error. Study of error estimates by Out of bag, gives evidence to show
that the out-of-bag estimate is as accurate as using a test set of the same size as the
training set. Therefore, using the out-of-bag error estimate removes the need for a set aside
test set.
It surely does a good job at classification but not as good as for regression problem as it
does not give precise continuous nature predictions. In case of regression,
it doesn‟t predict beyond the range in the training data, and that they may over-fit data
sets that are particularly noisy.
Random Forest can feel like a black box approach for statistical modelers – you have very
little control on what the model does. You can at best – try different parameters and
random seeds!
Random forests have commonly known implementations in R packages and Python scikit-learn.
Let‟s look at the code of loading random forest model in R and Python below:
Python
#Import Library
from sklearn.ensemble import RandomForestClassifier #use RandomForestRegressor for
regression problem
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of
test_dataset
# Create Random Forest object
model= RandomForestClassifier(n_estimators=1000)
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)
R Code
> library(randomForest)
80
> summary(fit)
#Predict Output
> predicted= predict(fit,x_test)
Definition: The term „Boosting‟ refers to a family of algorithms which converts weak learner to
strong learners.
Let‟s understand this definition in detail by solving a problem of spam email identification:
How would you classify an email as SPAM or not? Like everyone else, our initial approach would
be to identify „spam‟ and „not spam‟ emails using following criteria. If:
1. Email has only one image file (promotional image), It‟s a SPAM
2. Email has only link(s), It‟s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It‟s a SPAM
4. Email from our official domain “Analyticsvidhya.com” , Not a SPAM
5. Email from known source, Not a SPAM
Above, we‟ve defined multiple rules to classify an email into „spam‟ or „not spam‟. But, do you
think these rules individually are strong enough to successfully classify an email? No.
Individually, these rules are not powerful enough to classify an email into „spam‟ or „not
spam‟. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we‟ll combine the prediction of each weak learner
using methods like:
Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An
immediate question which should pop in your mind is, „How boosting identify weak rules?„
To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each time
base learning algorithm is applied, it generates a new weak prediction rule. This is an iterative
process. After many iterations, the boosting algorithm combines these weak rules into a single
strong prediction rule.
Here‟s another question which might haunt you, „How do we choose different distribution for
81
each round?‟
Page
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher
attention to observations having prediction error. Then, we apply the next base learning
algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improves the prediction power of the model. Boosting pays higher focus on examples which are
mis-classified or have higher errors by preceding weak rules.
There are many boosting algorithms which impart additional boost to model‟s accuracy. In this
tutorial, we‟ll learn about the two most commonly used algorithms i.e. Gradient Boosting (GBM)
and XGboost.
I‟ve always admired the boosting capabilities that xgboost algorithm. At times, I‟ve found that it
provides better result compared to GBM implementation, but at times you might find that the
gains are just marginal. When I explored more about its performance and science behind its
high accuracy, I discovered many advantages of Xgboost over GBM:
1. Regularization:
o Standard GBM implementation has no regularization like XGBoost, therefore it also
helps to reduce overfitting.
o In fact, XGBoost is also known as „regularized boosting„ technique.
2. Parallel Processing:
o XGBoost implements parallel processing and is blazingly faster as compared to GBM.
o But hang on, we know that boosting is sequential process so how can it be
parallelized? We know that each tree can be built only after the previous one,
so what stops us from making a tree using all cores? I hope you get where I‟m
coming from. Check this link out to explore further.
o XGBoost also supports implementation on Hadoop.
3. High Flexibility
o XGBoost allow users to define custom optimization objectives and evaluation criteria.
o This adds a whole new dimension to the model and there is no limit to what we can
do.
4. Handling Missing Values
o XGBoost has an in-built routine to handle missing values.
o User is required to supply a different value than other observations and pass that as a
parameter. XGBoost tries different things as it encounters a missing value on each
node and learns which path to take for missing values in future.
5. Tree Pruning:
o A GBM would stop splitting a node when it encounters a negative loss in the split. Thus
it is more of a greedy algorithm.
82
o XGBoost on the other hand make splits upto the max_depth specified and then
start pruning the tree backwards and remove splits beyond which there is no positive
Page
gain.
Before we start working, let‟s quickly understand the important parameters and the working of
this algorithm. This will be helpful for both R and Python users. Below is the overall pseudo-code
of GBM algorithm for 2 classes:
Let's consider the important GBM parameters used to improve model performance in Python:
1. learning_rate
o This determines the impact of each tree on the final outcome (step 2.4). GBM works
by starting with an initial estimate which is updated using the output of each tree. The
learning parameter controls the magnitude of this change in the estimates.
o Lower values are generally preferred as they make the model robust to the specific
characteristics of tree and thus allowing it to generalize well.
o Lower values would require higher number of trees to model all the relations and will
be computationally expensive.
2. n_estimators
o The number of sequential trees to be modeled (step 2)
o Though GBM is fairly robust at higher number of trees but it can still overfit at a point.
Hence, this should be tuned using CV for a particular learning rate.
3. subsample
o The fraction of observations to be selected for each tree. Selection is done by
random sampling.
83
o Values slightly less than 1 make the model robust by reducing the variance.
o Typical values ~0.8 generally work fine but can be fine-tuned further.
Page
Apart from these, there are certain miscellaneous parameters which affect overall functionality:
1. loss
o It refers to the loss function to be minimized in each split.
o It can have various values for classification and regression case. Generally the
default values work fine. Other values should be chosen only if you understand their
impact on the model.
2. init
o This affects initialization of the output.
o This can be used if we have made another model whose outcome is to be used as
the initial estimates for GBM.
3. random_state
o The random number seed so that same random numbers are generated every time.
o This is important for parameter tuning. If we don‟t fix the random number, then we‟ll
have different outcomes for subsequent runs on the same parameters and it
becomes difficult to compare models.
o It can potentially result in overfitting to a particular random sample selected. We can
try running models for different random samples, which is computationally expensive
and generally not used.
4. verbose
o The type of output to be printed when the model fits. The different values can be:
0: no output generated (default)
1: output generated for trees in certain intervals
>1: output generated for all trees
5. warm_start
o This parameter has an interesting application and can help a lot if used judicially.
o Using this, we can fit additional trees on previous fits of a model. It can save a lot of
time and you should explore this option for advanced applications
6. presort
o Select whether to presort data for faster splits.
o It makes the selection automatically by default but it can be changed if needed.
I know its a long list of parameters but I have simplified it for you in an excel file which you can
download from this GitHub repository.
For R users, using caret package, there are 3 main tuning parameters:
1. n.trees – It refers to number of iterations i.e. tree which will be taken to grow the trees
2. interaction.depth – It determines the complexity of the tree i.e. total number of splits it has
to perform on a tree (starting from a single node)
3. shrinkage – It refers to the learning rate. This is similar to learning_rate in python (shown
above).
4. n.minobsinnode – It refers to minimum number of training samples required in a node to
perform splitting
I‟ve shared the standard codes in R and Python. At your end, you‟ll be required to change the
value of dependent variable and data set name used in the codes below. Considering the ease
of implementing GBM in R, one can easily perform tasks like cross validation and grid search with
84
this package.
Page
> library(caret)
GBM in Python
#import libraries
from sklearn.ensemble import GradientBoostingClassifier #For Classification
from sklearn.ensemble import GradientBoostingRegressor #For Regression
#use GBM function
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1)
clf.fit(X_train, y_train)
Clustering
Clustering is the task of dividing the population or data points into a number of groups such that
data points in the same groups are more similar to other data points in the same group than
those in other groups. In simple words, the aim is to segregate groups with similar traits and assign
them into clusters.
Let‟s understand this with an example. Suppose, you are the head of a rental store and wish to
understand preferences of your costumers to scale up your business. Is it possible for you to look
at details of each costumer and devise a unique business strategy for each one of them?
Definitely not. But, what you can do is to cluster all of your costumers into say 10 groups based
on their purchasing habits and use a separate strategy for costumers in each of these 10 groups.
And this is what we call clustering.
Now, that we understand what is clustering. Let‟s take a look at the types of clustering.
Types of Clustering
Hard Clustering: In hard clustering, each data point either belongs to a cluster completely
85
or not. For example, in the above example each customer is put into one group out of the
10 groups.
Page
Soft Clustering: In soft clustering, instead of putting each data point into a separate cluster,
a probability or likelihood of that data point to be in those clusters is assigned. For example,
from the above scenario each costumer is assigned a probability to be in either of 10
clusters of the retail store.
Since the task of clustering is subjective, the means that can be used for achieving this goal are
plenty. Every methodology follows a different set of rules for defining the „similarity‟ among data
points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms
are used popularly, let‟s look at them in detail:
Connectivity models: As the name suggests, these models are based on the notion that
the data points closer in data space exhibit more similarity to each other than the data
points lying farther away. These models can follow two approaches. In the first approach,
they start with classifying all data points into separate clusters & then aggregating them as
the distance decreases. In the second approach, all data points are classified as a single
cluster and then partitioned as the distance increases. Also, the choice of distance
function is subjective. These models are very easy to interpret but lacks scalability for
handling big datasets. Examples of these models are hierarchical clustering algorithm and
its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is
derived by the closeness of a data point to the centroid of the clusters. K-Means clustering
algorithm is a popular algorithm that falls into this category. In these models, the no. of
clusters required at the end have to be mentioned beforehand, which makes it important
to have prior knowledge of the dataset. These models run iteratively to find the local
optima.
Distribution models: These clustering models are based on the notion of how probable is it
that all data points in the cluster belong to the same distribution (For example: Normal,
Gaussian). These models often suffer from overfitting. A popular example of these models is
Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data
points in the data space. It isolates various different density regions and assign the data
points within these regions in the same cluster. Popular examples of density models are
DBSCAN and OPTICS.
Now I will be taking you through two of the most popular clustering algorithms in detail – K
Means clustering and Hierarchical clustering. Let‟s begin.
86
Page
K Means Clustering
K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This
algorithm works in these 5 steps :
1. Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D
space.
2. Randomly assign each data point to a cluster : Let‟s assign three points in cluster 1 shown
using red color and two points in cluster 2 shown using grey color.
87
Page
3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red
cross and those in grey cluster using grey cross.
4. Re-assign each point to the closest cluster centroid : Note that only the data point at the
bottom is assigned to the red cluster even though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
88
Page
5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.
6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we‟ll repeat the
4th and 5th steps until we‟ll reach global optima. When there will be no further switching of
data points between two clusters for two successive repeats. It will mark the termination of
the algorithm if not explicitly mentioned.
Hierarchical Clustering
Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This
algorithm starts with all the data points assigned to a cluster of their own. Then two nearest
clusters are merged into the same cluster. In the end, this algorithm terminates when there is only
a single cluster left.
The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be
interpreted as:
89
Page
At the bottom, we start with 25 data points, each assigned to separate clusters. Two closest
clusters are then merged till we have just one cluster at the top. The height in the dendrogram at
which two clusters are merged represents the distance between two clusters in the data space.
The decision of the no. of clusters that can best depict different groups can be chosen by
observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the
dendrogram cut by a horizontal line that can transverse the maximum distance vertically
without intersecting a cluster.
In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the
dendrogram below covers maximum vertical distance AB.
Two important things that you should know about hierarchical clustering are:
This algorithm has been implemented above using bottom up approach. It is also possible
to follow top-down approach starting with all data points assigned in the same cluster and
recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters.
There are multiple metrics for deciding the closeness of two clusters :
o Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
o Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
o Manhattan distance: ||a-b||1 = Σ|ai-bi|
90
Hierarchical clustering can‟t handle big data well but K Means clustering can. This is
because the time complexity of K Means is linear i.e. O(n) while that of hierarchical
clustering is quadratic i.e. O(n2).
In K Means clustering, since we start with random choice of clusters, the results produced
by running the algorithm multiple times might differ. While results are reproducible in
Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in
2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your
data into. But, you can stop at whatever number of clusters you find appropriate in
hierarchical clustering by interpreting the dendrogram
Applications of Clustering
Clustering has a large no. of applications spread across various domains. Some of the most
popular applications of clustering are:
Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection
Clustering is an unsupervised machine learning approach, but can it be used to improve the
accuracy of supervised machine learning algorithms as well by clustering the data points into
similar groups and using these cluster labels as independent variables in the supervised machine
learning algorithm? Let‟s find out.
Let‟s check out the impact of clustering on the accuracy of our model for the classification
problem using 3000 observations with 100 predictors of stock data to predicting whether the
stock will go up or down using R. This dataset contains 100 independent variables from X1 to
X100 representing profile of a stock and one outcome variable Y with two levels : 1 for rise in
stock price and -1 for drop in stock price.
library('randomForest')
library('Metrics')
Page
So, the accuracy we get is 0.45. Now let‟s create five clusters based on values of independent
variables using k-means clustering and reapply randomforest.
Whoo! In the above example, even though the final accuracy is poor but clustering has given
our model a significant boost from accuracy of 0.45 to slightly above 0.53.
92
This shows that clustering can indeed be helpful for supervised machine learning tasks.
Page
We live in a fast changing digital world. In today‟s age customers expect the sellers to tell what
they might want to buy. I personally end up using Amazon‟s recommendations almost in all my
visits to their site.
If you can tell the customers what they might want to buy – it not only improves your sales, but
also the customer experience and ultimately life time value.
On the other hand, if you are unable to predict the next purchase, the customer might not
come back to your store.
In this article, we will learn one such algorithm which enables us to predict the items bought
together frequently. Once we know this, we can use it to our advantage in multiple ways.
When you go to a store, would you not want the aisles to be ordered in such a manner that
reduces your efforts to buy things?
For example, I would want the toothbrush, the paste, the mouthwash & other dental products
on a single aisle – because when I buy, I tend to buy them together. This is done by a way in
which we find associations between items.
In order to understand the concept better, let‟s take a simple dataset (let‟s name it as Coffee
dataset) consisting of a few hypothetical transactions. We will try to understand this in simple
English.
Coffee dataset:
93
Page
For this dataset, we can write the following association rules: (Rules are just for illustrations and
understanding of the concept. They might not represent the actuals).
Rule 3: If Milk and Sugar are purchased, Then Coffee powder is also purchased in 60% of the
transactions.
Generally, association rules are written in “IF-THEN” format. We can also use the term
“Antecedent” for IF (LHS) and “Consequent” for THEN (RHS).
Therefore now we will search for a suitable right hand side or Consequent. If someone buys
Coffee with Milk, we will represent it as {Coffee} => {Milk} where Coffee becomes the LHS and
Milk the RHS.
When we use these to explore more k-item sets, we might find that {Coffee,Milk} => {Tea}. That
means the people who buy Coffee and Milk have a possibility of buying Tea as well.
Let us see how the item sets are actually built using the Apriori.
Milk – 300
Coffee – 200
Tea – 200
Sugar – 150
Tea Sugar 80
Apriori envisions an iterative approach where it uses k-Item sets to search for (k+1)-Item sets. The
first 1-Item sets are found by gathering the count of each item in the set. Then the 1-Item sets are
used to find 2-Item sets and so on until no more k-Item sets can be explored; when all our items
land up in one final observation as visible in our last row of the table above. One exploration
takes one scan of the complete dataset.
The first part of any analysis is to bring in the dataset. We will be using an inbuilt dataset
“Groceries” from the „arules‟ package to simplify our analysis.
All stores and retailers store their information of transactions in a specific type of dataset called
the “Transaction” type dataset.
The „pacman‟ package is an assistor to help load and install the packages. we will be using
pacman to load the arules package.
If your system has those packages, it will load them and if not, it will install and load them.
Example:
pacman::p_load(PACKAGE_NAME)
pacman::p_load(arules, arulesViz)
OR
Library(arules)
Library(arulesViz)
data(“Groceries")
Before we begin applying the “Apriori” algorithm on our dataset, we need to make sure that it is
of the type “Transactions”.
str(Groceries)
95
Page
The structure of our transaction type dataset shows us that it is internally divided into three slots:
Data, itemInfo and itemsetInfo.
The slot “Data” contains the dimensions, dimension names and other numerical values of
number of products sold by every transaction made.
These are the first 12 rows of the itemInfo list within the Groceries dataset. It gives specific names
to our items under the column “labels”. The “level2” column segregates into an easier to
understand term, while “level1” makes the complete generalisation of Meat.
The slot itemInfo contains a Data Frame that has three vectors which categorizes the food items
in the first vector “Labels”.
The second & third vectors divide the food broadly into levels like “baby food”,”bags” etc.
96
The third slot itemsetInfo will be generated by us and will store all associations.
Page
This is what the internal visual of any transaction dataset looks like and there is a dataframe
containing products bought in each transaction in our first inspection. Then, we can group those
products by TransactionID like we did in our second inspection to see how many times each is
sold before we begin with associativity analysis.
The above datasets are just for a clearer visualisation on how to make a Transaction Dataset
and can be reproduced using the following code:
c("a","b","c"),
c("a","b"),
97
c("a","b","d"),
Page
c("b","e"),
c("b","c","e"),
c("a","d","e"),
c("a","c"),
c("a","b","d"),
c("c","e"),
c("a","b","d","e"),
c("a",'b','e','c')
inspect(data)
inspect(tl)
Let us check the most frequently purchased products using the summary function.
summary(Groceries)
98
Page
The summary statistics show us the top 5 items sold in our transaction set as “Whole Milk”,”Other
Vegetables”,”Rolls/Buns”,”Soda” and “Yogurt”. (Further explained in Section 3)
To parse to Transaction type, make sure your dataset has similar slots and then use
the as() function in R.
We can set minimum confidence (minConf) to anywhere between 0.75 and 0.85 for varied
results.
I have used support and confidence in my parameter list. Let me try to explain it:
Support: Support is the basic probability of an event to occur. If we have an event to buy
product A, Support(A) is the number of transactions which includes A divided by total number of
transactions.
Confidence: The confidence of an event is the conditional probability of the occurrence; the
chances of A happening given B has already happened.
99
Lift: This is the ratio of confidence to expected confidence.The probability of all of the items in a
Page
rule occurring together (otherwise known as the support) divided by the product of the
probabilities of the items on the left and right side occurring as if there was no association
between them.
The lift value tells us how much better a rule is at predicting something than randomly guessing.
The higher the lift, the stronger the association.
inspect(rules[1:10])
As we can see, these are the top 10 rules derived from our Groceries dataset by running the
above code.
The first rule shows that if we buy Liquor and Red Wine, we are very likely to buy bottled beer. We
can rank the rules based on top 10 from either lift, support or confidence.
Let‟s plot all our rules in certain visualisations first to see what goes with what item in our shop.
Let us first identify which products were sold how frequently in our dataset.
These histograms depict how many times an item has occurred in our dataset as
compared to the others.
Page
The relative frequency plot accounts for the fact that “Whole Milk” and “Other Vegetables”
constitute around half of the transaction dataset; half the sales of the store are of these items.
arules::itemFrequencyPlot(Groceries,topN=20,col=brewer.pal(8,'Pastel2'),main='Relative Item
Frequency Plot',type="relative",ylab="Item Frequency (Relative)")
This would mean that a lot of people are buying milk and vegetables!
What other objects can we place around the more frequently purchased objects to enhance
those sales too?
For example, to boost sales of eggs I can place it beside my milk and vegetables.
Moving forward in the visualisation, we can use a graph to highlight the support and lifts of
101
various items in our repository but mostly to see which product is associated with which one in
the sales environment.
Page
plot(rules[1:20],
method = "graph",
The size of graph nodes is based on support levels and the colour on lift ratios. The incoming lines
show the Antecedants or the LHS and the RHS is represented by names of items.
The above graph shows us that most of our transactions were consolidated around “Whole Milk”.
We also see that all liquor and wine are very strongly associated so we must place these
together.
Another association we see from this graph is that the people who buy tropical fruits and herbs
also buy rolls and buns. We should place these in an aisle together.
102
Page
The next plot offers us a parallel coordinate system of visualisation. It would help us clearly see
that which products along with which ones, result in what kinds of sales.
As mentioned above, the RHS is the Consequent or the item we propose the customer will buy;
the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we
previously had.
The topmost rule shows us that when I have whole milk and soups in my shopping cart, I am
highly likely to buy other vegetables to go along with those as well.
plot(rules[1:20],
method = "paracoord",
plot(rules[1:20],
103
method = "matrix",
Page
These plots show us each and every rule visualised into a form of a scatterplot. The confidence
levels are plotted on the Y axis and Support levels on the X axis for each rule. We can hover over
them in our interactive plot to see the rule.
Plot: arulesViz::plotly_arules(rules)
The plot uses the arulesViz package and plotly to generate an interactive plot. We can hover
over each rule and see the Support, Confidence and Lift.
As the interactive plot suggests, one rule that has a confidence of 1 is the one above. It has an
exceptionally high lift as well, at 5.17.
Reinforcement Learning
Let us start with a simple analogy. If you have a pet at home, you may have used this
104
A clicker (or whistle) is a technique to let your pet know some treat is just about to get served!
This is essentially “reinforcing” your pet to practice good behavior. You click the “clicker” and
follow up with a treat. And with time, your pet gets accustomed to this sound and responds
every time he/she hears the click sound. With this technique, you can train your pet to do
“good” deeds when required.
To apply this on an artificial agent, you have a kind of a feedback loop to reinforce your agent.
It rewards when the actions performed is right and punishes in-case it was wrong. Basically what
you have in your kitty is:
an internal state, which is maintained by the agent to learn about the environment
a reward function, which is used to train your agent how to behave
an environment, which is a scenario the agent has to face
an action, which is done by the agent in the environment
and last but not the least, an agent which does all the deeds!
105
Page
Now, I am sure you must be thinking how the experiment conducted on animals can be
relevant to people practicing machine learning. This is what I thought when I came across
reinforcement learning first.
A lot of beginners tend to think that there are only 2 types of problems in machine learning –
Supervised machine learning and Unsupervised machine learning. I don‟t know where this
notion comes from, but the world of machine learning is much more than the 2 types of
problems mentioned above. Reinforcement learning is one such class of problems.
Let‟s look at some real-life applications of reinforcement learning. Generally, we know the start
state and the end state of an agent, but there could be multiple paths to reach the end state –
reinforcement learning finds an application in these scenarios. This essentially means that
driverless cars, self navigating vaccum cleaners, scheduling of elevators are all applications of
Reinforcement learning.
Before we look into what a platform is, let's try to understand a reinforcement learning
environment.
A reinforcement learning environment is what an agent can observe and act upon. The horizon
of an agent is much bigger, but it is the task of the agent to perform actions on the environment
which can help it maximize its reward. As per “A brief introduction to reinforcement learning” by
Murphy (1998),
The environment is a modeled as a stochastic finite state machine with inputs (actions sent from
the agent) and outputs (observations and rewards sent to the agent).
This is a typical game of mario. Remember how you played this game. Now consider that you
are the “agent” who is playing the game.
Now you have “access” to a land of opportunities, but you don‟t know what will happen when
you do something, say smash a brick. You can see a limited amount of “environment”, and until
you traverse around the world you can‟t see everything. So you move around the world, trying
to perceive what entails ahead of you, and at the same time try to increase your chances to
attain your goal.
This whole “story” is not created by itself. You have to “render” it first. And that is the main task of
the platform, viz to create everything required for a complete experience – the environment,
the agent and the rewards.
i) Deepmind Lab
Things I liked
It still lacks variety in terms of a gaming environment, which would get built over time by
Page
Also at the moment, it supports only Linux, but has been tested on different OS. Bazel
(which is a dependency for deepmind lab) is experimental for windows. So windows
support for Deepmind lab is still not guaranteed.
Set of states, S
Set of actions, A
Reward function, R
Policy, π
Value, V
We have to take an action (A) to transition from our start state to our end state (S). In return
getting rewards (R) for each action we take. Our actions can lead to a positive reward or
negative reward.
The set of actions we took define our policy (π) and the rewards we get in return defines our
value (V). Our task here is to maximize our rewards by choosing the correct policy. So we have
to maximize
108
Page
This is a representation of a shortest path problem. The task is to go from place A to place F, with
as low cost as possible. The numbers at each edge between two places represent the cost
taken to traverse the distance. The negative cost are actually some earnings on the way. We
define Value is the total cumulative reward when you do a policy.
Here,
You can take a greedy approach and take the best possible next step, which is going from {A ->
D} from a subset of {A -> (B, C, D, E)}. Similarly now you are at place D and want to go to place
F, you can choose from {D -> (B, C, F)}. We see that {D -> F} has the lowest cost and hence we
take that path.
So here, our policy was to take {A -> D -> F} and our Value is -120.
Congratulations! You have just implemented a reinforcement learning algorithm. This algorithm is
known as epsilon greedy, which is literally a greedy approach to solving the problem. Now if you
(the salesman) want to go from place A to place F again, you would always choose the same
policy.
Can you guess which category does our policy belong to i.e. (pure exploration vs pure
exploitation)?
Notice that the policy we took is not an optimal policy. We would have to “explore” a little bit to
find the optimal policy. The approach which we took here is policy based learning, and our task
is to find the optimal policy among all the possible policies. There are different ways to solve this
problem, I‟ll briefly list down the major categories
We will be using Deep Q-learning algorithm. Q-learning is a policy based learning algorithm with
the function approximator as a neural network. This algorithm was used by Google to beat
109
When I was a kid, I remember that I would pick a stick and try to balance it on one hand. Me
and my friends used to have this competition where whoever balances it for more time would
get a “reward”, a chocolate!
110
Assuming you have pip installed, you need to install the following libraries
import numpy as np
import gym
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam
from rl.agents.dqn import DQNAgent
from rl.policy import EpsGreedyQPolicy
from rl.memory import SequentialMemory
Then set the relevant variables
ENV_NAME = 'CartPole-v0'
# Get the environment and extract the number of actions available in the Cartpole problem
env = gym.make(ENV_NAME)
np.random.seed(123)
env.seed(123)
nb_actions = env.action_space.n
Next, we build a very simple single hidden layer neural network model.
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
111
model.add(Activation('linear'))
print(model.summary())
Page
Next, we configure and compile our agent. We set our policy as Epsilon Greedy and we also set
our memory as Sequential Memory because we want to store the result of actions we performed
and the rewards we get for each action.
policy = EpsGreedyQPolicy()
memory = SequentialMemory(limit=50000, window_length=1)
dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory,
nb_steps_warmup=10,
target_model_update=1e-2, policy=policy)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
# Okay, now it's time to learn something! We visualize the training here for show, but this slows
down training quite a lot.
dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)
Now we test our reinforcement learning model
Now that you have seen a basic implementation of Re-inforcement learning, let us start moving
towards a few more problems, increasing the complexity little bit every time.
112
Page
For those, who don‟t know the game – it was invented in 1883 and consists of 3 rods along with a
number of sequentially-sized disks (3 in the figure above) starting at the leftmost rod. The
objective is to move all the disks from the leftmost rod to the rightmost rod with the least number
of moves. (You can read more on wikipedia)
Starting state – All 3 disks in leftmost rod (in order 1, 2 and 3 from top to bottom)
End State – All 3 disks in rightmost rod (in order 1, 2 and 3 from top to bottom)
Numerical Reward:
Since we want to solve the problem in least number of steps, we can attach a reward of -1 to
each step.
Policy:
113
Now, without going in any technical details, we can map possible transitions between above
states. For example (123)** -> (23)1* with reward -1. It can also go to (23)*1
Page
If you can now see a parallel, each of these 27 states mentioned above can represent a graph
similar to that of shortest path algorithm above and we can find the most optimal solutions by
experimenting various states and paths.
While I can solve this for you as well, I would want you to do this by yourself. Follow the same line
of thought I used above and you should be good.
Start by defining the Starting state and the end state. Next, define all possible states and their
transitions along with reward and policy. Finally, you should be able to create a solution for
solving a rubix cube using the same approach.
As you would realize that the complexity of this Rubix Cube is many folds higher than the Towers
of Hanoi. You can also understand how the possible number of options have increased in
number. Now, think of number of states and options in a game of Chess and then in Go! Google
DeepMind recently created a deep reinforcement learning algorithm which defeated Lee
Sedol!
With the recent success in Deep Learning, now the focus is slowly shifting to applying deep
learning to solve reinforcement learning problems. The news recently has been flooded with the
defeat of Lee Sedol by a deep reinforcement learning algorithm developed by Google
DeepMind. Similar breakthroughs are being seen in video games, where the algorithms
developed are achieving human-level accuracy and beyond. Research is still at par, with both
industrial and academic masterminds working together to accomplish the goal of building
better self-learning robots
114
Page
Source
115
Page
About Institute?
About Course?
Course 3: I want to apply statistics and Python on Machine Learning models for
Predictions& classifications of Data in various industry segments for intelligent
Business
Course 7: Start Learning Neural Networks using Tensor flows and Keras for image
classification and Data Extraction from Image(OCR)
Course 9: Sensing Real world Data and transforming it to Intelligent actions using
IOT
INTRODUCTION
Every day we come across a lot of information in the form of facts, numerical
figures, tables, graph, etc. These are provided by newspapers, televisions, magazines,
blogs and other means of communication. These may relate to cricket batting or
bowling averages, profits of a company, temperatures of cities, expenditures in various
sectors of a five year budget plan, polling results, and so on. The numerical facts or
figures ,collected with a definite purpose are called Data(plural form of Latin word
Datum).
Our world is becoming more and more information oriented. Every part of our
lives utilizes data in one or other forms. So, it becomes essential for us to know how to
extract meaningful information from such data which is studied in a branch of
mathematics called statistics.
The word statistics appears to have been derived from Latin word Status (a
political state)meaning collection of data on different aspects of the life of the people,
useful to the State.
Later Scientists seek to answer questions using rigorous methods and careful
observations. The observations which are collected from field notes, surveys, and
experiments form the main pillar or backbone of a statistical investigation and are
called data. Statistics is the study of how best to
collect,
organize,
analyze,
interpret and
present the data.
POPULATION
Don‟t get confused when we hear the word population, we typically think of all
the people living in a town, state, or country. In statistics, a population is an entire group
about which some information is required to be ascertained. A statistical population
need not consist only of people. We can have population of heights, weights, BMIs,
hemoglobin levels, events, outcomes. In selecting a population for study, the research
1
question or purpose of the study will suggest a suitable definition of the population to
Page
SAMPLE
A sample is any part of the fully defined population. A syringe full of blood drawn from a
patient‟s vein is a sample of all of his blood in circulation at the moment. Similarly, 100
patients of schizophrenia in a clinical study is a sample of the population of
schizophrenics, provided the sample is properly chosen and the inclusion and exclusion
criteria are well defined.
To make accurate inferences, the sample has to be representative in which each and
every member of the population has an equal and mutually exclusive chance of being
selected.
Clinical and demographic characteristics define the target population, the large
set of people in the world to which the results of the study will be generalized
(e.g. all schizophrenics).
The study population is the subset of the target population available for study
(e.g. schizophrenics in the researcher's town).
The study sample is the sample chosen from the study population.
2
Page
METHODS OF SAMPLING
The usual method of selecting a simple random sample from a listing of individuals is to
assign a number to each individual and then select certain numbers by reference to
random number tables which are published in standard statistical textbooks. Random
number can also be generated by statistical software such as EPI INFO developed by
WHO and CDC Atlanta.
Systematic sampling
A simple method of random sampling is to select a systematic sample in which
every nth person is selected from a list . A systematic sample can be drawn from a
queue of people or from patients ordered according to the time of their attendance at
a clinic. To fulfill the statistical criteria for a random sample, a systematic sample should
be drawn from subjects who are randomly ordered. The starting point for selection
should be randomly chosen.
3
Page
Multistage sampling
Sometimes, a strictly random sample may be difficult to obtain and it may be more
feasible to draw the required number of subjects in a series of stages. For example,
suppose we wish to estimate the number of CATSCAN examinations made of all
patients entering a hospital in a given month in the state of Maharashtra. It would be
quite tedious to devise a scheme which would allow the total population of patients to
be directly sampled. However, it would be easier
To list the districts of the state of Maharashtra and randomly draw a sample of
these districts.
Within this sample of districts, all the hospitals would then be listed by name, and
a random sample of these can be drawn.
Within each of these hospitals, a sample of the patients entering in the given
month could be chosen randomly for observation and recording.
Thus, by stages, we draw the required sample. If indicated, we can introduce
some element of stratification at some stage (urban/rural, gender, age).
TYPES
The two main types of statistics are descriptive statistics and inferential statistics.
As we know that the steps to study a survey or an experiment are to collect,
organize, analyze, interpret and present the data. Now the steps are divided into two
groups where the initial steps like collecting, organizing and presenting belong to
Descriptive statistics and the remaining two steps like analyzing and
interpreting(drawing the conclusion) the data belong to Inferential statistics.
Since the name Descriptive it involves all the describing about the numbers
obtained in the experiment or survey and preparing the data for analysis by finding the
measures of central tendency, spread, shape, regressions.etc, depending upon the
type of data .Descriptive statistics are limited in so much that they only allow you to
make summations about the people or objects that you have actually measured. For
example, if we have a new drug which cures a particular virus and it worked on a set of
patients, we cannot claim that it would work on other set of patients only based on
descriptive statistics. This is where inferential statistics comes in.
As per wiki, Univariate analysis describes the distribution of a single variable, such as
Inferential Statistics make propositions using data drawn from the population with some
form of sampling. Given a hypothesis about a population, for which we wish to draw
inferences, statistical inference consists of (first) selecting a model of the process that
generates the data and (second) deducing propositions from the
model.The conclusion of ainferential statistics is known as a statistical proposition. Some
common forms of statistical proposition are :
a point estimate( a particular value that approximates parameterat it‟s very
best )
an interval estimate, e.g. a confidence interval (or set estimate), i.e. an
5
parameter value with the probability at the stated confidence level under
repeated sampling.
a credible interval( a set of values containing in a particular interval)
rejection of a hypothesis
clustering of data points
Both descriptive statistics and inferential statistics go hand in hand and cannot exist
without one another.
We can simply further divide Descriptive and Inferential statistics as shown
VARIABLES
6
Page
Qualitative variables can be values that are names or labels. Let's have a look at this
table:
Qualitative /
Scale Properties Examples
Quantitative
Maruthi = 1
Nominal Hyundai = 2
There will be a difference however Volkswagen=3
Qualitative
Operations: order will not be changed. Audi = 4
=, ≠≠ BMW = 5
Other = 6
Food taste
Agree = 1
Ordinal
There will be a difference and Strongly agree =
direction of the difference will be 2 Qualitative
Operations:
indicated (less than or more than) Disagree = 3
=, ≠≠, <, >
Strongly
disagree = 4
Don't know = 5
Quantitative variables are numerical measurable quantity. For example, when we say
population of a city, we are thinking about the number of people in the city which is a
measurable attribute (quantity)of the city. Therefore, population is a quantitative
variable.
Suppose the fire department mandates that all fire fighters must weigh between
75 and 90 kilos. The weight of a fire fighter would be an example of a continuous
variable; since a fire fighter's weight could take on any value between 75.0 and
90.0 kilos.
Suppose we flip a coin to count the number of heads. The number of heads can
hold any integer value between 0 and +∞. But, We cannot get 2.3 heads. So, the
number of heads must be a discrete variable.
Independent(explanatory/predictor) variable,
Dependent(response) variable .
8
Page
DATA
Datum is the singular form of the nounwhere as Data is plural form which has been
since 20th century. Data can be categorized as either numeric or nonnumeric. Specific
terms are used as follows:
When the data are classified according to some qualitative terms like
honesty, beauty, employment, intelligence, occupation, sex, literacy, etc, the
classification is termed as qualitative or descriptive or with respect to attributes.
Qualitative data are often termed as categorical data.
Univariate data: The data identified on the basis of single characteristic is called
as single attribute/univariate data. The number of students in a class room based
on the characteristic gender. Here, we consider the characteristic as boys and
girls.
Multi variate data: The data identified on the basis of more than one
characteristic is called as multivariate data. The number of students on a class
room based on the characteristic height and gender. Here, we consider the
characteristic boys and girls and also tall and short. If the data is classified into
two or more classes with respect to a given attribute, it is said to be a manifold
classification. For example, for the attribute intelligence the various classes may
be, genius, very intelligent, average intelligent, below average and dull.
Ratio - Ratio variables are numbers with some base value and there is starting
point(zero). Ratio responses will have order and spacing where multiplication
makes sense too. Example: Height, weight.
Classification,
Nominal =, ≠ Grouping Mode
membership
Mean,
Interval Difference, affinity +, − Yardstick
Deviation
Geometric mean,
Ratio Magnitude, amount ×, / Ratio
Coefficient of
variation
Depending on the variable types which is actually data can also be included in the
categorization as shown
10
Page
DATA VISUALIZATION
The technique used to convert a set of data into visual insight is known as data
visualization. The main aim of data visualization is to give the data a meaningful
representation. To create an instant understanding from multi-variable data, it can be
displayed as 2d or 3d format images with techniques such as colorization, 3D imaging,
animation and spatial annotation.
The primary objective of any statistical story should be to inform its audience and
be newsworthy. It must use the statistics available to provide substance and stimulate
interest. It should seek to delve through the large pool of data and only surface those
details which will be useful and pertinent to the needs of the user. Once this data has
been uncovered the next step must be to ensure that the presentation of the story is in
a format that is understandable and easy to use. All statistical stories have a target
audience and it is critical that their needs are considered.
There are many different tools for visualizing statistical data. They may be open source
or we can purchase .some of the most heard tools mostly open source are :
Excel
You can actually do some pretty complex things with Excel, from table of cells to
scatter plots. As an entry-level tool, it can be a good way of quickly exploring data,
or creating visualizations for internal use, but it has limited default set of colors, lines
and styles. Excel is part of the Microsoft Office suite, so if you don't have access to it,
Google's spreadsheets can do many of the same things.
R
Tableau
We can create and share data in real time with Tableau. Tableau public is a
popular data visualization tool which is completely free is packed with graphs, charts,
maps and more helping users can easily drag and drop data into the system to update
in real-time, there by collaborating with other team members for quick project
turnaround.
Qualitative variables:
1. The Bar Chart (or Bar Graph) is one of the most common ways of displaying
categorical/qualitative data. Bar Graphs mainly holds 2 variables, response
(dependent variable) and predictor (independent variable) which can be
arranged on the horizontal and vertical axis of a graph. The relationship of the
predictor and response variables is shown by a mark of some sort (usually a
rectangular box) from one variable's value to the other's.
For example :A survey of 145 people asked them "Which is the nicest fruit?":
Fruit: Apple Orange Banana pomegranate Grapes Mango
No. of 30 25 15 5 20 35
votes:
Lets see how we can use Excel tool to represent this data as bar chart
Open Microsoft Excel and enter the table. Select the table which gets
highlighted.
12
Page
2. A histogram is similar to a bar chartbut is used for continuous data. Usually, there is
no space between adjacent columns. The columns are positioned over a label
that represents a continuous, variable. The column label can either be a single
13
value or a range of values. The size of the group can be equal to the height of
column. The area covered by each bar is proportional to the frequency of data.
Page
14
Page
4. A line graph can be, for example, a picture of what happened by/to something
(a variable) during a specific time period (also a variable). Usually a line graph is
plotted after a table which shows the relationship between the two variables in
the form of pairs. Just as in 2Dgraphs, each of the pairs results to a specific point on
the graph, and are connected to one another forming a LINE.
For example:
Sales of ice creams over a week in the month of august are:
15
Page
5. Scatter Plot is used to show the relationship between 2 numeric variables. A scatter
plot matrix is a collection of pair wise scatter plots of numeric variables.
For example we have height and weight data for students of a class as follows
Height Weight
In cms In kgs
72
180
71
178
69
170
65
150
50
145
60
165
58
162
48
158
16
Page
The statistical measure which helps to identify an entire distribution by a single value
is known as Central tendency. It is often used as an accurate description of the data. It
is the single value that is most typical/representative of the collected data. The three
commonly used measures of central tendency are mean, median and mode.
Variability also referred as dispersion/Spread, measures whether the data values are
tightly clustered or spread out and how much they differ from the mean value. The
17
spread of the values can be measured for numeric variables (quantitative data)
arranged in ascending order. Measures of the spread like variance and standard
Page
deviation of the data are present near the mean. If the data set spreads large, the
mean is not as representative of the data as when the spread of data is small. This is
because when there are large differences among individual scores it indicates larger
spread.
The two numerical measures of shape skewness and kurtosis can be used to test for
normality. The data set is not normally distributed when these values are not close to
zero.
Let‟s get into detailed versions.
MEAN
We are aware of the term average during our schooling. In statistics , when
dealing with set of quantitative data we term it as mean. It is computed by adding all
the values in the data set divided by the number of observations in it.
The mean uses each and every value in the dataset and hence is a good
representative of the data. Repeated samples drawn from the same population tend
to have similar means. Calculating mean eliminates random errors and helps to derive
a more accurate result. Therefore the mean is considered as the best measure of
central tendency that resists the fluctuation between different samples.The limitation of
mean is that it is sensitive to extreme values/outliers, especially when the sample size is
small.
Example:
a04 83
Page
a05 80
a06 86
a07 77
a08 88
75+70+72+83+80+86+77+88 631
Applying the formula we have mean = 8
= 8
= 78.875
MEDIAN
The median of a numerical data set is the value in the middle most when the data
is arranged in ascending or descending order. It is the halfway point in a data set also
known as the positional average.we know in triangles median is a line that divides the
opposite side into equal lengths giving area being divided exactly 50%.50% of the
values of the distribution are less than the median(left of median) and 50% are greater
than the median (right of median). Hence median is considered as a measure of
central tendency. Median is less affected by Outliers or extreme values. Median can
be also used as a measure of position (quartiles and deciles). Median can also be
termed as the second quartile or 50th percentile.It is also affected by sampling
fluctuations.Used to summarize ordinal or highly skewed interval or ratio scores interval
or ratio scores.
19
26+28
Median = 2
= 27
In excel, we have formula MEDIAN(start value : end value)
20
Page
MODE
Along with mean and median, Mode is one of the central tendencies of data
distribution. It does not involve much of tedious computations and can be found by
easy observation of occurrences of data values. When data is tightly clustered around
one or two values, Mode is the most meaningful average. Mode is the value that
occurs most often. Because of the highest number of occurrence in the data set, Mode
can be considered as the most typical value in the data set and hence a measure of
central tendency.Mode can be easily quoted by sheer counting or plotting the
frequencies of each item. Mode is not affected by extreme values. The advantage of
mode over the other two measures of central tendencies is it can be found for non
numerical data sets.
The highest point of the relative frequency histogram corresponds to the mode of the
data set. Always exactly one mean and one median exists for any data set.But not the
mode because several different values could occur with the highest frequency. It may
even happen that every value occurs with the same frequency; in such case the
concept of the mode does not make much sense.
A data set with only one greatest frequency value is termed as Unimodal, while data
21
sets which have the same greatest frequency for two values are called Bimodal. When
Page
the data set has more than two values occurring with the same greatest frequency,
then the distribution is described as multimodal and each of such values is a mode.
When all the data values have the same frequency(occurring only once), then the
data set is said to have no mode.
The class with highest frequency for grouped data with given frequencies is known
as the modal class.
The mode for small data sets can be determined by plotting the occurrences on a
number line.
Limitation of mode is it is not well defined, not based on all items and is affected by
sampling fluctuations.
Example: The following are points scored by a Basket ball team during the games
played in a season.
94, 94, 95, 96, 97, 97, 97, 98, 98, 98, 98, 99, 100, 101, 102, 89, 90, 92, 92, 92, 93, 97, 97, 99,
99, 100, 103, 108.
The picture below shows the points marked for each occurrence of the score.
22
Page
It can be seen that the point 97 has occurred 5 times in the season which is the highest.
RANGE AND INTER-QUARTILE RANGE
Range is defined as the difference between the maximum value and minimum
value in a data set. The minimum and maximum values are useful to know, and helpful
in identifying outliers, but the range is extremely sensitive to outliers and not very useful
as a general measure of dispersion in the data. We know that measures of central
tendencies at one point have the issue with the outliers so, to overcome this we can
look at the range of the data after dropping values from each end.
Range = Maximum value – Minimum value
The IQR is used to plot box plots which are graphical representations of a distribution.
The length of the box in a box plot is IQR. For a symmetric distribution, the median
equals the midline value is nothing but the average of the first and third quartiles,
hence half of the IQR equals the median absolute deviation (MAD).
The half of the IQR is termed as quartile deviation or semi-interquartile range.
23
Page
24
Page
25
Page
VARIANCE
the average of the squared differences from the mean is known as the variance.
Smaller the variance, closer the data points to the mean and from each other. Higher
the variance indicates that the data points are very spread out from the mean and
from each other.
𝑁
(Xi −x̅)2
The Variance of a sample s2 = 𝑖=1
n−1
wherexi is the ith unit, starting from the first observation to the last
x̅ - sample mean
n - number of units in the sample.
26
To calculate manually:
The sum of actual deviations having both positive and negative values from the mean
is zero.We use sum of squared deviations instead of the actual differences. Since square
of the deviations is always positive, the variance is a positive for all data distributions.
27
Page
STANDARD DEVIATION
When the values are pretty tightly bunched together and the bell-shaped curve is
steep, the standard deviation is small.
When the values are spread apart and the bell curve is relatively flat, that tells you have
a relatively large standard deviation.
One standard deviation away from the mean in either direction on the horizontal axis (-
28
1SD to +1SD) accounts to about 68% of the people in the group. Two standard
deviations away from the mean(-2SD to +2SD) can account to roughly 95% of the
Page
people. Three standard deviations (all the shaded areas) accounts to 99.7% of the
people.
In Microsoft Excel, STDEV(A1:Z99) for sample and STDEVP(A1:Z99) if you want to use the
"biased" or "n" method for population.
29
Page
SKEWNESS
30
When a histogram is constructed for skewed data it is easy to identify skewness by the
Page
A distribution is positively skewed when the tail on the right side of the histogram is
longer than the left side. Most of the values tend to cluster toward the left side of the x-
axis with increasingly fewer values at the right side of the x-axis
A distribution is said to be negatively skewed when the tail on the left side is longer than
the right sideof the histogram
mean −mode
Mode skewnesscoefficient(first skewness coefficient) = standard deviation
31
3(mean −mode )
Median skewnesscoefficient(second skewness coefficient) = standard deviation
Page
To calculate "Skewness" (the amount of skew), we use the SKEW() function in Excel.
KURTOSIS
Kurtosis is a measure ofthickness of a variable distribution found in the tails. The outliers
in the given data have more effect on this measure. Moreover, it does not have any
unit. The kurtosis of a distribution can be classified as leptokurtic, mesokurtic
andplatykurtic.
Leptokurtic distributions are variable distributions with wide and heavier tails and
have positive kurtosis(kurtosis >0).As the name tells us lepto means slender. Examples of
leptokurtic distributions are Student's t-distribution, Rayleigh distribution, Laplace
distribution, exponential distribution, Poisson distribution and the logistic distribution.
Platykurtic distributions have narrow and lighter tails and thus have negative
kurtosis(kurtosis <0).As the name tells us platy means broad. Examples of platykurtic
distributions are continuous or discrete uniform distributions, the raised cosine
distribution, the Bernoulli distribution.
Mesokurtic distributions (such as the normal distribution) have a kurtosis of
zero.Most often, kurtosis is measured against the normal distribution. For example,
the binomial distribution is mesokurtic.
32
Page
The normal curve is called Mesokurtic curve. If the curve of a distribution is more
peaked than a normal or mesokurtic curve then it is referred to as a Leptokurtic curve. If
a curve is less peaked than a normal curve, it is called as a platykurtic curve.
Formula :
The sample kurtosis is a useful measure of whether there is a problem with outliers in a
data set. Larger kurtosis indicates a more serious outlier problem, which helps
researcher to choose alternative statistical methods.
CORRELATION
When an investigator has collected two series of observations and wishes to see
whether there is a relationship between them, first one should construct a scatter
Page
diagram. If one set of observations consists of experimental results and the other consists
A correlation report also shows a second result of each test which is statistical
significance.
In Excel we have the functionCORREL(array1, array 2)
For example, =CORREL(A2:A6,B2:B6)
34
Page
COVARIANCE
The covariance of two variables x and y in a data set is a measure of the directional
relationship between them. A positive covariance indicates a positive linear
relationship between the variables which move together. A negative covariance
indicates that the variables move inversely. Covariance is similar to variance except
that we have two variables x and y.
𝑛
𝑥𝑖 −𝑥̅ (𝑦𝑖 −𝑦 )
Sample covariance, sxy = 𝑖=1
(𝑛−1)
𝑁
𝑥𝑖 −𝜇𝑥 (𝑦𝑖 −𝜇𝑦 )
Population covariance, σxy= 𝑖=1
𝑁
For a scatter plot of two variables the covariance measures how close the scatter is to
a line.Positive covariance corresponds to upward-sloping scatter plots. Negative
covariance corresponds to downward-sloping scatter plots.Covariance is scale
dependent and has units. Nonlinear dependencies have zero
covariance.Independence implies zero covariance. The value for a perfect linear
relationship depends on the data because covariance values are not standardized.
There is confusion in understanding terms covariance and correlation. Here are some of
the differences:
covariance.
Page
Covariance can involve the relationship of two variables or data sets whereas
correlation can involve the relationship of several variables.
Correlation values range from +1 to -1. But, covariance values can exceed this
scale.
The Spearman correlation coefficient tells how close or far two variables are
independent from each other. The covariance calculation tells you how much
two variables tend to change together.
In Excel, we have the function COVAR (array 1, array 2) where arrays are the 2 different
data sets.
PROBABILITY
The probability theory is very helpful for making predictions. In research investigation,
estimates and predictions form an important part. Using statistical methods, we
estimate for the further analysis. The role of probability in modern science is simply a
substitute for certainty. Probability can be defined in terms of a random process giving
rise to an outcome. Rolling a die or flipping a coin is a random process which gives rise
to an outcome.
Probability of occurrence of an event A is
Number of favourable outcomes
P(A) =
Total number of equally likely outcomes
36
Page
P (A or B) = P(A) + P(B)
If two events, A and B, are non-mutually exclusive, the probability that A or B will
occur is:
A single random card is chosen from a deck of 52 playing cards. What is the probability
of choosing a Queen or a club?
Probabilities:
4 13 1
= + -
52 52 52
16
=
52
4
=
13
37
Page
If two events A and B are independent, the probability of both occurring is:
If a coin is tossed and a single 6-sided die is rolled. Find the probability of landing on the
head side of the coin and rolling a 5 on the die.
Probabilities:
1
P(head) =
2
1
P(5) =
6
1 1
= ·
2 6
1
=
12
P(A and B)
P(B|A) =
P(A)
A math teacher gave her class two tests. 23% of the class passed both tests and 46% of
the class passed the first test. What is the percentage of those who passed the first test
also passed the second test?
If two events, A and B, are dependent, the probability of both occurring is:
Mr. Parth needs two students to help him with a science demonstration for his class of 16
girls and 12 boys. He randomly chooses one student who comes to the front of the
Page
room. He then chooses a second student from the remaining class. What is the
probability that both chosen students are girls?
16 15
= ·
28 27
240
=
756
60
=
189
RANDOM VARIABLES
A random variable is the value of the variable which represents the outcome of a
statistical experiment within sample space (range of values). It is usually represented
by X. The two types of random variables are discrete random variable and continuous
random variable.
Discrete random variable is a variable which can take countable number of distinct
values.
For example, while tossing two coins, let us consider the random variable(X) to be
number of heads observed.
Here, the possible outcomes are {HH, HT, TH, TT} which means the number of heads
possible in a single outcome may be 2 heads or 1 head or no heads at all.Hence, the
possible values taken by the random variable are 0, 1, and 2 (discrete).
Continuous random variable is a variable which can take infinite number of values in an
interval. It is usually represented by the area covered under the curve. Examples are
height and weight of the subjects, maximum and minimum temperatures of a particular
place.
Let Y be the random variable for the average height of a random group consisting of
25 people, the resulting outcome is a continuous figure since height may be 5 ft or
5.15ft or 5.002 ft. Clearly, there is an infinite number of continuous possible values for
height.
Let X and Y be two random variables and the constant C. Then CX, X+Y, X-Y are
39
also random variables. Any arithmetic operation with random number results in
another random variable.
Page
Let Z be the random variable which is the number on the top face of a die when it is
rolled once. The possible numbers for Z are 1, 2, 3, 4, 5, and 6.
P(1)=P(2)=P(3)=P(4)=P(5)=P(6) = 1/6 as they are all equally likely to be the value of
Z.Note that the sum of all probabilities is 1.
Suppose in a probability distribution where the outcomes of a random event are not
equally likely to happen we can still find the probabilities of random variable.Let Y be
random variable which is the number of heads we get from tossing two coins, then Y
could be 0, 1, or 2. The two coins will flip in 4 possible different ways – TT(no heads),
HT(one head), TH(one head), HH(2 heads).
PROBABILITY DISTRIBUTION
Probabilities for continuous distributions are measured over ranges of values rather than
single points. Here we calculate probability whether the likelihood of a value falls within
an interval or not.
In discrete distributions, the sum of all probabilities must equal one. Similarly, in
continuous distributions, the entire area in a probability plot under the distribution curve
must be equal to 1. The proportion of the area under the curve that falls within a range
of values along the X-axis represents the probability.
Each probability distribution has parameters that define its shape. Most distributions
have between 1-3 parameters. Based on these parameters, the shape of the
distribution and all of its probabilities can be established, such as the central tendency
and the variability.
The mathematical representation of a continuous probability function, f(x)
The probability that x is between two points a and b is
𝑏
41
p[a≤x≤b]= 𝑎 𝑓 𝑥 𝑑𝑥
It is non-negative for all real x.
Page
∞
−∞
𝑓 𝑥 𝑑𝑥= 1
BERNOULLI DISTRIBUTION
A Bernoulli trial is one of the simplest experiments with exactly two possible outcomes,
success and failure. For example
Coin tosses: Number of heads up/number of tails up.
Births: number of boys/girls born each day.
Rolling Dice: the probability of two die roll resulting in a double six.
The modeof a Bernoulli distribution (the value with the highest probability of occurring) is
0 𝑖𝑓 1 − 𝑝 > 𝑝
Mode = 0,1 𝑖𝑓 1 − 𝑝 = 𝑝
1 𝑖𝑓 1 − 𝑝 < 𝑝
The Bernoulli distribution can also be defined as the Binomial distribution with n = 1.
BINOMIAL DISTRIBUTION
𝑛+1 𝑝 𝑖𝑓 𝑛 + 1 𝑝 = 0 𝑜𝑟 𝑛𝑜𝑛𝑖𝑛𝑡𝑒𝑔𝑒𝑟,
Mode = 𝑛 + 1 𝑝 𝑎𝑛𝑑 𝑛 + 1 𝑝 − 1 𝑖𝑓 𝑛 + 1 𝑝 𝜖 1,2, … . , 𝑛 ,
𝑛 𝑖𝑓 𝑛 + 1 𝑝 = 𝑛 + 1
For example:
If a coin is tossed 10 times.The probability of getting exactly 6 headsis ?
1
Here, number_s = 6 , Trails = 10, Probability_s = = 0.5 (probability of head/ tail in a single
2
toss of a coin is ½), Cumulative = FALSE
43
Page
GEOMETRIC DISTRIBUTION
Consider a sequence of Bernoulli trials (failure and success), the geometric distribution is
used to find the number of failures before the first success. For a geometric distribution
with probability of success, the probability that exactly x failures occur before the first
success is
The geometric distribution is the only discrete distribution with the memory less property.
The successive probabilities in this distribution form a geometric series, hence the name
44
to the distribution.
Page
1
MEAN, μ = E(X) = p
1−p
Variance, σ2 = Var(X) = p2
In Excel, we use the function NEGBINOMDIST(number_f, number_s, probability_s) where
For example, A die is rolled until a 1 shows up. Using the function, resulting geometric
distribution is shown graphically
45
Page
46
Page
In statistics, a distribution where selections are made from two groups without replacing
members of the groups is known as Hyper geometric distribution. Hyper geometric
distribution is the probability distribution of a hyper geometric random variable.
For example,
(n×k)
Mean =
N
n×k×(N−k)×(N−n)
Variance =
N 2 ×(N−1)
There are 52 cards in a deck. Find the probability of getting randomly 1 red card out of
two cards chosen without replacement.
POISSON DISTRIBUTION
μx e −µ
P(X= x) = x!
48
Page
For example,the mean number of calls to a fire station on a weekday is 8. What is the
probability that there would be 11 calls on a given weekday?
e −8 811
P= 11!
= 0.072
cumulative is TRUE, if the number of random events occurring will be between 0 and x
inclusive; FALSE, if the number of events occurring will be exactly x.
49
Page
EXPONENTIAL DISTRIBUTION
The exponential distribution is a continuous memory less distribution that describes the
time between events in a Poisson process. The continuous analogue of
the geometric distribution gives exponential distribution. The exponential distribution
models time between successive events over a continuous time interval, whereas the
Poisson distribution deals with events that happen over a fixed period of time. The
exponential distribution is mostly used for testing product reliability which deals with the
amount of time a product lasts. For example, the amount of time (beginning now) until
an earthquake occurs, the length( in minutes) of long distance business telephone calls,
and the amount of time(in months) a car battery lasts.
The events occur in disjoint intervals (non-overlapping)
Two or more events cannot occur simultaneously
Each event occurs at a constant rate
A continuous random variable X is said to have an exponential distribution with
parameter λ>0, if its PDF is given by
−𝜆𝑥
fX(x) = 𝜆𝑒 𝑥 > 0
0 𝑥<0
Where, e = the natural number e,λ = mean time between events,x = a random variable.
The formula for the cumulative distribution function of the exponential distribution is:
F(x) = 1 − e − λx where x ≥ 0; λ > 0
1
Mean = λ
1
Variance = 2
λ
Suppose you are testing new software, and a bug causes errors randomly at a constant
rate of three times per hour. The probability that the first bug will occur within the first ten
minutes is?
Let constant rate or intensity be λ = 3 / hour and t = 1/6 hours (10 minutes)
1
P(X < 1/6) = 6
0
𝜆𝑒 −𝜆t d𝑡 = 0.393
The probability that the first bug will occur in the next 10 minutes is 0.393.
SAMPLING
Sampling is particularly useful with data sets that are too large to efficiently analyze.For
example, in big data analytics applications or surveys,identifying and analyzing a
representative sample is more efficient and cost-effective than surveying the entire
population.
51
The size of the required data sample and the possibility of introducing a sampling error
Page
are important.A small sample may sometimes reveal the most important information
about a data set. Using a larger sample increases the likelihood of accurately
representing the data as a whole, even though the increased size of the sample may
impede ease of manipulation and interpretation.
Probability Sampling uses randomization to select sample membersin the data set
to ensure that there is no correlation between points.
Non-probability sampling uses non-random techniques (i.e. the judgment of the
researcher). It can be difficult to calculate the odds of any particular item, person
or thing being included in your sample.
Simple random sampling is used to randomly select data from the whole population
using a software.
Stratified sampling: Subsets of the data sets or population are grouped based on a
common factor, and samples are randomly collected from each subgroup.
Cluster sampling: The larger data set is divided into subsets (clusters) based on a
specific factor, then a random sampling of clusters is analyzed.
Consecutive sampling: Until the predetermined sample size is met, data is collected
Page
Quota sampling: A selection ensuring equal representation within the sample for all
subgroups in the data set or population.
Errors happen when you take a sample from the population rather than using
the entire population. Sample error is the difference between the statistic you measure
and the parameter you would find for the entire population.
If you were to survey the entire population, there would be no error. Sample error can
only be reduced, since it is considered to be an acceptable tradeoff to avoid
measuring the entire population.When the sample size gets larger, the margin of error
becomes smaller. But, there is a notable exception: if you use cluster sampling, this may
increase the error because of the similarities between cluster members.
RANDOM SAMPLING
Random sampling is a technique where each item in the population has an even
chance and likelihood of being selected in the sample. Here the selection of items
completely depends on chance or by probability and therefore the name „method of
chances‟. The main attribute being every sample has the same probability of being
chosen.
1. A list of all the members of the population is prepared initially and then each
member is numbered starting from 1.
2. From this population, random samples are chosen in two ways:
Random number tables
Random number generator software (preferred more as the sample numbers
can be generated randomly without human interference)
To minimize any biases in the process of simple random sampling, there are two
approaches:
Method of lottery
One of the oldest methods which is a mechanical example of random sampling.Here,
each member of the population is numbered systematically. By writing each number
on a separate piece of paper, they are mixed in a box and then numbers are drawn
out of the box in a random manner.
53
No restriction on the sample size even when the population size is large.
The quality of the data collected through this sampling method depends on the
size of sample i.e., more the samples better the quality of the data.
Disadvantages of random sampling
Costly method of sampling as it needs the complete list of all potential data to be
available beforehand.
Not suitable for face-to-face interviews in larger geographical areas due to cost
and time constraints.
The larger population means a larger sample frame which is difficult to manage.
When we click on drag, we can have the function for the selected cells as
shown
54
Page
SYSTEMATIC SAMPLING
For example, a local NGO is seeking to form a systematic sample of 500 volunteers from
a population of 5000, they can select every 10th person (5000/500 = 10) in the
population to systematically form a sampling interval.
Systematic sampling consumes the least time as it requires selection of sample size and
identification of starting point for this sample that needs to be continued at regular
intervals to form a sample.
Advantages:
Simple and convenient to create, conduct, and analyze samples by the
researchers.
Beneficial in case of diverse population because of the even distribution of
sample.
Disadvantages:
Difficult when the population size cannot be estimated.
Data becomes skewed if sample is taken from a group which already has a
pattern.
Sometimes even a standard arrangement (order/pattern) may not be obvious or
visible, resulting in sampling bias.
STRATIFIED SAMPLING
A probability sampling technique in which the researcher divides the entire population
into different subgroups (strata), then randomly selects the final items proportionally
The sample size of each stratum is proportionate to the respective population sizewhen
viewed against the entire population i.e., each stratum has the same sampling fraction.
For example, we have 3 strata with 100, 200 and 300 population sizes respectively. Let
sampling fraction be ½. Then, weshould randomly sample 50, 100 and 150 subjects from
each stratum respectively.
Stratum A B C
Population Size 100 200 300
Sampling Fraction ½ ½ ½
Final Sample Size 50 100 150
Disproportionate Stratified Random Sampling
With disproportionate sampling, the different strata have different sampling fractions.
Advantages:
Highlighting a specific subgroup within the population.
Disadvantages:
Suppose for a survey, we need 50 students who are either juniors or seniors in a high
Page
school.
Based on gender,
junior 126 94
senior 77 85
To calculate the number of senior girls to be included in the 50 person sample issimply
CLUSTERED SAMPLING
A sampling technique which divides the main population into various clusters which
consist of multiple sample parameters like demographics, habits, background or any
other attribute. Cluster sampling allows the researchers to collect data by bifurcating
the data into small, more effective groups instead of selecting the entire population of
data.
There are two ways to classify cluster sampling. The first way is based on the number of
stages followed to obtain the cluster sample and the second way is the representation
of the groups in the entire cluster.
Cluster sampling can be classified as single-stage, two-stage, and multiple stages (in
most of the cases).
Single Stage Cluster Sampling: Here, sampling will be done just once. For example, An
NGO wants to create a sample of girls across 5 neighboring towns to provide
education. To form a sample, The NGO can randomly select towns (clusters). Then
they can extend help to the girls deprived of education in those towns directly.
Two-Stage Cluster Sampling: A sample created using two-stages is always better than
using a single stage because more filtered elements can be selected which can lead
to improved results from the sample. In two-stage cluster sampling, by
implementing systematic or simple random sampling only a handful of members are
selected from each cluster instead of selecting all the elements of a cluster. For
example, a business owner wants to explore the statistical performance of her plants
which are spread across various parts of the state. Based on the number of plants,
number of employees per plant and work done from each plant, single-stage
sampling would be time and cost consuming. The owner thus creates samples of
58
employees belonging to different plants to form clusters and then divides it into the
size or operation status of the plant.
Page
Advantages:
Sampling of groups divided geographically require less work, time and cost.
Ability to choose larger samples which will increase accessibility to various
clusters.
Due to large samples in each cluster, loss of accuracy in information per
individual can be compensated.
Since cluster sampling facilitates information from various areas and groups, it
can be easily implemented in practical situations in comparison to other
sampling methods.
Disadvantages:
Even though both strata and clusters are non-overlapping subsets of the population,
they do differ in several ways.
All strata are represented in the sample; but only a subset of clusters are in the
sample.
With stratified sampling, the best survey results occur when elements within strata
59
are internally homogeneous. Whereas, with cluster sampling, the best survey
Page
QUOTA SAMPLING
A non-probability sampling technique where the assembled sample has the same
proportions of individuals as the entire population with respect to known
characteristics.The main reason in choosing quota samples is that it allows the
researchers to sample a subgroup that is of great interest to the study.In a study that
considers gender, socioeconomic status and religion as the basis of the subgroups, the
final sample may have a skewed representation of age, race, educational attainment,
marital status and a lot more.
Quota sampling can be classified as: controlled and uncontrolled.
Controlled quota sampling involves certain restrictions in order to limit sample choice of
researcher.
Uncontrolled quota sampling, on the other hand, does not have any restrictions.
The researcher then evaluates the proportion in which the subgroups exist. This
proportion has to be maintained within the sample selected using this sampling
method. If 58% of the people who are interested in purchasing Bluetooth
headphones are between the age group of 25-35 years, your subgroups also
should have the same percentages of people.
Selecting the sample size while maintaining the proportion evaluated in the
previous step. If the population size is 500, the researcher can select a sample
size of 50 elements.
Time effective since the primary data collection can be done in shorter time.
Cost-effective.
Independent on the presence of the sampling frames.
CONVENIENCE SAMPLING
In pilot studies, the researcher prefers convenience sample because it allows to obtain
basic data and trends regarding his study and also to avoid the complications of
a randomized sample. This sampling technique is useful in documentation that a
particular quality or phenomenon occurs within a given sample.
When using convenience sampling, it is necessary to describe how the sample differs
from an ideal random sample. Description of the individuals who might be left out
during the selection process or the individuals who are overrepresented in the sample
might also be necessary.
In business studies, this method is used to collect initial primary data regarding specific
issues such as perception of image of a particular brand or collecting opinions of
perspective customers to a new design of a product.
Advantages:
Disadvantages:
Risky due to selection bias and influences beyond the control of the researcher.
The best way of reducing bias is using it along with probability sampling. Since
61
probability sampling gets the measurement parameter with it to keep the bias
under check.
Page
High level of sampling error since the samples are selected conveniently, it is not
necessary that these reflect the true attributes or characteristics of the target
population. Also, there are several choices of bias of the investigator‟s selection
which can tamper the results that leads to higher level of sampling error.
SNOWBALL SAMPLING
This often occurs when the population is somehow marginalized, like homeless or
formerly incarcerated individuals or those who are involved in illegal activities. It is also
common to use this sampling technique with people whose membership in a particular
group is difficult to find not widely known, such as closeted gay people or bisexual or
transgender individuals, homeless people, drug addicts, members of an elite golf club
etc. Some people may not want to be found. For example, if a study requires
investigating cheating on exams, shoplifting, drug use, or any other “unacceptable”
social behavior, the participants would worry to come forward due to possible
ramifications. However, other study participants would likely know other people in the
same situation as themselves and could inform others about the benefits of the study
and reassure them of confidentiality.
1. Identify potential subjects from sampling frame in the population. Mostly one or two
subjects can be found at this stage.
2. Ask those subjects to recruit other people and then ask those people to recruit till
there are no more subjects left/sample size becomes unmanageable.
Advantages:
This process allows the researcher to reach populations that are difficult to
sample compared to other sampling methods.
The process is cheap, simple and cost-efficient.
This technique needs lesser workforce as the subjects were involved directly
compared to other sampling techniques.
Disadvantages:
Oversampling a particular network of peers may lead to biasing.
There is no guarantee about the representation of samples. It is not possible to
determine the actual pattern of distribution of population.
Determination of the sampling error and make statistical inferences from the
sample to the population is not possible due to the absence of random selection
of samples.
BIAS IN SAMPLING
A sampling method is called biased if the survey sample does not accurately represent
the population. Sampling bias is sometimes called ascertainment bias or systematic
bias. Sampling bias refers to sample and also the method of sampling. Bias can be
either intentional or not. Some time seven poor measurement process can lead to bias.
When measuring a nonlinear functional of the probabilities from a limited number of
experimental samples a bias may occur, even when these samples are picked
randomly from the underlying population and there is thus no sampling bias. This bias is
63
Example:
Telephone sampling is common in marketing surveys. In a survey, a random sample is
chosen from the sampling frame having a list of telephone numbers of people in the
particular area. This method does involve taking a simple random sample, but it will miss
1. People who do not have a phone or
2. People who only have a cell phone that has an area code not in the region
being surveyed
3. People who do not wish to be surveyed, including those who monitor calls on an
answering machine
4. People who don't answer those from telephone surveyors. Thus systematically
excluding certain types of consumers in the area.
Here are few sources of sampling bias:
Convenience samples:
David A. Freedman, statistics professor stated "Statistical inference with
convenience samples is a risky business." In cases where it may not be possible or
not be practical to choose a random sample, a convenience sample might be
used. Sometimes convenience sample is considered as a random sample, but often
it gets biased. Under coverage problem arises with convenience samples. Under
coverage occurs when some members of the population are not adequately
represented in the sample. In the above example people who do not have a phone
under covered.
Extrapolation:
Drawing of a conclusion about something beyond the data range is
called extrapolation. Extrapolation of a biased sample systematically excludes
certain parts of the population under consideration, the inferences only apply to the
subpopulation which has actually been sampled. Extrapolation also occurs if, for
example, an inference based on a sample of senior citizens is applied to older
adults or to adults without citizenship.
Self-Selection Bias
A self-selection bias results when the non-random component occurs after
the potential subject has enlisted in the experiment. Considering the hypothetical
experiment in which subjects were asked about the details of their sex lives, assume
64
that the subjects did not know what the experiment was about until they showed
up. Many of the subjects would definitely leave the experiment resulting in a biased
Page
sample. Many of television or web site polls taken are prone to self-selection bias.
SAMPLING DISTRIBUTION
The population is assumed to be normally distributed. If the sample size is large enough,
then sampling distribution will also be normal which is determined by the mean and
the standard deviation values.
For Example, a random sample of 20 people from the population of women in
Hyderabad between the ages of 22 and 35 years is selected and computed the mean
65
height of sample. It might be lesser or greater, but not equal the population mean
exactly. The most common measure of how much sample means differ from each
Page
other is the standard deviation (standard error of the mean) of the sampling distribution
of the mean. The standard error of the mean would be small, if all the sample means
were very close to the population mean. The standard error of the mean would be
large, if the sample means varied considerably.
The sampling distributions were often derived from the normal distribution implied by the
central limit theorem. This holds for
66
The mean of the sampling distribution is in fact the mean of the population after
computing sample means and population means. However, the standard deviation
differs for the sampling distribution as compared to the population.
𝜎
If the population is large enough, this is given by σx = 𝑛
𝜎 𝑁−𝑛
The standard error of the sampling distribution, σx = *
𝑛 𝑁−1
𝑁−𝑛
In the standard error formula, the factor 𝑁−1
is called the finite population correction or
fpc. When the population size is very large relative to the sample size, the fpc≈ 1; and
the standard error formula can be approximated to:
𝜎
σx = 𝑛
Safer to use this formula when the sample size is no bigger than 1/20 of the population
size.
67
Page
We find that is equal to the probability of success in the population (P). And is
determined by the standard deviation of the population (σ), the population size, and
the sample size.
𝜎 𝑁−𝑛 𝑃𝑄 𝑁−𝑛
The standard error of the sampling distribution, σp= * = *
𝑛 𝑁−1 𝑛 𝑁−1
When the population size is very large relative to the sample size, the fpc≈ 1;
𝑃𝑄
So, the standard error formula can be approximated to σp = 𝑛
Safer to use thisformula when the sample size is no bigger than 1/20 of the population
size.
To make it easier:
Use the normal distribution, if the population standard deviation is known/ if the
sample size is large.
Use the t-distribution, if the population standard deviation is unknown/ if the
sample size is small.
68
Page
Statement: Given a sufficiently large sample size selected from a population with a
finitevariance, the mean of all samples from the same population will be approximately
equal to the mean of the population thereby forming an approximate normal
distribution pattern.When we draw repeated samples from a given population, the
shape of the distribution of means will be converging to the normal distribution
irrespective of the shape of the population distribution.As the sample size increases, the
sampling distribution of the mean, can be approximated by a normal distribution with
mean µ and standard deviation σ/√n.
Example
Here the resulting frequency distributions each based on 500 means is shown. For n = 4,
4 scores were sampled from a uniform distribution 500 times and then computed the
mean each time. Similarly, with means of 7 scores for n = 7 and 10 scores for n = 10.As
n is increasing, the spread of the distributions is decreasing and the distributions are
becominglimited to center (clustering around the mean)
69
Page
The values must be drawn independently from the same distribution having finite
mean and variance and should not be correlated.
The rate of convergence depends on the skewness of the distribution.
Sums from an exponential distribution converge for smaller sample sizes. Sums
from a lognormal distribution require larger sizes.
where X -a score, μ -the mean, σ - the standard deviation of original normal distribution.
The standard normal distribution is oftenknown as the z distribution.A z score tells us how
far the score is from the mean in terms of number of standard deviations. Percentile
rankis the area (probability) to the left of the value i.e.,by adding the percentages from
the chart from left of the curve.The mean is the 50th percentile (50%).
68% of the observations fall between -1𝜎 and 1 𝜎 (within 1 standard deviation of
the mean of 0),
95% fall between -2 𝜎 and 2 𝜎 (within 2 standard deviations of the mean) and
70
Z-SCORES
A z-score/ standard score indicates how many standard deviations an element is far
from the mean. A z-score can be calculated using the formula:
z = (X - μ) / σ
Interpretation of z-scores:
We can find the z score table online and save it for later calculations.
Example1: Find the probability for IQ values between 75 and 130, assuming a normal
distribution, mean = 100 and standard deviation = 15.
An IQ of 75 corresponds with a z score of -1.67(75-100/15) and an IQ of 130 corresponds
with a zscore of 2.00. The value for -1.67 is .4525 from the table. For 2.00 we find .4772.
The probability of an IQ between 75 and 130 is the same as .4525+.4772=.9297.
If the mean and standard deviation of a normal distribution are known, the percentile
rank of a person obtaining a specific score can be calculated.
Using the formula, we know that z = 70-80/5=-2. Using the table only 2.3% of the
population will be less than or equal to a score 2𝜎 below the mean.
The shaded area occupies 2.3% of the total area. The proportion of the area below 70 =
the proportion of the scores below 70.
72
Page
Same way, the percentile rank of a person receiving a score of 90 on the test is 97.7%
Since z = (90 - 80)/5 = 2it can be determined from the table that a z score of 2 is
equivalent to the 97.7th percentile: The proportion of people scoring below 90 is thus
.977.
Example 3: What score on the Introductory Psychology test would it have taken to be in
the 75th percentile? (Mean is 80 and a standard deviation is 5)
We now have to find out z score which can be done by reversing the steps followed in
example 2
First, determine z score from table with value associating to 0.75 by using a z table. The
value of z is 0.674.
X = 80 + (.674)(5) = 83.37.
73
Page
Suppose that a 90% confidence interval states that the population mean is greater than
100 and less than 200. How would you interpret this statement?
Some people think this means there is a 90% chance that the population mean falls
between 100 and 200. This is incorrect. Like any population parameter, the population
mean is a constant, not a random variable. It does not change. The probability that a
constant falls within any given range is always 0.00 or 1.00.
The confidence level describes the uncertainty associated with a sampling method.
Suppose we used the same sampling method to select different samples and to
compute a different interval estimate for each sample. Some interval estimates would
include the true population parameter and some would not. A 90% confidence level
means that we would expect 90% of the interval estimates to include the population
parameter; a 95% confidence level means that 95% of the intervals would include the
parameter; and so on.
Confidence level
Statistic
Margin of error
Given these inputs, the range of the confidence interval is defined by the sample
statistic + margin of error. And the uncertainty associated with the confidence interval is
specified by the confidence level.
Often, the margin of error is not given; you must calculate it. Previously, we
74
Identify a sample statistic. Choose the statistic (e.g, sample mean, sample
proportion) that you will use to estimate a population parameter.
Select a confidence level. As we noted in the previous section, the confidence
level describes the uncertainty of a sampling method. Often, researchers choose
90%, 95%, or 99% confidence levels; but any percentage can be used.
Find the margin of error. If you are working on a homework problem or a test
question, the margin of error may be given. Often, however, you will need to
compute the margin of error, based on one of the following equations.
The sample problem in the next section applies the above four steps to construct a 95%
confidence interval for a mean score. The next few lessons discuss this topic in greater
detail.
Problem 1
Suppose we want to estimate the average weight of an adult male in Dekalb County,
Georgia. We draw a random sample of 1,000 men from a population of 1,000,000 men
and weigh them. We find that the average man in our sample weighs 180 pounds, and
75
the standard deviation of the sample is 30 pounds. What is the 95% confidence interval.
Page
Solution
The correct answer is (A). To specify the confidence interval, we work through the four
steps below.
Identify a sample statistic. Since we are trying to estimate the mean weight in the
population, we choose the mean weight in our sample (180) as the sample
statistic.
Select a confidence level. In this case, the confidence level is defined for us in
the problem. We are working with a 95% confidence level.
Find the margin of error. Previously, we described how to compute the margin of
error. The key steps are shown below.
Find standard error. The standard error (SE) of the mean is:
SE = s / sqrt( n )
Find critical value. The critical value is a factor used to compute the
margin of error. To express the critical value as a t score (t*), follow these
steps.
o Compute alpha (α):
o
Page
df = n - 1 = 1000 - 1 = 999
Specify the confidence interval. The range of the confidence interval is defined
by the sample statistic + margin of error. And the uncertainty is denoted by the
confidence level. Therefore, this 95% confidence interval is 180 + 1.86.
This lesson describes how to construct a confidence interval for a sample proportion, p,
when the sample size is large.
Estimation Requirements
The approach described in this lesson is valid whenever the following conditions are
met:
Note the implications of the second condition. If the population proportion were close
to 0.5, the sample size required to produce at least 10 successes and at least 10 failures
would probably be close to 20. But if the population proportion were extreme (i.e.,
close to 0 or 1), a much larger sample would probably be needed to produce at least
77
For example, imagine that the probability of success were 0.1, and the sample were
selected using simple random sampling. In this situation, a sample size close to 100
might be needed to get 10 successes.
Suppose k possible samples of size n can be selected from the population. The
standard deviation of the sampling distribution is the "average" deviation
between the k sample proportions and the true population proportion, P. The
standard deviation of the sample proportion σp is:
σp = sqrt[ P * ( 1 - P ) / n ] * sqrt[ ( N - n ) / ( N - 1 ) ]
where P is the population proportion, n is the sample size, and N is the population
size. When the population size is much larger (at least 20 times larger) than the
sample size, the standard deviation can be approximated by:
σp = sqrt[ P * ( 1 - P ) / n ]
When the true population proportion P is not known, the standard deviation of
the sampling distribution cannot be calculated. Under these circumstances, use
the standard error. The standard error (SE) can be calculated from the equation
below.
where p is the sample proportion, n is the sample size, and N is the population
size. When the population size at least 20 times larger than the sample size, the
standard error can be approximated by:
SEp = sqrt[ p * ( 1 - p ) / n ]
78
Page
Alert
The Advanced Placement Statistics Examination only covers the "approximate" formulas
for the standard deviation and standard error.
σp = sqrt[ P * ( 1 - P ) / n ]
SEp = sqrt[ p * ( 1 - p ) / n ]
Identify a sample statistic. In this case, the sample statistic is the sample
proportion. We use the sample proportion to estimate the population proportion.
Select a confidence level. The confidence level describes the uncertainty of a
sampling method. Often, researchers choose 90%, 95%, or 99% confidence
levels; but any percentage can be used.
Find the margin of error. Previously, we showed how to compute the margin of
error.
Specify the confidence interval. The range of the confidence interval is defined
by the sample statistic + margin of error. And the uncertainty is denoted by the
confidence level.
In the next section, we work through a problem that shows how to use this approach to
construct a confidence interval for a proportion.
Problem 1
79
from their list of 100,000 subscribers. They asked whether the paper should increase its
coverage of local news. Forty percent of the sample wanted more local news. What is
the 99% confidence interval for the proportion of readers who would like more
coverage of local news?
Solution
The answer is (D). The approach that we used to solve this problem is valid when the
following conditions are met.
The sampling method must be simple random sampling. This condition is satisfied;
the problem statement says that we used simple random sampling.
The sample should include at least 10 successes and 10 failures. Suppose we
classify a "more local news" response as a success, and any other response as a
failure. Then, we have 0.40 * 1600 = 640 successes, and 0.60 * 1600 = 960 failures -
plenty of successes and failures.
If the population size is much larger than the sample size, we can use an
"approximate" formula for the standard deviation or the standard error. This
condition is satisfied, so we will use one of the simpler "approximate" formulas.
Since the above requirements are satisfied, we can use the following four-step
approach to construct a confidence interval.
instead, we compute the standard error. And since the population is more
than 20 times larger than the sample, we can use the following formula to
compute the standard error (SE) of the proportion:
SE = sqrt [ p(1 - p) / n ]
Find critical value. The critical value is a factor used to compute the
margin of error. Because the sampling distribution is approximately normal
and the sample size is large, we can express the critical value as a z-
score by following these steps.
o Compute alpha (α):
α = 1 - (99/100) = 0.01
df = n - 1 = 1600 -1 = 1599
Specify the confidence interval. The range of the confidence interval is defined
by the sample statistic + margin of error. And the uncertainty is denoted by the
confidence level.
Therefore, the 99% confidence interval is 0.37 to 0.43. That is, the 99% confidence
interval is the range defined by 0.4 + 0.03.
This lesson explains how to conduct a chi-square test for independence. The test is
applied when you have two categorical variables from a single population. It is used to
determine whether there is a significant association between the two variables.
The test procedure described in this lesson is appropriate when the following conditions
are met:
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results.
Suppose that Variable A has r levels, and Variable B has c levels. The null
hypothesis states that knowing the level of Variable A does not help you predict the
level of Variable B. That is, the variables are independent.
82
The alternative hypothesis is that knowing the level of Variable A can help you predict
the level of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are related; but
the relationship is not necessarily causal, in the sense that one variable "causes" the
other.
The analysis plan describes how to use sample data to accept or reject the null
hypothesis. The plan should specify the following elements.
Using sample data, find the degrees of freedom, expected frequencies, test statistic,
and the P-value associated with the test statistic. The approach described in this
section is illustrated in the sample problem at the end of this lesson.
DF = (r - 1) * (c - 1)
where r is the number of levels for one catagorical variable, and c is the number
of levels for the other categorical variable.
where Er,c is the expected frequency count for level r of Variable A and level c of
Variable B, nr is the total number of sample observations at level r of Variable A,
nc is the total number of sample observations at level c of Variable B, and n is the
total sample size.
Test statistic. The test statistic is a chi-square random variable (Χ2) defined by the
following equation.
where Or,c is the observed frequency count at level r of Variable A and level c of
Variable B, and Er,cis the expected frequency count at level r of Variable A and
level c of Variable B.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the
null hypothesis. Typically, this involves comparing the P-value to the significance level,
and rejecting the null hypothesis when the P-value is less than the significance level.
Problem
A public opinion poll surveyed a simple random sample of 1000 voters. Respondents
were classified by gender (male or female) and by voting preference (Republican,
Democrat, or Independent). Results are shown in the contingency table below.
Is there a gender gap? Do the men's voting preferences differ significantly from the
women's preferences? Use a 0.05 level of significance.
Solution
The solution to this problem takes four steps: (1) state the hypotheses, (2) formulate an
analysis plan, (3) analyze sample data, and (4) interpret results. We work through those
steps below:
State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
Formulate an analysis plan. For this analysis, the significance level is 0.05. Using
sample data, we will conduct a chi-square test for independence.
Analyze sample data. Applying the chi-square test for independence to sample
data, we compute the degrees of freedom, the expected frequency counts,
and the chi-square test statistic. Based on the chi-square statistic and
the degrees of freedom, we determine the P-value.
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
We use the Chi-Square Distribution Calculator to find P(Χ2 > 16.2) = 0.0003.
Interpret results. Since the P-value (0.0003) is less than the significance level
(0.05), we cannot accept the null hypothesis. Thus, we conclude that there is a
relationship between gender and voting preference.
Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the
sampling method was simple random sampling, the variables under study were
categorical, and the expected frequency count was at least 5 in each cell of the
contingency table.
Statistical Hypotheses
The best way to determine whether a statistical hypothesis is true would be to examine
the entire population. Since that is often impractical, researchers typically examine a
86
random sample from the population. If sample data are not consistent with the
Page
Null hypothesis. The null hypothesis, denoted by Ho, is usually the hypothesis that
sample observations result purely from chance.
Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the
hypothesis that sample observations are influenced by some non-random cause.
For example, suppose we wanted to determine whether a coin was fair and balanced.
A null hypothesis might be that half the flips would result in Heads and half, in Tails. The
alternative hypothesis might be that the number of Heads and Tails would be very
different. Symbolically, these hypotheses would be expressed as
Ho: P = 0.5
Ha: P ≠ 0.5
Suppose we flipped the coin 50 times, resulting in 40 Heads and 10 Tails. Given this result,
we would be inclined to reject the null hypothesis. We would conclude, based on the
evidence, that the coin was probably not fair and balanced.
Some researchers say that a hypothesis test can have one of two outcomes: you
accept the null hypothesis or you reject the null hypothesis. Many statisticians, however,
take issue with the notion of "accepting the null hypothesis." Instead, they say: you
reject the null hypothesis or you fail to reject the null hypothesis.
Why the distinction between "acceptance" and "failure to reject?" Acceptance implies
that the null hypothesis is true. Failure to reject implies that the data are not sufficiently
persuasive for us to prefer the alternative hypothesis over the null hypothesis.
Hypothesis Tests
State the hypotheses. This involves stating the null and alternative hypotheses.
87
The hypotheses are stated in such a way that they are mutually exclusive. That is,
Page
Formulate an analysis plan. The analysis plan describes how to use sample data
to evaluate the null hypothesis. The evaluation often focuses around a single test
statistic.
Analyze sample data. Find the value of the test statistic (mean score, proportion,
t statistic, z-score, etc.) described in the analysis plan.
Interpret results. Apply the decision rule described in the analysis plan. If the
value of the test statistic is unlikely, based on the null hypothesis, reject the null
hypothesis.
Decision Errors
Type I error. A Type I error occurs when the researcher rejects a null hypothesis
when it is true. The probability of committing a Type I error is called
the significance level. This probability is also called alpha, and is often denoted
by α.
Type II error. A Type II error occurs when the researcher fails to reject a null
hypothesis that is false. The probability of committing a Type II error is called Beta,
and is often denoted by β. The probability of not committing a Type II error is
called the Power of the test.
Decision Rules
The analysis plan includes decision rules for rejecting the null hypothesis. In practice,
statisticians describe these decision rules in two ways - with reference to a P-value or
with reference to a region of acceptance.
The set of values outside the region of acceptance is called the region of
rejection. If the test statistic falls within the region of rejection, the null hypothesis
is rejected. In such cases, we say that the hypothesis has been rejected at the α
level of significance.
These approaches are equivalent. Some statistics texts use the P-value approach;
others use the region of acceptance approach. On this website, we tend to use the
region of acceptance approach.
A test of a statistical hypothesis, where the region of rejection is on only one side of
the sampling distribution, is called a one-tailed test. For example, suppose the null
hypothesis states that the mean is less than or equal to 10. The alternative hypothesis
would be that the mean is greater than 10. The region of rejection would consist of a
range of numbers located on the right side of sampling distribution; that is, a set of
numbers greater than 10.
A test of a statistical hypothesis, where the region of rejection is on both sides of the
sampling distribution, is called a two-tailed test. For example, suppose the null
hypothesis states that the mean is equal to 10. The alternative hypothesis would be that
the mean is less than 10 or greater than 10. The region of rejection would consist of a
range of numbers located on both sides of sampling distribution; that is, the region of
rejection would consist partly of numbers that were less than 10 and partly of numbers
that were greater than 10.
This lesson explains how to conduct a hypothesis test of a mean, when the following
conditions are met:
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results.
Every hypothesis test requires the analyst to state a null hypothesis and an alternative
hypothesis. The hypotheses are stated in such a way that they are mutually exclusive.
That is, if one is true, the other must be false; and vice versa.
The table below shows three sets of hypotheses. Each makes a statement about how
the population mean μ is related to a specified value M. (In the table, the symbol ≠
means " not equal to ".)
The first set of hypotheses (Set 1) is an example of a two-tailed test, since an extreme
value on either side of the sampling distribution would cause a researcher to reject the
null hypothesis. The other two sets of hypotheses (Sets 2 and 3) are one-tailed tests,
since an extreme value on only one side of the sampling distribution would cause a
researcher to reject the null hypothesis.
The analysis plan describes how to use sample data to accept or reject the null
hypothesis. It should specify the following elements.
90
Test method. Use the one-sample t-test to determine whether the hypothesized
mean differs significantly from the observed sample mean.
Using sample data, conduct a one-sample t-test. This involves finding the standard
error, degrees of freedom, test statistic, and the P-value associated with the test
statistic.
Standard error. Compute the standard error (SE) of the sampling distribution.
SE = s * sqrt{ ( 1/n ) * [ ( N - n ) / ( N - 1 ) ] }
where s is the standard deviation of the sample, N is the population size, and n is
the sample size. When the population size is much larger (at least 20 times larger)
than the sample size, the standard error can be approximated by:
SE = s / sqrt( n )
Degrees of freedom. The degrees of freedom (DF) is equal to the sample size (n)
minus one. Thus, DF = n - 1.
Test statistic. The test statistic is a t statistic (t) defined by the following equation.
t = (x - μ) / SE
where x is the sample mean, μ is the hypothesized population mean in the null
hypothesis, and SE is the standard error.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher rejects the
null hypothesis. Typically, this involves comparing the P-value to the significance level,
and rejecting the null hypothesis when the P-value is less than the significance level.
In this section, two sample problems illustrate how to conduct a hypothesis test of a
mean score. The first problem involves a two-tailed test; the second problem, a one-
tailed test.
An inventor has developed a new, energy-efficient lawn mower engine. He claims that
the engine will run continuously for 5 hours (300 minutes) on a single gallon of regular
gasoline. From his stock of 2000 engines, the inventor selects a simple random sample of
50 engines for testing. The engines run for an average of 295 minutes, with a standard
deviation of 20 minutes. Test the null hypothesis that the mean run time is 300 minutes
against the alternative hypothesis that the mean run time is not 300 minutes. Use a 0.05
level of significance. (Assume that run times for the population of engines are normally
distributed.)
Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work
through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
Note that these hypotheses constitute a two-tailed test. The null hypothesis will
be rejected if the sample mean is too big or if it is too small.
92
Page
Formulate an analysis plan. For this analysis, the significance level is 0.05. The test
method is aone-sample t-test.
Analyze sample data. Using sample data, we compute the standard error (SE),
degrees of freedom (DF), and the t statistic test statistic (t).
DF = n - 1 = 50 - 1 = 49
where s is the standard deviation of the sample, x is the sample mean, μ is the
hypothesized population mean, and n is the sample size.
Since we have a two-tailed test, the P-value is the probability that the t statistic
having 49 degrees of freedom is less than -1.77 or greater than 1.77.
We use the t Distribution Calculator to find P(t < -1.77) = 0.04, and P(t > 1.77) =
0.04. Thus, the P-value = 0.04 + 0.04 = 0.08.
Interpret results. Since the P-value (0.08) is greater than the significance level
(0.05), we cannot reject the null hypothesis.
Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the
sampling method was simple random sampling, the population was normally
distributed, and the sample size was small relative to the population size (less than 5%).
Bon Air Elementary School has 1000 students. The principal of the school thinks that the
average IQ of students at Bon Air is at least 110. To prove her point, she administers an
IQ test to 20 randomly selected students. Among the sampled students, the average IQ
is 108 with a standard deviation of 10. Based on these results, should the principal
accept or reject her original hypothesis? Assume a significance level of 0.01. (Assume
that test scores in the population of engines are normally distributed.)
93
Page
Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work
through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an
alternative hypothesis.
Note that these hypotheses constitute a one-tailed test. The null hypothesis will
be rejected if the sample mean is too small.
Formulate an analysis plan. For this analysis, the significance level is 0.01. The test
method is aone-sample t-test.
Analyze sample data. Using sample data, we compute the standard error (SE),
degrees of freedom (DF), and the t statistic test statistic (t).
DF = n - 1 = 20 - 1 = 19
where s is the standard deviation of the sample, x is the sample mean, μ is the
hypothesized population mean, and n is the sample size.
Here is the logic of the analysis: Given the alternative hypothesis (μ < 110), we
want to know whether the observed sample mean is small enough to cause us to
reject the null hypothesis.
The observed sample mean produced a t statistic test statistic of -0.894. We use
the t Distribution Calculator to find P(t < -0.894) = 0.19. This means we would
expect to find a sample mean of 108 or smaller in 19 percent of our samples, if
the true population IQ were 110. Thus the P-value in this analysis is 0.19.
94
Page
Interpret results. Since the P-value (0.19) is greater than the significance level
(0.01), we cannot reject the null hypothesis.
Note: If you use this approach on an exam, you may also want to mention why this
approach is appropriate. Specifically, the approach is appropriate because the
sampling method was simple random sampling, the population was normally
distributed, and the sample size was small relative to the population size (less than 5%).
Analysis of variance (ANOVA) is an analysis tool used in statistics that splits the
aggregate variability found inside a data set into two parts: systematic factors and
random factors. The systematic factors have a statistical influence on the given data
set, but the random factors do not. Analysts use the analysis of the variance test to
determine the result that independent variables have on the dependent variable amid
a regression study.
The analysis of variance test is the initial step in analyzing factors that affect a given
data set. Once the analysis of variance test is finished, an analyst performs additional
testing on the methodical factors that measurably contribute to the data set's
inconsistency. The analyst utilizes the analysis of the variance test results in an f-test to
generate additional data that aligns with the proposed regression models.
The test allows comparison of more than two groups at the same time to determine
whether a relationship exists between them. The test analyzes multiple groups to
determine the types between and within samples. For example, a researcher might test
students from multiple colleges to see if students from one of the colleges consistently
outperform the others. Also, an R&D researcher might test two different processes of
creating a product to see if one process is better than the other in terms of cost
efficiency.
The type of ANOVA run depends on a number of factors. It is applied when data needs
to be experimental. Analysis of variance is employed if there is no access to statistical
software resulting in computing ANOVA by hand. It is simple to use and best suited for
small samples. With many experimental designs, the sample sizes have to be the same
for the various factor level combinations.
Analysis of variances is helpful for testing three or more variables. It is similar to multiple
95
two-sample t-tests. However, it results in fewer type I errors and is appropriate for a
Page
range of issues. ANOVA groups differences by comparing the means of each group,
and includes spreading out the variance into diverse sources. It is employed with
subjects, test groups, between groups and within groups.
Types of ANOVA
There are two types of analysis of variance: one-way (or unidirectional) and two-way.
One-way or two-way refers to the number of independent variables in your Analysis of
Variance test. A one-way ANOVA evaluates the impact of a sole factor on a sole
response variable. It determines whether all the samples are the same. The one-way
ANOVA is used to determine whether there are any statistically significant differences
between the means of three or more independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have
one independent variable affecting a dependent variable. With a two-way ANOVA,
there are two independents. For example, a two-way ANOVA allows a company to
compare worker productivity based on two independent variables, say salary and skill
set. It is utilized to observe the interaction between the two factors. It tests the effect of
two factors at the same time.
F Distribution
The F distribution is the probability distribution associated with the f statistic. In this lesson,
we show how to compute an f statistic and how to find probabilities associated with
specific f statistic values.
The f Statistic
The f statistic, also known as an f value, is a random variable that has an F distribution.
(We discuss the F distribution in the next section.)
f = [ s12/σ12 ] / [ s22/σ22 ]
Page
f = [ Χ21 / v1 ] / [ Χ22 / v2 ]
f = [ Χ21 * v2 ] / [ Χ22 * v1 ]
The F Distribution
The curve of the F distribution depends on the degrees of freedom, v1 and v2. When
describing an F distribution, the number of degrees of freedom associated with the
standard deviation in the numerator of the f statistic is always stated first. Thus, f(5, 9)
would refer to an F distribution with v1 = 5 and v2 = 9 degrees of freedom; whereas f(9, 5)
would refer to an F distribution with v1 = 9 and v2 = 5 degrees of freedom. Note that the
curve represented by f(5, 9) would differ from the curve represented by f(9, 5).
Every f statistic can be associated with a unique cumulative probability. This cumulative
probability represents the likelihood that the f statistic is less than or equal to a specified
97
value.
Page
Of course, to find the value of fα, we would need to know the degrees of
freedom, v1 and v2. Notationally, the degrees of freedom appear in parentheses as
follows: fα(v1,v2). Thus, f0.05(5, 7) refers to value of the f statistic having a cumulative
probability of 0.95, v1 = 5 degrees of freedom, and v2 = 7 degrees of freedom.
The easiest way to find the value of a particular f statistic is to use the F Distribution
Calculator.
Problem 1
Suppose you randomly select 7 women from a population of women, and 12 men from
a population of men. The table below shows the standard deviation in each sample
and in each population.
Women 30 35
Men 50 45
Solution A: The f statistic can be computed from the population and sample standard
deviations, using the following equation:
f = [ s12/σ12 ] / [ s22/σ22 ]
As you can see from the equation, there are actually two ways to compute an f statistic
from these data. If the women's data appears in the numerator, we can calculate an f
statistic as follows:
For this calculation, the numerator degrees of freedom v1 are 7 - 1 or 6; and the
denominator degrees of freedom v2 are 12 - 1 or 11.
On the other hand, if the men's data appears in the numerator, we can calculate an f
statistic as follows:
For this calculation, the numerator degrees of freedom v1 are 12 - 1 or 11; and the
denominator degrees of freedom v2 are 7 - 1 or 6.
When you are trying to find the cumulative probability associated with an f statistic, you
need to know v1 and v2. This point is illustrated in the next example.
Problem 2
Find the cumulative probability associated with each of the f statistics from Example 1,
above.
Solution: To solve this problem, we need to find the degrees of freedom for each
sample. Then, we will use the F Distribution Calculator to find the probabilities.
99
Therefore, when the women's data appear in the numerator, the numerator degrees of
freedom v1 is equal to 6; and the denominator degrees of freedom v2 is equal to 11.
And, based on the computations shown in the previous example, the f statistic is equal
to 1.68. We plug these values into the F Distribution Calculator and find that the
cumulative probability is 0.78.
On the other hand, when the men's data appear in the numerator, the numerator
degrees of freedom v1is equal to 11; and the denominator degrees of freedom v2 is
equal to 6. And, based on the computations shown in the previous example, the f
statistic is equal to 0.595. We plug these values into the F Distribution Calculator and find
that the cumulative probability is 0.22.
100
Page
About Institute?
About Course?
Course 3: I want to apply statistics and Python on Machine Learning models for
Predictions& classifications of Data in various industry segments for intelligent
Business
Course 7: Start Learning Neural Networks using Tensor flows and Keras for image
classification and Data Extraction from Image(OCR)
Course 9: Sensing Real world Data and transforming it to Intelligent actions using
IOT
Introduction
Python – Installation
You can understand the Installation of python on windows and Ubuntu in this
tutorial. Before that let me give provide you the download link for downloading the
Python.
For Linux and Unix Systems, below is the link to download Python:
https://docs.python.org/3/using/unix.html
https://docs.python.org/3/using/windows.html
Python will come pre-installed on most of the Linux distributions. Just you need to give
the
python3 to start programming in the terminal.
If you don‘t have the Python, then get the source from
https://www.python.org/downloads/source/
./configure
make
1
make install
Page
Python installers are available to install python on 32-bit and 64-bit versions. Just
download the installer and install the software. Once the installation opens the
command prompt and give the python command to test.
Python – Syntax
In this tutorial, you will learn the python syntax and an example about how to write a
basic but popular Hello World print program.
Note: In this tutorial, we are using python version 3.5. Hence all the examples reflect the
same results as like we execute in Python 3.5
Example:
>>> print(x)
Hello World
2
Page
>>>
In above Python terminal, x is a variable where we have assigned a value as hello world
in double quotation as it is the string value. Then, we used print function to print the
variable x which is in parenthesis. Do remember that we have given parenthesis for
variable in print function.
Example:
>>> x=2
>>> print(x)
>>>
Variable:
A variable is the location in the memory to store the values. A variable may hold
different types of values like numbers, strings etc. In Python, no need to declare a
datatype for a variable.It will understand by the value that is assigned to the variable.
3. All identifiers must start with letter or underscore. You cannot use digits.
>>>x = 10 # x is Integer
Data Types:
1. Numbers
2. String
3. List
4. Tuple
5. Dictionary
6. Boolean
Python – Numbers
1. Integer
2. Float
3. Complex
>>>x
10
>>>x=10
>>>type(x)
<class 'int'>
>>>y
10.1
>>>y=10.1
4
Page
>>>type(y)
<class 'float'>
We can perform the various calculations with these numbers. Let us see few examples
below.
>>>5+5
10
>>>5*5
25
>>>5-4
Python – Strings
In this tutorial, we will work on the Python Strings where we can learn about the
manipulation of Strings, using String Operators and string methods and Functions. First,
let us understand that how do we declare the strings in python programming language.
We can declare and print the strings by placing them in single Quotes ('..'), Double
Quotes (".."), and using the print function too. Python Strings are Immutable (An Object
with a fixed value).
'hello world'
'let's do it'
>>> 'let\'start'
"let's start"
"let's start"
>>>print("let's start") # we have enclose the strings in double quotation inside print
funtion
let's start
Using the 3 double quotes start and end of the string allows us to print the data including
spaces and newlines.
>>>print("""let's
...start
...Now""")
let's
start
now
String Concatenation:
Multiple Strings can be concatenated using (+) symbol. Let us see the example of
concatenating the strings.
Example:
>>> x="hello"
>>> y="world"
>>>x+y
'helloworld'
6
Page
String Repetition:
String repetition can be performed by using the (*) symbol. Let us see the example of
repetition of strings.
Example:
>>> 3*"hello"
'hellohellohello'
Strings are indexed with each character in a memory location when assigned to a
variable. The indexed number starts from zero '0' from first character till the end of the
string. Whereas, reverse indexing starts with '-1' from right to left until the starting
character. Let us try few examples of retrieving the characters from a word PYTHON in
either ways.
Example:
012345
-6-5-4-3-2-1
>>>x[3]
'H'
>>>x[-5]
'Y'
'PYTH'
>>>x[:-4] # Starting from fourth character from right, 4th position excluded
'PY'
'PYTHON'
'PYTHON'
7
rindex(sub[, start[, end]]) Returns Highest Index but raises when substring is
not found
rjust(width[, fillchar]) Returns the string right justified
startswith(prefix[, start[, end]]) Checks if String Starts with the Specified String
9
Page
Python – Sequences
Sequence in Python can be defined with a generic term as an ordered set which can
be classified as two sequence types. They are mutable and immutable. There are
different types of sequences in python. They are Lists, Tuples, Ranges.
Lists: Lists will come under mutable type in which data elements can be changed.
Tuples: Tuples are also like Lists which comes under immutable type which cannot be
changed.
Ranges: Ranges is mostly used for looping operations and this will come under
immutable type.
s+t : concatenation
len(s) : length of s
Python – Lists
Python Lists holds the data of any datatype like an array of elements and these are
mutable means the possibility of changing the content or data in it. List can be created
by giving the values that are separated by commas and enclosed in square brackets.
Let us see different types of value assignments to a list.
Example:
List1=[10,20,30,40,50];
10
List2=['A','B','C','D'];
Page
List3=[10.1,11.2,12.3];
List4=['html','java','oracle'];
List5=['html',10.1,10,'A'];
As we know the way strings can be accessed, same way Lists can be accessed. Below
is example of indexing in python for your understanding again.
Example:
List1=[10,20,30,40,50];
Now let us take a list which holds different datatypes and will access the elements in
that list.
Example:
>>> list5=['html',10.1,10,'A'];
>>> list5[0]
'html'
>>> list5[1:2];
[10.1]
>>>list5[-2:-1];
[10]
>>>list5[:-1];
>>>list5[:-2];
['html', 10.1]
11
>>>list5[1:-2];
Page
[10.1]
>>>list5[1:-1];
[10.1, 10]
>>> list5[-1];
'A'
>>> list5[3:];
['A']
Example:
>>> list5=['html',10.1,10,'A'];
>>>len(list5)
>>> 10 in list5
True
True
>>>num=[10,20,30,40];
>>> sum(num)
100
>>> max(num)
40
>>> min(num)
10
12
Example:
>>>score=[10,20,30,80,50]
>>> score
>>>score[3]=40
>>> score
List Comprehension:
Syntax:
[x for x in iterable]
Example:
>>>var
[0, 1, 2, 3, 4]
>>>var
[1, 2, 3, 4, 5]
>>>var
[0, 3]
13
Page
Example:
>>> var1=[10,20]
>>> var2=[30,40]
>>> var3=var1+var2
>>> var3
Example:
>>> var1*2
>>> var1*3
>>> var1*4
Example:
>>> var1.append(30)
14
>>> var1
Page
>>> var1.append(40)
>>> var1
>>> var2
[30, 40]
Python – Tuples
Tuples are generally used to store the heterogeneous data which immutable. Even
Tuple looks like Lists but Lists are mutable. To create a tuple, we need to use the comma
which
separates the values enclosed parentheses.
Example:
>>> tup1
()
>>> tup1=(10)
>>> tup1
10
>>> tup1=(10,20,30);
>>> tup1
>>> tup1=tuple([1,1,2,2,3,3])
>>> tup1
15
(1, 1, 2, 2, 3, 3)
Page
>>> tup1=("tuple")
>>> tup1
'tuple'
>>> tup1=(10,20,30);
>>> max(tup1)
30
>>> min(tup1)
10
>>>len(tup1)
>>> 20 in tup1
True
False
Slicing in Tuples:
>>> tup1[0:4]
>>> tup1[0:1]
(10,)
>>> tup1[0:2]
(10, 20)
16
Python – Dictionary
Page
Dictionaries are created or indexed by key-value pairs. In which keys are immutable
type.
Tuples can be used as keys, but lists cannot be used as keys. Just because lists are
mutable.
Generally, the key-value pairs which are stored in the Dictionary can be accessed with
the
key. we can delete a key value too. Let us see some examples.
Example:
>>> score={'maths':80,'physics':70,'chemistry':85}
>>> score
>>> score['maths']
80
>>> score['maths']
KeyError: 'maths'
>>> score
>>>score.keys()
dict_keys(['physics', 'chemistry'])
>>>keys=score.keys()
>>> keys
dict_keys(['physics', 'chemistry'])
>>> list(keys)
17
['physics', 'chemistry']
Page
Python – Ranges
Range a kind of data type in python which is an immutable. Range will be used in for
loops
for number of iterations. Range is a constructor which takes arguments and those must
be
integers. Below is the syntax.
Syntax:
class range(stop)
start: the value of start parameter. If the value is omitted, it defaults to zero.
Examples:
>>>list(range(5))
[0, 1, 2, 3, 4]
>>>list(range(10,20))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>>list(range(10,20,5))
[10, 15]
>>>list(range(10,20,2))
>>>list(range(0,0.1))
>>>list(range(0,2))
Page
[0, 1]
>>>list(range(0,1))
[0]
>>>list(range(0,10,5))
[0, 5]
Python – Sets
In this tutorial, we will learn about sets in python. A set is a datatype which holds an
unordered collection with immutable and no duplicate elements. By the name,
Set can be used for various mathematical operations. Mathematical operation may be
union, intersection or difference, etc. Let us see the example of using the Set below.
Example:
>>> set1={'html','c','java','python','sql'}
>>> print(set1)
>>> set1={'html','c','java','python','sql','java'}
>>> print(set1)
>>> set1
>>> set1={'html','java','python','sql','java'}
>>> set1
>>> print(set1)
Page
False
True
>>> set1={'html','java','python','sql','java'}
>>> set2={'html','oracle','ruby'}
>>> set1
>>> set2
>>> set1-set2
>>> set1|set2
>>> set1&set2
Page
{'html'}
>>> set1^set2
Python - Operators
Operators Description
// Integerdivision
+ addition
- subtraction
* multiplication
/ Float division
% Provide remainder after division(Modulus)
** Perform exponent (raise to power)
>>> 10+10
20
>>>20+30
50
>>>50+50
100
21
Page
>>>20-10
10
>>>50-40
10
>>>100-30
70
>>>5*2
10
>>>10*2
20
>>>20*2
40
4. Float Division: This will divide and provide the result in floating value and the symbol
used (/)
>>>5/2
2.5
>>>10/2
5.0
5. Integer Division: This will divide and truncate the decimal and provide the Integer
value and the symbol used (//)
>>>5//2
2
22
>>>7//2
Page
6. Exponentiation Operator: This will help us to calculate a power b and return the result
1000
7. Modulus Operator: This will provide the remainder after the calucation and symbol
used (%)
>>>10%3
What if we want to work with multiple operators at a time. Here comes the Operator
precedence in Python.
Operator Description
or Boolean OR
in, not in, is, is not, <, <=, >, >=, !=, == Comparisons, including membership
tests and identity tests
| Bitwise OR
^ Bitwise XOR
** Exponentiation [6]
In this tutorial, we will discuss about the "if" Condition statement. Let us understand how
this "if" statements work. There will be 2 parts in the "if" statement. They are "if" and "elif",
"else" which is optional. when the "if" condition satisfies then it executes the program
inside that else it execute the program inside "elif" or "else" statements.
Syntax:
if Condition:
program of if
program of elif
else:
program of else
Example 1:
24
>>> if x<0:
... print("single")
... else:
...
Example 2:
>>> if x<0:
... print("single")
... else:
...
Single
25
Page
In this tutorial, we will learn about for loop. In Python, for loop is used to iterate over the
sequence of elements (the sequence may be list, tuple or strings.. etc). Below is the
syntax.
Syntax:
In the below example, we have given the list of strings as courses and the for loop
created to iterate through all the strings to print the course and the length of the course
name.
Example:
>>> courses=['html','c','java','css']
...
html 4
c1
java 4
css 3
>>>
In the below example, for loop iterates through the list of numbers. In the immediate
step, if statement filters only the numbers less than 50, else it will display "no values" for
rest of the iterations.
Example:
>>> x=[10,20,30,40,50,60]
>>> x
26
>>> for i in x:
... if i<50:
... print(i)
... else:
10
20
30
40
no values
no values
In this tutorial, we will learn about while loop. In python, while loop is used to iterate until
the condition is satisfied. If the condition given is not satisfied in the first iteration itself,
the block of code inside the loop will not get executed.
In the below example, we have assigned the value of x as zero and started the while
loop until the value of x is less than 10 and print the values.
Example:
>>> x=0
... x=x+1
... print(x)
...
27
Page
10
Just changed the values for the above example and below is the output.
>>> x=100
... x=x+1
... print(x)
...
101
102
103
104
105
28
106
Page
107
108
109
110
Python – Break
In this tutorial, we will learn about the Break statement in Python. Break statement is
used to terminate the loop program at a point.
Let us understand the below example which do not have "break" statement will go
through multiple iterations till the value of i becomes 109.
Example:
>>>for i in range(100,110):
... print(i,num)
...
101 100
102 100
102 101
103 100
29
103 101
Page
103 102
104 100
104 101
104 102
104 103
105 100
105 101
105 102
105 103
105 104
106 100
106 101
106 102
106 103
106 104
106 105
107 100
107 101
107 102
107 103
30
107 104
Page
107 105
107 106
108 100
108 101
108 102
108 103
108 104
108 105
108 106
108 107
109 100
109 101
109 102
109 103
109 104
109 105
109 106
109 107
109 108
Now, let us break the loop when the value of i becomes 105. Below is the code and
output for clarification.
Example:
31
>>>for i in range(100,110):
Page
... print(i,num)
... if i==105:
... break
...
101 100
102 100
102 101
103 100
103 101
103 102
104 100
104 101
104 102
104 103
105 100
105 101
32
105 102
Page
105 103
105 104
Python – Continue
In this tutorial, we will learn about the Continue statement in Python. Continue
Statement is used to take the control to top of the loop for next iteration leaving the rest
of the statements in the loop without execution.
>>>for i in range(10):
... continue
... print(i)
...
Below is another example where the printing of even numbers are ignored:
>>>for i in range(100,110):
... if i%2==0:
33
... continue
Page
... else:
Python – Pass
Pass statement is used when there is a situation where a statement is required for syntax
in the code, but which should not to be executed. So that, When the program executes
that portion of code will not be executed.
In the below example, we can observe that the pass statement in if condition was not
executed.
Example:
>>>for i in range(100,104):
... print(i,num)
... if num==102:
... pass
101 100
Page
102 100
102 101
103 100
103 101
103 102
In the below example, we can observe that the pass statement is not mentioned.
Hence resulted an error.
Example:
>>>for i in range(100,104):
... print(i,num)
... if num==102:
...
In python, datetime is a module which provides different classes to work with dates and
times.
1. date
Page
2. time
3. datetime
4. timedelta
5. tzinfo
6. timezone
Date Object:
Date object depicts the date as date(year, month, day) in an ideal Gregorian
calendar. Syntax of the Date class is represented as below.
Syntax:
classdatetime.date(year,month,day)
All the arguments are integers. Everyargumenthas its own range of values as below.
2. MONTH: 1 - 12
Now let us work with date object and its methods which serves different requirements
with an example. The below example shows the current date in different formats.
Example:
>>>today=date.today()
>>> today
datetime.date(2017, 11, 7)
>>> today=x
>>> x
36
>>> d=date(2017,11,7)
>>> x=d.timetuple()
# Print the date values from the tuple by year, month, day ..etc
>>> for i in x:
... print(i)
...
2017
11
-1
>>>d.isoformat()
'2017-11-07'
>>>d.strftime("%d/%m/%u")
'07/11/2'
37
>>>d.strftime("%A%d.%B %Y")
Page
'Tuesday07.November 2017'
Time object:
A time object which gives the information about time of any particular day subject to
the requirements. The syntax of the time object constructor is given below.
Syntax:
1. HOUR: 0 - 24
2. MINUTE: 0 to < 60
3. SECOND: 0 to < 60
5. fold in [0,1]
Example:
>>> t=time(12,12,12)
>>> t
>>>t.isoformat()
'12:12:12'
Datetime Object:
Datetime object is a combination of both date and time information which can
provide the functions from date object and time object.
Syntax:
All the arguments are integers. Eachargumenthas its own range of values as below.
2. MONTH: 1 - 12
4. HOUR: 0 - 24
5. MINUTE: 0 to < 60
6. SECOND: 0 to < 60
8. fold in [0,1]
Now let us work with datetime object and its methods which serves different
requirements with an example.
Example:
# date
>>> d=date(2017,11,7)
# time
>>> t=time(10,10)
>>>datetime.combine(d,t)
>>>datetime.now()
39
>>>datetime.utcnow()
>>>dt=datetime.now()
>>>dt
>>>tt=dt.timetuple()
>>>tt
... print(i)
...
2017 # year
11 # month
7 # day
17 # hour
22 # minute
58 # second
1 # weekday ( 0 = Monday)
-1
40
Page
>>> 'The {1} is {0:%d}, the {2} is {0:%B}, the {3} is {0:%I:%M%p}.'.format(dt, "day", "month",
"time")
>>>dt.isoformat()
'2017-11-07T17:22:58.626832'
Python – Functions
1. User-defined functions
2. Pre-defined functions
User-Defined functions:
In Python, User-defined function is a block of code which can reusable. Once they are
defined or written, that can be used multiple times and in other applications too.
Syntax:
deffunction_name( args ):
statement 1
41
statement 2
Page
return
... if x%i==0:
... break
... else:
... print(x)
>>>prim(10)
>>>prim(20)
11
13
17
42
19
Page
>>>prim(50)
11
13
17
19
23
29
31
37
41
43
47
Pre-defined Functions:
Pre-defined functions are already existing functions which cannot be changed. But still
we can make our own custom functions using those pre-defined functions.
abs()
all()
any()
ascii()
43
bin()
Page
bool()
bytearray()
bytes()
callable()
chr()
classmethod()
compile()
complex()
delattr()
dict()
dir()
divmod()
enumerate()
eval()
exec()
filter()
float()
format()
frozenset()
getattr()
globals()
hasattr()
hash()
help()
hex()
44
id()
Page
input()
int()
isinstance()
issubclass()
iter()
len()
list()
locals()
map()
max()
memoryview()
min()
next()
object()
oct()
open()
ord()
pow()
print()
property()
range()
repr()
reversed()
round()
45
set()
Page
setattr()
slice()
sorted()
staticmethod()
str()
sum()
super()
tuple()
type()
vars()
zip()
__import__()
In this tutorial, we will learn about Modules in Python. Modules in Python are none other
than .py extension files which contains various statements and functions. In order to
import a module, we will use the "import" command.
There are many modules in Python. Below are the very few examples just for giving an
idea for you.
The above all are few pre-existing modules which are in python.
46
Page
We can also write our own modules. Let us see how to create a module which helps us
to process the prime numbers under any given value.
First create a file called "prime.py" and write the below code into the file.
def prim(n):
for x in range(2,n):
for i in range(2,x):
if x%i==0:
break
else:
print(x)
Now connect to python3 and import the module called "prime" using the keyword
import. Then, call the function by passing the integer value as an argument to list the
prime numbers for the given value.
>>>prime.prim(10)
>>>prime.prim(20)
2
47
3
Page
11
13
17
19
>>>
dir(module_name) will list us all types of variables, modules, funtions used for the given
module.
>>>dir(prime)
Packages:
Packages are the namespaces within which consists many modules and packages.
Every package is none other than a directory that consists a file called "--init__.py". This
file describes the directories as packages.
For example, we have created the prime.py module in the above example. Now let us
create the package for the same.
2. Create a directory called "primenum" and keep the above module in that directory.
In this tutorial, we will learn about how to read the date from the file in python. In
Python, we will use the method called "open" to open the file and "read" method to
read the contents of the file.
2. Mode - refers to the mode of opening the file which may be either write mode or
read mode.
Modes:
Read Methods:
read() - This method reads the entire data in the file. if you pass the argument as
read(1), it will read the first character and return the same.
First creating the data in a file called "techsal.csv". Below is the data how it looks.
Now let us import the module called "csv", open the file and read the data:
>>>import csv
>>> file=open("techsal.csv","r")
>>>print(file.read())
Let us read the first character of the data by passing the numeric argument:
>>> file=open("techsal.csv","r")
>>>print(file.read(1))
Let us read the first line of the file by using the readline() method:
>>> file=open("techsal.csv","r")
>>>print(file.readline())
... out=data.read()
...
>>> print(out)
Once we complete the work, we need to close the file. Otherwise, it is just waste of
memory. So, below is the way to close the file. Once you close the file, you cannot read
the data. To read it again, you need to open it again.
>>>data.closed
True
>>>
In this tutorial, we will learn how to write into a file in Python. Writing the data into a file
can be done by using the "write()" method. The write() method takes string as an
argument. Let us see an example of writing the data into the file called "techsal.csv".
First creating the data in a file called "techsal.csv". Below is the data how it looks.
>>> data=open('techsal.csv','r+')
28
... out=data.read()
... print(out)
...
51
Page
>>>data.close()
In this tutorial, we will understand about the Classes and Objects in Python. A class is
depicted in different syntax as how the functions are defined. Below is the syntax of the
Class.
Syntax:
class ClassName:
<statement-1>
<statement-N>
The statements inside the class are function definitions and also contain other required
statements. When a class is created, that creates a local namespace where all data
variables and functions are defined.
... data=127
52
Page
...
>>>print(MyFirstClass.data)
Output:
127
>>>print(MyFirstClass.f)
Output:
>>>print(MyFirstClass.__doc__)
Output:
Object:
Now let us see how to create an object. Creation of an object is an instance of the
class. Below is how we creating an Object of the class MyFirstClass.
>>> x = MyFirstClass()
In the above line of code, we have created an Object for the class "MyFirstClass" and
its name is "x".
Just trying to access the object name and it gives you information about object.
>>> x
>>>print(x.f)
53
Below is how you can access the attributes like data variables and functions inside the
class using the Object name, which return some value.
>>>x.f()
>>>x.data
127
Python – Exceptions
In this tutorial, we will learn about the handling exceptions in Python. It is quite common
that for any program written in any programming language may hit the error during the
execution due to any reasons.
Reasons may be the syntactical error, or conditional or operational errors caused due
to filesystem or due to the lack required resources to execute the program.
So, we need to handle those kind of exceptions or errors by using different clauses while
we do programming.
In Python, we can handle an exception by using the raise exception statement or using
the try, except clauses.
Syntax:
try:
raise statement
except x:
statement
1. Pre-defined exceptions:
These are the Exceptions which are existing within the python programming language
as Built-in Exceptions. Some of them are arithmetic errors like ZeroDivisionError,
54
2. User-defined exceptions:
These exceptions are created by programmer which are derived from Exception class.
... try:
... break
... print("Sorry !! the given number is not valid number. Try again...")
Output:
>>> try:
... print(100/0)
...
Page
Output:
Regular Expressions can be used for searching a word or a character or digits from the
given data and several patterns. These can be called as RREs, regex patterns. We just
need to import the module "re" to work with regular expressions.
Let us see the below example which uses the find all method to get the search result of
salary from the data given as input to techdata variable.
56
Page
Example:
import re
>>>techdata = '''
... '''
>>>salary = re.findall(r'\d{1,10}',techdata)
>>> print(salary)
Python – Mathematics
In python, we have the module called "math" which provides the access to
mathematical functions which are defined in C programming.
1. Number Functions
3. Trigonometric Functions
57
4. Angular conversion
Page
5. Hyperbolic Functions
6. Special Functions
7. Constants
>>> from math import ceil, factorial, floor, gcd, fsum, trunc
>>>ceil(10.3)
11
>>>ceil(9.9)
10
>>>factorial(3)
>>>factorial(10)
3628800
>>>floor(10.3)
10
>>>floor(10.9)
10
>>>gcd(5,10)
>>>gcd(3,7)
58
1
Page
>>>fsum([5,4,5,1])
15.0
>>>trunc(9.4)
>>>trunc(10.5)
10
>>>
>>>math.exp(2)
7.38905609893065
>>>math.log(2,10)
0.30102999566398114
>>>math.log(2,4)
0.5
>>> math.log2(4)
2.0
>>> math.log10(2)
0.3010299956639812
59
Page
>>>math.pow(2,3)
8.0
>>>math.sqrt(64)
8.0
>>>sin(30)
-0.9880316240928618
>>>cos(90)
-0.4480736161291701
>>>tan(0)
0.0
>>>degrees(10)
572.9577951308232
60
>>>radians(572)
Page
9.983283321407566
In this tutorial, we will learn about how to access the internet using the python. In
python, we will have a module called "urllib" that provides various Objects and
functions to access the internet. We can perform many activities using this "urllib"
module like accessing the webpage data of any website, sending an email... etc.
Let us try fetching the 100 bytes of code behind the google.com page. Below is the
example.
Example:
>>> f = urllib.request.urlopen("http://google.com/")
>>>print(f.read(100).decode('utf-8'))
Output:
Let us try fetching the 500 bytes of code behind the google.com page.
>>>print(f.read(500).decode('utf-8'))
Output:
In this tutorial, we will learn about the data compression in Python programming
language. In python, the data can be archived, compressed using the modules like
zlib, gzip, bz2,lzma,zipfile and tarfile. To use the respective module, you need to import
the module first. Let us look at below example.
Example:
>>>len(s)
41
>>>
>>> t = zlib.compress(s)
>>>len(t)
39
>>>
>>>zlib.decompress(t)
>>>
>>>zlib.crc32(s)
2172471860
>>>
Numpy
1. NumPy - Introduction
62
2. NumPy – Installation
Page
5. NumPy – Arrays
9. NumPy – Broadcasting
NumPy – Introduction
Numpy is one of the libraries available for Python programming language. This
library or module provides numerical and mathematical functions which are pre-
compiled.
Numpy is designed to used for multidimensional arrays and for scientific
computing which are memory efficient.
NumPy – Installation
In this tutorial, we will understand that how to do the installation of Numpy on
both linux and windows platforms.
It is best to use the pre-built packages to install the Numpy. Otherwise, you can
install the python distributions like Anaconda, python(x,y), Pyzo for installing all
necessary packages which ever needed.
To install Numpy on Linux platform, we need have the Python already installed.
In most Linux platforms, Python would come by default. If not, you can use the
Yum utility to install python and other packages which ever needed with below
command on RedHat or cent OS.
$ yum install python-numpy
Note:
In this tutorial, we will learn how to import and use Numpy. You have to use the
keyword ―import‖ to import the numpy module.
Below is command to import the Numpy Module.
>>> import numpy
It is better to give an alias name and try using the same alias name for every call
to numpy. Otherwise it will become write numpy.X file every time. By giving the
alias name, it will use the np.X instead of using numpy.X.
There is another method of importing entire Numpy in a single call.
>>>fromnumpy import *
float64 Double precision float: sign bit, 11 bits exponent, 52 bits mantissa
Page
Below is the command. we will use the ―dtype‖ method to identify the datatype
>>> x=np.array([1,2,3,4,5])
>>>x
array([1, 2, 3, 4, 5])
>>>x.dtype
dtype('int32')
Below is the command. Please observe that we have given the dtype as floating
datatype.
>>>x.dtype
dtype('float64')
>>>x
>>> y=np.array([.1,.2,.3,.4,.5])
>>>y
66
>>>y.dtype
dtype('float64')
>>>eq=np.array([True, False])
>>>eq
>>>eq.dtype
dtype('bool')
>>>str
dtype='<U5')
>>>str.dtype
dtype('<U5')
>>> j=2
>>> solve
>>>solve.dtype
dtype('complex128')
>>>solve.real
NumPy – Arrays
Page
>>> a=np.array([1,2,3,4],float)
>>>a
>>> a=np.array([1,2,3,4],int)
>>>a
array([1, 2, 3, 4])
>>>type(a)
<class 'numpy.ndarray'>
>>>a.ndim
# shape property will be used to find out the size of each array dimension
>>>a.shape
(4,)
# len function will be used to get the length of first array axis in the dimension
>>>len(a)
4
68
Arrays are Multi-dimensional (mostly like Matrices in Mathematics) and let us see
Page
>>> b=np.array([[1,2,3,4],[5,6,7,8]],int)
>>>b
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
>>>type(b)
<class 'numpy.ndarray'>
>>>b.ndim
>>>b.shape
(2, 4)
>>>len(b)
>>> c=np.array([[1,2,3,4],[5,6,7,8],[9,8,7,6]],int)
>>>type(c)
<class 'numpy.ndarray'>
>>>c
array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 8, 7, 6]])
>>>c.ndim
2
69
Page
>>>c.shape
(3, 4)
>>>len(c)
>>>c[1,1]
>>>c[2,3]
>>>c[2,0]
>>>c.reshape(4,3)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[8, 7, 6]])
The arange function which almost like a Range function in Python. The arange function
will return an array as a result.
>>> a=np.arange(5)
Page
>>>a
array([0, 1, 2, 3, 4])
>>>a[0]
>>>a[1]
>>>a[4]
In the below example, first argument is start number ,second is ending number, third is
nth position number. It means that it has to display the numbers for every 5th step
starting from one to 20.
Example:
>>> b=np.arange(1,20,5)
>>>b
If you want to divide it by number of points, linspace function can be used. This will
reach the end number by the number of points you give as the last argument.
Example:
First argument - 0
Second argument - 1
71
Third Argument - 5
Page
>>> b=np.linspace(0,1,5)
>>>b
>>>
>>> b=np.linspace(0,1,6)
>>>b
>>> b=np.linspace(0,1,7)
>>>b
>>>a
>>> a=np.zeros([2,3])
>>>a
>>>a[1,1]
Page
0.0
>>>np.ones(5)
>>>np.zeros(5)
>>> a=np.eye(3)
>>>a
Example:
>>> a=np.arange(5)
>>>a
array([0, 1, 2, 3, 4])
>>>a[0]
0
73
Page
>>>a[1]
>>>a[2]
>>>a[5]
a[5]
>>>a[-1]
>>>a[-2]
>>>a[-3]
>>>a[-4]
>>>a[-5]
>>>
The dimension corresponding to rows can be interpreted by using a[0] (First row of all
elements)
The column axis
>>> a=np.diag(np.arange(3))
>>>a
array([[0, 0, 0],
[0, 1, 0],
[0, 0, 2]])
>>>a[0]
array([0, 0, 0])
>>>a[1]
array([0, 1, 0])
>>>a[2]
array([0, 0, 2])
# To access the column axis, we need to mention the specified index number to access
the value.
>>>a[1,1]
>>>a[1,2]
# below is the error just because we tried to access the 3rd position value in the column
which is not existing.
>>>a[1,3]
a[1,3]
75
>>>a[1,0]
>>>a[0,0]
>>>
Accessing the column values from the matrix.we can use ellipsis(…) to get the column
or row values in particular. If we place the ellipsis in the row position, it will get you the all
the values of particular column.
>>>a[...,1]
array([0, 1, 0])
>>>a[...,0]
array([0, 0, 0])
>>>a[...,2]
array([0, 0, 2])
Accessing the row values from diagonal matrix. If we place the ellipsis in the column
value position, it will get you the all the values of particular row mentioned.
>>>a[1]
array([0, 1, 0])
>>>a[1,...]
array([0, 1, 0])
>>>
slicing:
76
>>> a=np.arange(20)
Page
>>>a
>>>a[5:20:5]
>>>a[:4]
array([0, 1, 2, 3])
>>>a[:10]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>>a[1:5]
array([1, 2, 3, 4])
>>>a[5:]
array([ 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>>a[::5]
>>>a[0]
>>>a[1:5]
array([1, 2, 3, 4])
Note:
Advanced indexing always returns a copy of data but not view as like basic slicing.
77
2. Boolean Indexing
Syntax:
Example:
>>> x=np.array([[1,2],[3,4],[5,6]])
Output:
>>>x
Let us try to select specific elements like [0,1,2] which is a row index and column index
[0,1,0] each element for the corresponding row.
>>>x[[0,1,2],[0,1,0]]
array([1, 4, 5])
>>>x[0]
array([1, 2])
Let us select the 0 as row index and 1 as column index which gives as array value of 2
78
>>>x[[0],[1]]
Page
array([2])
In the same way, if you select for 0,2 will give us an error. As, there is no value for index
of 2.
>>>x[[0],[2]]
x[[0],[2]]
You can do the add operation which returns the value of particular index after
performing the addition.
>>>x
array([[1, 2],
[3, 4],
[5, 6]])
>>>x[[0],[1]]+1
array([3])
Below operation will change the values in the array and returns the new copy of an
array.
>>>x
array([[1, 2],
[3, 4],
79
[5, 6]])
Page
>>>x[[0],[1]]+=1
>>>x
array([[1, 3],
[3, 4],
[5, 6]])
Boolean Indexing:
Boolean Indexing will be used when the result is going to be the outcome of boolean
operations.
Example:
>>> x=np.array([[0,1,2],
[3,4,5],
[6,7,8],
[9,10,11]])
>>>x
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
>>>x[x ==0]
array([0])
80
>>>x[x%2==0]
array([ 0, 2, 4, 6, 8, 10])
>>>
NumPy – Broadcasting
Operations on Numpy arrays are generally happened on element wise. It means the
arrays of same size works better for operations.
But it is also possible to do the operations on those arrays which are different in size.
How to do this?
Nothing to do anything by us.Numpy can transform the arrays of different sizes into the
same sizes. That kind of formulation or conversion is known as broadcasting in Numpy.
Let us observe the below example:
>>> a=np.array([[0],[10],[20],[30]])
>>>a
array([[ 0],
[10],
[20],
[30]])
Creating an array of size 3 column * 1 row
>>> b=np.array([0,1,2])
>>>b
array([0, 1, 2])
Adding the both of them will give us the resultant Array after broadcasting
81
>>>a+b
Page
array([[ 0, 1, 2],
>>>a-b
[10, 9, 8],
>>>a*b
array([[ 0, 0, 0],
[ 0, 10, 20],
[ 0, 20, 40],
[ 0, 30, 60]])
NumPy – Iterating Over Array
In this tutorial, we will learn about array iteration in Numpy.
In python, we have used iteration through lists. In the same way, we can iterate over
arrays in Numpy.
Example:
Page
>>> a=np.array([1,2,3,4,5,6,7,8,9,10],int)
>>>a
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>>for i in a:
print (i)
10
Here the iteration will go over the first axis so that each loop returns you the subset of
the array.
Example:
# 2-D Array
83
>>> a=np.array([[1,2],[3,4],[5,6],[7,8],[9,10]],int)
Page
>>>a
array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10]])
>>>for i in a:
print (i)
[1 2]
[3 4]
[5 6]
[7 8]
[ 9 10]
>>>for (i,j) in a:
print (i+j)
11
84
15
Page
19
# 3-Dimensional Array
>>> a=np.array([[1,2,3],[4,5,6],[7,8,9]],int)
>>>a
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
>>>for i in a:
print (i)
[1 2 3]
[4 5 6]
[7 8 9]
>>>for (x,y,z) in a:
print (x+y+z)
15
24
85
>>>for (x,y,z) in a:
Page
print (x*y*z)
120
504
We can transpose an array using the ndarray.T Operation which will be same if
self.ndim is less than 2
Let us work on an example.
Example:
>>> a=np.array([[1,2,3],[4,5,6]])
>>>a
array([[1, 2, 3],
[4, 5, 6]])
>>>a.ravel()
array([1, 2, 3, 4, 5, 6])
>>>a.T
86
Page
array([[1, 4],
[2, 5],
[3, 6]])
>>>a.T.ravel()
array([1, 4, 2, 5, 3, 6])
>>>a.T.ravel().reshape((2,3))
array([[1, 4, 2],
[5, 3, 6]])
>>> a=np.arange(4*3*2)
>>>a
>>>a.reshape(4,3,2)
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
87
[10, 11]],
Page
[[12, 13],
[14, 15],
[16, 17]],
[[18, 19],
[20, 21],
[22, 23]]])
>>>a.shape
(24,)
>>> b=a.reshape(4,3,2)
>>>b
array([[[ 0, 1],
[ 2, 3],
[ 4, 5]],
[[ 6, 7],
[ 8, 9],
[10, 11]],
[[12, 13],
[14, 15],
[16, 17]],
[[18, 19],
[20, 21],
[22, 23]]])
88
>>>b.shape
Page
(4, 3, 2)
Example:
>>> y=np.array(("program"))
>>>np.char.add(x,y)
array('iam a numpyprogram',dtype='<U18')
Example:
>>> x=np.char.multiply("numpy",5)
>>>x
array('numpynumpynumpynumpynumpy', dtype='<U25')
>>>print(x)
Numpynumpynumpynumpynumpy
capitalize: This will return a copy of string with first character of each element
capitalized.
Example:
>>> x=np.char.capitalize("numpy")
89
>>>x
Page
array('Numpy', dtype='<U5')
>>>print(x)
Numpy
Example:
>>>print(x)
Example:
>>> x=np.char.lower("NUMPY")
>>>print(x)
Numpy
Example:
>>> x=np.char.upper("numpy")
>>>print(x)
NUMPY
90
Example:
>>> x=np.char.equal("iam","numpy")
>>>print(x)
False
Example:
>>> x=np.char.not_equal("iam","numpy")
>>>print(x)
True
count: This will return an array with the number of non-overlapping occurances of
substring in the range.
Example:
>>> x=np.array(['bet','abet','alphabet'])
>>>x
>>>np.char.count(x,'bet')
array([1, 1, 1])
>>>np.char.count(x,'abet')
array([0, 1, 1])
>>>np.char.count(x,'alphabet')
array([0, 0, 1])
isnumeric: This will return true if there is only numeric characters in the element.
Example:
91
>>>np.char.isnumeric('bet')
Page
array(False, dtype=bool)
rfind: This will return the highest index in the string where substring is found.
Example:
>>> x=np.array(['bet','abet','alphabet'])
>>>x
>>>np.char.rfind(x,'abet')
array([-1, 0, 4])
>>>np.char.rfind(x,'bet')
array([0, 1, 5])
>>>np.char.rfind(x,'alphabet')
Trigonometric Operations:
Example:
>>>np.sin(np.pi/2)
1.0
>>>np.cos(np.pi/2)
6.123233995736766e-17
>>>np.tan(np.pi/2)
16331239353195370.0
Rounding Operations:
trunc: This will return the truncated value of the input, element-wise.
Example:
>>>print(x)
>>>print(np.floor(x))
>>>print(np.ceil(x))
>>>print(np.trunc(x))
93
sum: This will return the sum of array elements for the given axis.
diff: This will return the nth-discrete difference along given axis.
Example:
>>> x=np.sum([[1,2],[3,4]])
>>>print(x)
10
>>> print(y)
[4 6]
>>>print(z)
[3 7]
>>> x=np.diff([[1,2],[3,4]])
>>>print(x)
[[1]
[1]]
>>> print(y)
[[2 2]]
>>>print(z)
Page
[[1]
[1]]
Logarithmic Operations:
Example:
>>> x=np.log([1])
>>>print(x)
[ 0.]
>>> y=np.log2([2,4,8])
>>> print(y)
[ 1. 2. 3.]
Power: This will return the result of first array elements raised to powers from second
array, element-wise.
95
Example:
>>> x=np.add(10,20)
>>>print(x)
30
>>> x=np.multiply(10,20)
>>>print(x)
200
>>> x=np.divide(10,20)
>>>print(x)
0.5
>>> x=np.power(10,2)
>>>print(x)
100
>>> x=np.remainder(10,2)
>>>print(x)
>>> x=np.remainder(9,2)
>>>print(x)
Histograms.
Page
Order statistics:
Example:
>>> x=np.arange(4).reshape((2,2))
>>>x
array([[0, 1],
[2, 3]])
>>>np.amin(x)
>>>np.amin(x, axis=0)
array([0, 1])
>>>np.amax(x)
>>>np.amax(x, axis=0)
array([2, 3])
>>>np.amax(x, axis=1)
array([1, 3])
Median: This will return the median along the specified axis.
97
Page
Average: This will return the weighted average along the specified axis.
Mean: This will return the arithmetic mean along the specified axis.
std: This will return the standard deviation along the specified axis.
var: This will return the variance along the specified axis.
Example:
>>>importnumpy as np
>>> x=np.array([[1,2,3],[4,5,6]])
>>>a
array([[[1, 0, 1],
[0, 0, 1]]])
>>>x
array([[1, 2, 3],
[4, 5, 6]])
>>>np.median(x)
3.5
>>>np.median(x,axis=0)
>>>np.average(x)
3.5
>>>np.mean(x)
3.5
98
>>>np.std(x)
Page
1.707825127659933
>>>np.var(x)
2.9166666666666665
Histograms:
Histogram: This will return the computed histogram of a set of data. This function mainly
works with bins and set of data given as input. Numpy histogram function will give the
computed result as the occurances of input data which fall in each of the particular
range of bins. That determines the range of area of each bar when plotted using
matplotlib.
Example:
>>>importnumpy as np
>>>np.histogram([10,15,16,24,25,45,36,45], bins=[0,10,20,30,40,50])
>>>importmatplotlib.pyplot as plt
>>>plt.hist([10,15,16,24,25,45,36,45], bins=[0,10,20,30,40,50])
(array([ 0., 3., 2., 1., 2.]), array([ 0, 10, 20, 30, 40, 50]), <a list of 5 Patch objects>)
>>>plt.show()
argmax: This will return the indices of the maximum values along an axis.
argmin: This will return the indices of the minimum values along an axis.
count_nonzero: This will return count of number of non-zero values in the array.
Example:
99
>>> x=np.array([[1,4],[3,2]])
>>>np.sort(x)
array([[1, 4],
[2, 3]])
>>>np.sort(x,axis=None)
array([1, 2, 3, 4])
>>>np.sort(x,axis=0)
array([[1, 2],
[3, 4]])
>>>np.argmax(x)
>>>x
array([[1, 4],
[3, 2]])
>>>np.argmax(x, axis=0)
>>>np.argmax(x, axis=1)
>>>np.argmin(x)
>>>np.argmin(x, axis=0)
>>>np.count_nonzero(np.eye(4))
100
>>>np.count_nonzero([[0,1,7,0,0],[3,0,0,2,19]])
Page
Example:
>>> x=np.matrix('1,2,3,4')
>>>x
matrix([[1, 2, 3, 4]])
>>>print(x)
[[1 2 3 4]]
Example:
>>> x=np.array([[1,2],[3,4],[4,5]])
>>>x
array([[1, 2],
[3, 4],
[4, 5]])
>>> y=np.asmatrix(x)
>>>y
matrix([[1, 2],
[3, 4],
101
[4, 5]])
>>>x[0,0]=5
Page
>>>y
matrix([[5, 2],
[3, 4],
[4, 5]])
Example:
>>>mb.empty((2,2))
[ 3.44900029e-307, 1.78250172e-312]])
>>>mb.empty((2,2),int)
matrix([[0, 1],
[2, 3]])
>>>mb.empty((2,2),int)
matrix([[1, 2],
[3, 4]])
>>>mb.empty((2,2),int)
matrix([[0, 1],
[2, 3]])
zeros: This will return a matrix of given shape and type, filled with zeros.
102
Page
Example:
>>>mb.zeros((3,3))
>>>mb.zeros((3,3),int)
matrix([[0, 0, 0],
[0, 0, 0],
[0, 0, 0]])
Example:
>>>mb.ones((3,3))
>>>mb.ones((3,3),int)
matrix([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
eye: This will return a matrix with ones on diagonal and zeros elsewhere.
103
Example:
Page
>>>mb.eye(3)
>>>mb.eye(3,dtype=int)
matrix([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
Example:
>>>mb.identity(3,dtype=int)
matrix([[1, 0, 0],
[0, 1, 0],
[0, 0, 1]])
>>>mb.identity(4,dtype=int)
matrix([[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]])
rand: This will return a matrix of random values with given shape.
Example:
104
>>>mb.rand(2,3)
>>>mb.rand((2,3),int)
Example:
>>>np.dot(5,4)
20
>>> a=[[1,2],[3,4]]
>>> b=[[1,1],[1,1]]
>>>np.dot(a,b)
array([[3, 3],
[7, 7]])
105
Example:
>>>np.dot(5,4)
20
>>> a=[[1,2],[3,4]]
>>> b=[[1,1],[1,1]]
>>>np.vdot(a,b)
10
Example:
>>> a=[[1,2,3],[0,1,1]]
>>> b=[[1,2,3],[0,0,1]]
>>>np.inner(a,b)
array([[14, 3],
[ 5, 1]])
Example:
>>> a=[[1,2],[3,4]]
106
>>> b=[[1,1],[1,1]]
>>>np.outer(a,b)
Page
array([[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4]])
Example 1:
>>> a=[[1,2],[3,4]]
>>> b=[[1,1],[1,1]]
>>>np.matmul(a,b)
array([[3, 3],
[7, 7]])
Example 2:
>>> a=[[1,2],[3,4]]
>>> b=[1,1]
>>>np.matmul(a,b)
array([3, 7])
>>>np.matmul(b,a)
array([4, 6])
107
tensordot: This will return the computed tensor dot product along specific aces for
arrays>=1-D.
Page
Example:
>>>importnumpy as np
>>>np.tensordot(a,b,axes=((1),(1))).shape
(1, 3, 3, 1)
How it works:
>>>np.histogram([10,15,16,24,25,45,36,45], bins=[0,10,20,30,40,50])
>>>plt.hist([10,15,16,24,25,45,36,45], bins=[0,10,20,30,40,50])
(array([ 0., 3., 2., 1., 2.]), array([ 0, 10, 20, 30, 40, 50]), <a list of 5 Patch objects>)
>>>plt.show()
What is Pandas?
108
Python. Pandas library is built on top of Numpy, meaning Pandas needs Numpy to
operate. Pandas provide an easy way to create, manipulate and wrangle the data.
Pandas is also an elegant solution for time series data.
In a nutshell, Pandas is a useful library in data analysis. It can be used to perform data
manipulation and analysis. Pandas provide powerful and easy-to-use data structures,
as well as the means to quickly perform operations on these structures.
import sys
!conda install --yes --prefix {sys.prefix} pandas
A data frame is a two-dimensional array, with labeled axes (rows and columns). A data
frame is a standard way to store data.
Data frame is well-known by statistician and other data practitioners. A data frame is a
tabular data, with rows to store the information and columns to name the information.
For instance, the price can be the name of a column and 2,3,4 the price values.
What is a Series?
A series is a one-dimensional data structure. It can have any data structure like integer,
float, and string. It is useful when you want to perform computation or return a one-
dimensional array. A series, by definition, cannot have multiple columns. For the latter
case, please use the data frame structure.
You can add the index with index. It helps to name the rows. The length should be
equal to the size of the column
Below, you create a Pandas series with a missing value for the third rows. Note, missing
values in Python are noted "NaN." You can use numpy to create missing value: np.nan
artificially
pd.Series([1,2,np.nan])
Output
0 1.0
1 2.0
2 NaN
dtype: float64
You can convert a numpy array to a pandas data frame with pd.Dataframe(). The
opposite is also possible. To convert a pandas Data Frame to an array, you can use
np.array()
## Numpy to pandas
110
importnumpy as np
h = [[1,2],[3,4]]
Page
df_h = pd.DataFrame(h)
## Pandas to numpy
df_h_n = np.array(df_h)
print('Numpy array:', df_h_n)
Data Frame: 0 1
0 1 2
1 3 4
Numpy array: [[1 2]
[3 4]]
Age Name
0 30 John
1 40 Smith
Range Data
pd.data_range(date,period,frequency):
The second parameter is the number of periods (optional if the end date is
specified)
The last parameter is the frequency: day: 'D,' month: 'M' and year: 'Y.'
## Create date
# Days
dates_d = pd.date_range('20300101', periods=6, freq='D')
print('Day:', dates_d)
Output
111
# Months
dates_m = pd.date_range('20300101', periods=6, freq='M')
print('Month:', dates_m)
Output
Inspecting data
You can check the head or tail of the dataset with head(), or tail() preceded by the
name of the panda's data frame
Step 1) Create a random sequence with numpy. The sequence has 4 columns and 6
rows
random = np.random.randn(6,4)
Use dates_m as an index for the data frame. It means each row will be given a "name"
or an index, corresponding to a date.
Finally, you give a name to the 4 columns with the argument columns
df.head(3)
A B C D
df.tail(3)
A B C D
Step 5) An excellent practice to get a clue about the data is to use describe(). It
provides the counts, mean, std, min, max and percentile of the dataset.
df.describe()
A B C D
Slice data
The last point of this tutorial is about how to slice a pandas data frame.
You can use the column name to extract data in a particular column.
113
## Slice
### Using name
Page
df['A']
2030-01-31 -0.168655
2030-02-28 0.689585
2030-03-31 0.767534
2030-04-30 0.557299
2030-05-31 -1.547836
2030-06-30 0.511551
Freq: M, Name: A, dtype: float64
To select multiple columns, you need to use two times the bracket, [[..,..]]
The first pair of bracket means you want to select columns, the second pairs of bracket
tells what columns you want to return.
df[['A', 'B']].
A B
The loc function is used to select columns by names. As usual, the values before the
coma stand for the rows and after refer to the column. You need to use the brackets to
select more than one column.
## Multi col
df.loc[:,['A','B']]
A B
There is another method to select multiple rows and columns in Pandas. You can use
iloc[]. This method uses the index instead of the columns name. The code below returns
the same data frame as above
df.iloc[:, :2]
A B
Drop a column
df.drop(columns=['A', 'C'])
B D
Concatenation
You can concatenate two DataFrame in Pandas. You can use pd.concat()
First of all, you need to create two DataFrames. So far so good, you are already familiar
with dataframe creation
importnumpy as np
df1 = pd.DataFrame({'name': ['John', 'Smith','Paul'],
'Age': ['25', '30', '50']},
index=[0, 1, 2])
df2 = pd.DataFrame({'name': ['Adam', 'Smith' ],
'Age': ['26', '11']},
index=[3, 4])
116
df_concat = pd.concat([df1,df2])
df_concat
Age name
0 25 John
1 30 Smith
2 50 Paul
3 26 Adam
4 11 Smith
Drop_duplicates
df_concat.drop_duplicates('name')
Age name
0 25 John
1 30 Smith
2 50 Paul
3 26 Adam
Sort values
df_concat.sort_values('Age')
Age name
4 11 Smith
117
0 25 John
Page
3 26 Adam
1 30 Smith
2 50 Paul
You can use rename to rename a column in Pandas. The first value is the current
column name and the second value is the new column name.
0 25 John
1 30 Smith
2 50 Paul
3 26 Adam
4 11 Smith
Import CSV
During the TensorFlow tutorial, you will use the adult dataset. It is often used with
classification task. It is available in this URL https://archive.ics.uci.edu/ml/machine-
learning-databases/adult/adult.data The data is stored in a CSV format. This dataset
includes eights categorical variables:
workclass
education
marital
occupation
relationship
race
118
sex
native_country
Page
age
fnlwgt
education_num
capital_gain
capital_loss
hours_week
To import a CSV dataset, you can use the object pd.read_csv(). The basic argument
inside is:
Syntax:
pandas.read_csv(filepath_or_buffer,sep=',
',`names=None`,`index_col=None`,`skipinitialspace=False`)
## Import csv
import pandas as pd
## Define path data
COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital',
'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hours_week', 'native_country', 'label']
PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df_train = pd.read_csv(PATH,
skipinitialspace=True,
names = COLUMNS,
index_col=False)
df_train.shape
Output:(32561, 15)
Groupby
119
An easy way to see the data is to use the groupby method. This method can help you
to summarize the data by group. Below is a list of methods available with groupby:
Page
count: count
min: min
max: max
mean: mean
median: median
standard deviation: sdt
etc
Inside groupby(), you can use the column you want to apply the method.
Let's have a look at a single grouping with the adult dataset. You will get the mean of
all the continuous variables by type of revenue, i.e., above 50k or below 50k
df_train.groupby(['label']).mean()
age fnlwgt Education capital_gain capital_loss hours_week
num
label
df_train.groupby(['label'])['age'].min()
label
<=50K 17
>50K 19
Name: age, dtype: int64
You can also group by multiple columns. For instance, you can get the maximum
capital gain according to the household type and marital status.
df_train.groupby(['label', 'marital'])['capital_gain'].max()
label marital
<=50K Divorced 34095
Married-AF-spouse 2653
120
Married-civ-spouse 41310
Married-spouse-absent 6849
Never-married 34095
Page
Separated 7443
Widowed 6849
>50K Divorced 99999
Married-AF-spouse 7298
Married-civ-spouse 99999
Married-spouse-absent 99999
Never-married 99999
Separated 99999
Widowed 99999
Name: capital_gain, dtype: int64
You can create a plot following groupby. One way to do it is to use a plot after the
grouping.
To create a more excellent plot, you will use unstack() after mean() so that you have
the same multilevel index, or you join the values by revenue lower than 50k and above
50k. In this case, the plot will have two groups instead of 14 (2*7).
If you use Jupyter Notebook, make sure to add % matplotlib inline, otherwise, no plot
will be displayed
% matplotlib inline
df_plot = df_train.groupby(['label', 'marital'])['capital_gain'].mean().unstack()
df_plot
121
Page
Summary
Below is a summary of the most useful method for data science with Pandas
Describe describe
Why Seaborn?
Seaborn offers a variety of functionality which makes it useful and easier than other
frameworks. Some of these functionalities are:
A function to plot statistical time series data with flexible estimation and
representation of uncertainty around the estimate
Functions for visualizing univariate and bivariate distributions or for comparing
them between subsets of data
Functions that visualize matrices of data and use clustering algorithms to discover
structure in those matrices
High-level abstractions for structuring grids of plots that let you easily build
complex visualizations
Several built-in themes for styling matplotlib graphics
Tools for choosing color palettes to make beautiful plots that reveal patterns in
your data
Tools that fit and visualize linear regression models for different kinds of
independent and dependent variables
Install Seaborn
123
Seaborn assumes you have a running Python 2.7 or above platform with NumPY (1.8.2
and above), SciPy(0.13.3 and above) and pandas packages on the devices.
Page
Once we have these python packages installed we can proceed with the installation.
For pip installation, run the following command in the terminal:
Once you are done with the installation, you can use seaborn easily in your Python
code by importing it:
importseaborn
Controlling figure aesthetics
#matplotlib inline
importnumpyasnp
importmatplotlibasmpl
importmatplotlib.pyplotasplt
importseabornassns
np.random.seed(sum(map(ord, "aesthetics")))
sns.set()
sinplot()
125
Page
Seaborn provides five preset themes: white grid, dark grid, white, dark, and ticks, each
suited to different applications and also personal preferences.
Darkgrid is the default one. The White grid theme is similar but better suited to plots with
heavy data elements, to switch to white grid:
sns.set_style("whitegrid")
data = np.random.normal(size=(20, 6)) + np.arange(6) / 2
sns.boxplot(data=data)
The output will be:
For many plots, the grid is less necessary. Remove it by adding this code snippet:
sns.set_style("dark")
sinplot()
sns.set_style("white")
sinplot()
This time, the background looks like:
Sometimes you might want to give a little extra structure to the plots, which is where
ticks come in handy:
127
sns.set_style("ticks")
sinplot()
The plot looks like:
Page
sinplot()
sns.despine()
Some plots benefit from offsetting the spines away from the data. When the ticks don‘t
cover the whole range of the axis, the trim parameter will limit the range of the surviving
spines:
128
f, ax = plt.subplots()
Page
sns.violinplot(data=data)
sns.despine(offset=10, trim=True)
You can also control which spines are removed with additional arguments to despine:
sns.set_style("whitegrid")
sns.boxplot(data=data, palette="deep")
sns.despine(left=True)
The plot looks like:
axes_style() comes to help when you need to set figure style, temporarily:
Page
withsns.axes_style("darkgrid"):
plt.subplot(211)
sinplot()
plt.subplot(212)
sinplot(-1)
Note: Only the parameters that are part of the style definition through this method can
be overridden. For other purposes, you should use set() as it takes all the parameters.
In case you want to see what parameters are included, just call the function without
any arguments, an object is returned:
sns.axes_style()
{'axes.axisbelow': True,
'axes.edgecolor': '.8',
'axes.facecolor': 'white',
130
'axes.grid': True,
'axes.labelcolor': '.15',
Page
'axes.linewidth': 1.0,
'figure.facecolor': 'white',
'font.family': [u'sans-serif'],
'font.sans-serif': [u'Arial',
u'DejaVu Sans',
u'Liberation Sans',
u'Bitstream Vera Sans',
u'sans-serif'],
'grid.color': '.8',
'grid.linestyle': u'-',
'image.cmap': u'rocket',
'legend.frameon': False,
'legend.numpoints': 1,
'legend.scatterpoints': 1,
'lines.solid_capstyle': u'round',
'text.color': '.15',
'xtick.color': '.15',
'xtick.direction': u'out',
'xtick.major.size': 0.0,
'xtick.minor.size': 0.0,
'ytick.color': '.15',
'ytick.direction': u'out',
'ytick.major.size': 0.0,
'ytick.minor.size': 0.0}
You can then set different versions of these parameters:
131
Page
Let‘s try to manipulate scale of the plot. We can reset the default parameters by calling
set():
sns.set()
The four preset contexts are – paper, notebook, talk and poster. The notebook style is
the default, and was used in the plots above:
sns.set_context("paper")
sinplot()
The plot looks like:
sns.set_context("talk")
sinplot()
132
Page
What is Matplotlib
Installing Matplotlib
To install Matplotlib on your local machine, open Python command prompt and
type following commands:
It installs python, Jupyter notebook and other important python libraries including
Matplotlib, Numpy, Pandas, scikit-learn. Anaconda supports Windows, MacOS
and Linux. To quickly get started with Matplotlib without installing anything on
your local machine, check out Google Colab. It provides the Jupyter Notebooks
hosted on the cloud for free which are associated with your Google Drive
account and it comes with all the important packages pre-installed. You can
also run your code on GPU which helps in faster computation though we don‘t
need GPU computation for this tutorial.
General Concepts
Figure: It is a whole figure which may contain one or more than one axes (plots).
You can think of a Figure as a canvas which contains plots.
Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It
contains two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-
label and a y-label.
Axis: They are the number line like objects and take care of generating the
graph limits.
Artist: Everything which one can see on the figure is an artist like Textobjects,
Line2D objects, collection objects. Most Artists are tied to Axes.
133
Page
importmatplotlib.pyplot as plt
import numpy as np
We pass two arrays as our input arguments to Pyplot‘s plot() method and use
show() method to invoke the required plot. Here note that the first array appears
on the x-axis and second array appears on the y-axis of the plot. Now that our
first plot is ready, let us add the title, and name x-axis and y-axis using methods
title(), xlabel() and ylabel() respectively.
We can also specify the size of the figure using method figure() and passing the
values as a tuple of the length of rows and columns to the argument figsize
With every X and Y argument, you can also pass an optional third argument in
the form of a string which indicates the colour and line type of the plot. The
default format is b- which means a solid blue line. In the figure below we use go
which means green circles. Likewise, we can make many such combinations to
format our plot.
We can also plot multiple sets of data by passing in multiple sets of arguments of
X and Y axis in the plot() method as shown.
We can use subplot() method to add more than one plots in one figure. In the
image below, we used this method to separate two graphs which we plotted on
the same axes in the previous example. The subplot() method takes three
arguments: they are nrows, ncols and index. They indicate the number of rows,
number of columns and the index number of the sub-plot. For instance, in our
example, we want to create two sub-plots in one figure such that it comes in
one row and in two columns and hence we pass arguments (1,2,1) and (1,2,2)
in the subplot() method. Note that we have separately used title()method
134
for both the subplots. We use suptitle() method to make a centralized title for the
figure.
Page
If we want our sub-plots in two rows and single column, we can pass arguments
(2,1,1) and (2,1,2)
The above way of creating subplots becomes a bit tedious when we want many
subplots in our figure. A more convenient way is to use subpltots()method. Notice
the difference of ‗s‘ in both the methods. This method takes two arguments
nrows and ncols as number of rows and number of columns respectively. This
method creates two objects:figure and axes which we store in variables fig and
ax which can be used to change the figure and axes level attributes
respectively. Note that these variable names are chosen arbitrarily.
1) Bar Graphs
Bar graphs are one of the most common types of graphs and are used to show
data associated with the categorical variables. Pyplot provides a method bar()
to make bar graphs which take arguments: categorical variables, their values
and color (if you want to specify any).
To make horizontal bar graphs use method barh() Also we can pass an
argument (with its value)xerr oryerr (in case of the above vertical bar graphs) to
depict the variance in our data as follows:
To create horizontally stacked bar graphs we use the bar() method twice and
pass the arguments where we mention the index and width of our bar graphs in
order to horizontally stack them together. Also, notice the use of two other
methods legend() which is used to show the legend of the graph and xticks() to
label our x-axis based on the position of our bars.
Similarly, to vertically stack the bar graphs together, we can use an argument
bottom and mention the bar graph which we want to stack below as its value.
2) Pie Charts
One more basic type of chart is a Pie chart which can be made using the
method pie() We can also pass in arguments to customize our Pie chart to show
shadow, explode a part of it, tilt it at an angle as follows:
3) Histogram
Histograms are a very common type of plots when we are looking at data like
135
height and weight, stock prices, waiting time for a customer, etc which are
continuous in nature. Histogram‘s data is plotted within a range against its
Page
statistics and form the basis for various distributions like the normal -distribution, t-
distribution, etc. In the following example, we generate a random continuous
data of 1000 entries and plot it against its frequency with the data divided
into 10 equal strata. We have used NumPy‘s random.randn()method which
generates data with the properties of a standard normal distribution i.e. mean =
0 and standard deviation = 1, and hence the histogram looks like a normal
distribution curve.
Scatter plots are widely used graphs, especially they come in handy in visualizing
a problem of regression. In the following example, we feed in arbitrarily created
data of height and weight and plot them against each other. We used xlim()
and ylim() methods to set the limits of X-axis and Y-axis respectively.
The above scatter can also be visualized in three dimensions. To use this
functionality, we first import the module mplot3d as follows:
We can also create 3-D graphs of other types like line graph, surface, wireframes,
contours, etc. The above example in the form of a simple line graph is as follows:
Here instead of scatter3D() we use method plot3D()
Summary
Hope this article was useful to you. If you liked this article please express your
appreciation. Before we end the article here is the list of all the methods as they
appeared.
137
Page
About Institute?
About Course?
Course 3: I want to apply statistics and Python on Machine Learning models for
Predictions& classifications of Data in various industry segments for intelligent
Business
Course 7: Start Learning Neural Networks using Tensor flows and Keras for image
classification and Data Extraction from Image(OCR)
Course 9: Sensing Real world Data and transforming it to Intelligent actions using
IOT
The core of R is an interpreted computer language which allows branching and looping
as well as modular programming using functions. R allows integration with the
procedures written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the
GNU project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.
Features of R
As a conclusion, R is world’s most widely used statistics programming language. It's the
# 1 choice of data scientists and supported by a vibrant and talented community of
contributors. R is taught in universities and deployed in mission critical business
1
applications. This tutorial will teach you R programming along with suitable examples in
Page
R - ENVIRONMENT SETUP
If you are still willing to set up your environment for R, you can follow the steps given
below.
Windows Installation
You can download the Windows installer version of R from R-3.2.2 for
Windows 32/64bit32/64bit and save it in a local directory.
As it is a Windows installer .exe.exe with a name "R-version-win.exe". You can just double
click and run the installer accepting the default settings. If your Windows is 32-bit
version, it installs the 32-bit version. But if your windows is 64-bit, then it installs both the
32-bit and 64-bit versions.
After installation you can locate the icon to run the Program in a directory structure
"R\R3.2.2\bin\i386\Rgui.exe" under the Windows Program Files. Clicking this icon brings
up the R-GUI which is the R console to do R Programming.
Linux Installation
The instruction to install Linux varies from flavor to flavor. These steps are mentioned
under each type of Linux version in the mentioned link. However, if you are in a hurry,
then you can use yumcommand to install R as follows −
$ yum install R
Above command will install core functionality of R programming along with standard
packages, still you need additional package, then you can launch R prompt as follows
−
$R
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
Now you can use install command at R prompt to install the required package. For
example, the following command will install plotrix package which is required for 3D
charts.
> install.packages("plotrix")
R - BASIC SYNTAX
R Command Prompt
Once you have R environment setup, then it’s easy to start your R command prompt by
just typing the following command at your command prompt −
$R
This will launch R interpreter and you will get a prompt > where you can start typing your
program as follows −
Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print is being used to print the value stored in variable
myString.
R Script File
Usually, you will do your programming by writing your programs in script files and then
3
you execute those scripts at your command prompt with the help of R interpreter
Page
called Rscript. So let's start with writing following code in a text file called test. R as
under −
print( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.
$ Rscript test.R
Comments
Comments are like helping text in your R program and they are ignored by the
interpreter while executing your actual program. Single comment is written using # in
the beginning of the statement as follows −
R does not support multi-line comments but you can perform a trick which is something
as follows −
if(FALSE){
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
}
Though above comments will be executed by R interpreter, they will not interfere with
your actual program. You should put such comments inside, either single or double
quote.
R - DATA TYPES
Generally, while doing programming in any programming language, you need to use
various variables to store various information. Variables are nothing but reserved
memory locations to store values. This means that, when you create a variable you
reserve some space in memory.
You may like to store information of various data types like character, wide character,
integer, floating point, double floating point, Boolean etc. Based on the data type of a
variable, the operating system allocates memory and decides what can be stored in
the reserved memory.
In contrast to other programming languages like C and java in R, the variables are not
declared as some data type. The variables are assigned with R-Objects and the data
type of the R-object becomes the data type of the variable. There are many types of R-
objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these
atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon
the atomic vectors.
v <- TRUE
Logical TRUE, FALSE print(class(v))
[1] "logical"
Numeric 12.3, 5, 999
v <-23.5
print(class(v))
[1] "numeric"
Integer 2L, 34L, 0L
v <-2L
print(class(v))
[1] "integer"
Complex 3 + 2i
v <-2+5i
print(class(v))
[1] "complex"
Character 'a' , '"good", "TRUE", '23.4'
v <-"TRUE"
print(class(v))
[1] "character"
Raw "Hello" is stored as 48 65 6c 6c 6f
v <- charToRaw("Hello")
print(class(v))
[1] "raw"
Page
In R programming, the very basic data types are the R-objects called vectors which
hold elements of different classes as shown above. Please note in R the number of
classes is not confined to only the above six types. For example, we can use many
atomic vectors and create an array whose class will become array.
Vectors
When you want to create vector with more than one element, you should
use c function which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
Lists
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[1]]
7
[1] 2 5 3
Page
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow =2, ncol =3, byrow = TRUE)
print(M)
Arrays
While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required number
of dimension. In the below example we create an array with two elements which are
3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
,,1
,,2
Factors
Factors are the r-objects which are created using a vector. It stores the vector along
with the distinct values of the elements in the vector as labels. The labels are always
character irrespective of whether it is numeric or character or Boolean etc. in the input
vector. They are useful in statistical modeling.
Factors are created using the factor function. The nlevels functions gives the count of
levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can
contain different modes of data. The first column can be numeric while the second
9
column can be character and third column can be logical. It is a list of vectors of equal
Page
length.
A variable provides us with named storage that our programs can manipulate. A
variable in R can store an atomic vector, group of atomic vectors or a combination of
many Robjects. A valid variable name consists of letters, numbers and the dot or
underline characters. The variable name starts with a letter or the dot not followed by a
number.
var_name% Invalid
10
.var_name, valid Can start with a dot.. but the dot..should not be followed
by a number.
var.name
.2var_name invalid The starting dot is followed by a number making it invalid.
_var_name invalid Starts with _ which is not valid
Variable Assignment
The variables can be assigned values using leftward, rightward and equal to operator.
The values of the variables can be printed using print or cat function. The cat function
combines multiple items into a continuous print output.
print(var.1)
cat ("var.1 is ",var.1,"\n")
cat ("var.2 is ",var.2,"\n")
cat ("var.3 is ",var.3,"\n")
[1] 0 1 2 3
var.1 is 0 1 2 3
var.2 is learn R
var.3 is 1 1
Note − The vector cTRUE,1TRUE,1 has a mix of logical and numeric class. So logical class
is coerced to numeric class making TRUE as 1.
In R, a variable itself is not declared of any data type, rather it gets the data type of the
R - object assigned to it. So R is called a dynamically typed language, which means
that we can change a variable’s data type of the same variable again and again
when using it in a program.
var_x <-"Hello"
cat("The class of var_x is ",class(var_x),"\n")
var_x <-34.5
cat(" Now the class of var_x is ",class(var_x),"\n")
var_x <-27L
cat(" Next the class of var_x becomes ",class(var_x),"\n")
Finding Variables
To know all the variables currently available in the workspace we use the ls function.
Also the lsfunction can use patterns to match the variable names.
print(ls())
The variables starting with dot.. are hidden, they can be listed using "all.names = TRUE"
argument to ls function.
print(ls(all.name = TRUE))
Deleting Variables
Variables can be deleted by using the rm function. Below we delete the variable var.3.
On printing the value of the variable error is thrown.
rm(var.3)
print(var.3)
[1] "var.3"
Error in print(var.3) : object 'var.3' not found
All the variables can be deleted by using the rm and ls function together.
rm(list = ls())
print(ls())
13
Page
character(0)
R - OPERATORS
Types of Operators
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators
act on each element of the vector.
v <- c(2,5.5,6)
+ Adds two vectors t <- c(8,3,4)
print(v+t)
print(v-t)
Page
[1] 0 1 1
^ The first vector raised to the exponent
of second vector v <- c(2,5.5,6)
t <- c(8,3,4)
print(v^t)
Relational Operators
Following table shows the relational operators supported by R language. Each element
of the first vector is compared with the corresponding element of the second vector.
The result of comparison is a Boolean value.
v <- c(2,5.5,6,9)
> t <- c(8,2.5,14,9)
Checks if each element of the first
vector is greater than the print(v>t)
corresponding element of the second
vector. it produces the following result −
<
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first
vector is less than the corresponding print(v < t)
element of the second vector.
it produces the following result −
16
==
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first
vector is equal to the corresponding print(v == t)
element of the second vector.
it produces the following result −
<=
v <- c(2,5.5,6,9)
>=
v <- c(2,5.5,6,9)
!=
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first
vector is unequal to the corresponding print(v!=t)
element of the second vector.
it produces the following result −
17
Logical Operators
Each element of the first vector is compared with the corresponding element of the
second vector. The result of comparison is a Boolean value.
v <- c(3,1,TRUE,2+3i)
& It is called Element-wise Logical AND t <- c(4,1,FALSE,2+3i)
operator. It combines each element of
the first vector with the corresponding print(v&t)
element of the second vector and
gives a output TRUE if both the elements it produces the following result −
are TRUE.
[1] TRUE TRUE FALSE TRUE
|
v <- c(3,0,TRUE,2+2i)
It is called Element-wise Logical OR
operator. It combines each element of t <- c(4,0,FALSE,2+3i)
the first vector with the corresponding print(v|t)
element of the second vector and
gives a output TRUE if one the elements
it produces the following result −
is TRUE.
!
v <- c(3,0,TRUE,2+2i)
The logical operator && and || considers only the first element of the vectors and give
a vector of single element as output.
v <- c(3,0,TRUE,2+2i)
&& t <- c(1,3,TRUE,2+3i)
Called Logical AND operator. Takes first print(v&&t)
element of both the vectors and gives
the TRUE only if both are TRUE.
it produces the following result −
[1] TRUE
||
v <- c(0,0,TRUE,2+2i)
t <- c(0,3,TRUE,2+3i)
Called Logical OR operator. Takes first
element of both the vectors and gives print(v||t)
the TRUE if one of them is TRUE.
it produces the following result −
[1] FALSE
Assignment Operators
<− v3 = c(3,1,TRUE,2+3i)
print(v1)
or
print(v2)
=
print(v3)
or
it produces the following result −
<<−
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or
logical computation.
20
Page
v <-2:8
: Colon operator. It print(v)
creates the series
of numbers in
it produces the following result −
sequence for a
vector.
[1] 2 3 4 5 6 7 8
[1] TRUE
[1] FALSE
%*% This operator is M = matrix( c(2,6,5,1,10,4), nrow =2,ncol =3,byrow = TRUE)
used to multiply a
matrix with its t = M %*% t(M)
transpose.
print(t)
[,1] [,2]
[1,] 65 82
[2,] 82 117
R - DECISION MAKING
Decision making structures require the programmer to specify one or more conditions to
21
Following is the general form of a typical decision making structure found in most of the
programming languages −
R provides the following types of decision making statements. Click the following links to
check their detail.
2 if...else statement
values.
when you need to execute a block of code several number of times. In general,
statements are executed sequentially. The first statement in a function is executed first,
followed by the second, and so on.
Programming languages provide various control structures that allow for more
complicated execution paths.
1 repeat loop
Like a while statement, except that it tests the condition at the end of the loop
body.
Loop control statements change execution from its normal sequence. When execution
leaves a scope, all automatic objects that were created in that scope are destroyed.
R supports the following control statements. Click the following links to check their
detail.
1 break statement
R - FUNCTIONS
24
The function in turn performs its task and returns control to the interpreter as well as any
result which may be stored in other objects.
Function Definition
Function Components
R has many in-built functions which can be directly called in the program without
defining them first. We can also create and use our own functions referred as user
defined functions.
Built-in Function
Simple examples of in-built functions are seq, mean, max, sumxx and paste...... etc.
They are directly called by user written programs. You can refer most widely used R
functions.
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and
once created they can be used like the built-in functions. Below is an example of how a
function is created and used.
Calling a Function
}
Page
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
Calling a Function with Argument Values by position and by name by position and
byname
27
Page
The arguments to a function call can be supplied in the same sequence as defined in
the function or they can be supplied in a different sequence but assigned to the names
of the arguments.
[1] 26
[1] 58
We can define the value of the arguments in the function definition and call the
function without supplying any argument to get the default result. But we can also call
such functions by supplying new values of the argument and get non default result.
[1] 18
[1] 45
Arguments to functions are evaluated lazily, which means so they are evaluated only
when needed by the function body.
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default
R - STRINGS
Any value written within a pair of single quote or double quotes in R is treated as a
string. Internally R stores every string within double quotes, even when you create them
with single quote.
29
The quotes at the beginning and end of a string should be both double quotes
or both single quote. They cannot be mixed.
Double quotes can be inserted into a string starting and ending with single
quote.
Single quote can be inserted into a string starting and ending with double
quotes.
Double quotes cannot be inserted into a string starting and ending with double
quotes.
Single quote cannot be inserted into a string starting and ending with single
quote.
e <-'Mixed quotes"
30
print(e)
Page
String Manipulation
Many strings in R are combined using the paste function. It can take any number of
arguments to be combined together.
Syntax
Example
a <-"Hello"
b <-'How'
c <-"are you? "
31
print(paste(a,b,c))
Page
Numbers and strings can be formatted to a specific style using format function.
Syntax
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
Example
print(result)
[1] "23.1234568"
[1] "6.000000e+00" "1.314521e+01"
[1] "23.47000"
[1] "6"
[1] " 13.7"
[1] "Hello "
[1] " Hello "
33
Syntax
nchar(x)
Example
[1] 30
Syntax
toupper(x)
tolower(x)
Example
Syntax
substring(x,first,last)
Example
[1] "act"
R - VECTORS
35
Vectors are the most basic R data objects and there are six types of atomic vectors.
Page
Vector Creation
Even when you write just one value in R, it becomes a vector of length 1 and belongs to
one of the above vector types.
[1] "abc"
[1] 12.5
[1] 63
[1] TRUE
[1] 2+3i
[1] 68 65 6c 6c 6f
36
# If the final element specified does not belong to the sequence then it is discarded.
v <-3.8:11.4
print(v)
[1] 5 6 7 8 9 10 11 12 13
[1] 6.6 7.6 8.6 9.6 10.6 11.6 12.6
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0
The non-character values are coerced to character type if one of the elements is a
character.
37
Page
s <- c('apple','red',5,TRUE)
print(s)
Elements of a Vector are accessed using indexing. The [ ] brackets are used for
indexing. Indexing starts with position 1. Giving a negative value in the index drops that
element from result.TRUE,FALSE or 0 and 1 can also be used for indexing.
[1] "Sun"
Vector Manipulation
Vector arithmetic
Two vectors of same length can be added, subtracted, multiplied or divided giving the
result as a vector output.
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
[1] 7 19 4 13 1 13
39
[1] -1 -3 4 -3 -1 9
Page
[1] 12 88 0 40 0 22
If we apply arithmetic operations to two vectors of unequal length, then the elements
of the shorter vector are recycled to complete the operations.
v1 <- c(3,8,4,5,0,11)
v2 <- c(4,11)
# V2 becomes c(4,11,4,11,4,11)
[1] 7 19 8 16 4 22
[1] -1 -3 0 -6 -4 0
v <- c(3,8,4,5,0,11,-9,304)
# Sort the elements of the vector.
sort.result <- sort(v)
print(sort.result)
# Sort the elements in the reverse order.
revsort.result <- sort(v, decreasing = TRUE)
print(revsort.result)
40
v <- c("Red","Blue","yellow","violet")
sort.result <- sort(v)
print(sort.result)
[1] -9 0 3 4 5 8 11 304
[1] 304 11 8 5 4 3 0 -9
[1] "Blue" "Red" "violet" "yellow"
[1] "yellow" "violet" "Red" "Blue"
R - LISTS
Lists are the R objects which contain elements of different types like − numbers, strings,
vectors and another list inside it. A list can also contain a matrix or a function as its
elements. List is created using list function.
Creating a List
[[1]]
[1] "Red"
41
[[2]]
Page
[1] "Green"
[[3]]
[1] 21 32 11
[[4]]
[1] TRUE
[[5]]
[1] 51.23
[[6]]
[1] 119.1
The list elements can be given names and they can be accessed using these names.
$`1st_Quarter`
[1] "Jan" "Feb" "Mar"
$A_Matrix
[,1] [,2] [,3]
[1,] 3 5 -2
[2,] 9 1 8
42
$A_Inner_list
$A_Inner_list[[1]]
Page
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
Elements of the list can be accessed by the index of the element in the list. In case of
named lists it can also be accessed using the names.
# Access the thrid element. As it is also a list, all its elements will be printed.
print(list_data[3])
$`1st_Quarter`
43
$A_Inner_list
$A_Inner_list[[1]]
[1] "green"
$A_Inner_list[[2]]
[1] 12.3
We can add, delete and update list elements as shown below. We can add and
delete elements only at the end of a list. But we can update any element.
print(list_data[3])
Page
[[1]]
[1] "New element"
$<NA>
NULL
Merging Lists
You can merge many lists into one list by placing all the lists inside one list function.
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] "Sun"
45
[[5]]
[1] "Mon"
Page
[[6]]
[1] "Tue"
A list can be converted to a vector so that the elements of the vector can be used for
further manipulation. All the arithmetic operations on vectors can be applied after the
list is converted into vectors. To do this conversion, we use the unlist function. It takes the
list as input and produces a vector.
# Create lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
[[1]]
46
[1] 1 2 3 4 5
Page
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
R - MATRICES
Matrices are the R objects in which the elements are arranged in a two-dimensional
rectangular layout. They contain elements of the same atomic types. Though we can
create a matrix containing only characters or only logical values, they are not of much
use. We use matrices containing numeric elements to be used in mathematical
calculations.
Syntax
data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue. If TRUE then the input vector elements are arranged by
row.
dimname is the names assigned to the rows and columns.
Example
print(N)
Page
Elements of a matrix can be accessed by using the column and row index of the
element. We consider the matrix P above to find the specific elements below.
print(P[1,3])
[1] 5
[1] 13
col1 col2 col3
6 7 8
row1 row2 row3 row4
5 8 11 14
Matrix Computations
Various mathematical operations are performed on the matrices using the R operators.
The result of the operation is also a matrix.
The dimensions number of rows and columns should be same for the matrices involved
in the operation.
cat("Result of multiplication","\n")
print(result)
Arrays are the R data objects which can store data in more than two dimensions. For
example − If we create an array of dimension 2,3,42,3,4 then it creates 4 rectangular
matrices each with 2 rows and 3 columns. Arrays can store only data type.
An array is created using the array function. It takes vectors as input and uses the
values in the dim parameter to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3
columns.
51
,,1
,,2
We can give names to the rows, columns and matrices in the array by using
the dimnamesparameter.
matrix.names))
print(result)
, , Matrix1
, , Matrix2
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
We can do calculations across the elements in an array using the apply function.
Syntax
x is an array.
margin is the name of the data set used.
fun is the function to be applied across the elements of the array.
Example
We use the apply function below to calculate the sum of the elements in the rows of an
array across all the matrices.
print(new.array)
Page
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
,,1
,,2
[1] 56 68 60
R - FACTORS
Factors are the data objects which are used to categorize the data and store it as
levels. They can store both strings and integers. They are useful in the columns which
have a limited number of unique values. Like "Male, "Female" and True, False etc. They
are useful in data analysis for statistical modeling.
Factors are created using the factor function by taking a vector as input.
Example
print(data)
print(is.factor(data))
56
Page
print(factor_data)
print(is.factor(factor_data))
[1] "East" "West" "East" "North" "North" "East" "West" "West" "West" "East" "North"
[1] FALSE
[1] East West East North North East West West West East North
Levels: East North West
[1] TRUE
On creating any data frame with a column of text data, R treats the text column as
categorical data and creates factors on it.
1 132 48 male
2 151 49 male
3 162 66 female
4 139 53 female
5 166 67 male
6 147 52 female
7 122 40 male
[1] TRUE
[1] male male female female male female male
Levels: female male
The order of the levels in a factor can be changed by applying the factor function
again with new order of the levels.
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East North
Levels: East West North
We can generate factor levels by using the gl function. It takes two integers as input
58
which indicates how many levels and how many times each level.
Page
Syntax
gl(n, k, labels)
Example
The data stored in a data frame can be of numeric, factor or character type.
salary = c(623.3,515.2,611.0,729.0,843.25),
Page
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the data frame.
print(emp.data)
The structure of the data frame can be seen by using str function.
$ emp_id : int 1 2 3 4 5
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
62
"2015-03-27")),
Page
stringsAsFactors = FALSE
)
# Extract first two rows.
result <- emp.data[1:2,]
print(result)
Extract 3rd and 5th row with 2nd and 4th column
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract 3rd and 5th row with 2nd and 4th column.
result <- emp.data[c(3,5),c(2,4)]
print(result)
emp_name start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
63
Add Column
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Add the "dept" coulmn.
emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new
rows in the same structure as the existing data frame and use the rbind function.
In the example below we create a data frame with new rows and merge it with the
64
start_date =as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)
R - PACKAGES
R packages are a collection of R functions, complied code and sample data. They are
stored under a directory called "library" in the R environment. By default, R installs a set
of packages during installation. More packages are added later, when they are
needed for some specific purpose. When we start the R console, only the default
packages are available by default. Other packages which are already installed have
to be loaded explicitly to be used by the R program that is going to use them.
Below is a list of commands to be used to check, verify and use the R packages.
.libPaths()
When we execute the above code, it produces the following result. It may vary
depending on the local settings of your pc.
library()
When we execute the above code, it produces the following result. It may vary
depending on the local settings of your pc.
search()
When we execute the above code, it produces the following result. It may vary
depending on the local settings of your pc.
There are two ways to add new R packages. One is installing directly from the CRAN
directory and another is downloading the package to your local system and installing it
manually.
The following command gets the packages directly from CRAN webpage and installs
the package in the R environment. You may be prompted to choose a nearest mirror.
Choose the one appropriate to your location.
install.packages("Package Name")
Go to the link R Packages to download the package needed. Save the package as
a .zip file in a suitable location in the local system.
Now you can run the following command to install this package in the R environment.
Before a package can be used in the code, it must be loaded to the current R
environment. You also need to load a package that is already installed previously but
not available in the current environment.
R - DATA RESHAPING
Page
Data Reshaping in R is about changing the way data is organized into rows and
columns. Most of the time data processing in R is done by taking the input data as a
data frame. It is easy to extract data from the rows and columns of a data frame but
there are situations when we need the data frame in a format that is different from
format in which we received it. R has many functions to split, merge and change the
rows to columns and vice-versa in a data frame.
We can join multiple vectors to create a data frame using the cbindfunction. Also we
can merge two data frames using rbind function.
# Print a header.
cat("# # # # The First data frame\n")
)
Page
# Print a header.
# Print a header.
cat("# # # The combined data frame\n")
We can merge two data frames by using the merge function. The data frames must
have same column names on which the merging happens.
In the example below, we consider the data sets about Diabetes in Pima Indian
Women available in the library names "MASS". we merge the two data sets based on
the values of blood pressure"bp""bp"and body mass index"bmi""bmi". On choosing these
two columns for merging, the records where values of these two variables match in
both data sets are combined together to form a single data frame.
library(MASS)
merged.Pima<- merge(x =Pima.te, y =Pima.tr,
by.x = c("bp","bmi"),
by.y = c("bp","bmi")
)
print(merged.Pima)
nrow(merged.Pima)
bp bmi npreg.x glu.x skin.x ped.x age.x type.x npreg.y glu.y skin.y ped.y
1 60 33.8 1 117 23 0.466 27 No 2 125 20 0.088
2 64 29.7 2 75 24 0.370 33 No 2 100 23 0.368
3 64 31.2 5 189 33 0.583 29 Yes 3 158 13 0.295
4 64 33.2 4 117 27 0.230 24 No 1 96 27 0.289
5 66 38.1 3 115 39 0.150 28 No 1 114 36 0.289
6 68 38.5 2 100 25 0.324 26 No 7 129 49 0.439
7 70 27.4 1 116 28 0.204 21 No 0 124 20 0.254
8 70 33.1 4 91 32 0.446 22 No 9 123 44 0.374
9 70 35.4 9 124 33 0.282 34 No 6 134 23 0.542
10 72 25.6 1 157 21 0.123 24 No 4 99 17 0.294
11 72 37.7 5 95 33 0.370 27 No 6 103 32 0.324
12 74 25.9 9 134 33 0.460 81 No 8 126 38 0.162
13 74 25.9 1 95 21 0.673 36 No 8 126 38 0.162
14 78 27.6 5 88 30 0.258 37 No 6 125 31 0.565
15 78 27.6 10 122 31 0.512 45 No 6 125 31 0.565
16 78 39.4 2 112 50 0.175 24 No 4 112 40 0.236
17 88 34.5 1 117 24 0.403 40 Yes 4 127 11 0.598
age.y type.y
1 31 No
2 21 No
71
3 24 No
4 21 No
Page
5 21 No
6 43 Yes
7 36 Yes
8 40 No
9 29 Yes
10 28 No
11 55 No
12 39 No
13 39 No
14 49 Yes
15 49 Yes
16 38 No
17 28 No
[1] 17
One of the most interesting aspects of R programming is about changing the shape of
the data in multiple steps to get a desired shape. The functions used to do this are
called melt and cast.
We consider the dataset called ships present in the library called "MASS".
library(MASS)
print(ships)
18 C 60 75 552 1
19 C 65 60 781 0
Page
............
............
Now we melt the data to organize it, converting all columns other than type and year
into multiple rows.
104 C 75 incidents 1
105 D 60 incidents 0
Page
106 D 60 incidents 0
...........
...........
We can cast the molten data into a new form where the aggregate of each type of
ship for each year is created. It is done using the cast function.
R - CSV FILES
In R, we can read data from files stored outside the R environment. We can also write
data into files which will be stored and accessed by the operating system. R can read
and write into various file formats like csv, excel, xml etc.
74
In this chapter we will learn to read data from a csv file and then write data into a csv
file. The file should be present in current working directory so that R can read it. Of
Page
course we can also set our own directory and read files from there.
You can check which directory the R workspace is pointing to using the getwd function.
You can also set a new working directory using setwdfunction.
[1] "/web/com/1441086124_2016"
[1] "/web/com"
This result depends on your OS and your current directory where you are working.
The csv file is a text file in which the values in the columns are separated by a comma.
Let's consider the following data present in the file named input.csv.
You can create this file using windows notepad by copying and pasting this data. Save
the file as input.csv using the save As All files∗.∗∗.∗ option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
75
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
Page
8,Guru,722.5,2014-06-17,Finance
Following is a simple example of read.csv function to read a CSV file available in your
current working directory −
By default the read.csv function gives the output as a data frame. This can be easily
checked as follows. Also we can check the number of columns and rows.
[1] TRUE
[1] 5
[1] 8
Once we read data in a data frame, we can apply all the functions applicable to data
76
[1] 843.25
We can fetch rows meeting specific filter criteria similar to a SQL where clause.
R can create csv file form existing data frame. The write.csv function is used to create
the csv file. This file gets created in the working directory.
Here the column X comes from the data set newper. This can be dropped using
additional parameters while writing the file.
R - EXCEL FILE
Microsoft Excel is the most widely used spreadsheet program which stores data in the
.xls or .xlsx format. R can read directly from these files using some excel specific
packages. Few such packages are - XLConnect, xlsx, gdata etc. We will be using xlsx
package. R can also write into excel file using this package.
You can use the following command in the R console to install the "xlsx" package. It
may ask to install some additional packages on which this package is dependent.
Follow the same command with required package name to install the additional
packages.
install.packages("xlsx")
Use the following command to verify and load the "xlsx" package.
[1] TRUE
Page
Open Microsoft excel. Copy and paste the following data in the work sheet named as
sheet1.
Also copy and paste the following data to another worksheet and rename this
worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". You should save it in the current working directory of
the R workspace.
The input.xlsx is read by using the read.xlsx function as shown below. The result is stored
as a data frame in the R environment.
print(data)
Page
R - BINARY FILES
A binary file is a file that contains information stored only in form of bits and
bytes.0′sand1′s0′sand1′s. They are not human readable as the bytes in it translate to
characters and symbols which contain many other non-printable characters.
Attempting to read a binary file using any text editor will show characters like Ø and ð.
The binary file has to be read by specific programs to be useable. For example, the
binary file of a Microsoft Word program can be read to a human readable form only by
the Word program. Which indicates that, besides the human readable text, there is a
lot more information like formatting of characters and page numbers etc., which are
also stored along with alphanumeric characters. And finally a binary file is a continuous
sequence of bytes. The line break we see in a text file is a character joining first line to
the next.
R has two functions WriteBin and readBin to create and read binary files.
Syntax
writeBin(object, con)
readBin(con, what, n )
Example
We consider the R inbuilt data "mtcars". First we create a csv file from it and convert it to
a binary file and store it as a OS file. Next we read this binary file created into R.
We read the data frame "mtcars" as a csv file and then write it as a binary file to the OS.
# Read the "mtcars" data frame as a csv file and store only the columns
"cyl","am"and"gear".
write.table(mtcars, file ="mtcars.csv",row.names = FALSE, na ="",
col.names = TRUE, sep =",")
# Create a connection object to write the binary file using mode "wb".
write.filename = file("/web/com/binmtcars.dat","wb")
# Write the column names of the data frame to the connection object.
writeBin(colnames(new.mtcars), write.filename)
# Write the records in each of the column to the file.
writeBin(c(new.mtcars$cyl,new.mtcars$am,new.mtcars$gear), write.filename)
# Close the file for writing so that it can be read by other program.
close(write.filename)
The binary file created above stores all the data as continuous bytes. So we will read it
by choosing appropriate values of column names as well as the column values.
# Create a connection object to read the file in binary mode using "rb".
read.filename <- file("/web/com/binmtcars.dat","rb")
83
Page
# Next read the column values. n = 18 as we have 3 column names and 15 values.
read.filename <- file("/web/com/binmtcars.dat","rb")
bindata <- readBin(read.filename, integer(), n =18)
# Read the values from 4th byte to 8th byte which represents "cyl".
cyldata = bindata[4:8]
print(cyldata)
# Read the values form 9th byte to 13th byte which represents "am".
amdata = bindata[9:13]
print(amdata)
# Read the values form 9th byte to 13th byte which represents "gear".
geardata = bindata[14:18]
print(geardata)
When we execute the above code, it produces the following result and chart −
[13] 0 4 4 4 3 3
Page
[1] 6 6 4 6 8
[1] 1 1 1 0 0
[1] 4 4 4 3 3
cyl am gear
[1,] 6 1 4
[2,] 6 1 4
[3,] 4 1 4
[4,] 6 0 3
[5,] 8 0 3
As we can see, we got the original data back by reading the binary file in R.
R - XML FILES
XML is a file format which shares both the file format and the data on the World Wide
Web, intranets, and elsewhere using standard ASCII text. It stands for Extensible Markup
Language XMLXML. Similar to HTML it contains markup tags. But unlike HTML where the
markup tag describes structure of the page, in xml the markup tags describe the
meaning of the data contained into he file.
You can read a xml file in R using the "XML" package. This package can be installed
using following command.
install.packages("XML")
Input Data
Create a XMl file by copying the below data into a text editor like notepad. Save the
file with a .xml extension and choosing the file type as all files∗.∗∗.∗.
<RECORDS>
<EMPLOYEE>
<ID>1</ID>
<NAME>Rick</NAME>
<SALARY>623.3</SALARY>
<STARTDATE>1/1/2012</STARTDATE>
<DEPT>IT</DEPT>
85
</EMPLOYEE>
Page
<EMPLOYEE>
<ID>2</ID>
<NAME>Dan</NAME>
<SALARY>515.2</SALARY>
<STARTDATE>9/23/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>3</ID>
<NAME>Michelle</NAME>
<SALARY>611</SALARY>
<STARTDATE>11/15/2014</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>4</ID>
<NAME>Ryan</NAME>
<SALARY>729</SALARY>
<STARTDATE>5/11/2014</STARTDATE>
<DEPT>HR</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>5</ID>
<NAME>Gary</NAME>
<SALARY>843.25</SALARY>
<STARTDATE>3/27/2015</STARTDATE>
86
<DEPT>Finance</DEPT>
Page
</EMPLOYEE>
<EMPLOYEE>
<ID>6</ID>
<NAME>Nina</NAME>
<SALARY>578</SALARY>
<STARTDATE>5/21/2013</STARTDATE>
<DEPT>IT</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>7</ID>
<NAME>Simon</NAME>
<SALARY>632.8</SALARY>
<STARTDATE>7/30/2013</STARTDATE>
<DEPT>Operations</DEPT>
</EMPLOYEE>
<EMPLOYEE>
<ID>8</ID>
<NAME>Guru</NAME>
<SALARY>722.5</SALARY>
<STARTDATE>6/17/2014</STARTDATE>
<DEPT>Finance</DEPT>
</EMPLOYEE>
</RECORDS>
The xml file is read by R using the function xmlParse. It is stored as a list in R.
Page
1
Rick
623.3
1/1/2012
IT
2
Dan
515.2
9/23/2013
Operations
3
Michelle
611
11/15/2014
IT
4
Ryan
729
5/11/2014
HR
88
5
Gary
Page
843.25
3/27/2015
Finance
6
Nina
578
5/21/2013
IT
7
Simon
632.8
7/30/2013
Operations
8
Guru
722.5
6/17/2014
Finance
print(rootsize)
output
[1] 8
Let's look at the first record of the parsed file. It will give us an idea of the various
elements present in the top level node.
$EMPLOYEE
1
Rick
623.3
1/1/2012
IT
attr(,"class")
[1] "XMLInternalNodeList" "XMLNodeList"
90
1
IT
Michelle
To handle the data effectively in large files we read the data in the xml file as a data
frame. Then process the data frame for data analysis.
As the data is now available as a dataframe we can use data frame related function
to read and manipulate the file.
R - JSON FILES
JSON file stores data as text in human-readable format. Json stands for JavaScript
Object Notation. R can read JSON files using the rjson package.
In the R console, you can issue the following command to install the rjson package.
install.packages("rjson")
Input Data
Create a JSON file by copying the below data into a text editor like notepad. Save the
file with a .json extension and choosing the file type as all files∗.∗∗.∗.
{
"ID":["1","2","3","4","5","6","7","8"],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru"],
92
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5"],
Page
"StartDate":["1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":["IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}
The JSON file is read by R using the function from JSON. It is stored as a list in R.
$ID
[1] "1" "2" "3" "4" "5" "6" "7" "8"
$Name
[1] "Rick" "Dan" "Michelle" "Ryan" "Gary" "Nina" "Simon" "Guru"
$Salary
[1] "623.3" "515.2" "611" "729" "843.25" "578" "632.8" "722.5"
$StartDate
[1] "1/1/2012" "9/23/2013" "11/15/2014" "5/11/2014" "3/27/2015" "5/21/2013"
"7/30/2013" "6/17/2014"
$Dept
[1] "IT" "Operations" "IT" "HR" "Finance" "IT"
"Operations" "Finance"
We can convert the extracted data above to a R data frame for further analysis using
Page
print(json_data_frame)
R - WEB DATA
Many websites provide data for consumption by its users. For example the World Health
OrganizationWHOWHO provides reports on health and medical information in the form
of CSV, txt and XML files. Using R programs, we can programmatically extract specific
data from such websites. Some packages in R which are used to scrap data form the
web are − RCurl",XML", and "stringr". They are used to connect to the URL’s, identify
required links for the files and download them to the local environment.
Install R Packages
The following packages are required for processing the URL’s and links to the files. If they
are not available in your R Environment, you can install them using following
commands.
94
install.packages("RCurl")
Page
install.packages("XML")
install.packages("stringr")
install.packages("plyr")
Input Data
We will visit the URL weather data and download the CSV files using R for the year 2015.
Example
We will use the function getHTMLLinks to gather the URLs of the files. Then we will use the
function download.file to save the files to the local system. As we will be applying the
same code again and again for multiple files, we will create a function to be called
multiple times. The filenames are passed as parameters in form of a R list object to this
function.
# Identify only the links which point to the JCMB 2015 files.
filenames <- links[str_detect(links,"JCMB_2015")]
# Create a function to download the files by passing the URL and filename list.
downloadcsv <-function(mainurl,filename){
filedetails <- str_c(mainurl,filename)
download.file(filedetails,filename)
}
95
# Now apply the l_ply function and save the files into the current R working directory.
Page
l_ply(filenames,downloadcsv,mainurl
="http://www.geos.ed.ac.uk/~weather/jcmb_ws/")
After running the above code, you can locate the following files in the current R
working directory.
The data is Relational database systems are stored in a normalized format. So, to carry
out statistical computing we will need very advanced and complex Sql queries. But R
can connect easily to many relational databases like MySql, Oracle, Sql server etc. and
fetch records from them as a data frame. Once the data is available in the R
environment, it becomes a normal R data set and can be manipulated or analyzed
using all the powerful packages and functions.
In this tutorial we will be using MySql as our reference database for connecting to R.
RMySQL Package
R has a built-in package named "RMySQL" which provides native connectivity between
with MySql database. You can install this package in the R environment using the
following command.
install.packages("RMySQL")
Connecting R to MySql
dbListTables(mysqlconnection)
We can query the database tables in MySql using the function dbSendQuery. The query
gets executed in MySql and the result set is returned using the R fetch function. Finally it
is stored as a data frame in R.
# Store the result in a R data frame object. n = 5 is used to fetch first 5 rows.
data.frame = fetch(result, n =5)
print(data.fame)
We can update the rows in a Mysql table by passing the update query to the
dbSendQuery function.
After executing the above code we can see the table updated in the MySql
Environment.
dbSendQuery(mysqlconnection,
"insert into mtcars(row_names, mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
values('New Mazda RX4 Wag', 21, 6, 168.5, 110, 3.9, 2.875, 17.02, 0, 1, 4, 4)"
)
After executing the above code we can see the row inserted into the table in the
MySql Environment.
We can create tables in the MySql using the function dbWriteTable. It overwrites the
98
# Create the connection object to the database where we want to create the table.
mysqlconnection = dbConnect(MySQL(), user ='root', password ='', dbname ='sakila',
host ='localhost')
After executing the above code we can see the table created in the MySql
Environment.
We can drop the tables in MySql database passing the drop table statement into the
dbSendQuery in the same way we used it for querying data from tables.
After executing the above code we can see the table is dropped in the MySql
Environment.
R - PIE CHARTS
R Programming language has numerous libraries to create charts and graphs. A pie-
chart is a representation of values as slices of a circle with different colors. The slices are
labeled and the numbers corresponding to each slice is also represented in the chart.
In R the pie chart is created using the pie function which takes positive numbers as a
vector input. The additional parameters are used to control labels, color, title etc.
Syntax
Example
A very simple pie-chart is created using just the input vector and labels. The below script
will create and save the pie chart in the current R working directory.
100
Page
We can expand the features of the chart by adding more parameters to the function.
We will use parameter main to add a title to the chart and another parameter
is col which will make use of rainbow colour pallet while drawing the chart. The length
of the pallet should be same as the number of values we have for the chart. Hence we
use lengthxx.
Example
The below script will create and save the pie chart in the current R working directory.
We can add slice percentage and a chart legend by creating additional chart
variables.
fill = rainbow(length(x)))
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. The
package plotrix has a function called pie3D that is used for this.
R - BAR CHARTS
A bar chart represents data in rectangular bars with length of the bar proportional to
the value of the variable. R uses the function barplot to create bar charts. R can draw
both vertical and Horizontal bars in the bar chart. In bar chart each of the bars can be
given different colors.
Syntax
barplot(H,xlab,ylab,main, names.arg,col)
Page
Example
A simple bar chart is created using just the input vector and the name of each bar.
The below script will create and save the bar chart in the current R working directory.
The features of the bar chart can be expanded by adding more parameters.
The main parameter is used to add title. The col parameter is used to add colors to the
bars. The args.name is a vector having same number of values as the input vector to
describe the meaning of each bar.
Example
The below script will create and save the bar chart in the current R working directory.
dev.off()
We can create bar chart with groups of bars and stacks in each bar by using a matrix
as input values.
More than two variables are represented as a matrix which is used to create the group
bar chart and stacked bar chart.
png(file ="barchart_stacked.png")
Page
R - BOXPLOTS
108
Boxplots are a measure of how well distributed is the data in a data set. It divides the
data set into three quartiles. This graph represents the minimum, maximum, median, first
quartile and third quartile in the data set. It is also useful in comparing the distribution of
Page
Syntax
x is a vector or a formula.
data is the data frame.
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box proportionate to
the sample size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.
Example
We use the data set "mtcars" available in the R environment to create a basic boxplot.
Let's look at the columns "mpg" and "cyl" in mtcars.
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
The below script will create a boxplot graph for the relation between
mpg milespergallonmilespergallon and cyl numberofcylindersnumberofcylinders.
109
Page
png(file ="boxplot.png")
We can draw boxplot with notch to find out how the medians of different data groups
match with each other.
The below script will create a boxplot graph with notch for each of the data group.
110
R - HISTOGRAMS
Histogram is similar to bar chat but the difference is it groups the values into continuous
ranges. Each bar in histogram represents the height of the number of values present in
Page
that range.
R creates histogram using hist function. This function takes a vector as an input and uses
some more parameters to plot histograms.
Syntax
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working
directory.
To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim
parameters.
dev.off()
R - LINE GRAPHS
A line chart is a graph that connects a series of points by drawing line segments
between them. These points are ordered in one of their
coordinate usuallythex−coordinateusuallythex−coordinate value. Line charts are usually
used in identifying the trends in data.
Syntax
plot(v,type,col,xlab,ylab)
114
type takes the value "p" to draw only the points, "l" to draw only the lines and "o"
to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input vector and the type parameter as "O". The
below script will create and save a line chart in the current R working directory.
115
Page
The features of the line chart can be expanded by using additional parameters. We
add color to the points and lines, give a title to the chart and add labels to the axes.
Example
More than one line can be drawn on the same chart by using the linesfunction.
After the first line is plotted, the lines function can use an additional vector as input to
draw the second line in the chart,
R - SCATTERPLOTS
Scatterplots show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is chosen in the horizontal axis and another in the
vertical axis.
Syntax
118
Example
We use the data set "mtcars" available in the R environment to create a basic
scatterplot. Let's use the columns "wt" and "mpg" in mtcars.
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
The below script will create a scatterplot graph for the relation between
wtweightweight and mpgmilespergallonmilespergallon.
xlab ="Weight",
ylab ="Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main ="Weight vs Milage"
)
# Save the file.
dev.off()
Scatterplot Matrices
When we have more than two variables and we want to find the correlation between
one variable versus the remaining ones we use scatterplot matrix. We use pairs function
to create matrices of scatterplots.
Syntax
pairs(formula, data)
Page
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted
for each pair.
121
Page