DS Notes
DS Notes
1. Obtain data that you hope will help answer the question.
2. Explore the data to understand it.
3. Clean and prepare the data for analysis.
4. Perform analysis, model building, testing, etc.
(The analysis is the step most people think of as data science, but
it’s just one step! Notice how much more there is that surrounds
it.)
5. Draw conclusions from your work.
6. Report those conclusions to the relevant stakeholders.
Visualization:
Visualization technique helps you access huge amounts of data in easy
to understand and digestible visuals.
Machine Learning:
Machine Learning explores the building and study of algorithms that
learn to make predictions about unforeseen/future data.
Deep Learning:
Deep Learning method is new machine learning research where the
algorithm selects the analysis model to follow.
Data Science Process
Data Science Process:
1. Discovery:
Discovery step involves acquiring data from all the identified internal &
external sources, which helps you answer the business question.
2. Preparation:
Data can have many inconsistencies like missing values, blank columns,
an incorrect data format, which needs to be cleaned. You need to
process, explore, and condition data before modelling. The cleaner your
data, the better are your predictions.
3. Model Planning:
In this stage, you need to determine the method and technique to draw
the relation between input variables. Planning for a model is performed
by using different statistical formulas and visualization tools. SQL
analysis services, R, and SAS/access are some of the tools used for this
purpose.
4. Model Building:
In this step, the actual model building process starts. Here, Data scientist
distributes datasets for training and testing. Techniques like association,
classification, and clustering are applied to the training data set. The
model, once prepared, is tested against the “testing” dataset.
5. Operationalize:
You deliver the final baselined model with reports, code, and technical
documents in this stage. Model is deployed into a real-time production
environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This
helps you decide if the project results are a success or a failure based on
the inputs from the model.
Data Scientist:
Role: A Data Scientist is a professional who manages enormous
amounts of data to come up with compelling business visions by using
various tools, techniques, methodologies, algorithms, etc.
Data Engineer:
Role: The role of a data engineer is of working with large amounts of
data. He develops, constructs, tests, and maintains architectures like
large scale processing systems and databases.
Data Analyst:
Role: A data analyst is responsible for mining vast amounts of data.
They will look for relationships, patterns, trends in data. Later he or she
will deliver compelling reporting and visualization for analyzing the data
to take the most viable business decisions.
Data Administrator:
Role: Data admin should ensure that the database is accessible to all
relevant users. He also ensures that it is performing correctly and keeps
it safe from hacking.
Business Analyst:
Role: This professional needs to improve business processes. He/she is
an intermediary between the business executive team and the IT
department.
Internet Search:
Google search uses Data science technology to search for a specific
result within a fraction of a second
Recommendation Systems:
To create a recommendation system. For example, “suggested friends”
on Facebook or suggested videos” on YouTube, everything is done with
the help of Data Science.
Image & Speech Recognition:
Speech recognizes systems like Siri, Google Assistant, and Alexa run on
the Data science technique. Moreover, Facebook recognizes your friend
when you upload a photo with them, with the help of Data Science.
Gaming world:
EA Sports, Sony, Nintendo are using Data science technology. This
enhances your gaming experience. Games are now developed using
Machine Learning techniques, and they can update themselves when you
move to higher levels.
Analytics
Analytics generates insights from data using simple presentation,
manipulation, calculation or visualization of data. In the context of data
science, it is also sometimes referred to as exploratory data analytics. It
often serves the purpose to familiarize oneself with the subject matter
and to obtain some initial hints for further analysis. To this end, analytics
is often used to formulate appropriate questions for a data science
project.
Statistics
Statistics provides a methodological approach to answer questions raised
by the analysts with a certain level of confidence.
Descriptive Statistics
Descriptive statistics are a part of statistics that can be used to describe data.
It is used to summarize the attributes of a sample in such a way that a pattern
can be drawn from the group. It enables researchers to present data in a more
meaningful way such that easy interpretations can be made. Descriptive
statistics uses two tools to organize and describe data. These are given as
follows:
Inferential Statistics
Inferential statistics is a branch of statistics that is used to make inferences
about the population by analyzing a sample. When the population data is very
large it becomes difficult to use it. In such cases, certain samples are taken
that are representative of the entire population. Inferential statistics draws
conclusions regarding the population using these samples. Sampling
strategies such as simple random sampling, cluster sampling, stratified
sampling, and systematic sampling, need to be used in order to choose
correct samples from the population. Some methodologies used in inferential
statistics are as follows:
Machine Learning
Artificial intelligence refers to the broad idea that machines can perform
tasks normally requiring human intelligence, such as visual perception,
speech recognition, decision-making and translation between languages.
In the context of data science, machine learning can be considered as a
sub-field of artifical intelligence that is concerned with decision making.
In fact, in its most essential form, machine learning is decision making at
scale. Machine learning is the field of study of computer algorithms that
allow computer programs to identify and extract patterns from data. A
common purpose of machine learning algorithms is therefore to
generalize and learn from data in order to perform certain tasks
Supervised learning
Data set for learning, with an answer key the algorithm uses to determine
accuracy as it trains on the data. Supervised learning techniques in
statistical modeling include:
Regression model: A predictive model designed to analyze the
relationship between independent and dependent variables. The most
common regression models are logistical, polynomial, and linear. These
models determine the relationship between variables, forecasting, and
modeling.
Classification model: An algorithm analyzes and classifies a large and
complex set of data points. Common models include decision trees, Naive
Bayes, the nearest neighbor, random forests, and neural networking
models.
Unsupervised learning
1. Y= mx+c
In the decision tree algorithm, we can solve the problem, by using tree
representation in which, each node represents a feature, each branch
represents a decision, and each leaf represents the outcome.
In the decision tree, we start from the root of the tree and compare the
values of the root attribute with record attribute. On the basis of this
comparison, we follow the branch as per the value and then move to the
next node. We continue comparing these values until we reach the leaf
node with predicated class value.
If we are given a data set of items, with certain features and values, and
we need to categorize those set of items into groups, so such type of
problems can be solved using k-means clustering algorithm.
:
Populations and samples
In statistics, population is the entire set of items from which you draw
data for a statistical study. It can be a group of individuals, a set of
items, etc. It makes up the data pool for a study.
Generally, population refers to the people who live in a particular area
at a specific time. But in statistics, population refers to data on your
study of interest. It can be a group of individuals, objects, events,
organizations, etc. You use populations to draw conclusions.
An example of a population would be the entire student body at a
school. It would contain all the students who study in that school at the
time of data collection. Depending on the problem statement, data from
each of these students is collected. An example is the students who
speak Hindi among the students of a school.
For the above situation, it is easy to collect data. The population is small
and willing to provide data and can be contacted. The data collected will
be complete and reliable.
If you had to collect the same data from a larger population, say the
entire country of India, it would be impossible to draw reliable
conclusions because of geographical and accessibility constraints, not to
mention time and resource constraints. A lot of data would be missing or
might be unreliable. Furthermore, due to accessibility issues,
marginalized tribes or villages might not provide data at all, making the
data biased towards certain regions or groups.
What is a Sample?
Population Sample
All residents of a country would constitute All residents who live above
the Population set the poverty line would be the
Sample
All residents above the poverty line in a All residents who are
country would be the Population millionaires would make up
the Sample
All employees in an office would be the Out of all the employees, all
Population managers in the office would
be the Sample
You collect data from a population when your research question needs
an extensive amount of data or information about every member of the
population is available. You use population data when the data pool is
small and cooperative to give all the required information. For larger
populations, you use Sampling to represent parts of the population
from which it is hard to collect data.
How to Collect Data From a Sample?
Samples are used when the population is large, scattered, or if it's hard
to collect data on individual instances within it. You can then use a
small sample of the population to make overall hypotheses.
I) Estimation
a) Point Estimation
b) Interval Estimation
II) Hypothesis testing
Point Estimation
Formula
Confidence level: normally 90%,95%,99% etc
Z0.90= 1.645
Z0.95 =1.96
Z0.98 =2.326
Z0.99 =2.596
Example:
A company is claiming that their average sales for this quarter are 1000
units. This is an example of a simple hypothesis.
Suppose the company claims that the sales are in the range of 900 to
1000 units. Then this is a case of a composite hypothesis.
If the sample falls within this range, the alternate hypothesis will be
accepted, and the null hypothesis will be rejected.
Example:
According to the H1, the mean can be greater than or less than 50. This
is an example of a Two-tailed test.
Type 1 Error: A Type-I error occurs when sample results reject the null
hypothesis despite being true.
Type 2 Error: A Type-II error occurs when the null hypothesis is not
rejected when it is false, unlike a Type-I error.
Example:
Type I error will be the teacher failing the student [rejects H0] although
the student scored the passing marks [H0 was true].
Type II error will be the case where the teacher passes the student [do
not reject H0] although the student did not score the passing marks [H1
is true].
Probability Distributions
.
The probability distribution is a way to represent possible values a
variable may take and their respective probability.
Discrete Distributions
As its name suggests, a Discrete Distribution is a distribution where
observation can take only a finite number of values. For example, the
rolling of a die can only have resulted from 1 to 6, or the gender of a
species. It is fairly common to have discrete variables in a real-world
data set, be it gender, age or visitors to a place at a particular time. There
are a lot of other discrete distributions, but we will focus on the most
common and important of them
Continuous distribution
Bernoulli Distribution
Bernoulli Distribution can be safely assumed to be the simplest
0f discrete distributions. Consider an example of flipping an
unbiased coin. You either get a Head or a Tail. If we consider
either of them as our priority(caring only about Head/Tail), the
outcome will only be 0 (failure) or 1 (success). As it is an
unbiased coin probability assigned to each outcome is 0.5.
Remember, the outcome is always binary True/False, Head/Tail,
Success/Failure etc.
The probability mass function or PMF of Bernoulli Distribution
is given as
Let’s consider random variable X with only one parameter p
P[X=1]=p
P[X=0]=1-p
Where,
X=1 indicates event has occurred
X=0 indicates event didn’t occured.
and variance is given by
Binomial Distribution
Binomial Distribution is simply an extension of Bernoulli distribution. If
we repeat Bernoulli trials for n times, we will get a Binomial
distribution. If we want to model the number of successes in n trials, we
use Binomial Distribution. As each unit of Binomial is a Bernoulli trial,
the outcome is always binary. The observations are independent of each
other.
Binomial distribution is discrete distribution.
Binomial distribution is used to represent probability of x success in
n trial ,given success probability p in each trial.
If the distribution satisfies the below conditions then such
distribution is called as binomial distribution:
1. There should fixed number of trial.
2. It should have only two possible outcome.
3. Events should be independent.
4. Probability of getting success and failure should remain same.
Following are few properties of binomial distribution one should
remember:
1. Expected value=mean=np
2. Variance=npq
where
and p is the probability of success
As binomial distribution is Bernoulli trials taken n number of times, the
mean and variance are given by
Poisson Distribution
Normal distribution:
It means 68% of data point belongs to X falls within range of 1 Standard Deviation.
Uniform Distribution:
Distribution is said to be a uniform distribution, if all the outcomes
of event have equal probabilities.
Uniform distribution is also called rectangular distribution.
Expected value of uniform distribution provides us no relevant
information.
Since each outcome is equally likely both mean and variance are
uninterpretable.
The size distribution of rain droplets can be plotted using log normal
distribution
Model fitting /fitting a Model
Model fitting is the measure of how well a machine learning model
generalizes data similar to that with which it was trained. A good model
fit refers to a model that accurately approximates the output when it is
provided with unseen inputs.
Fitting refers to adjusting the parameters in the model to improve
accuracy. The process involves running an algorithm on data for which
the target variable (“labeled” data) is known to produce a machine
learning model. Then, the model’s outcomes are compared to the real,
observed values of the target variable to determine the accuracy.
The next step involves adjusting the algorithm’s standard parameters in
order to reduce the level of error and make the model more accurate
when determining the relationship between the features and the target
variable. This process is repeated several times until the model finds the
optimal parameters to make predictions with substantial accuracy.
History of R Programming
The history of R goes back about 20-30 years ago. R was developed by
Ross lhaka and Robert Gentleman in the University of Auckland, New
Zealand, and the R Development Core Team currently develops it. This
programming language name is taken from the name of both the
developers. The first project was considered in 1992. The initial version
was released in 1995, and in 2000, a stable beta version was released.
Features of R programming
R is a domain-specific programming language which aims to do data
analysis. It has some unique features which make it very powerful. The
most important arguably being the notation of vectors. These vectors
allow us to perform a complex operation on a set of values in a single
command. There are the following features of R programming:
ENVIRONMENT SETUP
R Command Prompt
Once you have R environment setup, then it’s easy to start your R
command prompt by just typing the following command at your
command prompt −
$R
This will launch R interpreter and you will get a prompt > where you
can start typing your program as follows −
> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"
Here first statement defines a string variable myString, where we assign
a string "Hello, World!" and then next statement print() is being used to
print the value stored in variable myString.
R Script File
print ( myString)
Save the above code in a file test.R and execute it at Linux command
prompt as given below. Even if you are using Windows or other system,
syntax will remain same.
$ Rscript test.R
When we run the above program, it produces the following result.
[1] "Hello, World!"
Vectors
When you want to create vector with more than one element, you should
use c() function which means to combine the elements into a vector.
Live Demo
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created
using a vector input to the matrix function.
Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any
number of dimensions. The array function takes a dim attribute which
creates the required number of dimension. In the below example we
create an array with two elements which are 3x3 matrices each.
Live Demo
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1
,,2
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can
be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
Live Demo
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
UNIT -II
An attribute is a data field, defining a characteristic of a data object. The
nouns attribute, dimension, feature, and variable are used
correspondently in the literature. The dimension is generally used in
data warehousing. Machine learning literature influence to use the term
feature, while statisticians prefer the method svariable.
Data mining and database experts generally use the term attribute.
Attributes defining a user object can include, for instance, customer ID,
name, and address. Observed values for a given attribute are referred to
as observations.
A set of attributes can define a given object is known as attribute vector
(or feature vector). The distribution of data containing one attribute (or
variable) is known as univariate. A bivariate distribution contains two
attributes, etc.
Example of attribute
In this example, RollNo, Name, and Result are attributes of the object
named as a student.
Rollo Name Result
1 Ali Pass
2 Akram Fail
we need to differentiate between different types of attributes during
Data-preprocessing. So firstly, we need to differentiate between
qualitative and quantitative attributes.
1. Qualitative Attributes such as Nominal, Ordinal, and Binary
Attributes.
2. Quantitative Attributes such as Discrete and Continuous
Attributes.
Types Of attributes
Binary
Nominal
Ordinal Attributes
Nominal Attributes
Nominal data is in alphabetical form and not in an integer. Nominal
Attributes are Qualitative Attributes.
Nominal related to name of things or symbol, represents category , code
or stste also called categorical attribute
Binary Attributes
Binary data have only two values/states. For example, here HIV detected
can be only Yes or No.
Binary Attributes are Qualitative Attributes.
O means value is absent
1 means value present
1. Symmetric binary
2. Asymmetric binary
Ordinal Attributes
Grade A, B, C, D, F
Numeric Attributes :
A numeric attribute is calculable, that is, it is a quantifiable amount
that constitutes integer or real values.
Numeric attributes can be of two types as follows: Interval- scaled,
and Ratio – scaled.
Let’s discuss one by one.
1. Interval – Scaled Attributes :
Interval – scaled attributes are calculated on a lamella of uniform-
size units. The values of interval-scaled attributes have order and
can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify
the difference between values.
Example –
A temperature attribute is an interval – scaled. We have different
temperature values for every new day, where each day is an entity.
By sequencing the values, we obtain an arrangement of entities
with reference to temperature. In addition, we can quantify the
difference in the value between values, for example, a temperature
of 20 degrees C is five degrees higher than a temperature of 15
degrees C.
Discrete Attributes
Discrete data have a finite value. It can be in numerical form and can
also be in a categorical form. Discrete Attributes are Quantitative
Attributes.
zip codes, profession, or the set of words
Note: Binary attributes are a special case of discrete attributes – Binary
attributes where only non-zero values are important are called
asymmetric binary attributes.
Examples of Discrete Data
Attribute Value
PROPERTIES OF ATTRIBUTES:
Although the mean is the single most useful quantity for describing a
data set, it is not always the best way of measuring the center of the data.
–
A major problem with the mean is its sensitivity to extreme (outlier)
values. – Even a small number of extreme values can corrupt the mean.
Range
In Statistics, the range is the smallest of all the measures of dispersion. It
is the difference between the two extreme conclusions of the
distribution. In other words, the range is the difference between the
maximum and the minimum observation of the distribution.
It is defined by
Range = Xmax – Xmin
Where Xmax is the largest observation and Xmin is the smallest
observation of the variable values.
Quartiles Definition
Quartiles divide the entire set into four equal parts. So, there are three
quartiles, first, second and third represented by Q1, Q2 and Q3,
respectively. Q2 is nothing but the median, since it indicates the position
of the item in the list and thus, is a positional average. To find quartiles
of a group of data, we have to arrange the data in ascending order.
Quartiles Formula
Suppose, Q3 is the upper quartile is the median of the upper half of the
data set. Whereas, Q1 is the lower quartile and median of the lower half
of the data set. Q2 is the median. Consider, we have n number of items
in a data set. Then the quartiles are given by;
Q1 = [(n+1)/4]th item
Q2 = [(n+1)/2]th item
Q3 = [3(n+1)/4]th item
Variance
According to layman’s words, the variance is a measure of how far a set
of data are dispersed out from their mean or average value. It is denoted
as ‘σ2’.
Standard Deviation
The spread of statistical data is measured by the standard deviation.
Distribution measures the deviation of data from its mean or average
position. The degree of dispersion is computed by the method of
estimating the deviation of data points. It is denoted by the symbol, ‘σ’.
In R, a sequence of elements which share the same data type is known as vector. A vector
supports logical, integer, double, character, complex, or raw data type. The elements which
are contained in vector known as components of the vector. We can check the type of
vector with the help of the typeof() function.
The length is an important property of a vector. A vector length is basically the number of
elements in the vector, and it is calculated with the help of the length() function.
Vector is classified into two parts, i.e., Atomic vectors and Lists. They have three common
properties, i.e., function type, function length, and
There is only one difference between atomic vectors and lists. In an atomic vector, all the
elements are of the same type, but in the list, the elements are of different data types. In
this section, we will discuss only the atomic vectors. We will discuss lists briefly in the next
topic.
How to create a vector in R?
In R, we use c() function to create a vector. This function returns a one-dimensional array or
simply vector. The c() function is a generic function which combines its argument. All
arguments are restricted with a common data type which is the type of the returned value.
There are various other ways to create a vector in R, which are as follows:
We can create a vector with the help of the colon operator. There is the following syntax to
use colon operator:
z<-x:y
Example:
a<-4:-10
a
Output
[1] 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10
In R, we can create a vector with the help of the seq() function. A sequence function creates
a sequence of elements as a vector. The seq() function is used in two ways, i.e., by setting
step size with ?by' parameter or specifying the length of the vector with the 'length.out'
feature.
Example:
seq_vec<-seq(1,4,by=0.5)
seq_vec
class(seq_vec)
Output
Example:
seq_vec<-seq(1,4,length.out=6)
seq_vec
class(seq_vec)
Output
Atomic vectors in R
In R, there are four types of atomic vectors. Atomic vectors play an important role in Data
Science. Atomic vectors are created with the help of c() function. These atomic vectors are
as follows:
Numeric vector
The decimal values are known as numeric data types in R. If we assign a decimal value to
any variable d, then this d variable will become a numeric type. A vector which contains
numeric elements is known as a numeric vector.
Example:
d<-45.5
num_vec<-c(10.1, 10.2, 33.2)
class(d)
num_vec
class(d)
class(num_vec)
Output
[1] 45.5
[1] 10.1 10.2 33.2
[1] "numeric"
[1] "numeric"
Integer vector
A non-fraction numeric value is known as integer data. This integer data is represented by
"Int." The Int size is 2 bytes and long Int size of 4 bytes. There is two way to assign an integer
value to a variable, i.e., by using as.integer() function and appending of L to the value.
Example:
d<-as.integer(5)
e<-5L
int_vec<-c(1,2,3,4,5)
int_vec<-as.integer(int_vec)
int_vec1<-c(1L,2L,3L,4L,5L)
class(d)
class(e)
class(int_vec)
class(int_vec1)
Output
[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"
Character vector
A character is held as a one-byte integer in memory. In R, there are two different ways to
create a character data type value, i.e., using as.character() function and by typing string
between double quotes("") or single quotes('').
d<-'shubham'
e<-"Arpita"
f<-“65”
f<-as.character(f)
d
e
f
char_vec<-c(1,2,3,4,5)
char_vec<-as.character(char_vec)
10. char_vec1<-c("shubham","arpita","nishka","vaishali")
11. char_vec
12. class(d)
13. class(e)
14. class(f)
15. class(char_vec)
16. class(char_vec1)
Output
[1] "shubham"
[1] "Arpita"
[1] "65"
[1] "1" "2" "3" "4" "5"
[1] "shubham" "arpita" "nishka" "vaishali"
[1] "character"
[1] "character"
[1] "character"
[1] "character"
[1] "character"
Logical vector
The logical data types have only two values i.e., True or False. These values are based on
which condition is satisfied. A vector which contains Boolean values is known as the logical
vector.
Example:
d<-as.integer(5)
e<-as.integer(6)
f<-as.integer(7)
g<-d>e
h<-e<f
log_vec<-c(d<e, d<f, e<d,e<f,f<d,f<e)
log_vec
class(g)
10. class(h)
11. class(log_vec)
Output
[1] FALSE
[1] TRUE
[1] TRUE TRUE FALSE TRUE FALSE FALSE
[1] "logical"
[1] "logical"
[1] "logical"
Naming a vector
This example explains how to create a vector with names in the R programming language.
my_values <- 1:5 # Create vector of values
my_values # Print vector of values
# [1] 1 2 3 4 5
…and a vector containing the corresponding names to our numbers:
my_names <- letters[1:5] # Create vector of names
my_names # Print vector of names
# [1] "a" "b" "c" "d" "e"
Note that the length of the vector of numbers and the length of the vector of names needs to
be the same.
2) Arithmetic operations
We can perform all the arithmetic operation on vectors. The arithmetic operations are
performed member-by-member on vectors. We can add, subtract, multiply, or divide two
vectors. Let see an example to understand how arithmetic operations are performed on
vectors.
Example:
a<-c(1,3,5,7)
b<-c(2,4,6,8)
a+b
a-b
a/b
a%%b
Output
[1] 3 7 11 15
[1] -1 -1 -1 -1
[1] 2 12 30 56
[1] 0.5000000 0.7500000 0.8333333 0.8750000
[1] 1 3 5 7
[1] "TensorFlow" "PyTorch"
Start
"TensorFlow"
Next Top
← PrevNext →
Vector Arithmetics
Arithmetic operations of vectors are performed member-by-member, i.e., memberwise.
For example, suppose we have two vectors a and b.
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)
Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5.
>5*a
[1] 5 15 25 35
And if we add a and b together, the sum would be a vector whose members are the sum of the corresponding members
from a and b.
>a+b
[1] 2 5 9 15
Similarly for subtraction, multiplication and division, we get new vectors via memberwise operations.
>a-b
[1] 0 1 1 -1
>a*b
[1] 1 6 20 56
>a/b
[1] 1.000 1.500 1.250 0.875
Recycling Rule
If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the
following vectors u and v have different lengths, and their sum is computed by recycling values of the shorter vector u.
> u = c(10, 20, 30)
> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
>u+v
[1] 11 22 33 14 25 36 17 28 39
Vector subsetting
In R Programming Language , subsetting allows the user to access elements from an object. It
takes out a portion from the object based on the condition provided. There are 4 ways of subsetting
in R programming. Each of the methods depends on the usability of the user and the type of
object.
Using the ‘[ ]’ operator, elements of vectors and observations from data frames can be accessed.
To neglect some indexes, ‘-‘ is used to access all other indexes of vector or data frame.
Program
# Create vector
x <- 1:15
# Print vector
cat("Original vector: ", x, "\n")
# Subsetting vector
cat("First 5 values of vector: ", x[1:5], "\n")
[[ ]] operator is used for subsetting of list-objects. This operator is the same as [ ] operator but the
only difference is that [[ ]] selects only one element whereas [ ] operator can select more than 1
element in a single command.
# Create list
ls <- list(a = 1, b = 2, c = 10, d = 20)
# Print list
cat("Original List: \n")
print(ls)
$b
[1] 2
$c
[1] 10
$d
[1] 20
$ operator can be used for lists and data frames in R. Unlike [ ] operator, it selects only a single
observation at a time. It can be used to access an element in named list or a column in data frame.
$ operator is only applicable for recursive objects or list-like objects.
# Create list
ls <- list(a = 1, b = 2, c = "Hello", d = "GFG")
# Print list
cat("Original list:\n")
print(list)
$b
[1] 2
$c
[1] "Hello"
$d
[1] "GFG"
Using $ operator:
[1] "GFG"
subset() function in R programming is used to create a subset of vectors, matrices, or data frames
based on the conditions provided in the parameters.
Syntax: subset(x, subset, select)
Parameters:
x: indicates the object
subset: indicates the logical expression on the basis of which subsetting has to be done
select: indicates columns to select
R Matrix
In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created with
the help of the vector input to the matrix function. On R matrices, we can perform addition,
subtraction, multiplication, and division operation.
In the R matrix, elements are arranged in a fixed number of rows and columns. The matrix
elements are the real numbers. In R, we use matrix function, which can easily reproduce the
memory representation of the matrix. In the R matrix, all the elements must share a common
basic type.
Example
matrix1<-matrix(c(11, 13, 15, 12, 14, 16),nrow =2, ncol =3, byrow = TRUE)
matrix1
Output
data
The first argument in matrix function is data. It is the input vector which is the data elements
of the matrix.
Nrow:The second argument is the number of rows which we want to create in the matrix.
Ncol:The third argument is the number of columns which we want to create in the matrix.
Byrow:The byrow parameter is a logical clue. If its value is true, then the input vector
elements are arranged by row.
dim_name:The dim_name parameter is the name assigned to the rows and columns.
Example
Output
R Arrays
In R, arrays are the data objects which allow us to store data in more than two dimensions. In
R, an array is created with the help of the array() function. This array() function takes a
vector as an input and to create an array it uses vectors values in the dim parameter.
For example- if we will create an array of dimension (2, 3, 4) then it will create 4
rectangular matrices of 2 row and 3 columns.
R Array Syntax
There is the following syntax of R arrays:
data
The data is the first argument in the array() function. It is an input vector which is given to
the array.
matrices
row_size
This parameter defines the number of row elements which an array can store.
column_size
This parameter defines the number of columns elements which an array can store.
dim_names
This parameter is used to change the default names of rows and columns.
How to create?
In R, array creation is quite simple. We can easily create an array using vector and array()
function. In array, data is stored in the form of the matrix. There are only two steps to create
a matrix which are as follows
Let see an example to understand how we can implement an array with the help of the
vectors and array() function.
Example
Output
, , 1
[,1] [,2] [,3]
[1,] 1 10 13
[2,] 3 11 14
[3,] 5 12 15
, , 2
[,1] [,2] [,3]
[1,] 1 10 13
[2,] 3 11 14
[3,] 5 12 15
R factors
The factor is a data structure which is used for fields which take only predefined finite
number of values. These are the variable which takes a limited number of different values.
These are the data objects which are used to categorize the data and to store it on multiple
levels. It can store both integers and strings values, and are useful in the column that has a
limited number of unique values.
Factors have labels which are associated with the unique integers stored in it. It contains
predefined set value known as levels and by default R always sorts levels in alphabetical
order.
Attributes of a factor
There are the following attributes of a factor in R
a. X
It is the input vector which is to be transformed into a factor.
b. levels
It is an input vector that represents a set of unique values which are taken by x.
c. labels
It is a character vector which corresponds to the number of labels.
d. Exclude
It is used to specify the value which we want to be excluded,
e. ordered
It is a logical attribute which determines if the levels are ordered.
f. nmax
It is used to specify the upper bound for the maximum number of level.
R provides factor() function to convert the vector into factor. There is the following syntax of
factor() function
factor_data<- factor(vector)
Example
print(data)
print(is.factor(data))
Output
Example
Output
[1] Shubham Nishka Arpita Nishka Shubham Sumit Nishka Shubham Sumit
[10] Arpita Sumit
Levels: Arpita Nishka Shubham Sumit
[1] Nishka
L evels: Arpita Nishka Shubham Sumit
[1] Shubham Nishka Arpita Shubham Sumit Nishka Shubham Sumit Arpita
[10] Sumit
Levels: Arpita Nishka Shubham Sumit
Syntax
gl(n, k, labels)
Following is the description of the parameters used −
n is a integer giving the number of levels.
k is a integer giving the number of replications.
labels is a vector of labels for the resulting factor levels.
Example
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
When we execute the above code, it produces the following result −
Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston Boston Boston Boston
Levels: Tampa Seattle Boston
R Data Frame
A data frame is a two-dimensional array-like structure or a table in which a column contains
values of one variable, and rows contains one set of values from each column. A data frame
is a special case of the list in which each component has equal length.
A data frame is used to store data table and the vectors which are present in the form of a
list in a data frame, are of equal length.
In a simple way, it is a list of equal length vectors. A matrix can contain one type of data, but
a data frame can contain different data types such as numeric, character, factor, etc.
Example
Output
employee_idemployee_namesalstarting_date
1 1 Shubham623.30 2012-01-01
2 2 Arpita915.20 2013-09-23
3 3 Nishka611.00 2014-11-15
4 4 Gunjan729.00 2014-05-11
5 5 Sumit843.25 2015-03-27
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
Add Column
Just add the column vector using a new column name.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same
structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame to
create the final data frame.
Live Demo
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
where
dataframe is the input dataframe
Column name is the column in the dataframe such that dataframe is sorted based on this
column
Decreasing parameter specifies the type of sorting order
If it is TRUE dataframe is sorted in descending order. Otherwise, in increasing order
return type: Index positions of the elements
Output:
rollno subjects
1 1 java
2 5 python
3 4 php
4 2 sql
5 3 c
[1] "sort the data in decreasing order based on subjects "
rollno subjects
4 2 sql
2 5 python
3 4 php
1 1 java
5 3 c
[1] "sort the data in decreasing order based on rollno "
rollno subjects
2 5 python
3 4 php
5 3 c
4 2 sql
1 1 java
R Lists
In R, lists are the second type of vector. Lists are the objects of R which contain elements of
different types such as number, vectors, string and another list inside it. It can also contain a
function or a matrix as its elements. A list is a data structure which has components of
mixed data types. We can say, a list is a generic vector which contains other objects.
Example
Output:
[[1]]
[1] 3 4 5 6
[[2]]
[1] "shubham" "nishka" "gunjan" "sumit"
[[3]]
[1] TRUE FALSE FALSE TRUE
Lists creation
The process of creating a list is the same as a vector. In R, the vector is created with the
help of c() function. Like c() function, there is another function, i.e., list() which is used to
create a list in R. A list avoid the drawback of the vector which is data type. We can add the
elements in the list of different data types.
syntax
list()
list_1<-list(1,2,3)
list_2<-list("Shubham","Arpita","Vaishali")
list_3<-list(c(1,2,3))
list_4<-list(TRUE,FALSE,TRUE)
list_1
list_2
list_3
list_4
Output:
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] "Vaishali"
[[1]]
[1] 1 2 3
[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
list_data<-list("Shubham","Arpita",c(1,2,3,4,5),TRUE,FALSE,22.5,12L)
print(list_data)
In the above example, the list function will create a list with character, logical, numeric, and
vector element. It will give the following output
Output:
[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] 1 2 3 4 5
[[4]]
[1] TRUE
[[5]]
[1] FALSE
[[6]]
[1] 22.5
[[7]]
[1] 12
1. Creating a list.
2. Assign a name to the list elements with the help of names() function.
3. Print the list data.
Let see an example to understand how we can give the names to the list elements.
Example
Output:
$Students
[1] "Shubham" "Nishka" "Gunjan"
$Marks
[,1] [,2] [,3]
[1,] 40 60 90
[2,] 80 70 80
$Course
$Course[[1]]
[1] "BCA"
$Course[[2]]
[1] "MCA"
$Course[[3]]
[1] "B. tech."
Output:
$Student
[1] "Shubham" "Arpita" "Nishka"
$Student
[1] "Shubham" "Arpita" "Nishka"
$Marks
[,1] [,2] [,3]
[1,] 40 60 90
[2,] 80 70 80
$Course
$Course[[1]]
[1] "BCA"
$Course[[2]]
[1] "MCA"
$Course[[3]]
[1] "B. tech."
Example
# Creating a list containing a vector, a matrix and a list.
list_data <- list(c("Shubham","Arpita","Nishka"), matrix(c(40,80,60,70,90,80), nrow = 2),
list("BCA","MCA","B.tech"))
Output:
[[1]]
[1] "Moradabad"
$<NA>
NULL
$Course
[1] "Masters of computer applications"
The unlist() function takes the list as a parameter and change into a vector. Let see an
example to understand how to unlist() function is used in R.
Example
# Creating lists.
list1 <- list(1:5)
print(list1)
list2 <-list(10:14)
print(list2)
print(v1)
print(v2)
Output:
[[1]]
[1] 1 2 3 4 5
[[1]]
[1] 10 11 12 13 14
[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19
Merging Lists
R allows us to merge one or more lists into one list. Merging is done with the help of the list()
function also. To merge the lists, we have to pass all the lists into list function as a
parameter, and it returns a list which contains all the elements which are present in the lists.
Let see an example to understand how the merging process is done.
Example
Output:
[[1]]
[[1]][[1]]
[1] 2
[[1]][[2]]
[1] 4
[[1]][[3]]
[1] 6
[[1]][[4]]
[1] 8
[[1]][[5]]
[1] 10
[[2]]
[[2]][[1]]
[1] 1
[[2]][[2]]
[1] 3
[[2]][[3]]
[1] 5
[[2]][[4]]
[1] 7
[[2]][[5]]
[1] 9
UNIT-IV
Operators in R
In computer programming, an operator is a symbol which represents an
action. An operator is a symbol which tells the compiler to perform
specific logical or mathematical manipulations. R programming is very rich
in built-in operators.
1. Arithmetic Operators
2. Relational Operators
3. Logical Operators
4. Assignment Operators
5. Miscellaneous Operators
Arithmetic Operators
Arithmetic operators are the symbols which are used to represent arithmetic
math operations. The operators act on each and every element of the vector.
There are various arithmetic operators which are supported by R
[1] 0 0 4
Relational Operators
A relational operator is a symbol which defines some kind of relation
between two entities. These include numerical equalities and inequalities. A
relational operator compares each element of the first vector with the
corresponding element of the second vector. The result of the comparison
will be a Boolean value. There are the following relational operators which
are supported by R:
Logical Operators
The logical operators allow a program to make a decision on the basis of
multiple conditions. In the program, each operand is considered as a
condition which can be evaluated to a false or true value. The value of the
conditions is used to determine the overall value of the op1 operator op2.
Logical operators are applicable to those vectors whose type is logical,
numeric, or complex.
The logical operator compares each element of the first vector with the
corresponding element of the second vector.
1. & This operator is known as the Logical AND a <- c(3, 0, TRUE,
2+2i)
operator. This operator takes the first
b <- c(2, 4, TRUE,
element of both the vector and returns 2+3i)
TRUE if both the elements are TRUE. print(a&b)
[1] TRUE
FALSE TRUE TRUE
[1] FALSE
TRUE FALSE FALSE
4. && This operator takes the first element of a <- c(3, 0, TRUE,
2+2i)
both the vector and gives TRUE as a result,
b <- c(2, 4, TRUE,
only if both are TRUE. 2+3i)
print(a&&b)
[1] TRUE
[1] TRUE
Assignment Operators
An assignment operator is used to assign a new value to a variable. In R,
these operators are used to assign values to vectors. There are the following
types of assignment
Miscellaneous Operators
Miscellaneous operators are used for a special and specific purpose. These
operators are not used for general mathematical or logical computation.
There are the following miscellaneous operators which are supported in R
[1] FALSE
[1] FALSE
3. %*% It is used to multiply a M=matrix(c(1,2,3,4,5,6),
nrow=2, ncol=3, byrow=TRUE)
matrix with its
T=m%*%T(m)
transpose. print(T)
14 32
32 77
CONDITIONAL STATEMENTS
1) IF STATEMENT
2) -ELSE STATEMENT
3) ELSE-IF STATEMENT
4) SWITCH STATEMENT
R if Statement
The if statement consists of the Boolean expressions followed by one or
more statements. The if statement is the simplest decision-making
statement which helps us to take a decision on the basis of the condition.
The block of code inside the if statement will be executed only when the
boolean expression evaluates to be true. If the statement evaluates false,
then the code which is mentioned after the condition will run.
if(boolean_expression) {
// If the boolean expression is true, then statement(s) will be executed.
}
Flow Chart
Let see some examples to understand how if statements work and perform a
certain task in R.
Example 1
x <-24L
y <- "shubham"
if(is.integer(x))
{
print("x is an Integer")
}
Output:
If-else statement
In the if statement, the inner code is executed when the condition is true.
The code which is outside the if block will be executed when the if condition
is false.
R programming treats any non-zero and non-null values as true, and if the
value is either zero or null, then it treats them as false.
if(boolean_expression) {
// statement(s) will be executed if the boolean expression is true.
} else {
// statement(s) will be executed if the boolean expression is false.
}
Flow Chart
Example 1
# local variable definition
a<- 100
#checking boolean condition
if(a<20){
# if the condition is true then print the following
cat("a is less than 20\n")
}else{
# if the condition is false then print the following
cat("a is not less than 20\n")
}
cat("The value of a is", a)
Output:
R else if statement
This statement is also known as nested if-else statement. The if statement is
followed by an optional else if..... else statement. This statement is used to
test various condition in a single if......else if statement. There are some key
points which are necessary to keep in mind when we are using the if.....else
if.....else statement. These points are as follows:
1. if statement can have either zero or one else statement and it must come
after any else if's statement.
2. if statement can have many else if's statement and they come before the
else statement.
3. Once an else if statement succeeds, none of the remaining else
if's or else's will be tested.
if(boolean_expression 1) {
// This block executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// This block executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// This block executes when the boolean expression 3 is true.
} else {
// This block executes when none of the above condition is true.
}
Flow Chart
Example 1
age <- readline(prompt="Enter age: ")
age <- as.integer(age)
if(age<18)
print("You are child")
else if(age>30)
print("You are old guy")
else
print("You are adult")
Output:
R Switch Statement
A switch statement is a selection control mechanism that allows the value of
an expression to change the control flow of program execution via map and
search.
There are basically two ways in which one of the cases is selected:
1) Based on Index
If the cases are values like a character vector, and the expression is
evaluated to a number than the expression's result is used as an index to
select the case.
Flow Chart
Example 1
x <- switch(
3,
"Shubham",
"Nishka",
"Gunjan",
"Sumit"
)
print(x)
Output:
ITERATIVE PROGRAMMING IN R
B) FOR LOOP
C) WHILE LOOP
D) OVER LIST
PROGRAM:
week < - c('Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday')
for (day in week)
{
In the above program, initially, all the days(strings) of the week are assigned to
the vector week. Then for loop is used to iterate over each string in a week. In
each iteration, each day of the week is displayed.
WHILE LOOP
While loop is used when the exact number of iterations of loop is not known
beforehand. It executes the same code again and again until a stop condition is
met. While loop checks for the condition to be true or false n+1 times rather
than n times. This is because the while loop checks for the condition before
entering the body of the loop.
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
LOOPING OVER LIST:
Looping over a list is just as easy and convenient as looping over a
vector. Looping over list has 3 ways
output
[1] "Mavericks"
[1] "G" "F" "C"
[1] 3
.
output
[1] "Mavericks"
[1] "G"
[1] "F"
[1] "C"
[1] 3
output
[1] "Mavericks"
[1] "G"
[1] 3
FUNCTIONS IN R
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any result
which may be stored in other objects.
Function Definition
An R function is created by using the keyword function. The basic syntax of an R function
definition is as follows −
Function Components
1.3K
OOPs Concepts in Java
The different parts of a function are −
Function Name − This is the actual name of the function. It is stored in R environment as
an object with this name.
Arguments − An argument is a placeholder. When a function is invoked, you pass a
value to the argument. Arguments are optional; that is, a function may contain no
arguments. Also arguments can have default values.
Function Body − The function body contains a collection of statements that defines what
the function does.
Return Value − The return value of a function is the last expression in the function body
to be evaluated.
R has many in-built functions which can be directly called in the program without defining them
first. We can also create and use our own functions referred as user defined functions.
Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They
are directly called by user written programs. You can refer most widely used R functions.
Live Demo
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once
created they can be used like the built-in functions. Below is an example of how a function is
created and used.
new.function(10)
Output:
Recursive functions call themselves. They break down the problem into smaller
components. The function() calls itself within the original function() on each of the smaller
components. After this, the results will be put together to solve the original problem.
Key Features of R Recursion
The use of recursion, often, makes the code shorter and it also looks clean.
It is a simple solution for a few cases.
It expresses in a function that calls itself.
Applications of Recursion in R
Recursive functions are used in many efficient programming techniques
like dynamic programming or divide and conquer algorithms.
Output
[1] 36
Output
[1] 15
Code explanation
LOADING R PACKAGE
The most common method of installing and loading packages is using
the install.packages() and library() function respectively. Let us see a brief about
these functions –
Install.packages() is used to install a required package in the R programming
language.
Syntax:
install.packages(“package_name”)
library() is used to load a specific package in R programming language
Syntax:
library(package_name)
In the case where multiple packages have to installed and loaded these commands
have to be specified repetitively. Thus making the approach inefficient.
install.packages("ggplot2")
install.packages("dpylr")
install.packages("readxl")
library(ggplot2)
library(dpylr)
library(readxl)
The most efficient way to install the R packages is by installing multiple packages at
a time using. For installing multiple packages we need to use install.packages( )
function again but this time we can pass the packages to be installed as a vector or a
list with each package separated by comma(,).
Syntax :
install.packages ( c(“package 1″ ,”package 2”, . . . . , “package n”) )
install.packages(“package1″,”package2”, . . . . , “package n”)
Example :
install.packages(c("ggplot2","dpylr","readxl"))
install.packages("ggplot2","dpylr","readxl")
library("ggplot2","dpylr")
pacman::p_load(dplyr,ggplot2,readxl)
Math Functions
R provides the various mathematical functions to perform the mathematical
calculation. These mathematical functions are very helpful to find absolute value,
square value and much more calculations. In R, there are the following functions
which are used:
Scope of a variable
The location where we can find a variable and also access it if required is called the scope
of a variable. There are mainly two types of variable scopes:
Global Variables: Global variables are those variables that exist throughout the
execution of a program. It can be changed and accessed from any part of the program.
Local Variables: Local variables are those variables that exist only within a certain
part of a program like a function and are released when the function call ends.
Global Variable
As the name suggests, Global Variables can be accessed from any part of the program.
They are available throughout the lifetime of a program.
They are declared anywhere in the program outside all of the functions or blocks.
Declaring global variables: Global variables are usually declared outside of all of the
functions and blocks. They can be accessed from any portion of the program.
# R program to illustrate
# usage of global variables
# global variable
global = 5
display = function()
{
print(global)
}
display()
# changing value of global variable
global = 10
display()
Output:
[1] 5
[1] 10
In the above code, the variable ‘global‘ is declared at the top of the program outside all of
the functions so it is a global variable and can be accessed or updated from anywhere in
the program.
Local Variable
Variables defined within a function or block are said to be local to those functions.
Local variables do not exist outside the block in which they are declared, i.e. they can
not be accessed or used outside that block.
Declaring local variables: Local variables are declared inside a block.
Example:
# usage of local variables
func = function()
{
age = 18
}
print(age)
Output:
Error in print(age) : object 'age' not found
The above program displays an error saying “object ‘age’ not found”. The variable age
was declared within the function “func()” so it is local to that function and not visible to the
portion of the program outside this function.
To correct the above error we have to display the value of variable age from the function
“func()” only.
correct the above error we have to display the value of variable age from the function
“func()” only.
Example:
func = function()
{
age = 18
print(age)
}
Func()
Lexical Scoping
In Lexical Scoping the scope of the variable is determined by the textual structure of a
program.
Most programming languages we use today are lexically scoped like R, Python etc.
Lexical scoping refers to when the location of a function's definition determines which variables
you have access to.
Another name for Lexical scoping is Static Scoping.
Lexical scoping access global value and cannot access local variable, to access local variable it
should assign value as global .
# R program to depict scoping
# Assign a value to a
a <- 10
# Call to function c
c()
Output:
10
Dynamic Scoping
In Dynamic scoping, the variable takes the value of the most latest value assigned to that
variable
Here, Considering the same above example.
# Assign a value to a
a <- 10
# Call to function c
c()
Output:
20
Lexical Dynamic
In this variable refers to top level In this variable is associated to most recent
environment environment
It is easy to find the scope by reading the In this programmer has to anticipate all
code possible contexts
It is property of program text and It is dependent on real time stack rather than
unrelated to real time stack. program text
UNIT- V
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a
process that reduces the volume of original data and represents it in a much smaller volume. Data
reduction techniques are used to obtain a reduced representation of the dataset that is much smaller
in volume by maintaining the integrity of the original data. By reducing the data, the efficiency of the
data mining process is improved, which produces the same analytical results.
Data reduction does not affect the result obtained from data mining. That means the result obtained
from data mining before and after data reduction is the same or almost the same.
Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to apply
sophisticated and computationally high-priced algorithms. The reduction of the data may be in terms
of the number of rows (records) or terms of the number of columns (dimensions).
Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.
i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into a
numerically different data vector A' such that both A and A' vectors are of the same length.
Then how it is useful in reducing data because the data obtained from the wavelet transform
can be truncated. The compressed data is obtained by retaining the smallest fragment of the
strongest wavelet coefficients. Wavelet transform can be applied to data cubes, sparse data,
or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples
with n attributes. The principal component analysis identifies k independent tuples with n
attributes that can represent the dataset.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and skewed
data.
Attribute Subset Selection: The large data set has many attributes, some of which are irrelevant to data
mining or some are redundant. The core attribute subset selection reduces the data volume and
dimensionality. The attribute subset selection reduces the volume of data by eliminating redundant and
irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even after
eliminating the unwanted attributes. The resulting probability of data distribution is as close as possible
to the original data distribution using all the attributes.
2. Numerosity Reduction
The numerosity reduction reduces the original data volume and represents it in a much smaller form.
This technique includes two types parametric and non-parametric numerosity reduction.
Regression and Log-Linear: Linear regression models a relationship between the two attributes by
modeling a linear equation to the data set. Suppose we need to model a linear function between two
attributes.
y= wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in terms of data mining,
attribute x and attribute y are the numeric database attributes, whereas w and b are regression
coefficients.
Multiple linear regressions let the response variable y model linear function between two or more
predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in the database.
Suppose we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to
study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any
model. The non-Parametric technique results in a more uniform reduction, irrespective of data
size, but it may not achieve a high volume of data reduction like the parametric. There are at
least four types of Non-Parametric data reduction techniques, Histogram, Clustering, Sampling,
Data Cube Aggregation, and Data Compression.
Histogram: A histogram is a ‘graph’ that represents frequency distribution which describes how often a
value appears in the data. Histogram uses the binning method and to represent data distribution of an
attribute. It uses disjoint subset which we call as bin or buckets.
We have data for AllElectronics data set, which contains prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18,
20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
The diagram below shows a histogram of equal width that shows the frequency of price distribution.
Clustering: Clustering techniques groups similar objects from the data so that the objects in a cluster
are similar to each other, but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a distance function. More is
the similarity between the objects in a cluster closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max distance between any
two objects in the cluster.
The cluster representation replaces the original data. This technique is more effective if the present
data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce
the large data set into a much smaller data sample. Below we will discuss the different
methods in which we can sample a large data set D containing N tuples:
c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing SRSWOR
on these clusters. A simple random sample of size s could be generated from
these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into mutually disjoint
sets called 'strata'. A simple random sample is taken from each stratum to get
stratified data. This method is effective for skewed data.
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent the
original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the
year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is much
smaller in size, and thereby we achieve data reduction even without losing any data.
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis.
The data cube pre
sent precomputed and summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form. Data that can be restored successfully
from its compressed form is called Lossless compression. In contrast, the opposite where it is not
possible to restore the original form from the compressed form is Lossy compression. Dimensionality
and numerosity reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.
i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them. For example, the JPEG
image format is a lossy compression, but we can find the meaning equivalent to the original
image. Methods such as the Discrete Wavelet transform technique PCA (principal component
analysis) are examples of this compression.
5. Discretization Operation
The data discretization technique is used to divide the attributes of the continuous nature into data
with intervals. We replace many constant values of the attributes with labels of small intervals. This
means that mining results are shown in a concise and easily understandable way.
The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk space,
the less capacity you will need to purchase. Here are some benefits of data reduction, such as:
Data reduction greatly increases the efficiency of a storage system and directly impacts your total
spending on capacity.
DATA VISUALIZATION
Pixel oriented visualization techniques:
A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value.
For a data set of m dimensions pixel oriented techniques create m windows on the
screen, one for each dimension.
The m dimension values of a record are mapped to m pixels at the corresponding
position in the windows.
The color of the pixel reflects other corresponding values.
Inside a window, the data values are arranged in some global order shared by all
windows
Eg: All Electronics maintains a customer information table, which consists of 4
dimensions: income, credit_limit, transaction_volume and age. We analyze the
correlation between income and other attributes by visualization.
We sort all customers in income in ascending order and use this order to layout the
customer data in the 4 visualization windows as shown in fig.
The pixel colors are chosen so that the smaller the value, the lighter the shading.
Using pixel based visualization we can easily observe that credit_limit increases as
income increases customer whose income is in the middle range are more likely to
purchase more from All Electronics, these is no clear correlation between income and
age.
In the adjoining figure , there are k2 plots. Out of these, k are X-X plots, and all X-Y plots (where X, Y are
distinct dimensions) are given in 2 orientations (X vs Y and Y vs, X)
Parallel Coordinates
The scatter-plot matrix becomes less effective as the dimensionality increases. Another technique, called
parallel coordinates, can handle higher dimensionality
n equidistant axes which are parallel to one of the screen axes and correspond to the attributes (i.e. n
dimensions)
The axes are scaled to the [minimum, maximum]: range of the corresponding attribute
Every data item corresponds to a polygonal line which intersects each of the axes at the point which
corresponds to the value for the attribute
Chernoff Faces
Stick Figures
General techniques
Tile bars: Use small icons to represent the relevant feature vectors in document retrieval
Chernoff Faces A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be
eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics–head eccentricity, eye size, eye spacing, eye
eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each
assigned one of 10 possible values.
Stick Figure
Gender, education are indicated by angle/length. Visualization can show a texture pattern
Hierarchical Visualization For a large data set of high dimensionality, it would be difficult to visualize all
dimensions at the same time.
Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces). The
subspaces are visualized in a hierarchical manner
To visualize a 6-D data set, where the dimensions are F,X1,X2,X3,X4,X5. We want to observe how F
changes w.r.t. other dimensions.
We can fix X3,X4,X5 dimensions to selected values and visualize changes to F w.r.t. X1, X2
Most visualization techniques were mainly for numeric data. Recently, more and more non-numeric
data, such as text and social networks, have become available.
Many people on the Web tag various objects such as pictures, blog entries, and product reviews.
A tag cloud is a visualization of statistics of user-generated tags. Often, in a tag cloud, tags are listed
alphabetically or in a user-preferred order.