0% found this document useful (0 votes)
17 views159 pages

DS Notes

Data science is a multidisciplinary field focused on extracting insights from structured and unstructured data using scientific methods and algorithms. It involves a systematic process including data discovery, preparation, modeling, and communication of results, with various roles such as Data Scientist, Data Engineer, and Data Analyst. Applications of data science span across internet search, recommendation systems, and image recognition, while challenges include data variety, talent shortages, and privacy issues.

Uploaded by

24ga10030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views159 pages

DS Notes

Data science is a multidisciplinary field focused on extracting insights from structured and unstructured data using scientific methods and algorithms. It involves a systematic process including data discovery, preparation, modeling, and communication of results, with various roles such as Data Scientist, Data Engineer, and Data Analyst. Applications of data science span across internet search, recommendation systems, and image recognition, while challenges include data variety, talent shortages, and privacy issues.

Uploaded by

24ga10030
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 159

DATA SCIENCE

Data science is a deep study of the massive amount of data,


which involves extracting meaningful insights from raw,
structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms.

It is a multidisciplinary field that uses tools and techniques to


manipulate the data so that you can find something new and
meaningful.

Data science uses the most powerful hardware, programming


systems, and most efficient algorithms to solve the data related
problems. It is the future of artificial intelligence.

In short, we can say that data science is all about:

o Asking the correct questions and analyzing the raw data.


o Modeling the data using various complex and efficient
algorithms.
o Visualizing the data to get a better perspective.
o Understanding the data to make better decisions and
finding the final result.
1.2. What do data scientists do?

Turning data into actionable value usually involves answering questions


using data. Here’s a typical workflow for how that plays out in practice.

1. Obtain data that you hope will help answer the question.
2. Explore the data to understand it.
3. Clean and prepare the data for analysis.
4. Perform analysis, model building, testing, etc.
(The analysis is the step most people think of as data science, but
it’s just one step! Notice how much more there is that surrounds
it.)
5. Draw conclusions from your work.
6. Report those conclusions to the relevant stakeholders.

Data Science Components


Statistics:
Statistics is the most critical unit of Data Science basics, and it is the
method or science of collecting and analyzing numerical data in large
quantities to get useful insights.

Visualization:
Visualization technique helps you access huge amounts of data in easy
to understand and digestible visuals.

Machine Learning:
Machine Learning explores the building and study of algorithms that
learn to make predictions about unforeseen/future data.

Deep Learning:
Deep Learning method is new machine learning research where the
algorithm selects the analysis model to follow.
Data Science Process
Data Science Process:
1. Discovery:
Discovery step involves acquiring data from all the identified internal &
external sources, which helps you answer the business question.

The data can be:

 Logs from webservers


 Data gathered from social media
 Census datasets
 Data streamed from online sources using APIs

2. Preparation:
Data can have many inconsistencies like missing values, blank columns,
an incorrect data format, which needs to be cleaned. You need to
process, explore, and condition data before modelling. The cleaner your
data, the better are your predictions.

3. Model Planning:
In this stage, you need to determine the method and technique to draw
the relation between input variables. Planning for a model is performed
by using different statistical formulas and visualization tools. SQL
analysis services, R, and SAS/access are some of the tools used for this
purpose.

4. Model Building:
In this step, the actual model building process starts. Here, Data scientist
distributes datasets for training and testing. Techniques like association,
classification, and clustering are applied to the training data set. The
model, once prepared, is tested against the “testing” dataset.

5. Operationalize:
You deliver the final baselined model with reports, code, and technical
documents in this stage. Model is deployed into a real-time production
environment after thorough testing.

6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This
helps you decide if the project results are a success or a failure based on
the inputs from the model.

Data Science Jobs Roles


Most prominent Data Scientist job titles are:
 Data Scientist
 Data Engineer
 Data Analyst
 Statistician
 Data Architect
 Data Admin
 Business Analyst
 Data/Analytics Manager

Let’s learn what each role entails in detail:

Data Scientist:
Role: A Data Scientist is a professional who manages enormous
amounts of data to come up with compelling business visions by using
various tools, techniques, methodologies, algorithms, etc.

Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

Data Engineer:
Role: The role of a data engineer is of working with large amounts of
data. He develops, constructs, tests, and maintains architectures like
large scale processing systems and databases.

Languages: SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C + +,


and Perl

Data Analyst:
Role: A data analyst is responsible for mining vast amounts of data.
They will look for relationships, patterns, trends in data. Later he or she
will deliver compelling reporting and visualization for analyzing the data
to take the most viable business decisions.

Languages: R, Python, HTML, JS, C, C+ + , SQL


Statistician:
Role: The statistician collects, analyses, and understands qualitative and
quantitative data using statistical theories and methods.

Languages: SQL, R, Matlab, Tableau, Python, Perl, Spark, and Hive

Data Administrator:
Role: Data admin should ensure that the database is accessible to all
relevant users. He also ensures that it is performing correctly and keeps
it safe from hacking.

Languages: Ruby on Rails, SQL, Java, C#, and Python

Business Analyst:
Role: This professional needs to improve business processes. He/she is
an intermediary between the business executive team and the IT
department.

Languages: SQL, Tableau, Power BI and, Python

Applications of Data Science


Some application of Data Science are:

Internet Search:
Google search uses Data science technology to search for a specific
result within a fraction of a second

Recommendation Systems:
To create a recommendation system. For example, “suggested friends”
on Facebook or suggested videos” on YouTube, everything is done with
the help of Data Science.
Image & Speech Recognition:
Speech recognizes systems like Siri, Google Assistant, and Alexa run on
the Data science technique. Moreover, Facebook recognizes your friend
when you upload a photo with them, with the help of Data Science.

Gaming world:
EA Sports, Sony, Nintendo are using Data science technology. This
enhances your gaming experience. Games are now developed using
Machine Learning techniques, and they can update themselves when you
move to higher levels.

Online Price Comparison:


PriceRunner, Junglee, Shopzilla work on the Data science mechanism.
Here, data is fetched from the relevant websites using APIs.

Challenges of Data Science Technology

 A high variety of information & data is required for accurate


analysis
 Not adequate data science talent pool available
 Management does not provide financial support for a data science
team
 Unavailability of/difficult access to data
 Business decision-makers do not effectively use data Science
results
 Explaining data science to others is difficult
 Privacy issues
 Lack of significant domain expert
 If an organization is very small, it can’t have a Data Science team
BIG DATA AND DAT SCIENCE HYPE / Differentiate BIG
DATA AND DAT SCIENCE
DATA SCIENCE BIG DATA

Data Science is an area. Big Data is a technique to


collect, maintain and process
huge information.
It is about the collection, It is about extracting vital and
processing, analyzing, and valuable information from a huge
utilizing of data in various amount of data.
operations. It is more
conceptual.
It is a field of study just like It is a technique for tracking and
Computer Science, Applied discovering trends in complex data
Statistics, or Applied sets.
Mathematics.
The goal is to build data- The goal is to make data more
dominant products for a vital and usable i.e. by extracting
venture. only important information from
the huge data within existing
traditional aspects.
Tools mainly used in Data Tools mostly used in Big Data
Science include SAS, R, Python, include Hadoop, Spark, Flink, etc.
etc
It is a superset of Big Data as It is a sub-set of Data Science as
data science consists of Data mining activities which is in a
scrapping, cleaning, pipeline of Data science.
visualization, statistics, and
many more techniques.
It is mainly used for scientific It is mainly used for business
purposes. purposes and customer
satisfaction.
It broadly focuses on the science It is more involved with the
of the data. processes of handling voluminous
data.

THE DATA SCIENCE LANDSCAPE/ CURRENT LANSCAPES


OF PERSPECTIVE

Data science is part of the computer sciences [1]. It comprises the


disciplines of i) analytics, ii) statistics and iii) machine learning.

The Data Science Landscape —

Analytics
Analytics generates insights from data using simple presentation,
manipulation, calculation or visualization of data. In the context of data
science, it is also sometimes referred to as exploratory data analytics. It
often serves the purpose to familiarize oneself with the subject matter
and to obtain some initial hints for further analysis. To this end, analytics
is often used to formulate appropriate questions for a data science
project.

The limitation of analytics is that it does not necessarily provide any


conclusive evidence for a cause-and-effect relationship. Also, the
analytics process is typically a manual and time-consuming process
conducted by a human with limited opportunity for automation. In
today’s business world, many corporations do not go beyond descriptive
analytics, even though more sophisticated analytical disciplines can offer
much greater value, such as those laid out in the analytic value escalator.

Statistics
Statistics provides a methodological approach to answer questions raised
by the analysts with a certain level of confidence.
Descriptive Statistics
Descriptive statistics are a part of statistics that can be used to describe data.
It is used to summarize the attributes of a sample in such a way that a pattern
can be drawn from the group. It enables researchers to present data in a more
meaningful way such that easy interpretations can be made. Descriptive
statistics uses two tools to organize and describe data. These are given as
follows:

 Measures of Central Tendency - These help to describe the


central position of the data by using measures such
as mean, median, and mode.
 Measures of Dispersion - These measures help to see how
spread out the data is in a distribution with respect to a
central point. Range, standard deviation, variance, quartiles,
and absolute deviation are the measures of dispersion.

Inferential Statistics
Inferential statistics is a branch of statistics that is used to make inferences
about the population by analyzing a sample. When the population data is very
large it becomes difficult to use it. In such cases, certain samples are taken
that are representative of the entire population. Inferential statistics draws
conclusions regarding the population using these samples. Sampling
strategies such as simple random sampling, cluster sampling, stratified
sampling, and systematic sampling, need to be used in order to choose
correct samples from the population. Some methodologies used in inferential
statistics are as follows:

 Hypothesis Testing - This technique involves the use of


hypothesis tests such as the z test, f test, t test, etc. to
make inferences about the population data. It requires
setting up the null hypothesis, alternative hypothesis, and
testing the decision criteria.
 Regression Analysis - Such a technique is used to check the
relationship between dependent and independent variables.
The most commonly used type of regression is linear
regression.

Machine Learning

Artificial intelligence refers to the broad idea that machines can perform
tasks normally requiring human intelligence, such as visual perception,
speech recognition, decision-making and translation between languages.
In the context of data science, machine learning can be considered as a
sub-field of artifical intelligence that is concerned with decision making.
In fact, in its most essential form, machine learning is decision making at
scale. Machine learning is the field of study of computer algorithms that
allow computer programs to identify and extract patterns from data. A
common purpose of machine learning algorithms is therefore to
generalize and learn from data in order to perform certain tasks

Supervised Machine Learning:


Supervised learning is a machine learning method in which
models are trained using labeled data. In supervised learning,
models need to find the mapping function to map the input
variable (X) with the output variable (Y).
Supervised learning needs supervision to train the model,
which is similar to as a student learns things in the presence of
a teacher. Supervised learning can be used for two types of
problems: Classification and Regression.

Learn more Supervised Machine Learning

Unsupervised Machine Learning:


Unsupervised learning is another machine learning method in
which patterns inferred from the unlabeled input data. The goal
of unsupervised learning is to find the structure and patterns
from the input data. Unsupervised learning does not need any
supervision. Instead, it finds patterns from the data by its own.

Learn more Unsupervised Machine Learning

Unsupervised learning can be used for two types of


problems: Clustering and Association.

Example: To understand the unsupervised learning, we will


use the example given above. So unlike supervised learning,
here we will not provide any supervision to the model. We will
just provide the input dataset to the model and allow the model
to find the patterns from the data. With the help of a suitable
algorithm, the model will train itself and divide the fruits into
different groups according to the most similar features between
them.
Statistical modeling
Data gathering is the foundation of statistical modeling. The data may
come from the cloud, spreadsheets, databases, or other sources. There are
two categories of statistical modeling methods used in data analysis.
These are:

Supervised learning

Data set for learning, with an answer key the algorithm uses to determine
accuracy as it trains on the data. Supervised learning techniques in
statistical modeling include:
 Regression model: A predictive model designed to analyze the
relationship between independent and dependent variables. The most
common regression models are logistical, polynomial, and linear. These
models determine the relationship between variables, forecasting, and
modeling.
 Classification model: An algorithm analyzes and classifies a large and
complex set of data points. Common models include decision trees, Naive
Bayes, the nearest neighbor, random forests, and neural networking
models.
Unsupervised learning

In the unsupervised learning model, the algorithm is given unlabeled data


and attempts to extract features and determine patterns
independently. Clustering algorithms and association rules are examples
of unsupervised learning. Here are two examples:
 K-means clustering: The algorithm combines a specified number of data
points into specific groupings based on similarities.
 Reasons for learning statistical modeling
Linear Regression Algorithm: Linear regression is the most popular
machine learning algorithm based on supervised learning. This
algorithm work on regression, which is a method of modeling target
values based on independent variables. It represents the form of the
linear equation, which has a relationship between the set of inputs and
predictive output. This algorithm is mostly used in forecasting and
predictions. Since it shows the linear relationship between input and
output variable, hence it is called linear regression.

The below equation can describe the relationship between x and y


variables:

1. Y= mx+c

Where, y= Dependent variable


X= independent variable
M= slope
C= intercept.
Decision Tree (Classification): Decision Tree algorithm is another
machine learning algorithm, which belongs to the supervised learning
algorithm. This is one of the most popular machine learning algorithms.
It can be used for both classification and regression problems.

In the decision tree algorithm, we can solve the problem, by using tree
representation in which, each node represents a feature, each branch
represents a decision, and each leaf represents the outcome.

Following is the example for a Job offer problem:

In the decision tree, we start from the root of the tree and compare the
values of the root attribute with record attribute. On the basis of this
comparison, we follow the branch as per the value and then move to the
next node. We continue comparing these values until we reach the leaf
node with predicated class value.

K-Means Clustering: K-means clustering is one of the most popular


algorithms of machine learning, which belongs to the unsupervised
learning algorithm. It solves the clustering problem.

If we are given a data set of items, with certain features and values, and
we need to categorize those set of items into groups, so such type of
problems can be solved using k-means clustering algorithm.

K-means clustering algorithm aims at minimizing an objective function,


which known as squared error function, and it is given as:

Where, J(V) => Objective function


'||xi - vj||' => Euclidean distance between x i and vj.
th
ci' => Number of data points in i cluster.
C => Number of clusters.

:
Populations and samples
In statistics, population is the entire set of items from which you draw
data for a statistical study. It can be a group of individuals, a set of
items, etc. It makes up the data pool for a study.
Generally, population refers to the people who live in a particular area
at a specific time. But in statistics, population refers to data on your
study of interest. It can be a group of individuals, objects, events,
organizations, etc. You use populations to draw conclusions.
An example of a population would be the entire student body at a
school. It would contain all the students who study in that school at the
time of data collection. Depending on the problem statement, data from
each of these students is collected. An example is the students who
speak Hindi among the students of a school.

For the above situation, it is easy to collect data. The population is small
and willing to provide data and can be contacted. The data collected will
be complete and reliable.

If you had to collect the same data from a larger population, say the
entire country of India, it would be impossible to draw reliable
conclusions because of geographical and accessibility constraints, not to
mention time and resource constraints. A lot of data would be missing or
might be unreliable. Furthermore, due to accessibility issues,
marginalized tribes or villages might not provide data at all, making the
data biased towards certain regions or groups.

What is a Sample?

A sample is defined as a smaller and more manageable representation


of a larger group. A subset of a larger population that contains
characteristics of that population. A sample is used in statistical testing
when the population size is too large for all members or observations to
be included in the test.

The sample is an unbiased subset of the population that best represents


the whole data.

To overcome the restraints of a population, you can sometimes collect


data from a subset of your population and then consider it as the general
norm. You collect the subset information from the groups who have
taken part in the study, making the data reliable. The results obtained for
different groups who took part in the study can be extrapolated to
generalize for the population.

The process of collecting data from a small subsection of the population


and then using it to generalize over the entire set is called Sampling.

Samples are used when :

 The population is too large to collect data.

 The data collected is not reliable.

 The population is hypothetical and is unlimited in size. Take the


example of a study that documents the results of a new medical
procedure. It is unknown how the procedure will affect people across
the globe, so a test group is used to find out how people react to it.

Population Sample
All residents of a country would constitute All residents who live above
the Population set the poverty line would be the
Sample

All residents above the poverty line in a All residents who are
country would be the Population millionaires would make up
the Sample

All employees in an office would be the Out of all the employees, all
Population managers in the office would
be the Sample

How to Collect Data From a Population?

You collect data from a population when your research question needs
an extensive amount of data or information about every member of the
population is available. You use population data when the data pool is
small and cooperative to give all the required information. For larger
populations, you use Sampling to represent parts of the population
from which it is hard to collect data.
How to Collect Data From a Sample?

Samples are used when the population is large, scattered, or if it's hard
to collect data on individual instances within it. You can then use a
small sample of the population to make overall hypotheses.

Samples should be randomly selected and should represent the entire


population and every class within it. To ensure this, statistical methods
such as probability sampling, are used to collect random samples from
every class within the population. This will reduce sampling bias and
increase validity.

Consider the polls conducted during election season to gauge the


public support for various political parties all over the nation. It is
impossible to ask millions of voters who their preferred candidate is,
so they collect the opinions of a few hundred or thousand people
from different sectors of the voting population.
STATISTICAL INFERENCE
Statistical inference is the process of using data analysis to infer
properties of an underlying distribution of probability. Statistical inference
is the process of analyzing the result and making conclusions from data
subject to random variation. It is also called inferential statistics.

Statistical inference is a method of making decisions about the


parameters of a population, based on random sampling.

Using data analysis and statistics to make conclusions about


a population is called statistical inference.

The main types of statistical inference are:

I) Estimation
a) Point Estimation
b) Interval Estimation
II) Hypothesis testing

Point Estimation

Point estimators are functions that are used to find an


approximate value of a population parameter from random
samples of the population. They use the sample data of a
population to calculate a point estimate or a statistic that
serves as the best estimate of an unknown parameter of a
population.

Point estimator of a population is a function of a sample


information that produces a single number called point
estimate

Point estimate formulae:


Interval Estimation

Interval estimation is the use of sample data to calculate an interval of


possible (or probable) values of an unknown population parameter, in
contrast to point estimation, which is a single number.

Formula
Confidence level: normally 90%,95%,99% etc

Z0.90= 1.645

Z0.95 =1.96

Z0.98 =2.326

Z0.99 =2.596

Explain with Example

III) Hypothesis testing

Hypothesis Testing is a type of statistical analysis in which you put your


assumptions about a population parameter to the test. It is used to
estimate the relationship between 2 statistical variables.

Let's discuss few examples of statistical hypothesis from real-life

 A teacher assumes that 60% of his college's students come from


lower-middle-class families.

 A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective


for diabetic patients.
Types of hypothesis testing in statistics.
Null Hypothesis and Alternate Hypothesis
The Null Hypothesis is the assumption that the event will not occur. A
null hypothesis has no bearing on the study's outcome unless it is
rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis.


The acceptance of the alternative hypothesis follows the rejection of the
null hypothesis. H1 is the symbol for it.

Let's understand this with an example.

A sanitizer manufacturer claims that its product kills 95 percent of germs


on average. To put this company's claim to the test, create a null and
alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

Another straightforward example to understand this concept is


determining whether or not a coin is fair and balanced.

The null hypothesis states that the probability of a show of heads is


equal to the likelihood of a show of tails.
In contrast, the alternate theory states that the probability of a show of
heads and tails would be very different.

Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical


hypothesis into two types.

Simple Hypothesis: A simple hypothesis specifies an exact value for the


parameter.

Composite Hypothesis: A composite hypothesis specifies a range of


values.

Example:

A company is claiming that their average sales for this quarter are 1000
units. This is an example of a simple hypothesis.

Suppose the company claims that the sales are in the range of 900 to
1000 units. Then this is a case of a composite hypothesis.

One-Tailed and Two-Tailed Hypothesis Testing

In a one-tailed test, the critical distribution area is one-sided, meaning


the test sample is either greater or lesser than a specific value.
In two tails, the test sample is checked to be greater or less than a range
of values in a Two-Tailed test, implying that the critical distribution area
is two-sided.

If the sample falls within this range, the alternate hypothesis will be
accepted, and the null hypothesis will be rejected.

Example:

Suppose H0: mean = 50 and H1: mean not equal to 50

According to the H1, the mean can be greater than or less than 50. This
is an example of a Two-tailed test.

In a similar manner, if H0: mean >=50, then H1: mean <50

Here the mean is less than 50. It is called a One-tailed test

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

Type 1 Error: A Type-I error occurs when sample results reject the null
hypothesis despite being true.

Type 2 Error: A Type-II error occurs when the null hypothesis is not
rejected when it is false, unlike a Type-I error.
Example:

Suppose a teacher evaluates the examination paper to decide whether a


student passes or fails.

H0: Student has passed

H1: Student has failed

Type I error will be the teacher failing the student [rejects H0] although
the student scored the passing marks [H0 was true].

Type II error will be the case where the teacher passes the student [do
not reject H0] although the student did not score the passing marks [H1
is true].
Probability Distributions
.
The probability distribution is a way to represent possible values a
variable may take and their respective probability.

Discrete Distributions
As its name suggests, a Discrete Distribution is a distribution where
observation can take only a finite number of values. For example, the
rolling of a die can only have resulted from 1 to 6, or the gender of a
species. It is fairly common to have discrete variables in a real-world
data set, be it gender, age or visitors to a place at a particular time. There
are a lot of other discrete distributions, but we will focus on the most
common and important of them

Continuous distribution

A continuous distribution describes the probabilities of a continuous


random variable's possible values. A continuous random variable has an
infinite and uncountable set of possible values (known as the range). The
mapping of time can be considered as an example of the continuous
probability distribution. It can be from 1 second to 1 billion seconds, and
so on.
Discrete Distributions

Bernoulli Distribution
Bernoulli Distribution can be safely assumed to be the simplest
0f discrete distributions. Consider an example of flipping an
unbiased coin. You either get a Head or a Tail. If we consider
either of them as our priority(caring only about Head/Tail), the
outcome will only be 0 (failure) or 1 (success). As it is an
unbiased coin probability assigned to each outcome is 0.5.
Remember, the outcome is always binary True/False, Head/Tail,
Success/Failure etc.
The probability mass function or PMF of Bernoulli Distribution
is given as
 Let’s consider random variable X with only one parameter p

which represents probability of occurrence of event.


 It’s density function is given as :

P[X=1]=p
P[X=0]=1-p
Where,
X=1 indicates event has occurred
X=0 indicates event didn’t occured.
and variance is given by

In the plot given below, the probability of the failure is labeled


on the x-axis as 0 and success is labeled as 1. The probability of
success (1) is 0.4, and the probability of failure (0) is 0.6. Here
is a great read on Bernoulli distribution.

Binomial Distribution
Binomial Distribution is simply an extension of Bernoulli distribution. If
we repeat Bernoulli trials for n times, we will get a Binomial
distribution. If we want to model the number of successes in n trials, we
use Binomial Distribution. As each unit of Binomial is a Bernoulli trial,
the outcome is always binary. The observations are independent of each
other.
 Binomial distribution is discrete distribution.
 Binomial distribution is used to represent probability of x success in
n trial ,given success probability p in each trial.
 If the distribution satisfies the below conditions then such
distribution is called as binomial distribution:
1. There should fixed number of trial.
2. It should have only two possible outcome.
3. Events should be independent.
4. Probability of getting success and failure should remain same.
 Following are few properties of binomial distribution one should
remember:
1. Expected value=mean=np
2. Variance=npq

The Probability Mass Function is given by

where
and p is the probability of success
As binomial distribution is Bernoulli trials taken n number of times, the
mean and variance are given by

 For a coin tossed N times, binomial distribution can be used to


model the probability of the number of successes (say, heads). For
example, for the coin toss 10 times, the binomial distribution could
be used to model the probability of a number of heads (1 to 10).
 Here is the sample binomial distribution plot created with different values of n and p

Poisson Distribution

Poisson Distribution describes the probability of a given number of


events occurring in a fixed interval, for example, the number of unique
pageviews on an article on a given day or the number of customers
visiting a florist shop at a particular time. It is not just limited to time
intervals, and we can also extend its use to the area, length and volume
intervals. For example, total rainfalls in a particular area.
The Probability Mass Function for Poisson distribution is given by
Here, lambda is the shape parameter that describes the average number
of events in that interval. lambda is also the mean and variance of the
distribution.

e (eular constant)= 2.718

Here is a sample diagram representing the probability distribution for a


given lambda (rate of change of event)

Normal distribution:

 Normal distribution is most important distribution ,because it fits in


many natural phenomenon.

For instance :height, blood pressure, IQ score,etc

 Normal distribution is also called as guassian distribution.


 Let’s consider X is random variable belongs to normal distribution
with mean and standard deviation . If we plot the histogram or
pdf(probability density function) of random variable ,it will look like
bell curve as shown below:

 Following are three important properties of normal distribution. These


properties also called Emperical formula.
1. Probability of variable that falls within range of 1 Standard Deviation i.e
(mew - sigma to mew+sigma)
is equal to 68%.

It means 68% of data point belongs to X falls within range of 1 Standard Deviation.

Probability of variable that falls within range of 2 Standard Deviation is equal to


95%.

Probability of variable that falls within range of 3 Standard Deviation is equal to


99.7%.

Uniform Distribution:
 Distribution is said to be a uniform distribution, if all the outcomes
of event have equal probabilities.
 Uniform distribution is also called rectangular distribution.
 Expected value of uniform distribution provides us no relevant
information.

 Since each outcome is equally likely both mean and variance are
uninterpretable.

 It does not have predective power.

Lognormal distribution: A continuous distribution in which


the logarithm of a variable has a normal distribution. In other words,
Lognormal distribution is a probability distribution with a normally
distributed logarithm. A random variable is log-normally distributed if
its logarithm is normally distributed

 Survival time of bacteria in disinfectants


 The weight and blood pressure of humans
 Size distributions of rainfall droplets.
 The volume of gas in a petroleum reserve

Take a look at the random variables X and Y. The variable represented


in this distribution is Y = ln(X), where ln denotes the natural logarithm
of X values.

The size distribution of rain droplets can be plotted using log normal
distribution
Model fitting /fitting a Model
Model fitting is the measure of how well a machine learning model
generalizes data similar to that with which it was trained. A good model
fit refers to a model that accurately approximates the output when it is
provided with unseen inputs.
Fitting refers to adjusting the parameters in the model to improve
accuracy. The process involves running an algorithm on data for which
the target variable (“labeled” data) is known to produce a machine
learning model. Then, the model’s outcomes are compared to the real,
observed values of the target variable to determine the accuracy.
The next step involves adjusting the algorithm’s standard parameters in
order to reduce the level of error and make the model more accurate
when determining the relationship between the features and the target
variable. This process is repeated several times until the model finds the
optimal parameters to make predictions with substantial accuracy.

Overfitting and Underfitting


Overfitting negatively impacts the performance of the model on new
data. It occurs when a model learns the details and noise in the training
data too efficiently. When random fluctuations or the noise in the
training data are picked up and learned as concepts by the model, the
model “overfits”. It will perform well on the training set, but very poorly
on the test set. This negatively impacts the model’s ability to generalize
and make accurate predictions for new data.
Underfitting happens when the machine learning model cannot
sufficiently model the training data nor generalize new data. An underfit
machine learning model is not a suitable model; this will be obvious as it
will have a poor performance on the training data.
Introduction to R programming

R is an open-source programming language that is widely used as a


statistical software and data analysis tool. R generally comes with the
Command-line interface. R is available across widely used platforms
like Windows, Linux, and macOS. Also, the R programming language
is the latest cutting-edge tool.
It was designed by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by
the R Development Core Team. R programming language is an
implementation of the S programming language. It also combines with
lexical scoping semantics inspired by Scheme. Moreover, the project
conceives in 1992, with an initial version released in 1995 and a stable
beta version in 2000.

Why R Programming Language?


 R programming is used as a leading tool for machine learning,
statistics, and data analysis. Objects, functions, and packages can
easily be created by R.
 It’s a platform-independent language. This means it can be applied
to all operating system.
 It’s an open-source free language. That means anyone can install it
in any organization without purchasing a license.
 R programming language is not only a statistic package but also
allows us to integrate with other languages (C, C++). Thus, you can
easily interact with many data sources and statistical packages.
 The R programming language has a vast community of users and
it’s growing day by day.
 R is currently one of the most requested programming languages in
the Data Science job market that makes it the hottest trend
nowadays.

History of R Programming
The history of R goes back about 20-30 years ago. R was developed by
Ross lhaka and Robert Gentleman in the University of Auckland, New
Zealand, and the R Development Core Team currently develops it. This
programming language name is taken from the name of both the
developers. The first project was considered in 1992. The initial version
was released in 1995, and in 2000, a stable beta version was released.
Features of R programming
R is a domain-specific programming language which aims to do data
analysis. It has some unique features which make it very powerful. The
most important arguably being the notation of vectors. These vectors
allow us to perform a complex operation on a set of values in a single
command. There are the following features of R programming:

1. It is a simple and effective programming language which has been


well developed.
2. It is data analysis software.
3. It is a well-designed, easy, and effective language which has the
concepts of user-defined, looping, conditional, and various I/O
facilities.
4. It has a consistent and incorporated set of tools which are used for
data analysis.
5. For different types of calculation on arrays, lists and vectors, R
contains a suite of operators.
6. It provides effective data handling and storage facility.
7. It is an open-source, powerful, and highly extensible software.
8. It provides highly extensible graphical techniques.
9. It allows us to perform multiple calculations using vectors.
10. R is an interpreted language.

ENVIRONMENT SETUP

Local Environment Setup


If you are still willing to set up your environment for R, you can follow
the steps given below.
Windows Installation
You can download the Windows installer version of R from R-3.2.2 for
Windows (32/64 bit) and save it in a local directory.
As it is a Windows installer (.exe) with a name "R-version-win.exe".
You can just double click and run the installer accepting the default
settings. If your Windows is 32-bit version, it installs the 32-bit version.
But if your windows is 64-bit, then it installs both the 32-bit and 64-bit
versions.
After installation you can locate the icon to run the Program in a
directory structure "R\R3.2.2\bin\i386\Rgui.exe" under the Windows
Program Files. Clicking this icon brings up the R-GUI which is the R
console to do R Programming.
Linux Installation
R is available as a binary for many versions of Linux at the location R
Binaries.
The instruction to install Linux varies from flavor to flavor. These steps
are mentioned under each type of Linux version in the mentioned link.
However, if you are in a hurry, then you can use yum command to
install R as follows −
$ yum install R

R Command Prompt
Once you have R environment setup, then it’s easy to start your R
command prompt by just typing the following command at your
command prompt −
$R
This will launch R interpreter and you will get a prompt > where you
can start typing your program as follows −
> myString <- "Hello, World!"
> print ( myString)
[1] "Hello, World!"
Here first statement defines a string variable myString, where we assign
a string "Hello, World!" and then next statement print() is being used to
print the value stored in variable myString.

R Script File

Usually, you will do your programming by writing your programs in


script files and then you execute those scripts at your command prompt
with the help of R interpreter called Rscript. So let's start with writing
following code in a text file called test.R as under −
Live Demo
# My first program in R Programming
myString <- "Hello, World!"

print ( myString)

Save the above code in a file test.R and execute it at Linux command
prompt as given below. Even if you are using Windows or other system,
syntax will remain same.
$ Rscript test.R
When we run the above program, it produces the following result.
[1] "Hello, World!"

Data types in R programming


There are many types of R-objects. The frequently used ones are −
 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
The simplest of these objects is the vector object and there are six data
types of these atomic vectors, also termed as six classes of vectors. The
other R-Objects are built upon the atomic vectors.

Data Type Example Verify

Logical TRUE, FALSE Live Demo


v <- TRUE
print(class(v))
it produces the following
result −
[1] "logical"

Numeric 12.3, 5, 999 Live Demo


v <- 23.5
print(class(v))
it produces the following
result −
[1] "numeric"

Integer 2L, 34L, 0L Live Demo


v <- 2L
print(class(v))
it produces the following
result −
[1] "integer"

Complex 3 + 2i Live Demo


v <- 2+5i
print(class(v))
it produces the following
result −
[1] "complex"

Character 'a' , '"good", "TRUE", Live Demo


'23.4' v <- "TRUE"
print(class(v))
it produces the following
result −
[1] "character"

Raw "Hello" is stored as 48 65 Live Demo


6c 6c 6f v <- charToRaw("Hello")
print(class(v))
it produces the following
result −
[1] "raw"

In R programming, the very basic data types are the R-objects


called vectors which hold elements of different classes as shown above.
Please note in R the number of classes is not confined to only the above
six types. For example, we can use many atomic vectors and create an
array whose class will become array.

Vectors
When you want to create vector with more than one element, you should
use c() function which means to combine the elements into a vector.
Live Demo
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))
When we execute the above code, it produces the following result −
[1] "red" "green" "yellow"
[1] "character"
Lists
A list is an R-object which can contain many different types of elements
inside it like vectors, functions and even another list inside it.
Live Demo
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")

Matrices
A matrix is a two-dimensional rectangular data set. It can be created
using a vector input to the matrix function.
Live Demo
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"

Arrays
While matrices are confined to two dimensions, arrays can be of any
number of dimensions. The array function takes a dim attribute which
creates the required number of dimension. In the below example we
create an array with two elements which are 3x3 matrices each.
Live Demo
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

,,2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Factors
Factors are the r-objects which are created using a vector. It stores the
vector along with the distinct values of the elements in the vector as
labels. The labels are always character irrespective of whether it is
numeric or character or Boolean etc. in the input vector. They are useful
in statistical modeling.
Factors are created using the factor() function. The nlevels functions
gives the count of levels.
Live Demo
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result −
[1] green green yellow red red red green
Levels: green red yellow
[1] 3

Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each
column can contain different modes of data. The first column can be
numeric while the second column can be character and third column can
be logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
Live Demo
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

UNIT -II
An attribute is a data field, defining a characteristic of a data object. The
nouns attribute, dimension, feature, and variable are used
correspondently in the literature. The dimension is generally used in
data warehousing. Machine learning literature influence to use the term
feature, while statisticians prefer the method svariable.
Data mining and database experts generally use the term attribute.
Attributes defining a user object can include, for instance, customer ID,
name, and address. Observed values for a given attribute are referred to
as observations.
A set of attributes can define a given object is known as attribute vector
(or feature vector). The distribution of data containing one attribute (or
variable) is known as univariate. A bivariate distribution contains two
attributes, etc.

Example of attribute
In this example, RollNo, Name, and Result are attributes of the object
named as a student.
Rollo Name Result
1 Ali Pass
2 Akram Fail
we need to differentiate between different types of attributes during
Data-preprocessing. So firstly, we need to differentiate between
qualitative and quantitative attributes.
1. Qualitative Attributes such as Nominal, Ordinal, and Binary
Attributes.
2. Quantitative Attributes such as Discrete and Continuous
Attributes.

Types Of attributes

 Binary
 Nominal
 Ordinal Attributes

Nominal Attributes
Nominal data is in alphabetical form and not in an integer. Nominal
Attributes are Qualitative Attributes.
Nominal related to name of things or symbol, represents category , code
or stste also called categorical attribute

Examples of Nominal attributes


In this example, sates and colors are the attribute and New, Pending,
Working, Complete, Finish and Black, Brown, White, and Red are the
values.
Attribute Value

Categorical data Lecturer, Assistant Professor, Professor

States New, Pending, Working, Complete, Finish


Colors Black, Brown, White, Red

Binary Attributes
Binary data have only two values/states. For example, here HIV detected
can be only Yes or No.
Binary Attributes are Qualitative Attributes.
O means value is absent
1 means value present

Examples of Binary Attributes


Attribute Value

HIV detected Yes, No

Result Pass, Fail


The binary attribute is of two types;

1. Symmetric binary
2. Asymmetric binary

Examples of Symmetric data


Both values are equally important. For example, if we have open
admission to our university, then it does not matter, whether you are a
male or a female.
Example:
Attribute Value

Gender Male, Female


Examples of Asymmetric data
Both values are not equally important. For example, HIV detected is
more important than HIV not detected. If a patient is with HIV and we
ignore him, then it can lead to death but if a person is not HIV detected
and we ignore it, then there is no special issue or risk.
Example
Attribute Value

HIV detected Yes, No

Result Pass, Fail

Ordinal Attributes

All Values have a meaningful order. For example, Grade-A means


highest marks, B means marks are less than A, C means marks are less
than grades A and B, and so on. Ordinal Attributes are Quantitative
Attributes.
Examples of Ordinal Attributes
Attribute Value

Grade A, B, C, D, F

BPS- Basic pay scale 16, 17, 18

 Numeric Attributes :
A numeric attribute is calculable, that is, it is a quantifiable amount
that constitutes integer or real values.
Numeric attributes can be of two types as follows: Interval- scaled,
and Ratio – scaled.
Let’s discuss one by one.
1. Interval – Scaled Attributes :
Interval – scaled attributes are calculated on a lamella of uniform-
size units. The values of interval-scaled attributes have order and
can be positive, 0, or negative. Thus, in addition to providing a
ranking of values, such attributes allow us to compare and quantify
the difference between values.
Example –
A temperature attribute is an interval – scaled. We have different
temperature values for every new day, where each day is an entity.
By sequencing the values, we obtain an arrangement of entities
with reference to temperature. In addition, we can quantify the
difference in the value between values, for example, a temperature
of 20 degrees C is five degrees higher than a temperature of 15
degrees C.

2. Ratio – Scaled Attributes :


A ratio – scaled attribute is a category of a numeric attribute with
imminent or fix zero points. In inclusion, the entities are structured,
and we can also compute the difference between values, as well as
the mean, median, and mode.
Example –
The Kelvin (K) temperature scale has what is contemplated as a
true zero point.

Discrete Attributes
Discrete data have a finite value. It can be in numerical form and can
also be in a categorical form. Discrete Attributes are Quantitative
Attributes.
zip codes, profession, or the set of words
Note: Binary attributes are a special case of discrete attributes – Binary
attributes where only non-zero values are important are called
asymmetric binary attributes.
Examples of Discrete Data
Attribute Value

Profession Teacher, Bussiness Man, Peon etc

Postal Code 42200, 42300 etc


Example of Continuous Attribute
Continuous data technically have an infinite number of steps.
Continuous data is in float type. There can be many numbers in between
1 and 2. These attributes are Quantitative Attributes.
Continuous attributes are typically represented as floating-point
variables
Example of Continuous Attribute
Attribute Value

Height 5.4…, 6.5….. etc

Weight 50.09…. etc

PROPERTIES OF ATTRIBUTES:
Although the mean is the single most useful quantity for describing a
data set, it is not always the best way of measuring the center of the data.

A major problem with the mean is its sensitivity to extreme (outlier)
values. – Even a small number of extreme values can corrupt the mean.
Range
In Statistics, the range is the smallest of all the measures of dispersion. It
is the difference between the two extreme conclusions of the
distribution. In other words, the range is the difference between the
maximum and the minimum observation of the distribution.
It is defined by
Range = Xmax – Xmin
Where Xmax is the largest observation and Xmin is the smallest
observation of the variable values.

Interquartile Range Definition


The interquartile range defines the difference between the third and the
first quartile. Quartiles are the partitioned values that divide the whole
series into 4 equal parts. So, there are 3 quartiles. First Quartile is
denoted by Q1 known as the lower quartile, the second Quartile is
denoted by Q2 and the third Quartile is denoted by Q3 known as the
upper quartile. Therefore, the interquartile range is equal to the upper
quartile minus lower quartile.

Interquartile Range Formula


The difference between the upper and lower quartile is known as the
interquartile range. The formula for the interquartile range is given
below
Interquartile range = Upper Quartile – Lower Quartile = Q3 – Q1

Quartiles Definition
Quartiles divide the entire set into four equal parts. So, there are three
quartiles, first, second and third represented by Q1, Q2 and Q3,
respectively. Q2 is nothing but the median, since it indicates the position
of the item in the list and thus, is a positional average. To find quartiles
of a group of data, we have to arrange the data in ascending order.

Quartiles Formula
Suppose, Q3 is the upper quartile is the median of the upper half of the
data set. Whereas, Q1 is the lower quartile and median of the lower half
of the data set. Q2 is the median. Consider, we have n number of items
in a data set. Then the quartiles are given by;
Q1 = [(n+1)/4]th item
Q2 = [(n+1)/2]th item
Q3 = [3(n+1)/4]th item
Variance
According to layman’s words, the variance is a measure of how far a set
of data are dispersed out from their mean or average value. It is denoted
as ‘σ2’.

Standard Deviation
The spread of statistical data is measured by the standard deviation.
Distribution measures the deviation of data from its mean or average
position. The degree of dispersion is computed by the method of
estimating the deviation of data points. It is denoted by the symbol, ‘σ’.

FIVE NUMBER SUMMARY: Minimum, Q1,Median,Q3,Maximum


OUTLIERS: an outlier is an extremely high or extremely low data
point relative to the nearest data point and the rest of the neighboring co-
existing values in a data graph or dataset you're working with
DRAW BOX PLOT ANALYSIS WITH EXAMPLE USING FIVE
NUMBER SUMMARY, OUTLIERS RANGE.
Graphical Displays of Basic Statistical
Description of Data

 Histogram: The foremost basic graph is a histogram, which may be a


barplot during which each bar represents the frequency (count) or
proportion (count/total count) of cases for a variety of values.
Histograms are one of the simplest ways to quickly learn a lot about
your data, including central tendency, spread, modality, shape and
outliers..

 Boxplots: Another very useful univariate graphical technique is that


the boxplot. Boxplots are excellent at presenting information about
central tendency and show robust measures of location and spread
also as providing information about symmetry and outliers, although
they will be misleading about aspects like multimodality. One among
the simplest uses of boxplots is within the sort of side-by-side
boxplots.
 Quantile-normal plots: The ultimate univariate graphical EDA
technique is that the most intricate. it’s called the quantile-normal or
QN plot or more generally the quantile-quantile or QQ plot. it’s wont
to see how well a specific sample fo
allows a specific theoretical distribution. It allows detection of non-
normality and diagnosis of skewness and kurtosis

 Scatterplot: For 2 quantitative variables, the essential graphical EDA


technique is that the scatterplot , sohas one variable on the x-axis and
one on the y-axis and therefore the point for every case in your
dataset.
 Heat map: It’s a graphical representation of data where values are
depicted by color.
 Multivariate chart: It’s a graphical representation of the
relationships between factors and response.

 Bubble chart: It’s a data visualization that displays multiple circles


(bubbles) in two-dimensional plot.
Line Graphs
A line graph is used to show how the value of particular variable
changes with time. We plot this graph by connecting the points at
different values of the variable. It can be useful for analyzing the trends
in the data predicting further trends.
Bar Graphs
A bar graph is a type of graphical representation of the data in which
bars of uniform width are drawn with equal spacing between them on
one axis (x-axis usually), depicting the variable. The values of the
variables are represented by the height of the bars.
Line Plot
It is a plot that displays data as points and checkmarks above a number
line, showing the frequency of the point.

Stem and Leaf Plot


This is a type of plot in which each value is split into a “leaf”(in most
cases, it is the last digit) and “stem”(the other remaining digits).
For example: the number 42 is split into leaf (2) and stem (4).
20, 35, 40, 42, 50
Example from class notes
Box and Whisker Plot
These plots divide the data into four parts to show their summary. They
are more concerned about the spread, average, and median of the data.
UNIT-3
R Vector
A vector is a basic data structure which plays an important role in R programming.

In R, a sequence of elements which share the same data type is known as vector. A vector
supports logical, integer, double, character, complex, or raw data type. The elements which
are contained in vector known as components of the vector. We can check the type of
vector with the help of the typeof() function.

The length is an important property of a vector. A vector length is basically the number of
elements in the vector, and it is calculated with the help of the length() function.

Vector is classified into two parts, i.e., Atomic vectors and Lists. They have three common
properties, i.e., function type, function length, and

There is only one difference between atomic vectors and lists. In an atomic vector, all the
elements are of the same type, but in the list, the elements are of different data types. In
this section, we will discuss only the atomic vectors. We will discuss lists briefly in the next
topic.
How to create a vector in R?
In R, we use c() function to create a vector. This function returns a one-dimensional array or
simply vector. The c() function is a generic function which combines its argument. All
arguments are restricted with a common data type which is the type of the returned value.
There are various other ways to create a vector in R, which are as follows:

1) Using the colon(:) operator

We can create a vector with the help of the colon operator. There is the following syntax to
use colon operator:

z<-x:y

This operator creates a vector with elements from x to y and assigns it to z.

Example:

a<-4:-10
a

Output

[1] 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

2) Using the seq() function

In R, we can create a vector with the help of the seq() function. A sequence function creates
a sequence of elements as a vector. The seq() function is used in two ways, i.e., by setting
step size with ?by' parameter or specifying the length of the vector with the 'length.out'
feature.

Example:

seq_vec<-seq(1,4,by=0.5)
seq_vec
class(seq_vec)

Output

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Example:

seq_vec<-seq(1,4,length.out=6)
seq_vec
class(seq_vec)

Output

[1] 1.0 1.6 2.2 2.8 3.4 4.0


[1] "numeric"

Atomic vectors in R
In R, there are four types of atomic vectors. Atomic vectors play an important role in Data
Science. Atomic vectors are created with the help of c() function. These atomic vectors are
as follows:

Numeric vector

The decimal values are known as numeric data types in R. If we assign a decimal value to
any variable d, then this d variable will become a numeric type. A vector which contains
numeric elements is known as a numeric vector.

Example:

d<-45.5
num_vec<-c(10.1, 10.2, 33.2)
class(d)
num_vec
class(d)
class(num_vec)

Output

[1] 45.5
[1] 10.1 10.2 33.2
[1] "numeric"
[1] "numeric"

Integer vector

A non-fraction numeric value is known as integer data. This integer data is represented by
"Int." The Int size is 2 bytes and long Int size of 4 bytes. There is two way to assign an integer
value to a variable, i.e., by using as.integer() function and appending of L to the value.

A vector which contains integer elements is known as an integer vector.

Example:

d<-as.integer(5)
e<-5L
int_vec<-c(1,2,3,4,5)
int_vec<-as.integer(int_vec)
int_vec1<-c(1L,2L,3L,4L,5L)
class(d)
class(e)
class(int_vec)
class(int_vec1)

Output

[1] "integer"
[1] "integer"
[1] "integer"
[1] "integer"

Character vector

A character is held as a one-byte integer in memory. In R, there are two different ways to
create a character data type value, i.e., using as.character() function and by typing string
between double quotes("") or single quotes('').

A vector which contains character elements is known as an integer vector.


Example:

d<-'shubham'
e<-"Arpita"
f<-“65”
f<-as.character(f)
d
e
f
char_vec<-c(1,2,3,4,5)
char_vec<-as.character(char_vec)
10. char_vec1<-c("shubham","arpita","nishka","vaishali")
11. char_vec
12. class(d)
13. class(e)
14. class(f)
15. class(char_vec)
16. class(char_vec1)

Output

[1] "shubham"
[1] "Arpita"
[1] "65"
[1] "1" "2" "3" "4" "5"
[1] "shubham" "arpita" "nishka" "vaishali"
[1] "character"
[1] "character"
[1] "character"
[1] "character"
[1] "character"

Logical vector

The logical data types have only two values i.e., True or False. These values are based on
which condition is satisfied. A vector which contains Boolean values is known as the logical
vector.

Example:

d<-as.integer(5)
e<-as.integer(6)
f<-as.integer(7)
g<-d>e
h<-e<f
log_vec<-c(d<e, d<f, e<d,e<f,f<d,f<e)
log_vec
class(g)
10. class(h)
11. class(log_vec)

Output

[1] FALSE
[1] TRUE
[1] TRUE TRUE FALSE TRUE FALSE FALSE
[1] "logical"
[1] "logical"
[1] "logical"

Naming a vector
This example explains how to create a vector with names in the R programming language.
my_values <- 1:5 # Create vector of values
my_values # Print vector of values
# [1] 1 2 3 4 5
…and a vector containing the corresponding names to our numbers:
my_names <- letters[1:5] # Create vector of names
my_names # Print vector of names
# [1] "a" "b" "c" "d" "e"
Note that the length of the vector of numbers and the length of the vector of names needs to
be the same.

2) Arithmetic operations

We can perform all the arithmetic operation on vectors. The arithmetic operations are
performed member-by-member on vectors. We can add, subtract, multiply, or divide two
vectors. Let see an example to understand how arithmetic operations are performed on
vectors.

Example:

a<-c(1,3,5,7)
b<-c(2,4,6,8)
a+b
a-b
a/b
a%%b

Output
[1] 3 7 11 15
[1] -1 -1 -1 -1
[1] 2 12 30 56
[1] 0.5000000 0.7500000 0.8333333 0.8750000
[1] 1 3 5 7
[1] "TensorFlow" "PyTorch"
Start
"TensorFlow"

Next Top

← PrevNext →
Vector Arithmetics
Arithmetic operations of vectors are performed member-by-member, i.e., memberwise.
For example, suppose we have two vectors a and b.
> a = c(1, 3, 5, 7)
> b = c(1, 2, 4, 8)

Then, if we multiply a by 5, we would get a vector with each of its members multiplied by 5.
>5*a
[1] 5 15 25 35

And if we add a and b together, the sum would be a vector whose members are the sum of the corresponding members
from a and b.
>a+b
[1] 2 5 9 15

Similarly for subtraction, multiplication and division, we get new vectors via memberwise operations.

>a-b
[1] 0 1 1 -1

>a*b
[1] 1 6 20 56

>a/b
[1] 1.000 1.500 1.250 0.875

Recycling Rule
If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. For example, the
following vectors u and v have different lengths, and their sum is computed by recycling values of the shorter vector u.
> u = c(10, 20, 30)
> v = c(1, 2, 3, 4, 5, 6, 7, 8, 9)
>u+v
[1] 11 22 33 14 25 36 17 28 39

Vector subsetting
In R Programming Language , subsetting allows the user to access elements from an object. It
takes out a portion from the object based on the condition provided. There are 4 ways of subsetting
in R programming. Each of the methods depends on the usability of the user and the type of
object.

Method 1: Subsetting in R Using [ ] Operator

Using the ‘[ ]’ operator, elements of vectors and observations from data frames can be accessed.
To neglect some indexes, ‘-‘ is used to access all other indexes of vector or data frame.
Program

# Create vector
x <- 1:15

# Print vector
cat("Original vector: ", x, "\n")

# Subsetting vector
cat("First 5 values of vector: ", x[1:5], "\n")

cat("Without values present at index 1, 2 and 3: ",


x[-c(1, 2, 3)], "\n")
Output:
Original vector: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
First 5 values of vector: 1 2 3 4 5
Without values present at index 1, 2 and 3: 4 5 6 7 8 9 10 11
12 13 14 15

Method 2: Subsetting in R Using [[ ]] Operator

[[ ]] operator is used for subsetting of list-objects. This operator is the same as [ ] operator but the
only difference is that [[ ]] selects only one element whereas [ ] operator can select more than 1
element in a single command.

# Create list
ls <- list(a = 1, b = 2, c = 10, d = 20)

# Print list
cat("Original List: \n")
print(ls)

# Select first element of list


cat("First element of list: ", ls[[1]], "\n")
Output:
Original List:
$a
[1] 1

$b
[1] 2

$c
[1] 10

$d
[1] 20

First element of list: 1

Method 3: Subsetting in R Using $ Operator

$ operator can be used for lists and data frames in R. Unlike [ ] operator, it selects only a single
observation at a time. It can be used to access an element in named list or a column in data frame.
$ operator is only applicable for recursive objects or list-like objects.

# Create list
ls <- list(a = 1, b = 2, c = "Hello", d = "GFG")

# Print list
cat("Original list:\n")
print(list)

# Print "GFG" using $ operator


cat("Using $ operator:\n")
print(list$d)
Output:
Original list:
$a
[1] 1

$b
[1] 2
$c
[1] "Hello"

$d
[1] "GFG"

Using $ operator:
[1] "GFG"

Method 4: Subsetting in R Using subset() Function

subset() function in R programming is used to create a subset of vectors, matrices, or data frames
based on the conditions provided in the parameters.
Syntax: subset(x, subset, select)
Parameters:
 x: indicates the object
 subset: indicates the logical expression on the basis of which subsetting has to be done
 select: indicates columns to select

R Matrix
In R, a two-dimensional rectangular data set is known as a matrix. A matrix is created with
the help of the vector input to the matrix function. On R matrices, we can perform addition,
subtraction, multiplication, and division operation.

In the R matrix, elements are arranged in a fixed number of rows and columns. The matrix
elements are the real numbers. In R, we use matrix function, which can easily reproduce the
memory representation of the matrix. In the R matrix, all the elements must share a common
basic type.

Example

matrix1<-matrix(c(11, 13, 15, 12, 14, 16),nrow =2, ncol =3, byrow = TRUE)
matrix1

Output

[,1] [,2] [,3]


[1,] 11 13 15
[2,] 12 14 16
How to create a matrix in R?
Like vector and list, R provides a function which creates a matrix. R provides the matrix()
function to create a matrix. This function plays an important role in data analysis. There is
the following syntax of the matrix in R:

matrix(data, nrow, ncol, byrow, dim_name)

data

The first argument in matrix function is data. It is the input vector which is the data elements
of the matrix.

Nrow:The second argument is the number of rows which we want to create in the matrix.

Ncol:The third argument is the number of columns which we want to create in the matrix.

Byrow:The byrow parameter is a logical clue. If its value is true, then the input vector
elements are arranged by row.

dim_name:The dim_name parameter is the name assigned to the rows and columns.

Example

#Arranging elements sequentially by row.


P <- matrix(c(5:16), nrow = 4, byrow = TRUE)
print(P)

# Arranging elements sequentially by column.


Q <- matrix(c(3:14), nrow = 4, byrow = FALSE)
print(Q)

# Defining the column and row names.


10. row_names = c("row1", "row2", "row3", "row4")
11. ccol_names = c("col1", "col2", "col3")
12.
13. R <- matrix(c(3:14), nrow = 4, byrow = TRUE, dimnames = list(row_names, col_names))
14. print(R)

Output

[,1] [,2] [,3]


[1,] 5 6 7
[2,] 8 9 10
[3,] 11 12 13
[4,] 14 15 16
[,1] [,2] [,3]
[1,] 3 7 11
[2,] 4 8 12
[3,] 5 9 13
[4,] 6 10 14

col1 col2 col3


row1 3 4 5
row2 6 7 8
row3 9 10 11
row4 12 13 14

R Arrays
In R, arrays are the data objects which allow us to store data in more than two dimensions. In
R, an array is created with the help of the array() function. This array() function takes a
vector as an input and to create an array it uses vectors values in the dim parameter.

For example- if we will create an array of dimension (2, 3, 4) then it will create 4
rectangular matrices of 2 row and 3 columns.

R Array Syntax
There is the following syntax of R arrays:

array_name <- array(data, dim= (row_size, column_size, matrices, dim_names))

data

The data is the first argument in the array() function. It is an input vector which is given to
the array.

matrices

In R, the array consists of multi-dimensional matrices.

row_size
This parameter defines the number of row elements which an array can store.

column_size

This parameter defines the number of columns elements which an array can store.

dim_names

This parameter is used to change the default names of rows and columns.

How to create?
In R, array creation is quite simple. We can easily create an array using vector and array()
function. In array, data is stored in the form of the matrix. There are only two steps to create
a matrix which are as follows

1. In the first step, we will create two vectors of different lengths.


2. Once our vectors are created, we take these vectors as inputs to the array.

Let see an example to understand how we can implement an array with the help of the
vectors and array() function.

Example

#Creating two vectors of different lengths


vec1 <-c(1,3,5)
vec2 <-c(10,11,12,13,14,15)

#Taking these vectors as input to the array


res <- array(c(vec1,vec2),dim=c(3,3,2))
print(res)

Output

, , 1
[,1] [,2] [,3]
[1,] 1 10 13
[2,] 3 11 14
[3,] 5 12 15

, , 2
[,1] [,2] [,3]
[1,] 1 10 13
[2,] 3 11 14
[3,] 5 12 15

R factors
The factor is a data structure which is used for fields which take only predefined finite
number of values. These are the variable which takes a limited number of different values.
These are the data objects which are used to categorize the data and to store it on multiple
levels. It can store both integers and strings values, and are useful in the column that has a
limited number of unique values.

Factors have labels which are associated with the unique integers stored in it. It contains
predefined set value known as levels and by default R always sorts levels in alphabetical
order.

Attributes of a factor
There are the following attributes of a factor in R

a. X
It is the input vector which is to be transformed into a factor.
b. levels
It is an input vector that represents a set of unique values which are taken by x.
c. labels
It is a character vector which corresponds to the number of labels.
d. Exclude
It is used to specify the value which we want to be excluded,
e. ordered
It is a logical attribute which determines if the levels are ordered.
f. nmax
It is used to specify the upper bound for the maximum number of level.

How to create a factor?


In R, it is quite simple to create a factor. A factor is created in two steps

1. In the first step, we create a vector.


2. Next step is to convert the vector into a factor,

R provides factor() function to convert the vector into factor. There is the following syntax of
factor() function

factor_data<- factor(vector)

Let's see an example to understand how factor function is used.

Example

# Creating a vector as input.


data <c("Shubham","Nishka","Arpita","Nishka","Shubham","Sumit","Nishka","Shubham","Su
mit","Arpita","Sumit")

print(data)
print(is.factor(data))

# Applying the factor function.


factor_data<- factor(data)
10.
11. print(factor_data)
12. print(is.factor(factor_data))

Output

[1] "Shubham" "Nishka" "Arpita" "Nishka" "Shubham" "Sumit" "Nishka"


[8] "Shubham" "Sumit" "Arpita" "Sumit"
[1] FALSE
[1] Shubham Nishka Arpita Nishka Shubham Sumit Nishka Shubham Sumit
[10] Arpita Sumit
Levels: Arpita Nishka Shubham Sumit
[1] TRUE
Accessing components of factor
Like vectors, we can access the components of factors. The process of accessing components
of factor is much more similar to the vectors. We can access the element with the help of the
indexing method or using logical vectors.

Example

# Creating a vector as input.


data <- c("Shubham","Nishka","Arpita","Nishka","Shubham","Sumit","Nishka","Shubham","S
umit","Arpita","Sumit")

# Applying the factor function.


factor_data<- factor(data)

#Printing all elements of factor


print(factor_data)

10. #Accessing 4th element of factor


11. print(factor_data[4])
12.
13. #Accessing 5th and 7th element
14. print(factor_data[c(5,7)])
15.
16. #Accessing all elemcent except 4th one
17. print(factor_data[-4])
18.
19. #Accessing elements using logical vector
20. print(factor_data[c(TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE)])

Output

[1] Shubham Nishka Arpita Nishka Shubham Sumit Nishka Shubham Sumit
[10] Arpita Sumit
Levels: Arpita Nishka Shubham Sumit

[1] Nishka
L evels: Arpita Nishka Shubham Sumit

[1] Shubham Nishka


Levels: Arpita Nishka Shubham Sumit

[1] Shubham Nishka Arpita Shubham Sumit Nishka Shubham Sumit Arpita
[10] Sumit
Levels: Arpita Nishka Shubham Sumit

[1] Shubham Shubham Sumit Nishka Sumit


Levels: Arpita Nishka Shubham Sumi

Changing the Order of Levels


The order of the levels in a factor can be changed by applying the factor function again with new order of
the levels.
data <- c("East","West","East","North","North","East","West",
"West","West","East","North")
# Create the factors
factor_data <- factor(data)
print(factor_data)

# Apply the factor function with required order of the level.


new_order_data <- factor(factor_data,levels = c("East","West","North"))
print(new_order_data)
When we execute the above code, it produces the following result −
[1] East West East North North East West West West East North
Levels: East North West
[1] East West East North North East West West West East North
Levels: East West North

Generating Factor Levels


We can generate factor levels by using the gl() function. It takes two integers as input which indicates
how many levels and how many times each level.

Syntax
gl(n, k, labels)
Following is the description of the parameters used −
 n is a integer giving the number of levels.
 k is a integer giving the number of replications.
 labels is a vector of labels for the resulting factor levels.

Example
v <- gl(3, 4, labels = c("Tampa", "Seattle","Boston"))
print(v)
When we execute the above code, it produces the following result −
Tampa Tampa Tampa Tampa Seattle Seattle Seattle Seattle Boston Boston Boston Boston
Levels: Tampa Seattle Boston

R Data Frame
A data frame is a two-dimensional array-like structure or a table in which a column contains
values of one variable, and rows contains one set of values from each column. A data frame
is a special case of the list in which each component has equal length.

A data frame is used to store data table and the vectors which are present in the form of a
list in a data frame, are of equal length.

In a simple way, it is a list of equal length vectors. A matrix can contain one type of data, but
a data frame can contain different data types such as numeric, character, factor, etc.

There are following characteristics of a data frame.

o The columns name should be non-empty.


o The rows name should be unique.
o The data which is stored in a data frame can be a factor, numeric, or character type.
o Each column contains the same number of data items.

How to create Data Frame


In R, the data frames are created with the help of frame() function of data. This function
contains the vectors of any type such as numeric, character, or integer. In below example,
we create a data frame that contains employee id (integer vector), employee
name(character vector), salary(numeric vector), and starting date(Date vector).

Example

# Creating the data frame.


emp.data<- data.frame( employee_id = c (1:5),
employee_name = c("Shubham","Arpita","Nishka","Gunjan","Sumit"),
sal = c(623.3,915.2,611.0,729.0,843.25),

starting_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)
# Printing the data frame.
print(emp.data)

Output

employee_idemployee_namesalstarting_date
1 1 Shubham623.30 2012-01-01
2 2 Arpita915.20 2013-09-23
3 3 Nishka611.00 2014-11-15
4 4 Gunjan729.00 2014-05-11
5 5 Sumit843.25 2015-03-27

Extract Data from Data Frame


Extract specific column from a data frame using column name.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

When we execute the above code, it produces the following result −


emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25

Expand/Extending Data Frame


A data frame can be expanded by adding columns and rows.

Add Column
Just add the column vector using a new column name.
Live Demo
# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
stringsAsFactors = FALSE
)

# Add the "dept" coulmn.


emp.data$dept <- c("IT","Operations","IT","HR","Finance")
v <- emp.data
print(v)

When we execute the above code, it produces the following result −


emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance

Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same
structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame to
create the final data frame.
Live Demo
# Create the first data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",


"2015-03-27")),
dept = c("IT","Operations","IT","HR","Finance"),
stringsAsFactors = FALSE
)

# Create the second data frame


emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
dept = c("IT","Operations","Fianance"),
stringsAsFactors = FALSE
)

# Bind the two data frames.


emp.finaldata <- rbind(emp.data,emp.newdata)
print(emp.finaldata)

When we execute the above code, it produces the following result −


emp_id emp_name salary start_date dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Rasmi 578.00 2013-05-21 IT
7 7 Pranab 722.50 2013-07-30 Operations
8 8 Tusar 632.80 2014-06-17 Fianance

Sorting Data frames

Using order() function


This function is used to sort the dataframe based on the particular column in the dataframe

Syntax: order(dataframe$column_name,decreasing = TRUE))

where
 dataframe is the input dataframe
 Column name is the column in the dataframe such that dataframe is sorted based on this
column
 Decreasing parameter specifies the type of sorting order
If it is TRUE dataframe is sorted in descending order. Otherwise, in increasing order
return type: Index positions of the elements

# create dataframe with roll no and


# subjects columns
data = data.frame(rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
print("sort the data in decreasing order based on subjects")
print(data[order(data$subjects, decreasing = TRUE), ] )
print("sort the data in decreasing order based on rollno ")
print(data[order(data$rollno, decreasing = TRUE), ] )

Output:
rollno subjects
1 1 java
2 5 python
3 4 php
4 2 sql
5 3 c
[1] "sort the data in decreasing order based on subjects "
rollno subjects
4 2 sql
2 5 python
3 4 php
1 1 java
5 3 c
[1] "sort the data in decreasing order based on rollno "
rollno subjects
2 5 python
3 4 php
5 3 c
4 2 sql
1 1 java

R Lists
In R, lists are the second type of vector. Lists are the objects of R which contain elements of
different types such as number, vectors, string and another list inside it. It can also contain a
function or a matrix as its elements. A list is a data structure which has components of
mixed data types. We can say, a list is a generic vector which contains other objects.
Example

vec <- c(3,4,5,6)


char_vec<-c("shubham","nishka","gunjan","sumit")
logic_vec<-c(TRUE,FALSE,FALSE,TRUE)
out_list<-list(vec,char_vec,logic_vec)

Output:

[[1]]
[1] 3 4 5 6
[[2]]
[1] "shubham" "nishka" "gunjan" "sumit"
[[3]]
[1] TRUE FALSE FALSE TRUE

Lists creation
The process of creating a list is the same as a vector. In R, the vector is created with the
help of c() function. Like c() function, there is another function, i.e., list() which is used to
create a list in R. A list avoid the drawback of the vector which is data type. We can add the
elements in the list of different data types.

syntax

list()

Example 1: Creating list with same data type

list_1<-list(1,2,3)
list_2<-list("Shubham","Arpita","Vaishali")
list_3<-list(c(1,2,3))
list_4<-list(TRUE,FALSE,TRUE)
list_1
list_2
list_3
list_4

Output:

[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] "Vaishali"

[[1]]
[1] 1 2 3

[[1]]
[1] TRUE
[[2]]
[1] FALSE
[[3]]
[1] TRUE

Example 2: Creating the list with different data type

list_data<-list("Shubham","Arpita",c(1,2,3,4,5),TRUE,FALSE,22.5,12L)
print(list_data)

In the above example, the list function will create a list with character, logical, numeric, and
vector element. It will give the following output

Output:

[[1]]
[1] "Shubham"
[[2]]
[1] "Arpita"
[[3]]
[1] 1 2 3 4 5
[[4]]
[1] TRUE
[[5]]
[1] FALSE
[[6]]
[1] 22.5
[[7]]
[1] 12

Giving a name to list elements


R provides a very easy way for accessing elements, i.e., by giving the name to each element
of a list. By assigning names to the elements, we can access the element easily. There are
only three steps to print the list data corresponding to the name:

1. Creating a list.
2. Assign a name to the list elements with the help of names() function.
3. Print the list data.

Let see an example to understand how we can give the names to the list elements.

Example

# Creating a list containing a vector, a matrix and a list.


list_data <- list(c("Shubham","Nishka","Gunjan"), matrix(c(40,80,60,70,90,80), nrow = 2),
list("BCA","MCA","B.tech"))

# Giving names to the elements in the list.


names(list_data) <- c("Students", "Marks", "Course")

# Show the list.


print(list_data)

Output:

$Students
[1] "Shubham" "Nishka" "Gunjan"

$Marks
[,1] [,2] [,3]
[1,] 40 60 90
[2,] 80 70 80

$Course
$Course[[1]]
[1] "BCA"

$Course[[2]]
[1] "MCA"

$Course[[3]]
[1] "B. tech."

Accessing List Elements


R provides two ways through which we can access the elements of a list. First one is the
indexing method performed in the same way as a vector. In the second one, we can access
the elements of a list with the help of names. It will be possible only with the named list.; we
cannot access the elements of a list using names if the list is normal.
Example : Accessing elements using names

# Creating a list containing a vector, a matrix and a list.


list_data < list(c("Shubham","Arpita","Nishka"), matrix(c(40,80,60,70,90,80), nrow = 2),list("
BCA","MCA","B.tech"))
# Giving names to the elements in the list.
names(list_data) <- c("Student", "Marks", "Course")
# Accessing the first element of the list.
print(list_data["Student"])
print(list_data$Marks)
print(list_data)

Output:

$Student
[1] "Shubham" "Arpita" "Nishka"

[,1] [,2] [,3]


[1,] 40 60 90
[2,] 80 70 80

$Student
[1] "Shubham" "Arpita" "Nishka"

$Marks
[,1] [,2] [,3]
[1,] 40 60 90
[2,] 80 70 80

$Course
$Course[[1]]
[1] "BCA"
$Course[[2]]
[1] "MCA"
$Course[[3]]
[1] "B. tech."

Manipulation of list elements


R allows us to add, delete, or update elements in the list. We can update an element of a list
from anywhere, but elements can add or delete only at the end of the list. To remove an
element from a specified index, we will assign it a null value. We can update the element of
a list by overriding it from the new value. Let see an example to understand how we can
add, delete, or update the elements in the list.

Example
# Creating a list containing a vector, a matrix and a list.
list_data <- list(c("Shubham","Arpita","Nishka"), matrix(c(40,80,60,70,90,80), nrow = 2),
list("BCA","MCA","B.tech"))

# Giving names to the elements in the list.


names(list_data) <- c("Student", "Marks", "Course")

# Adding element at the end of the list.


list_data[4] <- "Moradabad"
print(list_data[4])

# Removing the last element.


list_data[4] <- NULL

# Printing the 4th Element.


print(list_data[4])

# Updating the 3rd Element.


list_data[3] <- "Masters of computer applications"
print(list_data[3])

Output:

[[1]]
[1] "Moradabad"

$<NA>
NULL

$Course
[1] "Masters of computer applications"

Converting list to vector


There is a drawback with the list, i.e., we cannot perform all the arithmetic operations on list
elements. To remove this, drawback R provides unlist() function. This function converts the
list into vectors. In some cases, it is required to convert a list into a vector so that we can
use the elements of the vector for further manipulation.

The unlist() function takes the list as a parameter and change into a vector. Let see an
example to understand how to unlist() function is used in R.
Example

# Creating lists.
list1 <- list(1:5)
print(list1)

list2 <-list(10:14)
print(list2)

# Converting the lists to vectors.


v1 <- unlist(list1)
v2 <- unlist(list2)

print(v1)
print(v2)

adding the vectors


result <- v1+v2
print(result)

Output:

[[1]]
[1] 1 2 3 4 5

[[1]]
[1] 10 11 12 13 14

[1] 1 2 3 4 5
[1] 10 11 12 13 14
[1] 11 13 15 17 19

Merging Lists
R allows us to merge one or more lists into one list. Merging is done with the help of the list()
function also. To merge the lists, we have to pass all the lists into list function as a
parameter, and it returns a list which contains all the elements which are present in the lists.
Let see an example to understand how the merging process is done.

Example

# Creating two lists.


Even_list <- list(2,4,6,8,10)
Odd_list <- list(1,3,5,7,9)
# Merging the two lists.
merged.list <- list(Even_list,Odd_list)

# Printing the merged list.


print(merged.list)

Output:
[[1]]
[[1]][[1]]
[1] 2

[[1]][[2]]
[1] 4

[[1]][[3]]
[1] 6

[[1]][[4]]
[1] 8

[[1]][[5]]
[1] 10

[[2]]
[[2]][[1]]
[1] 1

[[2]][[2]]
[1] 3

[[2]][[3]]
[1] 5

[[2]][[4]]
[1] 7

[[2]][[5]]
[1] 9

UNIT-IV

Operators in R
In computer programming, an operator is a symbol which represents an
action. An operator is a symbol which tells the compiler to perform
specific logical or mathematical manipulations. R programming is very rich
in built-in operators.

In R programming, there are different types of operator, and each operator


performs a different task. For data manipulation, There are some advance
operators also such as model formula and list indexing.

There are the following types of operators used in R:

1. Arithmetic Operators
2. Relational Operators
3. Logical Operators
4. Assignment Operators
5. Miscellaneous Operators

Arithmetic Operators
Arithmetic operators are the symbols which are used to represent arithmetic
math operations. The operators act on each and every element of the vector.
There are various arithmetic operators which are supported by R

S. Operator Description Example


N
o

1. + This operator is used to b <- c(11, 5, 3)


print(a+b)
add two vectors in R. a
<- c(2, 3.3, 4) It will give us the following output:

[1] 13.0 8.3 5.0

2. - This operator is used to b <- c(11, 5, 3)


print(a-b)
divide a vector from
another one. a <- c(2, It will give us the following output:
3.3, 4)
[1] -9.0 -1.7 3.0

3. * This operator is used to b <- c(11, 5, 3)


print(a*b)
multiply two vectors with
each other. a <- c(2, 3.3, It will give us the following output:
4)
[1] 22.0 16.5 4.0

4. / This operator divides the b <- c(11, 5, 3)


print(a/b)
vector from another one.
It will give us the following output:
a <- c(2, 3.3, 4)
[1] 0.1818182 0.6600000
4.0000000

5. %% This operator is used to b <- c(11, 5, 3)


print(a%%b)
find the remainder of the
first vector with the It will give us the following output:
second vector. a <- c(2,
3.3, 4) [1] 2.0 3.3 0
6. %/% This operator is used to a <- c(2, 3.3, 4)
b <- c(11, 5, 3)
find the division of the
print(a%/%b)
first vector with the
second(quotient). It will give us the following output:

[1] 0 0 4

7. ^ This operator raised the b <- c(11, 5, 3)


print(a^b)
first vector to the
exponent of the second It will give us the following output:
vector. a <- c(2, 3.3, 4)
[1] 0248.0000 391.3539 4.0000

Relational Operators
A relational operator is a symbol which defines some kind of relation
between two entities. These include numerical equalities and inequalities. A
relational operator compares each element of the first vector with the
corresponding element of the second vector. The result of the comparison
will be a Boolean value. There are the following relational operators which
are supported by R:

S. Operator Description Example


N
o

1. > This operator will return TRUE a <- c(1, 3, 5)


b <- c(2, 4, 6)
when every element in the first
print(a>b)
vector is greater than the
corresponding element of the It will give us the following
second vector. output:

[1] FALSE FALSE FALSE

2. < This operator will return TRUE a <- c(1, 9, 5)


b <- c(2, 4, 6)
when every element in the first
print(a<b)
vector is less then the
corresponding element of the It will give us the following
second vector. output:

[1] FALSE TRUE FALSE

3. <= This operator will return TRUE a <- c(1, 3, 5)


b <- c(2, 3, 6)
when every element in the first
print(a<=b)
vector is less than or equal to the
corresponding element of another It will give us the following
vector. output:

[1] TRUE TRUE TRUE

4. >= This operator will return TRUE a <- c(1, 3, 5)


b <- c(2, 3, 6)
when every element in the first
print(a>=b)
vector is greater than or equal to
the corresponding element of It will give us the following
another vector. output:

[1] FALSE TRUE FALSE

5. == This operator will return TRUE a <- c(1, 3, 5)


b <- c(2, 3, 6)
when every element in the first
print(a==b)
vector is equal to the
corresponding element of the It will give us the following
second vector. output:

[1] FALSE TRUE FALSE

6. != This operator will return TRUE a <- c(1, 3, 5)


b <- c(2, 3, 6)
when every element in the first
print(a>=b)
vector is not equal to the
corresponding element of the It will give us the following
second vector. output:

[1] TRUE FALSE TRUE

Logical Operators
The logical operators allow a program to make a decision on the basis of
multiple conditions. In the program, each operand is considered as a
condition which can be evaluated to a false or true value. The value of the
conditions is used to determine the overall value of the op1 operator op2.
Logical operators are applicable to those vectors whose type is logical,
numeric, or complex.

The logical operator compares each element of the first vector with the
corresponding element of the second vector.

There are the following types of operators which are supported by R:

S. Operator Description Example


No

1. & This operator is known as the Logical AND a <- c(3, 0, TRUE,
2+2i)
operator. This operator takes the first
b <- c(2, 4, TRUE,
element of both the vector and returns 2+3i)
TRUE if both the elements are TRUE. print(a&b)

It will give us the


following output:

[1] TRUE
FALSE TRUE TRUE

2. | This operator is called the Logical OR a <- c(3, 0, TRUE,


2+2i)
operator. This operator takes the first
b <- c(2, 4, TRUE,
element of both the vector and returns 2+3i)
TRUE if one of them is TRUE. print(a|b)

It will give us the


following output:

[1] TRUE TRUE


TRUE TRUE

3. ! This operator is known as Logical NOT a <- c(3, 0, TRUE,


2+2i)
operator. This operator takes the first
print(!a)
element of the vector and gives the
opposite logical value as a result. It will give us the
following output:

[1] FALSE
TRUE FALSE FALSE

4. && This operator takes the first element of a <- c(3, 0, TRUE,
2+2i)
both the vector and gives TRUE as a result,
b <- c(2, 4, TRUE,
only if both are TRUE. 2+3i)
print(a&&b)

It will give us the


following output:

[1] TRUE

5. || This operator takes the first element of a <- c(3, 0, TRUE,


2+2i)
both the vector and gives the result TRUE,
b <- c(2, 4, TRUE,
if one of them is true. 2+3i)
print(a||b)

It will give us the


following output:

[1] TRUE

Assignment Operators
An assignment operator is used to assign a new value to a variable. In R,
these operators are used to assign values to vectors. There are the following
types of assignment

S. Operator Description Example


No

1. <- or = or These operators are a <- c(3, 0, TRUE, 2+2i)


b <<- c(2, 4, TRUE, 2+3i)
<<- known as left
d = c(1, 2, TRUE, 2+3i)
assignment operators. print(a)
print(b)
print(d)

It will give us the following output:

[1] 3+0i 0+0i 1+0i 2+2i


[1] 2+0i 4+0i 1+0i 2+3i
[1] 1+0i 2+0i 1+0i 2+3i

2. -> or ->> These operators are c(3, 0, TRUE, 2+2i) -> a


c(2, 4, TRUE, 2+3i) ->> b
known as right
print(a)
assignment operators. print(b)

It will give us the following output:

[1] 3+0i 0+0i 1+0i 2+2i


[1] 2+0i 4+0i 1+0i 2+3i

operators which are supported by R:

Miscellaneous Operators
Miscellaneous operators are used for a special and specific purpose. These
operators are not used for general mathematical or logical computation.
There are the following miscellaneous operators which are supported in R

S. Operator Description Example


No

1. : The colon operator is v <- 1:8


print(v)
used to create the
series of numbers in It will give us the following output:
sequence for a vector.
[1] 1 2 3 4 5 6 7 8

2. %in% This is used when we a1 <- 8


a2 <- 12
want to identify if an
d <- 1:10
element belongs to a print(a1%in%t)
vector. print(a2%in%t)

It will give us the following output:

[1] FALSE
[1] FALSE
3. %*% It is used to multiply a M=matrix(c(1,2,3,4,5,6),
nrow=2, ncol=3, byrow=TRUE)
matrix with its
T=m%*%T(m)
transpose. print(T)

It will give us the following output:

14 32
32 77

CONDITIONAL STATEMENTS
1) IF STATEMENT
2) -ELSE STATEMENT
3) ELSE-IF STATEMENT
4) SWITCH STATEMENT

R if Statement
The if statement consists of the Boolean expressions followed by one or
more statements. The if statement is the simplest decision-making
statement which helps us to take a decision on the basis of the condition.

The if statement is a conditional programming statement which performs the


function and displays the information if it is proved true.

The block of code inside the if statement will be executed only when the
boolean expression evaluates to be true. If the statement evaluates false,
then the code which is mentioned after the condition will run.

The syntax of if statement in R is as follows:

if(boolean_expression) {
// If the boolean expression is true, then statement(s) will be executed.
}

Flow Chart
Let see some examples to understand how if statements work and perform a
certain task in R.

Example 1
x <-24L
y <- "shubham"
if(is.integer(x))
{
print("x is an Integer")
}

Output:
If-else statement
In the if statement, the inner code is executed when the condition is true.
The code which is outside the if block will be executed when the if condition
is false.

There is another type of decision-making statement known as the if-else


statement. An if-else statement is the if statement followed by an else
statement. An if-else statement, else statement will be executed when the
boolean expression will false. In simple words, If a Boolean expression will
have true value, then the if block gets executed otherwise, the else block will
get executed.

R programming treats any non-zero and non-null values as true, and if the
value is either zero or null, then it treats them as false.

The basic syntax of If-else statement is as follows:

if(boolean_expression) {
// statement(s) will be executed if the boolean expression is true.
} else {
// statement(s) will be executed if the boolean expression is false.
}

Flow Chart
Example 1
# local variable definition
a<- 100
#checking boolean condition
if(a<20){
# if the condition is true then print the following
cat("a is less than 20\n")
}else{
# if the condition is false then print the following
cat("a is not less than 20\n")
}
cat("The value of a is", a)

Output:
R else if statement
This statement is also known as nested if-else statement. The if statement is
followed by an optional else if..... else statement. This statement is used to
test various condition in a single if......else if statement. There are some key
points which are necessary to keep in mind when we are using the if.....else
if.....else statement. These points are as follows:

1. if statement can have either zero or one else statement and it must come
after any else if's statement.
2. if statement can have many else if's statement and they come before the
else statement.
3. Once an else if statement succeeds, none of the remaining else
if's or else's will be tested.

The basic syntax of If-else statement is as follows:

if(boolean_expression 1) {
// This block executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// This block executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// This block executes when the boolean expression 3 is true.
} else {
// This block executes when none of the above condition is true.
}

Flow Chart
Example 1
age <- readline(prompt="Enter age: ")
age <- as.integer(age)
if(age<18)
print("You are child")
else if(age>30)
print("You are old guy")
else
print("You are adult")

Output:

R Switch Statement
A switch statement is a selection control mechanism that allows the value of
an expression to change the control flow of program execution via map and
search.

The switch statement is used in place of long if statements which compare a


variable with several integral values. It is a multi-way branch statement
which provides an easy way to dispatch execution for different parts of code.
This code is based on the value of the expression.

This statement allows a variable to be tested for equality against a list of


values. A switch statement is a little bit complicated. To understand it, we
have some key points which are as follows:
o If expression type is a character string, the string is matched to the listed
cases.
o If there is more than one match, the first match element is used.
o No default case is available.
o If no case is matched, an unnamed case is used.

There are basically two ways in which one of the cases is selected:

1) Based on Index
If the cases are values like a character vector, and the expression is
evaluated to a number than the expression's result is used as an index to
select the case.

2) Based on Matching Value


When the cases have both case value and output value like
["case_1"="value1"], then the expression value is matched against case
values. If there is a match with the case, the corresponding value is the
output.

The basic syntax of If-else statement is as follows:

1. switch(expression, case1, case2, case3....)

Flow Chart
Example 1
x <- switch(
3,
"Shubham",
"Nishka",
"Gunjan",
"Sumit"
)
print(x)

Output:
ITERATIVE PROGRAMMING IN R

B) FOR LOOP
C) WHILE LOOP
D) OVER LIST

For loop in R Programming


In R, a for loop is a way to repeat a sequence of instructions under certain
conditions. It allows us to automate parts of our code which need repetition.
In simple words, a for loop is a repetition control structure. It allows us to
efficiently write the loop that needs to execute a certain number of time.

In R, a for loop is defined as :

1. Instead of initializing and declaring a loop counter variable, we declare


a variable which is of the same type as the base type of the vector,
matrix, etc., followed by a colon, which is then followed by the array or
matrix name.
2. In the loop body, use the loop variable rather than using the indexed
array element.

There is a following syntax of for loop in R:

for (value in vector)


{
statements
}

Flowchart of For loop in R:


# R program to illustrate of for loop

# assigning strings to the vector

PROGRAM:
week < - c('Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday')
for (day in week)
{

# displaying each string in the vector


print(day)
}
Output:
[1] "Sunday"
[1] "Monday"
[1] "Tuesday"
[1] "Wednesday"
[1] "Thursday"
[1] "Friday"
[1] "Saturday"

In the above program, initially, all the days(strings) of the week are assigned to
the vector week. Then for loop is used to iterate over each string in a week. In
each iteration, each day of the week is displayed.

WHILE LOOP
While loop is used when the exact number of iterations of loop is not known
beforehand. It executes the same code again and again until a stop condition is
met. While loop checks for the condition to be true or false n+1 times rather
than n times. This is because the while loop checks for the condition before
entering the body of the loop.

R- While loop Syntax:


while (test_expression)
{
statement
update_expression
}

How does a While loop execute?


 Control falls into the while loop.
 The flow jumps to Condition
 Condition is tested.
 If Condition yields true, the flow goes into the Body.
 If Condition yields false, the flow goes outside the loop
 The statements inside the body of the loop get executed.
 Updation takes place.
 Control flows back to Step 2.
 The while loop has ended and the flow has gone outside.
R – while loop Flowchart:

WHILE LOOP PROGRAM

v <- c("Hello","while loop")


cnt <- 2

while (cnt < 7) {


print(v)
cnt = cnt + 1
}

When the above code is compiled and executed, it produces the following result −
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
LOOPING OVER LIST:
Looping over a list is just as easy and convenient as looping over a
vector. Looping over list has 3 ways

1) Loop Through List & Display All Sub-Elements on Same Line

#print each sub-element on same line


for (i in team_info) {
print(i)
}

output

[1] "Mavericks"
[1] "G" "F" "C"
[1] 3
.

2: Loop Through List & Display All Sub-Elements on Different Lines

#print each sub-element on different lines


for (i in team_info) {
for(j in i)
{print(j)}
}

output

[1] "Mavericks"
[1] "G"
[1] "F"
[1] "C"
[1] 3

3: Loop Through List & Only Display Specific Values

#only display first value in each element of list


for(i in 1:length(team_info)) {
print(team_info[[i]][1])
}

output

[1] "Mavericks"
[1] "G"
[1] 3

FUNCTIONS IN R
A function is a set of statements organized together to perform a specific task. R has a large
number of in-built functions and the user can create their own functions.
In R, a function is an object so the R interpreter is able to pass control to the function, along with
arguments that may be necessary for the function to accomplish the actions.
The function in turn performs its task and returns control to the interpreter as well as any result
which may be stored in other objects.

Function Definition
An R function is created by using the keyword function. The basic syntax of an R function
definition is as follows −

function_name <- function(arg_1, arg_2, ...)


{
Function body
}

Function Components
1.3K
OOPs Concepts in Java
The different parts of a function are −
 Function Name − This is the actual name of the function. It is stored in R environment as
an object with this name.
 Arguments − An argument is a placeholder. When a function is invoked, you pass a
value to the argument. Arguments are optional; that is, a function may contain no
arguments. Also arguments can have default values.
 Function Body − The function body contains a collection of statements that defines what
the function does.
 Return Value − The return value of a function is the last expression in the function body
to be evaluated.
R has many in-built functions which can be directly called in the program without defining them
first. We can also create and use our own functions referred as user defined functions.

Built-in Function
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They
are directly called by user written programs. You can refer most widely used R functions.
Live Demo
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))

# Find mean of numbers from 25 to 82.


print(mean(25:82))
# Find sum of numbers frm 41 to 68.
print(sum(41:68))

When we execute the above code, it produces the following result −


[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526

User-defined Function
We can create user-defined functions in R. They are specific to what a user wants and once
created they can be used like the built-in functions. Below is an example of how a function is
created and used.

# Create a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
new.function( )
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

# Call the function new.function supplying 6 as an argument.


new.function(6)

When we execute the above code, it produces the following result −


[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36

Function calling with an argument


We can easily call a function by passing an appropriate argument in the function.
Let see an example to see how a function is called.

# Creating a function to print squares of numbers in sequence.


new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}

# Calling the function new.function supplying 10 as an argument.

new.function(10)
Output:

Function calling with no argument


In R, we can call a function without an argument in the following
way

# Creating a function to print squares of numbers in sequence.


new.function <- function() {
for(i in 1:5) {
a <- i^2
print(a)
}
}

# Calling the function new.function with no argument.


new.function()
Output:

Function calling with Argument Values


We can supply the arguments to a function call in the same sequence as defined in
the function or can supply in a different sequence but assigned them to the names
of the arguments.

# Creating a function with arguments.


new.function <- function(x,y,z) {
result <- x * y + z
print(result)
}

# Calling the function by position of arguments.


new.function(11,13,9)

# Calling the function by names of the arguments.


new.function(x = 2, y = 5, z = 3)
Output:
Function calling with default arguments
To get the default result, we assign the value to the arguments in the function
definition, and then we call the function without supplying argument. If we pass any
argument in the function call, then it will get replaced with the default value of the
argument in the function definition.

# Creating a function with arguments.


new.function <- function(x = 11, y = 24)
{
result <- x * y
print(result)
}

# Calling the function without giving any argument.


new.function()

# Calling the function with giving new values of the argument.


new.function(4,6)
Output:
Recursion
Recursion, in the simplest terms, is a type of looping technique. It exploits the basic
working of functions in R. Recursion is when the function calls itself. This forms a loop,
where every time the function is called, it calls itself again and again and this technique is
known as recursion.

Recursive functions call themselves. They break down the problem into smaller
components. The function() calls itself within the original function() on each of the smaller
components. After this, the results will be put together to solve the original problem.
Key Features of R Recursion
 The use of recursion, often, makes the code shorter and it also looks clean.
 It is a simple solution for a few cases.
 It expresses in a function that calls itself.

Example: Factorial using Recursion in R

rec_fac <- function(x){


if(x==0 || x==1)
{
return(1)
}
else
{
return(x*rec_fac(x-1))
}
}
Output : 5! = 120
Sum of Series Using Recursion
Recursion in R is most useful for finding the sum of self-repeating
series. In this example, we will find the sum of squares of a given series
of numbers.
SUM = 12+22+…+N2

sum_series <- function(vec){


if(length(vec)<=1){
return(vec^2)
}
Else
{
return(vec[1]^2+sum_series(vec[-1]))
}
}
series <- c(1:10)
sum_series(series)

Applications of Recursion in R
Recursive functions are used in many efficient programming techniques
like dynamic programming or divide and conquer algorithms.

In dynamic programming, for both top-down as well as bottom-up


approaches, recursion is vital for performance.
In divide and conquer algorithms, we divide a problem into smaller sub-
problems that are easier to solve. The output is then built back up to
the top. Recursion has a similar process, which is why it is used to
implement such algorithms.
In its essence, recursion is the process of breaking down a problem into
many smaller problems, these smaller problems are further broken
down until the problem left is trivial. The solution is then built back up
piece by piece.
NESTED FUNCTIONS
A nested function or the enclosing function is a function that is
defined within another function. In simpler words, a nested function is a
function in another function.

There are two ways to create a nested function in the R programming


language:

1. Calling a function within another function we created.


2. Writing a function within another function.

Calling a function within another function


For this process, we have to create the function with two or more
required parameters. After that, we can call the function that we
created whenever we want to.

# creating a function with two parameters


myFunction <- function(x, y) {
# passing a command to the function
a <- x * y
return(a)
}

# creating a nested function


myFunction(myFunction(2,2), myFunction(3,3))

Output
[1] 36

 Line 2: We creatd a function with two parameter


values, x and y.
 Line 4: The function we created tells x to multiply y and
assign the output to a variable a.
 Line 5: We return the value of a.
 Line 9: We make the first input myFunction(2,2) represent the
primary function’s x parameter. Likewise, we make the
second input myFunction(3,3) represent the y parameter of
the main function.

Hence, we get the following expected output: (2 * 2) * (3 * 3) = 36

Writing a function within another function


In this process, we can’t directly call the function because there is an
inner function defined inside an outer function. Therefore, we will call
the external function to call the function inside

# creating an outer function


outerFunction <- function(x)
{

# creating an inner function


innerFunction <- function(y)
{
# passing a command to the function
a <- x * y
return(a)
}
return (innerFunction)
}
# To call the outer functionn
output <- outerFunction(3)
output(5)

Output
[1] 15

Code explanation

 Line 2: We create an outer function, outerFunction, and pass a


parameter value x to the function.
 Line 4: We create an inner function, innerFunction, and pass a
parameter value y to the function.
 Line 6: We pass a command to the inner function, which tells x to
multiply y and assign the output to a variable a.
 Line 7: We return the value of a.
 Line 9: We return the innerFunction.
 Line 12: To call the outer function, we create a
variable output and give it a value of 3.
 Line 13: We print the output with the desired value of 5 as the
value of the y parameter. Therefore, the output is: (3 * 5 = 15)

LOADING R PACKAGE
The most common method of installing and loading packages is using
the install.packages() and library() function respectively. Let us see a brief about
these functions –
 Install.packages() is used to install a required package in the R programming
language.

Syntax:
install.packages(“package_name”)
 library() is used to load a specific package in R programming language

Syntax:

library(package_name)
In the case where multiple packages have to installed and loaded these commands
have to be specified repetitively. Thus making the approach inefficient.

install.packages("ggplot2")
install.packages("dpylr")
install.packages("readxl")

library(ggplot2)
library(dpylr)
library(readxl)

The most efficient way to install the R packages is by installing multiple packages at
a time using. For installing multiple packages we need to use install.packages( )
function again but this time we can pass the packages to be installed as a vector or a
list with each package separated by comma(,).
Syntax :
install.packages ( c(“package 1″ ,”package 2”, . . . . , “package n”) )
install.packages(“package1″,”package2”, . . . . , “package n”)
Example :
install.packages(c("ggplot2","dpylr","readxl"))

install.packages("ggplot2","dpylr","readxl")

Similarly, package can be loaded efficiently by one of the following ways.


Method 1: Using library()
In this, the packages to be loaded are passed to the function but as a list with each
package separated by a comma (,).
Syntax:
library(“package1”, “package2″, . . . . ,”package n”)
Example:

library("ggplot2","dpylr")

Method 2: Using pacman


For efficient package loading, we need to install another package called
pacman. To load multiple packages using pacman we use a function
called p_load( ).
Syntax :
pacman::p_load( package 1 , . . . . , package n)
Example :

pacman::p_load(dplyr,ggplot2,readxl)
Math Functions
R provides the various mathematical functions to perform the mathematical
calculation. These mathematical functions are very helpful to find absolute value,
square value and much more calculations. In R, there are the following functions
which are used:

S. Function Description Example


No

1. abs(x) It returns the absolute value of input x. x<- -4


print(abs(x))
Output
[1] 4

2. sqrt(x) It returns the square root of input x. x<- 4


print(sqrt(x))
Output
[1] 2

3. ceiling(x) It returns the smallest integer which is x<- 4.5


larger than or equal to x. print(ceiling(x))
Output
[1] 5

4. floor(x) It returns the largest integer, which is x<- 2.5


smaller than or equal to x. print(floor(x))
Output
[1] 2

5. trunc(x) It returns the truncate value of input x. x<- c(1.2,2.5,8.1)


print(trunc(x))
Output
[1] 1 2 8

6. round(x, It returns round value of input x. x<- -4.545


digits=n) print(round(x),digit=2)
Output
4.54

7. cos(x), sin(x), It returns cos(x), sin(x) value of input x<- 4


tan(x) x. print(cos(x))
print(sin(x))
print(tan(x))
Output
[1] -06536436
[2] -0.7568025
[3] 1.157821

8. log(x) It returns natural logarithm of input x. x<- 4


print(log(x))
Output
[1] 1.386294

9. log10(x) It returns common logarithm of input x<- 4


x. print(log10(x))
Output
[1] 0.60206

10. exp(x) It returns exponent. x<- 4


print(exp(x))
Output
[1] 54.59815

S. No Function Description Example

1. mean(x) It is used to find the mean for x object a<-c(0:10, 40)


xm<-mean(a)
print(xm)
Output
[1] 7.916667

2. sd(x) It returns standard deviation of an object. a<-c(0:10, 40)


xm<-sd(a)
print(xm)
Output
[1] 10.58694

3. median(x) It returns median. a<-c(0:10, 40)


xm<-meadian(a)
print(xm)
Output
[1] 5.5

4. range(x) It returns range. a<-c(0:10, 40)


xm<-range(a)
print(xm)
Output
[1] 0 40

5. sum(x) It returns sum. a<-c(0:10, 40)


xm<-sum(a)
print(xm)
Output
[1] 95

6. min(x) It returns minimum value. a<-c(0:10, 40)


xm<-min(a)
print(xm)
Output
[1] 0

7. max(x) It returns maximum value a<-c(0:10, 40)


xm<-max(a)
print(xm)
Output
[1] 40

Scope of a variable
The location where we can find a variable and also access it if required is called the scope
of a variable. There are mainly two types of variable scopes:
 Global Variables: Global variables are those variables that exist throughout the
execution of a program. It can be changed and accessed from any part of the program.
 Local Variables: Local variables are those variables that exist only within a certain
part of a program like a function and are released when the function call ends.

Global Variable
As the name suggests, Global Variables can be accessed from any part of the program.
 They are available throughout the lifetime of a program.
 They are declared anywhere in the program outside all of the functions or blocks.
 Declaring global variables: Global variables are usually declared outside of all of the
functions and blocks. They can be accessed from any portion of the program.
# R program to illustrate
# usage of global variables
# global variable
global = 5
display = function()
{
print(global)
}

display()
# changing value of global variable
global = 10
display()

Output:
[1] 5
[1] 10
In the above code, the variable ‘global‘ is declared at the top of the program outside all of
the functions so it is a global variable and can be accessed or updated from anywhere in
the program.

Local Variable
Variables defined within a function or block are said to be local to those functions.
 Local variables do not exist outside the block in which they are declared, i.e. they can
not be accessed or used outside that block.
 Declaring local variables: Local variables are declared inside a block.
Example:
# usage of local variables

func = function()
{

age = 18
}

print(age)

Output:
Error in print(age) : object 'age' not found
The above program displays an error saying “object ‘age’ not found”. The variable age
was declared within the function “func()” so it is local to that function and not visible to the
portion of the program outside this function.
To correct the above error we have to display the value of variable age from the function
“func()” only.
correct the above error we have to display the value of variable age from the function
“func()” only.

Example:

# usage of local variables

func = function()
{
age = 18
print(age)
}
Func()

Lexical Scoping
In Lexical Scoping the scope of the variable is determined by the textual structure of a
program.

Most programming languages we use today are lexically scoped like R, Python etc.

Lexical scoping refers to when the location of a function's definition determines which variables
you have access to.
Another name for Lexical scoping is Static Scoping.
Lexical scoping access global value and cannot access local variable, to access local variable it
should assign value as global .
# R program to depict scoping

# Assign a value to a
a <- 10

# Defining function b and c


b <- function()
{
print(a)
}
c <- function()
{
global <- 20
b()
print(global)
}

# Call to function c
c()

Output:
10

Dynamic Scoping
In Dynamic scoping, the variable takes the value of the most latest value assigned to that
variable
Here, Considering the same above example.

# Assign a value to a
a <- 10

# Defining function b and c


b <- function()
{
print(a)
}
c <- function()
{
a <<- 20 #global variable assignment.
b()
}

# Call to function c
c()
Output:
20

R programming follows only Lexical scoping.


Difference Between Lexical and Dynamic Scoping

Lexical Dynamic

In this variable refers to top level In this variable is associated to most recent
environment environment

It is easy to find the scope by reading the In this programmer has to anticipate all
code possible contexts

It is dependent on how code is written It is dependent on how code is executed

Structure of program defines which Runtime state of program stack determines


variable is referred to. the variable.

It is property of program text and It is dependent on real time stack rather than
unrelated to real time stack. program text

It provides less flexibility It provides more flexibility

Accessing to nonlocal variables in lexical Accessing to nonlocal variables in dynamic


scoping is fast. scoping takes more time.
Lexical Dynamic

Local variables can be protected to be There is no way to protect local variables to be


accessed by local variables. accessed by subprograms.

UNIT- V

Data Reduction in Data Mining


Data mining is applied to the selected data in a large amount database. When data analysis and
mining is done on a huge amount of data, then it takes a very long time to process, making it
impractical and infeasible.

Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a
process that reduces the volume of original data and represents it in a much smaller volume. Data
reduction techniques are used to obtain a reduced representation of the dataset that is much smaller
in volume by maintaining the integrity of the original data. By reducing the data, the efficiency of the
data mining process is improved, which produces the same analytical results.

Data reduction does not affect the result obtained from data mining. That means the result obtained
from data mining before and after data reduction is the same or almost the same.

Data reduction aims to define it more compactly. When the data size is smaller, it is simpler to apply
sophisticated and computationally high-priced algorithms. The reduction of the data may be in terms
of the number of rows (records) or terms of the number of columns (dimensions).

Techniques of Data Reduction


Here are the following techniques or methods of data reduction in data mining, such as:
Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our analysis.
Dimensionality reduction eliminates the attributes from the data set under consideration, thereby
reducing the volume of original data. It reduces data size as it eliminates outdated or redundant
features. Here are three methods of dimensionality reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed into a
numerically different data vector A' such that both A and A' vectors are of the same length.
Then how it is useful in reducing data because the data obtained from the wavelet transform
can be truncated. The compressed data is obtained by retaining the smallest fragment of the
strongest wavelet coefficients. Wavelet transform can be applied to data cubes, sparse data,
or skewed data.
ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has tuples
with n attributes. The principal component analysis identifies k independent tuples with n
attributes that can represent the dataset.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and skewed
data.

Attribute Subset Selection: The large data set has many attributes, some of which are irrelevant to data
mining or some are redundant. The core attribute subset selection reduces the data volume and
dimensionality. The attribute subset selection reduces the volume of data by eliminating redundant and
irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even after
eliminating the unwanted attributes. The resulting probability of data distribution is as close as possible
to the original data distribution using all the attributes.

2. Numerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much smaller form.
This technique includes two types parametric and non-parametric numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters


instead of the original data. One method of parametric numerosity reduction is the regression
and log-linear method.

Regression and Log-Linear: Linear regression models a relationship between the two attributes by
modeling a linear equation to the data set. Suppose we need to model a linear function between two
attributes.
y= wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in terms of data mining,
attribute x and attribute y are the numeric database attributes, whereas w and b are regression
coefficients.
Multiple linear regressions let the response variable y model linear function between two or more
predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in the database.
Suppose we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to
study the probability of each tuple in a multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.

ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any
model. The non-Parametric technique results in a more uniform reduction, irrespective of data
size, but it may not achieve a high volume of data reduction like the parametric. There are at
least four types of Non-Parametric data reduction techniques, Histogram, Clustering, Sampling,
Data Cube Aggregation, and Data Compression.

Histogram: A histogram is a ‘graph’ that represents frequency distribution which describes how often a
value appears in the data. Histogram uses the binning method and to represent data distribution of an
attribute. It uses disjoint subset which we call as bin or buckets.
We have data for AllElectronics data set, which contains prices for regularly sold items.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18,
20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
The diagram below shows a histogram of equal width that shows the frequency of price distribution.

Clustering: Clustering techniques groups similar objects from the data so that the objects in a cluster
are similar to each other, but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated using a distance function. More is
the similarity between the objects in a cluster closer they appear in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max distance between any
two objects in the cluster.
The cluster representation replaces the original data. This technique is more effective if the present
data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce
the large data set into a much smaller data sample. Below we will discuss the different
methods in which we can sample a large data set D containing N tuples:

a. Simple random sample without replacement (SRSWOR) of size s: In this


s, some tuples are drawn from N tuples such that in the data set D (s<N). The
probability of drawing any tuple from the data set D is 1/N. This means all
tuples have an equal probability of getting sampled.

b. Simple random sample with replacement (SRSWR) of size s: It is similar to the


SRSWOR, but the tuple is drawn from data set D, is recorded, and then replaced
into the data set D so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing SRSWOR
on these clusters. A simple random sample of size s could be generated from
these clusters where s<M.
d. Stratified sample: The large data set D is partitioned into mutually disjoint
sets called 'strata'. A simple random sample is taken from each stratum to get
stratified data. This method is effective for skewed data.

3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to represent the
original data set, thus achieving data reduction.

For example, suppose you have the data of All Electronics sales per quarter for the year 2018 to the
year 2022. If you want to get the annual sale per year, you just have to aggregate the sales per
quarter for each year. In this way, aggregation provides you with the required data, which is much
smaller in size, and thereby we achieve data reduction even without losing any data.

The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis.
The data cube pre

sent precomputed and summarized data which eases the data mining into fast access.
4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form. Data that can be restored successfully
from its compressed form is called Lossless compression. In contrast, the opposite where it is not
possible to restore the original form from the compressed form is Lossy compression. Dimensionality
and numerosity reduction method are also used for data compression.

This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.

i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from the
original data but are useful enough to retrieve information from them. For example, the JPEG
image format is a lossy compression, but we can find the meaning equivalent to the original
image. Methods such as the Discrete Wavelet transform technique PCA (principal component
analysis) are examples of this compression.

5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous nature into data
with intervals. We replace many constant values of the attributes with labels of small intervals. This
means that mining results are shown in a concise and easily understandable way.

i. Top-down discretization: If you first consider one or a couple of points (so-called


breakpoints or split points) to divide the whole set of attributes and repeat this method up to
the end, then the process is known as top-down discretization, also known as splitting.
ii. Bottom-up discretization: If you first consider all the constant values as split-points, some
are discarded through a combination of the neighborhood values in the interval. That process
is called bottom-up discretization.

Benefits of Data Reduction

The main benefit of data reduction is simple: the more data you can fit into a terabyte of disk space,
the less capacity you will need to purchase. Here are some benefits of data reduction, such as:

o Data reduction can save energy.


o Data reduction can reduce your physical storage costs.
o And data reduction can decrease your data center track.

Data reduction greatly increases the efficiency of a storage system and directly impacts your total
spending on capacity.

DATA VISUALIZATION
Pixel oriented visualization techniques:

 A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value.
 For a data set of m dimensions pixel oriented techniques create m windows on the
screen, one for each dimension.
 The m dimension values of a record are mapped to m pixels at the corresponding
 position in the windows.
 The color of the pixel reflects other corresponding values.
 Inside a window, the data values are arranged in some global order shared by all
windows
 Eg: All Electronics maintains a customer information table, which consists of 4
dimensions: income, credit_limit, transaction_volume and age. We analyze the
correlation between income and other attributes by visualization.
 We sort all customers in income in ascending order and use this order to layout the
customer data in the 4 visualization windows as shown in fig.
 The pixel colors are chosen so that the smaller the value, the lighter the shading.
 Using pixel based visualization we can easily observe that credit_limit increases as
income increases customer whose income is in the middle range are more likely to
purchase more from All Electronics, these is no clear correlation between income and
age.

The colors of the pixels reflect the corresponding values


Geometric Projection visualization techniques

 A drawback of pixel-oriented visualization techniques is that they cannot help us much in


understanding the distribution of data in a multidimensional space.
 Geometric projection techniques help users find interesting projections of
multidimensional data sets.
 A scatter plot displays 2-D data point using Cartesian co-ordinates. A third dimension
can be added using different colors of shapes to represent different data points.
 Eg. Where x and y are two spatial attributes and the third dimension is represented by
different shapes
 Through this visualization, we can see that points of types “y” &”X” tend to be collocated.
Scatterplot Matrices

The scatter-plot matrix is an extension to the scatter plot.

For k-dimensional data a minimum of (k2-k)/2 scatterplots of 2D will be required.

There can be maximum of k2 plots of 2D

In the adjoining figure , there are k2 plots. Out of these, k are X-X plots, and all X-Y plots (where X, Y are
distinct dimensions) are given in 2 orientations (X vs Y and Y vs, X)

Parallel Coordinates

The scatter-plot matrix becomes less effective as the dimensionality increases. Another technique, called
parallel coordinates, can handle higher dimensionality

n equidistant axes which are parallel to one of the screen axes and correspond to the attributes (i.e. n
dimensions)

The axes are scaled to the [minimum, maximum]: range of the corresponding attribute

Every data item corresponds to a polygonal line which intersects each of the axes at the point which
corresponds to the value for the attribute

Icon-Based Visualization Techniques Visualization of the data values as features of icons

Typical visualization methods

Chernoff Faces

Stick Figures

General techniques

Shape coding: Use shape to represent certain information encoding

Color icons: Use color icons to encode more information

Tile bars: Use small icons to represent the relevant feature vectors in document retrieval
Chernoff Faces A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be
eye size, z be nose length, etc.

The figure shows faces produced using 10 characteristics–head eccentricity, eye size, eye spacing, eye
eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening): Each
assigned one of 10 possible values.

Stick Figure

A census data figure showing age, income, gender, education

A 5-piece stick figure (1 body and 4 limbs w. different angle/length)

Age, income are indicated by position of the figure.

Gender, education are indicated by angle/length. Visualization can show a texture pattern
Hierarchical Visualization For a large data set of high dimensionality, it would be difficult to visualize all
dimensions at the same time.

Hierarchical visualization techniques partition all dimensions into subsets (i.e., subspaces). The
subspaces are visualized in a hierarchical manner

“Worlds-within-Worlds,” also known as n-Vision, is a representative hierarchical visualization method.

To visualize a 6-D data set, where the dimensions are F,X1,X2,X3,X4,X5. We want to observe how F
changes w.r.t. other dimensions.

We can fix X3,X4,X5 dimensions to selected values and visualize changes to F w.r.t. X1, X2

Visualizing Complex Data and Relations

Most visualization techniques were mainly for numeric data. Recently, more and more non-numeric
data, such as text and social networks, have become available.

Many people on the Web tag various objects such as pictures, blog entries, and product reviews.

A tag cloud is a visualization of statistics of user-generated tags. Often, in a tag cloud, tags are listed
alphabetically or in a user-preferred order.

The importance of a tag is indicated by font size or color.

You might also like