0% found this document useful (0 votes)
11 views

Unit 1 Introduction

The document provides an overview of data types, including qualitative and quantitative data, and their categorization methods. It discusses the significance of Big Data, its characteristics known as the '4Vs' (Volume, Velocity, Variety, Veracity), and the importance of data analytics for informed decision-making and competitive advantage. Additionally, it highlights the evolution of data generation and consumption in the modern landscape.

Uploaded by

anshpatel184
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Unit 1 Introduction

The document provides an overview of data types, including qualitative and quantitative data, and their categorization methods. It discusses the significance of Big Data, its characteristics known as the '4Vs' (Volume, Velocity, Variety, Veracity), and the importance of data analytics for informed decision-making and competitive advantage. Additionally, it highlights the evolution of data generation and consumption in the modern landscape.

Uploaded by

anshpatel184
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

1

1. Data Definitions and Analysis Techniques

1
Introduction to
Data
□ Data refers to raw facts, observations, or information that can be collected,
recorded, and analyzed. Data can take various forms, including numbers,
text, images, audio, and more.

□ Types of Data:
□ Quantitative Data (Numeric): Measurable and represented by numbers
(e.g., height, temperature).
□ Qualitative Data (Categorical): Descriptive and represented by
categories or labels (e.g., colors, types).

2
Types of Data

3
Qualitative or Categorical
Data
Qualitative data, also known as the categorical data, describes the data that fits into the categories. Qualitative data are not
numerical. The categorical information involves categorical variables that describe the features such as a person’s gender, home
town etc. Categorical measures are defined in terms of natural language specifications, but not in terms of numbers.
Sometimes categorical data can hold numerical values (quantitative value), but those values do not have a mathematical sense.
Examples of the categorical data are birthdate, favourite sport, school postcode. Here, the birthdate and school postcode hold the
quantitative value, but it does not give numerical meaning.
Nominal Data
□ Nominal data is a type of qualitative data that groups variables into categories. These categories are purely descriptive,
have no quantitative or numeric value, and cannot be placed into any kind of meaningful order or hierarchy.
□ It helps to label the variables without providing the numerical value. Nominal data is also called the nominal scale. It
cannot be ordered and measured. But sometimes, the data can be qualitative and quantitative. Examples of nominal
data are letters, symbols, words, gender etc.
□ The nominal data are examined using the grouping method. In this method, the data are grouped into categories, and
then the frequency or the percentage of the data can be calculated. These data are visually represented using the pie
charts.

For Example
4
Ordinal
Data
□ Ordinal data/variable is a type of data that
follows a natural order. The significant
feature of the nominal data is that the
difference between the data values is not
determined. This variable is mostly found in
surveys, finance, economics,
questionnaires, and so on.
□ The ordinal data is commonly represented
using a bar chart. These data are
investigated and interpreted through many
visualisation tools. The information may be
expressed using tables in which each row in
the table shows the distinct category.
5
Quantitative or Numerical Data
Quantitative data is also known as numerical data which represents the numerical
value (i.e., how much, how often, how many). Numerical data gives information
about the quantities of a specific thing. Some examples of numerical data are
height, length, size, weight, and so on. The quantitative data can be classified into
two different types based on the data sets. The two different classifications of
numerical data are discrete data and continuous data.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a
finite number of possible values. Those values cannot be subdivided meaningfully.
Here, things can be counted in whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable
values that can be selected within a given specific range.
Example: Temperature range
6
7
Data Categorization based on
measurement:
Definition: Data categorization involves classifying data into different groups or categories
based on certain characteristics. This process helps in organizing and making sense of
diverse data types.
Methods of Data Categorization:
Nominal Categorization: Categorizing data into distinct categories without any inherent
order or ranking. Examples include colors or types of fruits.
Ordinal Categorization: Categorizing data where there is a meaningful order or
ranking. Examples include education levels or survey ratings.
Interval Categorization: Categorizing data with equal intervals between consecutive points,
but the absence of a true zero point. Temperature measured in Celsius is an example.
Ratio Categorization: Categorizing data with equal intervals between consecutive points and
a true zero point. Examples include height, weight, and income.

8
Types of Scale

9
Purpose of Data Categorization:
□ Facilitates data organization and management.
□ Aids in the analysis and interpretation of data.
□ Enables efficient communication of information.
□ Understanding data, elements, variables, and data categorization
provides the foundation for statistical analysis and data-driven
decision-making. The appropriate categorization method depends
on the nature of the data and the goals of the analysis.

10
Terms

Elements:
Definition: Elements are the individual entities or units within a dataset. Each
element represents a single observation or data point.
Examples:
In a dataset of student exam scores, each student's score is an element.
In a survey about favorite colors, each respondent's choice represents an element.
Variables:
Definition: Variables are characteristics or attributes that can take different values. They
are the properties of the elements being measured or observed.
Types of Variables:
Independent Variable: The variable manipulated or controlled in an experiment.
Dependent Variable: The variable being measured or observed, affected by the
independent variable.

11
Classification of Digital
Data
Digital data can be classified based on various characteristics, including
its format, structure, and nature. Here are some common classifications
of digital data:
□ Format Based Classification

□ Structure Based Classification

□ Nature Based Classification

□ Domain Specific Classification

12
Format-Based Classification:

a. Text
Data:
Consists of alphanumeric characters and is typically human-readable. Examples include documents, emails, and
web pages.
b. Numeric Data:
Comprises numerical values and is often used for quantitative analysis. Examples include spreadsheets, databases,
and numerical datasets.
c. Audio Data:
Represents sound and is stored in digital audio formats. Examples include MP3 files, WAV files, and
streaming audio.
d. Image Data:
Consists of visual information and is stored in digital image formats. Examples include JPEG images, PNG
images, and GIFs.
e. Video Data:
Represents moving images and is stored in digital video formats. Examples include MP4 videos, AVI files,
and streaming video.
f. Multimedia Data:
Combines multiple types of data, such as text, audio, images, and video.
Examples include multimedia presentations, interactive content, and multimedia websites

13
Structure-Based Classification:

a. Unstructured Data:
Lacks a predefined data model and is often text-heavy.
Examples include emails, social media posts, and text documents.
b. Semi-Structured Data:
Has some level of structure but does not fit neatly into traditional relational
databases.
Examples include XML files, JSON data, and certain types of log files.
c. Structured Data:
Organized in a predefined manner, often in rows and columns.
Examples include relational databases, spreadsheets, and CSV files.

14
Nature-Based Classification:

a. Discrete Data:
Consists of separate, distinct values with no intermediate values.
Examples include whole numbers, categories, and binary data.
b. Continuous Data:
Represents a range of values and can take any value within that range.
Examples include real numbers, temperature measurements, and time.
c. Binary Data:
Consists of bits (0s and 1s) and is fundamental to all digital data.
Examples include machine code, executable files, and binary images.

15
Domain-Specific Classification:

a. Scientific Data:
Data generated from scientific experiments and observations.
Examples include sensor readings, scientific simulations, and experimental results.
b. Business Data:
Data related to business processes and operations.
Examples include sales data, customer records, and financial transactions.
c. Geospatial Data:
Data associated with geographical locations.
Examples include maps, GPS data, and satellite imagery.
d. Health Data:
Data related to healthcare and medical information.
Examples include electronic health records, medical imaging data, and patient
demographics.

16
Big Data: Introduction
□ Big Data may well be the Next Big Thing in the IT world.

□ Big data burst upon the scene in the first decade of the 21st century.

□ The first organizations to embrace it were online and startup firms.


Firms like Google, eBay, LinkedIn, and Facebook were built around
big data from the beginning.

□ Like many new information technologies, big data can bring about
dramatic cost reductions, substantial improvements in the time
required to perform a computing task, or new product and service
offerings.

17
What is BIG DATA?
□ ‘Big Data’ is similar to ‘small data’, but bigger in
size

□ but having data bigger it requires different


approaches:
□ Techniques, tools and architecture

□ an aim to solve new problems or old problems in a


better way
□ Big Data generates value from the storage and
processing of very large quantities of
digital information that cannot be 18
Continued…

□ The basic idea behind the phrase 'Big Data' is that everything we do is
increasingly leaving a digital trace (or data), which we (and others) can use
and analyse.
□ Big Data therefore refers to our ability to make use of the ever-
increasing volumes of data.
□ Big Data is one of those things, and is completely transforming the way
we do business and is impacting most other parts of our lives
□ Big Data refers to extremely large and complex datasets that cannot be
easily processed, managed, or analyzed using traditional data
processing tools.
□ The term "Big Data" encompasses not only the volume of data but also
its velocity, variety, and, increasingly, veracity and value.

19
From the dawn of civilization
until 2003, humankind generated
five exabytes of data. Now we
produce five exabytes every two
days…and the pace is
accelerating.
Eric Schmidt,
Executive Chairman, Google

20
BIG DATA Everywhere

□ Lots of data is being collected and


warehoused
□ Web data, e-commerce
□ purchases at
department/ grocery
stores
□ Bank/Credit Card
transactions
□ Social Network
21
What Comes Under Big

Data?
Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It
captures voices of the flight crew, recordings of microphones and earphones, and
the performance information of the aircraft.
□ Social Media Data : Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the globe.
□ Stock Exchange Data : The stock exchange data holds information about the
‘buy’ and ‘sell’ decisions made on a share of different companies made by the
customers.
□ Power Grid Data : The power grid data holds information consumed by a
particular node with respect to a base station.
□ Transport Data : Transport data includes model, capacity, distance and availability of
a vehicle.
□ Search Engine Data : Search engines retrieve lots of data from different databases.
22
How Can You Avoid Big
Data?
□ Pay cash for everything!
□ Never go online!
□ Don’t use a telephone!
□ Don’t fill any prescriptions!

23
The key characteristics of Big Data are
often referred to as the "4Vs":
Volume: Big Data involves a massive amount of data. Traditional databases and
processing systems may struggle to handle the sheer volume, which can range
from terabytes to petabytes and beyond.
Velocity: Velocity refers to the speed at which data is generated, collected, and
processed. With the advent of real-time data sources such as sensors, social
media, and online transactions, data is often generated at high speeds.
Variety: Big Data comes in various formats and types, including structured,
semi-structured, and unstructured data. This diversity includes text, images,
videos, log files, social media posts, sensor data, and more.
Veracity: Veracity refers to consistency, accuracy, quality and reliability of the data.
Data veracity refers to the biasedness, noise, and abnormality in data. Big Data
sources may have inconsistencies, errors, or incomplete information. Managing and
ensuring data quality is a challenge in Big Data analytics.
24
25
Variety

c•
Data at Data in Data in Data in
Rest Motion Many Doubt
Uñcortbir\ty du9 to
Terabytes to Streaming data, Forms dsts
exabytes of millisecond9 tO Structured, inconsistency &
existing data to seconds to unstructured, incompleteness,
process respond text, mukimedia ambiguities, latency,
deception, model
26
approximations
Why Four V’s??
□ Whether data is structured or unstructured, it’s only as valuable
as the business outcomes it makes possible.
□ However, the data itself isn’t the only factor responsible for
those outcomes.
□ How you measure that data, from a business point of view, helps
you tie the value of the data to its potential and supports
decisions that lead to positive business results.
□ To get there, you need a big data analytics platform.

27
Continued

□ Once you have a platform that can measure along the four V’s
—volume, velocity, variety, and veracity—you can then extend
the outcomes of the data to impact customer acquisition,
retention, upsell, cross-sell and other revenue generating
indicators.
□ You can also look at this information as a competitive strategy that
brings corresponding improvements in operational efficiency and
helps you leverage data across the enterprise for other initiatives.

28
The Model Has Changed…
32

□ The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming
data

New Model: all of us are generating data, and all of us are consuming
data

29
What’s driving Big
33
Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

30
Data
Analytics
❖ Data Science and Data Analytics are two most trending terminologies
of today’s time.
❖ Data is collected into raw form and processed to
the
according requirement of a company.
❖ This data is utilized for the decision making purpose.
❖ This process helps the businesses to grow in the
❖ market.
❖ But, the
Data main question
Analytics is the arises
answer– What
here.is and,
the process
Data Analyst and
called?
Data Scientist are the ones who perform this process.

31
What is Data
Analytics?
❖ Data or information is in raw format.
❖ The increase in size of the data has led to arise
❖ In need for carrying out inspection, data cleaning and
❖ transformation. Data modeling to gain insights from the data
in order to derive conclusions for better decision-making process.
❖ This process is known as data analysis.
❖ The analysis is an interactive process of a person tackling a
problem, finding the data required to get an answer, analyzing
that data, and interpreting the results in order to provide a
recommendation for action.
32
Why Data Analytics? - rise of big data is
a significant factor
Informed Decision-Making: Data analytics helps organizations make informed and
data-driven decisions. By analyzing large sets of data, businesses can identify
trends, patterns, and correlations, enabling better decision-making processes.

Improved Efficiency and Productivity: Analyzing data allows organizations


to identify areas of inefficiency and optimize processes. This can lead to
increased productivity, reduced costs, and improved overall efficiency.

Competitive Advantage: Organizations that effectively leverage data analytics


gain a competitive advantage. By understanding customer behavior, market trends,
and competitors, businesses can adapt and respond quickly to changes in their
environment.

Customer Insights: Data analytics provides valuable insights into customer


behavior, preferences, and feedback. This information helps businesses tailor
their products and services to meet customer needs, leading to improved
customer satisfaction and loyalty. 33
Continued…
Risk Management: Analyzing data helps identify potential risks and threats
to an organization. This proactive approach allows businesses to develop
strategies for risk mitigation and crisis management.
Innovation: Data analytics can uncover new opportunities and areas for
innovation. By identifying emerging trends and market demands,
organizations can stay ahead of the curve and innovate their products or
services.
Fraud Detection and Security: In sectors such as finance and cybersecurity,
data analytics is crucial for detecting fraudulent activities and enhancing
overall security. Analyzing patterns and anomalies in data helps identify
potential security breaches and fraudulent transactions.

34
Big Data Vs
BI
□ Big Data collectively refers to the act of generating, capturing and
usually processing enormous amounts of data on a continuing
basis.

□ Business Intelligence collectively refers to software and systems


that import data streams of any size and use them to generate
informational displays that point towards specific decisions.

35
36
Data Analytics vs Data Scientist

□ Data Scientist: “what is going to be be happened next”


□ Using statistics and mathematics
□ Data Analyst: Analysis is the process of breaking a complex topic
into smaller parts to gain better understanding of it.
□ “what happened so far”

37
38
Need for Data Analytics /Data
Science

39
40
History

41
History
R is a programming language and software environment
for statistical analysis, graphics representation and reporting.
R was created by Ross Ihaka and Robert Gentleman at
the University of Auckland, New Zealand.
R is freely available under the GNU General Public License.
R provided for various operating systems like Linux, Windows
and Mac.
This programming language was named R, based on the first
letter of first name of the two R authors (Robert Gentleman
and Ross Ihaka).
42
❖ R play on the name of the Bell Labs Language S.

❖ R was initially written at the Department of Statistics of


the
University of Auckland in Auckland, New

❖ Zealand.
R made its first appearance in
1993.
❖ Since mid-1997 there has been a core group (the "R Core
Team")
who can modify the R source code
archive.

43
Why Learn R Programming Language
❖ With R, you can perform statistical analysis, data analysis as well as
machine learning.
❖ We can create objects, functions and packages in it.
❖ R is platform-independent and can be used across multiple operating systems.
❖ R is free owing to its open-source GNU licensing and can be installed by
anyone.
❖R consists of a robust collection of graphical libraries like ggplot2, plotly
and many more.
❖R is most widely used by the various industries like health, finance,
banking, manufacturing and many more.
❖ There are about 2 million job openings for R programmers worldwide.
❖ Companies hire R programmers for many roles like data analysts,
business analysts, data visualization experts, and business intelligence
44
45
Features of
R
❑ As stated earlier, R is a programming language and
software environment for statistical analysis, graphics
representation and reporting. The following are the important
❑ features of R −
R is a well-developed, simple and effective programming
language which includes conditionals, loops, user defined
❑ recursive functions and input and output facilities.
❑ R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays,
❑ lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools
46
How R is better than Other Technologies
There are certain unique aspects of R programming which makes it better in
comparison with other technologies:
• Graphical Libraries – Libraries like ggplot2, plotly facilitate appealing libraries
for making well-defined plots.
• Availability / Cost – R is completely free.
• Advancement in Tool – R supports various advanced tools and features that allow
you to build robust statistical models.
• Job Scenario – The immense growth in Data Science and rise in demand, R has
become the most in-demand programming language of the world today.
• Customer Service Support and Community – With R, you can enjoy strong
community support.
• Portability – R is highly portable. Many different programming languages and
software frameworks can easily combine with the R environment for the best results.
47
Sourcing of R
Script
RStudi
o
• RStudio is an Integrated Development Environment for
R.
as
• It facilitates extensive code editing, development as various
features.
well
Features of
RStudio
• RStudio provides various tools and features that allow you to boost
your code productivity.
• It can also be accessed over the web and is cross-platform in nature.
• It facilitates automatic checking of updates
• It provides support for recovery in case of file loss.
• With RStudio, you can manage the data more 48
Components of
RStudio
• Source – In the top left corner of the screen is the text editor that
allows

you to work within source scripting. You can enter multiple lines in this

source.

• Console – This is present on the bottom left corner of the main window

of R Studio. It facilitates interactive scripting in R.

Workspace and History – In the top right corner, the R workspace and

the history window. This will give you the list of all the variables and

view the list of past commands that were executed by R. 49


50
51
Why Do We Need Analytics using R?

Banking:
□ Large amount of customer data is generated every day in Banks. While dealing
with millions of customers on regular basis, it becomes hard to track their
□ mortgages. Solution:

□ R builds a custom model that maintains the loans provided to every individual
which helps us to decide the amount to be paid by the customer over
customer
time.
□Insurance:
Insurance extensively depends on forecasting. It is difficult to decide which policy
to accept or reject.
□ Solution:
□ By using the continuous credit report as input, we can create a model in R that will
not only assess risk appetite but also make a predictive forecast as well.

52
Healthcare:
□ Every year millions of people are admitted in hospital and billions

are spent annually just in the admission process.


□ Solution:

□ Given the patient history and medical history, a predictive model can

be built to identify who is at risk for hospitalization and to what


extent the medical equipment should be scaled.

53
More Applications of R
Programming
❖ finance and banking sectors for detecting fraud, reducing customer churn
rate and for making future decisions.
❖ bioinformatics to analyze strands of genetic sequences, for performing
drug discovery and also in computational neuroscience.
❖ Social media analysis to discover potential customers in online advertising.
❖ Companies also use social media information to analyze
customer
sentiments for making their products better.
❖ E-Commerce companies make use of R to analyze the purchases made
by the customers as well as their feedbacks.
❖ Manufacturing companies use R to analyze customer feedback.
❖ They also use it to predict future demand to adjust their
manufacturing
speeds and maximize 54
profits.
Companies Using R
Some of the companies that are using R programming are as
follows:
• Facebook
• Google
• Linkedin
• IBM
• Twitter
• Uber
• Airbnb
• Ford Motor company
• Microsoft 55
Who uses R?

□ The Consumer Financial Protection Bureau uses R for data analysis


□ Statisticians at John Deere use R for time series modeling and geospatial
analysis in a reliable and reproducible way.
□ Bank of America uses R for reporting.
□ R is part of technology stack behind Foursquare’s famed recommendation
engine.
□ ANZ, the fourth largest bank in Australia, using R for credit risk analysis.
□ Google uses R to predict Economic Activity.
□ Mozilla, the foundation responsible for the Firefox web browser, uses R to
visualize Web activity.

56
1.1 Features of
RR allows branching and looping as well as modular programming using functions.

□ R allows integration with different programming languages like C, C+


+, .Net, Python etc.
□ R has an extensive community of contributors.
□ R has an effective data handling and storage facility for numeric and textual
data.
□ R provides a collection of operators for calculations on arrays, lists, factor,
vectors, data frame and matrices.
□ R provides large and integrated collection of tools for data analysis and
statistical functions.
□ R provides graphical facilities for data analysis and can show result both in
soft and hard copies.
□ R is an integrated suite of software facilities for data manipulation, calculation
and graphical facilities for data analysis and display
57
1.3 Getting
Started
R Studio has four main window sections:
□ Top-Left Section: To write and save R code (Script section)

□ Bottom-Left Section: To execute R code and doing calculations.

The nature and values of all variables and objects appear here
(Console section)
□ Top-Right Section: To manage datasets and variables (Data section)

□ Bottom-Right Section: To display plots ,installed packages and

seek help on R functions (Plot, Packages and Help Section)

58
1.4 Variables in R
1. Naming Variables: A variable in R can store any object in R
including atomic vector, list, matrix, array, factor and data
frame. A valid variable name consists of letters, numbers and
the dot or underline characters.
2. Assigning Values to Variables: In R, an assignment to a
variable can be done in three ways = , <- and -> sign.
3. Finding Variables: To know all the variables currently
available in the workspace we use the ls() function.
4. Removing Variables: Variable can be deleted by using the
rm() function along with variable name.

59
1.5 Input in R
□ 1.5.1 Input of Data from Terminal: The scan function is used
to take data from the user at the terminal.
□ 1.5.2 Input of Data through R Objects: There are many types
of
R-objects including Vectors, Lists, Matrices, Arrays, Factors and
Data Frames.

60
1.6 Output in R

1. print() Function: Print cannot combine two or more strings, variables,


a string and a variable.
2. cat() Function: The cat() function is an alternative to print that lets
you combine multiple items into a continuous output.

61
1.7 Inbuilt Functions in
□ R Mathematical Functions: R can also be used as a calculator along with
1.7.1
facility to use many mathematical functions. Ex: sqrt, abs, floor, ceiling etc.
□ 1.7.2 Trigonometric Functions: R provides the user an ability to compute the
result using different trigonometric functions. Ex: sin, cos, tan etc.
□ 1.7.3 Logarithmic Functions: R has an extensive facility to provide log of a
number with proper specification of the base . Ex : log with base 10 and
natural base
□ 1.7.4 Date and Time Functions: Dates and times have special classes in R that
allow for numerical and statistical calculations.
□ 1.7.5 Sequence Function: A sequence is a set of related numbers, events, date etc.
that follow each other in a particular order. R has a number of facilities
for generating commonly used sequences of numbers.
□ 1.7.6 Repeat Function: Function rep is used to replicates the values in a vector. It
is a very powerful feature in R which helps the user to create a set of values in
an easy manner
62
1.7.7


1.
Strings
Creating a String: String in R is written within a pair of single quote or double quotes.
2.Concatenating Strings: The paste() function concatenates several strings together. It
creates a new string by joining the given strings end to end.
□ 3.Formatting of Strings: Strings can be formatted to a specific style according to
the requirement of the user using format() function.
□ 4.Counting number of character: nchar() function is used to count the number of
characters including spaces in a string.
□ 5.Change case:The functions toupper() and tolower() functions are used to change the case
of characters of a string.
□ 6.Extracting parts of a string: The substring() or substr() function extracts parts of a
string depending on the index position of the string.
□ 7. Searching Matches: The grep() function is used for searching the matches.
□ 8.Changing String to expression: The eval() function evaluates an expression only and
not a string..
□ 9. Split the Elements of Vector: The function strsplit() is used to split the elements of a
character vector into substrings according to the matches to substring split within 63
1.8 Packages in R
□ 1.8.1 Standard Packages: R packages are a collection of R
functions, complied code and sample data. They are stored under a
directory called "library" in the R environment.
□ 1.8.2 Contributed Packages: There are thousands of contributed
packages for R, written by many authors. Some of these packages
implement specialized statistical methods, others give access to
data or hardware, and others are designed to complement
textbooks.

64
R Installation

https://cran.r-
project.org/
bin/windows/
base/

65
R Console Window

66
R Command Prompt

Once you have R environment setup, then it’s easy to start your R

command prompt by just typing the following command.

This will launch R interpreter and you will get a prompt > where

you can start typing your program as follows.

> myString <- "Hello, World!"

> print ( myString)

[1] "Hello, World!"


67
Rstudio (IDE)

https://rstudio.com/
products/rstudio/
download/
#download

68
R - Data
Types
❖ In contrast to other programming languages like C and java in R,
the variables are not declared as some data type.
❖ The variables are assigned with R-Objects and the data type of the R-
object becomes the data type of the variable.
❖ There are many types of R-objects. The frequently used ones are −
✓ Vectors
✓ Lists
✓ Matrices
✓ Arrays
✓ Factors
✓ Data Frames 69
R-
Functions
✓ A function is a set of statements to perform a specific task.
✓ R has a large number of in-built functions
✓ The user can create their own functions.
Built-in Function
✓ Simple examples of in-built functions are seq(), mean(), max(), sum(x) and
paste(...)
etc.
✓They are directly called by user written
programs. # Create a sequence of numbers from
32 to 44. print(seq(32,44))
# Find mean of numbers from 25 to
82. print(mean(25:82))
# Find sum of numbers from 41 to 68.
print(sum(41:68) 70
)
User-defined
Function
❖ They are specific to what a user wants and once created they can
be used like the built-in functions.
# Create a function to print squares of numbers in
sequence. new.function <- function(a) { for(i in 1:a) { b <-
i^2 print(b) } }
Calling a Function
# Call the function new.function supplying 6 as an
argument. new.function(6)
Produces
[1] 1 the following
[1] 4 [1]result
9 − [1] 16 [1] 25 [1]
36
71
R String Manipulation Functions
1. grep()
It is used for pattern matching and replacement.
grep("b+", c("abc", "bda", "ccaa", "abd"), perl=TRUE,
value=TRUE)
grep("b+", c("abc", "bda", "ccaa", "abd"), perl=TRUE,
value=FALSE) grep("chid+", c("chidambaram", "Villupuram",
"Srimushnam", "chidambaram"), perl=TRUE, value=FALSE)
grep("அ+", c("அṅпw", "øwøøw", "அddw"), perl=TRUE,
value=FALSE)
[1] 1 2 4
[1] 1 4 72
2. nchar()
With the help of this function, we can count the characters.
> str <- "Big Data at DataFlair"
>nchar(st
r) [21]
3. paste()
Concatena
te n
number of
strings
using the
paste()
function.
> #Author
DataFlair
> paste("H
> [1] Matthew scored 72.30
adoop", 73
percent
5. strsplit()
> #Author DataFlair
> str = "Splitting sentence into words"
> strsplit(str, " ")
> strsplit(str, "")

Output
[1] "Splitting" "sentence" "into" "words"

[1] "S" "p" "l" "i" "t" "t" "i" "n" "g" " " "s" "e" "n" "t" "e" "n" "c" "e" "
" "i" "n"
74
"t" "o" " " "w" "o" "r" "d" "s"
Vector
Vectors are the most basic R data objects and there are six types of atomic
vectors. They are logical, integer, double, complex, character and raw.

Multiple Elements Vector - Using colon operator with numeric

data # Creating a sequence from 5 to 13.


v <- 5:13
print(v)
[1] 5 6 7 8 9 10 11 12 13
# Creating a sequence from 6.6 to
12.6. v <- 6.6:12.6
print(v)
[1] 6.6 7.6 8.6
9.6 10.6 11.6 12.6
# If the final element specified does not belong to the sequence then it is
discarded. v <- 3.8:11.4
print(v)
[1] 3.8 4.8 5.8 6.8 7.8 8.8 9.8 10.8 75
Using sequence (Seq.) operator

# Create vector with elements from 5 to 9 incrementing by


0.4. print(seq(5, 9, by = 0.4))
[1] 5.0 5.4 5.8 6.2 6.6 7.0 7.4 7.8 8.2 8.6 9.0

Using the c() function

The non-character values are coerced to character type if one of the elements
is a character.

# The logical and numeric values are converted to


characters. s <- c('apple','red',5,TRUE)
print(s)
[1] "apple" "red" "5" "TRUE"

76
Accessing Vector
❖ The [ ] brackets are used for indexing. Indexing starts with position
Elements
❖ 1. Giving a negative value in the index drops that element from
❖ result.
TRUE, FALSE or 0 and 1 can also be used for indexing.
# Accessing vector elements using position.
t <- c("Sun","Mon","Tue","Wed","Thurs","Fri","Sat")
u <- t[c(2,3,6)]
print(u) [1] "Mon" "Tue" "Fri"
# Accessing vector elements using logical indexing.
v <- t[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v) [1] "Sun" "Fri"
# Accessing vector elements using negative
indexing. x <- t[c(-2,-5)]
print(x) [1]
"Sun" "Tue" "Wed" "Fri" "Sat"
# Accessing vector elements using 0/1
indexing. y <- t[c(0,0,0,0,0,0,1)]
print(y) 77
R-
Lists
❖ Lists are the R objects which contain elements of different
❖ types. numbers, strings, vectors and another list inside it.
❖ List is created using list() function.
Creating a List
# Create a list containing strings, numbers, vectors and a logical
values.
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23,
119.1) print(list_data)
[1] "Red"
[1] "Green"
[1] 21 32 11
[1] TRUE
[1] 51.23
[1] 119.1 78
R - Matrices

❖ Matrices are the R objects in which the elements are arranged


in a dimensional.

Syntax

matrix(data, nrow, ncol, byrow, dimnames)

❖ data is the input vector which becomes the data elements of the matrix.
❖ nrow is the number of rows to be created.
❖ ncol is the number of columns to be created.
❖ byrow, If TRUE, then the input vector elements are arranged by row.
❖ dimname is the names assigned to the rows and columns.

79
Matrix
Example
# Elements are arranged sequentially by
row. M <- matrix(c(3:14), nrow = 4, byrow =
TRUE) print(M)

# Elements are arranged sequentially by


column. N <- matrix(c(3:14), nrow = 4, byrow
= FALSE) print(N)

# Define the column and row names.


rownames = c("row1", "row2", "row3",
"row4")
colnames = c("col1", "col2", "col3")

P <- matrix(c(3:14), nrow = 4, byrow = TRUE,


dimnames = list(rownames,
colnames))
print(P)
80
Accessing Elements of a Matrix
Matrix can be accessed by using the column and row
index
# Define the column and row names.
rownames = c("row1", "row2", "row3",
"row4")
colnames = c("col1", "col2", "col3")

# Create the matrix.


P <- matrix(c(3:14), nrow = 4, byrow =
TRUE, dimnames = list(rownames,
colnames))

# Access the element at 3rd column and 1st


row. print(P[1,3]) [1] 5

# Access the
onlyelement
the 2nd at
row.
2nd column and 4th
print(P[2,])
row. print(P[4,2]) col1[1]
col2
13
col3
6
row1 row2 row3
7 row4 81
# Example for Matrix # Matrix
A <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9),3,3) Multiflication
function Matmul<-A*B
B <- matrix(c(11, 12, 13, 14, 15, 16, 17, 18,
19),3,3) Matmul
C <- matrix(c(21, 22, 23, 24, 25, 26, 27, 28, # Matrix
29),3,3) Transforce tm<-
t(A)
# Check whether the variable A Matrix or
A
not is.matrix(A) tm
#Multiplication by a #Computing Column & Row
Sums sum(A)
Scalar s<-3
colSums(A)
s1<- rowSums(A
A*s s1 )
# Matrix #Computing Column & Row
Means mean(A)
Addition colMeans(A)
Matadd<-A+B rowMeans(A
Matadd )
#Accessing the matrix
# Matrix
element A 82
R - Data
Frames
❑ A data frame is a table or
structure.
❑ Each column contains values of one
variable.
Following are the characteristics of a data frame

❑ The column names should be non-empty.

❑ The row names should be unique.

❑ The data stored in a data frame can be of numeric, factor or


character.

83
❑ Each column should contain same number of data items.
# Create the data frame.
emp.data <- data.frame( emp_id = c (1:5),
emp_name =
c("Rick","Dan","Michelle","Ryan","Gary"), salary =
c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-
01", "2013-09-23", "2014-11-15",
"2014-05-11", "2015-03-27")),

stringsAsFactors = FALSE )
# Print the data frame.
print(emp.data) 84
Summary of Data in Data Frame
The statistical summary and nature of the data can be obtained
by applying summary() function.
# Create the data frame.
emp.data <- data.frame( emp_id = c (1:5),
emp_name =
c("Rick","Dan","Michelle","Ryan","Gary"), salary =
c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23",
"2014-11-15",
"2014-05-11", "2015-03-27")), stringsAsFactors =
FALSE )
85
summary(emp.data)

86

You might also like