Lect 1 Data Science
Lect 1 Data Science
buys a product.
Dr.Rabab Sabry 2
• The term “data science” has
become a buzzword and is now
used to
• represent many areas like data
analytics, data mining, text mining,
data visualizations, prediction
modeling, and so on.
Dr.Rabab Sabry 3
Statistics is important in data science because it can help analysts or data scientists analyze and understand
data. Descriptive statistics assists in summarizing the data, inferential statistics tests the relationship between
two data sets or samples, and regression analysis explores the relationships between multiple variables. Data
visualizations can explore the data with charts, graphs, and dashboards. Regressions and machine learning
algorithms can be used in predictive analytics to train a model and predict a variable.
Machine learning algorithms and regression or statistical learning algorithms are used to
predict a variable like this approach.
Domain expertise is the knowledge of the data set. If the data set is business data,
then the domain expertise should be business;
Adding in product design and engineering knowledge takes us into the fields of Internet of Things (IoT) and smart cities
because data science and predictive analytics can be used on sensor data
Dr.Rabab Sabry 4
Dr.Rabab Sabry 5
Dr.Rabab Sabry 6
• What Is R?
• R programming is for statistical
computing and is supported by the R
Foundation for Statistical Computing
Dr.Rabab Sabry 7
Why R?
When learning data science, many people struggle with choosing which programming languages and data
sciences to learn. There are many programming languages available for data science, like R, Python, SAS,
Java, and more. There are many data science software packages to learn, such as SPSS Statistics, SPSS
Modeler, SAS Enterprise Miner, Tableau, RapidMiner, Weka, GATE, and more.
I recommend learning R for statistics because it was developed for statistics in the first place. Python is a real
programming language, so you can develop real applications and software via Python programming.
Hence, if you want to develop a data product or data application, Python can be a better choice. R
programming is very strong in statistics, so it is ideal for data exploration or data understanding using
descriptive statistics, inferential statistics, regression analysis, and data visualizations. R is also ideal for
modeling because you can use statistical learning like
regressions for predictive analytics. R also has some packages for data mining, text mining, and machine
learning like Rattle, CARET, and TM. R programming can also interface with big data systems like Apache Spark
using Sparklyr.
Dr.Rabab Sabry 8
R is also heavily used in many of the companies that hire data scientists. Google and
Facebook have data scientists who use R. R is also used in companies like Bank of America,
Ford, Uber, Trulia, and more
Dr.Rabab Sabry 9
Other attractive features of R are:
• R is free and open source.
• It runs on all major platforms: Windows, Mac
Os, UNIX/Linux.
• Scripts and data objects can be shared
seamlessly across platforms.
• There is a large, growing, and active
community of R users and, as a result, there are
numerous resources for learning and asking
questions
• It is easy for others to contribute add-ons
which enables developers to share software
implementations of new data science
methodologies. This gives R users early access to
the latest methods and to tools which are
developed for a wide variety of disciplines,
including ecology, molecular biology, social
sciences, and geography, just to name a few
examples.
Dr.Rabab Sabry 10
Installation
of R
• You can download the R
programming command line
application from
• www.r-project.org
Dr.Rabab Sabry 11
12
Dr.Rabab Sabry
Dr.Rabab Sabry 13
Dr.Rabab Sabry 14
The Integrated Development Environment
An IDE is a software application that helps programmers develop software more easily and more productively. An IDE is
made up of a code editor, compiler, and debugger tools. Code editors usually offer syntax highlighting and intelligent code
completion
RStudio is the most popular IDE for the R programming language. RStudio helps you write R programming code more
easily and more productively
Dr.Rabab Sabry 15
RStudio is the most popular IDE for the R programming language. RStudio helps you write
R programming code more easily and more productively. To download and install RStudio,
visit www.rstudio.com/
Dr.Rabab Sabry 16
Download the latest version. For this book,you will download the 64-bit,Windows version. After downloading the RStudio
installer or setup file, double-click the file to install the RStudio IDE
Dr.Rabab Sabry 19
Dr.Rabab Sabry 20
Dr.Rabab Sabry 21
After clicking OK and choosing the R version, you must restart the RStudio IDE
Dr.Rabab Sabry 22
Basic Syntax
the R console offers a fast and easy way to do statistical calculations and some data visualizations. The R
console is also like a calculator, so you can always use the R console to calculate some math equations.
To do math calculations, you can just type in some math equations like
1+1
>1+1
[1] 2
1–3
>1-3
[1] -2
1*5
>1*5
[1] 5
1/6
>1/6
[1] 0.1666667
tan(2)
> tan(2)
[1] -2.18504
Dr.Rabab Sabry 23
Code in R script
Dr.Rabab Sabry 24
To run the R script, highlight the code in the code editor and click Run,
as shown in Figure
Running R Script
Dr.Rabab Sabry 25
To view the results of the R script, look in the R console of RStudio, as shown in Figure
Dr.Rabab Sabry 26
You can also see that in the Environment tab, there are two variables, as shown in Figure
Dr.Rabab Sabry 27
Adding Comments to the Code
You can add comments to the code. Comments are text that will not be run by the R console. You can add in a comment
by putting # in front of the text. The comment is for you to describe your code to let anyone read it more easily.
#Create variable A with value 1
A <- 1;
#Create variable B with value 2
B <- 2;
#Calculate A divide B
A/B;
#Calculate A times B
A * B;
#Calculate A plus B
A + B;
#Calculate A subtract B
A - B;
#Calculate A to power of 2
A^2;
#Calculate B to power of 2
B^2;
You can rerun the code and you should get the result shown in Figure
Dr.Rabab Sabry 28
Dr.Rabab Sabry 29
Variables
Let’s look into the code and scripts you used previously. You actually created two variables, A and B, and assigned some
values to the two variables.
A <- 1
B <- 2
In this code, A is a variable, and B is a variable also. <- means assign. A <- 1 means variable A is assigned a value of 1. 1
is a numeric type. B <- 2 means variable B is assigned a value of 2. 2 is a numeric type.
If you want to assign text or character values, you add quotations, like
A <- "Hello World“
Print(A)
Data Types
Data types are the types or kind of information or data a variable is
holding. A data type can be numeric and character.
For example,
A <- "abc"
B <- 1.2
Dr.Rabab Sabry 30
Dr.Rabab Sabry 31
Dr.Rabab Sabry 32
Dr.Rabab Sabry 33
Dr.Rabab Sabry 34