0% found this document useful (0 votes)
13 views

Lect 1 Data Science

Data science involves using statistics, computer science, machine learning, and domain expertise to extract knowledge and insights from data. It usually results in the development of a data product to solve a problem. For example, companies like Amazon and Lazada use data science to develop recommendation systems that identify customer shopping patterns and recommend other products when a purchase is made. The field of data science combines multiple disciplines to gain understanding from data.

Uploaded by

mariamahmoud870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lect 1 Data Science

Data science involves using statistics, computer science, machine learning, and domain expertise to extract knowledge and insights from data. It usually results in the development of a data product to solve a problem. For example, companies like Amazon and Lazada use data science to develop recommendation systems that identify customer shopping patterns and recommend other products when a purchase is made. The field of data science combines multiple disciplines to gain understanding from data.

Uploaded by

mariamahmoud870
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Introduction to Data Science 1

Dr. Rabab Sabry


Data science is a multidisciplinary field that A data product is the
includes statistics, computer science, machine changing of the data
What Is Data Science? learning, and domain expertise to get knowledge of a company into a
and insights from data. Data science usually ends product to solve a
up developing a data product. problem.

For example, a data Using this data,


system or data
product can be the These companies Amazon and Lazada
product to
product have a lot of data can identify the
recommend other
recommendation based on shoppers’ shopping patterns of
products whenever a
system used in purchases shoppers and create a
shopper
Amazon and Lazada recommendation

buys a product.

Dr.Rabab Sabry 2
• The term “data science” has
become a buzzword and is now
used to
• represent many areas like data
analytics, data mining, text mining,
data visualizations, prediction
modeling, and so on.

Dr.Rabab Sabry 3
Statistics is important in data science because it can help analysts or data scientists analyze and understand
data. Descriptive statistics assists in summarizing the data, inferential statistics tests the relationship between
two data sets or samples, and regression analysis explores the relationships between multiple variables. Data
visualizations can explore the data with charts, graphs, and dashboards. Regressions and machine learning
algorithms can be used in predictive analytics to train a model and predict a variable.

Machine learning algorithms and regression or statistical learning algorithms are used to
predict a variable like this approach.

Domain expertise is the knowledge of the data set. If the data set is business data,
then the domain expertise should be business;

Adding in product design and engineering knowledge takes us into the fields of Internet of Things (IoT) and smart cities
because data science and predictive analytics can be used on sensor data

Dr.Rabab Sabry 4
Dr.Rabab Sabry 5
Dr.Rabab Sabry 6
• What Is R?
• R programming is for statistical
computing and is supported by the R
Foundation for Statistical Computing

Dr.Rabab Sabry 7
Why R?
When learning data science, many people struggle with choosing which programming languages and data
sciences to learn. There are many programming languages available for data science, like R, Python, SAS,
Java, and more. There are many data science software packages to learn, such as SPSS Statistics, SPSS
Modeler, SAS Enterprise Miner, Tableau, RapidMiner, Weka, GATE, and more.

I recommend learning R for statistics because it was developed for statistics in the first place. Python is a real
programming language, so you can develop real applications and software via Python programming.
Hence, if you want to develop a data product or data application, Python can be a better choice. R
programming is very strong in statistics, so it is ideal for data exploration or data understanding using
descriptive statistics, inferential statistics, regression analysis, and data visualizations. R is also ideal for
modeling because you can use statistical learning like
regressions for predictive analytics. R also has some packages for data mining, text mining, and machine
learning like Rattle, CARET, and TM. R programming can also interface with big data systems like Apache Spark
using Sparklyr.

Dr.Rabab Sabry 8
R is also heavily used in many of the companies that hire data scientists. Google and
Facebook have data scientists who use R. R is also used in companies like Bank of America,
Ford, Uber, Trulia, and more

Dr.Rabab Sabry 9
Other attractive features of R are:
• R is free and open source.
• It runs on all major platforms: Windows, Mac
Os, UNIX/Linux.
• Scripts and data objects can be shared
seamlessly across platforms.
• There is a large, growing, and active
community of R users and, as a result, there are
numerous resources for learning and asking
questions
• It is easy for others to contribute add-ons
which enables developers to share software
implementations of new data science
methodologies. This gives R users early access to
the latest methods and to tools which are
developed for a wide variety of disciplines,
including ecology, molecular biology, social
sciences, and geography, just to name a few
examples.
Dr.Rabab Sabry 10
Installation
of R
• You can download the R
programming command line
application from
• www.r-project.org

Dr.Rabab Sabry 11
12
Dr.Rabab Sabry
Dr.Rabab Sabry 13
Dr.Rabab Sabry 14
The Integrated Development Environment
An IDE is a software application that helps programmers develop software more easily and more productively. An IDE is
made up of a code editor, compiler, and debugger tools. Code editors usually offer syntax highlighting and intelligent code
completion

RStudio is the most popular IDE for the R programming language. RStudio helps you write R programming code more
easily and more productively

Dr.Rabab Sabry 15
RStudio is the most popular IDE for the R programming language. RStudio helps you write
R programming code more easily and more productively. To download and install RStudio,
visit www.rstudio.com/

Dr.Rabab Sabry 16
Download the latest version. For this book,you will download the 64-bit,Windows version. After downloading the RStudio
installer or setup file, double-click the file to install the RStudio IDE

Dr. Rabab Sabry 17


Dr.Rabab Sabry 18
Before running the script, you need to select the R programming command line application version to use.
Click Tools ➤ Global Options

Dr.Rabab Sabry 19
Dr.Rabab Sabry 20
Dr.Rabab Sabry 21
After clicking OK and choosing the R version, you must restart the RStudio IDE

Dr.Rabab Sabry 22
Basic Syntax

the R console offers a fast and easy way to do statistical calculations and some data visualizations. The R
console is also like a calculator, so you can always use the R console to calculate some math equations.
To do math calculations, you can just type in some math equations like
1+1
>1+1
[1] 2
1–3
>1-3
[1] -2
1*5
>1*5
[1] 5
1/6
>1/6
[1] 0.1666667
tan(2)
> tan(2)
[1] -2.18504

Dr.Rabab Sabry 23
Code in R script

Dr.Rabab Sabry 24
To run the R script, highlight the code in the code editor and click Run,
as shown in Figure

Running R Script

Dr.Rabab Sabry 25
To view the results of the R script, look in the R console of RStudio, as shown in Figure

Dr.Rabab Sabry 26
You can also see that in the Environment tab, there are two variables, as shown in Figure

Dr.Rabab Sabry 27
Adding Comments to the Code
You can add comments to the code. Comments are text that will not be run by the R console. You can add in a comment
by putting # in front of the text. The comment is for you to describe your code to let anyone read it more easily.
#Create variable A with value 1
A <- 1;
#Create variable B with value 2
B <- 2;
#Calculate A divide B
A/B;
#Calculate A times B
A * B;
#Calculate A plus B
A + B;
#Calculate A subtract B
A - B;
#Calculate A to power of 2
A^2;
#Calculate B to power of 2
B^2;
You can rerun the code and you should get the result shown in Figure

Dr.Rabab Sabry 28
Dr.Rabab Sabry 29
Variables
Let’s look into the code and scripts you used previously. You actually created two variables, A and B, and assigned some
values to the two variables.

A <- 1
B <- 2

In this code, A is a variable, and B is a variable also. <- means assign. A <- 1 means variable A is assigned a value of 1. 1
is a numeric type. B <- 2 means variable B is assigned a value of 2. 2 is a numeric type.

If you want to assign text or character values, you add quotations, like
A <- "Hello World“
Print(A)

Data Types
Data types are the types or kind of information or data a variable is
holding. A data type can be numeric and character.
For example,
A <- "abc"
B <- 1.2
Dr.Rabab Sabry 30
Dr.Rabab Sabry 31
Dr.Rabab Sabry 32
Dr.Rabab Sabry 33
Dr.Rabab Sabry 34

You might also like