0% found this document useful (0 votes)
2 views

1 Python vs R an Introduction

This document compares Python and R, two prominent programming languages in Data Science, focusing on their goals, user communities, ecosystems, applications, learning curves, and execution speeds. It emphasizes the importance of these languages in data analysis and machine learning, highlighting their continuous improvement due to competition. The paper also discusses the fundamentals of computer programming, types of programming languages, and the characteristics of Python and R.

Uploaded by

Gradi Kamingu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

1 Python vs R an Introduction

This document compares Python and R, two prominent programming languages in Data Science, focusing on their goals, user communities, ecosystems, applications, learning curves, and execution speeds. It emphasizes the importance of these languages in data analysis and machine learning, highlighting their continuous improvement due to competition. The paper also discusses the fundamentals of computer programming, types of programming languages, and the characteristics of Python and R.

Uploaded by

Gradi Kamingu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Optimall series on Python vs.

R for Data Scientists (English)


Vol. 001, No. 001

Python vs. R for Data Scientists: An introduction


Gradi L. Kamingu∗
June 2, 2023

Abstract
This paper compares the most commonly used programming languages in Data Sci-
ence, including Python and R, explaining the comparison criteria such as their goals,
user communities, ecosystems, applications, learning curves, execution speeds and their
importance in the development of web or machine learning applications. It begins by
presenting the notions of computer programming, in particular programming languages,
then presents these languages while introducing their development environments.

Keywords: Python language, R language, Data Science, Multiparadigm programming.


Researcher at the Optimization and Machine Learning Laboratory (OPTIMALL), DR Congo.
E-mail: g.kamingu@unikin.ac.cd.
1 Introduction
Data plays an important role in the operation of businesses and understanding this data is important
for decision-making at all levels. To take advantage of this intelligence hidden in the crowd of data,
Data Science is used. Data Science can be described as a science, a research paradigm, a research
method, a discipline, a workflow or even a profession (Mike and Hazzan (2023)), which uses statistics,
scientific calculations, scientific methods, processes, algorithms and systems to extract or extrapolate
knowledge and insights from data, whether structured or unstructured (Dhar (2013)). It is a concept
that unifies statistics, data analysis, computer science and their associated methods to understand
and analyze real phenomena with data (Hayashi (1998)). Thus, as a Data Scientist, it is important to
be both a good computer developer and a good data analyst (Jakobowicz (2018)).

So a Data Scientist’s Toolkit contains infrastructure tools that collect, store, and prepare data,
regardless of its format (structured, semi-structured, or unstructured) or source. Then there are anal-
ysis and visualization tools that make the data intelligible and actionable. Among these tools, some
are software that allows analysis, while others are programming languages, the most used of which are
Python and R. R and Python are essential languages for a Data Scientist. Moreover, the competition
between the two languages leads to a constant improvement of their functionalities for data processing
(Jakobowicz (2018)).

The remainder of the paper is structured as follows. We proceed to the presentation of the concept
of computer programming in the section 2, then present the Python language in the section 3, and the
R language in the section 4. Finally, we discuss the elements of comparison of these two languages in
the section 5 before concluding.

2 Notions of computer programming


2.1 Computer program
For the instructions to be understood by the computer, they must be implemented in a programming
language. This produces a computer program or simply program. The set of activities that allow the
writing of computer programs is called computer programming or more simply programming. The
component of the computer that processes and executes program instructions is the processor. The
instructions composing a program in a readable form as written in a programming language are called
source code.

Definition 2.1 (Programming language). A programming language is a system of symbols for writing
computer programs.

A programming language includes the alphabet and syntactic rules, to which we associate semantic
rules (Gabbrielli and Martini (2010)).

The alphabet is a set of symbols called letters or lexemes. These letters can be letters from A to
Z (from a to z), digits (from 0 to 9), symbols of mathematical or logical operations. This alphabet is
based on common standards like ASCII or Unicode for most programming languages.

From the letters, we can build the vocabulary representing all the instructions. By abuse of lan-
guage, we sometimes speak of keyword.

Syntactic rules define how lexemes combine correctly to form statements or what we will call more
complex programming language terms (Louden and Lambert (2004).

2
Semantic rules define the meaning of each of the combined elements that can be constructed in
the programming language.

2.2 Common programming errors


Programming errors can be of two types, namely: syntax errors and semantics errors.

2.2.1 Syntax errors


A syntax error occurs when the instruction(s) does not follow the grammar rules of a programming
language. By analogy with natural languages, it’s like saying "I is in the hospital" instead of saying
"I’m in the hospital". As long as syntax errors are detected during translation, the program cannot be
translated into machine language; therefore, there will be a stoppage of operation (a "crash") as well
as the display of an error message.

It should be noted that the statements and symbols used have no meaning in themselves. They
must be combined in the correct way. Therefore, you have to be very careful to scrupulously respect
the syntax of the language. However, there are programming languages, lowercase letters are not the
same as uppercase letters; they are said to be case sensitive.

Programming languages like C, C++, C#, Java, Python, PHP, etc. are case sensitive, while
programming languages like Fortran, BASIC, Pascal, Ada, Common Lisp, HTML are case insensitive.

2.2.2 Semantic errors


Semantic errors occur when logic is violated. Generally, the program works perfectly. This means
that there is no error message, but we don’t get what we have expected.

In reality, the program does exactly what it is told to do. The problem is that what it was told
to do is not what we want it to do, which makes these kinds of mistakes very dangerous. By analogy
with natural languages, it is like saying “I have a weapon” while thinking that we have a “tree”. Note
that the sentence is syntactically correct, but does not make the speaker’s idea real.

2.3 Classification of programming languages


Several classification criteria can be used to classify programming languages, namely:

• Classification according to the level of abstraction with respect to the processor’s instruction
set;

• Classification according to intended area of use;

• Classification according to the programming paradigm.

2.3.1 Classification according to the level of abstraction with respect to the processor’s
instruction set
Depending on the level of abstraction with respect to the processor’s instruction set, we distinguish
between low-level programming languages and high-level programming languages.

3
A. Low-level programming languages
Definition 2.2. A low-level programming language is a programming language that provides little or
no abstraction from the machine’s processor instruction set.

Thus, its commands or functions are structurally similar to processor instructions. Low-level lan-
guages are sometimes described as being "close to hardware". Programs written in low-level languages
tend to be relatively non-portable, as they allow programs to be written with consideration for the
particular characteristics of the computer that is supposed to run the program.

These are languages for managing registers, memory addresses, machine instructions.

Thus, there are two types of low-level languages: machine language and assembly language.

A.1. Machine language


Definition 2.3. The machine language, also called machine code is a low-level programming language
consisting of a series of bits ("0" and "1") that can be interpreted directly by the computer processor.

It is the native language of the processor. Each instruction causes the processor to perform a very
specific task, such as a load, store, jump, or arithmetic logic unit (ALU) operation on one or more
units of data in the processor’s registers or memory .

A.2. Assembly language


Definition 2.4. The assembly language or the assembler language is a low-level programming language
that represents the combinations of machine language bits by mnemonic codes, easy to remember or
human readable. This language is composed of instructions and data to be processed in binary form.

B. High-level programming languages


Definition 2.5. A high-level programming language is a programming language with strong abstraction
from the machine’s processor instruction set.

It is therefore a programming language that allows programs to be written using common words
from natural languages (very often English) and common mathematical symbols. Moreover, a pro-
gramming language is generally machine-independent: the same program can be used as is on several
types of computers – although programs can also be designed for a particular operating system.

Plankalkül (German pronunciation: ["pla:nkalky:l]) is a high-level programming language designed


for a computer by civil engineer Konrad Zuse between 1942 and 1945. However, it was not implemented
at that time, and its original contributions were largely isolated from other developments due to World
War 2. The first widely used high-level language was therefore Fortran, thus it is often considered that
Fortran (designed in 1954) is considered as the first high-level programming language, later followed
by Lisp (in 1958), Algol (in 1958), and Cobol (in 1959).

2.3.2 Classification according to intended area of use


Depending on the intended domain of use, a distinction is made between domain-specific programming
languages and general-purpose programming languages.

4
A. Domain-specific programming languages
Definition 2.6. A domain-specific programming language is a programming language specialized in a
particular application domain.

Example 2.1. Languages like HTML, Cobol, R, and MATLAB are domain-specific programming lan-
guages. In fact,

1. HTML (from "HyperText Markup Language" is a programming language intended to represent


web pages;

2. Cobol (from “COmmon Business Oriented Language”) is a common language for programming
business applications. Today, it is mainly used in the banking, insurance and major administra-
tion sectors;

3. R is a programming language for statistical applications and Data Science;

4. MATLAB (from "MATrix LABoratory") is a programming language intended to allow matrix


manipulations, the tracing of functions and data, the implementation of algorithms, the creation
of interfaces user.

B. General-Purpose Programming Languages


Definition 2.7. A general-purpose programming language, also called general-purpose programming
language is a programming language for creating software in a wide variety of application domains.

Example 2.2. Languages like C, C++, C#, BASIC, Kotlin, Delphi, Java, Python, etc. are general-
purpose programming languages.

2.3.3 Classification according to programming paradigm


Taking into account the programming paradigm or approach, there are several types of programming
languages, grouped into two main families, namely imperative programming languages and declarative
programming languages.

A. Imperative Programming Languages


Definition 2.8. A imperative programming language is a programming language that describes opera-
tions in sequences of instructions executed by the computer to modify the state of the program.

These types of programming languages are subdivided into three major subgroups:

• The procedural programming languages, which are imperative programming languages that
are based on the concept of procedure (i.e. a subroutine that does not return any value) and
function (i.e. i.e. a subroutine returning one and only one value).

Example 2.3. C, Fortran, BASIC, Ada, Cobol, PL/SQL are procedural programming languages.

• The structured programming languages, which are imperative programming languages that
aim to improve the clarity, quality, and development time of a computer program by making
extensive use of control structures and subroutines while eliminating the "goto" instructions or
at least limiting its use to unusual and serious cases. Structured programming is possible in any
procedural programming language, but some like Fortran IV didn’t lend themselves very well
to it. Among the most structuring programming languages, we find PL/I, Pascal and, later for
very large projects, Ada.

5
• The object-oriented programming languages, which are languages consisting of the definition
and assembly of software components called objects. A federation of objects forms a class.

Example 2.4. C#, C++, Objective C, Java, VB.net and Effeil are object-oriented programming
languages.

B. Declarative programming languages


Definition 2.9. A declarative programming language is a programming language that consists of declar-
ing the data of the problem and asking the program to solve it.

These types of programming languages are subdivided into four major subgroups:

• The functional programming languages, which are declarative programming languages where
actions are based on mathematical functions.

Example 2.5. Lisp, Scheme, Scala, Erlang, Anubis and PureScript are functional programming
languages.

• The logical programming languages, which are declarative programming languages consisting of
expressing problems and actions in the form of predicates (using first-order logic).

Example 2.6. Prolog, CLIPS, Gödel and Fril are logic programming languages.

• The functional logic programming languages, which are declarative programming languages that
combine functional programming and logic programming paradigms.

Example 2.7. Curry, Mercury and ALF are functional logic programming languages.

• The descriptive programming languages, which are declarative programming languages for de-
scribing data structures. These types of language are specialized in the enrichment of textual
information. They use tags, which are syntactic units used to delimit a sequence of characters
or mark a specific position within a stream of characters.

Example 2.8. LATEX, HTML and XML are descriptive programming languages.

Currently, there are programming languages that combine several programming paradigms. These
languages are called multi-paradigm programming languages.

Example 2.9. Python (structured, functional and object-oriented programming), R (structured, proce-
dural, functional and object-oriented programming), Ciao (logical, functional, procedural and object-
oriented programming), Oz (object-oriented, structured, procedural, programming and logic concurrent
and functional)

2.4 Translators
A program written in a programming language not native to the processor can be executed on several
machines with small modifications. Thus, this program is first translated into a series of instructions
and instructions written in machine language with execution; the special program to perform this
translation is called translator.

Among the translators, we distinguish assemblers, compilers and interpreters.

6
2.4.1 Assembler
Definition 2.10. A assembler is a program that converts assembly language mnemonic codes into a
binary program.

Here are some examples of assemblers:

Example 2.10. Merlin, MAC/65, Lisa, ORCA/M, RMAC, VASM, A86/A386, IBM ALP.

2.4.2 Interpreter
Definition 2.11. A interpreter is a program that translates line by line as soon as the program is
entered.

A programming language is said to be interpreted when its analysis and translation operations are
performed by an interpreter.

Example 2.11. BASIC, MATLAB, Perl, Prolog and PHP are interpreted programming languages.

2.4.3 Compiler
Definition 2.12. A compiler is a program that transforms an entire program written in a high-level
language (source code) into a binary program.

A programming language is said to be compiled when its analysis and translation operations are
performed by a compiler.

Example 2.12. LATEX, C#, C, C++, Fortran, Cobol are compiled programming languages.

Now that we have set the context regarding the programming languages, we can move on to the
presentation of our two programming languages.

3 Presentation of the Python language


Python is a portable, free, high-level programming language with strong dynamic typing, automatic
memory management, and an exception handling system. Python has been developed since 1989 by
Guido van Rossum and many voluntary contributors (Swinnen (2012)). Its first version was published
on February 20, 1991 and its latest version is on April 5, 2023.

Python is an interpreted programming language, that is to say that the instructions sent to it are
"translated" into machine language as they are read.

3.1 Python language characteristics


Among the characteristics that have appealed to Python users, we can cite the following:

• Python is free, that is, it can be used without restriction in commercial projects.

• Python is multi-paradigm, because it favors object-oriented, structured imperative and proce-


dural and functional programming (it has list comprehensions , dictionaries and sets).

• Python is multi-platform, as it is available on platforms like smartphones and mainframes.


Moreover, it is available on several operating systems such as Unix variants (GNU/Linux, De-
bian, Android, etc.), but also on several proprietary operating systems - such as iOS, MacOS,
BeOS , NeXTStep, MS-DOS, NetBSD, FreeBSD and different versions of Windows.

7
• Python is (optionally) multi-threaded.

• Python is a dynamically typed language because Python programmers should not declare vari-
able types when writing code, because Python determines them at runtime.

• Python is orthogonal (a small number of concepts is enough to generate very rich constructions),
reflective (it supports metaprogramming, for example the ability for an object to add itself or
remove attributes or methods, or even change class during execution) and introspective (a large
number of development tools, such as the debugger or the profiler, are implemented in Python
itself).

• Python is extensible, that is, it allows easy interfacing with existing C libraries. Also, it can
interact with other data processing software, database management systems and geographic
information systems.

• Python is continuously evolving, as it is supported by a community of enthusiastic and responsi-


ble users, most of whom are supporters of free software. Alongside the main interpreter, written
in C and maintained by the creator of the language, a second interpreter, written in Java, is
under development.

3.2 Getting started using the Python language


Several solutions exist to write programs in Python. The interpreter can be launched directly from
the command line (in a Linux "shell", or in a DOS window under Windows). Or, one can use "windows
terminal", or even in a specialized working environment such as integrated development environments
(abbreviated as IDE) (Swinnen (2012)).

Here are some IDEs used to program in Python: Visual Studio Code, Atom, Vim, PyDev, Jupyter
Notebook, JupyterLab, Spyder, PyCharm, etc.

However, the last four IDEs are accessible from the Anaconda distribution downloadable from the
link https://www.anaconda.com/download.

Anaconda is a free distribution for Python and R programming languages, in scientific applications.
In this distribution, one can have access to several other applications which can be IDEs, libraries,
etc. By default, we have the following applications:

1. JupyterLab;

2. Jupyter Noteboook;

3. Spyder;

4. Glueviz;

5. Orange;

6. Rstudio;

7. Visual Studio Code;

8. etc.

After installing Anaconda, just open it to get the Anaconda Navigator interface.

8
Figure 1: Anaconda Logo.

Unless otherwise specified, in this paper and throughout this series, we will use the Jupyter Note-
book IDE. Jupyter Notebook (formally known as The IPython Notebook) is an interactive IDE in
which one can combine code execution, rich text, multimedia, and mathematical equations and graph-
ics. Using this IDE, the programmer can create and share the documents. It is supported by several
operating systems, including Linux, Windows and MacOS.

Here are the steps to run a Python script on Jupyter Notebook:

1. Once the Anaconda Navigator interface is launched, you must find the Jupyter Notebook ap-
plication. To do this, click on the “Launch” icon (See Figure 2) in the Jupyter Notebook area.

Figure 2: Presentation the Anaconda Browser.

2. Then, create a new Notebook (file) by clicking on "New" then "Python 3" (See Figure 3).

3. A newly created Notebook file is presented, by default, with "Code" cells allowing you to write
lines of code in Python language and then execute them. The word “In [ ]” is written in the
left margin in this type of cell (See Figure 4).

4. To use your Notebook file, you must execute the Notebook cells in order, one after the other.
To do this, select the cell, a frame around the cell appears. Click on the “Run” command or
use the keyboard shortcut “Shift + Enter”.

It is very important to execute code cells in program order. The numbers in square brackets may
not follow one by one, especially if the same cell is executed several times in a row, but the order of

9
Figure 3: Opening Python 3.

Figure 4: Presentation of the code editor.

10
these numbers must be increasing as one progresses through the Notebook file.

Once executed, the result of the execution (“Out”) is displayed in the cell under the code (this
can be an error message if its code is wrong). It may not display anything if the code does not require it.

It is possible to run the same cell several times in a row when you want to modify and test the
code it contains. In this case, it is not necessary to rerun the previous cells, nor to delete the previous
output of the cell concerned because it will be replaced automatically when the new execution of the
cell.

If we have several cells, we run them one by one or all at the same time by clicking on “Cell” then
“Run All” (See 5).

Figure 5: Running multiple cells at once.

Finally, it is possible to restart the Notebook file from the first cell after one or more executions.

To do this, simply click on the commands "interrupt the kernel" then "restart the kernel (with
dialog)" or restart the kernel, then re-run the whole notebook (with dialog)”.

Note that "interrupt the kernel" cancels the execution of all cells in the Notebook file (even if the
numbers in square brackets do not clear). It is therefore necessary to restart the execution of the cells
from the beginning of the file.

4 Presentation of the R language


The R language is an interpretation of the S programming language with the addition of Scheme-
inspired lexical scoping and (computing) garbage collection. It is therefore a language (the S) devel-
oped by John Chambers, a statistician at Harvard University with the help of his colleagues at Bell
Laboratories in the years 1975-1976.

However, the R project originated in 1993 (Ihaka and Gentleman (1996)) as a research project of
Ross Ihaka and Robert Gentleman at the University of Auckland (New Zealand) (Tippmann (2015)).

11
R version 1.0.0, the first official version of the R language, was released on February 29, 2000.

4.1 R language characteristics


R shares almost all the characteristics of Python, i.e.:
• R is free, that is, it can be used without restriction in commercial projects.
• R is multi-paradigm, because it favors object-oriented, structured imperative and procedural
and functional programming (it has list comprehensions , sets, etc.).
• R is multi-platform, as it is available on platforms like smartphones and mainframes. Moreover,
it is available on several operating systems such as GNU/Linux, Windows, NetBSD, FreeBSD
and MacOS.
• R includes a large number of statistical calculation and graphical representation functionalities
as standard.
• R is a dynamically typed language because, like Python, it does not require you to declare
variable types when writing code.
• Python is reflective and introspective.
• Python is extensible. It therefore makes it possible to interact with database management
systems (case of PostgreSQL via the PL/R language and MySQL), with geographic information
systems (case of GRASS), or those which allow the export of results in LATEXor OpenDocument.
• Thanks to its panoply of packages, the R language is now a powerful tool to help in several types
of applications, including Multivariate Statistics (Data Analysis), Econometrics, Biometrics,
Epidemiology, Modeling, Image Processing, Graph Analysis, Numerical Analysis, Data Mining,
Big Data, and so many others. In addition, it allows high quality graphics.
• R is also continuously evolving.

4.2 Getting started using the R language


The R interpreter, also called R base (R base) is above all a command line application (Goulet (2016),
Goulet (2023)). R base can be downloaded from the link https://cran.r-project.org/bin/. In this link,
there are folders each containing an R base version for Linux, Windows and MacOS(X) systems.

Once R is installed on the computer, all you need is the corresponding executable to start the
program. The command prompt (by default the symbol ">") then appears indicating that R is ready
to execute the commands (Paradis (2005)), as in the figure 6. Under Windows, the graphical user
interface supplied with R base is quite rudimentary, thus facilitating certain operations such as the
installation of external packages. This interface does not offer more features for editing R code. The
graphical user interface of R under MacOS is the most elaborate (Goulet (2023)). This is how it is
advisable to use other graphical user interfaces, above R base.
• RGUI, the graphical user interface installed by default on Windows. This interface is much
used by Windows users.
• JGR(read "Jaguar", short for Java GUI for R) is an environment for using R on Java.
• Rattle (short for R analytical tool to learn easily), is popular graphical user interface for Data
Mining using R. It presents statistical and visual summaries of data, transforms data that can
be easily modeled, builds unsupervised and supervised models from the data, presents model
performance graphically, and evaluates new datasets.

12
Figure 6: The presentation of the R base.

• Rcmdr, (short for R commander), is an R GUI that is implemented under the form of an R
package, the Rcmdr package, available for free on CRAN (the R package archive).

• RKWard, is a graphical user interface originally running under KDE. It currently works on
Linux, Windows and Mac. RKWard allows you to edit tables, import R objects or lists in csv
format, then perform statistical analyses. These analyzes are plug-ins, which include a graphical
user interface with options to choose with the mouse, but also a command line allowing advanced
users to enter exactly the analyzes desired.

• Sciviews R GUI, graphical user interface providing a series of open source software to comple-
ment R, for statistical computing in a reproductive workflow.

• RStudio, is a graphical user interface that makes R easier to use. It includes a code editor,
debugging and visualization tools. This interface also has the advantage of being multi-platform,
since it works on Windows, OS X and GNU/Linux.

Note that Anaconda distributor also gives access to RStudio, but that is not often up to date.
However, another solution is to use the RStudio Cloud, which is the online version of RStudio, ac-
cessible from the link https: //login.rstudio.cloud/login?redirect=%2F, in which you must log in (or
register if the user has not yet been created).

The interface of RStudio (be it its online version) looks like in the figure 7.

Downloading RStudio (Desktop version) can be done by accessing the link


https://posit.co/download/rstudio-desktop/. In this link, there is also the possibility to install the R
base, which is still a prerequisite for the installation of RStudio. The RStudio Cloud version does not
require the installation of R base.

13
Figure 7: Presentation of RStudio.

The left panel is the console (1). Here the code to send to R can be entered directly. It is through
this console that the codes are sent to the execution.

The upper right part (2) is the workspace and we see objects (such as datasets and variables) there.

The lower right part (3) serves several purposes. This is where the various charts appear, where
you perform file management where one performs file management (including importing files from
their computer), where one performs file management (including importing files from his computer),
where the packages are installed and where the help information is displayed. Tabs can be used to
switch between these screens as needed.

The code can be executed directly from the console (1), but it is always advisable to use R Script
for the following reasons. First, R Script allows running entire R scripts, which means writing and
running entire R programs can be done in a single file. This can be very useful for automating repet-
itive tasks or for creating complex data analyses. In addition, R Script allows specifying command
line arguments, which can be very useful for automating tasks or for running analyzes on large data
sets. Finally, R Script also allows producing script outputs in different formats, such as HTML, PDF
or CSV.

So, here are the steps to run an R script on RStudio:

1. Open RStudio.

2. Open a new script by going to the “File” menu ; then click on “New File” ; finally click on “R
Script” (See Figure 8).
The same result is obtained by pressing the "+" symbol (just above the File menu ; more by
clicking on "R Script" (See Figure 9).

3. Make sure that the R console is open in the lower right window of the RStudio interface, then
select the code you want to run by clicking on it with your mouse (See the red frame in the

14
Figure 8: Launch R Script without keyboard shortcut.

Figure 9: Launch R Script without keyboard shortcut.

15
figure 10).

4. Click on the "Run" button or use the shortcut "Ctrl+Enter" on Windows or "Cmd+Enter" on
Mac to run the selected code.
The results of the execution will be displayed in the RStudio console (See the blue frame in the
figure 10).

Figure 10: Running R code in RStudio.

Pour exécuter l’ensemble du script en une seule fois, cliquer sur le bouton « Run » (ou « Exécuter
») ou utiliser le raccourci clavier « Ctrl+Shift+S » sur Windows ou « Cmd+Shift+S » sur Mac.
To run the entire script at once, click on the "Run" button or use the keyboard shortcut "Ctrl+Shift+S"
on Windows or “Cmd+Shift+S” on Mac.

5 Comparison between Python language and R language


In this section, we will compare topic by topic the Python and R programming languages. The tables
1, 2 and 3 contain the various elements of comparison, most of which are taken from the publication
of Grogan (2018).

16
Table 1: Comparison between Python and R.

Criterion Python R
Goal Python is a general-purpose R is a language for Statistics
language, intended primar- and Data Analysis.
ily for software development
and deployment.
Users Python is more used by data R is more used by data scien-
scientists who have a back- tists who have a background
ground in application devel- in Mathematics (and Statis-
opment. tics).
Community The python.org com- The R-bloggers com-
munity (specifically munity, accessible at
https://www.python.org. the link https://www.r-
In addition, the newsletter bloggers.com/, which has
Python Weekly provides im- more than 750 contributors
portant information (news, who regularly provide use-
articles, new releases, jobs ful information on the R
and more) language.
Learning Curve Python is considered one With R, beginners can per-
of the closest programming form Data Analysis tasks
languages to English, thanks in minutes, but the com-
to its intuitive and easy to plexity of advanced features
read syntax. It is there- makes it more difficult to de-
fore considered a good lan- velop expertise. Its learn-
guage for beginner program- ing curve is not straight for-
mers. Its learning curve is ward, as its syntax is quite
linear and fluid. intuitive for those familiar
with Statistics.
Speed Python has much more R is relatively slower than
flexibility in interacting Python or other program-
with other programming ming languages, especially
languages and is faster if the code is poorly writ-
compared to R. ten. However, there are
workarounds for this, such
as the FastR, pqR, and Pen-
jin package.
Machine Learning Python has more advanced R has more advanced Ma-
Machine Learning libraries, chine Learning libraries,
such as Scikit-learn and such as Caret and MLR.
TensorFlow.

17
Table 2: Comparison between Python and R. (continued 1)

Criterion Python R
Ecosystems Python has robust and ex- R has robust and exten-
tensive package ecosystems. sive package ecosystems. R
Most packages in Python packages (almost 19,000) are
(over 300,000) are hosted normally stored in the Com-
in the Python Package In- prehensive R Archive Net-
dex (PyPi). Here are some work (CRAN). Here are
Python packages: some R packages:

• NumPy, which pro- • matrixStats, provides


vides a large collection highly optimized func-
of functions for scien- tions for calculating
tific computing; common summaries
over rows and columns
• Pandas, which is great of matrices.
for data manipulation.
• tidyverse, which pro-
• Matplotlib: the stan- vides tools for manip-
dard library for data ulating data.
visualization. Oth-
ers include Seaborn, • ggplot2, which is the
Plotly, Pygal and perfect library for vi-
Bokeh. sualizing data. Plotly
in R as well as caret,
• Scikit-learn: is a li- igraph and highchar-
brary in Python that ter can also be used.
provides many Ma-
chine Learning algo- • Caret: one of the most
rithms. important libraries for
Machine Learning in
However, the packages that R.
Python has for data vi-
sualization make it possi- Indeed, R has more ad-
ble to create professional- vanced and specific data
quality graphs, although less visualization features than
advanced than those of R. Python, which allows it to
create attractive graphs with
notations and formulas. R’s
graphics are considered more
advanced than Python’s and
are often compared to those
of statistical software.

18
Table 3: Comparison between Python and R. (continued 2)

Criterion Python R
Web Applications Python is often considered R, meanwhile, can also be
more suitable than R. used to build web applica-
Python has many popular tions using packages such as
web frameworks such as Shiny.
Django and Flask, which
make building web applica-
tions easier.
Applications
• Mozilla uses Python • Ford uses open source
programming to ex- tools like R program-
plore its large code ming and Hadoop for
base. Mozilla releases data-driven decision
several open source support and statistical
packages built using analysis.
Python.
• Famous insurance gi-
• Dropbox is written en- ant Lloyd’s uses the R
tirely in Python code language to create an-
which now has nearly imated diagrams that
150 million registered provide analytical re-
users. ports to investors.

• Walt Disney uses • Google uses R pro-


Python to reinforce gramming to analyze
the supremacy of its the effectiveness of on-
creative processes. line advertising cam-
paigns, predict eco-
• Some other outstand- nomic activities, and
ing products written measure the return on
in Python language investment of advertis-
are Cocos2d, Mercu- ing campaigns.
rial, Bit Torrent and
Reddit. • Facebook uses the R
language to parse sta-
tus updates and cre-
ate the social network
graph.

• Zillow uses R pro-


gramming to promote
housing prices.

19
6 Conclusion
The choice of this or that other programming language is motivated by several reasons. First, it may
be a personal preference or what the learner would find easier to master from the start. However,
mathematicians and statisticians generally tend to turn to R while computer scientists and software
engineers prefer to use Python. It is also recommended for those who do not really have any prior
notions in coding, for example, to start with the Python language, or to learn both simultaneously,
while taking Python as a reference; this is the approach we will use throughout this series of papers.
Second, the choice may be motivated by the tasks to be performed. If the business is mostly about
crunching numbers, visualizing data, and doing ad-hoc statistical analysis, R might be a good choice.
Indeed, R’s ecosystem can be far superior to Python’s when it comes to advanced statistical tech-
niques. However, if the mission is to collect data from websites, files, or other data sources, Python is
by far the best choice.

In addition, there are also packages such as RPy2 and rPython which allow integration of R and
Python programming languages. We will discuss these packages in a later paper. Indeed, RPy2 is
a Python package that allows you to use R from Python. Which means calling R functions from
Python; often used for data analysis and visualization, as R is a popular programming language for
these tasks. And rPython, on the other hand, is a Python package that allows integrating Python
into R, i.e. calling Python functions from R; often used for optimization and simulation, as Python is
a popular programming language for these tasks.

In summary, it is recommended to master both programming languages, since they are complemen-
tary and each of them can be used to solve most Data Science problems using the available packages.
However, success in this area also depends on the methodology, the skills of the Data Scientist and
the available resources, which are important factors regardless of the choice of programming language.
Moreover, by mastering both languages, a Data Scientist can take advantage of the advantages of each
to solve specific problems.

Additionally, many companies use a combination of R and Python for their Data Analytics and
Machine Learning projects. By mastering both languages, a data scientist can be more versatile and
useful to these companies.

20
References
Dhar, V. (2013). Data science and prediction. Communication of the Association for Computing
Machinery, 56(12):64–73.

Gabbrielli, M. and Martini, S. (2010). Programming Languages: Principles and Paradigms. Springer,
Berlin.

Goulet, V. (2016). Introduction à la programmation R. CRAN.

Goulet, V. (2023). Programmer avec R. CRAN.

Grogan, M. (2018). Python vs. R for Data Science. O’Reilly, Boston, 1st edition.

Hayashi, C. (1998). What is Data Science? Fundamental Concepts and a Heuristic Example. In
Hayashi, C., Yajima, K., Bock, H.-H., Ohsumi, N., Tanaka, Y., and Baba, Y., editors, Data Science,
Classification, and Related Methods, pages 40–51. Springer Japan.

Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3):299–314.

Jakobowicz, E. (2018). Python pour le Data Scientist. Des bases du langage au machine learning.
Dunod, Paris.

Louden, K. C. and Lambert, K. A. (2004). Programming Languages: Principles and Practices. Cengage
Learning, Farmington Hills.

Mike, K. and Hazzan, O. (2023). What is Data Science? Communications of the Association for
Computing Machinery, 66(2):12–13.

Paradis, E. (2005). R pour les débutants. CRAN.

Swinnen, G. (2012). Apprendre à programmer avec Python 3. Eyrolles, Paris.

Tippmann, S. (2015). Programming tools: Adventures with R. Nature, 517:109–110.

21

You might also like