1 Python vs R an Introduction
1 Python vs R an Introduction
Abstract
This paper compares the most commonly used programming languages in Data Sci-
ence, including Python and R, explaining the comparison criteria such as their goals,
user communities, ecosystems, applications, learning curves, execution speeds and their
importance in the development of web or machine learning applications. It begins by
presenting the notions of computer programming, in particular programming languages,
then presents these languages while introducing their development environments.
∗
Researcher at the Optimization and Machine Learning Laboratory (OPTIMALL), DR Congo.
E-mail: g.kamingu@unikin.ac.cd.
1 Introduction
Data plays an important role in the operation of businesses and understanding this data is important
for decision-making at all levels. To take advantage of this intelligence hidden in the crowd of data,
Data Science is used. Data Science can be described as a science, a research paradigm, a research
method, a discipline, a workflow or even a profession (Mike and Hazzan (2023)), which uses statistics,
scientific calculations, scientific methods, processes, algorithms and systems to extract or extrapolate
knowledge and insights from data, whether structured or unstructured (Dhar (2013)). It is a concept
that unifies statistics, data analysis, computer science and their associated methods to understand
and analyze real phenomena with data (Hayashi (1998)). Thus, as a Data Scientist, it is important to
be both a good computer developer and a good data analyst (Jakobowicz (2018)).
So a Data Scientist’s Toolkit contains infrastructure tools that collect, store, and prepare data,
regardless of its format (structured, semi-structured, or unstructured) or source. Then there are anal-
ysis and visualization tools that make the data intelligible and actionable. Among these tools, some
are software that allows analysis, while others are programming languages, the most used of which are
Python and R. R and Python are essential languages for a Data Scientist. Moreover, the competition
between the two languages leads to a constant improvement of their functionalities for data processing
(Jakobowicz (2018)).
The remainder of the paper is structured as follows. We proceed to the presentation of the concept
of computer programming in the section 2, then present the Python language in the section 3, and the
R language in the section 4. Finally, we discuss the elements of comparison of these two languages in
the section 5 before concluding.
Definition 2.1 (Programming language). A programming language is a system of symbols for writing
computer programs.
A programming language includes the alphabet and syntactic rules, to which we associate semantic
rules (Gabbrielli and Martini (2010)).
The alphabet is a set of symbols called letters or lexemes. These letters can be letters from A to
Z (from a to z), digits (from 0 to 9), symbols of mathematical or logical operations. This alphabet is
based on common standards like ASCII or Unicode for most programming languages.
From the letters, we can build the vocabulary representing all the instructions. By abuse of lan-
guage, we sometimes speak of keyword.
Syntactic rules define how lexemes combine correctly to form statements or what we will call more
complex programming language terms (Louden and Lambert (2004).
2
Semantic rules define the meaning of each of the combined elements that can be constructed in
the programming language.
It should be noted that the statements and symbols used have no meaning in themselves. They
must be combined in the correct way. Therefore, you have to be very careful to scrupulously respect
the syntax of the language. However, there are programming languages, lowercase letters are not the
same as uppercase letters; they are said to be case sensitive.
Programming languages like C, C++, C#, Java, Python, PHP, etc. are case sensitive, while
programming languages like Fortran, BASIC, Pascal, Ada, Common Lisp, HTML are case insensitive.
In reality, the program does exactly what it is told to do. The problem is that what it was told
to do is not what we want it to do, which makes these kinds of mistakes very dangerous. By analogy
with natural languages, it is like saying “I have a weapon” while thinking that we have a “tree”. Note
that the sentence is syntactically correct, but does not make the speaker’s idea real.
• Classification according to the level of abstraction with respect to the processor’s instruction
set;
2.3.1 Classification according to the level of abstraction with respect to the processor’s
instruction set
Depending on the level of abstraction with respect to the processor’s instruction set, we distinguish
between low-level programming languages and high-level programming languages.
3
A. Low-level programming languages
Definition 2.2. A low-level programming language is a programming language that provides little or
no abstraction from the machine’s processor instruction set.
Thus, its commands or functions are structurally similar to processor instructions. Low-level lan-
guages are sometimes described as being "close to hardware". Programs written in low-level languages
tend to be relatively non-portable, as they allow programs to be written with consideration for the
particular characteristics of the computer that is supposed to run the program.
These are languages for managing registers, memory addresses, machine instructions.
Thus, there are two types of low-level languages: machine language and assembly language.
It is the native language of the processor. Each instruction causes the processor to perform a very
specific task, such as a load, store, jump, or arithmetic logic unit (ALU) operation on one or more
units of data in the processor’s registers or memory .
It is therefore a programming language that allows programs to be written using common words
from natural languages (very often English) and common mathematical symbols. Moreover, a pro-
gramming language is generally machine-independent: the same program can be used as is on several
types of computers – although programs can also be designed for a particular operating system.
4
A. Domain-specific programming languages
Definition 2.6. A domain-specific programming language is a programming language specialized in a
particular application domain.
Example 2.1. Languages like HTML, Cobol, R, and MATLAB are domain-specific programming lan-
guages. In fact,
2. Cobol (from “COmmon Business Oriented Language”) is a common language for programming
business applications. Today, it is mainly used in the banking, insurance and major administra-
tion sectors;
Example 2.2. Languages like C, C++, C#, BASIC, Kotlin, Delphi, Java, Python, etc. are general-
purpose programming languages.
These types of programming languages are subdivided into three major subgroups:
• The procedural programming languages, which are imperative programming languages that
are based on the concept of procedure (i.e. a subroutine that does not return any value) and
function (i.e. i.e. a subroutine returning one and only one value).
Example 2.3. C, Fortran, BASIC, Ada, Cobol, PL/SQL are procedural programming languages.
• The structured programming languages, which are imperative programming languages that
aim to improve the clarity, quality, and development time of a computer program by making
extensive use of control structures and subroutines while eliminating the "goto" instructions or
at least limiting its use to unusual and serious cases. Structured programming is possible in any
procedural programming language, but some like Fortran IV didn’t lend themselves very well
to it. Among the most structuring programming languages, we find PL/I, Pascal and, later for
very large projects, Ada.
5
• The object-oriented programming languages, which are languages consisting of the definition
and assembly of software components called objects. A federation of objects forms a class.
Example 2.4. C#, C++, Objective C, Java, VB.net and Effeil are object-oriented programming
languages.
These types of programming languages are subdivided into four major subgroups:
• The functional programming languages, which are declarative programming languages where
actions are based on mathematical functions.
Example 2.5. Lisp, Scheme, Scala, Erlang, Anubis and PureScript are functional programming
languages.
• The logical programming languages, which are declarative programming languages consisting of
expressing problems and actions in the form of predicates (using first-order logic).
Example 2.6. Prolog, CLIPS, Gödel and Fril are logic programming languages.
• The functional logic programming languages, which are declarative programming languages that
combine functional programming and logic programming paradigms.
Example 2.7. Curry, Mercury and ALF are functional logic programming languages.
• The descriptive programming languages, which are declarative programming languages for de-
scribing data structures. These types of language are specialized in the enrichment of textual
information. They use tags, which are syntactic units used to delimit a sequence of characters
or mark a specific position within a stream of characters.
Example 2.8. LATEX, HTML and XML are descriptive programming languages.
Currently, there are programming languages that combine several programming paradigms. These
languages are called multi-paradigm programming languages.
Example 2.9. Python (structured, functional and object-oriented programming), R (structured, proce-
dural, functional and object-oriented programming), Ciao (logical, functional, procedural and object-
oriented programming), Oz (object-oriented, structured, procedural, programming and logic concurrent
and functional)
2.4 Translators
A program written in a programming language not native to the processor can be executed on several
machines with small modifications. Thus, this program is first translated into a series of instructions
and instructions written in machine language with execution; the special program to perform this
translation is called translator.
6
2.4.1 Assembler
Definition 2.10. A assembler is a program that converts assembly language mnemonic codes into a
binary program.
Example 2.10. Merlin, MAC/65, Lisa, ORCA/M, RMAC, VASM, A86/A386, IBM ALP.
2.4.2 Interpreter
Definition 2.11. A interpreter is a program that translates line by line as soon as the program is
entered.
A programming language is said to be interpreted when its analysis and translation operations are
performed by an interpreter.
Example 2.11. BASIC, MATLAB, Perl, Prolog and PHP are interpreted programming languages.
2.4.3 Compiler
Definition 2.12. A compiler is a program that transforms an entire program written in a high-level
language (source code) into a binary program.
A programming language is said to be compiled when its analysis and translation operations are
performed by a compiler.
Example 2.12. LATEX, C#, C, C++, Fortran, Cobol are compiled programming languages.
Now that we have set the context regarding the programming languages, we can move on to the
presentation of our two programming languages.
Python is an interpreted programming language, that is to say that the instructions sent to it are
"translated" into machine language as they are read.
• Python is free, that is, it can be used without restriction in commercial projects.
7
• Python is (optionally) multi-threaded.
• Python is a dynamically typed language because Python programmers should not declare vari-
able types when writing code, because Python determines them at runtime.
• Python is orthogonal (a small number of concepts is enough to generate very rich constructions),
reflective (it supports metaprogramming, for example the ability for an object to add itself or
remove attributes or methods, or even change class during execution) and introspective (a large
number of development tools, such as the debugger or the profiler, are implemented in Python
itself).
• Python is extensible, that is, it allows easy interfacing with existing C libraries. Also, it can
interact with other data processing software, database management systems and geographic
information systems.
Here are some IDEs used to program in Python: Visual Studio Code, Atom, Vim, PyDev, Jupyter
Notebook, JupyterLab, Spyder, PyCharm, etc.
However, the last four IDEs are accessible from the Anaconda distribution downloadable from the
link https://www.anaconda.com/download.
Anaconda is a free distribution for Python and R programming languages, in scientific applications.
In this distribution, one can have access to several other applications which can be IDEs, libraries,
etc. By default, we have the following applications:
1. JupyterLab;
2. Jupyter Noteboook;
3. Spyder;
4. Glueviz;
5. Orange;
6. Rstudio;
8. etc.
After installing Anaconda, just open it to get the Anaconda Navigator interface.
8
Figure 1: Anaconda Logo.
Unless otherwise specified, in this paper and throughout this series, we will use the Jupyter Note-
book IDE. Jupyter Notebook (formally known as The IPython Notebook) is an interactive IDE in
which one can combine code execution, rich text, multimedia, and mathematical equations and graph-
ics. Using this IDE, the programmer can create and share the documents. It is supported by several
operating systems, including Linux, Windows and MacOS.
1. Once the Anaconda Navigator interface is launched, you must find the Jupyter Notebook ap-
plication. To do this, click on the “Launch” icon (See Figure 2) in the Jupyter Notebook area.
2. Then, create a new Notebook (file) by clicking on "New" then "Python 3" (See Figure 3).
3. A newly created Notebook file is presented, by default, with "Code" cells allowing you to write
lines of code in Python language and then execute them. The word “In [ ]” is written in the
left margin in this type of cell (See Figure 4).
4. To use your Notebook file, you must execute the Notebook cells in order, one after the other.
To do this, select the cell, a frame around the cell appears. Click on the “Run” command or
use the keyboard shortcut “Shift + Enter”.
It is very important to execute code cells in program order. The numbers in square brackets may
not follow one by one, especially if the same cell is executed several times in a row, but the order of
9
Figure 3: Opening Python 3.
10
these numbers must be increasing as one progresses through the Notebook file.
Once executed, the result of the execution (“Out”) is displayed in the cell under the code (this
can be an error message if its code is wrong). It may not display anything if the code does not require it.
It is possible to run the same cell several times in a row when you want to modify and test the
code it contains. In this case, it is not necessary to rerun the previous cells, nor to delete the previous
output of the cell concerned because it will be replaced automatically when the new execution of the
cell.
If we have several cells, we run them one by one or all at the same time by clicking on “Cell” then
“Run All” (See 5).
Finally, it is possible to restart the Notebook file from the first cell after one or more executions.
To do this, simply click on the commands "interrupt the kernel" then "restart the kernel (with
dialog)" or restart the kernel, then re-run the whole notebook (with dialog)”.
Note that "interrupt the kernel" cancels the execution of all cells in the Notebook file (even if the
numbers in square brackets do not clear). It is therefore necessary to restart the execution of the cells
from the beginning of the file.
However, the R project originated in 1993 (Ihaka and Gentleman (1996)) as a research project of
Ross Ihaka and Robert Gentleman at the University of Auckland (New Zealand) (Tippmann (2015)).
11
R version 1.0.0, the first official version of the R language, was released on February 29, 2000.
Once R is installed on the computer, all you need is the corresponding executable to start the
program. The command prompt (by default the symbol ">") then appears indicating that R is ready
to execute the commands (Paradis (2005)), as in the figure 6. Under Windows, the graphical user
interface supplied with R base is quite rudimentary, thus facilitating certain operations such as the
installation of external packages. This interface does not offer more features for editing R code. The
graphical user interface of R under MacOS is the most elaborate (Goulet (2023)). This is how it is
advisable to use other graphical user interfaces, above R base.
• RGUI, the graphical user interface installed by default on Windows. This interface is much
used by Windows users.
• JGR(read "Jaguar", short for Java GUI for R) is an environment for using R on Java.
• Rattle (short for R analytical tool to learn easily), is popular graphical user interface for Data
Mining using R. It presents statistical and visual summaries of data, transforms data that can
be easily modeled, builds unsupervised and supervised models from the data, presents model
performance graphically, and evaluates new datasets.
12
Figure 6: The presentation of the R base.
• Rcmdr, (short for R commander), is an R GUI that is implemented under the form of an R
package, the Rcmdr package, available for free on CRAN (the R package archive).
• RKWard, is a graphical user interface originally running under KDE. It currently works on
Linux, Windows and Mac. RKWard allows you to edit tables, import R objects or lists in csv
format, then perform statistical analyses. These analyzes are plug-ins, which include a graphical
user interface with options to choose with the mouse, but also a command line allowing advanced
users to enter exactly the analyzes desired.
• Sciviews R GUI, graphical user interface providing a series of open source software to comple-
ment R, for statistical computing in a reproductive workflow.
• RStudio, is a graphical user interface that makes R easier to use. It includes a code editor,
debugging and visualization tools. This interface also has the advantage of being multi-platform,
since it works on Windows, OS X and GNU/Linux.
Note that Anaconda distributor also gives access to RStudio, but that is not often up to date.
However, another solution is to use the RStudio Cloud, which is the online version of RStudio, ac-
cessible from the link https: //login.rstudio.cloud/login?redirect=%2F, in which you must log in (or
register if the user has not yet been created).
The interface of RStudio (be it its online version) looks like in the figure 7.
13
Figure 7: Presentation of RStudio.
The left panel is the console (1). Here the code to send to R can be entered directly. It is through
this console that the codes are sent to the execution.
The upper right part (2) is the workspace and we see objects (such as datasets and variables) there.
The lower right part (3) serves several purposes. This is where the various charts appear, where
you perform file management where one performs file management (including importing files from
their computer), where one performs file management (including importing files from his computer),
where the packages are installed and where the help information is displayed. Tabs can be used to
switch between these screens as needed.
The code can be executed directly from the console (1), but it is always advisable to use R Script
for the following reasons. First, R Script allows running entire R scripts, which means writing and
running entire R programs can be done in a single file. This can be very useful for automating repet-
itive tasks or for creating complex data analyses. In addition, R Script allows specifying command
line arguments, which can be very useful for automating tasks or for running analyzes on large data
sets. Finally, R Script also allows producing script outputs in different formats, such as HTML, PDF
or CSV.
1. Open RStudio.
2. Open a new script by going to the “File” menu ; then click on “New File” ; finally click on “R
Script” (See Figure 8).
The same result is obtained by pressing the "+" symbol (just above the File menu ; more by
clicking on "R Script" (See Figure 9).
3. Make sure that the R console is open in the lower right window of the RStudio interface, then
select the code you want to run by clicking on it with your mouse (See the red frame in the
14
Figure 8: Launch R Script without keyboard shortcut.
15
figure 10).
4. Click on the "Run" button or use the shortcut "Ctrl+Enter" on Windows or "Cmd+Enter" on
Mac to run the selected code.
The results of the execution will be displayed in the RStudio console (See the blue frame in the
figure 10).
Pour exécuter l’ensemble du script en une seule fois, cliquer sur le bouton « Run » (ou « Exécuter
») ou utiliser le raccourci clavier « Ctrl+Shift+S » sur Windows ou « Cmd+Shift+S » sur Mac.
To run the entire script at once, click on the "Run" button or use the keyboard shortcut "Ctrl+Shift+S"
on Windows or “Cmd+Shift+S” on Mac.
16
Table 1: Comparison between Python and R.
Criterion Python R
Goal Python is a general-purpose R is a language for Statistics
language, intended primar- and Data Analysis.
ily for software development
and deployment.
Users Python is more used by data R is more used by data scien-
scientists who have a back- tists who have a background
ground in application devel- in Mathematics (and Statis-
opment. tics).
Community The python.org com- The R-bloggers com-
munity (specifically munity, accessible at
https://www.python.org. the link https://www.r-
In addition, the newsletter bloggers.com/, which has
Python Weekly provides im- more than 750 contributors
portant information (news, who regularly provide use-
articles, new releases, jobs ful information on the R
and more) language.
Learning Curve Python is considered one With R, beginners can per-
of the closest programming form Data Analysis tasks
languages to English, thanks in minutes, but the com-
to its intuitive and easy to plexity of advanced features
read syntax. It is there- makes it more difficult to de-
fore considered a good lan- velop expertise. Its learn-
guage for beginner program- ing curve is not straight for-
mers. Its learning curve is ward, as its syntax is quite
linear and fluid. intuitive for those familiar
with Statistics.
Speed Python has much more R is relatively slower than
flexibility in interacting Python or other program-
with other programming ming languages, especially
languages and is faster if the code is poorly writ-
compared to R. ten. However, there are
workarounds for this, such
as the FastR, pqR, and Pen-
jin package.
Machine Learning Python has more advanced R has more advanced Ma-
Machine Learning libraries, chine Learning libraries,
such as Scikit-learn and such as Caret and MLR.
TensorFlow.
17
Table 2: Comparison between Python and R. (continued 1)
Criterion Python R
Ecosystems Python has robust and ex- R has robust and exten-
tensive package ecosystems. sive package ecosystems. R
Most packages in Python packages (almost 19,000) are
(over 300,000) are hosted normally stored in the Com-
in the Python Package In- prehensive R Archive Net-
dex (PyPi). Here are some work (CRAN). Here are
Python packages: some R packages:
18
Table 3: Comparison between Python and R. (continued 2)
Criterion Python R
Web Applications Python is often considered R, meanwhile, can also be
more suitable than R. used to build web applica-
Python has many popular tions using packages such as
web frameworks such as Shiny.
Django and Flask, which
make building web applica-
tions easier.
Applications
• Mozilla uses Python • Ford uses open source
programming to ex- tools like R program-
plore its large code ming and Hadoop for
base. Mozilla releases data-driven decision
several open source support and statistical
packages built using analysis.
Python.
• Famous insurance gi-
• Dropbox is written en- ant Lloyd’s uses the R
tirely in Python code language to create an-
which now has nearly imated diagrams that
150 million registered provide analytical re-
users. ports to investors.
19
6 Conclusion
The choice of this or that other programming language is motivated by several reasons. First, it may
be a personal preference or what the learner would find easier to master from the start. However,
mathematicians and statisticians generally tend to turn to R while computer scientists and software
engineers prefer to use Python. It is also recommended for those who do not really have any prior
notions in coding, for example, to start with the Python language, or to learn both simultaneously,
while taking Python as a reference; this is the approach we will use throughout this series of papers.
Second, the choice may be motivated by the tasks to be performed. If the business is mostly about
crunching numbers, visualizing data, and doing ad-hoc statistical analysis, R might be a good choice.
Indeed, R’s ecosystem can be far superior to Python’s when it comes to advanced statistical tech-
niques. However, if the mission is to collect data from websites, files, or other data sources, Python is
by far the best choice.
In addition, there are also packages such as RPy2 and rPython which allow integration of R and
Python programming languages. We will discuss these packages in a later paper. Indeed, RPy2 is
a Python package that allows you to use R from Python. Which means calling R functions from
Python; often used for data analysis and visualization, as R is a popular programming language for
these tasks. And rPython, on the other hand, is a Python package that allows integrating Python
into R, i.e. calling Python functions from R; often used for optimization and simulation, as Python is
a popular programming language for these tasks.
In summary, it is recommended to master both programming languages, since they are complemen-
tary and each of them can be used to solve most Data Science problems using the available packages.
However, success in this area also depends on the methodology, the skills of the Data Scientist and
the available resources, which are important factors regardless of the choice of programming language.
Moreover, by mastering both languages, a Data Scientist can take advantage of the advantages of each
to solve specific problems.
Additionally, many companies use a combination of R and Python for their Data Analytics and
Machine Learning projects. By mastering both languages, a data scientist can be more versatile and
useful to these companies.
20
References
Dhar, V. (2013). Data science and prediction. Communication of the Association for Computing
Machinery, 56(12):64–73.
Gabbrielli, M. and Martini, S. (2010). Programming Languages: Principles and Paradigms. Springer,
Berlin.
Grogan, M. (2018). Python vs. R for Data Science. O’Reilly, Boston, 1st edition.
Hayashi, C. (1998). What is Data Science? Fundamental Concepts and a Heuristic Example. In
Hayashi, C., Yajima, K., Bock, H.-H., Ohsumi, N., Tanaka, Y., and Baba, Y., editors, Data Science,
Classification, and Related Methods, pages 40–51. Springer Japan.
Ihaka, R. and Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3):299–314.
Jakobowicz, E. (2018). Python pour le Data Scientist. Des bases du langage au machine learning.
Dunod, Paris.
Louden, K. C. and Lambert, K. A. (2004). Programming Languages: Principles and Practices. Cengage
Learning, Farmington Hills.
Mike, K. and Hazzan, O. (2023). What is Data Science? Communications of the Association for
Computing Machinery, 66(2):12–13.
21