0% found this document useful (0 votes)
184 views21 pages

2 - Data Science Tools

This document provides an overview of open source and commercial tools available for different aspects of data science, including data management, integration and transformation, visualization, model building, deployment, monitoring, code and data asset management, and development environments. It describes popular tools for each category such as MySQL, Tableau, Spark, Git, and Jupyter Notebooks. Commercial tools mentioned include Informatica, SAS, and IBM products.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
0% found this document useful (0 votes)
184 views21 pages

2 - Data Science Tools

This document provides an overview of open source and commercial tools available for different aspects of data science, including data management, integration and transformation, visualization, model building, deployment, monitoring, code and data asset management, and development environments. It describes popular tools for each category such as MySQL, Tableau, Spark, Git, and Jupyter Notebooks. Commercial tools mentioned include Informatica, SAS, and IBM products.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 21

edX - Data Science Tools

IBM - DS0105EN

Categorias de Ferramentas Data Science

Data Management is the process of persisting and retrieving data.

Data Integration and Transformation, often referred to as Extract, Transform, and


Load, or “ETL,” is the process of retrieving data from remote data management
systems.
Transforming data and loading it into a local data management system is also part of
Data Integration and Transformation.

Data Visualization is part of an initial data exploration process, as well as being part of
a final deliverable.

Model Building is the process of creating a machine learning or deep learning model
using an appropriate algorithm with a lot of data.

Model deployment makes such a machine learning or deep learning model available
to third-party applications.

Model monitoring and assessment ensures continuous performance quality checks


on the deployed models.

Code asset management uses versioning and other collaborative features to facilitate
teamwork.

Data asset management brings the same versioning and collaborative components to
data and also supports replication, backup, and access right management.

Development environments, commonly known as Integrated Development


Environments, or “IDEs”, are tools that help the data scientist to implement, execute,
test, and deploy their work.

Página 1 de 21
Execution environments are tools where data preprocessing, model training, and
deployment take place.

Fully integrated, visual tooling available that covers all the previous tooling
components, either partially or completely.

Open Source Tools for Data Science

Data Management - The most widely used open source data management tools are
relational databases such as MySQL and PostgreSQL; NoSQL databases such as
MongoDB Apache CouchDB, and Apache Cassandra; and file-based tools such as the
Hadoop File System or Cloud File systems like Ceph.
Finally,Elasticsearch is mainly used for storing text data and creating a search index for
fast document retrieval.

Data Integration and Transformation - Here are the most widely used open source
data integration and transformation tools:
Apache AirFlow, KubeFlow, which enables you to execute data science pipelines on
top of Kubernetes; Apache Kafka, Apache Nifi, which delivers a very nice visual editor;
Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute
clustersof 1000s of nodes), and NodeRED, which also provides a visual editor.

Data Visualization - Hue can create visualizations from SQL queries. Kibana, a data
exploration and visualization web application, is limited to Elasticsearch (the data
provider).
Finally, Apache Superset is a data exploration and visualization web application.

Model deployment - Apache PredictionIO currently only supports Apache Spark ML


models for deployment, butsupport for all sorts of other libraries is on the roadmap.
Seldon is an interesting product since it supports nearly every framework, including
TensorFlow, Apache SparkML, R, and scikit-learn. Seldon can run on top of
Kubernetes and Redhat OpenShift. Another way to deploy SparkML models is by using
MLeap.
Finally, TensorFlow can serve any of its models using the TensorFlow service.

Model monitoring and assessment -ModelDB is a machine model metadatabase


where information about the models are stored and can be queried.
A generic, multi-purpose tool called Prometheus is also widely used for machine
learning model monitoring, although it’s not specifically made for this purpose.
Model performance is not exclusively measured through accuracy. Model bias against
protected groups like gender or race is also important.
The IBM AI Fairness 360 open source toolkit does exactly this. It detects and mitigates
against bias in machine learning models.
Machine learning models, especially neural-network-based deep learning models, can
be subject to adversarial attacks, where an attacker tries to fool the model with
manipulated data or by manipulating the model itself.
The IBM Adversarial Robustness 360 Toolbox can be used to detect vulnerability to
adversarial attacks and help make the model more robust.
Machine learning modes are often considered to be a black box that applies some
mysterious “magic.”. The IBM AI Explainability 360 Toolkit makes the machine learning

Página 2 de 21
process more understandable by finding similar examples within a dataset that can be
presented to a user for manual comparison.
The IBM AI Explainability 360 Toolkit can also illustrate training for a simpler machine
learning model by explaining how different input variables affect the final decision of the
model.

Code asset management - Git is now the standard. Multiple services have emerged to
support Git, with the most prominent being GitHub, which provides hosting for software
development version management.
The runner-up is definitely GitLab, which has the advantage of being a fully open
source platform that you can host and manage yourself.
Another choice is Bitbucket.

Data asset management - Apache Atlas is a tool that supports this task.
Another interesting project, ODPi Egeria, is managed through the Linux Foundation
and is an open ecosystem.
It offers a set of open APIs, types, and interchange protocols that metadata repositories
use to share and exchange data.
Finally, Kylo is an open source data lake management software platform that provides
extensive support for a wide range of data asset management tasks.

Development environments - One of the most popular current development


environments that data scientists are using is “Jupyter.” Jupyter first emerged as a tool
for interactive Python programming; it now supports more than a hundred different
programming languages.
A key property of Jupyter Notebooks is the ability to unify documentation, code, output
from the code, shell commands, and visualizations into a single document.
Although Apache Zeppelin has been fully reimplemented, it’s inspired by Jupyter
Notebooks and provides a similar experience.
One key differentiator is the integrated plotting capability. In Jupyter Notebooks, you
are required to use external libraries; in Apache Zeppelin, and plotting doesn’t require
coding.
You can also extend these capabilities by using additional libraries. RStudio is one of
the oldest development environments for statistics and data science, having been
introduced in 2011. It exclusively runs R and all associated R libraries.
However, Python development is possible and R is therefore tightly integrated into this
tool to provide an optimal user experience. RStudio unifies programming, execution,
debugging, remote data access, data exploration, and visualization into a single tool.

Execution environments - Sometimes your data doesn’t fit into a single computer’s
storage or main memory capacity. That’s where cluster execution environments come
in. The well-known cluster-computing framework Apache Spark is among the most
active Apache projects and is used across all industries.
The key property of Apache Spark is linear scalability. This means, if you double the
number of servers in a cluster, you’ll also roughly double its performance.

Fully integrated - Let’s look at open source tools for data scientists that are fully
integrated and visual.
With these tools, no programming knowledge is necessary.
KNIME has a visual user interface with drag-and-drop capabilities. It also has built-in
visualization capabilities.

Página 3 de 21
Knime can be extended by programming in R and Python, and has connectors to
Apache Spark. Another example of this group of tools is Orange, it’s less flexible than
KNIME, but easier to use.

Commercial Tools for Data Science

Data Management - In data management, most of an enterprise’s relevant data is


stored in an Oracle Database, Microsoft SQL Server, or IBM Db2.

Data Integration and Transformation - Informatica Powercenter and IBM InfoSphere


DataStage are the leaders, followed by products from SAP, Oracle, SAS, Talend, and
Microsoft.
Finally, Watson Studio Desktop includes a component called Data Refinery, which
enables the defining and execution of data integration processes in a spreadsheet
style.

Data Visualization - The most prominent commercial examples are: Tableau,


Microsoft Power BI, and IBM Cognos Analytics.
Another type of visualization targets data scientists rather than regular users.
A sample problem might be “How can different columns in a table relate to each
other?”
This type of functionality is contained in Watson Studio Desktop.

Model Building - If you want to build a machine learning model using a commercial
tool, you should consider using a data mining product.
The most prominent of these types of products are: SPSS Modeler and SAS Enterprise
Miner. In addition, a version of SPSS Modeler is also available in Watson Studio
Desktop, based on the cloud version of the tool.

Model deployment - In commercial software, model deployment is tightly integrated in


the model building process.

Model monitoring and assessment - Model monitoring is a new discipline and there
are currently no relevant commercial tools available.
As a result, open source is the first choice.

Code asset management - The same is true for code asset management.
Open source with Git and GitHub is the effective standard.

Data asset management - Data asset management, often called data governance or
data lineage, is a crucial part of enterprise grade data science.
Data must be versioned and annotated using metadata.
Vendors, including Informatica Enterprise Data Governance and IBM, provide tools for
these specific tasks. The IBM InfoSphere Information Governance Catalog covers
functions like data dictionary, which facilitates discovery of data assets.

Development environments - Each data asset is assigned to a data steward -- the


data owner.
The data owner is responsible for that data asset and can be contacted.
Data lineage is also covered; this enables a user to track back through the
transformation steps followed in creating the data assets.

Página 4 de 21
The data lineage also includes a reference to the actual source data.
Rules and policies can be added to reflect complex regulatory and business
requirements for data privacy and retention.
Watson Studio is a fully integrated development environment for data scientists.

Fully integrated - Watson Studio, together with Watson Open Scale, is a fully
integrated tool covering the full data science life cycle and all the tasks we’ve
discussed previously.
Another example of a fully integrated commercial tool is H2O Driverless AI, which
covers the complete data science life cycle.

Cloud Based Tools for Data Science

Since cloud products are newer species, they follow the trend of having multiple tasks
integrated into a single tool.
This especially holds true for the Tasks marked Green in the diagram.

Data Management - In data management, with some exceptions, there are SAS
versions of existing open source and commercial tools.
Remember, SAS stands for software as a service.
It means that the cloud provider operates the tool for you in the cloud.
As an example, the cloud provider operates the product by backing up your data and
configuration and installing updates.
As mentioned, there is proprietary tooling which is only available as a cloud product.
Sometimes it's only available from a single cloud provider.
One example of such a service is Amazon Web Services Dynamo DB, a no SQL
database that allows storage and retrieval of data in a key value or a document store
format.
The most prominent document data structure is Jason.
Another flavor of such a service is cloudant, which is a database as a service offering
but under the hood it is based on the open source. Apache couch DB.
It has an advantage, although complex operational tasks like updating backup restore
and scaling are done by the cloud provider under the hood.
This offering is compatible with couch DB, therefore the application can be migrated to
another couch DB server without changing the application.

Página 5 de 21
And IBM offers DB 2 as a service as well.
This is an example of a commercial database made available as a software as a
service offering in the cloud, taking operational tasks away from the user.

Data Integration and Transformation - When it comes to commercial data integration


tools, we talk not only about extract, transform and load or ETL tools, but also about
extract, load and transform or Y Lt tools.
This means the transformation steps are not done by a data integration team.
But are pushed towards the domain of the data scientist or dated engineer.
Two widely used commercial data integration tools are Informatica Cloud data
integration and IBM's data refinery data refinery enables transformation of large
amounts of raw data into consumable quality information in a spreadsheet like user
interface.
Data refinery is part of IBM Watson Studio.

Data Visualization - The market for cloud data visualization tools is huge and every
major cloud vendor has one.
An example of a smaller company's cloud-based data visualization tool is Datameer.
IBM offers its famous Cognos business intelligence suite as a cloud solution as well.
IBM data refinery also offers data exploration and visualization functionality in Watson
Studio.

Model Building - Can be done using a service such as Watson, machine learning,
Watson. Machine learning can train and build models using various open source
libraries.
Google has a similar service on their cloud called AI platform training.
Nearly every cloud provider has a solution for this task.

Model deployment - Model deployment in commercial software is usually tightly


integrated to the model building process.

Model monitoring and assessment - Amazon sage maker model monitor is an


example of a cloud tool that continuously monitors deployed machine learning and
deep learning models.
Again, every major cloud provider has a similar tooling.
This is also the case for Watson Open scale, open scale, and Watson studio unify the
landscape.

Fully integrated and Platform - Since these tools introduce a component where large
scale execution of data science workflows happens in compute clusters.
We've changed the title here, an added the word platform.
These clusters are composed of multiple server machines transparently for the user in
the background. Watson studio, together with Watson Open scale covers the complete
development lifecycle for all data science, machine learning and AI tasks.
Another example is Microsoft Azure Machine Learning. This is also a fully. Cloud
hosted offering supporting the complete development life cycle of all data science,
machine learning and AI tasks.
Finally, another example is H2O driverless.

Página 6 de 21
Watson Open scale, open scale, and Watson studio unify the landscape.
Everything marked in green can be done using Watson Studio and Watson Open scale
open scale.

Libraries for Data Science

Libraries are a collection of functions and methods that enable you to perform a wide
variety of actions without writing the code yourself.
We will focus on Python libraries:
1) Scientific Computing Libraries in Python;
2) Visualization Libraries in Python;
3) High-Level Machine Learning and Deep Learning Libraries – “High-level” simply
means you don’t have to worry about details, although this makes it difficult to
study or improve;
4) Deep Learning Libraries in Python;

Libraries usually contain built-in modules providing different functionalities that you can
use directly; these are sometimes called “frameworks.”
There are also extensive libraries, offering a broad range of facilities.

Scientifics Computing Libraries in Phyton - Pandas offers data structures and tools
for effective data cleaning, manipulation, and analysis.
It provides tools to work with different types of data.
The primary instrument of Pandas is a two-dimensional table consisting of columns and
rows.
This table is called a “DataFrame” and is designed to provide easy indexing so you can
work with your data.
NumPy libraries are based on arrays, enabling you to apply mathematical functions to
these arrays. Pandas is actually built on top of NumPy.

Visualization Libraries in Pyton - Data visualization methods are a great way to


communicate with others and show the meaningful results of analysis.
These libraries enable you to create graphs, charts and maps.

Página 7 de 21
The Matplotlib package is the most well-known library for data visualization, and it’s
excellent for making graphs and plots. The graphs are also highly customizable.
Another high-level visualization library, Seaborn, is based on matplotlib. Seaborn
makes it easy to generate plots like heat maps, time series, and violin plots.

Machine Learning and Deeps Learning Libraries in Pyton - For machine learning,
the Scikit-learn library contains tools for statistical modeling, including regression,
classification, clustering and others.
It is built on NumPy, SciPy, and matplotlib, and it’s relatively simple to get started.
For this high-level approach, you define the model and specify the parameter types you
would like to use.
For deep learning, Keras enables you to build the standard deep learning model.
Like Scikit-learn, the high-level interface enables you to build models quickly and
simply.
It can function using graphics processing units (GPU), but for many deep learning
cases a lower-level environment is required.
TensorFlow is a low-level framework used in large scale production of deep learning
models.
It’s designed for production but can be unwieldy for experimentation.
Pytorch is used for experimentation, making it simple for researchers to test their ideas.

Apache Spark is a general-purpose cluster-computing framework that enables you to


process data using compute clusters.
This means that you process data in parallel, using multiple computers simultaneously.
The Spark library has similar functionality as Pandas, Numpy, Scikit-learn
Apache Spark data processing jobs can use Python, R Scala, or SQL.

Application Programming Interfaces (API)

An API lets two pieces of software talk to each other. For example you have your
program, you have some data, you have other software components. You use the API
to communicate with the other software components.You don’t have to know how the
API works, you just need to know its inputs and outputs. Remember, the API only
refers to the interface, or the part of the library that you see. The “library” refers to the
whole thing.

REST APIs - Are another popular type of API. They enable you to communicate using
the internet, taking advantage of storage, greater data access, artificial intelligence
algorithms, and many other resources. The RE stands for “Representational,” the S
stands for “State,” the T stand for “Transfer.” In rest APIs, your program is called the
“client.” The API communicates with a web service that you call through the internet.

Página 8 de 21
Output or Response. Here are some common API-related terms. You or your code can
be thought of as a client. The web service is referred to as a resource. The client finds
the service through an endpoint. The client sends the request to the resource and the
response to the client.
The request is usually communicated through an HTTP message. The HTTP message
usually contains a JSON file, which contains instructions for the operation that we
would like the service to perform. This operation is transmitted to the web service over
the internet.
The service performs the operation. Similarly, the web service returns a response
through an HTTP message, where the information is usually returned using a JSON
file. This information is transmitted back to the client.

Data Sets - Powering Data Science

A data set is a structured collection of data. Data embodies information that might be
represented as text, numbers, or media such as images, audio, or video files.
A data set that is structured as tabular data comprises a collection of rows, which in
turn comprise columns that store the information.
Hierarchical or network data structures are typically used to represent relationships
between data.
Hierarchical data is organized in a tree-like structure, whereas network data might be
stored as a graph.
For example, the connections between people on a social networking website are often
represented in the form of a graph.
Traditionally, most data sets were considered to be private because they contain
proprietary or confidential information such as customer data, pricing data, or other
commercially sensitive information.
These data sets are typically not shared publicly. Over time, more and more public and
private entities such as scientific institutions, governments, organizations and even
companies have started to make data sets available to the public as “open data,"
providing a wealth of information for free.
Open data has played a significant role in the growth of data science, machine
learning, and artificial intelligence and has provided a way for practitioners to hone their
skills on a wide variety of data sets.
There are many open data sources on the internet.
You can find a comprehensive list of open data portals from around the world on the

Página 9 de 21
Open Knowledge Foundation’s datacatalogs.org website.
The United Nations, the European Union, and many other governmental and
intergovernmental organizations maintain data repositories providing access to a wide
range of information.
On Kaggle, which is a popular data science online community, you can find and
contribute data sets that might be of general interest.
Last but not least, Google provides a search engine for data sets that might help you
find the ones that have particular value for you.

In absence of a license for open data distribution, many data sets were shared in the
past under open source software licenses.
These licenses were not designed to cover the specific considerations related to the
distribution and use of data sets.
To address the issue, the Linux Foundation created the Community Data License
Agreement, or CDLA.
Two licenses were initially created for sharing data: CDLA-Sharing and CDLA-
Permissive.
The CDLA-Sharing license grants you permission to use and modify the data.
The license stipulates that if you publish your modified version of the data you must do
so under the same license terms as the original data.
The CDLA-Permissive license also grants you permission to use and modify the data.
However, you are not required to share changes to the data.

Data Asset Exchange

Despite the growth of open data sets that are available to the public, it can still be
difficult to discover data sets that are both high quality and have clearly defined license
and usage terms.
To help solve this challenge, IBM created the Data Asset eXchange, or "DAX,”.
DAX provides a trusted source for finding open data sets that are ready for to use in
enterprise applications.
These data sets and which cover a wide variety of domains, including images, video,
text, and audio.
Because DAX provides a high level of curation for data set quality, as well as licensing
and usage terms, DAX data sets are typically easier to adopt, whether in research or
commercial projects.

Página 10 de 21
Machine Learning Models

Machine learning uses algorithms – also known as”models” - to identify patterns in the
data.
The process by which the model learns these patterns from data is called “model
training.". Once a model is trained, it can then be used to make predictions.
When the model is presented with new data, it tries to make predictions or decisions
based on the patterns it has learned from past data.
Machine learning models can be divided into three basic classes: supervised learning,
unsupervised learning, and reinforcement learning.

Supervised learning - Is one of the most commonly used type of machine learning
models.
In supervised learning, a human provides input data and the correct outputs.
The model tries to identify relationships and dependencies between the input data and
the correct output. Generally speaking, supervised learning is used to solve regression
and classification problems.

Unsupervised learning - The data is not labelled by a human. The models must
analyze the data and try to identify patterns and structure within the data based only on
the characteristics of the data itself.
Clustering and anomaly detection are two examples of this learning style. Clustering
models are used to divide each record of a data set into one of a small number of
similar groups.

Reinforcement learning - Is loosely based on the way human beings and other
organisms learn.
Think about a mouse in a maze. If the mouse gets to the end of the maze it gets a
piece of cheese.This is the “reward” for completing a task.
The mouse learns – through trial and error – how to get through the maze to get as
much cheese as it can.
In a similar way, a reinforcement learning model learns the best set of actions to take,
given its current environment, in order to get the most reward over time.
This type of learning has recently been very successful in beating the best human
players in games such as go, chess, and popular strategy video games.

Deep learning is a specialized type of machine learning.


It refers to a general set of models and techniques that tries to loosely emulate the way
the human brain solves a wide range of problems.
It is commonly used to analyze natural language, both spoken and text, as well as
images, audio, and video, to forecast time series data and much more.
Deep learning has had a lot of recent success in these and other areas and is therefore
becoming an increasingly popular and important tool for data science.
Deep learning typically requires very large data sets of labeled data to train a model, is
compute-intensive, and usually requires special purpose hardware to achieve
acceptable training times.
You can build a custom deep learning model from scratch or use pre-trained models
from public model repositories.
Deep learning models are implemented using popular frameworks such as
TensorFlow, PyTorch, and Keras.

Página 11 de 21
Deep learning frameworks typically provide a Python API, and many support other
programming

Assume you want to enable an application to identify objects in images by training a


deep learning model.
First, you collect and prepare data that will be used to train a model.
Data preparation can be a time-consuming and labor-intensive process.
In order to train a model to detect objects in images, you need to label the raw training
data by, for example, drawing bounding boxes around objects and labeling them.
Next, you build a model from scratch or select an existing model that might be well
suited for the task from a public or private resource.
You then train the model on your prepared data.
During training, your model learns from the labeled data how to identify objects that are
depicted in an image.
Once training has commenced, you analyze the training results and repeat the process
until the trained model performance meets your requirements.
When the trained model performs as desired, you deploy it to make it available to your
applications.

The Model Asset Exchange

To reduce time to value, consider taking advantage of pre-trained models for certain
types of problems.
These pre-trained models can be ready to use right away, or they might take less time
to train. The Model Asset eXchange is a free open source repository for ready-to-use
and customizable deep learning microservices.
These microservices are configured to use pre-trained or custom-trainable state-of-the-
art deep learning models to solve common business problems.
These models have been reviewed, tested, and can be quickly deployed in local and
cloud environments.

Let’s take a look at the components of a typical model-serving microservice. Each


microservice includes the following components:
A pre-trained deep learning model.
Code that pre-processes the input before it is analyzed by the model and code that
post-processes the model output.
A standardized public API that makes the services’ functionality available to
applications.

Página 12 de 21
The MAX model-serving microservices are built and distributed as open-source Docker
images. Docker is a container platform that makes it easy to build applications and to
deploy them in a development, test, or production environment.
The Docker image source is published on GitHub and can be downloaded, customized
as needed, and used in personal or commercial environments.
You can deploy and run these images in a test or production environment using
Kubernetes, an open-source system for automating deployment, scaling, and
management of containerized applications in private, hybrid, or public clouds.
A popular enterprise-grade Kubernetes platform is Red Hat OpenShift, which is
available on IBM Cloud, Google Cloud Platform, Amazon Web Services, and Microsoft
Azure.

Lab: Explore Data Sets and Models

Exercise 1: Explore deep learning datasets

The Data Asset Exchange is a curated collection of open datasets from IBM Research
and 3rd parties that you can use to train models.

1. Open https://developer.ibm.com/ in your web browser.


2. From the main menu select “Open Source at IBM” > “Data Asset eXchange”.
The DAX homepage is displayed.

Página 13 de 21
Exercise 2 - Explore deep learning models

The Model Asset Exchange is a curated repository of open source deep learning
models for a variety of domains, such as text, image, audio, and video processing.

1. Open https://developer.ibm.com/ in your web browser.


2. From the main menu, select Open Source at IBM > Model Asset eXchange.

The MAX home page is displayed. In this introductory lab exercise, we are going to
focus on a few MAX key features.

3. Select the Object Detector model from the list of options available. This model
recognizes the objects present in an image.

Página 14 de 21
4. Now, click on Try in CodePen. CodePen is a social development environment.
At its heart, it allows you to write code in the browser, and see the results of it
as you build.
5. Upload an image.
6. Click on the icon Extract Prediction as shown below:

Introduction to R and RStudio

R is a statistical programming language. It is a powerful tool for data processing and


manipulation, statistical inference, data analysis, and Machine Learning algorithms.
R has many functions; which support importing data from different sources.
For example, flat files, databases, the web, and statistical software like SPSS and
STATA.
R is a preferred language for some Data Scientists because it is easy to use the
functions within R. It is also known for producing great visualizations and readily
available packages to handle data analysis without the need to install any libraries.
To use R, we need an environment to help run the codes. One of the most popular
environments for developing and running R language source code and programs is
Rstudio.
R studio includes:
1. A syntax-highlighting editor that supports direct code execution and where you
keep a record of your work;
2. A console for typing R commands;
3. A workspace and history tab that shows the list of R objects you created during
your R session and shows the history of all previous commands;
4. A Plots, Files, Packages, and Help tab for showing files in your working
directory; history of plots you have created and allow for exporting plots to PDF
or images files; external R packages available on your local computer and help
on R resources, RStudio support, packages, and more.

Página 15 de 21
Here are some popular Libraries in the Data Science community:
 dplyr for manipulating data,
 stringer for manipulating strings,
 ggplot for visualizing data,
 caret for Machine Learning

R is a very great tool for data visualization and has different packages.
Some of the popular and top data visualization tools include:
 ggplot which used for data visualizations such as histograms, bar charts,
scatterplots etc.
 Plotly is an R package can be used to create web-based data visualizations that
can be displayed or saved as individual HTML files.
 Lattice is a data visualization tool that is used to implement complex, multi-
variable data sets.
 Lattice is a high-level data visualization library; it can handle many of the typical
graphics without needing many customizations.
 Leaflet is very useful in creating interactive plots.

Depending on your need and data science project, most of these libraries and
packages can come in handy.
To install these packages in your R environment, use the install dot packages and the
package name command:

Install.packages(“package name”)

Lab: RStudio - The Basics

# load the iris dataset


library(datasets)
data(iris)
View(iris)

#How many different species there are present in the data set
unique(iris$Species)
(Iris is the data and Species the column name)

Lab: Basic plots in RStudio

library(datasets)
# Load Data
data(mtcars)
# View first 5 rows
head(mtcars, 5)
#Get information about the variables. This will print the information at the bottom right
panel, on the Help tab
?mtcars

Página 16 de 21
#load ggplot package
library(ggplot2)
# create a scatterplot of displacement (disp) and miles per gallon (mpg)
ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()
# Add a title
ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()+ggtitle("displacement vs
miles per gallon")
# Change axis name
ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()+ggtitle("displacement vs
miles per gallon") + labs(x = "Displacement", y = "Miles per Gallon")

#make vs a factor
mtcars$vs <- as.factor(mtcars$vs)
# create boxplot of the distribution for v-shaped and straight Engine
ggplot(aes(x=vs, y=mpg), data = mtcars) + geom_boxplot()

Página 17 de 21
# Add color to the boxplots to help differentiate
ggplot(aes(x=vs, y=mpg, fill = vs), data = mtcars) + geom_boxplot(alpha=0.3) +
theme(legend.position="none")

# Create the histogram of weight


ggplot(aes(x=wt),data=mtcars) + geom_histogram(binwidth=0.5)

Introduction to Jupyter Notebooks

A Jupyter notebook is a browser-based application that allows you to create and share
documents that contain code, equations, visualizations, narrative text links, and so
much more. It can be likened to a scientist’s lab notebook where a scientist records all
steps to perform their experiments and record the results so it can be reproduced in the
future.
When you run the code, it generates the outputs, including plots and tables, within the
notebook file. And you can then export the notebook to a PDF or HTML file that can
then be shared with anyone. Jupyter Notebooks originated as “iPython” originally

Página 18 de 21
developed for the Python Programming language. As it came to support additional
languages it was renamed Jupyter which stands for:
 Julia
 Python
 R
Jupyter Lab extends the functionalities of Jupyter notebooks by enabling you to work
with multiple notebooks, text editors, terminals, and custom components in a flexible,
integrated, and extensible manner.

Jupyter Kernals

A notebook kernel is a computational engine that executes the code contained in a


Notebook file.

Jupyter Architecture

Jupyter implements a two-process model, with a kernel and a client.


The client is the interface offering the user the ability to send the code to a kernel.
The kernel executes the code and returns the result to the client for display.
The client is the browser when using a Jupyter notebook.
Jupyter notebooks represent your code, metadata, contents, and outputs. When saved
it uses a dot I Pi NB (.ipynb) extension and a JSON structure.
When you, the user, saves it, it is sent from your browser to the Notebook server which
saves the notebook file on a disk as a JSON file with a dot I PI NB (.ipynb) extension.
The Notebook server is responsible for saving and loading the notebooks.
The kernel gets sent the cells of code when the user runs them.

Jupyter also has an architecture of how it converts files to other formats. It uses a tool
called NB convert.
For example, if we want to convert a Notebook file into an HTML file, it will go through
the following: The Notebook is modified by a preprocessor, an exporter converts the
notebook to the new file format, and a postprocessor will work on the file produced by
exporting it.
After conversion, when you give the URL of the HTML file, it first fetches the notebook,
converts it HTML, and displays the file to you in a HTML file.

Página 19 de 21
Lab - Jupyter Notebook - The Basics

 Use the keyboard shortcuts: [a] - Insert a Cell Above; [b] - Insert a Cell Below.
 Execute the code, by either clicking the Play button in the menu above the
notebook, or by pressing Shift+Enter
 You cannot create Markdown cells without first creating cells and converting
them from Code to Markdown. To render the Markdown text, make sure the cell
is selected (by clicking within it), and press Play in the menu, or Shift+Enter. To
edit your Markdown cell, double-click anywhere within the cell. Note you can
use the keyboard shortcut: [m] - Convert Cell to Markdown

Overview of Git/GitHub

Git and GitHub, which are popular environments among developers and data scientists
for performing version control of source code files and projects and collaborating with
others.
A version control system allows you to keep track of changes to your documents.
This makes it easy for you to recover older versions of your document if you make a
mistake, and it makes collaboration with others much easier.
Version control systems are widely used for things involving code, but you can also
version control images, documents, and any number of file types.
There are a few basic terms that you will need to know before you can get started:
 The SSH protocol is a method for secure remote login from one computer to
another.
 A repository contains your project folders that are set up for version control.
 A fork is a copy of a repository.
 A pull request is the way you request that someone reviews and approves your
changes before they become final.
 A working directory contains the files and subdirectories on your computer that
are associated with a Git repository.
There are a few basic Git commands that you will always use:
 When starting out with a new repository, you only need create it once: either
locally, and then push to GitHub, or by cloning an existing repository by using
the command "git init".
 "git add" moves changes from the working directory to the staging area.
 "git status" allows you to see the state of your working directory and the staged
snapshot of your changes.

Página 20 de 21
 "git commit" takes your staged snapshot of changes and commits them to the
project.
 "git reset" undoes changes that you’ve made to the files in your working
directory.
 "git log" enables you to browse previous changes to a project.
 "git branch" lets you create an isolated environment within your repository to
make changes.
 "git checkout" lets you see and change existing branches.
 "git merge" lets you put everything back together again.

Página 21 de 21

You might also like