2 - Data Science Tools
2 - Data Science Tools
IBM - DS0105EN
Data Visualization is part of an initial data exploration process, as well as being part of
a final deliverable.
Model Building is the process of creating a machine learning or deep learning model
using an appropriate algorithm with a lot of data.
Model deployment makes such a machine learning or deep learning model available
to third-party applications.
Code asset management uses versioning and other collaborative features to facilitate
teamwork.
Data asset management brings the same versioning and collaborative components to
data and also supports replication, backup, and access right management.
Página 1 de 21
Execution environments are tools where data preprocessing, model training, and
deployment take place.
Fully integrated, visual tooling available that covers all the previous tooling
components, either partially or completely.
Data Management - The most widely used open source data management tools are
relational databases such as MySQL and PostgreSQL; NoSQL databases such as
MongoDB Apache CouchDB, and Apache Cassandra; and file-based tools such as the
Hadoop File System or Cloud File systems like Ceph.
Finally,Elasticsearch is mainly used for storing text data and creating a search index for
fast document retrieval.
Data Integration and Transformation - Here are the most widely used open source
data integration and transformation tools:
Apache AirFlow, KubeFlow, which enables you to execute data science pipelines on
top of Kubernetes; Apache Kafka, Apache Nifi, which delivers a very nice visual editor;
Apache SparkSQL (which enables you to use ANSI SQL and scales up to compute
clustersof 1000s of nodes), and NodeRED, which also provides a visual editor.
Data Visualization - Hue can create visualizations from SQL queries. Kibana, a data
exploration and visualization web application, is limited to Elasticsearch (the data
provider).
Finally, Apache Superset is a data exploration and visualization web application.
Página 2 de 21
process more understandable by finding similar examples within a dataset that can be
presented to a user for manual comparison.
The IBM AI Explainability 360 Toolkit can also illustrate training for a simpler machine
learning model by explaining how different input variables affect the final decision of the
model.
Code asset management - Git is now the standard. Multiple services have emerged to
support Git, with the most prominent being GitHub, which provides hosting for software
development version management.
The runner-up is definitely GitLab, which has the advantage of being a fully open
source platform that you can host and manage yourself.
Another choice is Bitbucket.
Data asset management - Apache Atlas is a tool that supports this task.
Another interesting project, ODPi Egeria, is managed through the Linux Foundation
and is an open ecosystem.
It offers a set of open APIs, types, and interchange protocols that metadata repositories
use to share and exchange data.
Finally, Kylo is an open source data lake management software platform that provides
extensive support for a wide range of data asset management tasks.
Execution environments - Sometimes your data doesn’t fit into a single computer’s
storage or main memory capacity. That’s where cluster execution environments come
in. The well-known cluster-computing framework Apache Spark is among the most
active Apache projects and is used across all industries.
The key property of Apache Spark is linear scalability. This means, if you double the
number of servers in a cluster, you’ll also roughly double its performance.
Fully integrated - Let’s look at open source tools for data scientists that are fully
integrated and visual.
With these tools, no programming knowledge is necessary.
KNIME has a visual user interface with drag-and-drop capabilities. It also has built-in
visualization capabilities.
Página 3 de 21
Knime can be extended by programming in R and Python, and has connectors to
Apache Spark. Another example of this group of tools is Orange, it’s less flexible than
KNIME, but easier to use.
Model Building - If you want to build a machine learning model using a commercial
tool, you should consider using a data mining product.
The most prominent of these types of products are: SPSS Modeler and SAS Enterprise
Miner. In addition, a version of SPSS Modeler is also available in Watson Studio
Desktop, based on the cloud version of the tool.
Model monitoring and assessment - Model monitoring is a new discipline and there
are currently no relevant commercial tools available.
As a result, open source is the first choice.
Code asset management - The same is true for code asset management.
Open source with Git and GitHub is the effective standard.
Data asset management - Data asset management, often called data governance or
data lineage, is a crucial part of enterprise grade data science.
Data must be versioned and annotated using metadata.
Vendors, including Informatica Enterprise Data Governance and IBM, provide tools for
these specific tasks. The IBM InfoSphere Information Governance Catalog covers
functions like data dictionary, which facilitates discovery of data assets.
Página 4 de 21
The data lineage also includes a reference to the actual source data.
Rules and policies can be added to reflect complex regulatory and business
requirements for data privacy and retention.
Watson Studio is a fully integrated development environment for data scientists.
Fully integrated - Watson Studio, together with Watson Open Scale, is a fully
integrated tool covering the full data science life cycle and all the tasks we’ve
discussed previously.
Another example of a fully integrated commercial tool is H2O Driverless AI, which
covers the complete data science life cycle.
Since cloud products are newer species, they follow the trend of having multiple tasks
integrated into a single tool.
This especially holds true for the Tasks marked Green in the diagram.
Data Management - In data management, with some exceptions, there are SAS
versions of existing open source and commercial tools.
Remember, SAS stands for software as a service.
It means that the cloud provider operates the tool for you in the cloud.
As an example, the cloud provider operates the product by backing up your data and
configuration and installing updates.
As mentioned, there is proprietary tooling which is only available as a cloud product.
Sometimes it's only available from a single cloud provider.
One example of such a service is Amazon Web Services Dynamo DB, a no SQL
database that allows storage and retrieval of data in a key value or a document store
format.
The most prominent document data structure is Jason.
Another flavor of such a service is cloudant, which is a database as a service offering
but under the hood it is based on the open source. Apache couch DB.
It has an advantage, although complex operational tasks like updating backup restore
and scaling are done by the cloud provider under the hood.
This offering is compatible with couch DB, therefore the application can be migrated to
another couch DB server without changing the application.
Página 5 de 21
And IBM offers DB 2 as a service as well.
This is an example of a commercial database made available as a software as a
service offering in the cloud, taking operational tasks away from the user.
Data Visualization - The market for cloud data visualization tools is huge and every
major cloud vendor has one.
An example of a smaller company's cloud-based data visualization tool is Datameer.
IBM offers its famous Cognos business intelligence suite as a cloud solution as well.
IBM data refinery also offers data exploration and visualization functionality in Watson
Studio.
Model Building - Can be done using a service such as Watson, machine learning,
Watson. Machine learning can train and build models using various open source
libraries.
Google has a similar service on their cloud called AI platform training.
Nearly every cloud provider has a solution for this task.
Fully integrated and Platform - Since these tools introduce a component where large
scale execution of data science workflows happens in compute clusters.
We've changed the title here, an added the word platform.
These clusters are composed of multiple server machines transparently for the user in
the background. Watson studio, together with Watson Open scale covers the complete
development lifecycle for all data science, machine learning and AI tasks.
Another example is Microsoft Azure Machine Learning. This is also a fully. Cloud
hosted offering supporting the complete development life cycle of all data science,
machine learning and AI tasks.
Finally, another example is H2O driverless.
Página 6 de 21
Watson Open scale, open scale, and Watson studio unify the landscape.
Everything marked in green can be done using Watson Studio and Watson Open scale
open scale.
Libraries are a collection of functions and methods that enable you to perform a wide
variety of actions without writing the code yourself.
We will focus on Python libraries:
1) Scientific Computing Libraries in Python;
2) Visualization Libraries in Python;
3) High-Level Machine Learning and Deep Learning Libraries – “High-level” simply
means you don’t have to worry about details, although this makes it difficult to
study or improve;
4) Deep Learning Libraries in Python;
Libraries usually contain built-in modules providing different functionalities that you can
use directly; these are sometimes called “frameworks.”
There are also extensive libraries, offering a broad range of facilities.
Scientifics Computing Libraries in Phyton - Pandas offers data structures and tools
for effective data cleaning, manipulation, and analysis.
It provides tools to work with different types of data.
The primary instrument of Pandas is a two-dimensional table consisting of columns and
rows.
This table is called a “DataFrame” and is designed to provide easy indexing so you can
work with your data.
NumPy libraries are based on arrays, enabling you to apply mathematical functions to
these arrays. Pandas is actually built on top of NumPy.
Página 7 de 21
The Matplotlib package is the most well-known library for data visualization, and it’s
excellent for making graphs and plots. The graphs are also highly customizable.
Another high-level visualization library, Seaborn, is based on matplotlib. Seaborn
makes it easy to generate plots like heat maps, time series, and violin plots.
Machine Learning and Deeps Learning Libraries in Pyton - For machine learning,
the Scikit-learn library contains tools for statistical modeling, including regression,
classification, clustering and others.
It is built on NumPy, SciPy, and matplotlib, and it’s relatively simple to get started.
For this high-level approach, you define the model and specify the parameter types you
would like to use.
For deep learning, Keras enables you to build the standard deep learning model.
Like Scikit-learn, the high-level interface enables you to build models quickly and
simply.
It can function using graphics processing units (GPU), but for many deep learning
cases a lower-level environment is required.
TensorFlow is a low-level framework used in large scale production of deep learning
models.
It’s designed for production but can be unwieldy for experimentation.
Pytorch is used for experimentation, making it simple for researchers to test their ideas.
An API lets two pieces of software talk to each other. For example you have your
program, you have some data, you have other software components. You use the API
to communicate with the other software components.You don’t have to know how the
API works, you just need to know its inputs and outputs. Remember, the API only
refers to the interface, or the part of the library that you see. The “library” refers to the
whole thing.
REST APIs - Are another popular type of API. They enable you to communicate using
the internet, taking advantage of storage, greater data access, artificial intelligence
algorithms, and many other resources. The RE stands for “Representational,” the S
stands for “State,” the T stand for “Transfer.” In rest APIs, your program is called the
“client.” The API communicates with a web service that you call through the internet.
Página 8 de 21
Output or Response. Here are some common API-related terms. You or your code can
be thought of as a client. The web service is referred to as a resource. The client finds
the service through an endpoint. The client sends the request to the resource and the
response to the client.
The request is usually communicated through an HTTP message. The HTTP message
usually contains a JSON file, which contains instructions for the operation that we
would like the service to perform. This operation is transmitted to the web service over
the internet.
The service performs the operation. Similarly, the web service returns a response
through an HTTP message, where the information is usually returned using a JSON
file. This information is transmitted back to the client.
A data set is a structured collection of data. Data embodies information that might be
represented as text, numbers, or media such as images, audio, or video files.
A data set that is structured as tabular data comprises a collection of rows, which in
turn comprise columns that store the information.
Hierarchical or network data structures are typically used to represent relationships
between data.
Hierarchical data is organized in a tree-like structure, whereas network data might be
stored as a graph.
For example, the connections between people on a social networking website are often
represented in the form of a graph.
Traditionally, most data sets were considered to be private because they contain
proprietary or confidential information such as customer data, pricing data, or other
commercially sensitive information.
These data sets are typically not shared publicly. Over time, more and more public and
private entities such as scientific institutions, governments, organizations and even
companies have started to make data sets available to the public as “open data,"
providing a wealth of information for free.
Open data has played a significant role in the growth of data science, machine
learning, and artificial intelligence and has provided a way for practitioners to hone their
skills on a wide variety of data sets.
There are many open data sources on the internet.
You can find a comprehensive list of open data portals from around the world on the
Página 9 de 21
Open Knowledge Foundation’s datacatalogs.org website.
The United Nations, the European Union, and many other governmental and
intergovernmental organizations maintain data repositories providing access to a wide
range of information.
On Kaggle, which is a popular data science online community, you can find and
contribute data sets that might be of general interest.
Last but not least, Google provides a search engine for data sets that might help you
find the ones that have particular value for you.
In absence of a license for open data distribution, many data sets were shared in the
past under open source software licenses.
These licenses were not designed to cover the specific considerations related to the
distribution and use of data sets.
To address the issue, the Linux Foundation created the Community Data License
Agreement, or CDLA.
Two licenses were initially created for sharing data: CDLA-Sharing and CDLA-
Permissive.
The CDLA-Sharing license grants you permission to use and modify the data.
The license stipulates that if you publish your modified version of the data you must do
so under the same license terms as the original data.
The CDLA-Permissive license also grants you permission to use and modify the data.
However, you are not required to share changes to the data.
Despite the growth of open data sets that are available to the public, it can still be
difficult to discover data sets that are both high quality and have clearly defined license
and usage terms.
To help solve this challenge, IBM created the Data Asset eXchange, or "DAX,”.
DAX provides a trusted source for finding open data sets that are ready for to use in
enterprise applications.
These data sets and which cover a wide variety of domains, including images, video,
text, and audio.
Because DAX provides a high level of curation for data set quality, as well as licensing
and usage terms, DAX data sets are typically easier to adopt, whether in research or
commercial projects.
Página 10 de 21
Machine Learning Models
Machine learning uses algorithms – also known as”models” - to identify patterns in the
data.
The process by which the model learns these patterns from data is called “model
training.". Once a model is trained, it can then be used to make predictions.
When the model is presented with new data, it tries to make predictions or decisions
based on the patterns it has learned from past data.
Machine learning models can be divided into three basic classes: supervised learning,
unsupervised learning, and reinforcement learning.
Supervised learning - Is one of the most commonly used type of machine learning
models.
In supervised learning, a human provides input data and the correct outputs.
The model tries to identify relationships and dependencies between the input data and
the correct output. Generally speaking, supervised learning is used to solve regression
and classification problems.
Unsupervised learning - The data is not labelled by a human. The models must
analyze the data and try to identify patterns and structure within the data based only on
the characteristics of the data itself.
Clustering and anomaly detection are two examples of this learning style. Clustering
models are used to divide each record of a data set into one of a small number of
similar groups.
Reinforcement learning - Is loosely based on the way human beings and other
organisms learn.
Think about a mouse in a maze. If the mouse gets to the end of the maze it gets a
piece of cheese.This is the “reward” for completing a task.
The mouse learns – through trial and error – how to get through the maze to get as
much cheese as it can.
In a similar way, a reinforcement learning model learns the best set of actions to take,
given its current environment, in order to get the most reward over time.
This type of learning has recently been very successful in beating the best human
players in games such as go, chess, and popular strategy video games.
Página 11 de 21
Deep learning frameworks typically provide a Python API, and many support other
programming
To reduce time to value, consider taking advantage of pre-trained models for certain
types of problems.
These pre-trained models can be ready to use right away, or they might take less time
to train. The Model Asset eXchange is a free open source repository for ready-to-use
and customizable deep learning microservices.
These microservices are configured to use pre-trained or custom-trainable state-of-the-
art deep learning models to solve common business problems.
These models have been reviewed, tested, and can be quickly deployed in local and
cloud environments.
Página 12 de 21
The MAX model-serving microservices are built and distributed as open-source Docker
images. Docker is a container platform that makes it easy to build applications and to
deploy them in a development, test, or production environment.
The Docker image source is published on GitHub and can be downloaded, customized
as needed, and used in personal or commercial environments.
You can deploy and run these images in a test or production environment using
Kubernetes, an open-source system for automating deployment, scaling, and
management of containerized applications in private, hybrid, or public clouds.
A popular enterprise-grade Kubernetes platform is Red Hat OpenShift, which is
available on IBM Cloud, Google Cloud Platform, Amazon Web Services, and Microsoft
Azure.
The Data Asset Exchange is a curated collection of open datasets from IBM Research
and 3rd parties that you can use to train models.
Página 13 de 21
Exercise 2 - Explore deep learning models
The Model Asset Exchange is a curated repository of open source deep learning
models for a variety of domains, such as text, image, audio, and video processing.
The MAX home page is displayed. In this introductory lab exercise, we are going to
focus on a few MAX key features.
3. Select the Object Detector model from the list of options available. This model
recognizes the objects present in an image.
Página 14 de 21
4. Now, click on Try in CodePen. CodePen is a social development environment.
At its heart, it allows you to write code in the browser, and see the results of it
as you build.
5. Upload an image.
6. Click on the icon Extract Prediction as shown below:
Página 15 de 21
Here are some popular Libraries in the Data Science community:
dplyr for manipulating data,
stringer for manipulating strings,
ggplot for visualizing data,
caret for Machine Learning
R is a very great tool for data visualization and has different packages.
Some of the popular and top data visualization tools include:
ggplot which used for data visualizations such as histograms, bar charts,
scatterplots etc.
Plotly is an R package can be used to create web-based data visualizations that
can be displayed or saved as individual HTML files.
Lattice is a data visualization tool that is used to implement complex, multi-
variable data sets.
Lattice is a high-level data visualization library; it can handle many of the typical
graphics without needing many customizations.
Leaflet is very useful in creating interactive plots.
Depending on your need and data science project, most of these libraries and
packages can come in handy.
To install these packages in your R environment, use the install dot packages and the
package name command:
Install.packages(“package name”)
#How many different species there are present in the data set
unique(iris$Species)
(Iris is the data and Species the column name)
library(datasets)
# Load Data
data(mtcars)
# View first 5 rows
head(mtcars, 5)
#Get information about the variables. This will print the information at the bottom right
panel, on the Help tab
?mtcars
Página 16 de 21
#load ggplot package
library(ggplot2)
# create a scatterplot of displacement (disp) and miles per gallon (mpg)
ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()
# Add a title
ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()+ggtitle("displacement vs
miles per gallon")
# Change axis name
ggplot(aes(x=disp,y=mpg,),data=mtcars)+geom_point()+ggtitle("displacement vs
miles per gallon") + labs(x = "Displacement", y = "Miles per Gallon")
#make vs a factor
mtcars$vs <- as.factor(mtcars$vs)
# create boxplot of the distribution for v-shaped and straight Engine
ggplot(aes(x=vs, y=mpg), data = mtcars) + geom_boxplot()
Página 17 de 21
# Add color to the boxplots to help differentiate
ggplot(aes(x=vs, y=mpg, fill = vs), data = mtcars) + geom_boxplot(alpha=0.3) +
theme(legend.position="none")
A Jupyter notebook is a browser-based application that allows you to create and share
documents that contain code, equations, visualizations, narrative text links, and so
much more. It can be likened to a scientist’s lab notebook where a scientist records all
steps to perform their experiments and record the results so it can be reproduced in the
future.
When you run the code, it generates the outputs, including plots and tables, within the
notebook file. And you can then export the notebook to a PDF or HTML file that can
then be shared with anyone. Jupyter Notebooks originated as “iPython” originally
Página 18 de 21
developed for the Python Programming language. As it came to support additional
languages it was renamed Jupyter which stands for:
Julia
Python
R
Jupyter Lab extends the functionalities of Jupyter notebooks by enabling you to work
with multiple notebooks, text editors, terminals, and custom components in a flexible,
integrated, and extensible manner.
Jupyter Kernals
Jupyter Architecture
Jupyter also has an architecture of how it converts files to other formats. It uses a tool
called NB convert.
For example, if we want to convert a Notebook file into an HTML file, it will go through
the following: The Notebook is modified by a preprocessor, an exporter converts the
notebook to the new file format, and a postprocessor will work on the file produced by
exporting it.
After conversion, when you give the URL of the HTML file, it first fetches the notebook,
converts it HTML, and displays the file to you in a HTML file.
Página 19 de 21
Lab - Jupyter Notebook - The Basics
Use the keyboard shortcuts: [a] - Insert a Cell Above; [b] - Insert a Cell Below.
Execute the code, by either clicking the Play button in the menu above the
notebook, or by pressing Shift+Enter
You cannot create Markdown cells without first creating cells and converting
them from Code to Markdown. To render the Markdown text, make sure the cell
is selected (by clicking within it), and press Play in the menu, or Shift+Enter. To
edit your Markdown cell, double-click anywhere within the cell. Note you can
use the keyboard shortcut: [m] - Convert Cell to Markdown
Overview of Git/GitHub
Git and GitHub, which are popular environments among developers and data scientists
for performing version control of source code files and projects and collaborating with
others.
A version control system allows you to keep track of changes to your documents.
This makes it easy for you to recover older versions of your document if you make a
mistake, and it makes collaboration with others much easier.
Version control systems are widely used for things involving code, but you can also
version control images, documents, and any number of file types.
There are a few basic terms that you will need to know before you can get started:
The SSH protocol is a method for secure remote login from one computer to
another.
A repository contains your project folders that are set up for version control.
A fork is a copy of a repository.
A pull request is the way you request that someone reviews and approves your
changes before they become final.
A working directory contains the files and subdirectories on your computer that
are associated with a Git repository.
There are a few basic Git commands that you will always use:
When starting out with a new repository, you only need create it once: either
locally, and then push to GitHub, or by cloning an existing repository by using
the command "git init".
"git add" moves changes from the working directory to the staging area.
"git status" allows you to see the state of your working directory and the staged
snapshot of your changes.
Página 20 de 21
"git commit" takes your staged snapshot of changes and commits them to the
project.
"git reset" undoes changes that you’ve made to the files in your working
directory.
"git log" enables you to browse previous changes to a project.
"git branch" lets you create an isolated environment within your repository to
make changes.
"git checkout" lets you see and change existing branches.
"git merge" lets you put everything back together again.
Página 21 de 21