Tools For Data Science
Tools For Data Science
In this module, you will learn about the different types and categories of tools that data scientists use
and popular examples of each. You will also become familiar with Open Source, Cloud-based, and
Commercial options for data science tools.
Describe the components of a Data Scientist's toolkit and list various tool categories
List examples of Open Source, Commercial, and Cloud-based tools in various categories
This module will bring awareness about the criteria determining which language you should learn. You
will learn the benefits of Python, R, SQL, and other common languages such as Java, Scala, C++,
JavaScript, and Julia. You will explore how you can use these languages in Data Science. You will also look
at some sites for more information about the languages.
Learning Objectives
Identify the criteria and roles for determining the language to learn.
This module will give you in-depth knowledge of different libraries, APIs, dataset sources and models
used by data scientist.
Learning Objectives
List examples of the various libraries: scientific, visualization, machine learning, and deep
learning.
List the tasks that a data scientist needs to perform to build a model.
This module introduces the Jupyter Notebook and JupyterLab. You will learn how to work with different
kernels and the basic Jupyter architecture. In addition, you will identify the tools in an Anaconda Jupyter
environment. Finally, the module overviews cloud-based Jupyter environments and their data science
features.
Learning Objectives
This module will start with an introduction to R and RStudio and will end up with Github usage. You will
learn about the different R visualization packages and how to create visual charts using the plot function.
Further in the module, you will develop the essential conceptual and hands-on skills to work with Git
and GitHub. You will start with an overview of Git and GitHub, creating a GitHub account and a project
repository, adding files, and committing your changes using the web interface. Next, you will become
familiar with Git workflows involving branches, pull requests (PRs), and merges. You will also complete a
project at the end to apply and demonstrate your newly acquired skills.
Learning Objectives
Explain version control and describe the Git and GitHub environment.
Describe the purpose of source repositories and explain how GitHub satisfies the needs of a
source repository.
In this module, you will work on a final project to demonstrate some of the skills learned in the course.
You will also be tested on your knowledge of various components and tools in a Data Scientist's toolkit
learned in the previous modules.
Learning Objectives
This is as an optional module if you are interested in learning about and working with data science tools
from IBM such as Watson Studio.
Learning Objectives
Find common resources in Watson Studio and IBM Cloud Pak for Data.
Use different types of Jupyter Notebook templates and kernels on IBM Watson Studio.
Describe how to connect a Watson Studio account and publish a notebook in GitHub.
Module 1 Summary
Congratulations! You have completed this module. At this point in the course, you know:
o Data Integration and Transformation - streamline data pipelines and automate data
processing tasks
o Code Asset Management - store & manage code, track changes and allow collaborative
development
o Data Asset Management - organize and manage data, provide access control, and
backup assets
The data science ecosystem consists of many open source and commercial options, and include both
traditional desktop applications and server-based tools, as well as cloud-based services that can be
accessed using web-browsers and mobile interfaces.
Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data platforms:
MySQL, and PostgreSQL are examples of Open Source Relational Database Management Systems
(RDBMS), and IBM Db2 and SQL Server are examples of commercial RDBMSes and are also
available as Cloud services.
Apache Hadoop and Apache Spark are used for Big Data analytics.
Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.
Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau and PowerBI
and can be used for building dynamic and interactive dashboards.
Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a popular
web-based platform for storing and managing source code. Its features make it an ideal tool for
collaborative software development, including version control, issue tracking, and project management.
Development Environments: Popular development environments for Data Science include Jupyter
Notebooks and RStudio.
Jupyter Notebooks provides an interactive environment for creating and sharing code,
descriptive text, data visualizations, and other computational artifacts in a web-browser based
interface.
RStudio is an integrated development environment (IDE) designed specifically for working with
the R programming language, which is a popular tool for statistical computing and data analysis.
SEMANA 1.1
La gestión de activos de código proporciona una vista unificada en la que administra un inventario de
activos. Cuando desee desarrollar un modelo, es posible que deba actualizarlo, corregir errores o
mejorar las características del código de forma incremental. Todo esto requiere control de versiones. Los
desarrolladores usan el control de versiones para rastrear y administrar los cambios en el código de un
proyecto de software. Al trabajar en un modelo, equipa un repositorio centralizado donde todos pueden
cargar, editar y administrar los archivos de código simultáneamente. La colaboración permite que
diversas personas compartan y actualicen el mismo proyecto juntas. GitHub es un buen ejemplo de una
plataforma de gestión de activos de código. Está basado en la web y proporciona funciones de uso
compartido, colaboración y control de acceso. Como científico de datos, desea almacenar y organizar
correctamente todas sus imágenes, videos, texto y otros datos en una ubicación central. También desea
controlar quién puede acceder, editar y administrar sus datos. La gestión de activos de datos, también
llamada gestión de activos digitales (DAM), es la organización y gestión de datos importantes recopilados
de diferentes fuentes. DAM se realiza en una plataforma DAM que permite el control de versiones y la
colaboración. Las plataformas DAM también admiten la replicación, la copia de seguridad y la
administración de derechos de acceso para los datos almacenados. Los entornos de desarrollo, también
llamados entornos de desarrollo integrados o "IDE", proporcionan un espacio de trabajo y herramientas
para desarrollar, implementar, ejecutar, probar e implementar código fuente. Los IDE como IBM Watson
Studio brindan herramientas de prueba y simulación para emular el mundo real para que pueda ver
cómo se comportará su código después de implementarlo. Un entorno de ejecución tiene bibliotecas
para compilar el código fuente y recursos del sistema que ejecutan y verifican el código. Los entornos de
ejecución basados en la nube no están vinculados a ningún hardware o software específico y ofrecen
herramientas como IBM Watson Studio para el preprocesamiento de datos, el entrenamiento de
modelos y la implementación. Finalmente, las herramientas visuales completamente integradas como
IBM Watson Studio e IBM Cognos Dashboard Embedded cubren todos los componentes de herramientas
anteriores y se pueden usar para desarrollar modelos de aprendizaje profundo y aprendizaje automático.
En este video, aprendió que las categorías de tareas de ciencia de datos son: administración de datos,
integración y transformación de datos, visualización de datos, creación de modelos, implementación de
modelos y monitoreo y evaluación de modelos. Las tareas de ciencia de datos son compatibles con la
gestión de activos de datos, la gestión de activos de código, los entornos de ejecución y los entornos de
desarrollo.