0% found this document useful (0 votes)
32 views5 pages

Tools For Data Science

Uploaded by

CLAUDIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Tools For Data Science

Uploaded by

CLAUDIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

TOOLS FOR DATA SCIENCE

Module 1: Overview of Data Science Tools

In this module, you will learn about the different types and categories of tools that data scientists use
and popular examples of each. You will also become familiar with Open Source, Cloud-based, and
Commercial options for data science tools.

 Describe the components of a Data Scientist's toolkit and list various tool categories

 List examples of Open Source, Commercial, and Cloud-based tools in various categories

Module 2: Languages of Data Science

This module will bring awareness about the criteria determining which language you should learn. You
will learn the benefits of Python, R, SQL, and other common languages such as Java, Scala, C++,
JavaScript, and Julia. You will explore how you can use these languages in Data Science. You will also look
at some sites for more information about the languages.

Learning Objectives

 Identify the criteria and roles for determining the language to learn.

 Identify the users and benefits of Python.

 Identify the users and uses of the R language.

 Define SQL elements and list their benefits.

 Review languages such as Java, Scala, C++, JavaScript, and Julia.

 List the global communities for connecting with other users.

Module 3: Packages, APIs, Data Sets and Models

This module will give you in-depth knowledge of different libraries, APIs, dataset sources and models
used by data scientist.

Learning Objectives

 List examples of the various libraries: scientific, visualization, machine learning, and deep
learning.

 Define REST API to request and respond.

 Describe data sets and sources of data.

 Explore open data sets on the Data Asset eXchange.

 Describe how to use a learning model to solve a problem.

 List the tasks that a data scientist needs to perform to build a model.

 Explore ML models in the Model Learning eXchange.


Module 4: Jupyter Notebooks and JupyterLab

This module introduces the Jupyter Notebook and JupyterLab. You will learn how to work with different
kernels and the basic Jupyter architecture. In addition, you will identify the tools in an Anaconda Jupyter
environment. Finally, the module overviews cloud-based Jupyter environments and their data science
features.

Learning Objectives

 Describe how to use the notebooks in JupyterLab.

 Describe how to work in a notebook session.

 Describe the basic Jupyter architecture.

 Describe how to work with kernels.

 Identify tools in Anaconda Jupyter environments.

 Describe cloud-based Jupyter environments and their data science features.

Module 5: RStudio and GitHub

This module will start with an introduction to R and RStudio and will end up with Github usage. You will
learn about the different R visualization packages and how to create visual charts using the plot function.

Further in the module, you will develop the essential conceptual and hands-on skills to work with Git
and GitHub. You will start with an overview of Git and GitHub, creating a GitHub account and a project
repository, adding files, and committing your changes using the web interface. Next, you will become
familiar with Git workflows involving branches, pull requests (PRs), and merges. You will also complete a
project at the end to apply and demonstrate your newly acquired skills.

Learning Objectives

 Describe R capabilities and RStudio environment.

 Use the inbuilt R plot function.

 Explain version control and describe the Git and GitHub environment.

 Describe the purpose of source repositories and explain how GitHub satisfies the needs of a
source repository.

 Create a GitHub account and a project repository.

 Demonstrate how to edit and upload files in GitHub.

 Explain the purpose of branches and how to merge changes.

Module 6: Final Project and Assessment

In this module, you will work on a final project to demonstrate some of the skills learned in the course.
You will also be tested on your knowledge of various components and tools in a Data Scientist's toolkit
learned in the previous modules.
Learning Objectives

 Create a Jupyter Notebook with markdown and code cells

 List examples of languages, libraries and tools used in Data Science

 Share your Jupyter Notebook publicly on GitHub

 Evaluate notebooks submitted by your peers using the provided rubric

 Demonstrate proficiency in Data Science toolkit knowledge

Module7: IBM Watson Studio

This is as an optional module if you are interested in learning about and working with data science tools
from IBM such as Watson Studio.

Learning Objectives

 Find common resources in Watson Studio and IBM Cloud Pak for Data.

 Create an IBM Cloud account, service, and project in Watson Studio.

 Create and share a Jupyter Notebook.

 Use different types of Jupyter Notebook templates and kernels on IBM Watson Studio.

 Describe how to connect a Watson Studio account and publish a notebook in GitHub.

Module 1 Summary

Congratulations! You have completed this module. At this point in the course, you know:

 The Data Science Task Categories include:

o Data Management - storage, management and retrieval of data

o Data Integration and Transformation - streamline data pipelines and automate data
processing tasks

o Data Visualization - provide graphical representation of data and assist with


communicating insights

o Modelling - enable Building, Deployment, Monitoring and Assessment of Data and


Machine Learning models

 Data Science Tasks support the following:

o Code Asset Management - store & manage code, track changes and allow collaborative
development
o Data Asset Management - organize and manage data, provide access control, and
backup assets

o Development Environments - develop, test and deploy code

o Execution Environments - provide computational resources and run the code

The data science ecosystem consists of many open source and commercial options, and include both
traditional desktop applications and server-based tools, as well as cloud-based services that can be
accessed using web-browsers and mobile interfaces.

Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data platforms:

 MySQL, and PostgreSQL are examples of Open Source Relational Database Management Systems
(RDBMS), and IBM Db2 and SQL Server are examples of commercial RDBMSes and are also
available as Cloud services.

 MongoDB and Apache Cassandra are examples of NoSQL databases.

 Apache Hadoop and Apache Spark are used for Big Data analytics.

Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.

Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau and PowerBI
and can be used for building dynamic and interactive dashboards.

Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a popular
web-based platform for storing and managing source code. Its features make it an ideal tool for
collaborative software development, including version control, issue tracking, and project management.

Development Environments: Popular development environments for Data Science include Jupyter
Notebooks and RStudio.

 Jupyter Notebooks provides an interactive environment for creating and sharing code,
descriptive text, data visualizations, and other computational artifacts in a web-browser based
interface.

 RStudio is an integrated development environment (IDE) designed specifically for working with
the R programming language, which is a popular tool for statistical computing and data analysis.
SEMANA 1.1

La gestión de activos de código proporciona una vista unificada en la que administra un inventario de
activos. Cuando desee desarrollar un modelo, es posible que deba actualizarlo, corregir errores o
mejorar las características del código de forma incremental. Todo esto requiere control de versiones. Los
desarrolladores usan el control de versiones para rastrear y administrar los cambios en el código de un
proyecto de software. Al trabajar en un modelo, equipa un repositorio centralizado donde todos pueden
cargar, editar y administrar los archivos de código simultáneamente. La colaboración permite que
diversas personas compartan y actualicen el mismo proyecto juntas. GitHub es un buen ejemplo de una
plataforma de gestión de activos de código. Está basado en la web y proporciona funciones de uso
compartido, colaboración y control de acceso. Como científico de datos, desea almacenar y organizar
correctamente todas sus imágenes, videos, texto y otros datos en una ubicación central. También desea
controlar quién puede acceder, editar y administrar sus datos. La gestión de activos de datos, también
llamada gestión de activos digitales (DAM), es la organización y gestión de datos importantes recopilados
de diferentes fuentes. DAM se realiza en una plataforma DAM que permite el control de versiones y la
colaboración. Las plataformas DAM también admiten la replicación, la copia de seguridad y la
administración de derechos de acceso para los datos almacenados. Los entornos de desarrollo, también
llamados entornos de desarrollo integrados o "IDE", proporcionan un espacio de trabajo y herramientas
para desarrollar, implementar, ejecutar, probar e implementar código fuente. Los IDE como IBM Watson
Studio brindan herramientas de prueba y simulación para emular el mundo real para que pueda ver
cómo se comportará su código después de implementarlo. Un entorno de ejecución tiene bibliotecas
para compilar el código fuente y recursos del sistema que ejecutan y verifican el código. Los entornos de
ejecución basados en la nube no están vinculados a ningún hardware o software específico y ofrecen
herramientas como IBM Watson Studio para el preprocesamiento de datos, el entrenamiento de
modelos y la implementación. Finalmente, las herramientas visuales completamente integradas como
IBM Watson Studio e IBM Cognos Dashboard Embedded cubren todos los componentes de herramientas
anteriores y se pueden usar para desarrollar modelos de aprendizaje profundo y aprendizaje automático.

En este video, aprendió que las categorías de tareas de ciencia de datos son: administración de datos,
integración y transformación de datos, visualización de datos, creación de modelos, implementación de
modelos y monitoreo y evaluación de modelos. Las tareas de ciencia de datos son compatibles con la
gestión de activos de datos, la gestión de activos de código, los entornos de ejecución y los entornos de
desarrollo.

You might also like