0% found this document useful (0 votes)

44 views

02-Tools For Data Science

This document provides an overview of key data science tools and concepts. It discusses data science task categories including data management, integration, visualization, and modeling. It also covers code/data asset management, development environments, and execution environments. Popular tools are mentioned for each category. The document also discusses popular programming languages for data science like Python, R, SQL, and others. Finally, it covers open and proprietary datasets as well as different types of dataset licenses.

Uploaded by

abdessalemdjoudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

02-Tools For Data Science

Uploaded by

abdessalemdjoudi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Overview of Data Science Tools

The Data Science Task Categories include:

 Data Management - storage, management and retrieval of data

 Data Integration and Transformation - streamline data pipelines and automate data
processing tasks

 Data Visualization - provide graphical representation of data and assist with

communicating insights

 Modelling - enable Building, Deployment, Monitoring and Assessment of Data and

Machine Learning models

Data Science Tasks support the following:

 Code Asset Management - store & manage code, track changes and allow collaborative
development

 Data Asset Management - organize and manage data, provide access control, and backup
assets

 Development Environments - develop, test and deploy code

 Execution Environments - provide computational resources and run the code

The data science ecosystem consists of many open source and commercial options, and
include both traditional desktop applications and server-based tools, as well as cloud-based
services that can be accessed using web-browsers and mobile interfaces.

Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data
platforms:

 MySQL, and PostgreSQL are examples of Open Source Relational Database Management
Systems (RDBMS), and IBM Db2 and SQL Server are examples of commercial
RDBMSes and are also available as Cloud services.
 MongoDB and Apache Cassandra are examples of NoSQL databases.
 Apache Hadoop and Apache Spark are used for Big Data analytics.
Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.

Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau
and PowerBI and can be used for building dynamic and interactive dashboards.

Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a
popular web-based platform for storing and managing source code. Its features make it an
ideal tool for collaborative software development, including version control, issue tracking,
and project management.
Development Environments: Popular development environments for Data Science include
Jupyter Notebooks and RStudio.

 Jupyter Notebooks provides an interactive environment for creating and sharing code,
descriptive text, data visualizations, and other computational artifacts in a web-browser
based interface.
 RStudio is an integrated development environment (IDE) designed specifically for
working with the R programming language, which is a popular tool for statistical
computing and data analysis.

Languages of Data Science

 You should select a language to learn depending on your needs, the problems you are
trying to solve, and whom you are solving them for.
 The popular languages are Python, R, SQL, Scala, Java, C++, and Julia.
 For data science, you can use Python's scientific computing libraries like Pandas, NumPy,
SciPy, and Matplotlib.
 Python can also be used for Natural Language Processing (NLP) using the Natural
Language Toolkit (NLTK).
 Python is open source, and R is free software.
 R language’s array-oriented syntax makes it easier to translate from math to code for
learners with no or minimal programming background.
 SQL is different from other software development languages because it is a non-
procedural language.
 SQL was designed for managing data in relational databases.
 If you learn SQL and use it with one database, you can apply your SQL knowledge with
many other databases easily.
 Data science tools built with Java include Weka, Java-ML, Apache MLlib, and
Deeplearning4.
 For data science, popular program built with Scala is Apache Spark which includes Shark,
MLlib, GraphX, and Spark Streaming.
 Programs built for Data Science with JavaScript include TensorFlow.js and R-js.
 One great application of Julia for Data Science is JuliaDB.

Datasets
Open datasets and sources
In this data-driven world, some datasets are freely available for anyone to access, use,
modify, and share. These are called open datasets.
Open datasets include a public license and are very useful for your journey as a Data
Scientist. Some of the most informative open dataset sources are listed below.
Government Data:
 https://www.data.gov/
 https://www.census.gov/data.html
 https://data.gov.uk/
 https://www.opendatanetwork.com/
 https://data.un.org/
Financial Data Sources:
 https://data.worldbank.org/
 https://www.globalfinancialdata.com/
 https://comtrade.un.org/
 https://www.nber.org/
 https://fred.stlouisfed.org/
Crime Data:
 https://www.fbi.gov/services/cjis/ucr
 https://www.icpsr.umich.edu/icpsrweb/content/NACJD/index.html
 https://www.drugabuse.gov/related-topics/trends-statistics
 https://www.unodc.org/unodc/en/data-and-analysis/
Health Data:
 https://www.who.int/gho/database/en/
 https://www.fda.gov/Food/default.htm
 https://seer.cancer.gov/faststats/selections.php?series=cancer
 https://www.opensciencedatacloud.org/
 https://pds.nasa.gov/
 https://earthdata.nasa.gov/
 https://www.sgim.org/communities/research/dataset-compendium/public-datasets-
topic-grid
Academic and Business Data:
 https://scholar.google.com/
 https://nces.ed.gov/
 https://www.glassdoor.com/research/
 https://www.yelp.com/dataset
Other General Data:
 https://www.kaggle.com/datasets
 https://www.reddit.com/r/datasets/

Propriety datasets and sources

Proprietary datasets contain data primarily owned and controlled by specific individuals
or organizations. This data is limited in distribution because it is sold with a licensing
agreement.
Some data from private sources cannot be easily disclosed, like public data.
National security data, geological, geophysical, and biological data are examples of
propriety data. Copyright laws or patents usually bind this type of data. Proprietary
datasets that mainly contain sensitive information are less widely available than open
datasets.

Some standard propriety dataset sources are listed below.

Health Care:
https://www.sgim.org/communities/research/dataset-compendium/proprietary-datasets
Financial Market data:
https://datarade.ai/data-categories/proprietary-market-data
Google Cloud based datasets:
https://cloud.google.com/datasets

Dataset licenses
When you select a dataset, it is necessary to look into the license. A license explains
whether you can use that dataset or not; or explains if you have to accept certain
guidelines to use that dataset. The different license types are listed below.

1. PUBLIC DOMAIN MARK - PUBLIC DOMAIN

When a dataset has a Public Domain license, all the rights to use, access, modify and
share the dataset are open to everyone. Here there is technically no license.
2. OPEN DATA COMMONS PUBLIC DOMAIN DEDICATION AND LICENSE – PDDL
Open Data Commons license has the same features as the Public Domain license, but the
difference is the PDDL license uses a licensing mechanism to give the rights to the
dataset.
3. CREATIVE COMMONS ATTRIBUTION 4.0 INTERNATIONAL CC-BY
This license allows users to share and modify a dataset, but only if they give credit to the
creator(s) of the dataset.
4. COMMUNITY DATA LICENSE AGREEMENT – CDLA PERMISSIVE-2.0
Like most open-source licenses, this license allows users to use, modify, adapt, and share
the dataset, but only if a disclaimer of warranties and liability is also included.
5. OPEN DATA COMMONS ATTRIBUTION LICENSE - ODC-BY
This license allows users to share and adapt a dataset, but only if they give credit to the
creator(s) of the dataset.
6. CREATIVE COMMONS ATTRIBUTION-SHAREALIKE 4.0 INTERNATIONAL - CC-BY-SA
This license allows users to use, share, and adapt a dataset, but only if they give credit to
the dataset and show any changes or transformations, they made to the dataset. Users
might not want to use this license because they have to share the work they did on the
dataset.
7. COMMUNITY DATA LICENSE AGREEMENT – CDLA-SHARING-1.0
This license uses the principle of ‘copyleft’: users can use, modify, and adapt a dataset,
but only if they don’t add license restrictions on the new work(s) they create with the
dataset.
8. OPEN DATA COMMONS OPEN DATABASE LICENSE - ODC-ODBL
This license allows users to use, share, and adapt a dataset but only if they give credit to
the dataset and show any changes or transformations they make to the dataset. Users
might not want to use this license because they have to share the work they did on the
dataset.
9. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL 4.0 INTERNATIONAL - CC
BY-NC
This license is a restrictive license. Users can share and adapt a dataset, provided they give
credit to its creator(s) and ensure that the dataset is not used for any commercial
purpose.
10. CREATIVE COMMONS ATTRIBUTION-NO DERIVATIVES 4.0 INTERNATIONAL - CC BY-
ND
This license is also a restrictive license. Users can share a dataset if they give credit to its
creator(s). This license does not allow additions, transformations, or changes to the
dataset.
11. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0
INTERNATIONAL - CC BY-NC-SA
This license allows users to share a dataset only if they give credit to its creator(s). Users
can share additions, transformations, or changes to the dataset, but they cannot use the
dataset for commercial purposes.
12. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-NODERIVATIVES 4.0
INTERNATIONAL - CC BY-NC-ND
This license allows users to share a dataset only if they give credit to its creator(s). Users
are not allowed to modify the dataset and are not allowed to use it for commercial
purposes.

 Libraries usually contain built-in modules that provide different functionalities.

 You can use data visualization methods to communicate with others and display meaningful
results of an analysis.

 For machine learning, the Scikit-learn library contains tools for statistical modeling, including
regression, classification, clustering, and so on.

 Large-scale production of deep-learning models use TensorFlow, a low-level framework.

 Apache Spark is a general-purpose cluster-computing framework that allows you to process

data using compute clusters.

 An application programming interface (API) allows communication between two pieces of

software.

 API is the part of the library you see while the library contains all the components of the
program.

 REST APIs allow you to communicate through the internet and take advantage of resources
like storage, data, artificially intelligent algorithms, and much more.

 Open data is fundamental to Data Science.

 Community Data License Agreement makes it easier to share open data.

 The IBM Data Asset eXchange (DAX) site contains high-quality open data sets.
 DAX open data sets include tutorial notebooks that provide basic and advanced walk-
throughs for developers.

 DAX notebooks open in Watson Studio.

 Machine learning (ML) uses algorithms – also known as “models” – to identify patterns in the
data.

 Types of ML are Supervised, Unsupervised, and Reinforcement.

 Supervised learning comprises two types of models, regression and classification.

 Deep learning refers to a general set of models and techniques that loosely emulate the way
the human brain solves a wide range of problems.

 The Model Asset eXchange is a free, open-source repository for ready-to-use and
customizable deep-learning microservices.

 MAX model-serving microservices are built and distributed on GitHub as open-source

Docker images.

 You can use Red Hat OpenShift, a Kubernetes platform, to automate deployment, scaling, and
management of microservices.

 Ml-exchange.org has multiple predefined models.

Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
No ratings yet
Anaconda's Guide To Open-Source: Tools and Libraries For Enterprise Data Science and Machine Learning
29 pages
Sas Course Content Porak Technologies
No ratings yet
Sas Course Content Porak Technologies
8 pages
01-Introduction To Data Science
No ratings yet
01-Introduction To Data Science
3 pages
Module 2 - Data Preprocessing and Visualization
No ratings yet
Module 2 - Data Preprocessing and Visualization
15 pages
Datascience Tools
No ratings yet
Datascience Tools
6 pages
2 - Data Science Tools
No ratings yet
2 - Data Science Tools
21 pages
Data Science IBM
No ratings yet
Data Science IBM
157 pages
Modul 2 Data Science
No ratings yet
Modul 2 Data Science
10 pages
Additional Sources of Datasets
No ratings yet
Additional Sources of Datasets
6 pages
Tools for Data Science
No ratings yet
Tools for Data Science
16 pages
H ERRAMIENTAS
No ratings yet
H ERRAMIENTAS
2 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Open-Source Datasets
No ratings yet
Open-Source Datasets
3 pages
6th Sem Cse Data Science Analytics SM o
No ratings yet
6th Sem Cse Data Science Analytics SM o
40 pages
What Is Data Science by IBM
No ratings yet
What Is Data Science by IBM
9 pages
Tools For Data Science
No ratings yet
Tools For Data Science
5 pages
08 Module 1 Summary
No ratings yet
08 Module 1 Summary
2 pages
Open Data
No ratings yet
Open Data
3 pages
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet
2.data Science Tools
No ratings yet
2.data Science Tools
13 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Doc1
No ratings yet
Doc1
3 pages
Resumédata
No ratings yet
Resumédata
8 pages
A Review On Data Science Technologies
No ratings yet
A Review On Data Science Technologies
3 pages
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Tools for data science
No ratings yet
Tools for data science
6 pages
Data Governance On Unity Catalog - Jul 2024
No ratings yet
Data Governance On Unity Catalog - Jul 2024
56 pages
1. Databases for Data Science-SQL
No ratings yet
1. Databases for Data Science-SQL
55 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
DSOST1
No ratings yet
DSOST1
91 pages
Tools For Data Science
No ratings yet
Tools For Data Science
4 pages
Unit 2 Data Science
No ratings yet
Unit 2 Data Science
53 pages
Building and Operating Data Hubs: Using a practical Framework as Toolset
From Everand
Building and Operating Data Hubs: Using a practical Framework as Toolset
Georg Graner
No ratings yet
The Data Science Toolkit
No ratings yet
The Data Science Toolkit
5 pages
2 DS # 1 Introduction To DS
No ratings yet
2 DS # 1 Introduction To DS
12 pages
Your Future in Data - Resources To Explore - EN
No ratings yet
Your Future in Data - Resources To Explore - EN
2 pages
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Toolkits
No ratings yet
Toolkits
10 pages
Data Science - UNIT-3 - Notes
No ratings yet
Data Science - UNIT-3 - Notes
32 pages
Unit 1
No ratings yet
Unit 1
21 pages
Sandbox
No ratings yet
Sandbox
7 pages
Data Science Tools
No ratings yet
Data Science Tools
8 pages
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
From Everand
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
Malcolm Coxall
No ratings yet
23 Module 3 Summary
No ratings yet
23 Module 3 Summary
2 pages
CSE3038 Module 1
No ratings yet
CSE3038 Module 1
21 pages
Real-Time Big Data Analytics: Emerging Trends
From Everand
Real-Time Big Data Analytics: Emerging Trends
Trilokesh Khatri
No ratings yet
3961502-Class10 Ai Part b Unit3 Unit3 Data Science
No ratings yet
3961502-Class10 Ai Part b Unit3 Unit3 Data Science
15 pages
Introduction-It Skills
No ratings yet
Introduction-It Skills
20 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
Data Munging
No ratings yet
Data Munging
65 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
Edge Cloud Operations: A Systems Approach
From Everand
Edge Cloud Operations: A Systems Approach
Larry L Peterson
No ratings yet
Data exam 3
No ratings yet
Data exam 3
42 pages
5_6237938787641463884
No ratings yet
5_6237938787641463884
9 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
ML - Practical 02
No ratings yet
ML - Practical 02
7 pages
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
From Everand
RDBMS In-Depth: Mastering SQL and PL/SQL Concepts, Database Design, ACID Transactions, and Practice Real Implementation of RDBM (English Edition)
Dr. Madhavi Vaidya
No ratings yet
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
From Everand
Digital Technologies – an Overview of Concepts, Tools and Techniques Associated with it
Editor IJSMI
No ratings yet
Chapte 1 Exercises Unit II Introduction to JAVA
No ratings yet
Chapte 1 Exercises Unit II Introduction to JAVA
11 pages
m57 SMSC Datasheet PDF
No ratings yet
m57 SMSC Datasheet PDF
5 pages
Steven P. Jobs Stephen G. Wozniak: Damage Hold Include Invite Make Clean Overtake Show Translate Write Build
No ratings yet
Steven P. Jobs Stephen G. Wozniak: Damage Hold Include Invite Make Clean Overtake Show Translate Write Build
2 pages
A Library of Local Search Heuristics For The Vehicle Routing Problem
No ratings yet
A Library of Local Search Heuristics For The Vehicle Routing Problem
23 pages
Monitoring-J1939-Diagnostic-Trouble-Codes_Final-1 (1)
No ratings yet
Monitoring-J1939-Diagnostic-Trouble-Codes_Final-1 (1)
9 pages
Ig, PG and SG
No ratings yet
Ig, PG and SG
6 pages
Service Manual
No ratings yet
Service Manual
100 pages
Core Java BCA Sem V Slip Solution
67% (3)
Core Java BCA Sem V Slip Solution
69 pages
Operation Instructions: 1. Appearance and Buttons
No ratings yet
Operation Instructions: 1. Appearance and Buttons
5 pages
12 - User Interface - Creating A Panel
No ratings yet
12 - User Interface - Creating A Panel
3 pages
Atmel 8390 WIRELESS AVR2054 Serial Bootloader User Guide Application Note
100% (1)
Atmel 8390 WIRELESS AVR2054 Serial Bootloader User Guide Application Note
23 pages
Internet of Things Course Syllabus
100% (1)
Internet of Things Course Syllabus
3 pages
QuickRide Logcat
No ratings yet
QuickRide Logcat
231 pages
What Is An Algorithm?
No ratings yet
What Is An Algorithm?
13 pages
Practical-7
No ratings yet
Practical-7
9 pages
Systemd Commands Cheat Sheet: Application Management Using Systemctl Commands
No ratings yet
Systemd Commands Cheat Sheet: Application Management Using Systemctl Commands
5 pages
d2 Network 2: Design by Neil Poulton
No ratings yet
d2 Network 2: Design by Neil Poulton
4 pages
Intel Core I7 7700HQ at 2893
No ratings yet
Intel Core I7 7700HQ at 2893
9 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
B45 Selection Bubble Sort
No ratings yet
B45 Selection Bubble Sort
4 pages
Fusion Techathon
No ratings yet
Fusion Techathon
15 pages
Contact Managment UI With React
No ratings yet
Contact Managment UI With React
10 pages
06 Task Performance CEPERG PDF
No ratings yet
06 Task Performance CEPERG PDF
2 pages
SEH All at Once Attack
No ratings yet
SEH All at Once Attack
15 pages
Daq
No ratings yet
Daq
3 pages
Getting Started With Sample Applications
No ratings yet
Getting Started With Sample Applications
70 pages
Python (Lalit)
No ratings yet
Python (Lalit)
19 pages
Appendix D: Replacement of Origin Driver With The jUSB Driver
No ratings yet
Appendix D: Replacement of Origin Driver With The jUSB Driver
4 pages
Floboss 104 Flow Manager Fb104 en 132242
No ratings yet
Floboss 104 Flow Manager Fb104 en 132242
7 pages

02-Tools For Data Science

Uploaded by

02-Tools For Data Science

Uploaded by

Overview of Data Science Tools

The Data Science Task Categories include:

 Data Management - storage, management and retrieval of data

 Data Visualization - provide graphical representation of data and assist with

 Modelling - enable Building, Deployment, Monitoring and Assessment of Data and

Data Science Tasks support the following:

 Development Environments - develop, test and deploy code

 Execution Environments - provide computational resources and run the code

Languages of Data Science

Propriety datasets and sources

Some standard propriety dataset sources are listed below.

1. PUBLIC DOMAIN MARK - PUBLIC DOMAIN

 Libraries usually contain built-in modules that provide different functionalities.

 Large-scale production of deep-learning models use TensorFlow, a low-level framework.

 Apache Spark is a general-purpose cluster-computing framework that allows you to process

 An application programming interface (API) allows communication between two pieces of

 Open data is fundamental to Data Science.

 Community Data License Agreement makes it easier to share open data.

 DAX notebooks open in Watson Studio.

 Types of ML are Supervised, Unsupervised, and Reinforcement.

 Supervised learning comprises two types of models, regression and classification.

 MAX model-serving microservices are built and distributed on GitHub as open-source

 Ml-exchange.org has multiple predefined models.

You might also like