0% found this document useful (0 votes)
44 views

02-Tools For Data Science

This document provides an overview of key data science tools and concepts. It discusses data science task categories including data management, integration, visualization, and modeling. It also covers code/data asset management, development environments, and execution environments. Popular tools are mentioned for each category. The document also discusses popular programming languages for data science like Python, R, SQL, and others. Finally, it covers open and proprietary datasets as well as different types of dataset licenses.

Uploaded by

abdessalemdjoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

02-Tools For Data Science

This document provides an overview of key data science tools and concepts. It discusses data science task categories including data management, integration, visualization, and modeling. It also covers code/data asset management, development environments, and execution environments. Popular tools are mentioned for each category. The document also discusses popular programming languages for data science like Python, R, SQL, and others. Finally, it covers open and proprietary datasets as well as different types of dataset licenses.

Uploaded by

abdessalemdjoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Overview of Data Science Tools

The Data Science Task Categories include:

 Data Management - storage, management and retrieval of data

 Data Integration and Transformation - streamline data pipelines and automate data
processing tasks

 Data Visualization - provide graphical representation of data and assist with


communicating insights

 Modelling - enable Building, Deployment, Monitoring and Assessment of Data and


Machine Learning models

Data Science Tasks support the following:

 Code Asset Management - store & manage code, track changes and allow collaborative
development

 Data Asset Management - organize and manage data, provide access control, and backup
assets

 Development Environments - develop, test and deploy code

 Execution Environments - provide computational resources and run the code

The data science ecosystem consists of many open source and commercial options, and
include both traditional desktop applications and server-based tools, as well as cloud-based
services that can be accessed using web-browsers and mobile interfaces.

Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data
platforms:

 MySQL, and PostgreSQL are examples of Open Source Relational Database Management
Systems (RDBMS), and IBM Db2 and SQL Server are examples of commercial
RDBMSes and are also available as Cloud services.
 MongoDB and Apache Cassandra are examples of NoSQL databases.
 Apache Hadoop and Apache Spark are used for Big Data analytics.
Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.

Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau
and PowerBI and can be used for building dynamic and interactive dashboards.

Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a
popular web-based platform for storing and managing source code. Its features make it an
ideal tool for collaborative software development, including version control, issue tracking,
and project management.
Development Environments: Popular development environments for Data Science include
Jupyter Notebooks and RStudio.

 Jupyter Notebooks provides an interactive environment for creating and sharing code,
descriptive text, data visualizations, and other computational artifacts in a web-browser
based interface.
 RStudio is an integrated development environment (IDE) designed specifically for
working with the R programming language, which is a popular tool for statistical
computing and data analysis.

Languages of Data Science


 You should select a language to learn depending on your needs, the problems you are
trying to solve, and whom you are solving them for.
 The popular languages are Python, R, SQL, Scala, Java, C++, and Julia.
 For data science, you can use Python's scientific computing libraries like Pandas, NumPy,
SciPy, and Matplotlib.
 Python can also be used for Natural Language Processing (NLP) using the Natural
Language Toolkit (NLTK).
 Python is open source, and R is free software.
 R language’s array-oriented syntax makes it easier to translate from math to code for
learners with no or minimal programming background.
 SQL is different from other software development languages because it is a non-
procedural language.
 SQL was designed for managing data in relational databases.
 If you learn SQL and use it with one database, you can apply your SQL knowledge with
many other databases easily.
 Data science tools built with Java include Weka, Java-ML, Apache MLlib, and
Deeplearning4.
 For data science, popular program built with Scala is Apache Spark which includes Shark,
MLlib, GraphX, and Spark Streaming.
 Programs built for Data Science with JavaScript include TensorFlow.js and R-js.
 One great application of Julia for Data Science is JuliaDB.

Datasets
Open datasets and sources
In this data-driven world, some datasets are freely available for anyone to access, use,
modify, and share. These are called open datasets.
Open datasets include a public license and are very useful for your journey as a Data
Scientist. Some of the most informative open dataset sources are listed below.
Government Data:
 https://www.data.gov/
 https://www.census.gov/data.html
 https://data.gov.uk/
 https://www.opendatanetwork.com/
 https://data.un.org/
Financial Data Sources:
 https://data.worldbank.org/
 https://www.globalfinancialdata.com/
 https://comtrade.un.org/
 https://www.nber.org/
 https://fred.stlouisfed.org/
Crime Data:
 https://www.fbi.gov/services/cjis/ucr
 https://www.icpsr.umich.edu/icpsrweb/content/NACJD/index.html
 https://www.drugabuse.gov/related-topics/trends-statistics
 https://www.unodc.org/unodc/en/data-and-analysis/
Health Data:
 https://www.who.int/gho/database/en/
 https://www.fda.gov/Food/default.htm
 https://seer.cancer.gov/faststats/selections.php?series=cancer
 https://www.opensciencedatacloud.org/
 https://pds.nasa.gov/
 https://earthdata.nasa.gov/
 https://www.sgim.org/communities/research/dataset-compendium/public-datasets-
topic-grid
Academic and Business Data:
 https://scholar.google.com/
 https://nces.ed.gov/
 https://www.glassdoor.com/research/
 https://www.yelp.com/dataset
Other General Data:
 https://www.kaggle.com/datasets
 https://www.reddit.com/r/datasets/

Propriety datasets and sources


Proprietary datasets contain data primarily owned and controlled by specific individuals
or organizations. This data is limited in distribution because it is sold with a licensing
agreement.
Some data from private sources cannot be easily disclosed, like public data.
National security data, geological, geophysical, and biological data are examples of
propriety data. Copyright laws or patents usually bind this type of data. Proprietary
datasets that mainly contain sensitive information are less widely available than open
datasets.

Some standard propriety dataset sources are listed below.

Health Care:
https://www.sgim.org/communities/research/dataset-compendium/proprietary-datasets
Financial Market data:
https://datarade.ai/data-categories/proprietary-market-data
Google Cloud based datasets:
https://cloud.google.com/datasets

Dataset licenses
When you select a dataset, it is necessary to look into the license. A license explains
whether you can use that dataset or not; or explains if you have to accept certain
guidelines to use that dataset. The different license types are listed below.

1. PUBLIC DOMAIN MARK - PUBLIC DOMAIN


When a dataset has a Public Domain license, all the rights to use, access, modify and
share the dataset are open to everyone. Here there is technically no license.
2. OPEN DATA COMMONS PUBLIC DOMAIN DEDICATION AND LICENSE – PDDL
Open Data Commons license has the same features as the Public Domain license, but the
difference is the PDDL license uses a licensing mechanism to give the rights to the
dataset.
3. CREATIVE COMMONS ATTRIBUTION 4.0 INTERNATIONAL CC-BY
This license allows users to share and modify a dataset, but only if they give credit to the
creator(s) of the dataset.
4. COMMUNITY DATA LICENSE AGREEMENT – CDLA PERMISSIVE-2.0
Like most open-source licenses, this license allows users to use, modify, adapt, and share
the dataset, but only if a disclaimer of warranties and liability is also included.
5. OPEN DATA COMMONS ATTRIBUTION LICENSE - ODC-BY
This license allows users to share and adapt a dataset, but only if they give credit to the
creator(s) of the dataset.
6. CREATIVE COMMONS ATTRIBUTION-SHAREALIKE 4.0 INTERNATIONAL - CC-BY-SA
This license allows users to use, share, and adapt a dataset, but only if they give credit to
the dataset and show any changes or transformations, they made to the dataset. Users
might not want to use this license because they have to share the work they did on the
dataset.
7. COMMUNITY DATA LICENSE AGREEMENT – CDLA-SHARING-1.0
This license uses the principle of ‘copyleft’: users can use, modify, and adapt a dataset,
but only if they don’t add license restrictions on the new work(s) they create with the
dataset.
8. OPEN DATA COMMONS OPEN DATABASE LICENSE - ODC-ODBL
This license allows users to use, share, and adapt a dataset but only if they give credit to
the dataset and show any changes or transformations they make to the dataset. Users
might not want to use this license because they have to share the work they did on the
dataset.
9. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL 4.0 INTERNATIONAL - CC
BY-NC
This license is a restrictive license. Users can share and adapt a dataset, provided they give
credit to its creator(s) and ensure that the dataset is not used for any commercial
purpose.
10. CREATIVE COMMONS ATTRIBUTION-NO DERIVATIVES 4.0 INTERNATIONAL - CC BY-
ND
This license is also a restrictive license. Users can share a dataset if they give credit to its
creator(s). This license does not allow additions, transformations, or changes to the
dataset.
11. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0
INTERNATIONAL - CC BY-NC-SA
This license allows users to share a dataset only if they give credit to its creator(s). Users
can share additions, transformations, or changes to the dataset, but they cannot use the
dataset for commercial purposes.
12. CREATIVE COMMONS ATTRIBUTION-NONCOMMERCIAL-NODERIVATIVES 4.0
INTERNATIONAL - CC BY-NC-ND
This license allows users to share a dataset only if they give credit to its creator(s). Users
are not allowed to modify the dataset and are not allowed to use it for commercial
purposes.

 Libraries usually contain built-in modules that provide different functionalities.

 You can use data visualization methods to communicate with others and display meaningful
results of an analysis.

 For machine learning, the Scikit-learn library contains tools for statistical modeling, including
regression, classification, clustering, and so on.

 Large-scale production of deep-learning models use TensorFlow, a low-level framework.

 Apache Spark is a general-purpose cluster-computing framework that allows you to process


data using compute clusters.

 An application programming interface (API) allows communication between two pieces of


software.

 API is the part of the library you see while the library contains all the components of the
program.

 REST APIs allow you to communicate through the internet and take advantage of resources
like storage, data, artificially intelligent algorithms, and much more.

 Open data is fundamental to Data Science.

 Community Data License Agreement makes it easier to share open data.

 The IBM Data Asset eXchange (DAX) site contains high-quality open data sets.
 DAX open data sets include tutorial notebooks that provide basic and advanced walk-
throughs for developers.

 DAX notebooks open in Watson Studio.

 Machine learning (ML) uses algorithms – also known as “models” – to identify patterns in the
data.

 Types of ML are Supervised, Unsupervised, and Reinforcement.

 Supervised learning comprises two types of models, regression and classification.

 Deep learning refers to a general set of models and techniques that loosely emulate the way
the human brain solves a wide range of problems.

 The Model Asset eXchange is a free, open-source repository for ready-to-use and
customizable deep-learning microservices.

 MAX model-serving microservices are built and distributed on GitHub as open-source


Docker images.

 You can use Red Hat OpenShift, a Kubernetes platform, to automate deployment, scaling, and
management of microservices.

 Ml-exchange.org has multiple predefined models.

You might also like