02-Tools For Data Science
02-Tools For Data Science
Data Integration and Transformation - streamline data pipelines and automate data
processing tasks
Code Asset Management - store & manage code, track changes and allow collaborative
development
Data Asset Management - organize and manage data, provide access control, and backup
assets
The data science ecosystem consists of many open source and commercial options, and
include both traditional desktop applications and server-based tools, as well as cloud-based
services that can be accessed using web-browsers and mobile interfaces.
Data Management Tools: include Relational Databases, NoSQL Databases, and Big Data
platforms:
MySQL, and PostgreSQL are examples of Open Source Relational Database Management
Systems (RDBMS), and IBM Db2 and SQL Server are examples of commercial
RDBMSes and are also available as Cloud services.
MongoDB and Apache Cassandra are examples of NoSQL databases.
Apache Hadoop and Apache Spark are used for Big Data analytics.
Data Integration and Transformation Tools: include Apache Airflow and Apache Kafka.
Data Visualization Tools: include commercial offerings such as Cognos Analytics, Tableau
and PowerBI and can be used for building dynamic and interactive dashboards.
Code Asset Management Tools: Git is an essential code asset management tool. GitHub is a
popular web-based platform for storing and managing source code. Its features make it an
ideal tool for collaborative software development, including version control, issue tracking,
and project management.
Development Environments: Popular development environments for Data Science include
Jupyter Notebooks and RStudio.
Jupyter Notebooks provides an interactive environment for creating and sharing code,
descriptive text, data visualizations, and other computational artifacts in a web-browser
based interface.
RStudio is an integrated development environment (IDE) designed specifically for
working with the R programming language, which is a popular tool for statistical
computing and data analysis.
Datasets
Open datasets and sources
In this data-driven world, some datasets are freely available for anyone to access, use,
modify, and share. These are called open datasets.
Open datasets include a public license and are very useful for your journey as a Data
Scientist. Some of the most informative open dataset sources are listed below.
Government Data:
https://www.data.gov/
https://www.census.gov/data.html
https://data.gov.uk/
https://www.opendatanetwork.com/
https://data.un.org/
Financial Data Sources:
https://data.worldbank.org/
https://www.globalfinancialdata.com/
https://comtrade.un.org/
https://www.nber.org/
https://fred.stlouisfed.org/
Crime Data:
https://www.fbi.gov/services/cjis/ucr
https://www.icpsr.umich.edu/icpsrweb/content/NACJD/index.html
https://www.drugabuse.gov/related-topics/trends-statistics
https://www.unodc.org/unodc/en/data-and-analysis/
Health Data:
https://www.who.int/gho/database/en/
https://www.fda.gov/Food/default.htm
https://seer.cancer.gov/faststats/selections.php?series=cancer
https://www.opensciencedatacloud.org/
https://pds.nasa.gov/
https://earthdata.nasa.gov/
https://www.sgim.org/communities/research/dataset-compendium/public-datasets-
topic-grid
Academic and Business Data:
https://scholar.google.com/
https://nces.ed.gov/
https://www.glassdoor.com/research/
https://www.yelp.com/dataset
Other General Data:
https://www.kaggle.com/datasets
https://www.reddit.com/r/datasets/
Health Care:
https://www.sgim.org/communities/research/dataset-compendium/proprietary-datasets
Financial Market data:
https://datarade.ai/data-categories/proprietary-market-data
Google Cloud based datasets:
https://cloud.google.com/datasets
Dataset licenses
When you select a dataset, it is necessary to look into the license. A license explains
whether you can use that dataset or not; or explains if you have to accept certain
guidelines to use that dataset. The different license types are listed below.
You can use data visualization methods to communicate with others and display meaningful
results of an analysis.
For machine learning, the Scikit-learn library contains tools for statistical modeling, including
regression, classification, clustering, and so on.
API is the part of the library you see while the library contains all the components of the
program.
REST APIs allow you to communicate through the internet and take advantage of resources
like storage, data, artificially intelligent algorithms, and much more.
The IBM Data Asset eXchange (DAX) site contains high-quality open data sets.
DAX open data sets include tutorial notebooks that provide basic and advanced walk-
throughs for developers.
Machine learning (ML) uses algorithms – also known as “models” – to identify patterns in the
data.
Deep learning refers to a general set of models and techniques that loosely emulate the way
the human brain solves a wide range of problems.
The Model Asset eXchange is a free, open-source repository for ready-to-use and
customizable deep-learning microservices.
You can use Red Hat OpenShift, a Kubernetes platform, to automate deployment, scaling, and
management of microservices.