0% found this document useful (0 votes)
169 views

What Is Data Science - IBM

Data science combines various disciplines like statistics, programming, analytics, and storytelling to extract insights from large amounts of data. It involves preparing, analyzing, and communicating data to reveal patterns and enable informed decisions. Data scientists require skills in math, science, programming, and business communication. They use tools like R and Python to analyze data and create visualizations. Cloud computing makes large-scale data science more accessible for organizations. Data science is used across industries to improve processes, personalization, and decision-making through data-driven insights.

Uploaded by

waqar ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
169 views

What Is Data Science - IBM

Data science combines various disciplines like statistics, programming, analytics, and storytelling to extract insights from large amounts of data. It involves preparing, analyzing, and communicating data to reveal patterns and enable informed decisions. Data scientists require skills in math, science, programming, and business communication. They use tools like R and Python to analyze data and create visualizations. Cloud computing makes large-scale data science more accessible for organizations. Data science is used across industries to improve processes, personalization, and decision-making through data-driven insights.

Uploaded by

waqar ahmad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Skip to content

Cloud

IBM Cloud Learn Hub



What is Data Science

Data Science

By: IBM Cloud Education

15 May 2020

Analytics
Data science

What is data science?

Data Science

Data science combines the scientific method,


math and statistics, specialized programming,
advanced analytics, AI, and even storytelling to
uncover and explain the business insights buried
in data.
Cookie Preferences
What is data science?
Skip to content

Data science is a multidisciplinary approach to extracting actionable insights from


the large and ever-increasing volumes of data collected and created by today’s
organizations. Data science encompasses preparing data for analysis and processing,
performing advanced data analysis, and presenting the results to reveal patterns and
enable stakeholders to draw informed conclusions.

Data preparation can involve cleansing, aggregating, and manipulating it to be ready


for specific types of processing. Analysis requires the development and use of
algorithms, analytics and AI models. It’s driven by software that combs through data
to find patterns within to transform these patterns into predictions that support
business decision-making. The accuracy of these predictions must be validated
through scientifically designed tests and experiments. And the results should be
shared through the skillful use of data visualization tools that make it possible for
anyone to see the patterns and understand trends.

As a result, data scientists (as data science practitioners are called) require
computer science and pure science skills beyond those of a typical data analyst. A
data scientist must be able to do the following:

– Apply mathematics, statistics, and the scientific method

– Use a wide range of tools and techniques for evaluating and preparing data—
everything from SQL to data mining to data integration methods

– Extract insights from data using predictive analytics and artificial intelligence
(AI), including machine learning and deep learning models

– Write applications that automate data processing and calculations

– Tell—and illustrate—stories that clearly convey the meaning of results to


decision-makers and stakeholders at every level of technical knowledge and
understanding

– Explain how these results can be used to solve business problems

Cookie Preferences
This combination of skills is rare, and it’s no surprise that data scientists are currently
in high
Skipdemand. According to an IBM survey (PDF, 3.9 MB), the number of job
to content
openings in the field continues to grow at over 5% per year, with over 60,000
forecast for 2020.

The data science lifecycle


The data science lifecycle—also called the data science pipeline—includes anywhere
from five to sixteen (depending on whom you ask) overlapping, continuing processes.
The processes common to just about everyone’s definition of the lifecycle include the
following:

– Capture: This is the gathering of raw structured and unstructured data from all
relevant sources via just about any method—from manual entry and web
scraping to capturing data from systems and devices in real time.

– Prepare and maintain: This involves putting the raw data into a consistent
format for analytics or machine learning or deep learning models. This can
include everything from cleansing, deduplicating, and reformatting the data, to
using ETL (extract, transform, load) or other data integration technologies to
combine the data into a data warehouse, data lake, or other unified store for
analysis.

– Preprocess or process: Here, data scientists examine biases, patterns, ranges,


and distributions of values within the data to determine the data’s suitability for
use with predictive analytics, machine learning, and/or deep learning algorithms
(or other analytical methods).

– Analyze: This is where the discovery happens—where data scientists perform


statistical analysis, predictive analytics, regression, machine learning and deep
learning algorithms, and more to extract insights from the prepared data.

– Communicate: Finally, the insights are presented as reports, charts, and other


data visualizations that make the insights—and their impact on the business—
easier for decision-makers to understand. A data science programming
language such as R or Python (see below) includes components for generating
visualizations; alternatively, data scientists can use dedicated visualization
tools. Cookie Preferences
Data science tools
Skip to content

Data scientists must be able to build and run code in order to create models. The
most popular programming languages among data scientists are open source tools
that include or support pre-built statistical, machine learning and graphics
capabilities. These languages include:

– R: An open source programming language and environment for developing


statistical computing and graphics, R is the most popular programming
language among data scientists. R provides a broad variety of libraries and tools
for cleansing and prepping data, creating visualizations, and training and
evaluating machine learning and deep learning algorithms. It’s also widely used
among data science scholars and researchers.

– Python: Python is a general-purpose, object-oriented, high-level programming


language that emphasizes code readability through its distinctive generous use
of white space. Several Python libraries support data science tasks, including
Numpy for handling large dimensional arrays, Pandas for data manipulation and
analysis, and Matplotlib for building data visualizations.

For a deep dive into the differences between these approaches, check out "Python
vs. R: What's the Difference?"

Data scientists need to be proficient in the use of big data processing platforms, such
as Apache Spark and Apache Hadoop. They also need to be skilled with a wide range
of data visualization tools, including the simple graphics tools included with business
presentation and spreadsheet applications, built-for-purpose commercial
visualization tools like Tableau and Microsoft PowerBI, and open source tools like
D3.js (a JavaScript library for creating interactive data visualizations) and RAW
Graphs.

Data science and cloud computing


Cloud computing is bringing many data science benefits within reach of even small
and midsized organizations.
Cookie Preferences
Data science’s foundation is the manipulation and analysis of extremely large data
sets; thetocloud
Skip provides access to storage infrastructures capable of handling large
content
amounts of data with ease. Data science also involves running machine learning
algorithms that demand massive processing power; the cloud makes available the
high-performance compute that’s necessary for the task. To purchase equivalent on-
site hardware would be far too expensive for many enterprises and research teams,
but the cloud makes access affordable with per-use or subscription-based pricing.

Cloud infrastructures can be accessed from anywhere in the world, making it


possible for multiple groups of data scientists to share access to the data sets they’re
working with in the cloud—even if they’re located in different countries.

Open source technologies are widely used in data science tool sets. When they’re
hosted in the cloud, teams don’t need to install, configure, maintain, or update them
locally. Several cloud providers also offer prepackaged tool kits that enable data
scientists to build models without coding, further democratizing access to the
innovations and insights that this discipline is making available.

Data science use cases


There’s no limit to the number or kind of enterprises that could potentially benefit
from the opportunities data science is creating. Nearly any business process can be
made more efficient through data-driven optimization, and nearly every type of
customer experience (CX) can be improved with better targeting and personalization.

Here are a few representative use cases for data science and AI:

– An international bank created a mobile app offering on-the-spot decisions to


loan applicants using machine learning-powered credit risk models and a hybrid
cloud computing architecture that is both powerful and secure.

– An electronics firm is developing ultra-powerful 3D-printed sensors that will


guide tomorrow’s driverless vehicles. The solution relies on data science and
analytics tools to enhance its real-time object detection capabilities.

– A robotic process automation (RPA) solution provider developed a cognitive


business process mining solution that reduces incident handling times between
Cookie Preferences
15%  and 95%  for its client companies. The solution is trained to understand
the content and sentiment of customer emails, directing service teams to
prioritize those that are most relevant and urgent.
Skip to content

– A digital media technology company created an audience analytics platform


that enables its clients to see what’s engaging TV audiences as they’re offered a
growing range of digital channels. The solution employs deep analytics and
machine learning to gather real-time insights into viewer behavior.

– An urban police department created statistical incident analysis tools to help


officers understand when and where to deploy resources in order to prevent
crime. The data-driven solution creates reports and dashboards to augment
situational awareness for field officers.

– A smart healthcare company developed a solution enabling seniors to live


independently for longer. Combining sensors, machine learning, analytics, and
cloud-based processing, the system monitors for unusual behavior and alerts
relatives and caregivers, while conforming to the strict security standards that
are mandatory in the healthcare industry.

Data science and IBM Cloud


IBM Cloud offers a highly secure public cloud infrastructure with a full-stack platform
that includes more than 170 products and services, many of which were designed to
support data science and AI.

IBM’s data science and AI lifecycle product portfolio is built upon our longstanding
commitment to open source technologies and includes a range of capabilities that
enable enterprises to unlock the value of their data in new ways.

AutoAI, a powerful new automated development capability in IBM Watson Studio,


speeds the data preparation, model development, and feature engineering stages of
the data science lifecycle. This allows data scientists to be more efficient and helps
them make better-informed decisions about which models will perform best for real-
world use cases. AutoAI simplifies enterprise data science across any cloud
environment.

The IBM Cloud Pak for Data platform provides a fully integrated and extensible data
and information architecture built on the Red Hat OpenShift Container Platform Cookie
that Preferences
runs on any cloud. With IBM Cloud Pak for Data, enterprises can more easily collect,
organize and
Skip to analyze data, making it possible to infuse insights from AI throughout
content
the entire organization.

Want to learn more about building and running data science models on IBM Cloud?
Get started for no-charge by signing up for an IBM Cloud account today.

IBM named a Leader


IBM is named a Leader in the 2021 Gartner Magic Quadrant for Data Science and Machine
Learning Platforms.

Read the report

Cookie Preferences
Skip to content

Data science community


Connect with experts and peers to elevate technical expertise, solve problems and share
insights.

Learn more

Featured products

Watson Studio

IBM Cloud Pak for Data

Related links

ModelOps

Explainable AI

AutoAI

Why IBM Cloud

Why IBM Cloud

Hybrid Cloud approach


Cookie Preferences
Trust and security
Trust and security

Open Cloud
Skip to content
Data centers

Case studies

Products and Solutions

Cloud Paks

Cloud pricing

View all products

View all solutions

Learn about 

What is Hybrid Cloud?

What is Cloud Computing?

What is Confidential Computing?

What is a Data Lake?

What is a Data Warehouse?

What is Artificial Intelligence (AI)?

What is Machine Learning?

What is DevOps?

What is Microservices?

Resources

Get started

Docs

Architectures

IBM Garage
Cookie Preferences
Training and Certifications
Training and Certifications

Partners
Skip to content
Cloud blog

Hybrid Cloud careers

My Cloud account

Cookie Preferences

You might also like