0% found this document useful (0 votes)
30 views47 pages

Session 1819

The document provides an overview of data science, including what it is, why it is important, its key components and applications. It states that data science involves extracting meaningful insights from massive amounts of structured and unstructured data using scientific methods, technologies and algorithms. It discusses the major components of data science like statistics, machine learning, data engineering and visualization. The document also outlines some popular tools used in data science like Python, R, SQL and Tableau. It explains concepts like big data, data analytics, and the typical data science lifecycle.

Uploaded by

Dhanunjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views47 pages

Session 1819

The document provides an overview of data science, including what it is, why it is important, its key components and applications. It states that data science involves extracting meaningful insights from massive amounts of structured and unstructured data using scientific methods, technologies and algorithms. It discusses the major components of data science like statistics, machine learning, data engineering and visualization. The document also outlines some popular tools used in data science like Python, R, SQL and Tableau. It explains concepts like big data, data analytics, and the typical data science lifecycle.

Uploaded by

Dhanunjay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

DATA DRIVEN

ARTIFICIAL
INTELLIGENT SYSTEMS
Session no: 18&19
Topic: Introduction to Data Science
WHAT IS DATA SCIENCE?

• Data science is a deep study of the massive amount of data, which involves extracting meaningful insights from raw,
structured, and unstructured data that is processed using the scientific method, different technologies, and algorithms.
• It is a multidisciplinary field that uses tools and techniques to manipulate the data so that we can find something new
and meaningful.
• Data science uses the most powerful hardware, programming systems, and most efficient algorithms to solve the data
related problems. It is the future of artificial intelligence.
• In short, we can say that data science is all about:
• Asking the correct questions and analyzing the raw data.
• Modeling the data using various complex and efficient algorithms.
• Visualizing the data to get a better perspective.
• Understanding the data to make better decisions and finding the final result.

2
• Data science is a multidisciplinary approach to extracting actionable insights from the
large and ever-increasing volumes of data collected and created by organizations.
• Data science enables businesses to process huge amounts of structured and unstructured
big data to detect patterns.
WHY DATA SCIENCE?

• With the right tools, technologies, algorithms, we can use data and convert it into a
distinct business advantage
• Data Science help to detect fraud using advanced machine learning algorithms
• It helps to prevent any significant monetary losses
• Allows to build intelligence ability in machines
• It enables to take better and faster decisions
• It helps to recommend the right product to the right customer to enhance your business

4
DATA SCIENCE COMPONENTS

5
DATA SCIENCE COMPONENTS…

• 1. Statistics: Statistics is one of the most important components of data science. Statistics is a way
to collect and analyze the numerical data in a large amount and finding meaningful insights from it.
• 2. Domain Expertise: In data science, domain expertise binds data science together. Domain
expertise means specialized knowledge or skills of a particular area. In data science, there are
various areas for which we need domain experts.
• 3. Data engineering: Data engineering is a part of data science, which involves acquiring, storing,
retrieving, and transforming the data. Data engineering also includes metadata (data about data) to
the data.
• 4. Visualization: Data visualization is meant by representing data in a visual context so that people
can easily understand the significance of data. Data visualization makes it easy to access the huge
amount of data in visuals.

6
DATA SCIENCE COMPONENTS…

• 5. Advanced computing: Advanced computing involves designing, writing, debugging,


and maintaining the source code of computer programs.
• 6. Mathematics: Mathematics is the critical part of data science. Mathematics involves
the study of quantity, structure, space, and changes. For a data scientist, knowledge of
good mathematics is essential.
• 7. Machine learning: Machine learning is backbone of data science. Machine learning is
all about to provide training to a machine so that it can act as a human brain. In data
science, we use various machine learning algorithms to solve the problems.

7
DATA SCIENCE COMPONENTS…

8
TOOLS FOR DATA SCIENCE

Following are some tools required for data science:


• Data Analysis tools: R, Python, Statistics, SAS,
Jupiter, R Studio, MATLAB, Excel, RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop,
Informatica/Talend, AWS Redshift
• Data Visualization tools: R, Jupyter, Tableau,
Cognos.
• Machine learning tools: Spark, Mahout, Azure
ML studio.

9
DATA SCIENCE TOOLS

• To build and run code in order to create models, the most popular programming
languages are open-source tools that include or support pre-built statistical, machine
learning and graphics capabilities. These languages include:
• R: An open-source programming language and environment for developing statistical
computing and graphics
• Python: Python is a general-purpose, object-oriented, high-level programming language
that emphasizes code readability through its distinctive generous use of white space.
CONT.
• SQL Analysis Services: Use perform in-
database analytics using common data
mining functions and basic predictive
models.
• SAS/ACCESS: Can be used to access
data from Hadoop and is used for creating
repeatable and reusable model flow
diagrams.

SAS: Statistical Analysis System


DATA SCIENCE APPLICATIONS

• Identifying and predicting disease


• Personalized healthcare recommendations
• Optimizing shipping routes in real-time
• Getting the most value out of soccer rosters
• Finding the next slew of world-class athletes
• Stamping out tax fraud
• Automating digital ad placement
• Algorithms that help you find love
• Predicting incarceration rates
BIG DATA

• Big data is a collection of massive and complex data sets and data volume.
• It include the huge quantities of data, data management capabilities, social media
analytics and real-time data.
• Big data is about data volume and large data set's measured in terms of terabytes or
petabytes.
• After examining of Bigdata, the data has been launched as Big Data analytics.
• Big data analytics is the process of examining large amounts of data.
5 VS IN BIG DATA

• Doug Laney introduced this concept of 3 Vs of Big Data, viz. Volume, Variety, and
Velocity.
• Volume: refers to the amount of data that is being collected (the data could be structured
or unstructured).
• Velocity: refers to the rate at which data is coming in.
• Variety: refers to the different kinds of data (data types, formats, etc.) that is coming in
for analysis.
CONT.
• Over the last few years, 2 additional Vs of data have also emerged i.e. value and veracity.
• Value refers to the usefulness of the collected data.
• Veracity refers to the quality of data that is coming in from different sources.
DATA ANALYTICS

• Data analytics is the science of analyzing raw data to make conclusions about that
information.
• The techniques and processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human consumption.
• Data analytics help a business optimize its performance.
DATA SCIENCE AND DATA ANALYTICS (TWO
SIDES OF THE SAME COIN)
• Data science is an umbrella term that encompasses data analytics, data mining, machine
learning, and several other related disciplines.
• Data Science and Data Analytics utilize data in different ways.
• Data Science and Data Analytics deal with Big Data, each taking a unique approach.
• Data analytics is mainly concerned with Statistics, Mathematics, and Statistical Analysis.
CONT.
• Data Science focuses on finding meaningful correlations between large datasets.
• Data Analytics is designed to uncover the specifics of extracted insights.

• Note: Data Analytics is a branch of Data Science that focuses on more specific answers
to the questions that Data Science brings forth.
KEY POINTS
• Data science and data analytics both fields are ways of understanding big data, and both
often involve analyzing massive databases using R and Python.
• SAS/ACCESS engines are tightly integrated and used by all SAS solutions for third-party
data integration, supported integration standards include ODBC, JDBC, Spark SQL (on
SAS Viya) and OLE DB.
• Internet users generate about 2.5 quintillion bytes of data every day. By 2020, every
person on Earth will be generating about 146,880 GB of data every day, and by 2025, that
will be 165 zettabytes every year.
DATA SCIENCE LIFECYCLE

• The data science lifecycle also called the data science pipeline. Following steps involved
in Data Science Life Cycle.
• Step 1: Define Problem Statement: Creating a well-defined problem statement is a first
and critical step in data science.
• Step 2: Data Collection: need to collect the data which can help to solve the problem
through systematic approach.
• Step 3: Data Quality Check and Remediation: Ensuring the data that is used for analysis
and interpretation is of good quality.
CONT.
• Step 4: Exploratory Data Analysis: Before you model the steps to arrive at a solution, it’s
important to analyse the data.
• Step 5: Data Modelling: Modelling means formulating every step and gather the
techniques required to achieve the solution.
• Step 6: Data Communication: This is the final step where you present the results from
your analysis to the stakeholders. You explain to them how you came to a specific
conclusion and your critical findings.
DATA SCIENCE LIFE CYCLE

22
DATA SCIENCE LIFE CYCLE…

1. Discovery: The first phase is discovery, which involves asking the right questions. When we start any data
science project, we need to determine what are the basic requirements, priorities, and project budget.
In this phase, we need to determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem on first hypothesis level.
2. Data preparation: In this phase, we need to perform the following tasks:
• Data cleaning
• Data Reduction
• Data integration
• Data transformation,
After performing all the above tasks, we can easily use this data for our further processes.

23
DATA SCIENCE LIFE CYCLE…

3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables.
We will apply Exploratory data analytics (EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what data can
inform us.
Common tools used for model planning are:
• SQL Analysis Services
• R
• SAS
• Python

24
DATA SCIENCE LIFE CYCLE…

4. Model-building: In this phase, the process of model building starts.


We will create datasets for training and testing purpose.
We will apply different techniques such as association, classification, and
clustering, to build the model.
Following are some common Model building tools:
• SAS Enterprise Miner
• WEKA
• SPCS Modeler
• MATLAB

25
DATA SCIENCE LIFE CYCLE…

5. Operationalize: In this phase, we will deliver the final reports of the project, along with
briefings, code, and technical documents.
This phase provides a clear overview of complete project performance and other
components on a small scale before the full deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which we have
set on the initial phase.
We will communicate the findings and final result with the business team.

26
DATA SCIENCE LIFE CYCLE

27
TYPES OF DATA

• In data science and big data different types of data will be used, and each of them tends to
require different tools and techniques.
• The main categories of data are
• Structured
• Unstructured
• Natural language
• Machine-generated
• Graph-based
• Audio, video, and images
• Streaming

28
STRUCTURED DATA

• Structured data is data that depends on a data model and resides in a fixed field within a
record

29
UNSTRUCTURED DATA

• Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.

30
MACHINE-GENERATED DATA

• Machine-generated data is information that’s automatically created by a computer,


process, application, or other machine without human intervention.

31
GRAPH-BASED DATA

• Graph-based data is a natural


way to represent social
networks, and its structure
allows you to calculate specific
metrics such as the influence of
a person and the shortest path
between two people.

32
AUDIO, IMAGE AND VIDEO

• Audio, image, and video are data types that pose specific challenges to a data scientist.
• Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.
• MLBAM (Major League Baseball Advanced Media) announced in 2014 that they’ll increase video capture to approximately 7
TB per game for the purpose of live, in-game analytics.
• High-speed cameras at stadiums will capture ball and athlete movements to calculate in real time,
• for example, the path taken by a defender relative to two baselines.
• Recently a company called Deep Mind succeeded at creating an algorithm that’s capable of learning how to play video games.
• This algorithm takes the video screen as input and learns to interpret everything via a complex process of deep learning.
• It’s a remarkable feat that prompted Google to buy the company for their own Artificial Intelligence (AI) development plans.
• The learning algorithm takes in data as it’s produced by the computer game; it’s streaming data.

33
STREAMING DATA

• While streaming data can take almost any of the previous forms, it has an extra property.
• The data flows into the system when an event happens instead of being loaded into a data
store in a batch.
• Although this isn’t really a different type of data, we treat it here as such because you
need to adapt process to deal with this type of information.
• Examples are the “What’s trending” on Twitter, live sporting or music events, and the
stock market.

34
DATA ANALYTICS

• Data analytics is the science of analyzing raw data to make conclusions about that
information.

• The techniques and processes of data analytics have been automated into mechanical
processes and algorithms that work over raw data for human consumption.

• Data analytics help a business optimize its performance.

35
DATA ANALYTICS LIFE CYCLE

36
PHASE 1— DISCOVERY

• In Phase-1, the team learns the business domain, including relevant history such as
whether the organization or business unit has attempted similar projects in the past from
which they can learn.
• The team assesses the resources available to support the project in terms of people,
technology, time, and data.
• Important activities in this phase include framing the business problem as an analytics
challenge that can be addressed in subsequent phases and formulating initial hypotheses
(IHs) to test and begin learning the data.

37
PHASE 2 — DATA PREPARATION

• Phase 2—Data preparation: requires the presence of an analytic sandbox, in which the
team can work with data and perform analytics for the duration of the project.
• The team needs to execute extract, load, and transform (ELT) or extract, transform and
load (ETL) to get data into the sandbox.
• The ELT and ETL are sometimes abbreviated as ETLT.
• Data should be transformed in the ETLT process so the team can work with it and analyze
it.

38
PHASE 3—MODEL PLANNING

• Phase 3—Model planning: where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase.
• The team explores the data to learn about the relationships between variables and
subsequently selects key variables and the most suitable models.

39
PHASE 4 — MODEL BUILDING

• Phase 4—Model building: the team develops datasets for testing, training, and
production purposes.
• In addition, in this phase the team builds and executes models based on the work done in
the model planning phase.
• The team also considers whether its existing tools will sufficient for running the models,
or if it will need a more robust environment for executing models and workflows

40
Phase 4 — Model building

41
PHASE 5 — COMMUNICATE RESULTS AND
FINDINGS
• Phase 5—Communicate results: the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria
developed in Phase 1.
• The team should identify key findings, quantify the business value, and develop a
narrative to summarize and convey findings to stakeholders.

42
Phase 5 — Communicate Results and Findings

43
PHASE 6 — OPERATIONALIZE

• In this phase, the team delivers final reports, briefings, code, and technical documents.
• In addition, the team may run a pilot project to implement the models in a production
environment.
• In the final phase, the team communicates the benefits of the project more broadly and
sets up a pilot project to deploy the work in a controlled way before broadening the work
to a full enterprise or ecosystem of users.
• In model building phase, the team scored the model in the analytics sandbox.

44
Phase 6 — Operationalize

45
Assessment Questions

1. Describe the roles of Data Science.

2. Draw the data science life cycle diagram and explain. Write down steps involved in
data science life cycle.

3. List any FOUR applications of Data Science and explain any ONE application in detail.

4. Illustrate the structured and un-structured data types.

5. Describe Data Analytics Life Cycle with neat diagram.


THANK YOU

You might also like