0% found this document useful (0 votes)
30 views29 pages

AIDS C04-Session-19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views29 pages

AIDS C04-Session-19

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

21CS2213RA

AI for Data Science

Session -19

Contents: Data science-an introduction

1
Session Objective
• An ability to understand about Data Science.

• An ability to Understand the real-life application and uses of


Data Science
Data Science

• Data science combines the scientific method, math and statistics,


specialized programming, advanced analytics, AI, and even storytelling to
uncover and explain the business insights buried in data.

• Data science is a multidisciplinary approach to extracting actionable


insights from the large and ever-increasing volumes of data collected and
created by organizations.

• Data science is all about using data to solve problems.


Cont.

Data science is–

 preparing data for analysis and processing.

 performing advanced data analysis.

 presenting the results to reveal patterns and enable stakeholders to draw


informed conclusions.
Cont.

Data science enables businesses to process huge amounts of structured and


unstructured big data to detect patterns.
Data science lifecycle

• The data science lifecycle also called the data science pipeline. Following steps

involved in Data Science Life Cycle.

 Step 1: Define Problem Statement: Creating a well-defined problem


statement is a first and critical step in data science.
 Step 2: Data Collection: need to collect the data which can help to solve
the problem through systematic approach.
 Step 3: Data Quality Check and Remediation: Ensuring the data that is
used for analysis and interpretation is of good quality.
Cont.

Step 4: Exploratory Data Analysis: Before you model the steps to arrive
at a solution, it’s important to analyse the data.

Step 5: Data Modelling: Modelling means formulating every step and


gather the techniques required to achieve the solution.

Step 6: Data Communication: This is the final step where you present the
results from your analysis to the stakeholders. You explain to them how you
came to a specific conclusion and your critical findings .
Cont.

Data Science life cycle


Cont. (Given by IBM)

• The data science lifecycle includes anywhere from five to sixteen steps.

• The processes common to just about everyone’s definition of the lifecycle


include the following:

 Capture: This is the gathering of raw structured and unstructured data


from all relevant sources via just about any method.

 Prepare and maintain: This involves putting the raw data into a
consistent format for analytics or machine learning or deep learning
models.
Cont.

 Preprocess or process: To examine biases, patterns, ranges, and


distributions of values within the data to determine the data’s suitability
for use with predictive analytics, machine learning, and/or deep learning
algorithms.

 Analyze: This is where perform statistical analysis, predictive analytics,


regression, machine learning and deep learning algorithms, and more to
extract insights from the prepared data.
Cont.

 Communicate: Finally, the insights are presented as reports, charts, and


other data visualizations that make the insights—and their impact on the
business—easier for decision-makers to understand.
Types of data

• Always need to look at what


types of data are involved.

 Known Data

 Unknown Data

 Others’ decisions

 Your decisions
Data Science Tools

• To build and run code in order to create models, the most popular programming
languages are open-source tools that include or support pre-built statistical, machine
learning and graphics capabilities. These languages include:

 R: An open-source programming language and environment for developing statistical


computing and graphics

 Python: Python is a general-purpose, object-oriented, high-level programming


language that emphasizes code readability through its distinctive generous use of
white space.
Cont.

 SQL Analysis Services: Use perform in-


database analytics using common data
mining functions and basic predictive
models.

 SAS/ACCESS: Can be used to access


data from Hadoop and is used for creating
repeatable and reusable model flow
diagrams.
SAS: Statistical Analysis System
Data Science Applications

 Identifying and predicting disease

 Personalized healthcare recommendations

 Optimizing shipping routes in real-time

 Getting the most value out of soccer rosters

 Finding the next slew of world-class athletes

 Stamping out tax fraud

 Automating digital ad placement

 Algorithms that help you find love

 Predicting incarceration rates


Big Data

• Big data is a collection of massive and complex data sets and data volume.

• It include the huge quantities of data, data management capabilities, social


media analytics and real-time data.

• Big data is about data volume and large data set's measured in terms of
terabytes or petabytes.

• After examining of Bigdata, the data has been launched as Big Data analytics.

• Big data analytics is the process of examining large amounts of data.


5 Vs in Big Data

• Doug Laney introduced this concept of 3 Vs of Big Data, viz. Volume, Variety, and
Velocity.

Volume: refers to the amount of data that is being collected (the data could be
structured or unstructured).

Velocity: refers to the rate at which data is coming in.

Variety: refers to the different kinds of data (data types, formats, etc.) that is
coming in for analysis.
Cont.

Over the last few years, 2 additional Vs of data have also


emerged i.e. value and veracity.

Value refers to the usefulness of the collected data.

Veracity refers to the quality of data that is coming in from


different sources.
Types of Data Science
Data Analytics

•Data analytics is the science of analyzing raw data to make


conclusions about that information.

•The techniques and processes of data analytics have been


automated into mechanical processes and algorithms that work
over raw data for human consumption.

•Data analytics help a business optimize its performance.


Data Science and Data Analytics (Two sides of the same coin)

• Data science is an umbrella term that encompasses data


analytics, data mining, machine learning, and several other
related disciplines.
 Data Science and Data Analytics utilize data in different ways.

 Data Science and Data Analytics deal with Big Data, each
taking a unique approach.
 Data analytics is mainly concerned with Statistics,
Mathematics, and Statistical Analysis.
Cont.

 Data Science focuses on finding meaningful correlations


between large datasets.
 Data Analytics is designed to uncover the specifics of extracted
insights.

Note: Data Analytics is a branch of Data Science that focuses on


more specific answers to the questions that Data Science brings
forth.
Key Points

• Data science and data analytics both fields are ways of understanding big data, and
both often involve analyzing massive databases using R and Python.

• SAS/ACCESS engines are tightly integrated and used by all SAS solutions for third-
party data integration, supported integration standards include ODBC, JDBC, Spark
SQL (on SAS Viya) and OLE DB.

• Internet users generate about 2.5 quintillion bytes of data every day. By 2020, every
person on Earth will be generating about 146,880 GB of data every day, and by 2025,
that will be 165 zettabytes every year.
Lab/Skilling

Case Study: Diabetes Prevention

What if we could predict the occurrence of diabetes and take appropriate measures
beforehand to prevent it?
Conclusion

• We should be careful and not directly link data analytics and data science to artificial
intelligence and machine learning.

• There are different types of data to consider when we face a complex problem with
lots of data.

• We can also use Apache Spark, Tableau and Snowflake, Google machine learning
stack Tensorflow, NLP training and Deep learning experience are all part of the data
science toolkit
Placement Related/Industry Oriented

• Data preparation and analysis are the most important data science skills, but data
preparation alone typically consumes 60 to 70 percent of a data scientist’s time.

• By 2020, there will be around 40 zettabytes of data, that's 40 trillion gigabytes.

• The amount of data that exists grows exponentially.

• At any time, about 90 percent of this huge amount of data gets generated in the most
recent two years, according to sources like IBM and SINTEF.

• This means there is a huge amount of work in data science.


References

• https://www.ibm.com/cloud/learn/data-science-introduction
• https://www.edureka.co/blog/what-is-data-science/
• https://towardsdatascience.com/intro-to-data-science-531079c38b22
• https://www.omnisci.com/learn/data-science
• https://www.edureka.co/blog/what-is-data-science/
• https://www.edureka.co/blog/data-science-applications/
• https://www.omnisci.com/learn/data-science
Next Class Topic

In next class I will cover following topics-

 Data pre-processing
 Feature extraction technique
Thank you

29

You might also like