AIDS C04-Session-19
AIDS C04-Session-19
Session -19
1
Session Objective
• An ability to understand about Data Science.
• The data science lifecycle also called the data science pipeline. Following steps
Step 4: Exploratory Data Analysis: Before you model the steps to arrive
at a solution, it’s important to analyse the data.
Step 6: Data Communication: This is the final step where you present the
results from your analysis to the stakeholders. You explain to them how you
came to a specific conclusion and your critical findings .
Cont.
• The data science lifecycle includes anywhere from five to sixteen steps.
Prepare and maintain: This involves putting the raw data into a
consistent format for analytics or machine learning or deep learning
models.
Cont.
Known Data
Unknown Data
Others’ decisions
Your decisions
Data Science Tools
• To build and run code in order to create models, the most popular programming
languages are open-source tools that include or support pre-built statistical, machine
learning and graphics capabilities. These languages include:
• Big data is a collection of massive and complex data sets and data volume.
• Big data is about data volume and large data set's measured in terms of
terabytes or petabytes.
• After examining of Bigdata, the data has been launched as Big Data analytics.
• Doug Laney introduced this concept of 3 Vs of Big Data, viz. Volume, Variety, and
Velocity.
Volume: refers to the amount of data that is being collected (the data could be
structured or unstructured).
Variety: refers to the different kinds of data (data types, formats, etc.) that is
coming in for analysis.
Cont.
Data Science and Data Analytics deal with Big Data, each
taking a unique approach.
Data analytics is mainly concerned with Statistics,
Mathematics, and Statistical Analysis.
Cont.
• Data science and data analytics both fields are ways of understanding big data, and
both often involve analyzing massive databases using R and Python.
• SAS/ACCESS engines are tightly integrated and used by all SAS solutions for third-
party data integration, supported integration standards include ODBC, JDBC, Spark
SQL (on SAS Viya) and OLE DB.
• Internet users generate about 2.5 quintillion bytes of data every day. By 2020, every
person on Earth will be generating about 146,880 GB of data every day, and by 2025,
that will be 165 zettabytes every year.
Lab/Skilling
What if we could predict the occurrence of diabetes and take appropriate measures
beforehand to prevent it?
Conclusion
• We should be careful and not directly link data analytics and data science to artificial
intelligence and machine learning.
• There are different types of data to consider when we face a complex problem with
lots of data.
• We can also use Apache Spark, Tableau and Snowflake, Google machine learning
stack Tensorflow, NLP training and Deep learning experience are all part of the data
science toolkit
Placement Related/Industry Oriented
• Data preparation and analysis are the most important data science skills, but data
preparation alone typically consumes 60 to 70 percent of a data scientist’s time.
• At any time, about 90 percent of this huge amount of data gets generated in the most
recent two years, according to sources like IBM and SINTEF.
• https://www.ibm.com/cloud/learn/data-science-introduction
• https://www.edureka.co/blog/what-is-data-science/
• https://towardsdatascience.com/intro-to-data-science-531079c38b22
• https://www.omnisci.com/learn/data-science
• https://www.edureka.co/blog/what-is-data-science/
• https://www.edureka.co/blog/data-science-applications/
• https://www.omnisci.com/learn/data-science
Next Class Topic
Data pre-processing
Feature extraction technique
Thank you
29