Data Science Lecture No 02
Data Science Lecture No 02
02 AI th
7 , SEN –5th
10/22/2024 1
Data Science
10/22/2024 2
Lecture Contents
❑Data Science
❑Understanding Data Science
❑Exploratory Data Analysis
10/22/2024 3
Data Science
❑Data Science
▪ Data science is the application of computational and statistical techniques to
address or gain insight into some problem in the real world
▪ Data science = statistics +
data processing +
machine learning +
scientific inquiry +
visualization +
business analytics +
big data + …
10/22/2024 4
CRISP process
CRoss-Industry Standard Process for data mining
(CRISP)
5
Data Science Process Step
6
Understanding data science
❑Data requirements:
▪ There can be various sources of data for an organization. It is important to comprehend what type of data is
required for the organization to be collected, curated, and stored.
▪ In addition to this, it is required to categorize the data, numerical or categorical, and the format of storage
and dissemination.
❑ Data collection:
▪ Data collected from several sources must be stored in the correct format and transferred to the right information
technology personnel within a company. As mentioned previously, data can be collected from several objects on
several events using different types of sensors and storage tools.
❑Data processing:
▪ Preprocessing involves the process of pre-curating the dataset before actual analysis. Common tasks involve
correctly exporting the dataset, placing them under the right tables, structuring them, and exporting them
in the correct format.
Understanding data science
❑Data cleaning:
▪ Preprocessed data is still not ready for detailed analysis. It must be correctly transformed for an incompleteness check,
duplicates check, error check, and missing value check. These tasks are performed in the data cleaning stage, which
involves responsibilities such as matching the correct record, finding inaccuracies in the dataset, understanding the overall
data quality, removing duplicate items, and filling in the missing values.
▪ However, how could we identify these anomalies on any dataset?
▪ An example of data cleaning would be using outlier detection methods for quantitative data cleaning.
❑ EDA:
▪ Exploratory data analysis, is the stage where we actually start to understand the message contained in the data.
❑Communication:
▪ This stage deals with disseminating the results to end stakeholders to use the result for business
intelligence. One of the most notable steps in this stage is data visualization.
▪ Visualization deals with information relay techniques such as tables, charts, summary diagrams,
and bar charts to show the analyzed result.
Prior Knowledge
Gaining information on:
10
10
Data Preparation / Data exploration
Data Exploration
Data quality
Handling missing values
Data type conversion
Transformation
Outliers
Feature selection
Sampling
11
11
Introduction to Exploratory Data Analysis (EDA)
12
Key aspects of EDA
❑Correlation Analysis
▪ Checking the relationships between variables to understand how they might affect each other. This
includes computing correlation coefficients and creating correlation matrices.
❑ Summary Statistics
▪ Calculating key statistics that provide insight into data trends and nuances
❑Testing Assumptions
▪ Many statistical tests and models assume the data meet certain conditions (like normality
Why Exploratory Data Analysis is Important?
Exploratory Data Analysis (EDA) is important for several reasons, especially in the context of data
science and statistical modeling. Here are some of the key reasons why EDA is a critical step in the
data analysis process:
▪ Understanding Data Structures
▪ Identifying Patterns and Relationships
▪ Detecting Anomalies and Outliers
▪ Testing Assumptions
▪ Informing Feature Selection and Engineering
▪ Optimizing Model Design
▪ Facilitating Data Cleaning
▪ Enhancing Communication
EDA Importance
❑Understanding Data Structures
o EDA helps in getting familiar with the dataset, understanding the number of features, the type of data in each
feature, and the distribution of data points. This understanding is crucial for selecting appropriate analysis or
prediction techniques.
❑ Enhancing Communication
o Visual and statistical summaries from EDA can make it easier to communicate findings and convince others of the
validity of your conclusions, particularly when explaining data-driven insights to stakeholders without technical
backgrounds.
Traditional Vs Machine Learning Model
18
https://www.ranker.com/crowdranked-list/best-jobs-in-the-world
https://www.panelplace.com/blogs/top-8-coolest-jobs-world 18
Data Science process
19
https://www.ranker.com/crowdranked-list/best-jobs-in-the-world
https://www.panelplace.com/blogs/top-8-coolest-jobs-world 19
Data Science process
20
20
Thank You !
10/22/2024 21