0% found this document useful (0 votes)

9 views21 pages

Unit 1

Uploaded by

mayura.shelke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views21 pages

Unit 1

Uploaded by

mayura.shelke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Unit 1: DATA PROCESSING AND

STATESTICS
Basics of Data and its processing -Record Keeping , Statistics and data science ,
measurement scales , properties of data, Visualization, cleaning the data
Symbolic data analysis , Statistics-Basic Statistical Measures, Variance and
Standard Deviation, Visualizing Statistical Measures, Calculating Percentiles,
Quartiles and Box Plots, Missing data handling methods-Finding missing values,
dealing with missing values. Outliers- What are Outliers, Using Z-scores to Find
Outliers, Modified Z-score, Using IQR to Detect Outliers
Statistics & Data Science
Data science involves the collection, organization, analysis and visualization of large amounts
of data.

Statisticians, meanwhile, use mathematical models to quantify relationships between variables

and outcomes and make predictions based on those relationships.

Statisticians do not use computer science, algorithms or machine learning to the same degree
as computer scientists.
Data Science Statistics
Definition Is an interdisciplinary branch of Is a mathematical science for analysing
computer science used to gain valuable existing data pertaining to specific
information from a large data using problems, applying statistical tools to
statistics, computers and technology. this data, and presenting the results for
decision-making.

Concept 1. primary goal is to identify underlying 1. primary goal is to determine cause-

trends and patterns in a data for and-effect relationship in analysed
decision making. data, is a purely mathematical
2. works well on both quantitative and approach.
qualitative data 2. works only on quantitative data
Key steps include Key terms include
data mining Mean
data pre-processing Median
Exploratory Data Analysis (EDA) Mode
Model building and optimization Standard deviation (σ)
Variance (σ2)
Some important techniques include Some important techniques
regression, classification include probability
distribution, acceptance
sampling and statistical
quality control
Application Can be applied in specialized areas Can be applied in areas
Areas like computer vision, natural where random variations
language processing, disaster are observed in sampled
management, recommender data like medical,
systems and search engines, etc. information technology,
economics, engineering,
finance, marketing,
accounting, and business,
etc.
Properties of Data
following are the properties of data:
1) amenability of use,
2) clarity,
3) accuracy, and
4) the quality
Amenability of use: From the dictionary meaning of data it is learnt
that data are facts used in deciding something. In short, data are
meant to be used as a base for arriving at definitive conclusions. They
are not required, if they are not amenable to use.
Clarity: This means data should necessarily' display so essential for
communicating the essence of the matter. Without clarity, the
meaning desired to be communicated will remain hidden.
Accuracy: Data should be real, complete and accurate. Accuracy is
thus, an essential property of data. Since data offer a basis for deciding
something, they must necessarily be accurate if valid conclusions are
to be drawn.
Essence: In social sciences, large quantities of data are collected
which cannot be presented, nor is it necessary to present them
in that form. They have to be compressed and refined. Data so
refined can present the essence or derived qualitative value, of
the matter. Data in sciences consist of observations made from
scientific experiments, these are all measured quantities. Data,
thus, are always the essence of the matter.
Outlier - Jupyter Notebook
Missing Data Handling Methods
The real-world data often has a lot of missing values. The cause of missing
values can be data corruption or failure to record data. The handling of missing
data is very important during the preprocessing of the dataset as many machine
learning algorithms do not support missing values.
1.Deleting Rows with missing values

2.Impute missing values for continuous variable

3.Impute missing values for categorical variable

4.Other Imputation Methods

5.Using Algorithms that support missing values

6.Prediction of missing values

7.Imputation using Deep Learning Library — Datawig

Delete Rows with Missing Values:

Missing values can be handled by deleting the rows or columns having null
values. If columns have more than half of the rows as null then the entire column
can be dropped. The rows which are having one or more columns values as null
can also be dropped.
Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can replace it with
some arbitrary value using the following code. E.g., in the following code, we are replacing
the missing values of the ‘Dependents’ column with ‘0’.

IN:

#Replace the missing value with '0' using 'fiilna' method

train_df['Dependents'] = train_df['Dependents'].fillna(0)

train_df[‘Dependents'].isnull().sum()

OUT:

0
Replacing with the mean
Replacing with the mode
Replacing with the median
Replacing with the previous value – forward fill
Replacing with the next value – backward fill
How to Impute Missing Values for Categorical Features?
There are two ways to impute missing values for categorical features as follows:

Impute the Most Frequent Value :We will use ‘SimpleImputer’ in this case,
and as this is a non-numeric column, we can’t use mean or median, but we can
use the most frequent value and constant.

Impute the Value “Missing” : We can impute the value “missing,” which
treats it as a separate category.
Outliers
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.

Outlier detection is a process used to identify and remove data points from a
dataset that differ from the rest of the data points In the dataset.

Outlier detection is the process of identifying abnormal or abnormal-looking data

points in a dataset.
Types of outlier detection

There are two main types of outlier detection: descriptive and

prescriptive.

Descriptive outlier detection simply describes the outliers while

prescriptive outlier detection determines what action, if any, needs to
be taken based on the outlier.
Identifying Outliers using Z-Score
Z-Score is a measure of how many standard deviations a data point is away
from the mean.

data points with a Z-Score greater than a threshold are considered outliers.

Definition of Z-Scores: Z-Scores are calculated by subtracting the mean of the

data set from a data point and dividing the result by the standard deviation of
the data set. The resulting value is a measure of how many standard deviations
a data point is away from the mean.
For example, let's say we have a dataset of test scores for a group of students.

The mean score is 75, and the standard deviation is 5. If a student scored 85

on the test, we can calculate their Z-score as follows:

Z-score = (85 - 75) / 5 = 2

Outlier - Jupyter Notebook
modified z-score
However, z-scores can be affected by unusually large or small data
values, which is why a more robust way to detect outliers is to use
a modified z-score, which is calculated as:

Modified z-score = 0.6745(xi – x̃) / MAD

where:
•xi: A single data value
•x̃: The median of the dataset
•MAD: The median absolute deviation of the dataset
Identifying Outliers using IQR (Interquartile Range): The IQR is the range between the first
quartile (Q1) and the third quartile (Q3) of the data.

Outliers are often identified as values outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]

Dirty Dreams To Tell Your Boyfriend: Click Here To Download
67% (6)
Dirty Dreams To Tell Your Boyfriend: Click Here To Download
3 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
Data Quality
No ratings yet
Data Quality
14 pages
EDA - Task
No ratings yet
EDA - Task
20 pages
EDA - Zep
No ratings yet
EDA - Zep
33 pages
Data Quality
100% (2)
Data Quality
16 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Outliners
No ratings yet
Outliners
15 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
Data Preparation Notebook
No ratings yet
Data Preparation Notebook
14 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Research File 3
No ratings yet
Research File 3
10 pages
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
No ratings yet
Data Wrangling and Descriptive Analytics: DR Sandipan Karmakar Department of Management Studies MNIT Jaipur
57 pages
Unit 2
No ratings yet
Unit 2
76 pages
Unit-4 Part 1 Preparing Model
No ratings yet
Unit-4 Part 1 Preparing Model
20 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
A Concise Encyclopedia of Islam
100% (8)
A Concise Encyclopedia of Islam
257 pages
Dey'S - Sample PDF - BST-XII Exam Handbook Term-I - 2021-22
No ratings yet
Dey'S - Sample PDF - BST-XII Exam Handbook Term-I - 2021-22
62 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
02 - 23ECE216 - EDA - Pre Processing
No ratings yet
02 - 23ECE216 - EDA - Pre Processing
16 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Lecture 3 - Data Preprocessing
No ratings yet
Lecture 3 - Data Preprocessing
50 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
Machine: Learning
No ratings yet
Machine: Learning
15 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
Statistic & Machine Learning: Team 2
No ratings yet
Statistic & Machine Learning: Team 2
42 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Cleaning Techniques
No ratings yet
Data Cleaning Techniques
11 pages
Expt 2
No ratings yet
Expt 2
3 pages
Islamic Political System (Basic Concept) : Madiha Ashraf
100% (1)
Islamic Political System (Basic Concept) : Madiha Ashraf
13 pages
Data Analytics Program - Introduction To Data Analytics - Lesson 1
No ratings yet
Data Analytics Program - Introduction To Data Analytics - Lesson 1
56 pages
Mosdorfer Catalog Clamps
No ratings yet
Mosdorfer Catalog Clamps
44 pages
HFHDJSJWDJNDNDKWM
No ratings yet
HFHDJSJWDJNDNDKWM
81 pages
Grade 8 and 9 Workbook
No ratings yet
Grade 8 and 9 Workbook
155 pages
Ai For IT Coders
No ratings yet
Ai For IT Coders
18 pages
Group 1 - FC1 G12 01 STEM - 1st Draft of RRL
No ratings yet
Group 1 - FC1 G12 01 STEM - 1st Draft of RRL
10 pages
2 Literature Review
No ratings yet
2 Literature Review
15 pages
Jona
No ratings yet
Jona
4 pages
Module 1 PPT
No ratings yet
Module 1 PPT
48 pages
Choosing A Course Booklet 2022
No ratings yet
Choosing A Course Booklet 2022
9 pages
Prameet (12a) (5728)
No ratings yet
Prameet (12a) (5728)
33 pages
2 Term 9 Form
No ratings yet
2 Term 9 Form
31 pages
Unit - 6 Promotion Decisions: Jacqueline
No ratings yet
Unit - 6 Promotion Decisions: Jacqueline
22 pages
Tax Quizzer
No ratings yet
Tax Quizzer
3 pages
Oscar Ccoa Codes v1
No ratings yet
Oscar Ccoa Codes v1
247 pages
1 Text For Reading Comprehension
100% (1)
1 Text For Reading Comprehension
3 pages
ZHAO - Variability of Surface Heat Fluxes and Its Driving Forces at Different Time Scales Over A Large Ephemeral Lake in China - 2018
No ratings yet
ZHAO - Variability of Surface Heat Fluxes and Its Driving Forces at Different Time Scales Over A Large Ephemeral Lake in China - 2018
19 pages
58 Circular 2023
No ratings yet
58 Circular 2023
8 pages
2016, Yamasaki Et Al, Auditory Perceptual Evaluation of Normal and Dysphonic Voices Using The Voice Deviation Scale J Voice
No ratings yet
2016, Yamasaki Et Al, Auditory Perceptual Evaluation of Normal and Dysphonic Voices Using The Voice Deviation Scale J Voice
5 pages
Lectura Log Error Calculo Barras
No ratings yet
Lectura Log Error Calculo Barras
12 pages
Mechatronics Q & A
No ratings yet
Mechatronics Q & A
3 pages
14.07.24 - SR - Star Co Super Chaina (Model-A&b) - Exams Syllabus Clarification
No ratings yet
14.07.24 - SR - Star Co Super Chaina (Model-A&b) - Exams Syllabus Clarification
2 pages
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
No ratings yet
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
25 pages
Linearizing Effect Regenerative Feedback
No ratings yet
Linearizing Effect Regenerative Feedback
3 pages
Project Report PDF
No ratings yet
Project Report PDF
15 pages
Is Brain A Good Model For AI
No ratings yet
Is Brain A Good Model For AI
2 pages
Miraña Genus Aeromonas
No ratings yet
Miraña Genus Aeromonas
1 page
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet

Unit 1

Uploaded by

Unit 1

Uploaded by

Unit 1: DATA PROCESSING AND

Statisticians, meanwhile, use mathematical models to quantify relationships between variables

Concept 1. primary goal is to identify underlying 1. primary goal is to determine cause-

2.Impute missing values for continuous variable

3.Impute missing values for categorical variable

4.Other Imputation Methods

5.Using Algorithms that support missing values

6.Prediction of missing values

7.Imputation using Deep Learning Library — Datawig

#Replace the missing value with '0' using 'fiilna' method

Outlier detection is the process of identifying abnormal or abnormal-looking data

There are two main types of outlier detection: descriptive and

Descriptive outlier detection simply describes the outliers while

Definition of Z-Scores: Z-Scores are calculated by subtracting the mean of the

on the test, we can calculate their Z-score as follows:

Z-score = (85 - 75) / 5 = 2

Modified z-score = 0.6745(xi – x̃) / MAD

You might also like