0% found this document useful (0 votes)
9 views21 pages

Unit 1

Uploaded by

mayura.shelke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views21 pages

Unit 1

Uploaded by

mayura.shelke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 1: DATA PROCESSING AND

STATESTICS
Basics of Data and its processing -Record Keeping , Statistics and data science ,
measurement scales , properties of data, Visualization, cleaning the data
Symbolic data analysis , Statistics-Basic Statistical Measures, Variance and
Standard Deviation, Visualizing Statistical Measures, Calculating Percentiles,
Quartiles and Box Plots, Missing data handling methods-Finding missing values,
dealing with missing values. Outliers- What are Outliers, Using Z-scores to Find
Outliers, Modified Z-score, Using IQR to Detect Outliers
Statistics & Data Science
Data science involves the collection, organization, analysis and visualization of large amounts
of data.

Statisticians, meanwhile, use mathematical models to quantify relationships between variables


and outcomes and make predictions based on those relationships.

Statisticians do not use computer science, algorithms or machine learning to the same degree
as computer scientists.
Data Science Statistics
Definition Is an interdisciplinary branch of Is a mathematical science for analysing
computer science used to gain valuable existing data pertaining to specific
information from a large data using problems, applying statistical tools to
statistics, computers and technology. this data, and presenting the results for
decision-making.

Concept 1. primary goal is to identify underlying 1. primary goal is to determine cause-


trends and patterns in a data for and-effect relationship in analysed
decision making. data, is a purely mathematical
2. works well on both quantitative and approach.
qualitative data 2. works only on quantitative data
Key steps include Key terms include
data mining Mean
data pre-processing Median
Exploratory Data Analysis (EDA) Mode
Model building and optimization Standard deviation (σ)
Variance (σ2)
Some important techniques include Some important techniques
regression, classification include probability
distribution, acceptance
sampling and statistical
quality control
Application Can be applied in specialized areas Can be applied in areas
Areas like computer vision, natural where random variations
language processing, disaster are observed in sampled
management, recommender data like medical,
systems and search engines, etc. information technology,
economics, engineering,
finance, marketing,
accounting, and business,
etc.
Properties of Data
following are the properties of data:
1) amenability of use,
2) clarity,
3) accuracy, and
4) the quality
Amenability of use: From the dictionary meaning of data it is learnt
that data are facts used in deciding something. In short, data are
meant to be used as a base for arriving at definitive conclusions. They
are not required, if they are not amenable to use.
Clarity: This means data should necessarily' display so essential for
communicating the essence of the matter. Without clarity, the
meaning desired to be communicated will remain hidden.
Accuracy: Data should be real, complete and accurate. Accuracy is
thus, an essential property of data. Since data offer a basis for deciding
something, they must necessarily be accurate if valid conclusions are
to be drawn.
Essence: In social sciences, large quantities of data are collected
which cannot be presented, nor is it necessary to present them
in that form. They have to be compressed and refined. Data so
refined can present the essence or derived qualitative value, of
the matter. Data in sciences consist of observations made from
scientific experiments, these are all measured quantities. Data,
thus, are always the essence of the matter.
Outlier - Jupyter Notebook
Missing Data Handling Methods
The real-world data often has a lot of missing values. The cause of missing
values can be data corruption or failure to record data. The handling of missing
data is very important during the preprocessing of the dataset as many machine
learning algorithms do not support missing values.
1.Deleting Rows with missing values

2.Impute missing values for continuous variable

3.Impute missing values for categorical variable

4.Other Imputation Methods

5.Using Algorithms that support missing values

6.Prediction of missing values

7.Imputation using Deep Learning Library — Datawig


Delete Rows with Missing Values:

Missing values can be handled by deleting the rows or columns having null
values. If columns have more than half of the rows as null then the entire column
can be dropped. The rows which are having one or more columns values as null
can also be dropped.
Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can replace it with
some arbitrary value using the following code. E.g., in the following code, we are replacing
the missing values of the ‘Dependents’ column with ‘0’.

IN:

#Replace the missing value with '0' using 'fiilna' method

train_df['Dependents'] = train_df['Dependents'].fillna(0)

train_df[‘Dependents'].isnull().sum()

OUT:

0
Replacing with the mean
Replacing with the mode
Replacing with the median
Replacing with the previous value – forward fill
Replacing with the next value – backward fill
How to Impute Missing Values for Categorical Features?
There are two ways to impute missing values for categorical features as follows:

Impute the Most Frequent Value :We will use ‘SimpleImputer’ in this case,
and as this is a non-numeric column, we can’t use mean or median, but we can
use the most frequent value and constant.

Impute the Value “Missing” : We can impute the value “missing,” which
treats it as a separate category.
Outliers
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.

Outlier detection is a process used to identify and remove data points from a
dataset that differ from the rest of the data points In the dataset.

OR

Outlier detection is the process of identifying abnormal or abnormal-looking data


points in a dataset.
Types of outlier detection

There are two main types of outlier detection: descriptive and


prescriptive.

Descriptive outlier detection simply describes the outliers while


prescriptive outlier detection determines what action, if any, needs to
be taken based on the outlier.
Identifying Outliers using Z-Score
Z-Score is a measure of how many standard deviations a data point is away
from the mean.

data points with a Z-Score greater than a threshold are considered outliers.

Definition of Z-Scores: Z-Scores are calculated by subtracting the mean of the


data set from a data point and dividing the result by the standard deviation of
the data set. The resulting value is a measure of how many standard deviations
a data point is away from the mean.
For example, let's say we have a dataset of test scores for a group of students.

The mean score is 75, and the standard deviation is 5. If a student scored 85

on the test, we can calculate their Z-score as follows:

Z-score = (85 - 75) / 5 = 2


Outlier - Jupyter Notebook
modified z-score
However, z-scores can be affected by unusually large or small data
values, which is why a more robust way to detect outliers is to use
a modified z-score, which is calculated as:

Modified z-score = 0.6745(xi – x̃) / MAD


where:
•xi: A single data value
•x̃: The median of the dataset
•MAD: The median absolute deviation of the dataset
Identifying Outliers using IQR (Interquartile Range): The IQR is the range between the first
quartile (Q1) and the third quartile (Q3) of the data.

Outliers are often identified as values outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]

You might also like