Unit 1
Unit 1
STATESTICS
Basics of Data and its processing -Record Keeping , Statistics and data science ,
measurement scales , properties of data, Visualization, cleaning the data
Symbolic data analysis , Statistics-Basic Statistical Measures, Variance and
Standard Deviation, Visualizing Statistical Measures, Calculating Percentiles,
Quartiles and Box Plots, Missing data handling methods-Finding missing values,
dealing with missing values. Outliers- What are Outliers, Using Z-scores to Find
Outliers, Modified Z-score, Using IQR to Detect Outliers
Statistics & Data Science
Data science involves the collection, organization, analysis and visualization of large amounts
of data.
Statisticians do not use computer science, algorithms or machine learning to the same degree
as computer scientists.
Data Science Statistics
Definition Is an interdisciplinary branch of Is a mathematical science for analysing
computer science used to gain valuable existing data pertaining to specific
information from a large data using problems, applying statistical tools to
statistics, computers and technology. this data, and presenting the results for
decision-making.
Missing values can be handled by deleting the rows or columns having null
values. If columns have more than half of the rows as null then the entire column
can be dropped. The rows which are having one or more columns values as null
can also be dropped.
Replacing with an arbitrary value
If you can make an educated guess about the missing value, then you can replace it with
some arbitrary value using the following code. E.g., in the following code, we are replacing
the missing values of the ‘Dependents’ column with ‘0’.
IN:
train_df['Dependents'] = train_df['Dependents'].fillna(0)
train_df[‘Dependents'].isnull().sum()
OUT:
0
Replacing with the mean
Replacing with the mode
Replacing with the median
Replacing with the previous value – forward fill
Replacing with the next value – backward fill
How to Impute Missing Values for Categorical Features?
There are two ways to impute missing values for categorical features as follows:
Impute the Most Frequent Value :We will use ‘SimpleImputer’ in this case,
and as this is a non-numeric column, we can’t use mean or median, but we can
use the most frequent value and constant.
Impute the Value “Missing” : We can impute the value “missing,” which
treats it as a separate category.
Outliers
An outlier is an observation that lies an abnormal distance from other values in a
random sample from a population.
Outlier detection is a process used to identify and remove data points from a
dataset that differ from the rest of the data points In the dataset.
OR
data points with a Z-Score greater than a threshold are considered outliers.
The mean score is 75, and the standard deviation is 5. If a student scored 85
Outliers are often identified as values outside the range [Q1 - 1.5 * IQR, Q3 + 1.5 * IQR]