Expt 2

The document outlines an experiment to create an 'Academic performance' dataset and perform data wrangling operations in Python, including handling missing values, detecting outliers, and applying data transformations. It emphasizes the importance of data wrangling for effective data analysis and decision-making. Additionally, it describes various methods for outlier detection, including statistical, proximity, and projection methods.

Uploaded by

chincholkar.sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views3 pages

Expt 2

Uploaded by

chincholkar.sam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Experiment No.

Create an “Academic performance” dataset of students and perform the following operations
using Python.

1. Scan all variables for missing values and inconsistencies. If there are missing values and/or
inconsistencies, use any of the suitable techniques to deal with them.
2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques
to deal with them.
3. Apply data transformations on at least one of the variables. The purpose of this
transformation should be one of the following reasons: to change the scale for better
understanding of the variable, to convert a non-linear relation into a linear one, or to decrease
the skewness and convert the distribution into a normal distribution.

Data Wrangling in Python

Data Wrangling is the process of gathering, collecting, and transforming Raw data into another
format for better understanding, decision-making, accessing, and analysis in less time. Data
Wrangling is also known as Data Munging.
Importance Of Data Wrangling
Data Wrangling is a very important step. The below example will explain its importance as :
Books selling Website want to show top-selling books of different domains, according to user
preference. For example, a new user searches for motivational books, then they want to show
those motivational books which sell the most or have a high rating, etc.

Data wrangling in python deals with the below functionalities:

1. Data exploration: In this process, the data is studied, analyzed and understood by
visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with mean,
mode, the most frequent value of the column or simply by dropping the row having
a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements,
where new data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which
are required to be removed or filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an
efficient dataset as per our requirements and then it can be used for a required purpose like
data analyzing, machine learning, data visualization, model training etc.
Below is an example which implements the above functionalities on a raw dataset:
Data exploration, here we assign the data, and then we visualize the data in a tabular
Format.
Outlier detection is usually performed in the Exploratory Data Analysis stage of the Data
Science Project Management process, and our decision to deal with them decides how well or
bad the model performs for the business problem at hand. The model, and hence, the entire
workflow, is greatly affected by the presence of outliers.

Outlier Detection Methods

1. Statistical Methods

Simply starting with visual analysis of the Univariate data by using Boxplots, Scatter plots,
Whisker plots, etc., can help in finding the extreme values in the data. Assuming a normal
distribution, calculate the z-score, which means the standard deviation (σ) times the data point
is from the sample’s mean. Because we know from the Empirical Rule, which says that 68% of
the data falls within one standard deviation, 95% percent within two standard deviations, and
99.7% within three standard deviations from the mean, we can identify data points that are more
than three times the standard deviation, as outliers. Another way would be to use InterQuartile
Range (IQR) as a criterion and treating outliers outside the range of 1.5 times from the first or
the third quartile.

2. Proximity Methods

Proximity-based methods deploy clustering techniques to identify the clusters in the data and
find out the centroid of each cluster. They assume that an object is an outlier if the nearest
neighbors of the object are far away in feature space; that is, the proximity of the object to its
neighbors significantly deviates from the proximity of most of the other objects to their neighbors
in the same data set. The usual approach is as follows – Fix a threshold and evaluate the
distance of each data point from the cluster centroid and then remove the outlier data points and
go ahead with the modeling.

Proximity-based methods are classified into two types: Distance-based methods judge a data
point based on the distance(s) to its neighbors. Density-based determines the degree of outlines
of each data instance based on its local density. DBScan, k-means, and hierarchical clustering
techniques are examples of density-based outlier detection methods.

3. Projection Methods

Projection methods utilize techniques such as the PCA to model the data into a lower-
dimensional subspace using linear correlations. Post that, the distance of each data point to a
plane that fits the sub-space is calculated. This distance can be used then to find the outliers.
Projection methods are simple and easy to apply and can highlight irrelevant values.

The PCA-based method approaches a problem by analyzing available features to determine

what constitutes a “normal” class. The module then applies distance metrics to identify cases
that represent anomalies.
Conclusion: Hence we have thoroughly studied how to perform the following operations using
Python on created dataset (e.g. data.csv / Dictionary)

Velammal Vidyalaya: Section A (Objective Type)
No ratings yet
Velammal Vidyalaya: Section A (Objective Type)
7 pages
Module1 - Satellite Communication VTU 7th Sem
No ratings yet
Module1 - Satellite Communication VTU 7th Sem
127 pages
QP Inservice 1
100% (1)
QP Inservice 1
296 pages
Chemistry 2nd Edition Tro Test Bank
No ratings yet
Chemistry 2nd Edition Tro Test Bank
27 pages
Foundation of Data Science Previous Year Question Paper
No ratings yet
Foundation of Data Science Previous Year Question Paper
40 pages
PIPESIM Presentation SAE - 20181105
No ratings yet
PIPESIM Presentation SAE - 20181105
35 pages
PPF Implementation Guide
82% (11)
PPF Implementation Guide
52 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
FW-796-SAF-62.20-0023 - 0 Lifting Plan For Water Injection Pump
100% (4)
FW-796-SAF-62.20-0023 - 0 Lifting Plan For Water Injection Pump
33 pages
Technical Datasheet D2866LXE20 6-4-2011
No ratings yet
Technical Datasheet D2866LXE20 6-4-2011
2 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Data Quality
100% (2)
Data Quality
16 pages
Amc2017 MP
No ratings yet
Amc2017 MP
9 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
3-Data Preprocessing
No ratings yet
3-Data Preprocessing
32 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
Group 5 - Streamflow Measurement
No ratings yet
Group 5 - Streamflow Measurement
70 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Mathematical Foundations For Data Science: BITS Pilani
No ratings yet
Mathematical Foundations For Data Science: BITS Pilani
36 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
03 Data Science Process - Fall 23-24
No ratings yet
03 Data Science Process - Fall 23-24
38 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Unit 1
No ratings yet
Unit 1
21 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Outlier Treatment
No ratings yet
Outlier Treatment
16 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Businnes Intelligence
No ratings yet
Businnes Intelligence
36 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Unit 4
No ratings yet
Unit 4
17 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Bi Ut2 Answers
No ratings yet
Bi Ut2 Answers
23 pages
SPWLA-2025-0077 Akbar Et Al
No ratings yet
SPWLA-2025-0077 Akbar Et Al
18 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Unit 2
No ratings yet
Unit 2
21 pages
CC&BD Unit 4
No ratings yet
CC&BD Unit 4
12 pages
Data Wrangling
No ratings yet
Data Wrangling
18 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Carel Probes and Sensors Selection and Optimal Installation Guide 2021 06 26
No ratings yet
Carel Probes and Sensors Selection and Optimal Installation Guide 2021 06 26
40 pages
Research File 3
No ratings yet
Research File 3
10 pages
Koren - CH 02 PDF
No ratings yet
Koren - CH 02 PDF
19 pages
Outliers
No ratings yet
Outliers
3 pages
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
No ratings yet
A Fuzzy Proximity Relation Approach For Outlier Detection in - 2021 - Soft Compu
12 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
12 pages
1preparing Data
No ratings yet
1preparing Data
6 pages
HF-Katalog 2 EN - Technische Informationen PDF
No ratings yet
HF-Katalog 2 EN - Technische Informationen PDF
27 pages
PAACDA Comprehensive Data Corruption Detection Algorithm
No ratings yet
PAACDA Comprehensive Data Corruption Detection Algorithm
8 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
TinyG Report - Final
No ratings yet
TinyG Report - Final
44 pages
Omar Khayyam Biography: Presented by Fajar Syahadi (0403201001)
No ratings yet
Omar Khayyam Biography: Presented by Fajar Syahadi (0403201001)
14 pages
Tutorial 10: Solving Cutting Stock Problem Using Column Generation Technique
No ratings yet
Tutorial 10: Solving Cutting Stock Problem Using Column Generation Technique
13 pages
HBS57H
No ratings yet
HBS57H
6 pages
Data Sheet Switch Serie w23-w31
No ratings yet
Data Sheet Switch Serie w23-w31
3 pages
PHSN 106 Chapter 1 Reading Journal
No ratings yet
PHSN 106 Chapter 1 Reading Journal
2 pages
A Tour of The Famous Scientists Laid To Rest in Göttingen City Cemetery - COMSOL Blog
No ratings yet
A Tour of The Famous Scientists Laid To Rest in Göttingen City Cemetery - COMSOL Blog
14 pages
Class-Scoresheet SHS v2.0 115738
No ratings yet
Class-Scoresheet SHS v2.0 115738
28 pages
Ina102 PDF
No ratings yet
Ina102 PDF
13 pages
iXLc Spreadsheet Management Tool
No ratings yet
iXLc Spreadsheet Management Tool
1 page
Chemistry: Matter and Change
No ratings yet
Chemistry: Matter and Change
12 pages
1AND2 1996 Reff2022-1
No ratings yet
1AND2 1996 Reff2022-1
24 pages
Cpa Lab 2 3 19 (9) - C
No ratings yet
Cpa Lab 2 3 19 (9) - C
2 pages
RC Racing Truck2
No ratings yet
RC Racing Truck2
75 pages
Pie Charts
No ratings yet
Pie Charts
5 pages
Exp 8
No ratings yet
Exp 8
2 pages
Exp 6
No ratings yet
Exp 6
3 pages
Synthesis of Copper Oxide
No ratings yet
Synthesis of Copper Oxide
2 pages
Exp 7
No ratings yet
Exp 7
2 pages
Exp 4
No ratings yet
Exp 4
2 pages
GRP B Exp1
No ratings yet
GRP B Exp1
1 page
Exp 3
No ratings yet
Exp 3
2 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet