0% found this document useful (0 votes)
50 views2 pages

REPORT - Assignment 1

This document summarizes a student project analyzing a dataset of 12,000 car entries. The students used Python and pandas to load and explore the dataset. They investigated central tendencies and dispersion, identified and removed duplicate and null values, and detected and removed outliers. Visualizations were created to better understand patterns in the data. The overall goal was to clean the data and gain insights through exploratory analysis using various Python libraries.

Uploaded by

hardik solanki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views2 pages

REPORT - Assignment 1

This document summarizes a student project analyzing a dataset of 12,000 car entries. The students used Python and pandas to load and explore the dataset. They investigated central tendencies and dispersion, identified and removed duplicate and null values, and detected and removed outliers. Visualizations were created to better understand patterns in the data. The overall goal was to clean the data and gain insights through exploratory analysis using various Python libraries.

Uploaded by

hardik solanki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Visualization and Exploratory Data Analysis

Katleen Ezekeil Orata (c0848019) Hardik Solanki (0852302) Mohammad Imran Uddin (c0800487)
Artificial Intelligence & Machine Learning Artificial Intelligence & Machine Learning Artificial Intelligence & Machine Learning
Program Program Program
Lambton College Lambton College Lambton College
Toronto, Canada Toronto, Canada Toronto, Canada
[email protected] [email protected] [email protected]

Abstract—This electronic document is a “live” template and The.head() and.tail() functions are used by the students to
already defines the components of your paper [title, text, heads, investigate the head and the tail, or the first and last rows of
etc.] in its style sheet. *CRITICAL: Do Not Use Symbols, Special the dataset. This provides the students with a quick look at the
Characters, Footnotes, or Math in Paper Title or Abstract.
information, assisting them in developing a hypothesis and
(Abstract)
giving them an indication of the type of analysis they can
Keywords—data preprocessing, exploratory data analysis, conduct. The students occasionally discovered some
central tendency, dispersion, outlier, visualization duplicated rows using the.head() and.tail() functions, as seen
in Figure 2.
I. INTRODUCTION
The students are tasked to investigate a dataset (data.csv) that
contains 12,000 observations with 16 different attributes and
perform an exploratory data analysis. The purpose of the
assignment is for the students to investigate various Python
libraries that may be applied to the analysis, manage typical
data mistakes, and illustrate patterns and insights from the
data.

II. DATASET
The dataset contains 12,000 entries with a total of 16 columns.
Each column describes the various feature of the cars such as
make, model, engine fuel type, fuel type, popularity, etc.

III. . DATA LOADING AND OVERVIEW


For reading and exploring the dataset, we imported the Python
pandas library and utilized its various methods to load the Figure 3: Result using the .info() method
dataset and do an initial inspection of the dataset.
The function.info() and .describe() function is also utilized
during the initial stages of the analysis. The function helped
the students gain insights not just into the number of
observations, and the total number of features but also into the
types of data and the count of non-null and null values.

IV. HANDLING DUPLICATE AND NULL VALUES


Figure 1: The head or the first five rows of the dataset
Many datasets in the real world contain a lot of incomplete
and erroneous information, giving them poor quality. One
form of inaccurate data is duplicated value or when all of the
values in at least one row match all of the values in another
row, that value is considered to be duplicate. To manage this,
the team used the pandas duplicated() to determine the total
number of duplicate values present in the dataset and then
Figure 2: The tail or the last five rows of the dataset used drop_dulicates() function to eliminate 801 rows' worth of
duplicate data.
Figure 3: The head or the first five rows of the dataset
Another form of poor quality data is the existence of null or
missing values. Prior to handling the NaN values, the team
decided to perform a review on dataset distribution and VI. OUTLIER DETECTION AND REMOVAL
dispersion which will be discussed in Section V of the paper.
The insight from this analysis will be used to decide which The term "outlier" refers to a data point or observation that
imputing technique is best suited for the dataset. significantly deviates from the data set's norm or average.
Outliers can distort perceptions of statistical results by having
a large impact on statistics like the mean and other measures
V. MEASURE OF CENTRAL TENDENCY of central tendency. In addition, it has the potential to mislead
One of the foundations of advanced analytics and data science machine learning model training, leading to longer training
is descriptive statistics. Descriptive statistics are the times, less accurate models, and ultimately subpar outcomes.
measurements that provide a summary of a set of data which
may be further subdivided into measures of central tendency To detect outliers and handle the outliers, the students used
and measures of dispersion. inter quartile range.

Pandas’ built-in function was used by the students in


measuring central tendency and measures of variability.
Measures include mean, median, mode, standard deviation,
variance, and skewness.

A. Data Loading and Overview


VII. DATA VISUALIZATION

You might also like