0% found this document useful (0 votes)
18 views

Data Preprocessing

Uploaded by

asiyashaik7867
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Data Preprocessing

Uploaded by

asiyashaik7867
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 64

Data Preprocessing

Data Formatting
1.Data is usually collected from different places by different people, which may be
stored in different formats.
2.Data formatting means bringing data into a common standard of expression
that allows users to make meaningful comparisons.
3.As a part of dataset cleaning, data formatting ensures that data Is consistent
and easily understandable.
For example, people may use different expressions to represent New York City,
such as N.Y., Ny, NY, and New York.
Sometimes this unclean data is a good thing to see.
For example, if you're looking at the different ways people tend to write New York,
then this is exactly the data that you want.
Or if you're looking for ways to spot fraud, perhaps writing N.Y. is more likely to
predict an anomaly than if someone wrote out New York in full.
Normalization enables a fairer comparison between
the different features, making sure they have the
same impact.
1.It is also important for computational reasons.
Box plots:
1.Box plots are a great way to visualize numeric data, since you can
visualize the various distributions of the data.
2.The main features that the box plot shows are the median of the data
which represents where the middle data point is, the upper quartile
shows where the 75th percentile is, the lower quartile shows where the
25th percentile is.
3.The data between the upper and lower quartile represents the inter-
quartile range.
4.Next, you have the lower and upper extremes.
5.These are calculated as 1.5 times the inter-quartile range above the 75th
percentile, Finally, box plots also display outliers as individual dots that
occur outside the upper and lower extremes.
6.With box plots, you can easily spot outliers and also see the distribution
and skewness of the data.
7.Box plots make it easy to compare between groups.
Pivot Table and Heatmap
1.A PivotTable has one variable displayed along the columns and the other variable
displayed along the rows. Just with one line of code and by using the Pandas pivot
method, we can pivot the body style variable, so it is displayed along the columns,
and the drive wheels will be displayed along the rows.
2. The price data now becomes a rectangular grid, which is easier to visualize. This is
similar to what is usually done in Excel spreadsheets.
3.Another way to represent the PivotTable is using a heatmap plot. Heatmap takes a
rectangular grid of data and assigns a colour intensity based on the data value at the
grid points.
4.It is a great way to plot the target variable over multiple variables, and through this,
get visual clues of the relationship between these variables and the target.
5.In this example, we use pyplot's pcolor method to plot heatmap, and convert the
previous PivotTable into a graphical form.
6.We specified the red blue colour scheme.
7.In the output plot, each type of body style is numbered along the x-axis, and each
type of drive wheels is numbered along the y-axis.
8.The average prices are plotted with varying colors based on their values according to
the color bar.
9.We see that the top section of the heatmap seems to have higher prices in the bottom

You might also like