Data Science Unit 2-11-08 2023
Data Science Unit 2-11-08 2023
There are five key plots that are used for data visualization.
Mat plot lib
Backend layer
The backend layer is the bottom layer of the figure, which consists of the
implementation of the various functions that are necessary for plotting.
There are three essential classes from the backend layer
FigureCanvas(The surface on which the figure will be drawn),
Renderer(The class that takes care of the drawing on the surface), and
Event(It handle the mouse and keyboard events).
Mat plot lib
Artist Layer
The artist layer is the second layer in the architecture.
It is responsible for the various plotting functions, like axis, which
coordinates on how to use the renderer on the figure canvas.
Scripting layer
The scripting layer is the topmost layer on which most of our code will
run.
The methods in the scripting layer, almost automatically take care of the
other layers, and all we need to care about is the current state (figure &
Mat plot lib
Example:
We will be plotting two lists containing the X, Y coordinates for the plot.
Pyplot :
Pyplot is a Matplotlib module that provides a MATLAB-like interface.
Pyplot provides functions that interact with the figure i.e. creates a figure, decorates
the plot with labels, and creates a plotting area in a figure.
Syntax:
matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)
Mat plot lib
SAMPLE CODE
# Adding the title
import matplotlib.pyplot as plt
plt.title("Simple Plot")
# Adding the labels
# initializing the data
plt.ylabel("y-axis")
x = [10, 20, 30, 40]
plt.xlabel("x-axis")
y = [20, 30, 40, 50]
plt.show()
I In the above example, the elements of X and Y provides the coordinates for the x axis and y
axis and a straight line is plotted against those coordinates. e x axis and y axis and a straight
line is plotted against those coordinates.
E.g. Matplotlib
• Matplotlib allows you to make easy things
• You can generate plots, histograms, power spectra, bar
charts, errorcharts, scatterplots, etc., with just a few lines
of code.
Bar Charts
Bar Charts
A bar plot or bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is proportional to the
values which they represent.
The bar plots can be plotted horizontally or vertically.
A bar chart describes the comparisons between the discrete categories.
It can be created using the bar() method
A legend (upper right corner )is used to describe elements for a particular
area of a graph.
Python has a function called legend() which is used to place a legend on
Bar Charts
Horizontal bar chart Vertical Bar Charts Grouped Bar Chart Stacked Bar Chart
Bar Charts
Advantages:
Summarize large data sets
Performance tracking
Accessible to all audiences
Disadvantages
Too simple
Too easily manipulated
linE Charts
linE Charts
A Line plot or Line graph is a graph that represents the category
of data with lines that is proportional to the values which they
represent.
The line plots can be plotted horizontally or vertically.
A line chart describes the comparisons between the discrete
categories.
It can be created using the plot() method.
Types of Line Graphs
SYNTAX :
plt.plot(x, y)
Scatter plots are the graphs that represents the relationship between two
variables in a data-set.
It represents data points on a two-dimensional plane or on a Cartesian
system.
The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis.
These plots are often called scatter graphs or scatter diagrams.
Scatter plots are used to observe the relationship between variables and use
dots to represent the relationship between them.
The scatter() method in the matplotlib library is used to draw a scatter plot.
Scatter Plot
A scatter plot is also called a scatter chart, scatter gram, or scatter plot, XY graph. The scatter
diagram graphs numerical data pairs, with one variable on each axis, show their relationship.
When there are multiple values of the dependent variable for a unique value of an independent variable
In determining the relationship between variables in some scenarios, such as identifying potential
root causes of problems, checking whether two products that appear to be related both occur with
the exact cause and so on.
Scatter Plot
Types of correlation
The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected.
There can be three such situations to see the relation between the two
variables –
Positive Correlation
Negative Correlation
No Correlation
Positive Correlation
When the points in the graph are rising, moving from left to right, then the scatter plot
shows a positive correlation.
It means the values of one variable are increasing with respect to another. Now
positive correlation can further be classified into three categories:
When the points in the scatter graph fall while moving left to right, then it is called a
negative correlation.
It means the values of one variable are decreasing with respect to another.
When the points are scattered all over the graph and it is difficult to conclude
whether the values are increasing or decreasing, then there is no correlation
between the variables.
Scatter Plot
The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“. See the graph below for an example.
Working with Data: Exploring Data
Exploring Data
What is Data Exploration?
Data exploration refers to the initial step in data analysis. Data analysts
use data visualization and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.
Data exploration techniques include both manual analysis
and automated data exploration software solutions that
visually explore and identify relationships between different
data variables, the structure of the dataset, the presence of
outliers, and the distribution of data values to reveal patterns
and points of interest, enabling data analysts to gain greater
insight into the raw data.
Data is often gathered in large, unstructured volumes from
various sources.
Why is Data Exploration Important?
Humans process visual data better than numerical data.
Therefore it is extremely challenging for data scientists and
data analysts to assign meaning to thousands of rows and
columns of data points and communicate that meaning without
any visual components.
Data visualization are shapes, dimensions, colors, lines,
points, and angles.
Data Munging
Data Munging
Data Munging is the general technique of transforming data from unusable or erroneous form to useful
form.
Basically the procedure of cleansing the data manually is known as data munging.
Data mugging is the practice of preparing data sets for reporting and analysis.
Data munging is a fundamental step in data science, and there are various tools and libraries available in
Some popular libraries for data munging include Pandas,NumPy & scikit-learn in Python and dplyr in R.
Data Munging
Data Munging
In R Programming the following ways are oriented with data munging process:
apply() Family
aggregate()
dplyr package
plyr package
In apply() collection of R the most basic function is the apply ( ) function.
Apart from that, there exists lapply( ), sapply( ) and tapply( ) .
The entire collection of apply() can be considered a substitute for a loop
Data Munging in R
In R, aggregate() function is used to combine or aggregate the input data frame by applying a
function on each column of a sub-data frame.
The plyr package is used for splitting, applying, and combining data.
The plyr is a set of tools that can be used for splitting up huge or big data for creating a
homogeneous piece, then applying a function on each and every piece and finally combine all the
resultant values.
The dplyr package can be considered as a grammar of data manipulation which is providing us a
consistent set of verbs that helps us to solve some most common challenges of data manipulation.
Data Munging
For example:
R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use
the data manipulation function, first need to import the dplyr package using library(dplyr) line of code.
distinct(),
arrange(),
select(),
rename().
Data Manipulation
Filter() : The filter() function is used to produce the subset of the data that satisfies the condition specified
in the filter() method.
distinct() method : The distinct() method removes duplicate rows from data frame or based on the
specified columns.
arrange() method : In R, the arrange() method is used to order the rows based on a specified column.
select() method : The select() method is used to extract the required columns as a table by specifying the
required column names in select() method.
rename() method : The rename() function is used to change the column names.
Data scaling
Data scaling
Scaling is a technique to standardize the independent features present in the data in a fixed range.
It is performed during the data pre-processing to handle highly varying magnitudes or values or units.
Data scaling technique brings data points that are far from each other closer in order to increase the
Standardization or Z-Score Normalization refers to making data points centered on the mean of all
Dimensionality reduction is the process of reducing the number of features (or dimensions) in a
dataset while retaining.
This can be done for a variety of reasons, such as to reduce the complexity of a model, to
improve the performance of a learning algorithm, or to make it easier to visualize the data.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are given below:
By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
Less Computation training time is required for reduced dimensions of features.
Reduced dimensions of features of the dataset help in visualizing the data quickly.
It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:
Correlation
Chi-Square Test
ANOVA
Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to
increase the accuracy of the model. This method is more accurate than the filtering method
but complex to work. Some common techniques of wrapper methods are:
Forward Selection
Backward Selection
Bi-directional Elimination
3. Embedded Methods:
Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some
common techniques of Embedded methods are:
LASSO
Elastic Net
Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing
many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer
resources while processing the information.
Some common feature extraction techniques are:
Principal Component Analysis
Linear Discriminant Analysis
Kernel PCA
Quadratic Discriminant Analysis
Thank you