0% found this document useful (0 votes)
20 views78 pages

Data Science Unit 2-11-08 2023

Uploaded by

rishavsingh7478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views78 pages

Data Science Unit 2-11-08 2023

Uploaded by

rishavsingh7478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

UNIT II

PROGRAMMING TOOLS FOR


DATA SCIENCE
SYLLABUS
 Mat plot lib
 Bar Charts
 Line Charts
 Scatterplots
 Working with Data: Exploring Data
 Cleaning and Munging
 Manipulating Data
 Rescaling
 Dimensionality Reduction
Mat plot lib
Mat plot lib
(Multi-platform data visualization
library)

 Matplotlib is a multi-platform data visualization library built on NumPy


arrays and designed to work with the broader SciPy stack.
 It was introduced by John Hunter in the year 2002.
 One of the greatest benefits of visualization is that it allows us visual access
to huge amounts of data in easily digestible visuals.
Mat plot lib
(Multi-platform data visualization
library)
What is data visualization?
 Data visualization is the graphical representation of information and data.
 By using visual elements like charts, graphs, and maps, data visualization tools provide
an accessible way to see and understand trends, outliers, and patterns in data.
Mat plot lib
(Multi-platform data visualization
library)
DATA VISUALIZATION

There are five key plots that are used for data visualization.
Mat plot lib

 Matplotlib is easy to use and an amazing visualizing library in Python.


 It is built on NumPy arrays and designed to work with the broader SciPy stack and
consists of several plots like line, bar, scatter, histogram, etc.
Matplotlib Architecture
There are three different layers in the architecture of the matplotlib which are the
following:
 Backend Layer
 Artist layer
 Scripting layer
Mat plot lib

Backend layer
 The backend layer is the bottom layer of the figure, which consists of the
implementation of the various functions that are necessary for plotting.
 There are three essential classes from the backend layer
 FigureCanvas(The surface on which the figure will be drawn),
 Renderer(The class that takes care of the drawing on the surface), and
 Event(It handle the mouse and keyboard events).
Mat plot lib

Artist Layer
 The artist layer is the second layer in the architecture.
 It is responsible for the various plotting functions, like axis, which
coordinates on how to use the renderer on the figure canvas.
Scripting layer
 The scripting layer is the topmost layer on which most of our code will
run.
 The methods in the scripting layer, almost automatically take care of the
other layers, and all we need to care about is the current state (figure &
Mat plot lib

Example:
 We will be plotting two lists containing the X, Y coordinates for the plot.
Pyplot :
 Pyplot is a Matplotlib module that provides a MATLAB-like interface.
 Pyplot provides functions that interact with the figure i.e. creates a figure, decorates
the plot with labels, and creates a plotting area in a figure.
Syntax:
 matplotlib.pyplot.plot(*args, scalex=True, scaley=True, data=None, **kwargs)
Mat plot lib

SAMPLE CODE
 # Adding the title
import matplotlib.pyplot as plt
 plt.title("Simple Plot")
# Adding the labels
# initializing the data
 plt.ylabel("y-axis")
 x = [10, 20, 30, 40]
 plt.xlabel("x-axis")
 y = [20, 30, 40, 50]
 plt.show()

# plotting the data


 plt.plot(x, y)
Mat plot lib

I In the above example, the elements of X and Y provides the coordinates for the x axis and y
axis and a straight line is plotted against those coordinates. e x axis and y axis and a straight
line is plotted against those coordinates.
E.g. Matplotlib
• Matplotlib allows you to make easy things
• You can generate plots, histograms, power spectra, bar
charts, errorcharts, scatterplots, etc., with just a few lines
of code.
Bar Charts
Bar Charts

A bar plot or bar chart is a graph that represents the category of data
with rectangular bars with lengths and heights that is proportional to the
values which they represent.
 The bar plots can be plotted horizontally or vertically.
 A bar chart describes the comparisons between the discrete categories.
 It can be created using the bar() method
 A legend (upper right corner )is used to describe elements for a particular
area of a graph.
 Python has a function called legend() which is used to place a legend on
Bar Charts

Syntax: plt.bar(x, height, width, bottom, align)


 x: This parameter is the sequence of horizontal coordinates of the bar.
 height: This parameter is the height(s) of the bars.
 width: This parameter is an optional parameter. And it is the width(s) of
the bars with default value 0.8.
 bottom: This parameter is also an optional parameter. And it is the y
coordinate(s) of the bars bases with default value 0.
 alighn:This parameter is also an optional parameter. And it is used for
alignment of the bars to the x coordinates.
Bar Charts

Types of Bar Charts


1. Horizontal bar chart
2. Column or Vertical Bar Charts
3. Stacked Bar Chart
4. Grouped Bar Chart
Types of Bar Charts

Horizontal bar chart Vertical Bar Charts Grouped Bar Chart Stacked Bar Chart
Bar Charts

import matplotlib.pyplot as plt


# data to display on plots
# Adding the legends
 x = [3, 1, 3, 12, 2, 4, 4]
 plt.legend(["bar"])
 y = [3, 2, 1, 4, 5, 6, 7]
 plt.show()

# This will plot a simple bar chart


 plt.bar(x, y)

# Title to the plot


 plt.title("Bar Chart")
Bar Charts

 Advantages:
 Summarize large data sets
 Performance tracking
 Accessible to all audiences
 Disadvantages
 Too simple
 Too easily manipulated
linE Charts
linE Charts
A Line plot or Line graph is a graph that represents the category
of data with lines that is proportional to the values which they
represent.
 The line plots can be plotted horizontally or vertically.
 A line chart describes the comparisons between the discrete
categories.
 It can be created using the plot() method.
Types of Line Graphs

Simple Line Graph Multiple Line Graph Compound Line


linE Charts
Different Parts of Line Graph
 Title
 Scale
 Labels
 Bars
 Data values
linE Charts

SYNTAX :
plt.plot(x, y)

 x,y: These parameter are the horizontal and


vertical coordinates of the data points.
linE Charts
# importing the required libraries
import matplotlib.pyplot as plt
import numpy as np

# define data values


x = np.array([1, 2, 3, 4]) # X-axis points
y = x*2 # Y-axis points

plt.plot(x, y) # Plot the chart


plt.show() # display
SCATTER PLOT
Scatter Plot

 Scatter plots are the graphs that represents the relationship between two
variables in a data-set.
 It represents data points on a two-dimensional plane or on a Cartesian
system.
 The independent variable or attribute is plotted on the X-axis, while the
dependent variable is plotted on the Y-axis.
 These plots are often called scatter graphs or scatter diagrams.
 Scatter plots are used to observe the relationship between variables and use
dots to represent the relationship between them.
 The scatter() method in the matplotlib library is used to draw a scatter plot.
Scatter Plot

 A scatter plot is also called a scatter chart, scatter gram, or scatter plot, XY graph. The scatter
diagram graphs numerical data pairs, with one variable on each axis, show their relationship.

 Scatter plots are used in either of the following situations.


 When we have paired numerical data

 When there are multiple values of the dependent variable for a unique value of an independent variable

 In determining the relationship between variables in some scenarios, such as identifying potential
root causes of problems, checking whether two products that appear to be related both occur with
the exact cause and so on.
Scatter Plot

Scatter Plot Uses and Examples

 Scatter plots instantly report a large volume of data.

It is beneficial in the following situations –

For a large set of data points given

 Each set comprises a pair of values

 The given data is in numeric form


Scatter Plot Correlation

Types of correlation
 The scatter plot explains the correlation between two attributes or variables. It
represents how closely the two variables are connected.
 There can be three such situations to see the relation between the two
variables –
 Positive Correlation
 Negative Correlation
 No Correlation
Positive Correlation

 When the points in the graph are rising, moving from left to right, then the scatter plot
shows a positive correlation.

 It means the values of one variable are increasing with respect to another. Now
positive correlation can further be classified into three categories:

 Perfect Positive – Which represents a perfectly straight line

 High Positive – All points are nearby

 Low Positive – When all the points are scattered


Positive Correlation
Negative Correlation

 When the points in the scatter graph fall while moving left to right, then it is called a
negative correlation.

 It means the values of one variable are decreasing with respect to another.

These are also of three types:

 Perfect Negative – Which form almost a straight line

 High Negative – When points are near to one another

 Low Negative – When points are in scattered form


Negative Correlation
No Correlation

 When the points are scattered all over the graph and it is difficult to conclude
whether the values are increasing or decreasing, then there is no correlation
between the variables.
Scatter Plot

import matplotlib.pyplot as plt  # Title to the plot


 # data to display on plots plt.title("Scatter chart")
x = [3, 1, 3, 12, 2, 4, 4] plt.show()
y = [3, 2, 1, 4, 5, 6, 7]

 # This will plot a simple scatter chart


plt.scatter(x, y)

 # Adding legend to the plot


plt.legend("A")
Scatter Plot

The line drawn in a scatter plot, which is near to almost all the points in the plot is
known as “line of best fit” or “trend line“. See the graph below for an example.
Working with Data: Exploring Data
Exploring Data
What is Data Exploration?
 Data exploration refers to the initial step in data analysis. Data analysts
use data visualization and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.
 Data exploration techniques include both manual analysis
and automated data exploration software solutions that
visually explore and identify relationships between different
data variables, the structure of the dataset, the presence of
outliers, and the distribution of data values to reveal patterns
and points of interest, enabling data analysts to gain greater
insight into the raw data.
 Data is often gathered in large, unstructured volumes from
various sources.
Why is Data Exploration Important?
 Humans process visual data better than numerical data.
 Therefore it is extremely challenging for data scientists and
data analysts to assign meaning to thousands of rows and
columns of data points and communicate that meaning without
any visual components.
 Data visualization are shapes, dimensions, colors, lines,
points, and angles.
Data Munging
Data Munging

 Data Munging is the general technique of transforming data from unusable or erroneous form to useful

form.

 Basically the procedure of cleansing the data manually is known as data munging.

 Data mugging is the practice of preparing data sets for reporting and analysis.

 Data munging is a fundamental step in data science, and there are various tools and libraries available in

programming languages like Python and R that aid in this process.

 Some popular libraries for data munging include Pandas,NumPy & scikit-learn in Python and dplyr in R.
Data Munging
Data Munging

Stage 1 : Data Discovery

 Everything begins with a defined goal, and the data


analysis journey isn’t an exception.

 Data discovery is the first stage of data munging, where


data analysts define data’s purpose and how to achieve it
through data analytics.

 The goal is to identify the potential uses and requirements


of data.
Data Munging

Stage 2 : Data Structuring


 Once the requirements are identified
and outlined, the next stage is
structuring raw data to make it
machine-readable.
 Structured data has a well-defined
schema and follows a consistent
layout.
 Think of data neatly organized in rows
and columns available in spreadsheets
and relational databases.
Data Munging

Stage 3 : Data Cleansing

 Once the data is organized into a standardized


format, the next step is data cleansing.

 This stage addresses a range of data quality


issues, ranging from missing values to duplicate
datasets.

 The process involves detecting and correcting


this erroneous data to avoid information gaps.
Data Munging
Stage 4 : Data Enrich
 Data enrichment is the process of filling in
missing details by referring to other data
sources.
 It’s a process that involves appending one or
multiple data sets from different sources to
generate a holistic view of information.
 For example, the raw data might contain partial
customer addresses.
 Data enrichment lets you fill in all address
fields by looking up the missing values
elsewhere, such as in the database or a postal
records lookup.
Data Munging
Stage 5: Data Validate
 Finally, it’s time to ensure that all data
values are logically consistent.
 Validating the accuracy, completeness, and
reliability of data is imperative to the data
munging process.
 Data validation also involves some deeper
checks, such as ensuring that all values are
compatible with the specified data type.
Data Munging in R

“R” is an open source software package directed


at analyzing and visualizing data, but with the
power of the language, and available packages,
it also provides a powerful means of
slicing/dicing the data to get it into a form for
analysis.
Data Munging in R

 In R Programming the following ways are oriented with data munging process:
 apply() Family
 aggregate()
 dplyr package
 plyr package
 In apply() collection of R the most basic function is the apply ( ) function.
 Apart from that, there exists lapply( ), sapply( ) and tapply( ) .
 The entire collection of apply() can be considered a substitute for a loop
Data Munging in R

 In R, aggregate() function is used to combine or aggregate the input data frame by applying a
function on each column of a sub-data frame.

 The plyr package is used for splitting, applying, and combining data.

 The plyr is a set of tools that can be used for splitting up huge or big data for creating a
homogeneous piece, then applying a function on each and every piece and finally combine all the
resultant values.

 The dplyr package can be considered as a grammar of data manipulation which is providing us a
consistent set of verbs that helps us to solve some most common challenges of data manipulation.
Data Munging

Issues with Data Munging


Data munging processes sometimes present issues such as:
 Resource overheads
 Data loss
 Flexibility
 Process errors
Data Munging

Benefits of Data Munging


 Data Quality Improvement
 Enhanced Analysis
 Dealing with Missing Data
 Standardization
 Feature Engineering
 Data Integration
 Reduced Processing Time
 Improved Visualization
 Increased Reproducibility
Data
Manipulation
What is Data Manipulation?
 Data Manipulation Meaning: Manipulation of data is the process of manipulating or
changing information to make it more organized and readable.

Data manipulation provides an organization with many advantages, including:


 Consistent data: It can be structured, read, and better understood by providing data in
a consistent format.
 Project data: it is paramount for organizations to be able to use historical data to
project the future and to provide more in-depth analysis, especially when it comes to
finances.
 Overall, converting, updating, deleting, and incorporating data into a database means
you can do more with the data.
Data Manipulation

For example:

 R provides a library called dplyr which consists of many built-in methods to manipulate the data. So to use
the data manipulation function, first need to import the dplyr package using library(dplyr) line of code.

 Some of few manipulation are


 filter(),

 distinct(),

 arrange(),

 select(),

 rename().
Data Manipulation

 Filter() : The filter() function is used to produce the subset of the data that satisfies the condition specified
in the filter() method.

 distinct() method : The distinct() method removes duplicate rows from data frame or based on the
specified columns.

 arrange() method : In R, the arrange() method is used to order the rows based on a specified column.

 select() method : The select() method is used to extract the required columns as a table by specifying the
required column names in select() method.

 rename() method : The rename() function is used to change the column names.
Data scaling
Data scaling

 Scaling is a technique to standardize the independent features present in the data in a fixed range.

 It is performed during the data pre-processing to handle highly varying magnitudes or values or units.

 Data scaling technique brings data points that are far from each other closer in order to increase the

algorithm effectiveness and speed up the Machine Learning processing.

 Two popular data scaling methods are normalization and standardization.

 Standardization or Z-Score Normalization refers to making data points centered on the mean of all

data points presented through a feature with a unit standard deviation.


Data scaling

 Normalization: is the process of adjusting all measured values from different

scales into one scale.

 Rescaling data is multiplying each member of a data set by a constant term k;

that is to say, transforming each number x to f(X), where

f(x) = k x, and k and x are both real numbers.


 Rescaling will change the spread of your data as well as the position of your data
points.
Dimensionality
Reduction
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.

 Dimensionality reduction is the process of reducing the number of features (or dimensions) in a
dataset while retaining.

 This can be done for a variety of reasons, such as to reduce the complexity of a model, to
improve the performance of a learning algorithm, or to make it easier to visualize the data.
Benefits of applying Dimensionality Reduction

 Some benefits of applying dimensionality reduction technique to the given dataset are given below:

 By reducing the dimensions of the features, the space required to store the dataset also gets
reduced.
 Less Computation training time is required for reduced dimensions of features.
 Reduced dimensions of features of the dataset help in visualizing the data quickly.
 It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction

 There are also some disadvantages of applying the


dimensionality reduction, which are given below:

 Some data may be lost due to dimensionality reduction.


 In the PCA dimensionality reduction technique, sometimes
the principal components required to consider are unknown.
Approaches of Dimension Reduction
 There are two ways to apply the dimension reduction technique, which are given below:
 Feature Selection
 Feature selection is the process of selecting the subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods

 In this method, the dataset is filtered, and a subset that contains only the
relevant features is taken. Some common techniques of filters method are:

 Correlation
 Chi-Square Test
 ANOVA
 Information Gain, etc.
2. Wrappers Methods

 The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to
increase the accuracy of the model. This method is more accurate than the filtering method
but complex to work. Some common techniques of wrapper methods are:

 Forward Selection
 Backward Selection
 Bi-directional Elimination
3. Embedded Methods:
Embedded methods check the different training iterations of the machine
learning model and evaluate the importance of each feature. Some
common techniques of Embedded methods are:

 LASSO
 Elastic Net
 Ridge Regression, etc.
Feature Extraction:
 Feature extraction is the process of transforming the space containing
many dimensions into space with fewer dimensions. This approach is
useful when we want to keep the whole information but use fewer
resources while processing the information.
 Some common feature extraction techniques are:
 Principal Component Analysis
 Linear Discriminant Analysis
 Kernel PCA
 Quadratic Discriminant Analysis
Thank you

You might also like