INTRODUCTION TO DATA SCIENCE(IDS)
Prepared by N.Pandu Ranga Reddy
UNIT-5
Graphical Representation of Data (CHARTS & GRAPHS)
Graphical Representation of Data: Graphical Representation of Data,” where
numbers and facts become lively pictures and colorful diagrams.
Instead of staring at boring lists of numbers, we use fun charts, cool graphs,
and interesting visuals to understand information better.
In this exciting concept of data visualization, we’ll learn about different kinds of
graphs, charts, and pictures that help us see patterns and stories hidden in
data.
What is Graphical Representation
Graphics Representation is a way of representing any data in picturized
form. It helps a reader to understand the large set of data very easily as it gives us
various data patterns in visualized form.
There are two ways of representing data,
Tables
Pictorial Representation through graphs.
They say, “A picture is worth a thousand words”. It’s always better to represent
data in a graphical format. Even in Practical Evidence and Surveys, scientists
have found that the restoration and understanding of any information is better
when it is available in the form of visuals as Human beings process data
better in visual form than any other form.
Types of Graphical Representations
Comparison between different items is best shown with graphs, it becomes easier
to compare the crux of the data about different items. Let’s look at all the different
types of graphical representations briefly:
1.Line Graphs
A line graph is used to show how the value of a particular variable changes with
time. We plot this graph by connecting the points at different values of the
variable. It can be useful for analyzing the trends in the data and predicting
further trends.
Line graph also known as a line chart or line plot is a tool used for data visualization .
In a line graph data points are connected with a straight-line and data points are
represented either with points or wedges.
A line graph or line chart is a graphical representation of the data that displays the
relationship between two or more variables concerning time. It is made by connecting data
points with straight-line segments.
Parts of Line Graph
Parts of the line graph include the following:
Title: It is nothing but the title of the graph drawn.
Axes: The line graph contains two axes i.e. X-axis and Y-axis.
Labels: The name given to the x-axis and y-axis.
Line: It is the line segment that is used to connect two or more data points.
Point: It is nothing but a point given at each segment.
Example: Draw a line graph for the given data
No. of Days 1 2 3 4
Absentees 5 10 15 10
Multiple Line Graph
It is the type of line graph in which we can represent two or more lines in a single
graph and they can either belong to the same categories or different which makes it easy
to make comparisons between them. Multiple line graphs also include a double line graph
or we can say that a double line graph is also a multiple line graph.
An example of multiple graphs is shown below:
PIE CHARTS:
Pie chart is a popular and visually intuitive tool used in data representation, making
complex information easier to understand at a glance.
This circular graph divides data into slices, each representing a proportion of the whole,
allowing for a clear comparison of different categories making it easier to digest complex
information through a straightforward, intuitive format.
They look like a pie cut into slices, and each slice shows a piece of information .
Pie charts are ideal for displaying percentage data or showing how individual parts
contribute to a total.
A pie chart uses a circle or sphere to represent the data, where the circle represents
the entire data, and the slices represent the data in parts.
Pie charts, also known as circle graphs or pie diagrams
Pie Chart Examples
In a class of 200 students, a survey was done to collect each student’s favorite sports.
The pie chart of the data is given below:
Since the pie chart is provided and the total number of students is given, we can easily
take the original data out for each sport.
Cricket = 17/100 × 200 = 34 students
Football = 25/100 × 200 = 50 students
Badminton = 12/100 × 200 = 24 students
Hockey = 5/100 × 200 = 10 students
Other = 41/100 × 200 = 82 students
Pie Chart Formula
The total value or percentage of the pie is 100% always. Here it contains different sectors
and segments in which each sector or segment of the chart corresponds to a certain
portion of the net or total percentage (or data). The total or sum of all the data can be
summed up to 360 degrees.
Converting the data into degrees on a pie chart. The formula for a pie chart can be
summed up as:
(Given Data / Total Value of Data) × 360°
Calculating the percentage of each sector from degrees in a pie chart.
Chart Legend
Plot/Chart legends give meaning to a visualization, assigning labels to
the various plot elements.
Legends are found in maps - describe the pictorial language or symbology of
the map.
Legends are used in line graphs to explain the function or the values
underlying the different lines of the graph.
• Matplotlib has native support for legends. Legends can be placed in
various positions:
A legend can be placed inside or outside the chart and the position can be
moved. The legend() method adds the legend to the plot.
• To place the legend inside, simply call legend():
In above grapg,Each series is differentiated by a specific color, and the legend provides
color-based labels “blue” and “green” for clarity.
2.Bar Graphs/Charts
A bar graph is a type of graphical representation of the data in which bars of
uniform width are drawn with equal spacing between them on one axis (x-axis
usually), depicting the variable. The values of the variables are represented by
the height of the bars.
The pictorial representation of data in groups, either in horizontal or vertical bars
where the length of the bar represents the value of the data present on the axis .
They (bar graphs) are usually used to display or impart the information belonging to
‘categorical data’ i.e., data that fit in some category.
Reading a Bar Graph and comparing two sets of data
To read a Bar graph, we need to ask questions to ourselves looking at the displayed
graph. Let’s understand reading a Bar graph through a fundamental example,
Different types of fruits are liked by People,
What does the X-axis and Y-axis on the graph are representing?
The X-axis represents the different types of fruits like apple, guava. while Y-axis
represents the Number of people.
Overall, what kind of information the bar graph displaying?
The bar graph is displaying the number of People liking different types of fruits.
3.Histograms
This is similar to bar graphs, but it is based frequency of numerical values rather
than their actual values. The data is organized into intervals and the bars represent
the frequency of the values in that range. That is, it counts how many values of the
data lie in a particular range.
A histogram displays frequencies of quantitative data that has been sorted into
intervals
What is Histogram?
Histograms are graphical representations of data distributions. They consist of bars, each
representing the frequency or count of observations falling within specific intervals, known
as bins.
We can also say a histogram is a variation of a bar chart in which data values are grouped
together and put into different classes. This grouping enables you to see how frequently
data in each class occur in the dataset.
Example:
Suppose you’re analyzing the distribution of scores on a standardized test. You have data
for 2000 students, and you want to visualize how many students scored within different
score ranges. For this you can create a histogram using the following data.
Score Range Frequency
0-25 150
26-50 300
51-75 600
76-100 750
101-125 150
126-150 50
The histogram show that the data is normally distributed, and the students have
mostly score between 76-100
SCATTER PLOT
Scatter plot is a mathematical technique that is used to represent data.
Scatter plot also called a Scatter Graph, or Scatter Chart uses dots to describe two
different numeric variables.
The position of each dot on the horizontal and vertical axis indicates values for an
individual data point.
A scatter plot is used to plot the relationship between two variables, on a two-
dimensional graph that is known as Cartesian Plane on mathematical grounds.
It is generally used to plot the relationship between one independent variable and one
dependent variable,
where an independent variable is plotted on the x-axis and a dependent variable is plotted
on the y-axis so that you can visualize the effect of the independent variable on the
dependent variable. These plots are known as Scatter Plot Graph or Scatter Diagram.
Scatter Plot is known by several other names, a few of them are scatter chart,
scattergram, scatter plot, and XY graph.
A scatter plot is used to visualize a data pair, such that each element gets its axis,
generally the independent one gets the x-axis and the dependent one gets the y-axis.
So Scatter Plot is useful in situations when we have to find out the relationship
between two sets of data, or in cases when we suspect that there may be some
relationship between two variables and this relationship may be the root cause of some
problem.
Let's understand the process through an example. In the following table, a data set of two
variables is given.
Matches Played 2 5 7 1 12 15 18
Goals Scored 1 4 5 2 7 12 11
Now in this data set there are two variables, first is the number of matches played by a
certain player and second is the number of goals scored by that player. Suppose, we aim
to find out the relationship between the number of matches played by a certain player and
the number of goals scored by him/her. For now, let us discard our obvious intuitive
understanding that the number of goals scored is directly proportional to the number of
matches played. For now, let us assume that we just have the given dataset and we have
to extract out relationship between given data pair.
Scatter Plot
A scatter plot is a diagram where each value in the data set is represented by a dot.
BOX PLOT/Box and Whisker Plot
A boxplot (also known as a box and whiskers plot) is another way to display
quantitative data.
Box plot is a type of chart that depicts a group of numerical data through
their quartiles.
Box plot is also known as a whisker plot, box-and-whisker plot, or simply a
box-and whisker diagram.
Box plot is a graphical representation of the distribution of a dataset. It
displays key summary statistics such as the median, quartiles, and
potential outliers in a concise and visual manner.
By using Box plot you can provide a summary of the distribution, identify
potential and compare different datasets in a compact and visual manner.
The box can either be vertically or horizontally displayed depending on the labeling
of the axis.
The box does not need to be perfectly symmetrical because it represents data that
might not be perfectly symmetrical.
Elements of Box Plot
A box plot gives a five-number summary of a set of data which is-
Minimum – It is the minimum value in the dataset excluding the outliers.
First Quartile (Q1) – 25% of the data lies below the First (lower) Quartile.
Median (Q2) – It is the mid-point of the dataset. Half of the values lie below it and half
above.
Third Quartile (Q3) – 75% of the data lies below the Third (Upper) Quartile.
Maximum – It is the maximum value in the dataset excluding the outliers.
Note: The box plot shown in the above diagram is a perfect plot with no skewness. The
plots can have skewness and the median might not be at the center of the box.
The area inside the box (50% of the data) is known as the Inter Quartile Range .
The IQR is calculated as –
IQR = Q3-Q1
Outlies are the data points below and above the lower and upper limit. The lower and
upper limit is calculated as –
Lower Limit = Q1 - 1.5*IQR
Upper Limit = Q3 + 1.5*IQR
The values below and above these limits are considered outliers and the minimum and
maximum values are calculated from the points which lie under the lower and upper limit.
How to create a box plots?
Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches – 100, 120, 110,
150, 110, 140, 130, 170, 120, 220, 140, 110.
To draw a box plot for the given data first we need to arrange the data in ascending order
and then find the minimum, first quartile, median, third quartile and the maximum.
Ascending Order
100, 110, 110, 110, 120, 120, 130, 140, 140, 150, 170, 220
Median (Q2) = (120+130)/2 = 125; Since there were even values
To find the First Quartile we take the first six values and find their median.
Q1 = (110+110)/2 = 110
For the Third Quartile, we take the next six and find their median.
Q3 = (140+150)/2 = 145
Note: If the total number of values is odd then we exclude the Median while calculating
Q1 and Q3. Here since there were two central values we included them. Now, we need to
calculate the Inter Quartile Range.
IQR = Q3-Q1 = 145-110 = 35
We can now calculate the Upper and Lower Limits to find the minimum and maximum
values and also the outliers if any.
Lower Limit = Q1-1.5*IQR = 110-1.5*35 = 57.5
Upper Limit = Q3+1.5*IQR = 145+1.5*35 = 197.5
So, the minimum and maximum between the range [57.5,197.5] for our given data are –
Minimum = 100
Maximum = 170
The outliers which are outside this range are –
Outliers = 220
Now we have all the information, so we can draw the box plot which is as below-
We can see from the diagram that the Median is not exactly at the center of the box and
one whisker is longer than the other. We also have one Outlier.
Use-Cases of Box Plot
Box plots provide a visual summary of the data with which we can quickly identify the
average value of the data, how dispersed the data is, whether the data is skewed or
not (skewness).
The Median gives you the average value of the data.
Box Plots shows Skewness of the data-
a) If the Median is at the center of the Box and the whiskers are almost the
same on both the ends then the data is Normally Distributed.
b) If the Median lies closer to the First Quartile and if the whisker at the
lower
end is shorter (as in the above example) then it has a Positive Skew
(Right Skew).
c) If the Median lies closer to the Third Quartile and if the whisker at the
upper end is shorter than it has a Negative Skew (Left Skew).
The dispersion or spread of data can be visualized by the minimum and maximum
values which are found at the end of the whiskers.
The Box plot gives us the idea of about the Outliers which are the points which are
numerically distant from the rest of the data.
How to compare box plots?
As we have discussed at the beginning of the article that box plots make comparing
characteristics of data between categories very easy. Let us have a look at how we can
compare different box plots and derive statistical conclusions from them.
Let us take the below two plots as an example: –
Compare the Medians — If the median line of a box plot lies outside the box of the
other box plot with which it is being compared, then we can say that there is likely to be
a difference between the two groups. Here the Median line of the plot B lies outside the
box of Plot A.
Compare the Dispersion or Spread of data — The Inter Quartile range (length of the
box) gives us an idea about how dispersed the data is. Here Plot A has a longer length
than Plot B which means that the dispersion of data is more in plot A as compared to
plot B. The length of whiskers also gives an idea of the overall spread of data. The
extreme values (minimum &maximum) give the range of data distribution. Larger the
range more scattered the data. Here Plot A has a larger range than Plot B.
Comparing Outliers — The outliers give the idea of unusual data values which are
distant from the rest of the data. More number of Outliers means the prediction will be
more uncertain. We can be more confident while predicting the values for a plot which
has less or no outliers.
Compare Skewness — Skewness gives us the direction and the magnitude of the lack
of symmetry. We have discussed above how to identify skewness. Here Plot A is
Positive or Right Skewed and Plot B is Negative or Left Skewed.
Example box plot
Linear Regression in Machine Learning/Data science
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (y) variables, hence called as linear regression.
Since linear regression shows the linear relationship, which means it finds how the
value of the dependent variable is changing according to the value of the
independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error.
The values for x and y variables are training datasets for Linear Regression model
representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
o
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called
Simple Linear Regression.
o Multiple Linear regression:
o
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:
o Positive Linear Relationship:
o
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
o Negative Linear Relationship:
o
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.
Finding the best fit line:
When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be
minimized. The best fit line will have the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different
line of regression, so we need to calculate the best values for a 0 and a1 to find
the best fit line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function
is also known as Hypothesis function.
o
o For Linear Regression, we use the Mean Squared Error (MSE) cost
function, which is the average of squared error occurred between the
predicted values and actual values. It can be written as:
o For the above linear equation, MSE can be calculated as:
o Where,
o N=Total number of observation
o
Yi = Actual value
o (a1xi+a0)= Predicted value.
o Residuals: The distance between the actual value and predicted values is
called residual. If the observed points are far from the regression line, then
the residual will be high, and so cost function will high. If the scatter points
are close to the regression line, then the residual will be small and hence the
cost function.