dISCRIPTIVE 6707
dISCRIPTIVE 6707
INTRODUCTION TO STATISTICS
Statistics
Statistics is a branch of mathematics that deals with the collection, analysis,
interpretation, presentation ,organization of data and drawing conclusion from the data.
Population
Complete group of people, objects, or observations that we are interested in studying or
analyzing is called as Population.
Sample
Variable
variable is a characteristic of interest that can take on different values or outcomes.
Variables are the building blocks of statistical analysis, and they can be used to describe,
analyze, and summarize data.
Measures of Data
1
Categorical data VS Numerocal data
Characteristics Categorical Data Numerical data
Also known as Qualitative data Quantitavie data
Nature Non-numerical and It is identified based
can be identified on numbers and the
based on name and arithmetic process
label
Types of data Nominal and ordinal Continuous and
data discrete data
Examples Name, Gender, Phone Measurements such as
number Height, Weight
,Temperature.
2
MODULE-2
DATA VISUALISATION
Data visualisation:
Taking the huge and large amount of data and simplifying it in the
form of pictures or a kindm of Graphs which make easy to understand is called
as data visualisation.
Categorical Data : The data which can categorized or grouped is called as
categorical data
Ordinal data: The ordinal scale has a meaning which we can count and order is called as
ordinal scale
Ex: Food Rating , Temperature of water etc. Frequ
Nominal data : The data which is used to identify the characteristic of the observation .
Ex: Gender, Names
Difference between ordinal data and nominal data
Ordinal data Vs nominal data
Characteristics Ordinal data Nominal data
Numerical data
Continuous data : The data which contains the numericals of the values, which contains
infinite number of values at a given range.
Ex: 1.Height
2. Temperature
3.Weight
Discrete data: Discrete data is a type of quantitative data that includes non-divisible figures and
statistics you can count
.(or)
3
Discrete data is a numerical type of data that includes whole, concrete numbers with specific and
fixed data values determined by counting.
Ex: 1. Movie tickets sold on the single day.
2.No. of students in a class
GRAPHS TO REPRESENT THE CATEGORICAL DATA
Bar graph: Bar graph most used for representation of the graphs, they represent the data using
bars where the length of the each bar corresponds the value it represents .
Example:
B 2 2/10=0.2
C 2 2/10=0.2
D 2 2/10=0.2
Chart Title
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Frequency Relative Frequency
A B C D
4
Python code for the Bar Graph
import matplotlib.pyplot as plt
from collections import Counter
data = ['A', 'A', 'B', 'C', 'A', 'D', 'A', 'B', 'D', 'C']
counter = Counter(data)
plt.bar(counter.keys(), counter.values())
plt.title('Frequency of Each Item')
plt.xlabel('Item')
plt.ylabel('Frequency')
plt.show()
5
PIE CHART: It is a circle divided into the pieces or the portions to the relative frequency
of categorical data.
Example:
Pie chart
Example:
: Consider the data
A,A,B,C,A,D,A,B,D,C construct frequency distribution table and draw the pie chart
Given the values
Category Frequency
A 4
B 2
C 2
D 2
6
Python code for the pie chart
data = ['A', 'A', 'B', 'C', 'A', 'D', 'A', 'B', 'D', 'C']
frequency = {}
for value in data:
if value in frequency:
frequency[value] += 1
else:
frequency[value] = 1
Line Graph :
A line graph, also known as a line chart or line plot, is a type of graph used
to display data points connected by straight line segments. It is commonly used
to show trends, patterns, and relationships between continuous data points.
Example: Consider the data
7
A,A,B,C,A,D,A,B,D,C construct frequency distribution table and draw the line graph.
B 2 2/10=0.2
C 2 2/10=0.2
D 2 2/10=0.2
Chart Title
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Frequency Relative Frequency
A B C D
frequencies = [4, 2, 2, 2]
plt.xlabel('Category')
plt.ylabel('Frequency')
plt.show()
8
Out put of the code:
Conclusion: The line and the python code has been done.
HISTOGRAM
A histogram is a graphical representation of the distribution of numerical data. It
is the one of the type bar chart that shows the frequency values within specific
ranges or bins.
Histograms are used to:
1. Visualize the distribution of data
2. Identify patterns, such as skewness or outliers
3. Understand the central tendency and variability of the data
4. Compare data distributions
Example: create a histogram and give python code for the data
3,3,3,6,6,10,10,6,6,12,12,12,7,7,5,15,7,7,15,15,11,12,15,14,12,11,7,8,8,9,5,5,4,3
,2,3,2,8,11
Sol:
9
Give the data
3,3,3,6,6,10,10,6,6,12,12,12,7,7,5,15,7,7,15,15,11,12,15,14,12,11,7,8,8,9,5,5,4,3
,2,3,2,8,11
Value Frequency
2 2
3 5
4 1
5 3
6 4
7 5
8 3
9 1
10 2
11 3
12 5
14 1
15 4
10
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Out put of the data :
11
0 28 34 30 0 36 28 26
11 34 19 20 19 19 40 35
31 8 20 30 16 29 19 29
38 27 30 36 17 22 24 30
16 40 33 25 27 15 37 35
0 40 30 36 18 17 14 21
19 23 34 38 17 39 16 16
10 21 19 19 19 21 0 0
12 30 27 0 18 16 36 35
36 14 34 29 0 33 15 21
0 19 28 40 0 33 20 18
36 18 37 27 0 0 12 27
16 40 37 40 17 14 21 35
35 22 39 12 36 40 22 27
19 35 36 38 19 14 28 25
38 14 21 17 34 40 39 37
24 24 38 40 24 14 20 16
18 30 21 40 17 38 30 14
25 15 19 21 0 14 36 28
25 38 15 30 12 30 20 36
35 29 32 26 37 24 31 29
27 25 16 26 30 14 20 18
36 29 38 24 0 37 15 29
37 32 28 25 22 26 18 20
28 40 37 14 21 16 40 40
14 32 22 26 35 22 16 20
28 40 34 40 40 36 25 28
21 39 21 25 19 38 24 19
38 35 30 36 26 40 24 35
24 19 28 32 37 12 9 15
21 25 33 39 0 36 30 33
16 25 10 28 33 38 22 31
23 40 28 38 38 33 29 35
19 10 36 14 21 26 19 27
28 25 11 11 20 39 13 19
37 31 15 35 0 19 16 23
20 11 15 35 20 31 13 18
22 11 12 25 0 32 19 37
9 21 33 27 0 24 20 31
35 40 24 33 21 21 27 27
23 27 17 14 22 0 30 35
25 25 20 19 16 25 38 39
30 14 38 13 19 16 32
23 35 19 27 19 24
32 26 22 18 23
35 29 11 0
28 35 0
20 23
21
12
CSD MID MARKS:
Comparison of mid marks of CSD
mid 1 mid2
Mid 1 of CSD
25
Frequency
20
15
10
5 Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks
Mid 2 of CSD
40
Frequency
30
20
10 Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks
13
Comparison of mid marks of CSC:
CSC Mid-1
30
20
Frequency
20 16 17
13
10
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
Bin
CSC Mid-2
28
30 23
Frequency
20
9 10
10
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
Bin
Comparision
30
25
20
15
10
0
1 2 3 4
Mid-1 Mid-2
14
CSM mid marks:
CSM Mid-1
30 25
Frequency
20 17 15
10 7
Frequency
0
14 21 30 More
Bin
CSM Mid-2
40
26 29
Frequency
30
20 10
10 6
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
Bin
35
CSM Mid Marks
29
30
26 25
25
20 17
15
15
10
10 7 6
5
0
1 2 3 4
Mid-1 Mid-2
15
CSO mid marks:
CSO Mid-1
30 24
21
Frequency
20 13
10
10
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks
CSO Mid-2
30 24 25
Frequency
20 15
10 5 Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks
25
24 24 25
20
21
15
15
10 13
10
5
5
0
0 to 14 14 to 21 21 to 30 30 to 40
mid1 mid2
Conclusion: From the bar graphs we can conclude that the highest marks is
gained by the CSM branch students.
16
Q2) STUDENT DATA
1) Create Frequency table according to branch wise and also
represent the data by bar graph.
2) Create Frequency table as per gender wise and also represent the
data by Pie Chart
3) Create Frequency table as per Category and also create the
frequency polygon
17
1) Frequency table according to branch wise and bar graph
Branch Count of Branch
CSC 69
CSD 66
CSE 130
CSM 71
CSO 67
ECE 66
EEE 66
MECH 46
Grand Total 581
18
2) Frequency table as per gender wise and also representing the data
by Pie Chart
19
BC-C 2
BC-D 115
SC 70
ST 27
20
Q3. California Data. For the given data
1) Create frequency distribution table for gender and create a pie chart.
2) Create frequency distribution for state wise sale and create a bar
graph.
3) Create a frequency distribution for age with class length of 5 units
and find the mean, median and the mode and hence create a
histogram.
4) State the percentage of people who have taken a mortgage.
5) Which source is better to get the information about the data as per
the data
21
1) Frequency distribution table for gender and create a pie chart.
Row Count of
R.F
Labels Gender
F 70 0.358974
M 108 0.553846
N/A 17 0.087179
Grand
195 1
Total
22
2)Create frequency distribution for state wise sale and create a bar
graph.
Arizona 11
California 119
Colorado 11
Kansas 1
Nevada 17
Oregon 11
Utah 6
Virginia 4
Wyoming 1
Grand Total 181
23
3)Create frequency distribution for age with class length of 5 units
and find the mean, median and the mode and hence create a
histogram.
Bins frequency CF Mi F*Mi
18-23 4 4 20.5 82
23-28 8 12 25.5 204
28-33 21 33 30.5 640.5
33-38 20 53 35.5 710
38-43 29 82 40.5 1174.5
43-48 26 108 45.5 1183
48-53 16 124 50.5 808
53-58 19 143 55.5 1054.5
58-63 12 155 60.5 726
63-68 13 168 65.5 851.5
68-73 9 177 70.5 634.5
73-78 1 178 75.5 75.5
Total 178 8144
Mean 45.75281 46.15169
Median 44.34615 45
Mode 45.5 48
24
By observing the pie chart we can say that 69% of people haven’t taken a mortgage.
25
Q4. Adventure works Customer lookup data. For the
given data
1) Create a frequency distribution table as per Marital Status and
determine the % of people who are single.
2) Determine % of people who have completed Partial College and
represent using data visualization. Which visualization chart is
useful and why?
3) Determine number of people who own a house? Represent using
data visualization. Which visualization chart is useful and why?
4) Find the number of people who are doing clerical jobs represent
using data visualization. Which data visualization is
useful and why?
5) Determine number of people who do not wish to mention their
gender, represent using data visualization. Which visualization
chart is useful and why?
6) How many families have no children. Use visualization.
26
1)Create a frequency distribution table as per Marital status and
determine the % of people who are single.
Marital Status Count of Marital Status
M 9817
S 8331
27
Grand Total 18148
28
Conclusion: 27% of customers have education level of Partial
College.
4)Find the number of people who are doing clerical jobs represent using
data visualization. Which data visualization is useful and why?
Type of Job Count
Clerical 2859
Management 3011
Manual 2353
Professional 5424
Skilled Manual 4501
Total 18148
29
Conclusion: By observing the above bar graph, we can say that 2859
customer have Clerical jobs.
SCATTER PLOT
Defination: A scatter plot, also known as a scatter graph or scatter chart, is a
type of mathematical graph that uses Cartesian coordinates to display the
relationship between two variables.
• The x-axis (horizontal axis) represents the independent variable
• The y-axis (vertical axis) represents the dependent variable
31
1 4
3 5
Python program: -
import matplotlib.pyplot as plt
x = [2, 4, 6, 1, 3, 5]
y = [1, 2, 3, 4, 5, 6]
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt. show()
32
OUT PUT OF THE CODE
33
MODULE -3
DISCRIPTIVE STATISTICS
Descriptive statistics: Descriptive statistics is a branch of statistics that deals
with summarizing and describing the main features of a dataset.
Descriptive Statistics:
➢ It deals with organizing , summarizing, analysing of the data.
➢ It is the purpose of analysing to explore the information for its own
essential interest then the study is descriptive
➢ It is performed on either on sample or population
Inferential statistics:
➢ It is a part of statistics that deals with the drawing the conclusion from
the data .
➢ If the information obtain from the a sample of population and the
purpose of the study is use this information to draw the conclusion
about population then the study is inferential.
34
Measures if central tendency :
Measures of central tendency are statistical measures that
describe the middle or centre of a distribution of values.
Measures of central tendency consists of the
1) Mean
2) Median
3) Mode
35
1. Range: The difference between the largest and smallest values.
2. Variance: The average of the squared differences from the mean.
3. Standard Deviation: The square root of the variance.
4. Skewness: It a measure of the asymmetry of a distribution.
MEAN: The mean of the data set is given by the some of the frequences upon
the total no.of observation.
∑𝑥ⅈ ̇
Mean of the raw data =
𝑛
∑𝑓𝑚𝑖
Mean of the raw data with continuous grouped data =
∑𝑓
∑𝑥𝑓
Mean of the raw data with Discrete freq. table =
∑𝑓
Example:
1.Find the mean of 2,12,5,7,6,7,3
∑𝑥ⅈ ̇
Mean =
𝑛
=2+12+5+7+6+7+3
7
=42/7
=6
Since the mean of the given data is 6
2. If the mean of 52,57,x,60,54,59 is 56 then find the value of the x
Sol: Given data 52,57,x,60,54,59
Given mean = 56
Mean =52+57+x+60+54+59+55
7
56=337+x
7
337+x = 56*7
X =392-337
XX ==5555
36
DISCRIPTIVES
TATISTICS
Descriptive Statistics:
• Skewness
Mean:
The mean of the dataset is given by sum of the
frequencies upon the total no of observation.
➢ Raw Data:
37
38
39