0% found this document useful (0 votes)
6 views39 pages

dISCRIPTIVE 6707

Uploaded by

Kiran Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views39 pages

dISCRIPTIVE 6707

Uploaded by

Kiran Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Module -1

INTRODUCTION TO STATISTICS
Statistics
Statistics is a branch of mathematics that deals with the collection, analysis,
interpretation, presentation ,organization of data and drawing conclusion from the data.

Population
Complete group of people, objects, or observations that we are interested in studying or
analyzing is called as Population.

Sample

The subset of the population which we are studying is known as Sample .


Types of data:
*Primary data: The data which collected by the individual by
himmself
* Secondary data:The data which is collected from the external sources is called as the
secondary data[Collection by third person].

Variable
variable is a characteristic of interest that can take on different values or outcomes.
Variables are the building blocks of statistical analysis, and they can be used to describe,
analyze, and summarize data.

Measures of Data

1
Categorical data VS Numerocal data
Characteristics Categorical Data Numerical data
Also known as Qualitative data Quantitavie data
Nature Non-numerical and It is identified based
can be identified on numbers and the
based on name and arithmetic process
label
Types of data Nominal and ordinal Continuous and
data discrete data
Examples Name, Gender, Phone Measurements such as
number Height, Weight
,Temperature.

2
MODULE-2
DATA VISUALISATION
Data visualisation:
Taking the huge and large amount of data and simplifying it in the
form of pictures or a kindm of Graphs which make easy to understand is called
as data visualisation.
Categorical Data : The data which can categorized or grouped is called as
categorical data
Ordinal data: The ordinal scale has a meaning which we can count and order is called as
ordinal scale
Ex: Food Rating , Temperature of water etc. Frequ

Nominal data : The data which is used to identify the characteristic of the observation .
Ex: Gender, Names
Difference between ordinal data and nominal data
Ordinal data Vs nominal data
Characteristics Ordinal data Nominal data

Definition Represents Represents


categories with a categories with no
specific order or permanent order or
ranking. ranking
Arithmetic Values have the Values doesn’t have
Operations meaningful sequence the meaning sequence
Examples Ranking the students Gender

Numerical data
Continuous data : The data which contains the numericals of the values, which contains
infinite number of values at a given range.
Ex: 1.Height
2. Temperature
3.Weight
Discrete data: Discrete data is a type of quantitative data that includes non-divisible figures and
statistics you can count

.(or)

3
Discrete data is a numerical type of data that includes whole, concrete numbers with specific and
fixed data values determined by counting.
Ex: 1. Movie tickets sold on the single day.
2.No. of students in a class
GRAPHS TO REPRESENT THE CATEGORICAL DATA

Bar graph: Bar graph most used for representation of the graphs, they represent the data using
bars where the length of the each bar corresponds the value it represents .
Example:

Frequency : Frequency refers to the no of times the data or the particulars


repeat in the data distribution is known as frequency

Example: Consider the data


A,A,B,C,A,D,A,B,D,C construct frequency distribution table and calculate the relative frequency
Sol: Given data A,A,B,C,A,D,A,B,D,C
Arrange the data in the tabular form

Category Frequency Relative Frequency


A 4 4/10=0.4

B 2 2/10=0.2

C 2 2/10=0.2

D 2 2/10=0.2

Chart Title
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Frequency Relative Frequency

A B C D

4
Python code for the Bar Graph
import matplotlib.pyplot as plt
from collections import Counter
data = ['A', 'A', 'B', 'C', 'A', 'D', 'A', 'B', 'D', 'C']
counter = Counter(data)
plt.bar(counter.keys(), counter.values())
plt.title('Frequency of Each Item')
plt.xlabel('Item')
plt.ylabel('Frequency')
plt.show()

5
PIE CHART: It is a circle divided into the pieces or the portions to the relative frequency
of categorical data.

Example:

Pie chart

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Example:
: Consider the data
A,A,B,C,A,D,A,B,D,C construct frequency distribution table and draw the pie chart
Given the values

Category Frequency

A 4
B 2
C 2
D 2

6
Python code for the pie chart
data = ['A', 'A', 'B', 'C', 'A', 'D', 'A', 'B', 'D', 'C']
frequency = {}
for value in data:
if value in frequency:
frequency[value] += 1
else:
frequency[value] = 1

print("Frequency Distribution Table:")


print(frequency)
labels = frequency.keys()
sizes = frequency.values()
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.title('Pie Chart')plt.show()

Line Graph :
A line graph, also known as a line chart or line plot, is a type of graph used
to display data points connected by straight line segments. It is commonly used
to show trends, patterns, and relationships between continuous data points.
Example: Consider the data
7
A,A,B,C,A,D,A,B,D,C construct frequency distribution table and draw the line graph.

Category Frequency Relative Frequency


A 4 4/10=0.4

B 2 2/10=0.2

C 2 2/10=0.2

D 2 2/10=0.2

Chart Title
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Frequency Relative Frequency

A B C D

Python Code for the Line graph


import matplotlib.pyplot as plt

categories = ['A', 'B', 'C', 'D']

frequencies = [4, 2, 2, 2]

plt.plot(categories, frequencies, marker='o')

plt.title('Frequency of Each Item')

plt.xlabel('Category')

plt.ylabel('Frequency')

plt.show()

8
Out put of the code:

Conclusion: The line and the python code has been done.

HISTOGRAM
A histogram is a graphical representation of the distribution of numerical data. It
is the one of the type bar chart that shows the frequency values within specific
ranges or bins.
Histograms are used to:
1. Visualize the distribution of data
2. Identify patterns, such as skewness or outliers
3. Understand the central tendency and variability of the data
4. Compare data distributions
Example: create a histogram and give python code for the data
3,3,3,6,6,10,10,6,6,12,12,12,7,7,5,15,7,7,15,15,11,12,15,14,12,11,7,8,8,9,5,5,4,3
,2,3,2,8,11
Sol:

9
Give the data
3,3,3,6,6,10,10,6,6,12,12,12,7,7,5,15,7,7,15,15,11,12,15,14,12,11,7,8,8,9,5,5,4,3
,2,3,2,8,11

Value Frequency
2 2
3 5
4 1
5 3
6 4
7 5
8 3
9 1
10 2
11 3
12 5
14 1
15 4

Python code for the Histogram:


import matplotlib.pyplot as plt
data = [3, 3, 3, 6, 6, 10, 10, 6, 6, 12, 12, 12, 7, 7, 5, 15, 7, 7, 15, 15, 11, 12, 15,
14, 12, 11, 7, 8, 8, 9, 5, 5, 4, 3, 2, 3, 2, 8, 11]
plt.hist(data, bins=range(2, 17), edgecolor='black', align='left')
plt.title('Histogram of Given Data')

10
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Out put of the data :

Comparative of mid marks:

CSC CSM CSD CSO

Mid- Mid- Mid- Mid- Mid- Mid- Mid- Mid-


1 2 1 2 1 2 1 2
14 30 0 28 35 38 17 21
14 14 33 22 22 31 37 37
37 39 22 25 0 38 18 15
31 31 34 39 36 40 0 23
32 39 29 36 35 35 32 23
19 15 38 38 0 33 22 25
20 25 34 40 0 26 20 37
19 23 0 39 0 24 19 21
37 12 16 30 25 28 36 39
19 22 38 40 21 15 0 26
34 39 22 23 35 36 35 38
17 31 0 36 34 40 17 27
36 36 29 39 33 37 26 33
0 23 18 28 28 29 23 26
34 40 34 34 17 21 28 32
34 30 25 19 35 39 40 38
19 39 13 15 27 31 15 28
29 37 17 19 0 36 0 8
0 17 39 39 18 14 22 22
10 32 20 28 37 37 27 33
25 29 34 30 35 35 29 30
29 19 34 33 37 23 28 33

11
0 28 34 30 0 36 28 26
11 34 19 20 19 19 40 35
31 8 20 30 16 29 19 29
38 27 30 36 17 22 24 30
16 40 33 25 27 15 37 35
0 40 30 36 18 17 14 21
19 23 34 38 17 39 16 16
10 21 19 19 19 21 0 0
12 30 27 0 18 16 36 35
36 14 34 29 0 33 15 21
0 19 28 40 0 33 20 18
36 18 37 27 0 0 12 27
16 40 37 40 17 14 21 35
35 22 39 12 36 40 22 27
19 35 36 38 19 14 28 25
38 14 21 17 34 40 39 37
24 24 38 40 24 14 20 16
18 30 21 40 17 38 30 14
25 15 19 21 0 14 36 28
25 38 15 30 12 30 20 36
35 29 32 26 37 24 31 29
27 25 16 26 30 14 20 18
36 29 38 24 0 37 15 29
37 32 28 25 22 26 18 20
28 40 37 14 21 16 40 40
14 32 22 26 35 22 16 20
28 40 34 40 40 36 25 28
21 39 21 25 19 38 24 19
38 35 30 36 26 40 24 35
24 19 28 32 37 12 9 15
21 25 33 39 0 36 30 33
16 25 10 28 33 38 22 31
23 40 28 38 38 33 29 35
19 10 36 14 21 26 19 27
28 25 11 11 20 39 13 19
37 31 15 35 0 19 16 23
20 11 15 35 20 31 13 18
22 11 12 25 0 32 19 37
9 21 33 27 0 24 20 31
35 40 24 33 21 21 27 27
23 27 17 14 22 0 30 35
25 25 20 19 16 25 38 39
30 14 38 13 19 16 32
23 35 19 27 19 24
32 26 22 18 23
35 29 11 0
28 35 0
20 23
21

12
CSD MID MARKS:
Comparison of mid marks of CSD

CSD Mid Marks


35
30
25
20
15
10
5
0
0 to 14 14 to 21 21 to 30 30 to 40

mid 1 mid2

Mid 1 of CSD
25
Frequency

20
15
10
5 Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks

Mid 2 of CSD
40
Frequency

30
20
10 Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks

13
Comparison of mid marks of CSC:

CSC Mid-1
30
20
Frequency

20 16 17
13
10
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
Bin

CSC Mid-2
28
30 23
Frequency

20
9 10
10
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
Bin

Comparision

30

25

20

15

10

0
1 2 3 4

Mid-1 Mid-2

14
CSM mid marks:

CSM Mid-1
30 25
Frequency

20 17 15
10 7
Frequency
0
14 21 30 More
Bin

CSM Mid-2
40
26 29
Frequency

30
20 10
10 6
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
Bin

35
CSM Mid Marks
29
30
26 25
25

20 17
15
15
10
10 7 6
5

0
1 2 3 4
Mid-1 Mid-2

15
CSO mid marks:

CSO Mid-1
30 24
21
Frequency

20 13
10
10
Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks

CSO Mid-2
30 24 25
Frequency

20 15

10 5 Frequency
0
0 to 14 14 to 21 21 to 30 30 to 40
marks

CSO MID MARKS


30

25
24 24 25
20
21
15
15
10 13
10
5
5
0
0 to 14 14 to 21 21 to 30 30 to 40

mid1 mid2

Conclusion: From the bar graphs we can conclude that the highest marks is
gained by the CSM branch students.

16
Q2) STUDENT DATA
1) Create Frequency table according to branch wise and also
represent the data by bar graph.
2) Create Frequency table as per gender wise and also represent the
data by Pie Chart
3) Create Frequency table as per Category and also create the
frequency polygon

17
1) Frequency table according to branch wise and bar graph
Branch Count of Branch
CSC 69
CSD 66
CSE 130
CSM 71
CSO 67
ECE 66
EEE 66
MECH 46
Grand Total 581

18
2) Frequency table as per gender wise and also representing the data
by Pie Chart

Gender Count of Gender


F 199
M 382
Grand Total 581

3) Frequency table as per Category and also the frequency polygon.


category frequency
OC 173
BC-A 37
BC-B 132

19
BC-C 2
BC-D 115
SC 70
ST 27

20
Q3. California Data. For the given data
1) Create frequency distribution table for gender and create a pie chart.
2) Create frequency distribution for state wise sale and create a bar
graph.
3) Create a frequency distribution for age with class length of 5 units
and find the mean, median and the mode and hence create a
histogram.
4) State the percentage of people who have taken a mortgage.
5) Which source is better to get the information about the data as per
the data

21
1) Frequency distribution table for gender and create a pie chart.
Row Count of
R.F
Labels Gender
F 70 0.358974
M 108 0.553846
N/A 17 0.087179
Grand
195 1
Total

22
2)Create frequency distribution for state wise sale and create a bar
graph.

Row Labels Count of State

Arizona 11

California 119

Colorado 11

Kansas 1
Nevada 17
Oregon 11
Utah 6
Virginia 4
Wyoming 1
Grand Total 181

23
3)Create frequency distribution for age with class length of 5 units
and find the mean, median and the mode and hence create a
histogram.
Bins frequency CF Mi F*Mi
18-23 4 4 20.5 82
23-28 8 12 25.5 204
28-33 21 33 30.5 640.5
33-38 20 53 35.5 710
38-43 29 82 40.5 1174.5
43-48 26 108 45.5 1183
48-53 16 124 50.5 808
53-58 19 143 55.5 1054.5
58-63 12 155 60.5 726
63-68 13 168 65.5 851.5
68-73 9 177 70.5 634.5
73-78 1 178 75.5 75.5
Total 178 8144
Mean 45.75281 46.15169
Median 44.34615 45
Mode 45.5 48

4)State the percentage of people who have taken a


mortgage

Row Labels Count of Mortgage R.F Percent


No 134 0.687179 68.71795

Yes 61 0.312821 31.28205

Grand Total 195 1 100

24
By observing the pie chart we can say that 69% of people haven’t taken a mortgage.

4)Which source is better to get the information about the data


as per the data.

Row Labels Count of Gender


Agency 59
Client 17
Website 119

Grand Total 195

25
Q4. Adventure works Customer lookup data. For the
given data
1) Create a frequency distribution table as per Marital Status and
determine the % of people who are single.
2) Determine % of people who have completed Partial College and
represent using data visualization. Which visualization chart is
useful and why?
3) Determine number of people who own a house? Represent using
data visualization. Which visualization chart is useful and why?
4) Find the number of people who are doing clerical jobs represent
using data visualization. Which data visualization is
useful and why?
5) Determine number of people who do not wish to mention their
gender, represent using data visualization. Which visualization
chart is useful and why?
6) How many families have no children. Use visualization.

26
1)Create a frequency distribution table as per Marital status and
determine the % of people who are single.
Marital Status Count of Marital Status
M 9817
S 8331

27
Grand Total 18148

2)Determine % of people who have completed Partial College and


represent using data visualization. Which visualization chart is useful and
why?
Education Level Count of Education Level
Bachelors 5261
Graduate Degree 3125
High School 3241
Partial College 4966
Partial High School 1555
Grand Total 18148

28
Conclusion: 27% of customers have education level of Partial
College.

3)Determine number of people who own a house? Represent using data


visualization. Which visualization chart is useful and why?
Homeowner Count of Homeowner
N 5888
Y 12260
Grand Total 18148

Conclusion: The number of people who own a house are 12,260.

4)Find the number of people who are doing clerical jobs represent using
data visualization. Which data visualization is useful and why?
Type of Job Count
Clerical 2859
Management 3011
Manual 2353
Professional 5424
Skilled Manual 4501
Total 18148

29
Conclusion: By observing the above bar graph, we can say that 2859
customer have Clerical jobs.

5)Determine number of people who do not wish to mention their gender,


represent using data visualization. Which visualization chart is useful
and why?
Gender Count
F 8892
M 9126
N/A 130
Total 18148

6)How many families have no children. Use visualization.


30
No. of Children Frequency
0 5080
1 3552
2 3703
3 2153
4 2259
5 1401

Conclusion: There are 5,080 families with no children.

SCATTER PLOT
Defination: A scatter plot, also known as a scatter graph or scatter chart, is a
type of mathematical graph that uses Cartesian coordinates to display the
relationship between two variables.
• The x-axis (horizontal axis) represents the independent variable
• The y-axis (vertical axis) represents the dependent variable

Example: - Consider the given table


X Y
2 1
4 2
6 3

31
1 4
3 5

Procedure to create a Scatter Plot using Excel:


1) Prepare your data and make sure you have the two variables
among which you need to draw conclusion about their relationship.
2) Highlight that data.
3) Go to insert tab, in the charts group click on Scatter chart icon.
4) For better visualization use the one such as Scatter with only
Markers.
5) Add Titles, Labels, Format the chart and save it.

Python program: -
import matplotlib.pyplot as plt
x = [2, 4, 6, 1, 3, 5]
y = [1, 2, 3, 4, 5, 6]
plt.scatter(x, y)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Scatter Plot')
plt. show()

32
OUT PUT OF THE CODE

33
MODULE -3
DISCRIPTIVE STATISTICS
Descriptive statistics: Descriptive statistics is a branch of statistics that deals
with summarizing and describing the main features of a dataset.

Descriptive Statistics:
➢ It deals with organizing , summarizing, analysing of the data.
➢ It is the purpose of analysing to explore the information for its own
essential interest then the study is descriptive
➢ It is performed on either on sample or population
Inferential statistics:
➢ It is a part of statistics that deals with the drawing the conclusion from
the data .
➢ If the information obtain from the a sample of population and the
purpose of the study is use this information to draw the conclusion
about population then the study is inferential.

34
Measures if central tendency :
Measures of central tendency are statistical measures that
describe the middle or centre of a distribution of values.
Measures of central tendency consists of the
1) Mean
2) Median
3) Mode

➢ MEAN: The arithmetic average of all values in the distribution.


➢ MEDIAN: The middle value in the distribution when the values are
arranged in order.
➢ MODE: The most frequently occurring value in the distribution.
Measures of variability: Measures of variability, also known as measures of
spread or dispersion
Measures of variability consists of the
➢ Range
➢ Standard Deviation
➢ Variance
➢ Skewness

35
1. Range: The difference between the largest and smallest values.
2. Variance: The average of the squared differences from the mean.
3. Standard Deviation: The square root of the variance.
4. Skewness: It a measure of the asymmetry of a distribution.
MEAN: The mean of the data set is given by the some of the frequences upon
the total no.of observation.
∑𝑥ⅈ ̇
Mean of the raw data =
𝑛
∑𝑓𝑚𝑖
Mean of the raw data with continuous grouped data =
∑𝑓

∑𝑥𝑓
Mean of the raw data with Discrete freq. table =
∑𝑓

Example:
1.Find the mean of 2,12,5,7,6,7,3
∑𝑥ⅈ ̇
Mean =
𝑛

=2+12+5+7+6+7+3
7
=42/7
=6
Since the mean of the given data is 6
2. If the mean of 52,57,x,60,54,59 is 56 then find the value of the x
Sol: Given data 52,57,x,60,54,59
Given mean = 56
Mean =52+57+x+60+54+59+55
7
56=337+x
7
337+x = 56*7
X =392-337
XX ==5555

36
DISCRIPTIVES
TATISTICS
Descriptive Statistics:

Descriptive statistics summarize and describe the main


features of a dataset. This includes measures like mean,
median, mode, standard deviation, and range.
It is performed on either the sample or the population.
Measures of central Tendency: These
are the measures that indicates where the
data is actually concentrated. It consists of
• Mean
• Median
• Mode
• Quartile
Measures of Variability
(Dispersion): These are the
measures which indicates the
spread of the data.
• Range
• Standard Deviation
• Variance

• Skewness

Mean:
The mean of the dataset is given by sum of the
frequencies upon the total no of observation.
➢ Raw Data:

➢ Raw data (Discrete):

37
38
39

You might also like