B.SC (CS With AI) Unit - 1
B.SC (CS With AI) Unit - 1
Unit-1
Sampling
1
How do we study a population?
A population may be studied using one of two approaches: taking a census, or
selecting a sample.
It is important to note that whether a census or a sample is used, both provide
information that can be used to draw conclusions about the whole population.
Population: Population refers to the total set of observations that can be made.
Eg: All the students in a college.
Sampling unit: One unit from a population. (Element) E.g., if the population
is defined to be l00 trees on a lot, then the sampling unit is a single tree.
2
Notation:
Statistic Parameter
Mean ̅
𝒙 𝝁
Standard Deivation(S.D) s 𝝈
Variance 𝒔𝟐 𝝈𝟐
Size n N
Advantages Disadvantages
Very accurate Chances for bias
Very reliable Problems of accuracy
Take less time Untrained manpower
Low cost of sampling Absence of the informants
Scope of sampling is high Chances of committing the error in
sampling
3
TEN Mark Question and Answers
Explain Probability Sampling or Random sampling or Method of sampling.
1. Simple Random Sampling (SRS): Simple Random Sampling is a
probability sampling in which each unit in the population has an equal chance
of being included in the sample.
There are two types of Simple Random Sampling - Simple Random Sampling
Without Replacement (SRSWOR) and Simple Random Sampling With
Replacement (SRSWR).
Suppose you are going to buy apple from a fruit shop. You are selecting five
apples one by one from a basket of apples without replacing the selected ones.
This type of sampling in which all units have an equal chance of being included
in the sample is called as simple random sampling without replacement. If the
sampling is done by replacing the selected unit it is called simple random
sampling with replacement. If a population consists of N units and a sample of n
units to be taken, the possible number of samples in SRSWOR is N𝐶𝑛 and in
SRSWR is 𝑁 𝑛
Randomization is a method and is done by using a number of techniques as:
(a) Tossing a coin. (b) Throwing a dice. (c) Lottery method. (d) Blind folded
method. (e) By using random table of ‘Tippett’s Table’.
Lottery Method
For example, suppose we have to select five students out of 50 to visit an old
age home. We assign numbers from 1 to 50 to the students. 50 identical slips are
made for these students. These slips folded and put in a box and shuffle
thoroughly. Then five slips are drawn. Suppose the numbers drawn are 34, 6,
48, 37 and 20. Then the students bearing these numbers are selected for visiting
the home.
4
Advantages
(a) Easy to understand. (b) Easy to analyse and interpret result
(c) Easy to detect errors. (d) Usually representative of the population.
Disadvantages
(a) Selection according to strictly random basis is not possible.
(b) Lack of control of the investigator.
(c) Random sampling does not suit heterogeneous groups.
5
3. Stratified Sampling: The universe of entire population is divided into a
number of strata (or) group so it is named as stratified sampling. Once whole
universe is divided into various groups certain numbers of items are taken from
group at random.
Eg: All the student of a college may be divided in groups of boys and girls.
Advantages
(a) Easy to understand. (b) Easy to analyse and interpret result
(c) Usually representative of the population.
(d) Allows subgroup comparisons.
(e) Results represent population without weight.
Disadvantages
(a) Require subgroup identification of each population element.
(b) May be costly and difficult to prepare lists of elements in each subgroup.
(c) Requires Knowledge of the proportion of each subgroup in the population.
(d) It is costly and time consuming method.
Consider a population which consists of males and females who are smokers
or non smokers.
Population
Male Female
6
Example: Simple Random Sampling
5C2 = 10 samples
Samples: Using SRSWR
(2,2), (2,3), (2,6), (2,8), (2,11), (3,2), (3,3), (3,6), (3,8),(3,11), (6,2), (6,3),
(6,6),(6,8), (6,11), (8,2), (8,3), (8,6), (8,8), (8,11), (11,2), (11,3), (11,6),
(11,8),(11,11)
52 = 25 samples.
Definition
7
Quantitative Variable : Variables which can assume a numerical values.
Example: Age , Weight , Height , Income , Expenditure
Discrete Variables: Variable which can assume a finite number of possible
values. Eg: Number of pages in a book, Number of apples in a basket.
Continuous Variables : Variable which and assume an infinite number of
possible values. Eg: Students heights, ages, weights, Income of a family
8
Types of data
Primary and Secondary Data
Primary Data
The data is collected directly from the sources then it is called primary data.
(or Original Data or First Time) Eg: 1.List of Absentees in a class.
2. Ram has collected the data of statistics marks from the students in person.
Secondary data
Secondary data consists of second hand information which has already been
collected. Example: Population census data, annual rain fall, budget records.
Difference between Primary and Secondary data.
Primary data Secondary data
Original data Not original data
First hand information Second hand information
More money Less money
More time Less time
After use become secondary data Data cannot converted to primary data
9
Investigation through Local Reports
In this method, data are through local agents or correspondent. They collect information in
their own fashion according to their likes and dislikes
Advantages and Disadvantages of Information from Correspondents method
Advantages Disadvantages
Speedy information is possible Data may not be original
Extensive information can be had Uniformity cannot be maintained
It is useful where information is needed The information may be biased
regularly
Mailed Questionnaire
This method of investigation is done by the investigator sending questionnaire to the
respondent
Advantages and Disadvantages of Mailed questionnaire method
Advantages Disadvantages
Least expensive Long response time
Only method to reach remote areas Cannot be used by illiterates
Informants can be influenced Doubts cannot be cleared regarding questions
Telephonic interviews:
Data is collected through an interview over the telephone with the interviewer.
Advantages and Disadvantages of Telephonic Interview method
Advantages Disadvantages
Relatively low cost Limited use
Relatively high response rate Reaction cannot be watched
Less influence on informants Respondents can be influenced
Note:
Investigator: The person who conducts statistical investigation.
Enumerator: The person you helps investigator in the collection of data.
Respondent: The person or an institution that provide information to the
investigator or enumerator is called Respondent or Informant.
10
Explain methods of Secondary Data
Published data are available in various resources including
Libraries
A common place to look for secondary data is a library. Here, data can be
obtained from magazines, journals and newspapers.
Government agencies
Government data can be obtained from publications issued by local, state,
national and international governments. Such data include laws, regulations,
statistics and consumer information.
Internet
Secondary data can be obtained from search engines such as Yahoo, Google,
MSN.com, etc., on the internet.
Government publications.
• Office records in panchayats, municipalities etc.
• Survey reports of various research organizations.
• Survey reports in Journals, Newspapers and other publications.
• Websites.
Unpublished
Official records and files of the government and private and offices
Studies made by research institutions.
Diaries
Letters.
11
Classification of Data
12
Tabulation
Tabulation: A table is a systematic arrangement of statistical data in rows and
columns
Body
Total
Foot- note
1. Number and Title indicating the serial number of the table and subject matter
of the table.
2. Stub i.e., the headings of the row.
3. Caption i.e., the headings of the column.
4. Body i.e., figures to be entered in the table.
5. Foot-note is source from which the data have been obtained.
13
Tally Marks
Tally marks are the representation of the data in the form of vertical lines. We
put one vertical line (|) for each of the four counts. A diagonal line (\) is put for
the fifth count.
Types of Data
2. Discrete Data
3. Age of 6 children's
4, 3, 2, 5, 6, 3
14
Discrete Data
Total 64 Total 44
Total 46
15
Class: Interval in called a class.
Class Limit: Upper limit and lower limit.
Class Interval: Difference between upper and lower limit.
𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑖𝑡+𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡
Mid Value or Mid Point =
2
Exclusive Type (Closed Interval): There is no gap between upper and next
lower limit.
Weight No. of Students
L.L 10-20 U.L 15
20-30 20
30-40 35
40-50 19
50-60 5
Inclusive Type (Open Interval): There is a gap between upper and lower limit
of the next class.
Marks(C.I) No. of Students Marks(C.I) No. of Students
10-19 6 9.5-19.5 6
20-29 10 19.5-29.5 10
30-39 15 29.5-39.5 15
40-49 5 39.5-49.5 5
16
Explain Frequency Distribution.
Frequency distribution organizes data into categories (ranges) and shows
how many times each category occurs.
Discrete Data: Discrete data can only be specific values.
Eg: The student’s marks are given below:
Marks 5 10 15 20
No.of Students 3 12 8 2
Continuous Data: Continuous data can be any value within a range.
Eg: The heights of students are given below:
Height 3-4 4-5 5-6 6-7
No. of Students 5 10 6 1
10-20 8 20 8 10 47
N=47 50 9-9=0
17
Bivariate Frequency Distribution
If only one characteristic of the sampling units is measured for the study, it is
called Univariate Data. If two characteristics are measured simultaneously
from each unit, it is known as Bivariate Data. Similarly data containing
measurements of more than two characteristics of each unit is called
Multivariate Data.
For example if only the height of the students is measured for the study, it is
Univariate Data. Usually we represent it by x, y, z, etc.
(52, 45), (51, 62), (57, 58), (62, 70), (68, 73)
Height (x): 52 51 57 62 68
Weight (y): 45 62 58 70 73
18
Contingency Table
Definition:
A contingency table is a simple table used to show how two different categories
or groups are related. It helps us see how often different combinations of these
categories occur.
Example: Let’s say we have a group of students and we want to see how their
choice of favorite fruit (Apple or Banana) relates to their choice of favorite
color (Red or Blue).
In this table:
- The rows show the favorite fruit (Apple or Banana).
- The columns show the favorite color (Red or Blue).
- Each cell shows the number of students who like a particular fruit and color
combination.
What It Shows:
10 students like Apple and Red.
5 students like Apple and Blue.
8 students like Banana and Red.
12 students like Banana and Blue.
The totals at the bottom and right side show the overall counts for each
category.
19