0% found this document useful (0 votes)
30 views34 pages

Unit 1

fds notes

Uploaded by

Benitta Mary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views34 pages

Unit 1

fds notes

Uploaded by

Benitta Mary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

CS3352 Foundations of Data Science

Two Marks Questions with Answers

Q.1 What is data science?

Ans;

• Data science is an interdisciplinary field that seeks to extract knowledge or


insights from various forms of data.

• At its core, data science aims to discover and extract actionable knowledge
from data that can be used to make sound business decisions and predictions.

• Data science uses advanced analytical theory and various methods such as
time series analysis for predicting future.

Q.2 Define structured data.

Ans. Structured data is arranged in rows and column format. It helps for
application to retrieve and process data easily. Database management system is
used for storing structured data. The term structured data refers to data that is
identifiable because it is organized in a structure.

Q.3 What is data?

Ans. Data set is collection of related records or information. The information


may be on some entity or some subject area.

Q.4 What is unstructured data ?

Ans. Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.

Q.5 What is machine - generated data ?

Ans. Machine-generated data is an information that is created without human


interaction as a result of a computer process or application activity. This means
that data entered manually by an end-user is not recognized to be machine-
generated.
Q.6 Define streaming data.

Ans; Streaming data is data that is generated continuously by thousands of data


sources, which typically send in the data records simultaneously and in small
sizes (order of Kilobytes).

Q.7 List the stages of data science process.

Ans.: Stages of data science process are as follows:

1. Discovery or Setting the research goal

2. Retrieving data

3. Data preparation

4. Data exploration

5. Data modeling

6. Presentation and automation

Q.8 What are the advantages of data repositories?

Ans.: Advantages are as follows:

i. Data is preserved and archived.

ii. Data isolation allows for easier and faster data reporting.

iii. Database administrators have easier time tracking problems.

iv. There is value to storing and analyzing data.

Q.9 What is data cleaning?

Ans. Data cleaning means removing the inconsistent data or noise and
collecting necessary information of a collection of interrelated data.

Q.10 What is outlier detection?


Ans. : Outlier detection is the process of detecting and subsequently excluding
outliers from a given set of data. The easiest way to find outliers is to use a plot
or a table with the minimum and maximum values.

Q.11 Explain exploratory data analysis.

Ans. : Exploratory Data Analysis (EDA) is a general approach to exploring


datasets by means of simple summary statistics and graphic visualizations in
order to gain a deeper understanding of data. EDA is used by data scientists to
analyze and investigate data sets and summarize their main characteristics, often
employing data visualization methods.

Q.12 Define data mining.

Ans. : Data mining refers to extracting or mining knowledge from large


amounts of data. It is a process of discovering interesting patterns or Knowledge
from a large amount of data stored either in databases, data warehouses, or other
information repositories.

Q.13 What are the three challenges to data mining regarding data mining
methodology?

Ans. Challenges to data mining regarding data mining methodology include the
following:

1. Mining different kinds of knowledge in databases,

2. Interactive mining of knowledge at multiple levels of abstraction,

3. Incorporation of background knowledge.

Q.14 What is predictive mining?

Ans. Predictive mining tasks perform inference on the current data in order to
make predictions. Predictive analysis provides answers of the future queries that
move across using historical data as the chief principle for decisions.

Q.15 What is data cleaning?

Ans. Data cleaning means removing the inconsistent data or noise and
collecting necessary information of a collection of interrelated data.
Q.16 List the five primitives for specifying a data mining task.

Ans. :

1. The set of task-relevant data to be mined

2. The kind of knowledge to be mined

3. The background knowledge to be used in the discovery process

4. The interestingness measures and thresholds for pattern evaluation

5. The expected representation for visualizing the discovered pattern.

Q.17 List the stages of data science process.

Ans. Data science process consists of six stages:

1. Discovery or Setting the research goal 2. Retrieving data 3. Data preparation

4. Data exploration 5. Data modeling 6. Presentation and automation

Q.18 What is data repository?

Ans. Data repository is also known as a data library or data archive. This is a
general term to refer to a data set isolated to be mined for data reporting and
analysis. The data repository is a large database infrastructure, several databases
that collect, manage and store data sets for data analysis, sharing and reporting.

Q.19 List the data cleaning tasks?

Ans. Data cleaning are as follows:

1. Data acquisition and metadata

2. Fill in missing values

3. Unified date format

4. Converting nominal to numeric

5. Identify outliers and smooth out noisy data


6. Correct inconsistent data

Q.20 What is Euclidean distance ?

Ans. Euclidean distance is used to measure the similarity between observations.


It is calculated as the square root of the sum of differences between each point.

UNIT II : Describing Data

Syllabus

Types of Data - Types of Variables - Describing Data with Tables and Graphs -
Describing Data with Averages - Describing Variability - Normal Distributions
and Standard (z) Scores.

Types of Data

• Data is collection of facts and figures which relay something specific, but
which are not organized in any way. It can be numbers, words, measurements,
observations or even just descriptions of things. We can say, data is raw
material in the production of information.

• Data set is collection of related records or information. The information may


be on some entity or some subject area.

• Collection of data objects and their attributes. Attributes captures the basic
characteristics of an object

• Each row of a data set is called a record. Each data set also has multiple
attributes, each of which gives information on a specific characteristic.

Qualitative and Quantitative Data

• Data can broadly be divided into following two types: Qualitative data and
quantitative data.
Qualitative data:

• Qualitative data provides information about the quality of an object or


information which cannot be measured. Qualitative data cannot be expressed as
a number. Data that represent nominal scales such as gender, economic status,
religious preference are usually considered to be qualitative data.

• Qualitative data is data concerned with descriptions, which can be observed


but cannot be computed. Qualitative data is also called categorical data.
Qualitative data can be further subdivided into two types as follows:

1. Nominal data

2. Ordinal data

Qualitative data:

• Qualitative data is the one that focuses on numbers and mathematical


calculations and can be calculated and computed.

• Qualitative data are anything that can be expressed as a number or quantified.


Examples of quantitative data are scores on achievement tests, number of hours
of study or weight of a subject. These data may be represented by ordinal,
interval or ratio scales and lend themselves to most statistical manipulation.

• There are two types of qualitative data: Interval data and ratio data.

Difference between Qualitative and Quantitative Data


Advantages and Disadvantages of Qualitative Data

1. Advantages:

• It helps in-depth analysis

• Qualitative data helps the market researchers to understand the mindset of


their

customers.

• Avoid pre-judgments

2. Disadvantages:

• Time consuming

• Not easy to generalize

• Difficult to make systematic comparisons

Advantages and Disadvantages of Quantitative Data


1. Advantages:

• Easier to summarize and make comparisons.

• It is often easier to obtain large sample sizes

• It is less time consuming since it is based on statistical analysis.

2. Disadvantages:

• The cost is relatively high.

• There is no accurate generalization of data the researcher received

Ranked Data

• Ranked data is a variable in which the value of the data is captured from an
ordered set, which is recorded in the order of magnitude. Ranked data is also
called as Ordinal data.

• Ordinal represents the "order." Ordinal data is known as qualitative data or


categorical data. It can be grouped, named and also ranked.

• Characteristics of the Ranked data:

a) The ordinal data shows the relative ranking of the variables

b) It identifies and describes the magnitude of a variable

c) Along with the information provided by the nominal scale, ordinal scales
give the rankings of those variables

d) The interval properties are not known

e) The surveyors can quickly analyze the degree of agreement concerning the
identified order of variables

• Examples:

a) University ranking : 1st, 9th, 87th...


b) Socioeconomic status: poor, middle class, rich.

c) Level of agreement: yes, maybe, no.

d) Time of day: dawn, morning, noon, afternoon, evening, night

Scale of Measurement

• Scales of measurement, also called levels of measurement. Each level of


measurement scale has specific properties that determine the various use of
statistical analysis.

• There are four different scales of measurement. The data can be defined as
being one of the four scales. The four types of scales are: Nominal, ordinal,
interval and ratio.

Nominal

• A nominal data is the 1 level of measurement scale in which the numbers


serve as "tags" or "labels" to classify or identify the objects.

• A nominal data usually deals with the non-numeric variables or the numbers
that do not have any value. While developing statistical models, nominal data
are usually transformed before building the model.

• It is also known as categorical variables.

Characteristics of nominal data:

1. A nominal data variable is classified into two or more categories. In this


measurement mechanism, the answer should fall into either of the classes.

2. It is qualitative. The numbers are used here to identify the objects.

3. The numbers don't define the object characteristics. The only permissible
aspect of numbers in the nominal scale is "counting".

• Example:

1. Gender: Male, female, other.


2. Hair Color: Brown, black, blonde, red, other.

Interval

• Interval data corresponds to a variable in which the value is chosen from an


interval set.

• It is defined as a quantitative measurement scale in which the difference


between the two variables is meaningful. In other words, the variables are
measured in an exact manner, not as in a relative way in which the presence of
zero is arbitrary.

• Characteristics of interval data:

a) The interval data is quantitative as it can quantify the difference between the
values.

b) It allows calculating the mean and median of the variables.

c) To understand the difference between the variables, you can subtract the
values between the variables.

d) The interval scale is the preferred scale in statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calender types, etc.

• Examples:

1. Celsius temperature

2. Fahrenheit temperature

3. Time on a clock with hands.

Ratio

• Any variable for which the ratios can be computed and are meaningful is
called ratio data.

• It is a type of variable measurement scale. It allows researchers to compare the


differences or intervals. The ratio scale has a unique feature. It processes the
character of the origin or zero points.
• Characteristics of ratio data:

a) Ratio scale has a feature of absolute zero.

b) It doesn't have negative numbers, because of its zero-point feature.

c) It affords unique opportunities for statistical analysis. The variables can be


orderly added, subtracted, multiplied, divided. Mean, median and mode can be
calculated using the ratio scale.

d) Ratio data has unique and useful properties. One such feature is that it allows
unit conversions like kilogram - calories, gram - calories, etc.

• Examples: Age, weight, height, ruler measurements, number of children.

Example 2.1.1: Indicate whether each of the following terms is qualitative;


ranked or quantitative:

(a) ethnic group

(b) academic major

(c) age

(d) family size

(e) net worth (in Rupess)

(f) temperature

(g) sexual preference

(h) second-place finish

(i) IQ score

(j) gender

Solution :

(a) ethnic group→ Qualitative


(b) age → Quantitative

(c) family size → Quantitative

(d) academic major → Qualitative

(e) sexual preference → Qualitative

(f) IQ score → Quantitative

(g) net worth (in Rupess) → Quantitative

(h) second-place finish → ranked

(i) gender → Qualitative

(j) temperature → Quantitative

Types of Variables

• Variable is a characteristic or property that can take on different values.

Discrete and Continuous Variables

Discrete variables:

• Quantitative variables can be further distinguished in terms of whether they


are discrete or continuous.

• The word discrete means countable. For example, the number of students in a
class is countable or discrete. The value could be 2, 24, 34 or 135 students, but
it cannot be 23/32 or 12.23 students.

• Number of page in the book is a discrete variable. Discrete data can only take
on certain individual values.

Continuous variables:

• Continuous variables are a variable which can take all values within a given
interval or range. A continuous variable consists of numbers whose values, at
least in theory, have no restrictions.
• Example of continuous variables is Blood pressure, weight, high and income.

• Continuous data can take on any value in a certain range. Length of a file is a
continuous variable.

Difference between Discrete variables and Continuous variables

Approximate Numbers

• Approximate number is defined as a number approximated to the exact


number and there is always a difference between the exact and approximate
numbers.

• For example, 2, 4, 9 are exact numbers as they do not need any approximation.

• But √2, л, √3 are approximate numbers as they cannot be expressed exactly


by a finite digits. They can be written as 1.414, 3.1416, 1.7320 etc which are
only approximations to the true values.
• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.

• An approximate number is one that does have uncertainty. A number can be


approximate for one of two reasons:

a) The number can be the result of a measurement.

b) Certain numbers simply cannot be written exactly in decimal form. Many


fractions and all irrational numbers fall into this category

Independent and Dependent Variables

• The two main variables in an experiment are the independent and dependent
variable. An experiment is a study in which the investigator decides who
receives the special treatment.

1. Independent variables

• An independent variable is the variable that is changed or controlled in a


scientific experiment to test the effects on the dependent variable.

• An independent variable is a variable that represents a quantity that is being


manipulated in an experiment.

• The independent variable is the one that the researcher intentionally changes
or controls.

• In an experiment, an independent variable is the treatment manipulated by the


investigator. Mostly in mathematical equations, independent variables are
denoted by 'x'.

• Independent variables are also termed as "explanatory variables,"


"manipulated variables," or "controlled variables." In a graph, the independent
variable is usually plotted on the X-axis.

2. Dependent variables

• A dependent variable is the variable being tested and measured in a scientific


experiment.
• The dependent variable is 'dependent' on the independent variable. As the
experimenter changes the independent variable, the effect on the dependent
variable is observed and recorded.

• The dependent variable is the factor that the research measures. It changes in
response to the independent variable or depends upon it.

• A dependent variable represents a quantity whose value depends on how the


independent variable is manipulated.

• Mostly in mathematical equations, dependent variables are denoted by 'y'.

• Dependent variables are also termed as "measured variable," the "responding


variable," or the "explained variable". In a graph, dependent variables are
usually plotted on the Y-axis.

• When a variable is believed to have been influenced by the independent


variable, it is called a dependent variable. In an experimental setting, the
dependent variable is measured, counted or recorded by the investigator.

• Example: Suppose we want to know whether or not eating breakfast affects


student test scores. The factor under the experimenter's control is the presence
or absence of breakfast, so we know it is the independent variable. The
experiment measures test scores of students who ate breakfast versus those who
did not. Theoretically, the test results depend on breakfast, so the test results are
the dependent variable. Note that test scores are the dependent variable, even if
it turns out there is no relationship between scores and breakfast.

Observational Study

• An observational study focuses on detecting relationships between variables


not manipulated by the investigator. An observational study is used to answer a
research question based purely on what the researcher observes. There is no
interference or manipulation of the research subjects and no control and
treatment groups.

• These studies are often qualitative in nature and can be used for both
exploratory and explanatory research purposes. While quantitative observational
studies exist, they are less common.
• Observational studies are generally used in hard science, medical and social
science fields. This is often due to ethical or practical concerns that prevent the
researcher from conducting a traditional experiment. However, the lack of
control and treatment groups means that forming inferences is difficult and
there is a risk of confounding variables impacting user analysis.

Confounding Variable

• Confounding variables are those that affect other variables in a way that
produces spurious or distorted associations between two variables. They
confound the "true" relationship between two variables. Confounding refers to
differences in outcomes that occur because of differences in the baseline risks of
the comparison groups.

• For example, if we have an association between two variables (X and Y) and


that association is due entirely to the fact that both X and Y are affected by a
third variable (Z), then we would say that the association between X and Y is
spurious and that it is a result of the effect of a confounding variable (Z).

• A difference between groups might be due not to the independent variable but
to a confounding variable.

• For a variable to be confounding:

a) It must have connected with independent variables of interest and

b) It must be connected to the outcome or dependent variable directly.

• Consider the example, in order to conduct research that has the objective that
alcohol drinkers can have more heart disease than non-alcohol drinkers such
that they can be influenced by another factor. For instance, alcohol drinkers
might consume cigarettes more than non drinkers that act as a confounding
variable (consuming cigarettes in this case) to study an association amidst
drinking alcohol and heart disease.

• For example, suppose a researcher collects data on ice cream sales and shark
attacks and finds that the two variables are highly correlated. Does this mean
that increased ice cream sales cause more shark attacks? That's unlikely. The
more likely cause is the confounding variable temperature. When it is warmer
outside, more people buy ice cream and more people go in the ocean.

Describing Data with Tables

Frequency Distributions for Quantitative Data


• Frequency distribution is a representation, either in a graphical or tabular
format, that displays the number of observations within a given interval. The
interval size depends on the data being analyzed and the goals of the analyst.

• In order to find the frequency distribution of quantitative data, we can use the
following table that gives information about "the number of smartphones owned
per family."

• For such quantitative data, it is quite straightforward to make a frequency


distribution table. People either own 1, 2, 3, 4 or 5 laptops. Then, all we need to
do is to find the frequency of 1, 2, 3, 4 and 5. Arrange this information in table
format and called as frequency table for quantitative data.

• When observations are sorted into classes of single values, the result is
referred to as a frequency distribution for ungrouped data. It is the
representation of ungrouped data and is typically used when we have a smaller
data set.

• A frequency distribution is a means to organize a large amount of data. It takes


data from a population based on certain characteristics and organizes the data in
a way that is comprehensible to an individual that wants to make assumptions
about a given population.

• Types of frequency distribution are grouped frequency distribution, ungrouped


frequency distribution, cumulative frequency distribution, relative frequency
distribution and relative cumulative frequency distribution

1. Grouped data:

• Grouped data refers to the data which is bundled together in different classes
or categories.

• Data are grouped when the variable stretches over a wide range and there are a
large number of observations and it is not possible to arrange the data in any
order, as it consumes a lot of time. Hence, it is pertinent to convert frequency
into a class group called a class interval.

• Suppose we conduct a survey in which we ask 15 familys how many pets they
have in their home. The results are as follows:

1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8

• Often we use grouped frequency distributions, in which we create groups of


values and then summarize how many observations from a dataset fall into
those groups. Here's an example of a grouped frequency distribution for our
survey data :

Guidelines for Constructing FD


1. All classes should be of the same width.
2. Classes should be set up so that they do not overlap and so that each piece of
data belongs to exactly one class.

3. List all classes, even those with zero frequencies.

4. There should be between 5 and 20 classes.

5. The classes are continuous.

• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled
boundary.

• Table 2.3.4 gives a frequency distribution of the IQ test scores for 75 adults.

• IQ score is a quantitative variable and according to Table, eight of the


individuals have an IQ score between 80 and 94, fourteen have scores between
95 and 109, twenty-four have scores between 110 and 124, sixteen have scores
between 125 and 139 and thirteen have scores between 140 and 154.

• The frequency distribution given in Table is composed of five classes. The


classes are: 80-94, 95-109, 110- 124, 125-139 and 140- 154. Each class has a
lower class limit and an upper class limit. The lower class limits for this
distribution are 80, 95, 110, 125 and 140. The upper class limits are 94,109,
124, 139 and 154.

• If the lower class limit for the second class, 95, is added to the upper class
limit for the first class,94 and the sum divided by 2, the upper boundary for
the first class and the lower boundary for the second class is determined. Table
2.3.5 gives all the boundaries for Table 2.3.5.
• If the lower class limit is added to the upper class limit for any class and the
sum divided by 2, the class mark for that class is obtained. The class mark for a
class is the midpoint of the class and is sometimes called the class midpoint
rather than the class mark.

Example 2.3.1: Following table gives the frequency distribution for the
cholesterol values of 45 patients in a cardiac rehabilitation study. Give the
lower and upper class limits and boundaries as well as the class marks for
each class.

• Solution: Below table gives the limits, boundaries and marks for the classes.
Example 2.3.2: The IQ scores for a group of 35 school dropouts are as
follows:

a) Construct a frequency distribution for grouped data.

b) Specify the real limits for the lowest class interval in this frequency
distribution.

• Solution: Calculating the class width

(123-69)/ 10=54/10=5.4≈ 5

a) Frequency distribution for grouped data


b) Real limits for the lowest class interval in this frequency distribution = 64.5-
69.5.

Example 2.3.3: Given below are the weekly pocket expenses (in Rupees) of
a group of 25 students selected at random.

37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49,
45, 44, 37, 36

Construct a grouped frequency distribution table with class intervals of


equal widths, starting from 25-30, 30-35 and so on. Also, find the range of
weekly pocket expenses.

Solution:
• In the given data, the smallest value is 26 and the largest value is 49. So, the
range of the weekly pocket expenses = 49-26=23.

Outliers
• 'In statistics, an Outlier is an observation point that is distant from other
observations.'

• An outlier is a value that escapes normality and can cause anomalies in the
results obtained through algorithms and analytical systems. There, they always
need some degrees of attention.

• Understanding the outliers is critical in analyzing data for at least two aspects:

a) The outliers may negatively bias the entire result of an analysis;

b) The behavior of outliers may be precisely what is being sought.

• The simplest way to find outliers in data is to look directly at the data table,
the dataset, as data scientists call it. The case of the following table clearly
exemplifies a typing error, that is, input of the data.

• The field of the individual's age Antony Smith certainly does not represent the
age of 470 years. Looking at the table it is possible to identify the outlier, but it
is difficult to say which would be the correct age. There are several possibilities
that can refer to the right age, such as: 47, 70 or even 40 years.

Relative and Cumulative Frequency Distribution


• Relative frequency distributions show the frequency of each class as a part
or fraction of the total frequency for the entire distribution. Frequency
distributions can show either the actual number of observations falling in each
range or the percentage of observations. In the latter instance, the distribution is
called a relative frequency distribution.
• To convert a frequency distribution into a relative frequency distribution,
divide the frequency for each class by the total frequency for the entire
distribution.

• A relative frequency distribution lists the data values along with the percent
of all observations belonging to each group. These relative frequencies are
calculated by dividing the frequencies for each group by the total number of
observations.

• Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:

Cumulative frequency:

• A cumulative frequency distribution can be useful for ordered data (e.g. data
arranged in intervals, measurement data, etc.). Instead of reporting frequencies,
the recorded values are the sum of all frequencies for values less than and
including the current value.

• Example: Suppose we take a sample of 200 India family's and record the
number of people living there. We obtain the following:
• To convert a frequency distribution into a cumulative frequency distribution,
add to the frequency of each class the sum of the frequencies of all classes
ranked below it.

Frequency Distributions for Qualitative (Nominal) Data


• In the set of observations, any single observation is a word, numerical code or
letter, then data are qualitative data. Frequency distributions for qualitative data
are easy to construct.

• It is possible to convert frequency distributions for qualitative variables into


relative frequency distribution.

• If measurement is ordinal because observations can be ordered from least to


most, cumulative frequencies can be used.

Graphs for Quantitative Data

1. Histogram

• A histogram is a special kind of bar graph that applies to quantitative data


(discrete or continuous). The horizontal axis represents the range of data values.
The bar height represents the frequency of data values falling within the interval
formed by the width of the bar. The bars are also pushed together with no
spaces between them.

• A diagram consisting of rectangles whose area is proportional to the frequency


of a variable and whose width is equal to the class interval.

• Here the data values only take on integer values, but we still split the range of
values into intervals. In this case, the intervals are [1,2), [2,3), [3,4), etc. Notice
that this graph is also close to being bell-shaped. A symmetric, bell-shaped
distribution is called a normal distribution.

• Fig. 2.4.1 shows histogram.

• Notice that all the rectangles are adjacent and they have no gaps between them
unlike a bar graph.

• This histogram above is called a frequency histogram. If we had used the


relative frequency to make the histogram, we would call the graph a relative
frequency histogram.

• If we had used the percentage to make the histogram, we would call the graph
a percentage histogram.
• A relative frequency histogram is the same as a regular histogram, except
instead of the bar height representing frequency, it now represents the relative
frequency (so the y-axis runs from 0 to 1, which is 0% to 100%).

2. Frequency polygon

• Frequency polygons are a graphical device for understanding the shapes of


distributions. They serve the same purpose as histograms, but are especially
helpful for comparing sets of data. Frequency polygons are also a good choice
for displaying cumulative frequency distributions.

• We can say that frequency polygon depicts the shapes and trends of data. It
can be drawn with or without a histogram.

• Suppose we are given frequency and bins of the ages from another survey as
shown in Table 2.4.1.

• The midpoints will be used for the position on the horizontal axis and the
frequency for the vertical axis. From Table 2.4.1 we can then create the
frequency polygon as shown in Fig. 2.4.2.
• A line indicates that there is a continuous movement. A frequency polygon
should therefore be used for scale variables that are binned, but sometimes a
frequency polygon is also used for ordinal variables.

• Frequency polygons are useful for comparing distributions. This is achieved


by overlaying the frequency polygons drawn for different data sets.

Example 2.4.1: The frequency polygon of a frequency distribution is shown


below.
Answer the following about the distribution from the histogram.

(i) What is the frequency of the class interval whose class mark is 15?

(ii) What is the class interval whose class mark is 45?

(iii) Construct a frequency table for the distribution.

• Solution:

(i) Frequency of the class interval whose class mark is 15 → 8

(ii) Class interval whose class mark is 45→40-50

(iii) As the class marks of consecutive overlapping class intervals are 5, 15, 25,
35, 45, 55 we find the class intervals are 0 - 10, 10-20, 20 - 30, 30 - 40, 40 - 50,
50 - 60. Therefore, the frequency table is constructed as below.
3. Steam and Leaf diagram:

• Stem and leaf diagrams allow to display raw data visually. Each raw score is
divided into a stem and a leaf. The leaf is typically the last digit of the raw
value. The stem is the remaining digits of the raw value.

• Data points are split into a leaf (usually the ones digit) and a stem (the other
digits)

• To generate a stem and leaf diagram, first create a vertical column that
contains all of the stems. Then list each leaf next to the corresponding stem. In
these diagrams, all of the scores are represented in the diagram without the loss
of any information.

• A stem-and-leaf plot retains the original data. The leaves are usually the last
digit in each data value and the stems are the remaining digits.

• Create a stem-and-leaf plot of the following test scores from a group of college
freshmen.

• Stem and Leaf Diagram :


Graph for Qualitative (Nominal) Data

• There are a couple of graphs that are appropriate for qualitative data that has
no natural ordering.

1. Bar graphs

• Bar Graphs are like histograms, but the horizontal axis has the name of each
category and there are spaces between the bars.

• Usually, the bars are ordered with the categories in alphabetical order. One
variant of a bar graph is called a Pareto Chart. These are bar graphs with the
categories ordered by frequency, from largest to smallest.

• Fig. 2.5.1 shows bar graph.

• Bars of a bar graph can be represented both vertically and horizontally.


• In bar graph, bars are used to represent the amount of data in each category;
one axis displays the categories of qualitative data and the other axis displays
the frequencies.

Misleading Graph

• It is a well known fact that statistics can be misleading. They are often used to
prove a point and can easily be twisted in favour of that point.

• Good graphs are extremely powerful tools for displaying large quantities of
complex data; they help turn the realms of information available today into
knowledge. But, unfortunately, some graphs deceive or mislead.

• This may happen because the designer chooses to give readers the impression
of better performance or results than is actually the situation. In other cases, the
person who prepares the graph may want to be accurate and honest, but may
mislead the reader by a poor choice of a graph form or poor graph construction.

• The following things are important to consider when looking at a graph:

1. Title

2. Labels on both axes of a line or bar chart and on all sections of a pie chart

3. Source of the data

4. Key to a pictograph

5. Uniform size of a symbol in a pictograph

6. Scale: Does it start with zero? If not, is there a break shown

7. Scale: Are the numbers equally spaced?

• A graph can be altered by changing the scale of the graph. For example, data
in the two graphs of Fig. 2.6.1 are identical, but scaling of the Y-axis changes
the impression of the magnitude of differences.
Example 2.6.1: Construct a frequency distribution for the number of
different residences occupied by graduating seniors during their college
career, namely: 1, 4, 2, 3, 3, 1, 6, 7, 4, 3, 3, 9, 2, 4, 2, 2, 3, 2, 3, 4, 4, 2, 3, 3, 5.
What is the shape of this distribution?

Solution:

Normal distribution: The normal distribution is one of the most commonly


encountered types of data distribution, especially in social sciences. Due to its
bell-like shape, the normal distribution is also referred to as the bell curve.

Histogram of given data:

You might also like