Likith Content
Likith Content
CHAPTER – 1
INTRODUCTION
When you conduct research about a group of people, it’s rarely possible to collect data from
every person in that group. Instead, you select a sample. The sample is the group of individuals
who will participate in the research.
Data processing in research is the collection and translation of a data set into valuable, usable
information. Through this process, a researcher, data engineer or data scientist takes raw data
and converts it into a more readable format, such as a graph, report or chart, either manually or
through an automated tool. The researcher will then use this information to gain insights, solve
problems, make improvements and ultimately generate better results.
Data Processing: This refers to the steps involved in transforming raw data into usable
and meaningful information. It includes collecting data, cleaning it to remove errors or
inconsistencies, organizing it systematically, and analyzing it to draw conclusions.
Proper data processing ensures the reliability and accuracy of research results.
These methods are fundamental in research for efficient data handling and making valid
generalizations about a population.
CHAPTER – 2
First, you need to understand the difference between a population and a sample, and identify
the target population of your research.
The population is the entire group that you want to draw conclusions about.
The sample is the specific group of individuals that you will collect data from.
The population can be defined in terms of geographical location, age, income, or many other
characteristics.
It can be very broad or quite narrow: maybe you want to make inferences about the whole
adult population of your country; maybe your research focuses on customers of a certain
company, patients with a specific health condition, or students in a single school.
It is important to carefully define your target population according to the purpose and
practicalities of your project.
If the population is very large, demographically mixed, and geographically dispersed, it might
be difficult to gain access to a representative sample. A lack of a representative sample affects
the validity of your results, and can lead to several research biases, particularly sampling bias.
Sampling frame
The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally,
it should include the entire target population (and nobody who is not part of that population).
Example: Sampling frameYou are doing research on working conditions at a social media
marketing company. Your population is all 1000 employees of the company. Your sampling
frame is the company’s HR database, which lists the names and contact details of every
employee.
Sample size
The number of individuals you should include in your sample depends on various factors,
including the size and variability of the population and your research design. There are
different sample size calculators and formulas depending on what you want to achieve
with statistical analysis.
Probability sampling means that every member of the population has a chance of being
selected. It is mainly used in quantitative research. If you want to produce results that are
representative of the whole population, probability sampling techniques are the most valid
choice.
In a simple random sample, every member of the population has an equal chance of being
selected. Your sampling frame should include the whole population.
To conduct this type of sampling, you can use tools like random number generators or other
techniques that are based entirely on chance.
Example: Simple random samplingYou want to select a simple random sample of 1000
employees of a social media marketing company. You assign a number to every employee in
the company database from 1 to 1000, and use a random number generator to select 100
numbers.
2. Systematic sampling
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to
conduct. Every member of the population is listed with a number, but instead of randomly
generating numbers, individuals are chosen at regular intervals.
Example: Systematic sampling. All employees of the company are listed in alphabetical order.
From the first 10 numbers, you randomly select a starting point: number 6. From number 6
onwards, every 10th person on the list is selected (6, 16, 26, 36, and so on), and you end up
with a sample of 100 people.
If you use this technique, it is important to make sure that there is no hidden pattern in the list
that might skew the sample. For example, if the HR database groups employees by team, and
team members are listed in order of seniority, there is a risk that your interval might skip over
people in junior roles, resulting in a sample that is skewed towards senior employees.
3. Stratified sampling
Stratified sampling involves dividing the population into subpopulations that may differ in
important ways. It allows you draw more precise conclusions by ensuring that every subgroup
is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based on
the relevant characteristic (e.g., gender identity, age range, income bracket, job role).
Based on the overall proportions of the population, you calculate how many people should be
sampled from each subgroup. Then you use random or systematic sampling to select a sample
from each subgroup.
Example: Stratified sampling. The company has 800 female employees and 200 male
employees. You want to ensure that the sample reflects the gender balance of the company, so
you sort the population into two strata based on gender. Then you use random sampling on
each group, selecting 80 women and 20 men, which gives you a representative sample of 100
people.
4. Cluster sampling
Cluster sampling also involves dividing the population into subgroups, but each subgroup
should have similar characteristics to the whole sample. Instead of sampling individuals from
each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled cluster. If
the clusters themselves are large, you can also sample individuals from within each cluster
using one of the techniques above. This is called multistage sampling.
This method is good for dealing with large and dispersed populations, but there is more risk of
error in the sample, as there could be substantial differences between clusters. It’s difficult to
guarantee that the sampled clusters are really representative of the whole population.
Example: Cluster samplingThe company has offices in 10 cities across the country (all with
roughly the same number of employees in similar roles). You don’t have the capacity to travel
to every office to collect your data, so you use random sampling to select 3 offices – these are
your clusters.
In a non-probability sample, individuals are selected based on non-random criteria, and not
every individual has a chance of being included.
This type of sample is easier and cheaper to access, but it has a higher risk of sampling bias.
That means the inferences you can make about the population are weaker than with probability
samples, and your conclusions may be more limited. If you use a non-probability sample, you
should still aim to make it as representative of the population as possible.
Non-probability sampling techniques are often used in exploratory and qualitative research. In
these types of research, the aim is not to test a hypothesis about a broad population, but to
develop an initial understanding of a small or under-researched population.
1. Convenience sampling
A convenience sample simply includes the individuals who happen to be most accessible to the
researcher.
This is an easy and inexpensive way to gather initial data, but there is no way to tell if the
sample is representative of the population, so it can’t produce generalizable results.
Convenience samples are at risk for both sampling bias and selection bias.
Example: Convenience samplingYou are researching opinions about student support services
in your university, so after each of your classes, you ask your fellow students to complete
a survey on the topic. This is a convenient way to gather data, but as you only surveyed students
taking the same classes as you at the same level, the sample is not representative of all the
students at your university.
Voluntary response samples are always at least somewhat biased, as some people will
inherently be more likely to volunteer than others, leading to self-selection bias.
Example: Voluntary response samplingYou send out the survey to all students at your university
and a lot of students decide to complete it. This can certainly give you some insight into the
topic, but the people who responded are more likely to be those who have strong opinions about
the student support services, so you can’t be sure that their opinions are representative of all
students.
3. Purposive sampling
This type of sampling, also known as judgement sampling, involves the researcher using their
expertise to select a sample that is most useful to the purposes of the research.
It is often used in qualitative research, where the researcher wants to gain detailed knowledge
about a specific phenomenon rather than make statistical inferences, or where the population
is very small and specific. An effective purposive sample must have clear criteria and rationale
for inclusion. Always make sure to describe your inclusion and exclusion criteria and beware
of observer bias affecting your arguments.
Example: Purposive samplingYou want to know more about the opinions and experiences of
disabled students at your university, so you purposefully select a number of students with
different support needs in order to gather a varied range of data on their experiences with
student services.
4. Snowball sampling
If the population is hard to access, snowball sampling can be used to recruit participants via
other participants. The number of people you have access to “snowballs” as you get in contact
with more people. The downside here is also representativeness, as you have no way of
knowing how representative your sample is due to the reliance on participants recruiting others.
This can lead to sampling bias.
5. Quota sampling
You first divide the population into mutually exclusive subgroups (called strata) and then
recruit sample units until you reach your quota. These units share specific characteristics,
determined by you prior to forming your strata. The aim of quota sampling is to control what
or who makes up your sample.
Example: Quota sampling you want to gauge consumer interest in a new produce delivery
service in Boston, focused on dietary preferences. You divide the population into meat eaters,
vegetarians, and vegans, drawing a sample of 1000 people. Since the company wants to cater
to all consumers, you set a quota of 200 people for each dietary group. In this way, all dietary
preferences are equally represented in your research, and you can easily compare these groups.
You continue recruiting until you reach the quota of 200 participants for each subgroup.
CHAPTER – 3
The data processing cycle includes several steps. Though each stage has a specific order, the
entire process repeats cyclically.
1. Collection
Data collection is the process of extracting data from available sources, such as data
warehouses and data lakes. Raw data can come in several forms, from user behavior to
monetary figures to profit statements to web cookies. The type of raw data that you collect will
have a significant impact on the output you later produce. Researchers must look to accurate,
trustworthy and comprehensive sources for valid, usable findings.
2. Preparation
Through data preparation, you will polish, organize, filter and examine raw data for errors. The
data preparation stage is meant to eliminate incorrect, redundant or incomplete data and convert
it into a suitable form for further processing and analysis. The goal of the preparation stage is
to achieve the highest quality data possible.
3. Input
The input stage is the first stage where raw data begins to resemble usable information. Once
the data is clean, you’ll enter it into a corresponding destination, such as a data warehouse or
customer relationship management (CRM) software, and translate it into a compatible language
for these systems. You can enter this data using numerous input sources, including keyboards,
scanners or digitizers.
4. Processing
Next, you’ll begin to process the data stored in your computer during the data input stage. You
can conduct data processing using machine learning and artificial intelligence algorithms to
generate the desired input, but the processing will vary based on your data sources and intended
output use. You can use the data from the processing stage in a variety of ways, from creating
medical diagnoses to determining customer needs to drawing connections between advertising
patterns.
5. Output
Through this stage, data becomes usable and can be interpreted by non-data scientists. This
translated data is readable and often presented in images, graphs, text, audio and videos. Once
interpreted, company members can self-serve the data for their analytics projects.
6. Storage
After processing the data successfully, all remaining information should be stored for later use.
When companies properly store their data, they remain compliant with data protection
legislation and promote a faster, easier means of accessing information when they need to.
They can also use this data as input in the following processing cycle.
You can choose from three primary methods of data processing based on your needs:
Manual data processing: Through this method, users process data manually, meaning they
carry out every step without using electronics or automation software. Though this method is
the least expensive and requires minimal resources, it can be time-consuming and has a higher
risk of producing errors.
Mechanical data processing: Mechanical processing involves the use of machines and
devices to filter data, such as calculators, printing presses or typewriters. This method is
suitable for simple data processing endeavors and produces fewer errors but is more complex
than other techniques.
Electronic data processing: Researchers process data using modern data processing software
and technologies, where they feed an instruction set to the program to analyze the data and
create a yield output. Though this method is the most expensive, it is also the fastest and most
reliable for generating accurate output.
CHAPTER – 4
Conclusion
Data processing and sampling methods are critical pillars of research, ensuring the
transformation of raw information into accurate, actionable insights. Data processing involves
a systematic cycle of steps, including collection, preparation, input, processing, output, and
storage. Each stage refines the data, eliminating errors and redundancies to produce reliable
results. The choice of processing methods—manual, mechanical, or electronic—plays a
significant role in determining the efficiency and accuracy of the research. Electronic data
processing, with its speed and precision, is particularly advantageous for handling complex
datasets in modern research scenarios.
The synergy between effective data processing and robust sampling methods allows for high-
quality, dependable research outcomes. Together, they support the generation of insights, the
resolution of challenges, and informed decision-making. In fields ranging from business
analytics to scientific studies, these methodologies enhance the reliability, scalability, and
impact of research efforts, making them indispensable in today’s data-driven world.
References:
Petrat, P. (2021, March 10). What is data processing in research? Cint Blog. Retrieved
from Cint
Cint Group. (2024). Data processing and market research solutions. Cint. Retrieved
from Cint
Research Methodology - Methods and Techniques (Second Revised Edition) by C.R.
Kothari