DATAPROCESSINGlecturenotes
DATAPROCESSINGlecturenotes
Compiled by
Professor Olukunmi 'Lanre OLAITAN
Introduction
What is Data?
Data is a raw and unorganized fact that required to be processed to make it meaningful. Data can
be simple at the same time unorganized unless it is organized. Generally, data comprises facts,
observations, perceptions numbers, characters, symbols, image, etc.
Data is an individual unit that contains raw materials which do not carry any specific
meaning. Information is a group of data that collectively carries a logical meaning.
Data is the name given to basic facts and entities like names and numbers. Five examples of data
includes:
weights
prices and costs
numbers of items sold
employee names
product names.
Data is always interpreted, by a human or machine, to derive meaning. So, data is meaningless.
Data contains numbers, statements, and characters in a raw form.
1
Whereas, information is a set of data which is processed in a meaningful way according to the
given requirement. Information is processed, structured, or presented in a given context to make
it meaningful and useful.
It is processed data which includes data that possess context, relevance, and purpose. It also
involves manipulation of raw data.
Information assigns meaning and improves the reliability of the data. It helps to ensure
undesirability and reduces uncertainty. So, when the data is transformed into information, it
never has any useless details.
KEY DIFFERENCE
1. Data is a raw and unorganized fact that is required to be processed to make it meaningful
whereas Information is a set of data that is processed in a meaningful way according to
the given requirement.
2. Data does not have any specific purpose whereas Information carries a meaning that has
been assigned by interpreting data.
3. Data alone has no significance while Information is significant by itself.
4. Data never depends on Information while Information is dependent on Data.
2
5. Data measured in bits and bytes, on the other hand, Information is measured in
meaningful units like time, quantity, etc.
6. Data can be structured, tabular data, graph, data tree whereas Information is language,
ideas, and thoughts based on the given data.
3
Parameters Data Information
The data collected by the researcher, Information is useful and valuable as it is readily
Usefulness
may or may not be useful. available to the researcher for use.
Information is always specific to the requirements
Data is never designed to the specific and expectations because all the irrelevant facts
Dependency
need of the user. and figures are removed, during the
transformation process.
Example:
4
The data is classified into majorly four categories:
Nominal data
Ordinal data
Discrete data
Continuous data
Further, we can classify these data as follows:
Sometimes categorical data can hold numerical values (quantitative value), but those values do
not have a mathematical sense. Examples of the categorical data are birthdate, favourite sport,
school postcode. Here, the birthdate and school postcode hold the quantitative value, but it does
not give numerical meaning.
Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables
without providing the numerical value. Nominal data is also called the nominal scale. It cannot
be ordered and measured. But sometimes, the data can be qualitative and quantitative. Examples
of nominal data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are grouped
into categories, and then the frequency or the percentage of the data can be calculated. These
data are visually represented using the pie charts.
5
Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the
nominal data is that the difference between the data values is not determined. This variable is
mostly found in surveys, finance, economics, questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated and
interpreted through many visualisation tools. The information may be expressed using tables in
which each row in the table shows the distinct category.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of
possible values. Those values cannot be subdivided meaningfully. Here, things can be counted in
whole numbers.
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that
can be selected within a given specific range.
Data Processing
Data processing occurs when data is collected and translated into usable information. Usually
performed by a data scientist or team of data scientists, it is important for data processing to be
done correctly as not to negatively affect the end product, or data output.
Data in its raw form is not useful to any organization. Data processing is the method of collecting
raw data and translating it into usable information. It is usually performed in a step-by-step
process by a team of data scientists and data engineers in an organization. The raw data is
collected, filtered, sorted, processed, analyzed, stored, and then presented in a readable format.
6
Data processing is essential for organizations to create better business strategies and increase
their competitive edge. By converting the data into readable formats like graphs, charts, and
documents, employees throughout the organization can understand and use the data.
Now that we’ve established what we mean by data processing, let’s examine the data processing
cycle.
The data processing cycle consists of a series of steps where raw data (input) is fed into a system
to produce actionable insights (output). Each step is taken in a specific order, but the entire
process is repeated in a cyclic manner. The first data processing cycle's output can be stored and
fed as the input for the next cycle, as the illustration below shows us.
Generally, there are six main steps in the data processing cycle:
Step 1: Collection
The collection of raw data is the first step of the data processing cycle. The type of raw data
collected has a huge impact on the output produced. Hence, raw data should be gathered from
defined and accurate sources so that the subsequent findings are valid and usable. Raw data can
include monetary figures, website cookies, profit/loss statements of a company, user behavior,
etc.
7
Step 2: Preparation
Data preparation or data cleaning is the process of sorting and filtering the raw data to remove
unnecessary and inaccurate data. Raw data is checked for errors, duplication, miscalculations or
missing data, and transformed into a suitable form for further analysis and processing. This is
done to ensure that only the highest quality data is fed into the processing unit.
The purpose of this step to remove bad data (redundant, incomplete, or incorrect data) so as to
begin assembling high-quality information so that it can be used in the best possible way
for business intelligence.
Step 3: Input
In this step, the raw data is converted into machine readable form and fed into the processing
unit. This can be in the form of data entry through a keyboard, scanner or any other input source.
In this step, the raw data is subjected to various data processing methods using machine learning
and artificial intelligence algorithms to generate a desirable output. This step may vary slightly
from process to process depending on the source of data being processed (data lakes, online
databases, connected devices, etc.) and the intended use of the output.
Step 5: Output
The data is finally transmitted and displayed to the user in a readable form like graphs, tables,
vector files, audio, video, documents, etc. This output can be stored and further processed in the
next data processing cycle.
Step 6: Storage
The last step of the data processing cycle is storage, where data and metadata are stored for
further use. This allows for quick access and retrieval of information whenever needed, and also
allows it to be used as input in the next data processing cycle directly.
There are different types of data processing based on the source of data and the steps taken by
the processing unit to generate an output. There is no one-size-fits-all method that can be used
for processing raw data.
Type Uses
8
Data is collected and processed in batches. Used for large
amounts of data.
Batch Processing
Eg: payroll system
There are three main data processing methods - manual, mechanical and electronic.
9
Manual Data Processing
This data processing method is handled manually. The entire process of data collection, filtering,
sorting, calculation, and other logical operations are all done with human intervention and
without the use of any other electronic device or automation software. It is a low-cost method
and requires little to no tools, but produces high errors, high labor costs, and lots of time and
tedium.
Data is processed mechanically through the use of devices and machines. These can include
simple devices such as calculators, typewriters, printing press, etc. Simple data processing
operations can be achieved with this method. It has much lesser errors than manual data
processing, but the increase of data has made this method more complex and difficult.
Data is processed with modern technologies using data processing software and programs. A set
of instructions is given to the software to process the data and yield output. This method is the
most expensive but provides the fastest processing speeds with the highest reliability and
accuracy of output.
Data processing occurs in our daily lives whether we may be aware of it or not. Here are some
real-life examples of data processing:
A stock trading software that converts millions of stock data into a simple graph
An e-commerce company uses the search history of customers to recommend similar products
A digital marketing company uses demographic data of people to strategize location-specific
campaigns
A self-driving car uses real-time data from sensors to detect if there are pedestrians and other
cars on the road
10
Analytics, the process of finding, interpreting, and communicating meaningful patterns in data, is
the next logical step after data processing. Whereas data processing changes data from one form
to another, analytics takes those newly processed forms and makes sense of them.
But no matter which of these processes data scientists are using, the sheer volume of data and the
analysis of its processed forms require greater storage and access capabilities, which leads us to
the next section!
The future of data processing can best be summed up in one short phrase: cloud computing.
While the six steps of data processing remain immutable, cloud technology has provided
spectacular advances in data processing technology that has given data analysts and scientists the
fastest, most advanced, cost-effective, and most efficient data processing methods today.
The cloud lets companies blend their platforms into one centralized system that’s easy to work
with and adapt. Cloud technology allows seamless integration of new upgrades and updates to
legacy systems while offering organizations immense scalability.
Cloud platforms are also affordable and serve as a great equalizer between large organizations
and smaller companies.
So, the same IT innovations that created big data and its associated challenges have also
provided the solution. The cloud can handle the huge workloads that are characteristic of big data
operations.
Data processing essentially involves Collection, manipulation, and processing collected data for
the required use is known as data processing. It is a technique normally performed by a
computer; the process includes retrieving, transforming, or classification of information.
However, the processing of data largely depends on the following −
The volume of data that need to be processed
The complexity of data processing operations
Capacity and inbuilt technology of respective computer system
Technical skills
Time constraints
11
On-line processing
Time sharing processing
Distributed processing
Distributed Processing
This is a specialized data processing technique in which various computers (which are located
remotely) remain interconnected with a single host computer making a network of computer.
12
All these computer systems remain interconnected with a high speed communication network.
This facilitates in the communication between computers. However, the central computer system
maintains the master data base and monitors accordingly.
13
However, in their output version, BI tools have a relatively small number of available data
source connections. These connections are required for conducting analyses. BI tools also have
limited options for preparing data for further analyses. Therefore, it’s common to use both BI
tools and ETL/ESB solutions.
The downside of statistical analysis solutions is the high purchase and maintenance costs.
These costs are related to the fact that this kind of tool is often divided into different modules
that each generate additional expenses.
Using different programming languages is still a common approach. One perk is the option for
creating advanced machine-learning models. But programming techniques aren’t too flexible
compared to other methods. This is the case especially when there’s a need to introduce changes
related to, e.g., dynamically transforming business conditions.
This method also has downsides unrelated to the data analysis itself. It requires qualified data
processing specialists to be skilled in programming languages and possess a vast knowledge of
business processes. This is needed to correctly interpret analysis results and create new
scenarios. Maintaining such a skilled team might be a big challenge.
SQL
SQL consoles that handle queries in the SQL programming language are useful for many
analytical scenarios and for achieving precise feedback.
14
However, queries will only bring satisfying results if data are structured in the right way,
maintaining relations between them.
Growing databases and the need for managing data source accesses may make it challenging for
administrators.
The main task of these solutions is creating connections between systems or databases, sending
notifications, verifying data accuracy and completeness, and transforming them while
maintaining crucial attributes and schemes. This maximizes the usefulness of data in future
analyses.
Data integration platforms…can be used by business owners that aren’t qualified data
processing specialists.
A significant advantage of data integration platforms is their no-code/low-code model. They can
be used by business owners that aren’t qualified data processing specialists. The features of
these tools can be expanded with additional scripts in Python or R programming languages. After
acquiring the necessary competencies, the users can successfully expand their solution
environment, limiting the so-called vendor lock – dependency on the software provider.
With integration platforms, they can process tabular, vector, and raster data, as well as databases
and data warehouses. Moreover, they can process data from network services such as WMS or
WFS, different APIs, and information from IoT sensors.
With integration platforms, you can also automate your designed processes. This saves you
time and money. Moreover, the skills of employees who work with data can be used in other
areas.
Deciding on ETL tools or an integration platform, you should analyze your data processing
goals to avoid unnecessary costs. These are complex solutions that offer nearly infinite
15
possibilities. They might be left wasted if it turns out your organization could only use much
simpler tools.
Benefits of Data Processing
As mentioned before, collecting data without processing and analyzing them makes them
useless. Prepared in the right way, data can give you measurable business benefits.
Better business decisions. Cleaned data are easier to analyze and make it more straightforward
to notice patterns that could get overlooked in the original, unprocessed dataset. You can be
sure you’re making the right decisions if you’re making them based on verified, organized
data.
Limited operational costs. Correct data processing guarantees that your data are high-
quality and can be successfully used in business processes. After data processing, it may turn out
that some data need corrections – you can use this knowledge and avoid using them in your
analyses. They would only bring incorrect results. This saves the time and effort you’d have to
spend searching for errors and repeating analyses. Moreover, it helps you eliminate the risk
of making wrong decisions based on invalid analyses.
Improved data storage, distribution, and reporting. Data are more accessible when saved in
a format preferred by their users. Data saved in a unified format can be still used in many
systems and for different purposes. They don’t need to be transformed over and over again.
16