0% found this document useful (0 votes)
10 views

Chapter 17 Data Analysis and Visualisation

Uploaded by

Salko Hamzi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter 17 Data Analysis and Visualisation

Uploaded by

Salko Hamzi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

17 Data analysis and visualisation

In this chapter you will learn:


● about transforming, cleansing and combining data
● about displaying the results of analysis.

Before starting this chapter, you should:


● be familiar with:
○ spreadsheets
○ relational databases
○ database management and file concepts
○ Chapters 8 (especially 8.4) and 10 (especially 10.1 and 10.4) of the AS
syllabus and in Cambridge International AS Level Information
Technology
● *be able to analyse, interpret and display data to communicate information
visually.

17.1 Data analysis and visualisation

17.1.1 Transforming and cleaning data to extract meaningful


information

Data transformation is the changing of data from one format, structure or value into
another. The original source of the data may not be in the condition that is required for
processing or, for example, creating reports. Data transformation can be carried out by
specialist software or with scripting languages such as SQL. Data can be transformed
in several ways:
● Constructive data transformation adds, copies and replicates data, for example
data can be combined with other data, such as customer information from sales
databases with that from marketing databases.
● Destructive data transformation deletes fields or records, for example it can
simplify data for analysis such as anonymising information by removing names
and changing specific ages into age ranges to ensure that individuals are not
identified during data mining.
● Aesthetic data transformation changes data to a standard form, for example
using the same standard format for dates or names.
● Structural data transformation reorganises data by renaming or combining
entities in a database.
● Normalisation of data for use in relational databases is also a type of data
transformation.

Cleaning data is the process of removing incorrect, incomplete, irrelevant,


improperly formatted, replicated or corrupt data from a data set. If data is not
cleansed, the faulty data can slow down processing, cause inaccuracies in calculations
and may cause inaccurate, unexpected or biased results to appear.

At the end of data cleansing, the data should be:


● of high quality and valid, for example be what is needed for the analysis that
will use it
● accurate, for example be the true values, such as a person's name spelled
correctly every time
● consistent, for example be the same where the data is recorded twice, such as a
person's name or address
● complete, for example have all the necessary data
● uniform, such as using the same units of measurement.

Data cleansing involves auditing the data set using specialist software or commercial
database management software to check for and remove anomalies and contradictions.
Any anomalies that the software cannot reconcile are reported and have to be removed
manually. Data cleansing can be expensive because it can take up a large amount of
computing resources, and is timeconsuming to check and carry out manually.

17.1.2 Getting data from different sources

Combining data from different sources is called data integration. Where the data
comes from is called a 'data source'. When collecting data, primary data sources are
the original data, for example from interviews, from observations, on data capture
forms or readings from sensors. Secondary data sources are those that have derived
data from the original, primary sources. In computing and IT, a data source usually
refers to the location where the required data is to be found, which could be a text file,
a spreadsheet, a database file, XML or other stored data. Data sources are used to
combine data when merging, searching or analysing spreadsheets or databases, or
when creating mail-merged documents.
Websites responding to user requests (for example when searching for information,
during online shopping or banking) combine data from different sources, such as
customer, stock and financial databases.

Reports can be created in word-processing software using data that is dynamically


linked to spreadsheet or database files. Embedded charts or tables can be used for
illustration. Linking allows information in a report to update automatically if the
source file is updated because only the location of the source file appears in the
document. Embedding a section, or object, from a spreadsheet does not allow for
automatic updating because the data becomes part of the new document and is no
longer part of the source file.

Combining data in spreadsheets


The simplest way to combine the data from two or more spreadsheets is to copy and
paste the data from the original into the others. This method is useful only if the data
being copied is unlikely to change. If the data does change, the copied data will not
change automatically so it will have to be updated manually.

If there are several sources of data that are required to be linked, for example when
summarising or using data from several spreadsheet in another, the original data in the
source spreadsheets can be linked to the summary spreadsheet. In this case, when the
source data is amended, provided that the summary spreadsheet is still linked and can
still locate the original data and automatic calculation is enabled, the data in the linked
spreadsheet will update. The links can be broken so that the data is no longer
dependent on the source data, but then it will also no longer update automatically.
Links can be created using formulas or by hyperlinks. The link information contains
references to the spreadsheet name and to the cell where the source data is stored. In
the case of using a different, separate spreadsheet file for the source, the filename is
also included.

In Microsoft Excel®, cells can be linked to cells in different workbooks (files) by


opening both workbooks, or all if there are more than two to be linked, and typing the
= (equals) sign in the cell where the data is to go, selecting the cell in the other
workbook from which the link is to come and then pressing Enter, as shown in the
image.
In Excel, the data in the linked workbook cell usually updates when the data in the
source workbook is changed. A similar method is used in LibreOffice® Calc® but the
data may not automatically update.

Spreadsheet applications also include tools to import and combine data from different
sources. In both Microsoft Excel and LibreOffice Calc, there are several ways to
import and combine data from different sources. Excel has numerous options for
importing from files such as other Excel workbooks (files), CSV and XML as shown
here.

Calc uses the Open menu option as shown here:


When a file has been chosen for import, both spreadsheet applications provide options
for transforming the data into their own format before loading it into the cells. In
Microsoft Excel it looks like this:

Importing data from files in LibreOffice Calc looks like this:


Combining data in databases
Combining data from several different databases allows complex queries, or searches,
to be carried out across vast amounts of data. Businesses can query their databases
(for example customer, stock, order and financial) for the purposes of marketing
research or creating reports for stakeholders. Data mining involves combining,
querying and analysing data from large data sets from many databases.

Relational databases store data in different tables within the database. Many tables can
be linked so that data can be extracted. There are several ways to join tables in
Microsoft Access® or in LibreOffice Base®. Structured query language (SQL) can
also be used to select and extract data across tables in relational databases.

Comparing and consolidating data from two data sources


In spreadsheets, formulas can be created using the functions provided by the
spreadsheet application to compare and consolidate data from different sources.
Comparisons between data (for example equals to, greater than, less than) can be
made in spreadsheets. Complex comparisons and decisions based on the results of the
comparison can be used in, for example, nested If and various types of Lookup
statements. In-built functions, such as Concatenate, can be combined into complex
formulas to bring together the contents, or various parts of the contents, of cells from
the same worksheet, different worksheets or different workbooks.

When working with data stored in database tables, comparing data to find out whether
or not the data in the fields match can be very useful. For example, for marketing and
further sales opportunities, database tables of customers and their purchases and of
salespeople can be compared to find out which customers have been sold which items
by which salespeople.

To compare two database tables to find matching data, queries can be used. A query is
created either to make a new join between fields, which must be the same type, or by
using an existing join between the tables, to extract only the data that matches. Fields
that hold similar data but are of different types, for example a field that has a number
data type but is to be compared with a text data type, cannot be joined. However, a
query can be created to use a field in one table as the criterion against which a field, or
fields, in other tables can be compared. A select query is created to include the tables,
the fields to be displayed are included and the field to be used as the criterion is
selected. The type of comparison is chosen, for example equals to. Complex queries
can be set up across several tables.
Data consolidation takes data from different sources, cleanses it and combines it. This
enables the processes of transformation, analysis and reporting to be carried out more
quickly and easily. In businesses and organisations, data is consolidated in a data
warehouse. A data warehouse is a large database that takes and stores data from other
databases and upon which analysis is carried out. A specialised program, or script, is
used to extract and load (EL) data from the source and place it in the data warehouse.
This is done by running queries to extract the required data, creating tables to receive
the data and loading the data into the data warehouse. Specialised EL tools (ELT) are
also available commercially. This is inexpensive, but for smaller businesses and
individuals, Microsoft Access and LibreOffice Base have the ability to consolidate
data.

Microsoft Access can work with several different file formats and has a powerful
query and SQL system that can consolidate data in many ways according to the
requirements of users.

Splitting data into discrete fields


It may be necessary to split data in one field into several new fields. For example,
importing customer details from a poorly designed spreadsheet or database may have
all the address details in single field. To enable searching, for example of just the
street or city, the data would need to be split. Ideally, this would be done before the
data is imported as part of the transformation process but, if not, database queries can
be set up to extract the data from the field and place it in new fields. Microsoft Access
has functions, such as Split(), which can be used in queries to extract data from a
string in a field for use in other fields or reports. The Split() function has the syntax:

Split(string,delimiter,limit, compare)

Where string is the contents of the field, delimiter (or separator) is where the split
should be and this is taken to be a space if nothing else is stated here, limit is the
number of splits returned and this is set to -1 by default, which means all of the splits
are returned to the system for use in other fields, and compare. Compare is optional
but can be used to make comparisons based on the string contents.

In LibreOffice Base, the syntax of Split() is:

Split(string,delimiter,number)

The first two parameters are the same as for Microsoft Access but number is how
many strings are to be returned.
Merging and combining data into required fields
While merging tables with fields of data is sometimes required, it is not usual to
permanently combine data from several fields into one field. It is usual database
practice to keep data at the lowest possible level of detail, or as atomic as possible,
which is why relational databases often have many tables with joined fields.

However, merging and combining fields at runtime; that is, only as and when required
and not altering the underlying data structure, can be used to give an overview of the
data, for example in reports. In Microsoft Access, two fields can be combined in a
query using the & character, which will concatenate the chosen fields with a space
separating each field. In both Microsoft Access and LibreOffice, SQL scripts can be
used to select and combine data in different fields.

17.1.3 Displaying data to communicate information


Data from spreadsheets and databases is useful only if it can be understood. Raw data
is meaningless, but when data is processed, put into context and understood, it
becomes information. How data is displayed to a user can help with the
communication of the information.

Graphs and charts can be created from data. Graphs and charts present data visually,
which makes it easier to read and understand. Graphs and charts enable viewers to
visualise trends and relationships between data with minimal mathematical or
analytical skills. There should be a clear, meaningful title and the axes should be
clearly labelled. The chart or graph should be uncluttered and with no unnecessary
data or information, the units of measurement should be clearly visible and
appropriate, and the axes should not be distorted or disproportionate. Where
appropriate, the data source should be included.
Pivot table reports

Pivot tables are used in spreadsheets to summarise large amounts of data. They can
also be used to look for patterns and trends and can include links to external data
sources.

It is important to ensure that the data you wish to analyse with a pivot table is
correctly formatted and suitable, meaning it has been transformed and cleansed ready
for use. For example, each column in the spreadsheet should have a header (or title),
there should be no blank columns or rows and there should be no rows with totals.
Table 17.1 shows a simple set of data about the sales of computing items that can be
used to create a pivot table in Microsoft Excel.

The data is entered into a Microsoft Excel Spreadsheet and prepared for use as shown
below.
When the table is prepared, a pivot table report can be created from the Insert, Pivot
table menu options. The pivot table report is placed in a new sheet in the spreadsheet
workbook, as shown below.

The data to be included in the pivot table report is chosen from the dialogue box that
appears when the table is created. The PivotTable Fields dialogue box can be opened
at any time by a right-click on any cell in the pivot table and choosing the Show field
list option, as shown here:
The headers can be altered in the pivot table from the default to customise the table as
required. There are options to search, sort and filter the table contents, as shown here:

Extracting and viewing data from the contents of the original sheet is made much
simpler by using the pivot table to report. For example, to find out and display the
total sales of items, open the PivotTable Fields dialogue box and move it by dragging
the field Sum of Total Sales to the filter box, as shown here:
By choosing the ProductIDs for the items in the drop-down list, the sales for these can
be displayed, as shown here:

In LibreOffice Cale, pivot tables are created from the Data, Pivot table menu options.
The layout of the data fields is carried out by dragging the column titles as required, as
shown here:

The same end result can be achieved in Cale as in Excel. The dropdown menus also
offer the same sorting, filtering and searching of data, as shown here:
Pivot tables allow the data to be explored to create many different reports so it is
worth spending time exploring their use with your own data.

If the data in the source sheet is altered, then the pivot table must be manually
refreshed, from the menu options, with the new data.

Pivot charts

Pivot charts are used to display the data in a pivot table. Pivot charts display the data
in much the same way as charts made from data in ordinary worksheets. A pivot chart
can be edited to change elements like the titles, legends or colour, but the data itself
cannot be edited in the chart. This is because the data depends on that in the sheet on
which the pivot table is based. In Microsoft Excel, pivot charts are created from the
Insert, Pivot chart menu options, as shown here:
Similarly, in LibreOffice Cale, select any cell in the pivot table, and from the Insert
menu choose to insert a Chart, as shown here:

The format and layout of the charts can be customised from the menus and dialogue
boxes. As for pivot tables, any changes to the original data source are not reflected in
the chart until it is manually refreshed. However, in the same way as for pivot tables,
the content of pivot charts can be filtered.

17.1.4 Dashboards

Instead of generating a new report from the data each time, different visualisations of
data can be presented in one view and examined together in a single visual
representation called a dashboard. To create a dashboard, data must first be imported,
or entered, into a spreadsheet application, for example Microsoft Excel, and if
necessary transformed so that its structure is suitable, such as no missing rows or
columns, each row representing a unique record and the data correctly formatted as a
spreadsheet table. The data can be manipulated, summarised and analysed using
formulas, filters, charts, pivot tables and pivot charts to extract the required
information for the user. The resulting views are grouped together to form the
dashboard. Dashboards usually look best if the gridlines are not showing. Excel's
toolbars and ribbons can be hidden too if necessary.
Dashboards should be kept simple and easy to understand and navigate. Shapes and
colours can be used but should not be crowded. Very bright, contrasting colours
should be avoided unless these are to draw attention to particular details. Users can
interact with the dashboard to manipulate the display using filters available in the
dropdown menus on the charts.

Creating an interactive dashboard in Microsoft Excel


A dashboard showing summaries and analyses of the data in Table 17.1 (in Section
17.1.3 'Displaying data to communicate information') can contain shapes, tables and
pivot charts. This is done by creating a new worksheet and giving it an appropriate
name, for example Dashboard. It is often preferable to move this sheet to be the first
tab in the list at the bottom of the workbook.

Using shapes to display information from the data


A shape, inserted from the Insert tab on the ribbon, from Illustration and Shapes, can
be used for a simple display of a value. Note that formulas that do calculations cannot
be placed in a shape so use references to cells that contain the formula. Put these on a
different sheet. Insert a shape, click inside the shape and type the equals sign (=) to
start a formula and go to the different sheet, click on the cell with the calculation and
press Enter. The value of the calculation will appear in the shape. For example, a
formula to calculate the value of the total sales of printers, from the data table, could
be found using =SUMIF() in cell K3 on sheetl. So the formula in the shape would be
=Sheet1!K3. The shape can be formatted and have labels, in text boxes, as required:

Grouping the shape with its label boxes enables easy repositioning as required. The
value in the shape will automatically update if the source data changes so the
dashboard will show the latest data. The shape shows the format of the value in the
source cell, for example currency shown as $ will show as $ in the shape. Shapes can
be linked to the data source so that, for example, a click takes the user to the original
data.
Using tables to display information from the data
A table on the dashboard can be used to display information using references to the
data either from the data source or from a pivot table made from the source data. As
for shapes, functions such as SUMIF() can be used to extract meaningful data but,
unlike shapes, formulas using functions can be placed directly in the tables. Also, the
=GETPIVOTDATA(data_field, pivot_table, [field1, iteml, field2, item2], ... ) function
(the data_field is the field in the table where the data is to come from, pivot_table is
the pivot table that holds the data to be retrieved and the field and items describe the
data to be retrieved) can be used to extract data directly from a pivot table.

Using pivot tables and charts to extract data for display


The pivot charts from the other sheets are copied and pasted onto the dashboard sheet
and arranged as required. Suitable legends and titles can be added and formatted using
the tools in Excel. The border and colours can be changed using the 'format shape'
options.

A simple dashboard using shapes, a table and a pivot chart created from the data in
Table 17.1 is shown here:

The chart can be manipulated by using the filters that appear on the bottom left of
each of the pivot charts, as shown here.
When the filters are altered on the right-hand chart, the chart will change to reflect the
new filters. For example, selecting only HDD and SSD in the Type filter changes the
right-hand chart in the dashboard to this when OK is clicked.

The user can view and analyse the data using the filters from the dashboard without
seeing, or even knowing about, the data table and pivot charts. Care needs to be taken
when using a pivot table for more than one display on a dashboard because changes in
the filters on one display may cause unexpected, unwanted or unsightly changes to
other displays on the dashboard.

Data displayed in pivot tables and pivot charts does not automatically update when the
source data is changed so a dashboard using them will not do so either. The tables and
charts used for a dashboard can be manually refreshed from the 'PivotChart Analyze'
tab where there are options to refresh one or all of the charts. This tab only appears
when a pivot table or chart is selected. There is also an option to have the pivot tables
and charts updated from the source data whenever the file is opened.

You might also like