Chapter 17 Data Analysis and Visualisation
Chapter 17 Data Analysis and Visualisation
Data transformation is the changing of data from one format, structure or value into
another. The original source of the data may not be in the condition that is required for
processing or, for example, creating reports. Data transformation can be carried out by
specialist software or with scripting languages such as SQL. Data can be transformed
in several ways:
● Constructive data transformation adds, copies and replicates data, for example
data can be combined with other data, such as customer information from sales
databases with that from marketing databases.
● Destructive data transformation deletes fields or records, for example it can
simplify data for analysis such as anonymising information by removing names
and changing specific ages into age ranges to ensure that individuals are not
identified during data mining.
● Aesthetic data transformation changes data to a standard form, for example
using the same standard format for dates or names.
● Structural data transformation reorganises data by renaming or combining
entities in a database.
● Normalisation of data for use in relational databases is also a type of data
transformation.
Data cleansing involves auditing the data set using specialist software or commercial
database management software to check for and remove anomalies and contradictions.
Any anomalies that the software cannot reconcile are reported and have to be removed
manually. Data cleansing can be expensive because it can take up a large amount of
computing resources, and is timeconsuming to check and carry out manually.
Combining data from different sources is called data integration. Where the data
comes from is called a 'data source'. When collecting data, primary data sources are
the original data, for example from interviews, from observations, on data capture
forms or readings from sensors. Secondary data sources are those that have derived
data from the original, primary sources. In computing and IT, a data source usually
refers to the location where the required data is to be found, which could be a text file,
a spreadsheet, a database file, XML or other stored data. Data sources are used to
combine data when merging, searching or analysing spreadsheets or databases, or
when creating mail-merged documents.
Websites responding to user requests (for example when searching for information,
during online shopping or banking) combine data from different sources, such as
customer, stock and financial databases.
If there are several sources of data that are required to be linked, for example when
summarising or using data from several spreadsheet in another, the original data in the
source spreadsheets can be linked to the summary spreadsheet. In this case, when the
source data is amended, provided that the summary spreadsheet is still linked and can
still locate the original data and automatic calculation is enabled, the data in the linked
spreadsheet will update. The links can be broken so that the data is no longer
dependent on the source data, but then it will also no longer update automatically.
Links can be created using formulas or by hyperlinks. The link information contains
references to the spreadsheet name and to the cell where the source data is stored. In
the case of using a different, separate spreadsheet file for the source, the filename is
also included.
Spreadsheet applications also include tools to import and combine data from different
sources. In both Microsoft Excel and LibreOffice Calc, there are several ways to
import and combine data from different sources. Excel has numerous options for
importing from files such as other Excel workbooks (files), CSV and XML as shown
here.
Relational databases store data in different tables within the database. Many tables can
be linked so that data can be extracted. There are several ways to join tables in
Microsoft Access® or in LibreOffice Base®. Structured query language (SQL) can
also be used to select and extract data across tables in relational databases.
When working with data stored in database tables, comparing data to find out whether
or not the data in the fields match can be very useful. For example, for marketing and
further sales opportunities, database tables of customers and their purchases and of
salespeople can be compared to find out which customers have been sold which items
by which salespeople.
To compare two database tables to find matching data, queries can be used. A query is
created either to make a new join between fields, which must be the same type, or by
using an existing join between the tables, to extract only the data that matches. Fields
that hold similar data but are of different types, for example a field that has a number
data type but is to be compared with a text data type, cannot be joined. However, a
query can be created to use a field in one table as the criterion against which a field, or
fields, in other tables can be compared. A select query is created to include the tables,
the fields to be displayed are included and the field to be used as the criterion is
selected. The type of comparison is chosen, for example equals to. Complex queries
can be set up across several tables.
Data consolidation takes data from different sources, cleanses it and combines it. This
enables the processes of transformation, analysis and reporting to be carried out more
quickly and easily. In businesses and organisations, data is consolidated in a data
warehouse. A data warehouse is a large database that takes and stores data from other
databases and upon which analysis is carried out. A specialised program, or script, is
used to extract and load (EL) data from the source and place it in the data warehouse.
This is done by running queries to extract the required data, creating tables to receive
the data and loading the data into the data warehouse. Specialised EL tools (ELT) are
also available commercially. This is inexpensive, but for smaller businesses and
individuals, Microsoft Access and LibreOffice Base have the ability to consolidate
data.
Microsoft Access can work with several different file formats and has a powerful
query and SQL system that can consolidate data in many ways according to the
requirements of users.
Split(string,delimiter,limit, compare)
Where string is the contents of the field, delimiter (or separator) is where the split
should be and this is taken to be a space if nothing else is stated here, limit is the
number of splits returned and this is set to -1 by default, which means all of the splits
are returned to the system for use in other fields, and compare. Compare is optional
but can be used to make comparisons based on the string contents.
Split(string,delimiter,number)
The first two parameters are the same as for Microsoft Access but number is how
many strings are to be returned.
Merging and combining data into required fields
While merging tables with fields of data is sometimes required, it is not usual to
permanently combine data from several fields into one field. It is usual database
practice to keep data at the lowest possible level of detail, or as atomic as possible,
which is why relational databases often have many tables with joined fields.
However, merging and combining fields at runtime; that is, only as and when required
and not altering the underlying data structure, can be used to give an overview of the
data, for example in reports. In Microsoft Access, two fields can be combined in a
query using the & character, which will concatenate the chosen fields with a space
separating each field. In both Microsoft Access and LibreOffice, SQL scripts can be
used to select and combine data in different fields.
Graphs and charts can be created from data. Graphs and charts present data visually,
which makes it easier to read and understand. Graphs and charts enable viewers to
visualise trends and relationships between data with minimal mathematical or
analytical skills. There should be a clear, meaningful title and the axes should be
clearly labelled. The chart or graph should be uncluttered and with no unnecessary
data or information, the units of measurement should be clearly visible and
appropriate, and the axes should not be distorted or disproportionate. Where
appropriate, the data source should be included.
Pivot table reports
Pivot tables are used in spreadsheets to summarise large amounts of data. They can
also be used to look for patterns and trends and can include links to external data
sources.
It is important to ensure that the data you wish to analyse with a pivot table is
correctly formatted and suitable, meaning it has been transformed and cleansed ready
for use. For example, each column in the spreadsheet should have a header (or title),
there should be no blank columns or rows and there should be no rows with totals.
Table 17.1 shows a simple set of data about the sales of computing items that can be
used to create a pivot table in Microsoft Excel.
The data is entered into a Microsoft Excel Spreadsheet and prepared for use as shown
below.
When the table is prepared, a pivot table report can be created from the Insert, Pivot
table menu options. The pivot table report is placed in a new sheet in the spreadsheet
workbook, as shown below.
The data to be included in the pivot table report is chosen from the dialogue box that
appears when the table is created. The PivotTable Fields dialogue box can be opened
at any time by a right-click on any cell in the pivot table and choosing the Show field
list option, as shown here:
The headers can be altered in the pivot table from the default to customise the table as
required. There are options to search, sort and filter the table contents, as shown here:
Extracting and viewing data from the contents of the original sheet is made much
simpler by using the pivot table to report. For example, to find out and display the
total sales of items, open the PivotTable Fields dialogue box and move it by dragging
the field Sum of Total Sales to the filter box, as shown here:
By choosing the ProductIDs for the items in the drop-down list, the sales for these can
be displayed, as shown here:
In LibreOffice Cale, pivot tables are created from the Data, Pivot table menu options.
The layout of the data fields is carried out by dragging the column titles as required, as
shown here:
The same end result can be achieved in Cale as in Excel. The dropdown menus also
offer the same sorting, filtering and searching of data, as shown here:
Pivot tables allow the data to be explored to create many different reports so it is
worth spending time exploring their use with your own data.
If the data in the source sheet is altered, then the pivot table must be manually
refreshed, from the menu options, with the new data.
Pivot charts
Pivot charts are used to display the data in a pivot table. Pivot charts display the data
in much the same way as charts made from data in ordinary worksheets. A pivot chart
can be edited to change elements like the titles, legends or colour, but the data itself
cannot be edited in the chart. This is because the data depends on that in the sheet on
which the pivot table is based. In Microsoft Excel, pivot charts are created from the
Insert, Pivot chart menu options, as shown here:
Similarly, in LibreOffice Cale, select any cell in the pivot table, and from the Insert
menu choose to insert a Chart, as shown here:
The format and layout of the charts can be customised from the menus and dialogue
boxes. As for pivot tables, any changes to the original data source are not reflected in
the chart until it is manually refreshed. However, in the same way as for pivot tables,
the content of pivot charts can be filtered.
17.1.4 Dashboards
Instead of generating a new report from the data each time, different visualisations of
data can be presented in one view and examined together in a single visual
representation called a dashboard. To create a dashboard, data must first be imported,
or entered, into a spreadsheet application, for example Microsoft Excel, and if
necessary transformed so that its structure is suitable, such as no missing rows or
columns, each row representing a unique record and the data correctly formatted as a
spreadsheet table. The data can be manipulated, summarised and analysed using
formulas, filters, charts, pivot tables and pivot charts to extract the required
information for the user. The resulting views are grouped together to form the
dashboard. Dashboards usually look best if the gridlines are not showing. Excel's
toolbars and ribbons can be hidden too if necessary.
Dashboards should be kept simple and easy to understand and navigate. Shapes and
colours can be used but should not be crowded. Very bright, contrasting colours
should be avoided unless these are to draw attention to particular details. Users can
interact with the dashboard to manipulate the display using filters available in the
dropdown menus on the charts.
Grouping the shape with its label boxes enables easy repositioning as required. The
value in the shape will automatically update if the source data changes so the
dashboard will show the latest data. The shape shows the format of the value in the
source cell, for example currency shown as $ will show as $ in the shape. Shapes can
be linked to the data source so that, for example, a click takes the user to the original
data.
Using tables to display information from the data
A table on the dashboard can be used to display information using references to the
data either from the data source or from a pivot table made from the source data. As
for shapes, functions such as SUMIF() can be used to extract meaningful data but,
unlike shapes, formulas using functions can be placed directly in the tables. Also, the
=GETPIVOTDATA(data_field, pivot_table, [field1, iteml, field2, item2], ... ) function
(the data_field is the field in the table where the data is to come from, pivot_table is
the pivot table that holds the data to be retrieved and the field and items describe the
data to be retrieved) can be used to extract data directly from a pivot table.
A simple dashboard using shapes, a table and a pivot chart created from the data in
Table 17.1 is shown here:
The chart can be manipulated by using the filters that appear on the bottom left of
each of the pivot charts, as shown here.
When the filters are altered on the right-hand chart, the chart will change to reflect the
new filters. For example, selecting only HDD and SSD in the Type filter changes the
right-hand chart in the dashboard to this when OK is clicked.
The user can view and analyse the data using the filters from the dashboard without
seeing, or even knowing about, the data table and pivot charts. Care needs to be taken
when using a pivot table for more than one display on a dashboard because changes in
the filters on one display may cause unexpected, unwanted or unsightly changes to
other displays on the dashboard.
Data displayed in pivot tables and pivot charts does not automatically update when the
source data is changed so a dashboard using them will not do so either. The tables and
charts used for a dashboard can be manually refreshed from the 'PivotChart Analyze'
tab where there are options to refresh one or all of the charts. This tab only appears
when a pivot table or chart is selected. There is also an option to have the pivot tables
and charts updated from the source data whenever the file is opened.