BA-Unit 2
BA-Unit 2
2
Data Preparation,
Summarisation and
Visualisation Using
Spreadsheet
Ms. Asha Yadav
Assistant Professor
Department of Computer Science
School of Open Learning
University of Delhi
STRUCTURE
2.1 Learning Objectives
2.2 Data Preparation
2.3 Data Cleaning
2.4 Data Summarization
2.5 Data Sorting
2.6 Filtering Data
2.7 Conditional Formatting
2.8 Text to Column
2.9 Find and Remove Duplicates
2.10 Removing Duplicate Values
2.11 Data Validation
2.12 Identifying Outliers in Data
2.13 Covariance
2.14 Correlation Matrix
2.15 Moving Average
2.16 Finding Missing Values
2.17 Data Summarization
2.18 Data Visualization
PAGE 19
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
20 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Data Transformation: This refers to the conversion of data into a format Notes
or structure fit for analysis. Arguably, it will involve the normalization or
scaling of numeric values, encoding categorical variables, and aggregation
or disaggregation of data.
Data Integration: This is the integration of data from different sources
into one dataset. It may involve table merging, dataset joining, or another
kind of data conflict resolution.
Data Reduction: This is a process for reducing either the size or the
complexity of the dataset, and it involves feature selection, dimensionality
reduction, and sampling, among others.
Data Formatting: Consistency in format, including standardized date
formats and variable naming conventions.
Data Splitting: Basically, it is the division of data into subsets, usually
training, validation, and test sets. These sets help a model builder to build
models with the data, tune their hyperparameters, and finally estimate
their performance.
Good data preparation is important in order for one to generate valid
and accurate insights; otherwise, if the data quality is low, meaningful
conclusions will not be obtained.
22 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
ensure that the data is logically consistent, such as ensuring all transac- Notes
tions have corresponding dates.
Removing Irrelevant Data: This can be done by filtering data. That is
by removing data that is not relevant to the analysis or that does not
contribute useful information. This can include unnecessary columns,
out-dated records, or noise in the data.
Formatting and Structuring Data: This is done by ensuring that data
is in the correct format, such as consistent date formats or proper text
casing. Also, re-structure the data to meet the needs of the analysis, such
as pivoting tables or separating combined fields into distinct columns.
IN-TEXT QUESTIONS
1. What is the primary goal of data cleaning in a spreadsheet?
(a) To improve the appearance of the spreadsheet
(b) To remove inconsistencies and errors in the data
(c) To format data for printing
(d) To reduce the size of the spreadsheet
2. In data cleaning, what does “imputation” refer to?
(a) Removing unnecessary columns
(b) Filling in missing data with estimated values
(c) Filtering out irrelevant data
(d) Detecting outliers
PAGE 23
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
24 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 25
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
For filtering data on numeric values, you can even select a comparison,
like Between to see only the values that lie in a given range.
Notes
2.7 Conditional Formatting
Conditional Formatting allows users to
fill cells with certain color depending
on the condition. This enhances data
visualization and its interpretation. It also
helps in identifying patterns in data. Let
us see how conditional formatting can
be done in MS Excel.
Example: Highlight cells that have a value greater than 350.
Step 1: Select the range of cells
on which conditional formatting
has to be applied.
Step 2: On the Home tab, under
Styles Group, click Conditional
Formatting.
Step 3: Click Highlight Cells Rules
> Greater Than....
Step 4: Enter the desired value and
select the formatting style.
Step 5: Click OK
PAGE 27
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.8 Text to Column
Text to column feature is used to separate a single column data into mul-
tiple columns. This enhances readability of the data. For example, if a
column contains first name, last name and profession in a single column,
then this information can be separated in different columns. This allows
columns to have atomic values. Note that this separation is possible only
if multiple values are separated by the same delimiter in the cell. These
delimiters can be Comma, Semicolon, Space, or other characters. Let us
see how we can split data in MS Excel.
Step 1: Select the cell or column that contains the text to be split.
Step 2: Select Data > Text to Columns.
Step 3: In the Convert Text to Columns Wizard displayed on the screen,
select Delimited > Next.
Step 4: Select the Delimiters for your data.
Step 5: Select Next.
Step 6: Preview the split and select Finish.
28 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
IN-TEXT QUESTIONS
3. What does Conditional Formatting allow you to do in a spreadsheet?
(a) Apply formulas automatically
(b) Highlight cells based on certain criteria
(c) Change data values based on formatting
(d) Sort data based on custom rules
4. To highlight only duplicate values in a range of data using
Conditional Formatting, which rule would you apply?
(a) Text that contains
(b) Top/Bottom Rules
(c) Highlight Cell Rules > Duplicate Values
(d) New Rule > Use a Formula
PAGE 29
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes of the data. Data validation helps users control the input to ensure ac-
curacy and consistency.
While validating data, specific criteria for accepting data in cell(s) are
set. This restricts users from entering invalid data. Thus, validating data,
not only enhances accuracy, reliability and integrity of data but it also
cuts time in manual checking and correcting data entries. In Excel, this
can be done using the steps given below:
Step 1: Select the Cells for Data Validation
Step 2: In the Data Tab, click on Data Validation to open the Data Val-
idation Dialog Box
Step 3: In the Data Validation dialog box, under the Settings tab, define
the validation criteria:
Allow: Select the type of data. This data can be Whole Number, Decimal,
List (only values from a predefined list are allowed), Date, Time, Text Length
(only text of a certain length is allowed). The last option is Custom which
is used for more complex criteria and can be specified using a formula.
Data: Specify the condition (e.g., between, not between, equal to, not
equal to, etc.).
Minimum/Maximum: Enter the acceptable range or limits based on the
above selection. For example, to allow values between 100 and 1000,
select “Whole Number,” “between,” and then set the minimum to 100
and the maximum to 1000.
You can even configure (optional) an Input Message that will appear when
the cell is selected. For this, click on InputMessage Tab in the dialog
box. Give a brief title for the input message box and enter the guidance
text that will appear when someone selects the cell. The guidance text
will instruct user on what type of data to enter.
Another optional feature in MS Excel is that you can customize the Error
Alert. To do this, under the Error Alert tab, specify what would happen
if user enters invalid data:
Show Error Alert after Invalid Data is entered: Check this to enable
error alerts.
Style: Choose from Stop, Warning, or Information to indicate the severity
of the alert.
30 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 31
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Analyze Data Values: After sorting the values, identify large data dis-
crepancies and outliers to eliminate them. Such values can be straight-
away deleted. But, a better option is to remove only statistical anomalies.
Identify Data Quartiles: To calculate the outliers in the data, calculate
quartiles using Excel’s automated quartile formula beginning with “=
QUARTILE ()” in an empty cell. After the left parenthesis, specify the
first and last cells in your data range separated by a colon and followed
by a comma and the quartile you want to define. For example, formu-
la like “= QUARTILE (A5:A50, 1)” or “= QUARTILE (B2:B200, 3).”
Will find values from A1 cell to A50 cells that belong to quartile 1 (the
25th percentile, or the value below which 25% of data points fall when
arranged in increasing order).
Define the Interquartile Range (IQR): IQR represents the expected
average range of the data (without outlier values). It is calculated by
subtracting the first quartile from the third quartile.
Calculate the Upper and Lower Bounds: Defining the upper and lower
bounds of data allows identification of values that are higher than expected
value (upper bound) and smaller than the lower bound.
Calculate the upper bound of data by multiplying IQR by 1.5 and adding it
to the third quartile. The formula can be given as, “= Q3 + (1.5 * IQR).”
Similarly, to find the lower bound of data, multiply the IQR by 1.5 and
subtract it to from your first quartile value. The formula can be given
as, “= Q1 + (1.5 * IQR).”
Remove the Outliers: After defining the upper and lower bounds of data,
review the data to identify values that are higher than the upper bound
or lower than the lower bound. These values are statistical outliers. So,
delete them for more accurate analysis or visualization reports.
2.13 Covariance
Covariance is a statistical function that calculates the joint variability of
two random variables, given two sets of data. To calculate covariance in
Excel, use the covariance.p functions. The syntax is = COVARIANCE.P
(array1, array2), where
Array1 is a range or array of integer values.
32 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 33
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
34 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 35
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes business, moving average of sales for the last 3 months is calculated to
understand the market trends. To forecast weather, the moving average
of three-month temperatures is calculated.
We can compute different types of moving average - simple (or arithme-
tic), exponential, variable, triangular, and weighted. But in this section,
let us see how to calculate simple moving average. In Excel, simple
moving average is calculated by using formulas and trendline options.
A simple moving average can be calculated using the AVERAGE func-
tion. Given a list of average monthly temperatures in column B, moving
average for first 3 months can be calculated as = AVERAGE(B2:B4) or
=SUM(B2:B4)/3. To find subsequent averages, the formula can be copied
in other rows.
36 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
IN-TEXT QUESTIONS
5. What is the main purpose of data validation in spreadsheets?
(a) To perform mathematical calculations on data
(b) To ensure that data entered meets specific criteria
(c) To visualize data using charts
(d) To automatically sort data
6. What is an outlier in a dataset?
(a) A value that is similar to other values
(b) A value that falls within the Interquartile Range (IQR)
(c) A value significantly different from other values in the
dataset
(d) A missing or blank value
7. Which statistical method can be used to detect outliers using
quartiles?
(a) Standard deviation
(b) Z-score
(c) Interquartile Range (IQR)
(d) Median
PAGE 37
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.16 Finding Missing Values
Excel does not have any particular function to list missing values. But
it is important because of the following reasons:
Data Integrity which ensures that the dataset is complete.
Data Reconciliation that facilitates the reconciliation process (mostly
used in finance).
Quality Assurance to identify anomalies or data entry errors.
Efficient Analysis to perform accurate data analysis by spotting and
addressing gaps.
List missing Values in Excel
To identify and list missing values in Excel, you can use the following
functions:
IF, ISNUMBER and MATCH Functions:
IF: Returns one value if a condition is true and another if it’s false.
ISNUMBER: Checks if a value is a number.
MATCH: Searches for a value in a range and returns its relative
position.
Example: If a column A has a list of values in the range 1 to 100, then
missing values in this data can be identified by using the formula
= IF(ISNUMBER(MATCH(ROW(A1), A:A, 0)), “”, ROW(A1))
Note that the syntax of the MATCH
function is,
MATCH(lookup_value, lookup_ar-
ray, [match_type])
Where,
lookup_value is the value to be
matched in the lookup_array.
lookup_array is the range of cells
being searched.
match_type is optional. It can have
38 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
value -1, 0, or 1. The default value is 1. The argument specifies how Notes
Excel matches lookup_value with values in lookup_array.
Now, drag and apply the formula from B1 to B100. This will result in
column B displaying the missing values in the list.
Missing values can also be identified using the Filter feature on column
B to display only the missing numbers by excluding blank cells.
PAGE 39
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes Microsoft Excel provides different types of charts to visualize data in the
spreadsheet. To draw a chart, you need to follow the steps given below:
Step 1: Organize the data in rows and columns within the Excel sheet.
Every row and column should be labelled clearly to identify the data to
be visualized.
Step 2: Select the data by clicking and dragging mouse to highlight
the data to be visualized. In this selection, include the row and column
headers (as shown in the figure).
Step 3: Choose a chart type by clicking on the “Insert” tab. In the “Charts”
section, select the required chart option (Column, Line, Pie, Bar, Area,
Scatter, etc.) by clicking on the dropdown arrow below the chart type.
40 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Step 4: Insert the chart. Once the desired chart is selected, it is auto- Notes
matically created and inserted in the worksheet. Now, it can be clicked
and dragged to change its position or resized by using the sizing handles
at the corners.
Step 5: Customize the chart. For this, click on the chart to select it. Now,
you would be able to see two additional tabs: “Design” and “Format”.
Use these tabs to customize the chart’s appearance, style and layout. Im-
portant information like chart title, axis labels, legend, data labels, etc.
can be added to enhance visualization and data interpretation.
Step 6: Edit the data (optional). In case you wish to make changes to
the data, simply edit it in the worksheet. Excel will automatically update
the chart to reflect the changes.
PAGE 41
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Line Chart: The line chart plots data points and then connects these
points by lines. These lines show trends or change in values over time.
Line charts are widely used for continuous data like stock prices or
temperature measurements.
Pie Chart: A pie chart plots data as slices of a circle. Size of each slice
is proportional to the value it represents. That is, it represents the pro-
portion of each category within a whole.
42 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
PAGE 43
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
44 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
PAGE 45
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes If you want 3 pivot charts on the interactive dashboard then you must
have 3 pivot tables. So, you can simply duplicate the pivot table sheet
in the Excel workbook.
Step 3: Create Charts using the Pivot Table. For example:
The first chart would represent every product’s monthly sales. For this
chart, we need 3 data entries - Sales, Product, and Month. In the Pivot
table sheet, drag and drop the Month data in the rows area, product in
the columns area, and Sales in the values area.
46 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Data Table to insert a table representing all values in the data table. Notes
PAGE 47
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
IN-TEXT QUESTIONS
8. Which of the following is the most suitable chart type for
displaying the proportion of different categories in a dataset?
(a) Line Chart
(b) Scatter Plot
(c) Pie Chart
(d) Histogram
9. Which of the following operations can you perform using a
pivot table?
(a) Filter data based on specific criteria
(b) Create complex formulas
(c) Sort data in a specific column
(d) All of the above
10. Which type of chart is commonly used in pivot charts to show
data changes over time?
(a) Bar Chart
(b) Pie Chart
(c) Line Chart
(d) Scatter Plot
11. What is a common benefit of using a dashboard for data analysis?
(a) It provides detailed data without summarization
(b) It allows for real-time monitoring of key metrics
(c) It removes the need for data visualization
(d) It only displays raw data without analysis
48 PAGE
Department of Distance & Continuing Education, Campus of Open Learning,
School of Open Learning, University of Delhi
Notes
2.23 Summary
The aim of data preparation is to ensure good quality and consistency of
data for specific tasks. While data preparation, we need to detect outliers
that fall outside the expected range of values. These unexpected values
could be due to errors or may require special attention. We need to decide
whether to remove, correct, or leave outliers in the dataset. Sometimes
outliers are valid and should be kept, but in other cases, they may need
correction or exclusion.
Moreover, to validate accuracy of the data, check the data against reliable
sources or business rules to ensure accuracy. And, ensure that the data is
logically consistent, such as ensuring all transactions have corresponding dates.
Data summarization is done to transform a given large dataset into a
smaller form, usually presentable, for reporting, analysis, and further
examination. It involves extracting central insights and patterns from
data without losing vital information. Pivot tables are an important part
of MS Excel that allows users to quickly summarize large amounts of
data, analyze numerical data in detail, and answer unanticipated questions
about the data. Correspondingly, Pivot Chart is a dynamic visualization
tool that helps users summarize and analyze large datasets. Trends and
patterns can be easily identified by pivot charts.