ENGG1003 Lab06 DataScience v2
ENGG1003 Lab06 DataScience v2
2
Tasks
3
CSV file
4
Download a CSV
6
Import the CSV
7
Save As .xlsx
Save the file as Excel workbook format for further processing
Windows: Ribbon File > Save As > Browse > Save as type: Excel
Workbook (*.xlsx)
The list of options is long, you may need to scroll to the top of the list to find
Excel Workbook (*.xlsx)
Not this
one
MacOS: File Menu > Save As… > File Format: Excel Workbook (.xlsx)
See Appendix B at p.37 for screenshots
8
Academic Honesty and
Declaration Statement
You must read the University Guideline on Academic Honest (
https://www.cuhk.edu.hk/policy/academichonesty/), and place the following declaration
statement (with your information filled in) in cell A1 of a worksheet named [Declaration] in
your submitted Excel file
You should NOT share your file to others, regardless of your intention
You should NOT obtain other’s works by any means
Submitting the wrong file by “accident” will NOT be accepted as an excuse – so please
double-check which file you have submitted
10
Tidy data
12
Transform Data
13
Transform Data
Click cell D2, type: 1/1/2023
We are going to input all calendar date
in this column.
Select cells D2, then drag the little
green square at the bottom right corner
of cell D2 (circled in the screenshot) all
the way down to cell D366
That should auto-fill the cells D3 to
D366 with consecutive calendar
dates. i.e., from 1st Jan to 31st Dec.
As a result, cell D366 will have
31/12/2023 in it.
14
Transform Data
15
Transform Data
Explanation:
The VLOOKUP() function searches for the date contained in column D
within the range A2:B244.
If a match is found, then the date is a trading date. In this case, the
VLOOKUP() function returns the corresponding close value in the second
column (i.e., Column B).
If there is no match, then the date is not a trading date. In this case, the
VLOOKUP() function returns an error. The IFERROR() function handles
the error and returns NA() instead.
17
Transform Data
We have finished all the data transformation.
We are now ready to do the visualization!
18
Visualize data
19
Visualize Data
Let’s plot a chart to visualize the data
Select cells E1:F366
Scroll to top of the worksheet > Ribbon Insert
> Scatter (X Y) > Scatter with Straight Lines
Click on the chart > Ribbon Chart Design > Add Chart Elements >
Gridlines > Primary Minor Vertical
Do the same to add gridline Primary Minor Horizontal
Click on the chart > Ribbon Chart Design > Add Chart Elements >
Axis Titles > Primary Vertical
Do the same to add axis title Primary Horizontal
Click and rename the chart title to Daily Close Value of HSI in
2023
Click and rename the axis titles
X-axis title should be Date Number
Y-axis title should be Close Value of HSI
20
Visualize Data
The numbers on the X-axis are now “0 50 100 150 … 400”
Let’s change them to “0 10 20 … 370”
Here are the steps:
Right-click any number (e.g., 0 50 100 150 … 400) on the X-axis
In the right-click menu, choose “Format Axis…”
The “Format Axis” menu will appear on R.H.S. (depends on your Excel version)
Set Bounds – Minimum to 0.0
Set Bounds – Maximum to 370.0
Set Units – Major to 10.0
Set Units – Minor to 5.0
22
Modeling
23
Create a Model
Model the data with a trendline
Right-click on the data line in the chart
> Add Trendline…
On the right pane, select Moving average > set Period
as 4
Select Fill & Line (the painting icon) > set Color as Red
A dotted line would appear on the chart
24
Create a Model
26
Predict the Missing Values
Using the trendline, what is the estimated HSI
for date number 25?
Enlarge the chart or zoom in the worksheet for better
viewing
Type your estimation in cell K1.
If the estimated value is not exactly on the gridline,
you may choose the closest gridline as your answer.
27
One more Trendline
28
One more Trendline
Forecasting
29
Predict the future value
The dataset only contains data in 2023.
The maximum date number is 365, i.e., 31 st Dec 2023.
According to the actual data from the Hang Seng Indexes Company
Limited, the actual HSI close value on 5th Jan 2024 is 16535.33.
What is the absolute error of your prediction?
In cell M1, use the formula =ABS(L1 – 16535.33) to output the absolute
error.
30
[Bonus Part]
In cells O1 to Q7, type the cell contents as
shown in the figure below:
31
[Bonus Part]
In cells Q2 to Q7, use FREQUENCY() to count the number of
values in the HSI close value in 2023.
Remember to press Ctrl+Shift+Enter (for both Windows and macOS)
Try the formula =FREQUENCY(F2:F366, O2:O6)
However, the results are all #N/A.
Reason: FREQUENCY() will give #N/A results if there are some #N/A
values in the original data.
You are not allowed to use the original data in Column B in this bonus
part.
32
[Bonus Part]
FREQUENCY() cannot handle #NA values.
To fix the problem, the first argument of FREQUENCY()
cannot simply be <cell_range>.
Two possible solutions are:
IF(ISNA(<cell_range>), "", <cell_range>) IFNA(<cell_range>, "")
Modify <cell_range> by yourself (remember not to include the < >)
The double-quotation mark "" means an empty text/string, which can be handled by
FREQUENCY()
ISNA() will check whether the cell value is #N/A, it IFNA() will check whether the cell value is #N/A
returns either True or False then replace it to ""
If the cell value is #N/A, If the cell value is not
IF() returns "" #N/A, IF() returns the cell
value unchanged
Now the problem is fixed without modifying the original cell values in
<cell_range>
Solutions that do not make use of the combinations of FREQUENCY()
and IFNA(), IF(), and/or ISNA() will be treated as INCORRECT
33
Blackboard Submission
(DO NOT pick the CSV file that is just raw data!)
NO need to “Create Submission” or “Add Comments”
Download your submitted file and open it to make sure
it is really the latest Excel file you worked on
Click Submit
34
Known Possible Issues
35
Appendix A: Opening a csv file
in Microsoft Excel in MacOS
Command-click on the csv file you have just
downloaded
In the pop-up menu, choose Open With -> Microsoft
Excel
36
Appendix B: Saving the file as a
.xlsx file in macOS
37
Appendix C: What is moving
average?
It is also known as rolling average, running average, moving mean, or rolling mean.
A parameter “Period”, say p, should be assigned by the user.
It is calculated by taking the average of an interval with size p and then shifting that interval
forward by one data point at a time.
Example:
Assume p = 3
The first value of the trendline will be the average of the first 3 data points.
The second value of the trendline will be the average of the second, third, and fourth
points.
The third value of the trendline will be the average of the third, fourth, and fifth points.
and so on
Pros: It can smooth out the random fluctuations in data and highlight the overall trends.
Cons: It may oversimplify complex data structure.
You can use the moving average trendline when you want to analyze trends over time but
have volatile or noisy data.
It is commonly used in financial analysis, weather data, and other time-series data.
38