0% found this document useful (0 votes)
4 views38 pages

ENGG1003 Lab06 DataScience v2

The document outlines a lab exercise for ENGG1003/1004 focusing on the Hang Seng Index (HSI) and its data analysis using Excel. It includes tasks such as downloading a CSV file, tidying and transforming data, visualizing it through charts, and creating predictive models. Additionally, it emphasizes academic honesty and provides instructions for submitting the completed work on Blackboard.

Uploaded by

phyphy57415741
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views38 pages

ENGG1003 Lab06 DataScience v2

The document outlines a lab exercise for ENGG1003/1004 focusing on the Hang Seng Index (HSI) and its data analysis using Excel. It includes tasks such as downloading a CSV file, tidying and transforming data, visualizing it through charts, and creating predictive models. Additionally, it emphasizes academic honesty and provides instructions for submitting the completed work on Blackboard.

Uploaded by

phyphy57415741
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

ENGG1003/1004

Digital Literacy and


Computational Thinking—P/R

Lab 06 Data Science


2024-25 Term 2
Hang Seng Index

 Hang Seng Index (HSI) is an indicator to


reflect the overall stock market
performance in Hong Kong.
 HSI is compiled and maintained by Hang
Seng Indexes Company Limited.
 In this lab, we are going to investigate
HSI by Data Science!

2
Tasks

 Download and import a CSV into Excel


 Convert file to .xlsx and tidy the data
 Transform the data
 Visualize the data
 Create a model
 Predict some values
 [Bonus Part]
 Upload and submit your saved file to Blackboard

3
CSV file

4
Download a CSV

 We will focus on the HIS in 2023 (from 1


Jan to 31 Dec).
 Download and Save properly a CSV from
Blackboard.
 file name: HSI.csv
 Remember the folder location that you have
saved your file (Mainly the file should be
stored in “Download” folder.)

 Credit: The dataset is also available on Yahoo!Finance


5
Recap: Save Your Work Properly

 Setup your own filing system:


 E.g., create a folder named ENGG1003/1004 for
your works in this course
 Keep on your own computer, e.g., Documents
 Keep on portable storage such as USB drive
 Keep on cloud storage such as OneDrive (CUHK
O365)

 Folder structure is hierarchical, i.e., tree-like


with branches called sub-folders

6
Import the CSV

 The file HSI.csv contains the Open,


High, Low, Close, Adj Close, and
Volume of HSI of every trading date
in 2023:
 Open the CSV in Excel
 Windows users: Just double-click on the
csv file you have just downloaded
 macOS users:
 If you use Safari, the csv contents may
be shown directly in the browser 
press ⌘+S (Command-S) to save the
file
 Do NOT open the file directly by
clicking on the saved csv file, as that
may open the file in another
spreadsheet application called
Numbers  See Appendix A at p.36 for
instructions

7
Save As .xlsx
 Save the file as Excel workbook format for further processing
 Windows: Ribbon File > Save As > Browse > Save as type: Excel
Workbook (*.xlsx)
 The list of options is long, you may need to scroll to the top of the list to find
Excel Workbook (*.xlsx)

Not this
one

 MacOS: File Menu > Save As… > File Format: Excel Workbook (.xlsx)
 See Appendix B at p.37 for screenshots
8
Academic Honesty and
Declaration Statement
 You must read the University Guideline on Academic Honest (
https://www.cuhk.edu.hk/policy/academichonesty/), and place the following declaration
statement (with your information filled in) in cell A1 of a worksheet named [Declaration] in
your submitted Excel file
 You should NOT share your file to others, regardless of your intention
 You should NOT obtain other’s works by any means
 Submitting the wrong file by “accident” will NOT be accepted as an excuse – so please
double-check which file you have submitted

I declare that the lab work here submitted is original


except for source material explicitly acknowledged,
and that the same or closely related material has not been
previously submitted for another course.
I also acknowledge that I am aware of University policy and
regulations on honesty in academic work, and of the disciplinary
guidelines and procedures applicable to breaches of such
policy and regulations, as contained in the website.

University Guideline on Academic Honesty:


https://www.cuhk.edu.hk/policy/academichonesty/

Student Name : <your name>


Student ID : <your student ID>
Class/Section : <your class/section>
Date : <date> 9
Tidy data

10
Tidy data

 We only want to focus on


the Date and the Close
value in this lab.
 Let's delete the
unnecessary columns.
 Select column B, C, D, F,
and G, then Right-click* the
shaded columns > Delete

* How to right-click on Mac 11


Transform data

12
Transform Data

 Let’s create three more columns to


transform the data
 Click cell D1, type: Calendar
 Click cell E1, type: Date#
 Click cell F1, type: Close

13
Transform Data
 Click cell D2, type: 1/1/2023
 We are going to input all calendar date
in this column.
 Select cells D2, then drag the little
green square at the bottom right corner
of cell D2 (circled in the screenshot) all
the way down to cell D366
 That should auto-fill the cells D3 to
D366 with consecutive calendar
dates. i.e., from 1st Jan to 31st Dec.
 As a result, cell D366 will have
31/12/2023 in it.

14
Transform Data

 Click cell E2, type: 1


 Click cell E3, type: 2
 Similarly, we are going to input
consecutive numbers in this column.
 Select cells E2 and E3, then drag the little
green square at the bottom right corner of
cell E3 (circled in the screenshot) all the
way down to cell E366
 That should auto-fill the cells E4 to
E366 with consecutive numbers. i.e.,
from 1 to 365.
 As a result, cell E366 will have 365 in
it.

15
Transform Data

 If the calendar date has a


close value, we are going to
put the value in Column F.
 Otherwise, we are going to
put =NA() in Column F. (i.e.,
it is not a trading date).
 How can we do it?
 Hint: Use =IFERROR() and
=VLOOKUP()
 Answer on the next page……

******** What is IFERROR() function? ********


16
Transform Data
 In F2, Type the formula
=IFERROR(VLOOKUP(D2, $A$2:$B$244, 2, FALSE), NA())
 Double-click the right-bottom corner of F2 to Apply this
formula to all rows below.

 Explanation:
 The VLOOKUP() function searches for the date contained in column D
within the range A2:B244.
 If a match is found, then the date is a trading date. In this case, the
VLOOKUP() function returns the corresponding close value in the second
column (i.e., Column B).
 If there is no match, then the date is not a trading date. In this case, the
VLOOKUP() function returns an error. The IFERROR() function handles
the error and returns NA() instead.

17
Transform Data
 We have finished all the data transformation.
 We are now ready to do the visualization!

18
Visualize data

19
Visualize Data
 Let’s plot a chart to visualize the data
 Select cells E1:F366
 Scroll to top of the worksheet > Ribbon Insert
> Scatter (X Y) > Scatter with Straight Lines
 Click on the chart > Ribbon Chart Design > Add Chart Elements >
Gridlines > Primary Minor Vertical
 Do the same to add gridline Primary Minor Horizontal
 Click on the chart > Ribbon Chart Design > Add Chart Elements >
Axis Titles > Primary Vertical
 Do the same to add axis title Primary Horizontal
 Click and rename the chart title to Daily Close Value of HSI in
2023
 Click and rename the axis titles
 X-axis title should be Date Number
 Y-axis title should be Close Value of HSI
20
Visualize Data
 The numbers on the X-axis are now “0 50 100 150 … 400”
 Let’s change them to “0 10 20 … 370”
 Here are the steps:
 Right-click any number (e.g., 0 50 100 150 … 400) on the X-axis
 In the right-click menu, choose “Format Axis…”
 The “Format Axis” menu will appear on R.H.S. (depends on your Excel version)
 Set Bounds – Minimum to 0.0
 Set Bounds – Maximum to 370.0
 Set Units – Major to 10.0
 Set Units – Minor to 5.0

 Do the same for Y-Axis:


 Set Bounds – Minimum to 15000.0
 Set Bounds – Maximum to 23000.0
 Set Units – Major to 1000.0
 Set Units – Minor to 200.0

 See the screenshots on the next page


21
Visualize Data

 Broken data line because of some unavailable data.


 Not all calendar dates are trading dates.
 Place your graph around the right-hand-side of the data.

22
Modeling

23
Create a Model
 Model the data with a trendline
 Right-click on the data line in the chart
> Add Trendline…
 On the right pane, select Moving average > set Period
as 4
 Select Fill & Line (the painting icon) > set Color as Red
 A dotted line would appear on the chart

24
Create a Model

 What is the moving average?


 Check the Appendix C in p.38
25
Predict the Missing Values
 The data line in the chart is broken
because of missing values in non-trading
dates.
 We can use the trendline to predict the
missing data.

26
Predict the Missing Values
 Using the trendline, what is the estimated HSI
for date number 25?
 Enlarge the chart or zoom in the worksheet for better
viewing
 Type your estimation in cell K1.
 If the estimated value is not exactly on the gridline,
you may choose the closest gridline as your answer.

27
One more Trendline

 Again, model the data with a


trendline
 Right-click on the data line in the
chart
> Add Trendline…
 On the right pane, select
Polynomial> set Order as 5
 You may choose any color.
 In forecast > Forward, set 5.0 periods
 A curve would appear on the chart
 Change it to Green color this time.

28
One more Trendline

Forecasting

29
Predict the future value
 The dataset only contains data in 2023.
 The maximum date number is 365, i.e., 31 st Dec 2023.

 The polynomial trendline is forecasting the HSI in Date Number


370, i.e., 5th Jan 2024.
 What is the predicted HSI close value on 5th Jan 2024?
 Write down the predicted value in cell L1.
 If the predicted value is not exactly on the gridline, you may choose the
closest gridline as your answer.

 According to the actual data from the Hang Seng Indexes Company
Limited, the actual HSI close value on 5th Jan 2024 is 16535.33.
 What is the absolute error of your prediction?
 In cell M1, use the formula =ABS(L1 – 16535.33) to output the absolute
error.

30
[Bonus Part]
 In cells O1 to Q7, type the cell contents as
shown in the figure below:

31
[Bonus Part]
 In cells Q2 to Q7, use FREQUENCY() to count the number of
values in the HSI close value in 2023.
 Remember to press Ctrl+Shift+Enter (for both Windows and macOS)
 Try the formula =FREQUENCY(F2:F366, O2:O6)
 However, the results are all #N/A.
 Reason: FREQUENCY() will give #N/A results if there are some #N/A
values in the original data.
 You are not allowed to use the original data in Column B in this bonus
part.

32
[Bonus Part]
 FREQUENCY() cannot handle #NA values.
 To fix the problem, the first argument of FREQUENCY()
cannot simply be <cell_range>.
 Two possible solutions are:
IF(ISNA(<cell_range>), "", <cell_range>) IFNA(<cell_range>, "")
Modify <cell_range> by yourself (remember not to include the < >)

The double-quotation mark "" means an empty text/string, which can be handled by
FREQUENCY()

ISNA() will check whether the cell value is #N/A, it IFNA() will check whether the cell value is #N/A
returns either True or False then replace it to ""
If the cell value is #N/A, If the cell value is not
IF() returns "" #N/A, IF() returns the cell
value unchanged

 Now the problem is fixed without modifying the original cell values in
<cell_range>
 Solutions that do not make use of the combinations of FREQUENCY()
and IFNA(), IF(), and/or ISNA() will be treated as INCORRECT
33
Blackboard Submission

 Login Blackboard course ENGG1003/1004


 Go to Lab 06 Data Science

 Under Upload Files, Browse Local Files > select your

saved file HSI.xlsx

(DO NOT pick the CSV file that is just raw data!)
 NO need to “Create Submission” or “Add Comments”
 Download your submitted file and open it to make sure
it is really the latest Excel file you worked on
 Click Submit

34
Known Possible Issues

 Windows in Simplified Chinese language and region setting is unable to


treat commas as separators, thus Excel cannot open a CSV properly.
 Excel > Data ribbon > “Text to columns” to manually tell Excel to use commas to
separate the columns; OR change Windows settings.
 Plotting a chart with “Scatter with Straight Lines” shows no broken part on
missing data.
 Avoid clicking “Recommended Charts”; pick Scatter (X Y) instead.
 Seeing formula text =NA() instead of #N/A value.
 Copy and paste #N/A value from some other cells.
 Some cells display strange characters or symbols.
 It could be due to certain fonts such as WingDing being used. Try to change to
use another font for those cells.

35
Appendix A: Opening a csv file
in Microsoft Excel in MacOS
 Command-click on the csv file you have just
downloaded
 In the pop-up menu, choose Open With -> Microsoft
Excel

36
Appendix B: Saving the file as a
.xlsx file in macOS

37
Appendix C: What is moving
average?
 It is also known as rolling average, running average, moving mean, or rolling mean.
 A parameter “Period”, say p, should be assigned by the user.
 It is calculated by taking the average of an interval with size p and then shifting that interval
forward by one data point at a time.
 Example:
 Assume p = 3
 The first value of the trendline will be the average of the first 3 data points.
 The second value of the trendline will be the average of the second, third, and fourth
points.
 The third value of the trendline will be the average of the third, fourth, and fifth points.
 and so on
 Pros: It can smooth out the random fluctuations in data and highlight the overall trends.
 Cons: It may oversimplify complex data structure.
 You can use the moving average trendline when you want to analyze trends over time but
have volatile or noisy data.
 It is commonly used in financial analysis, weather data, and other time-series data.
38

You might also like