0% found this document useful (0 votes)
22 views9 pages

UFCFLR-15-M Data Management Fundamentals 2021

The document provides details about an assignment to model, clean, normalize, and query air quality data from Bristol, UK. The data ranges from 2004 to 2022 and contains over 1.47 million rows from 18 monitoring stations recording hourly measurements of pollutants like nitrogen oxides, particulate matter, and carbon monoxide. Students must cleanse and normalize the data, import it into a MySQL database, perform SQL queries, and map the data to a NoSQL database to demonstrate their skills in working with large real-world datasets.

Uploaded by

thongocanhdoan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

UFCFLR-15-M Data Management Fundamentals 2021

The document provides details about an assignment to model, clean, normalize, and query air quality data from Bristol, UK. The data ranges from 2004 to 2022 and contains over 1.47 million rows from 18 monitoring stations recording hourly measurements of pollutants like nitrogen oxides, particulate matter, and carbon monoxide. Students must cleanse and normalize the data, import it into a MySQL database, perform SQL queries, and map the data to a NoSQL database to demonstrate their skills in working with large real-world datasets.

Uploaded by

thongocanhdoan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

MODULAR PROGRAMME

COURSEWORK ASSESSMENT SPECIFICATION

Module Details

Module Code Run Module Title

UFCFLR-15-M 21JAN/1 Data Management Fundamentals

Module Leader Module Coordinator Module Tutors

Prakash Chatterjee P Chatterjee

Component and Element Number Weighting: (% of the Module's assessment)

B: CW1 50%

Element Description Total Assignment time

Model, clense, normalize, map, & query big data 36 hours

Dates

Date Issued to Students Date to be Returned to Students

15 Feb 2021 4 working weeks after hand-in.

Submission Place Submission Date

18 Jul 2022

Blackboard Submission Time

2.00 pm

Deliverables

A ZIP file submiitted to BB called dmf-assign.zip containing all required code & reports.

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 1/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

Module Leader Signature

UFCFLR-15-M Data Management Fundamentals

Assignment Specification 2021-22 - Jan/21-1

Learning Goals & Outcomes

Learn to model, cleanse, normalize, shard, map, query and analyze substantial real-world data
(230mb+);
Understand the data cleansing, normalization and sharding processes by writing PYTHON scripts
to process and convert the data to first (cleansed) CSV and then (normalized) SQL;
Design and implement a relational (MySQL) database and then write a PYTHON script to pipe
(import) the cleansed data into the appropriate tables ensuring all integrity constraints are met.
Construct and implement a set of SQL queries to extract data using various filters and constraints.
Map (forward engineer) the data to a NoSQL database of your choice (MongoDB, BaseX,
CouchBase, ArangoDB etc.)
Write a short, reflective report on the learning outcomes you have achieved.
Get exposure to and learn the use of a range of data oriented technologies (databases, python &
sql.)
Learn and use the MARKDOWN markup syntax.

Context: Measuring Air Quality

Levels of various air borne pollutants such as Nitrogen Monoxide (NO), Nitrogen Dioxide (NO2) and
particulate matter (also called particle pollution) are all major contributors to the measure of overall
air quality.

For instance, NO2 is measured using micrograms in each cubic metre of air ( /m ). A microgram ( ) ㎍ 3

is one millionth of a gram. A concentration of 1 ㎍/m 3
means that one cubic metre of air contains one
microgram of pollutant.

To protect our health, the UK Government sets two air quality objectives for NO2 in their Air Quality
Strategy

1. The hourly objective, which is the concentration of NO2 in the air, averaged over a period of one
hour.

2. The annual objective, which is the concentration of NO2 in the air, averaged over a period of a
year.

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 2/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

The following table shows the colour encoding and the levels for Objective 1 above, the mean hourly
ratio, adopted in the UK.

Index 1 2 3 4 5 6 7 8 9 10

Band Low Low Low Moderate Moderate Moderate High High High Very High

㎍/m³ 0- 68- 135- 201-267 268-334 335-400 401- 468- 535- 601 or
67 134 200 467 534 600 more

Further details of colour encodings and health warnings can be found at the DEFRA Site.

The Input Data

The following ZIP file provides data ranging from 2004 to 10 February 2022 (five days ago) taken from
18 monitoring stations in and around Bristol.
Delete any previous versions of this file (if you have downloaded it).

Monitors come and go and may suffer down times, so the data isn't complete for all stations at all
times.

Download & save the data file: bristol-air-quality-data.zip (18 Mb)

Shown here is the first 8 lines of the file (cropped):

Note the following:

There are 18 stations (monitors):


188 => 'AURN Bristol Centre',
203 => 'Brislington Depot',
206 => 'Rupert Street',
209 => 'IKEA M32',
213 => 'Old Market',
215 => 'Parson Street School',
228 => 'Temple Meads Station',
270 => 'Wells Road',
271 => 'Trailer Portway P&R',
375 => 'Newfoundland Road Police Station',
395 => "Shiner's Garage",
452 => 'AURN St Pauls',
447 => 'Bath Road',
459 => 'Cheltenham Road \ Station Road',
463 => 'Fishponds Road',
481 => 'CREATE Centre Roof',

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 3/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

500 => 'Temple Way',


501 => 'Colston Avenue'

Each line represents one reading from a specific detector. Detectors take one reading every hour. If
you examine the file using a programming editor, Notepad++ can handle the job, you can see that the
first row gives headers and there are another 1474177 (1.47 million+) rows (lines). There are 23 data
items (columns) per line.

The schema is given below:

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 4/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

measure desc unit

Date Time Date and time of measurement datetime

NOx Concentration of oxides of nitrogen ㎍/m3


NO2 Concentration of nitrogen dioxide ㎍/m3
NO Concentration of nitric oxide ㎍/m3
SiteID Site ID for the station integer

PM10 Concentration of particulate matter <10 micron diameter ㎍/m3


NVPM10
Concentration of non - volatile particulate matter <10 micron
diameter
㎍/m3
VPM10 Concentration of volatile particulate matter <10 micron diameter ㎍/m3
NVPM2.5
Concentration of non volatile particulate matter <2.5 micron
diameter
㎍/m3
PM2.5 Concentration of particulate matter <2.5 micron diameter ㎍/m3
VPM2.5 Concentration of volatile particulate matter <2.5 micron diameter ㎍/m3
3
CO Concentration of carbon monoxide ㎎/m

O3 Concentration of ozone ㎍/m3


SO2 Concentration of sulphur dioxide ㎍/m3
Temperature Air temperature °C

RH Relative Humidity %

Air Pressure Air Pressure mbar

Location Text description of location text

geo
geo_point_2d Latitude and longitude
point

DateStart The date monitoring started datetime

DateEnd The date monitoring ended datetime

Current Is the monitor currently operating text

Instrument
Classification of the instrument text
Type

Task 1: Crop, Cleanse and Refactor the Data (16 marks)

Design & write appropriate PYTHON scripts to carry out the following.

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 5/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

a. Crop the file to delete any records before 00:00 1 Jan 2010 (1262304000).
b. Filter for and remove any dud records where there is no value for SiteID or there is a mismatch
between SiteID and Location.
(This script should print to the console the line number and mismatch field values for each dud
record.)

Submission files: Two Python sripts: crop.py & clean.py that generate cropped & cleaned
CSV files. The generated output files must be named crop.csv & clean.csv respectively.

Task 2: Create and implement a relational database (MySQL). (12 marks)

a. Use MySQL Workbench or any other tool to create a ER model in the third-normal form to hold
the given data.
b. Use the forward engineer feature of MySQL Workbench to generate the SQL schema and
implement the database (pollution-db).
(If this does not work for you, e.g. MYSQL Worbench configuration issues, you can use
PHPMyAdmin within XAMPP to create the tables by hand. You can then use the export feature to
extract the SQL, edit in a text editor and then save the file as pollution.sql)

Submission files: A ER diagram pollution-er.png and a SQL file pollution.sql holding


table definitions.

Task 3: Write python scripts to populate the database & generate SQL. (20 marks)

a. Design, write, & test a PYTHON script (populate.py) that takes the cleaned CSV file as input and
creates a new database instance (pollution-db2) and populates it.
b. Create a PYTHON script (insert-100.py) that generates a SQL file (insert-100.sql) that holds the
first 100 inserts to the main data table.

Submission files: Two Python scripts: populate.py & insert-100.py .

Task 4: Design, write and run SQL queries. (12 marks)

Write and implement (test run) the following four SQL queries:

a. Return the date/time, station name and the highest recorded value of nitrogen oxide (NOx) found
in the dataset for the year 2019.
b. Return the mean values of PM2.5 (particulate matter <2.5 micron diameter) & VPM2.5 (volatile
particulate matter <2.5 micron diameter) by each station for the year 2019 for readings taken on
or near 08:00 hours (peak traffic intensity).
c. Extend the previous query to show these values for all stations in the years 2010 to 2019.

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 6/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

Submission file: Code listing of the three SQL queries query-a.sql , query-b.sql &
query-c.sql

Task 5: Model, implement and query a selected NoSQL database. (30 marks)

Model the data for a specific monitor (station) to a NoSQL data model (key-value, xml or graph) to
implement the selected database type/product & pipe or import the data.

You can select from any of the seven databases listed below but if you want, you can select one not
currently on the list (after confirmation from the tutor).

Submission file: A report in Markdown format (<1200 words) named nosql.md describing the
data models used & relevant implementation details.

Task 6: Reflective Report. (10 marks)

A short report in Markdown format (<1000 words) reflecting on the assignment, the problems
encountered and the solutions found.

In addition you should discuss and outline some of the Python tools and libraries that could be used to
visualize this data. What maps / charts with which content?

You should also briefly outline the Learning Outcomes you have managed to achieve in undertaking
this Assignment.

Submission file: A report in Markdown format named report.md .

** It is a necessary requirement to adhere to the naming convention given for all submission
files. This enables automated testing and substantial marks will be forfeited if this requirement
is not met. **

As a reminder, a final checklist of submission files will be published a week before the hand-in
date (12 May).

Note: report.md & Report.md are not the same file!! (despite what Windows OS may assume!)

Assessment Criteria and Marks Allocation

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 7/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

Task 1: Crop, Cleanse and Refactor the Data (16%)

all scripts are well designed, structured and commented;


scripts make use of dataframes, chunking and other techniques as appropraite;
cropped and cleansed data is correctly formatted and complete.

Task 2: Create and Implement a Normalized Database. (12%)

a normalised ER diagram showing all entities, keys, attributes & relationships;


a implemented database structure with all required tables, fields and keys;

Task 3: Write a Python script to generate the required SQL. (20%)

the scripts are well desiged, structured and commented;


the script makes use of objects and/or functions as required;
the script generates valid SQL matching the database schema.
the database is populated with the base data

Task 4: Design, Write and Run SQL Queries. (12%)

queries are valid and return the required results;

Task 5: Model, implement and query a selected NoSQL database. (30%)

database is one chosen from the list provided (unless explicitly agreed with the tutor);
an adequate data model is developed and realized (implemented);
sample data is imported into the selected database type/product;
evidence of example query implementation and result output;

Task 6: Reflective Report. (10%)

a clear and concise report describing the problems, solutions and possible visualizations;
some reflection on the Learning Outomes achieved.

Tutor support

This coursework is seen as providing a learning experience in the tools & technologies used on this
module. Support will be provided in workshops and via email.

Tutor help can be requested for any aspect of the coursework such as the overall design, Python coding
problems or data structuring. Please ask for assistance after a bit of a effort with the problem rather
than get stuck.

Assessment Offences

This assignment should be your own work. Allowing others to do the work for you, or sharing
significant portions of code with others will be considered an assessment offence and may lead to your
mark being reduced to 0. Part of the marking process will include similarity checks and we may ask

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 8/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021

you to explain your code in detail to verify that it is your own. Please refer to the assessment offences
policy document for more information.

References

Air Pollution - Wikipedia


UK Government Air Quality Strategy
Markdown Tutorial

url: http://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/

https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 9/9

You might also like