UFCFLR-15-M Data Management Fundamentals 2021
UFCFLR-15-M Data Management Fundamentals 2021
MODULAR PROGRAMME
Module Details
B: CW1 50%
Dates
18 Jul 2022
2.00 pm
Deliverables
A ZIP file submiitted to BB called dmf-assign.zip containing all required code & reports.
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 1/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
Learn to model, cleanse, normalize, shard, map, query and analyze substantial real-world data
(230mb+);
Understand the data cleansing, normalization and sharding processes by writing PYTHON scripts
to process and convert the data to first (cleansed) CSV and then (normalized) SQL;
Design and implement a relational (MySQL) database and then write a PYTHON script to pipe
(import) the cleansed data into the appropriate tables ensuring all integrity constraints are met.
Construct and implement a set of SQL queries to extract data using various filters and constraints.
Map (forward engineer) the data to a NoSQL database of your choice (MongoDB, BaseX,
CouchBase, ArangoDB etc.)
Write a short, reflective report on the learning outcomes you have achieved.
Get exposure to and learn the use of a range of data oriented technologies (databases, python &
sql.)
Learn and use the MARKDOWN markup syntax.
Levels of various air borne pollutants such as Nitrogen Monoxide (NO), Nitrogen Dioxide (NO2) and
particulate matter (also called particle pollution) are all major contributors to the measure of overall
air quality.
For instance, NO2 is measured using micrograms in each cubic metre of air ( /m ). A microgram ( ) ㎍ 3
㎍
is one millionth of a gram. A concentration of 1 ㎍/m 3
means that one cubic metre of air contains one
microgram of pollutant.
To protect our health, the UK Government sets two air quality objectives for NO2 in their Air Quality
Strategy
1. The hourly objective, which is the concentration of NO2 in the air, averaged over a period of one
hour.
2. The annual objective, which is the concentration of NO2 in the air, averaged over a period of a
year.
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 2/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
The following table shows the colour encoding and the levels for Objective 1 above, the mean hourly
ratio, adopted in the UK.
Index 1 2 3 4 5 6 7 8 9 10
Band Low Low Low Moderate Moderate Moderate High High High Very High
㎍/m³ 0- 68- 135- 201-267 268-334 335-400 401- 468- 535- 601 or
67 134 200 467 534 600 more
Further details of colour encodings and health warnings can be found at the DEFRA Site.
The following ZIP file provides data ranging from 2004 to 10 February 2022 (five days ago) taken from
18 monitoring stations in and around Bristol.
Delete any previous versions of this file (if you have downloaded it).
Monitors come and go and may suffer down times, so the data isn't complete for all stations at all
times.
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 3/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
Each line represents one reading from a specific detector. Detectors take one reading every hour. If
you examine the file using a programming editor, Notepad++ can handle the job, you can see that the
first row gives headers and there are another 1474177 (1.47 million+) rows (lines). There are 23 data
items (columns) per line.
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 4/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
RH Relative Humidity %
geo
geo_point_2d Latitude and longitude
point
Instrument
Classification of the instrument text
Type
Design & write appropriate PYTHON scripts to carry out the following.
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 5/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
a. Crop the file to delete any records before 00:00 1 Jan 2010 (1262304000).
b. Filter for and remove any dud records where there is no value for SiteID or there is a mismatch
between SiteID and Location.
(This script should print to the console the line number and mismatch field values for each dud
record.)
Submission files: Two Python sripts: crop.py & clean.py that generate cropped & cleaned
CSV files. The generated output files must be named crop.csv & clean.csv respectively.
a. Use MySQL Workbench or any other tool to create a ER model in the third-normal form to hold
the given data.
b. Use the forward engineer feature of MySQL Workbench to generate the SQL schema and
implement the database (pollution-db).
(If this does not work for you, e.g. MYSQL Worbench configuration issues, you can use
PHPMyAdmin within XAMPP to create the tables by hand. You can then use the export feature to
extract the SQL, edit in a text editor and then save the file as pollution.sql)
Task 3: Write python scripts to populate the database & generate SQL. (20 marks)
a. Design, write, & test a PYTHON script (populate.py) that takes the cleaned CSV file as input and
creates a new database instance (pollution-db2) and populates it.
b. Create a PYTHON script (insert-100.py) that generates a SQL file (insert-100.sql) that holds the
first 100 inserts to the main data table.
Write and implement (test run) the following four SQL queries:
a. Return the date/time, station name and the highest recorded value of nitrogen oxide (NOx) found
in the dataset for the year 2019.
b. Return the mean values of PM2.5 (particulate matter <2.5 micron diameter) & VPM2.5 (volatile
particulate matter <2.5 micron diameter) by each station for the year 2019 for readings taken on
or near 08:00 hours (peak traffic intensity).
c. Extend the previous query to show these values for all stations in the years 2010 to 2019.
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 6/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
Submission file: Code listing of the three SQL queries query-a.sql , query-b.sql &
query-c.sql
Task 5: Model, implement and query a selected NoSQL database. (30 marks)
Model the data for a specific monitor (station) to a NoSQL data model (key-value, xml or graph) to
implement the selected database type/product & pipe or import the data.
You can select from any of the seven databases listed below but if you want, you can select one not
currently on the list (after confirmation from the tutor).
Submission file: A report in Markdown format (<1200 words) named nosql.md describing the
data models used & relevant implementation details.
A short report in Markdown format (<1000 words) reflecting on the assignment, the problems
encountered and the solutions found.
In addition you should discuss and outline some of the Python tools and libraries that could be used to
visualize this data. What maps / charts with which content?
You should also briefly outline the Learning Outcomes you have managed to achieve in undertaking
this Assignment.
** It is a necessary requirement to adhere to the naming convention given for all submission
files. This enables automated testing and substantial marks will be forfeited if this requirement
is not met. **
As a reminder, a final checklist of submission files will be published a week before the hand-in
date (12 May).
Note: report.md & Report.md are not the same file!! (despite what Windows OS may assume!)
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 7/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
database is one chosen from the list provided (unless explicitly agreed with the tutor);
an adequate data model is developed and realized (implemented);
sample data is imported into the selected database type/product;
evidence of example query implementation and result output;
a clear and concise report describing the problems, solutions and possible visualizations;
some reflection on the Learning Outomes achieved.
Tutor support
This coursework is seen as providing a learning experience in the tools & technologies used on this
module. Support will be provided in workshops and via email.
Tutor help can be requested for any aspect of the coursework such as the overall design, Python coding
problems or data structuring. Please ask for assistance after a bit of a effort with the problem rather
than get stuck.
Assessment Offences
This assignment should be your own work. Allowing others to do the work for you, or sharing
significant portions of code with others will be considered an assessment offence and may lead to your
mark being reduced to 0. Part of the marking process will include similarity checks and we may ask
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 8/9
11/10/2022, 23:15 UFCFLR-15-M Data Management Fundamentals 2021
you to explain your code in detail to verify that it is your own. Please refer to the assessment offences
policy document for more information.
References
url: http://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/
https://fetstudy.uwe.ac.uk/~p-chatterjee/2021-22/modules/dmf-jan/assignment/resit/ 9/9