ITECH2302 MainAssessment Report
ITECH2302 MainAssessment Report
Purpose:
The assignment helps you grasp the fundamental concepts of big data management, related
knowledge and the techniques, and practical software and tools which are required for
developing big data projects.
Requirements: You are required to identify a suitable dataset, provide an analysis of the data,
and recommend suitable Big Data Management strategies. This will be written up as a
professional report.
Details
You will use the analytical tools taught on this course (including Jupyter notebooks, pySpark,
Tableau) to explore, analyse and visualise a dataset of your choosing. An important part of
this work is preparing a good quality report, which details your choices, analysis, and
recommendations/conclusions. Also, that it is of an appropriate style.
The dataset should be chosen from the following repository:
Tasks
Data choice. Choose any dataset from the repository that has at least five attributes, and for
which the default task is classification. Transform this dataset into an appropriate one to load
into your chosen analytics software.
Background information. Write a description of the dataset and project. Provide an overview
of what the dataset is about, including from where and how it has been gathered, and for what
purpose.
Data description. Describe how many instances does the dataset contain, how many
attributes there are in the dataset, their names, and include which is the class attribute.
Include in your description details of any missing values, and any other relevant
characteristics. Use appropriate pandas functions to initially analyse the data, for instance
descriptive statistics of each attribute, including description of the range of possible values of
the attributes, and visualise these in a graphical format.
Initial analysis. You will need to make decisions about which features to include in your
dataframe, and how to deal with missing values (if they exist). You might need preprocess the
dataset attributes. Useful techniques will include remove certain attributes, exploring different
ways of discretizing continuous attributes and replacing missing values. Discretizing is the
conversion of numeric attributes into "nominal" ones by binning numeric values into intervals.
If you replaced missing values explain what strategy you used to select a replacement of the
missing values.
GroupBy analysis. Implement various aggregate functions that will provide interesting
insights into the data. Use the GroupBy function in pandas to analyse the data.
Data visualisation. Choose any data visualisation techniques that will provide helpful insights
into the data. This could include plotting chosen variables against each other, and displaying
them in a linechart, or binning them and using a (stacked) histogram etc. Use whichever you
prefer from either matplotlib (matplotlib.pyplot.hist), pandas (pandas.DataFrame.plot), seaborn
(seaborn.histplot) and/or Tableau.
Data mining. Compare and contrast at least two different data mining algorithms on your
data, for instance: SVN, neural networks, k-nearest neighbour, Apriori association rules,
decision tree induction etc. For each experiment you run, describe the data you used for the
experiments, that is, did you use the entire dataset of just a subset of it. You must include
screenshots and results from the techniques you employ.
Discussion of findings. Explain your results and include the usefulness of the approaches
for the purpose of the analysis. Include any assumptions that you may have made about the
analysis. In this discussion you should explain what each algorithm provides to the overall
analysis task. Summarize your main findings.
Big Data Management. The data you have used will have been very small in comparison with
what might be considered “big data” in this course. In this section you are to draw conclusions
about how the acquisition, storage, and subsequent analysis of the data would be different if
this was truly a “big data” dataset. You are to make reference to the concepts learned about
the “V’s” of big data (velocity, volume.. etc), data warehouses, OLAP, business intelligence,
HADOOP/Spark and so on. Explain how this dataset might have links to data that could be
considered be too difficult or very complex to implement in a traditional SQL database, and
traditional statistical analysis, and would therefore require Big Data storage and Big Data
Analytics.
Report writing. Present your work in the form of a big data management report.
Submission
The assignment is to be submitted via the Assignment submission box in Moodle. This can be found in
the Assessments section of the course Moodle shell. Your report file will be submitted as either a MS
word file or a PDF. If you are using MacOS, please submit as a PDF.
Your report will include the following in the order provided below:
Separately you are to upload your analytics files (e.g. Jupyter notebooks [ipynb], python files
[py] etc).
Your references should use the APA referencing style; information is available here:
https://federation.edu.au/library/student-resources/help-with-referencing
https://federation.edu.au/library/student-resources/fedcite
Identify all sources of information used. You are reminded to read the “Plagiarism” section of
the course description.
Feedback and marks will be provided in Moodle. Marks will also be available in FDL Marks.
Plagiarism
Plagiarism is the presentation of the expressed thought or work of another person as though it is one's own without
properly acknowledging that person. You must not allow other students to copy your work and must take care to
safeguard against this happening. More information about the plagiarism policy and procedure for the university can
be found at http://federation.edu.au/students/learning-and-study/online-help-with/plagiarism
Please refer to the Course Description for information regarding late assignments, extensions, and special
consideration. A reminder all academic regulations can be accessed via the university’s website, see:
http://federation.edu.au/staff/governance/legal/feduni-legislation
Marking Criteria/Rubric
Tasks Marks Awarded Comments
1 - Data choice 5
i. Data correctly
transformed into a
format that can be
loaded into analytics
software.
3 - Data description 5 + 10
= 15
i. General details of
dataset
ii. Detailed description of
five attributes
5. GroupBy analysis 10
• Use of pandas to analyse
the data
6. Data visualisation 10
• Use of visualisation techniques to
investigate the data
5 – Data mining 2x5
• Two different data mining = 10
algorithms used
• Description of techniques with
screenshots and discussion of
results
7 - Presentation of report 5
• Report is well-written and
presented professionally,
containing all required sections.