0% found this document useful (0 votes)
75 views4 pages

Assignment2 2024

Uploaded by

benjaminxin11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views4 pages

Assignment2 2024

Uploaded by

benjaminxin11
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Mining and Machine Learning COMP3027J

ASSIGNMENT 2
Weight: 60%
• Code + Report + Interview: 45%
• Leaderboard Result: 15%

Introduction:
Active learning is a semi-supervised machine learning approach that aims to reduce the amount of
labelled data needed to train a model. Unlike traditional learning methods that passively learn from
a given dataset, active learning algorithms selectively query the oracle to label new data points. This
process is iterative, with the model focusing on querying labels for the most informative samples.
The goal is to achieve high accuracy while minimizing the labelling effort and cost. How to sample
data that carries more information is the core problem in the fields of data mining and machine
learning. The assignment is to conduct a multiclassification task using Active Learning on a natural
scenes dataset.(A simple classification task here to imitate the scenario that labels are hard to
query.)

The natural scenes dataset in this assignment includes 17,034 images in total. It was divided into
test dataset of 3,000 and the rest of 14,304. On the rest of the dataset, you need to divide a training
set and a validation set yourselves. The training set are regarded as an unlabelled data pool, where
you can extract the initial labelled data for training the initial model and sample data. But in this
assignment, you are constrained to use initial labelled data of 300 and total amount of sampling
data allowed is 900. (Please read the code for the three interfaces ActiveLearningDataset, Net,
Strategy carefully.) The test set is divided into the public leaderboard and the private final sets. The
public leaderboard is open for submission, and each team can submit the result 5 times per day.
The final score will be based on the private test set, so the final ranking results might be different at
the end.

Before diving into the assignment, it is crucial to have a basic understanding of active learning and
its now commonly used sampling methods. Conducting a brief literature review will provide a
foundation for your project.
Here is a literature survey highly recommended for you to read:
● Settles, B. (2009). Active learning literature survey.
And here is a GitHub repository will be very helpful to you:
● https://github.com/ej0cl6/deep-active-learning

Further clarification:
You can decide how to sample, how many rounds of sampling to conduct and the amount of data
to sample in each round. But the maximum allowed data for Active Learning got from the unlabelled
data pool of 14,034 data is 1,200 (300 for training the initial model and 900 for sampling), which
means 12,834 data will not be used. Please note the visualization of training and sampling process
must be shown in the report. Like the figure below:

1
Data Mining and Machine Learning COMP3027J

The following situations will be judged as plagiarism:


• Using open source models (including using pre-trained model or model structure) on pre-
processing or training process.
• Tampering with training results.
• The submitted prediction result is not reproducible.

Submission:
1. Prediction Result
Kaggle offers a no-setup, customizable, Jupyter Notebooks environment. Access free GPUs and a
huge repository of community-published data & code. In this assignment, we will use Kaggle in class
Competition to record your prediction results. You need to upload the result (.csv file) to the Kaggle
competition (please refer to “sample_submission.csv” for the result format). The final prediction
result ranking will be displayed on the Kaggle competition leaderboard.
- Competition Name: COMP3027J Assignment 2 - BDIC2024
- URL: https://www.kaggle.com/t/c137f9cd43734b38bc4cb4ed6b95b1ce
- Group name: Group_XX (e.g. Group_01)
- The leaderboard is calculated with approximately 40% of the test data. The final results will be
based on the other 60%, so the final standings may be different.
The evaluation metric of the leaderboard is prediction accuracy, but other metrics like Macro-F are
also valid for comparison in the assignment report. The submission file format should look like the
file "sample_submission.csv" with a news id and category prediction.
- Public Test Period:
To avoid using the wrong format in the final submission, we will provide a public leaderboard test
that has a similar format as the final test set. The prediction results can be uploaded to Kaggle for
evaluation and ranking to check whether the model predicts correctly and whether the format of the
model result is correct.
The public test phase will start on April 8th (Week 7, Monday) and end on May 13th (Week 12,
Monday). Training data can be obtained from Kaggle Dataset. After this phase is over, we will clear
the leaderboard and no longer accept test data.
- Deadline: Monday, May 13th, 23:55 - Beijing Time (Submit on Kaggle)
Note: Due to the effectiveness of the job review, we do not accept any overdue.

2. Code + Report
The code should be well-organised and executable. Here is an example of code submission:
https://www.kaggle.com/code/jiechen00/code-example
Your pdf report should clearly detail how you carried out the experiment to address this challenge.
1. Your report should be written in Overleaf or Word, and use the provided template:
https://www.overleaf.com/latex/templates/acm-journals-primary-article-
template/cpkjqttwbshg.
2. It should be a human-readable document (e.g. do not include code)
3. The final report is expected to be 4-8 pages including references.
4. You should provide student numbers instead of institutions in the provided
template.
5. Use clear headings for each section.
6. Include tables and figures if needed appropriately, such as giving captions, describing your
figures, or analysing the results provided in your tables in your text, etc.
2
Data Mining and Machine Learning COMP3027J
7. The final report filename should be “Comp3027J_GroupX.pdf” (e.g. Comp3027J_Group01).
8. The pdf file of your report must be submitted as a separate file, i.e. it cannot be compressed
into one file with your code or data, for the purpose of originality checking.
In your report, it is recommended to discuss the following essential topics, but not limited to these
topics:
1. Classification Algorithms Reviews: Research various machine learning algorithms and deep
learning techniques that have been successfully applied to classification tasks based on
image dataset.
2. Pre-processing Techniques: Explore the importance of pre-processing techniques in image
classification.
3. Methodology
a. Any machine learning algorithm can be used (not limited to the algorithm we have
learned).
b. Creativity is encouraged, especially trying to design a new sampling method.
c. Be careful, a sophisticated approach with little description and explanation will
receive little credit.
4. Evaluation Metrics: Familiarize yourself with different evaluation metrics used in news
classification, such as accuracy, precision, recall, F1-score, and Macro-F score. Understand
their significance and how they can be used to assess the performance of your news
classification models:
a. Compare your solution with benchmarks in literature.
b. Evaluation metrics for your task. We use accuracy for the leaderboard test, but we
also encourage the usage of other metrics like F1-score and Macro-F.
c. Analysing your results etc.
d. Which model finally is chosen to generate predictions submitted to Kaggle.
5. Visual Analytics: Exploring what can be visualized in active learning, including the model
training process and the sampling process, and what the significance of these visualizations
is. Here are some tips on what to visualize, but not limited to what is listed below.
a. Data visualization
b. Visualization of decision boundary
c. Visualization of loss landscape
6. More Thinking: Investigate uncertainty sampling and diversity sampling in Active Learning.
Discuss the ideas behind these two approaches, as well as the advantages and
disadvantages of each.
- Deadline: Tuesday, May 14th, 23:55 - Beijing Time (Submit on BrightSpace)

3. Interview: Time to be notified


The interview portion of the assignment is designed to assess your understanding of the
methodologies implemented in both Assignment 1 and Assignment 2, as well as your
ability to communicate your ideas clearly and concisely. During the interview, you may be
asked questions about various aspects of your projects, including:
1. Problem Formulation: Explain the problem statement and the objective of your
projects in Assignment 1 and Assignment 2.
2. Literature Review: Discuss the strengths and weaknesses of these approaches and
their applicability to your projects.
3. Data Pre-processing: Discuss any challenges encountered during the pre-
processing phase and how you addressed them.
4. Model Selection and Implementation: Discuss the reasons for selecting these
models and any considerations made during the implementation process.
5. Model Evaluation and Validation: Explain your validation strategy and discuss the
results obtained. Share any insights or conclusions you drew from these results.
6. Design of Sampling Methods (if you proposed): Discuss your thoughts on designing
3
Data Mining and Machine Learning COMP3027J
a sampling method and what more information is carried by the data that can be
sampled.
7. Findings of Visualizations (if you tried): Discuss what did you find according to your
visualization results. And discuss the significance of the findings.
To prepare for the interview, make sure to understand the rationale behind your decisions
and be prepared to justify them during the interview.

• Grading
Leaderboard 15%
Literature 5%
Methodology 15%
Evaluation 5%
Code + Reproducibility 5%
Interview 15%

You might also like