158736 - Advanced Machine Learning
Assignment 2
Introduction
In this assignment, you will develop a simple text classification system using LLMs. You will
be using the dataset given in this github repository. The corresponding research paper can
be found here (note that you need to access the paper through the Massey network. The
paper has been added to Stream for your convenience).
Read the paper to get an understanding of the dataset. You do not have to understand the
full context of the paper, as we are not going to replicate their method. However, you should
notice that this is a fine example of a text classification system based on traditional Machine
Learning techniques, where a greater emphasis has been placed on input feature
engineering and producing a pipe-lined system.
The dataset contains 5000 questions that have been asked by (potential) travellers. Each
question has been classified into a coarse-grain class and a fine-grain class. For this
assignment, we will only consider the coarse-grain classes as follows:
TTD things to do
TGU travel guide
ACM accommodation
TRS transport
WTH weather
FOD food
ENT entertainment
General Instructions
Similar to assignment 1, your code should run in a colab notebook. Make sure anyone can
access your notebook. Add clear instructions/comments to the code.
In order to answer the questions, you need to refer to external sources such as research
papers. All the sources you refer to must be cited within text, and corresponding bibliography
should be added to the bottom of the document. You may use a bibliography style of your
choice (e.g. APA, IEEE), but make sure your referencing is consistent throughout the
document.
Data Preparation
Download the data csv file and remove the fine-grain column. Use 4000 samples for training,
700 for testing and 300 for validation.
Convert the dataset into the instruction format. Upload this dataset into the GDrive as csv
files. Make sure you use the following folder structure:
/content/gdrive/MyDrive/Massey-158736/assignment-2/
It is extremely important that you have this structure, so that we can easily run the code.
Once the data has been uploaded, your directory should look similar to the image below: