Assignments and weighting

Join the discussions and do peer reviews 5%.

You are expected to ask questions and join discussions in the lectures and student presentations.

Download this file for the peer review: 2024-peer-review.xlsx

During the student presentations, you need to give a score out of 10 for each speaker (1 for extremely poor and 10 for perfect). Submit an updated version of this file at the end of each session. The evaluation on the presentations should be based on two main criteria: whether the content is informative and whether the communication is clear. Your peer review scores will be used to adjust my marking scores.

Give a presentation in one of the lectures 15%

NOTE: The topics are not limited to the ones listed below. You are encouraged to choose new topics or recent research areas, for example, Sora, Gemini, ChatGPT/chat bots, large language models, text generation, automatic essay scoring, speech recognition, text understanding, caption generation, tools such as spaCy (used in industry), StanfordNLP (research), beautiful soup, etc.

If you choose a topic that is not covered in lectures, you will need to give an introduction, history, state-of-art technology, leading researcher or typical system, future research direction etc; If you choose a topic listed below which will be covered in lectures, you will need to find more recent research e.g a recent paper or a new system/technology.

The main goal is to learn communication skills and practice to give professional presentations. You must sign up a time for your presentation from the following:

  • week 2, Thursday, topics related to text classification or clustering, such as new algorithms, deep learning models(CNN, RNN) or their applications

  • week 4, Thursday, topics related to text representation such as word2vec, word embedding, word rank, new measures for word similarity
  • week 6, Thursday, topics related to opinion mining, information extraction

  • week 8, Thursday, topics related to recommender systems, such as the system used by TikTok, Netflix, Amazon, YouTube, etc.

  • week 10, Thursday, topics related to information retrieval, query expansion, personalised search, such as new search engines, new web services.

  • week 12, Thursday, other topics including machine translation, natural language processing,

A presentation is about 6 minutes plus 2 minutes question time. Your main task is to look for a good topic which is related to our course and introduce the most recent research in this area. It can be based on a paper, a system, a project, or web documents etc.

You are required to submit your PPT presentation slides before your presentation using our online submission system. Make sure you can use them in the lecture room.

Complete a Project

Baseline code due 22 March Friday 5pm, Full code due 19 April Friday 5pm, Report due 3 May Friday 5pm.

The goal of the project is to learn the main text mining techniques and practice research skills including programming and academic writing.

The project requires you to build a baseline text classification system, apply it on a new dataset, make modifications and compare them with the baseline system. You can use any tools or packages that are available. Python libraries are preferred.

You are expected to spend a total of about 30 hours to do this project. Please manage your time carefully.

Step 1: Build a baseline text classification system

  • Choose a dataset.
  • Choose a baseline classification algorithm
  • Test the algorithm on the dataset

Many datasets can be found online, e.g. UCI datasets. Some datasets are built-in if you are using Keras, for example, imdb (25,000 movie reviews, 2 classes) and reuters (11,228 newswires, 46 classes). Another popular dataset 20 NewsGroups can be loaded from sklearn. If you are using a built-in dataset, you will need to search for the documentation on these dataset to know the details of these datasets, and you do not need to download the original dataset and you may use the built-in pre-processed ones directly.

Some state of art classification algorithms are CNN or LSTM (RNN). You may use Keras and Tensorflow, and the easiest way to start this is to follow one of the online tutorials, e.g.

CNN tutorial using movie reviews
A short CNN tutorial with a video
a longer tutorial on CNN and word embedding
LSTM (RNN) tutorial

Step 2: Apply the baseline system on a new dataset

The recommended dataset can be downloaded here at AG News . If you have your own datasets or other datasets you want to work on, please email me.

You will need to clean the data and convert them into the right format. Test your baseline classifier on the new dataset.

Step 3: Modify the baseline system and try to improve the accuracy. This should be done on the new data set.

Here are some ideas:
  • Apply existing or develop new text representations, for example, 1) use pretrained word embedding such as GLOVE, 2) use BERT (token embedding or CLS or some output layer embedding) 3) Use pretrained large language models
  • Apply another existing classification algorithm or combine some of the existing algorithms. If you have made significant changes to an existing algorithm, then you can claim that you have developed a new classification algorithm.
  • Develop ways to tune or learn the parameters. The deep learning models contain many parameters, manually tuning them is time consuming and error prone. Investigate ways to reduce the training or evaluation time.

You are required to submit the following using our online submission system:

  • Project baseline code, due in week 4, 22 March, Friday 5pm. 5%. It should include a README file to explain how to run your program on lab computers, and briefly explain what you have done.
  • Project full code, due in week 6, 19 April, Friday 5pm. 10%. It should include a README file to explain how to run your program on lab computers, and briefly explain what you have done.

Please note that all code is primarily marked in-person based on your project demo. You will need to run your program and show your results in the demo, and briefly explain your system. If your program takes too long to run, you will need to take screenshot of your results, or save your intermediate results to files.

  • Project report, due in week 8, 3 May, Friday 5pm. 15%.

You are required to submit two reports, the first written by you and the second one written by ChatGPT or any AI tools. The marking is primarily based on the first version of your report including the following:

  • List 1: a brief description on your baseline system,
  • List 2: how to apply it to the new dataset (e.g. any data cleaning and transferring steps and the data format of input files)
  • Table 1 the caparison results between the two different data sets
  • List 3: detail what you did in step 3, including a description and justification for each modification you did on the representation or algorithm. You may also include a list of things you tried but didn't work well, and explain why.
  • Table 2: the comparison results between the baseline and your modified version of the algorithm on the new dataset
  • A paragraph to compare your two versions of reports (1 page without AI and 2 page with AI) and your own evaluation of the version written by AI.
Readers should be able to understand what you have done without looking at your program code.

The page limits are one page for version1 and 2 pages for version 2, and do not use small fonts.

The project report is worth 15% of the final grade. The marking of the report will be based on the innovation, technical quality/difficulty and writing. Here are some guidelines
  • A range: all three steps are completed, preferably a new representation or a new classification algorithm is developed, and the performance is improved.
  • B range: Attempted all three steps, and got comparison results.
  • C range: The baseline is completed and either step 2 or step 3 is attempted.

Write a paper review 15%

due week 10, May 17, Friday 5pm

Reading papers and writing short summaries is one of the basic learning skills. The goal of this assignment is to learn new technologies from reading papers and practice critical thinking and writing for future research.

You may review any paper of your own choice but make sure that the paper content must be closely related to the course. This paper should be different from the paper you used for your presentation. It is preferred that the paper is related to your project. You are required to submit the paper review together with a copy of the paper using our online submission system.

The paper review should include

  • A brief justification on why the paper is chosen and you can comment on content, source, citations, authors, novelty, technical quality and presentation etc. Also it should include a summary with at least two parts.
  • A summary of the main contributions of the paper and its main technologies.
  • A discussion on why it is relevant to you, for example, you may point out its limitations, possible ways to address the limitations, and any suggestions you may have to improve the paper; alternatively you may discuss how this paper can be applied in some applications or future research, or how it can be used to improve your system in your project,etc.

You are required to submit two versions, side by side, in one page, each with a word limit of 250:

The first version should be done using ChatGPT. Please briefly describe your questions/queries at the end, e.g. title or full test etc. The second version should be written by you. Please include a final paragraph to compare your version with the first version.

Test 35%

This will be scheduled in the assessment period. The questions will be similar to previous year's exams and it will be close book. in-person, on paper, at exam conditions this year.

Late submissions

Each student will have 3 "late days" which you may choose to use for any submissions during the course. There will be no penalty applied for these late days. You do not need to apply for these - any late days you have left will be automatically applied to assignments that you submit late. Please note that these late days are not applicable for your presentation or project demo, but if you have any exceptional circumstances, email the lecturer for reschedule.

The late days are intended to cover minor illnesses or other personal reasons for being late. You should ask for extensions in the case of more significant or longer lasting problems (and you may need documentation).

Late submissions will be penalised 20% per day of the full assignment mark. This only applies if you have run out of your late days and you do not have a pre-arrangement with the lecturer.

-- xgao - 19 Feb 2021