Assignments and weighting

Join the discussions and do peer reviews 3%.

You are expected to ask questions and join discussions in the lectures, student presentations and the project demo session.

During the student presentations and project demo session, peer review sheets will be handed out and you need to give a score out of 10 for each speaker (1 for extremely poor and 10 for perfect). The peer review is anonymous, and the sheets will be collected at the end of each session. The evaluation on the presentations should be based on two main criteria: whether the content is informative and whether the communication is clear. The evaluation on the project demo should be based on your understanding of the novelty, workload, and difficulty of the project.

Give a presentation in one of the lectures 7%

The main goal is to learn communication skills and practice to give professional presentations. You must sign up a time for your presentation from the following:

  • week 2, Thursday, 14 March, topics related to text classification or clustering, such as new algorithms, deep learning models(CNN, RNN) or their applications

  • week 4, Thursday, 28 March, topics related to text representation such as word2vec, word embedding, word rank, new measures for word similarity
  • week 6, Thursday, 11 April, topics related to opinion mining, information extraction

  • week 8, Thursday, 9 may, topics related to recommender systems, such as the system used by Netflix, Amonzon, youTube, etc.

  • week 10, Thursday, 23 May, topics related to information retrieval, query expansion, personalised search, such as new search engines, new web services.

  • week 12, Thursday, 6 June, other topics including machine translation, natural language processing,

A presentation is about 6-8 minutes including the question time. Your main task is to look for a good topic which is related to our course and introduce the most recent research in this area. It can be based on a paper, a system, a project, or web documents etc.

You are required to submit your PPT presentation slides before your presentation using our online submission system. Make sure you can use them in the lecture room.

Complete a Project

Baseline code due 29 Mar Friday 5pm, Full code due 29 April Monday 9am, Report due 10 May Friday 5pm.

The goal of the project is to learn the main text mining techniques and practice research skills including programming and academic writing.

The project requires you to build a baseline text classification system, apply it on a new dataset, make modifications and compare them with the baseline system. You can use any tools or packages that are available. Python libraries are preferred.

You are expected to spend a total of about 40 hours to do this project. Please manage your time carefully.

Step 1: Build a baseline text classification system

  • Choose a dataset.
  • choose a baseline classification algorithm
  • Test the algorithm on the dataset

Some datasets are built-in if you are using Keras, for example, imdb (movie reviews, 2 classes) and reuters (news, 46 classes). The documentation and source of these datasets and many others are available online, e.g. 20 NewsGroups, Reuters, WebKb.

The state of art classification algorithms are CNN or LSTM (RNN). You may use Keras and Tensorflow and the easiest way to start this is to follow one of the online tutorials, e.g.

CNN tutorial using movie reviews
A short CNN tutorial with a video
a longer tutorial on CNN and word embedding
LSTM (RNN) tutorial

Step 2: Apply the baseline system on a new dataset

You will need to clean the data and convert them into the right format. You are not allowed to use any built in dataset for this step. The data set should have more than two classes/categories and each class has multiple files.

You may use a very new dataset described here: https://pan.webis.de/clef19/pan19-web/author-profiling.html You will need to email pan@webis.de or me to ask for a password to unzip it and please note that this dataset can only be used for research.

Step 3: Modify the baseline system and try to improve the accuracy. This can be done either on the original data set or the new data set.

Here are some ideas:
  • Apply another existing classification algorithm or combine some of the existing algorithms.
  • Apply existing or develop new text representations. You may consider 1) Many tutorials compare performances of local trained wed embedding with global pre-trained word embedding. 2) bag of words such as TF-IDF 3) word embedding for words with multiple senses 4) view based representation which uses subsets of related words, such as words about time, location, genre, etc. 5) phrase based representation, such as suffix tree, which uses a structure similar to the trie structure in COMP261, 6) concept based representation which consider synonyms, distance between concepts.
  • Use different features, such as document level features(e.g. no of words in document), part-Of-speech features (no of nouns), etc.

  • Develop ways to tune or learn the parameters. The deep learning models contain many parameters, manually tuning them is time consuming and error prone. Investigate ways to reduce the training or evaluation time.

If you have made significant changes to an existing algorithm, then you can claim that you have developed a new classification algorithm.

You are required to submit the following using our online submission system:

  • Project baseline code, due in week 4, Mar 29, Friday 5pm. 3% It should include a README file to explain how to run your program on lab computers.
  • Project full code, due in week 7, 29 April, Monday 9am. 5%. It should include a README file to explain how to run your program on lab computers.

Please note that all code is primarily marked based on your project demo in a lecture. You must make sure your code can run on your laptop and your laptop can connect to the display system in the lecture room. You will need to run your program and show your results in the demo, and briefly explain your system. If your program takes too long to run, you will need to take screenshot of your results, or save your intermediate results to files.

  • Project report, due in week 8, May 10, Friday 5pm. 15%. The project report should include a brief description on your baseline system, the instructions on how to apply it to the new dataset (e.g. any data cleaning and transferring steps and the data format of input files), and a description and justification for your modification of the system. It should also include the caparison results between the different data sets, and the comparison results between the baseline and your modified version of the algorithm. Readers should be able to understand what you have done without looking at your program code. The page limit of the report is 3.

The project report is worth 15% of the final grade. The marking of the report will be based on the innovation, technical quality/difficulty and writing. Here are some guidelines
  • A range: all three steps are completed, preferably a new representation or a new classification algorithm is developed, and the performance is improved. Writing is excellent.
  • B range: Attempted all three steps, and got comparison results. Writing is good.
  • C range: The baseline is completed and either step 2 or step 3 is attempted. Writing is clear.

Write a paper review 7%

due week 10, May 24 , Friday 5pm

Reading papers and writing short summaries is one of the basic learning skills. The goal of this assignment is to learn new technologies from reading papers and practice critical thinking and writing for future research.

You may review any paper of your own choice but make sure that the paper content must be closely related to the course. It is preferred that the paper is related to your project. You are required to submit the paper review together with a copy of the paper using our online submission system.

The paper review should include a brief justification on why the paper is chosen and you can comment on content, source, citations, authors, novelty, technical quality and presentation etc. Also it should include a summary with at least two parts. The first part summarises the main contributions of the paper and its main technologies. The second part discusses why it is relevant to you, for example, you may point out its limitations, possible ways to address the limitations, and any suggestions you may have to improve the paper; alternatively you may discuss how this paper can be applied in some applications or future research, or how it can be used to improve your system in your project,etc.

The page limit is 1 and the word limit is 400.

Late submissions

Each student will have 3 "late days" which you may choose to use for any submissions during the course. There will be no penalty applied for these late days. You do not need to apply for these - any late days you have left will be automatically applied to assignments that you submit late. Please note that these late days are not applicable for your presentation or project demo, but if you have any exceptional circumstances, email the lecturer for reschedule.

The late days are intended to cover minor illnesses or other personal reasons for being late. You should ask for extensions in the case of more significant or longer lasting problems (and you may need documentation).

Late submissions will be penalised 20% per day of the full assignment mark. This only applies if you have run out of your late days and you do not have a pre-arrangement with the lecturer.

-- xgao - 10 Feb 2019