Assignments and weighting
Join the discussions and do peer reviews 3%.You are expected to ask questions and join discussions in the lectures, student presentations and the project demo session. During the student presentations and project demo session, peer review sheets will be handed out and you need to give a score out of 10 for each speaker (1 for extremely poor and 10 for perfect). The peer review is anonymous, and the sheets will be collected at the end of each session. The evaluation on the presentations should be based on two main criteria: whether the content is informative and whether the communication is clear. The evaluation on the project demo should be based on your understanding of the novelty, workload, and difficulty of the project.
Give a presentation in one of the lectures 7%The main goal is to learn communication skills and practice to give professional presentations. You must sign up a time for your presentation from the following:
- week 2, Thursday, 14 March, topics related to text classification or clustering, such as new algorithms, deep learning models(CNN, RNN) or their applications
- week 4, Thursday, 28 March, topics related to text representation such as word2vec, word embedding, word rank, new measures for word similarity
- week 6, Thursday, 11 April, topics related to opinion mining, information extraction
- week 8, Thursday, 9 may, topics related to recommender systems, such as the system used by Netflix, Amonzon, youTube, etc.
- week 10, Thursday, 23 May, topics related to information retrieval, query expansion, personalised search, such as new search engines, new web services.
- week 12, Thursday, 6 June, other topics including machine translation, natural language processing,
Complete a ProjectBaseline code due 29 Mar Friday 5pm, Full code due 29 April Monday 9am, Report due 10 May Friday 5pm. The goal of the project is to learn the main text mining techniques and practice research skills including programming and academic writing. The project requires you to build a baseline text classification system, apply it on a new dataset, make modifications and compare them with the baseline system. You can use any tools or packages that are available. Python libraries are preferred. You are expected to spend a total of about 40 hours to do this project. Please manage your time carefully.
Step 1: Build a baseline text classification system
- Choose a dataset.
- choose a baseline classification algorithm
- Test the algorithm on the dataset
A short CNN tutorial with a video
a longer tutorial on CNN and word embedding
LSTM (RNN) tutorial
Step 2: Apply the baseline system on a new datasetYou will need to clean the data and convert them into the right format. You are not allowed to use any built in dataset for this step. The data set should have more than two classes/categories and each class has multiple files. You may use a very new dataset described here: https://pan.webis.de/clef19/pan19-web/author-profiling.html You will need to email firstname.lastname@example.org or me to ask for a password to unzip it and please note that this dataset can only be used for research.
Step 3: Modify the baseline system and try to improve the accuracy. This can be done either on the original data set or the new data set.Here are some ideas:
- Apply another existing classification algorithm or combine some of the existing algorithms.
- Apply existing or develop new text representations. You may consider 1) Many tutorials compare performances of local trained wed embedding with global pre-trained word embedding. 2) bag of words such as TF-IDF 3) word embedding for words with multiple senses 4) view based representation which uses subsets of related words, such as words about time, location, genre, etc. 5) phrase based representation, such as suffix tree, which uses a structure similar to the trie structure in COMP261, 6) concept based representation which consider synonyms, distance between concepts.
- Use different features, such as document level features(e.g. no of words in document), part-Of-speech features (no of nouns), etc.
- Develop ways to tune or learn the parameters. The deep learning models contain many parameters, manually tuning them is time consuming and error prone. Investigate ways to reduce the training or evaluation time.
- Project baseline code, due in week 4, Mar 29, Friday 5pm. 3% It should include a README file to explain how to run your program on lab computers.
- Project full code, due in week 7, 29 April, Monday 9am. 5%. It should include a README file to explain how to run your program on lab computers.
- Project report, due in week 8, May 10, Friday 5pm. 15%. The project report should include a brief description on your baseline system, the instructions on how to apply it to the new dataset (e.g. any data cleaning and transferring steps and the data format of input files), and a description and justification for your modification of the system. It should also include the caparison results between the different data sets, and the comparison results between the baseline and your modified version of the algorithm. Readers should be able to understand what you have done without looking at your program code. The page limit of the report is 3.
- A range: all three steps are completed, preferably a new representation or a new classification algorithm is developed, and the performance is improved. Writing is excellent.
- B range: Attempted all three steps, and got comparison results. Writing is good.
- C range: The baseline is completed and either step 2 or step 3 is attempted. Writing is clear.