COMP424 (2019) - Big Data
Big Data refers to the large and often complex datasets generated in the modern world: data sources such as commercial customer records, internet transactions, environmental monitoring. This course provides an introduction to the theory and practice of working with Big Data. Students enrolling in this course should be familiar with the basics of machine learning, data mining, statistical modelling and with programming.
Course learning objectives
Students who pass this course should be able to:
- Identify properties and challenges of very large data sets in order to determine appropriate analysis techniques to apply a specific Big Data task.
- Identify challenges in high-dimensional data and choose appropriate dimensionality reduction methods, from a software library such as Weka, to solve high-dimensional problems.
- Analyse regression and clustering data to choose appropriate analysis methods with good parameter settings from a software library such as R to address regression and clustering problems and to generate data visualisations.
- Based on an understanding of Hadoop MapReduce and Apache Spark, implement relevant algorithmic analysis of Big Data problems using appropriate machine learning libraries.
Section 1 Introduction to Big Data
- What is Big Data ?
- Where does Big Data come from?
- What we can do and what we should do with Big Data ?
- Typical examples of Big Data analysis in real word
Section 2 Machine learning for high-dimensional data
- Data Preprocessing and Introduction to Feature Manipulation
- Machine learning for high-dimensional data, dimensionality reduction and feature selection (and possibly missing data analysis) Wrapper, filter and embeded dimensionality reduction method
- The techniques covered will include sequential forward selection, sequential backword selection, and other machine learning methods such as decision trees, random forest, support vector machines, genetic programming (and possibly particle swarm optimisation).
- Regression: ridge regression, local regression, lasso; curse of dimensionality
- Generalized additive models; case study on intelligible models in healthcare applications.
- Clustering and resampling methods.
Section 4 Big Data Tools/Project
- Hadoop MapReduce
- Apache Spark
- Spark Machine Learning Libraries
Withdrawal from Course
Withdrawal dates and process:
Two lectures per week, one of 1 hour and one of 2 hours duration, with tutorials
Student feedback on University courses may be found at: www.cad.vuw.ac.nz/feedback/feedback_display.php
Dates (trimester, teaching & break dates)
- Teaching: 04 March 2019 - 09 June 2019
- Break: 15 April 2019 - 28 April 2019
- Study period: 10 June 2019 - 13 June 2019
- Exam period: 14 June 2019 - 29 June 2019
Set Texts and Recommended Readings
There are no required texts for this offering.
Mandatory Course Requirements
In addition to achieving an overall pass mark of at least 50%, students must:
- submit reasonable attempts for at least two out of the three assignments. (Justification: The practical skills that are obtained in the assignments are a critical part of the CLO’s, and engagement with a minimum of two of the assignments is considered essential.)
If you believe that exceptional circumstances may prevent you from meeting the mandatory course requirements, contact the Course Coordinator for advice as soon as possible.
|Assessment Item||Due Date or Test Date||CLO(s)||Percentage|
|Assignment 1 (3 weeks) (Analysis and report)||Week 4/5||CLO: 1,2||20%|
|Assignment 2 (3 weeks) (Analysis and report)||Week 7||CLO: 3||20%|
|Assignment 3 (3 weeks) (Analysis and report)||Week 11||CLO: 4||20%|
|Final exam (2 hours)||Exam Period||CLO: 1,2,3,4||40%|
The penalty for assignments that are handed in late without prior arrangement is one grade reduction per day. Assignments that are more than one week late will not be marked.
Individual extensions will only be granted in exceptional personal circumstances, and should be negotiated with the course coordinator before the deadline whenever possible. Documentation (eg, medical certificate) may be required.
Submission & Return
All work should be submitted through the ECS submission system, accessible through the course web pages. Marks and comments will be returned through the ECS marking system
In order to maintain satisfactory progress in COMP 424, you should plan to spend an average of at least 10 hours per week on this paper. A plausible and approximate breakdown for these hours would include:
- Lectures and tutorials: 3
- Readings: 2-4
- Assignments: 3-5
Communication of Additional Information
Links to General Course Information
- Academic Integrity and Plagiarism: https://www.victoria.ac.nz/students/study/exams/integrity-plagiarism
- Academic Progress: https://www.victoria.ac.nz/students/study/progress/academic-progess (including restrictions and non-engagement)
- Dates and deadlines: https://www.victoria.ac.nz/students/study/dates
- Grades: https://www.victoria.ac.nz/students/study/progress/grades
- Special passes: Refer to the Assessment Handbook, at https://www.victoria.ac.nz/documents/policy/staff-policy/assessment-handbook.pdf
- Statutes and policies, e.g. Student Conduct Statute: https://www.victoria.ac.nz/about/governance/strategy
- Student support: https://www.victoria.ac.nz/students/support
- Students with disabilities: https://www.victoria.ac.nz/st_services/disability/
- Student Charter: https://www.victoria.ac.nz/learning-teaching/learning-partnerships/student-charter
- Terms and Conditions: https://www.victoria.ac.nz/study/apply-enrol/terms-conditions/student-contract
- Turnitin: http://www.cad.vuw.ac.nz/wiki/index.php/Turnitin
- University structure: https://www.victoria.ac.nz/about/governance/structure
- VUWSA: http://www.vuwsa.org.nz
Offering CRN: 31156
Prerequisites: One of (COMP 307, 309, STAT 393, 394); STAT 193 or ENGR 123 or approved background in Statistics;
Restrictions: COMP 473 (2016-2018)
Duration: 04 March 2019 - 30 June 2019
Starts: Trimester 1