AIML427 (2024) - Big Data

Prescription

Big Data refers to the large and often complex datasets generated in the modern world: data sources such as commercial customer records, internet transactions, environmental monitoring. This course provides an introduction to the theory and practice of working with Big Data. Students enrolling in this course should be familiar with the basics of machine learning, data mining, statistical modelling and with programming.

Course learning objectives

Students who pass this course should be able to:

  1. Identify properties and challenges of very large data sets in order to determine appropriate analysis techniques to apply a specific Big Data task.
  2. Explain the challenges in high-dimensional data and choose appropriate dimensionality reduction methods, from a software library such as KNIME, to solve high-dimensional problems.
  3. Analyse regression and clustering data to choose appropriate analysis methods with good parameter settings from a software library such as R to generate data visualisations and to address regression and clustering problems.
  4. Use their understanding of tools such as Hadoop MapReduce and Apache Spark to implement relevant algorithmic analysis of Big Data problems using appropriate machine learning libraries.

Course content

We’ve designed this course for in-person study, and to get the most of out it we strongly recommend you attend lectures on campus. Most assessment items, as well as tutorials/seminars/labs/workshops will only be available in person. Any exceptions for in-person attendance for assessment will be looked at on a case-by-case basis in exceptional circumstances, e.g., through disability services or by approval by the course coordinator.
 
If you started your programme of study remotely and can only study remotely, please contact the School so we can help and confirm what courses are available.  
 
=============================================
 
Section 1 Introduction to Big Data

  • What is Big Data?
  • Where does Big Data come from? 
  • What we can do and what we should do with Big Data?
  • Typical examples of Big Data analysis in real word
Section 2 Machine learning for high-dimensional data
  • Data Preprocessing and Introduction to Feature Manipulation
  • Machine learning for high-dimensional data, dimensionality reduction and feature selection (and possibly missing data analysis) Wrapper, filter and embedded dimensionality reduction method
  • The techniques covered will include sequential forward selection, sequential backward selection, and other machine learning methods such as decision trees, random forest, support vector machines, genetic programming (and possibly particle swarm optimisation).
Section 3 Regression, Clustering and other Techniques in Big Data
  • Regression: ridge regression, local regression, lasso; the curse of dimensionality
  • Generalized additive models; case study on intelligible models in healthcare applications.
  • Clustering and resampling methods.
Section 4 Big Data Tools/Project 
  • Hadoop MapReduce 
  • Apache Spark
  • Spark Machine Learning Libraries

Withdrawal from Course

Withdrawal dates and process:
https://www.wgtn.ac.nz/students/study/course-additions-withdrawals

Lecturers

Dr Qi Chen (Coordinator)

  • qi.chen@vuw.ac.nz
  • CO 329 Cotton Building (All Blocks), Gate 7, Kelburn Parade, Kelburn

Dr Hoai-Bach Nguyen

Teaching Format

This course will be offered in-person.
Two lectures per week, with associated assignments. Additional content may be provided through video resources.

Dates (trimester, teaching & break dates)

  • Teaching: 26 February 2024 - 31 May 2024
  • Break: 01 April 2024 - 14 April 2024
  • Study period: 03 June 2024 - 06 June 2024
  • Exam period: 07 June 2024 - 22 June 2024

Class Times and Room Numbers

26 February 2024 - 31 March 2024

  • Monday 15:10 - 16:00 – 501, Murphy, Kelburn
  • Thursday 15:10 - 16:00 – 501, Murphy, Kelburn
15 April 2024 - 02 June 2024

  • Monday 15:10 - 16:00 – 501, Murphy, Kelburn
  • Thursday 15:10 - 16:00 – 501, Murphy, Kelburn

Required

There are no required texts for this offering.

Mandatory Course Requirements

There are no mandatory course requirements for this course.

If you believe that exceptional circumstances may prevent you from meeting the mandatory course requirements, contact the Course Coordinator for advice as soon as possible.

Assessment

Assessment ItemDue Date or Test DateCLO(s)Percentage
Assignment 1 (25 hours)Week 5CLO: 1,2,320%
Assignment 2 (25 hours)Week 9CLO: 1,2,325%
Test (50 Minutes)Week 11CLO: 1,2,325%
Assignment 3 (25 hours)Second Week of Assessment PeriodCLO: 430%

Penalties

The penalty for assignments that are handed in late without prior arrangement is one grade reduction per day. Assignments that are more than one week late will not be marked.

Extensions

There will be three late days automatically available across the assessments in the course. These will be automatically applied in the assessment system. These are intended to cover common reasons for short extensions, such as overlapping deadlines; technical difficulties; or unforeseen changes in personal circumstance.
 
Individual extensions beyond the three late days will only be granted in exceptional personal circumstances, and should be negotiated with the course coordinator before the deadline whenever possible. Documentation (e.g., medical certificate) may be requested.

Submission & Return

All work should be submitted through the ECS submission system, accessible through the course web pages. Marks and comments will be returned through the ECS marking system.
 
The School normally has a goal of returning marks for all assessment items within two weeks of the submission deadline. This year, the course will aim to meet this goal, but we expect that sickness and self-isolation due to Covid will extend the time required to mark some assignments and tests.

Workload

In order to maintain satisfactory progress in AIML 427, you should plan to spend an average of at least 10 hours per week on this paper. A plausible and approximate breakdown for these hours would include:

  • Lectures and tutorials: 2
  • Readings: 2-4
  • Assignments: 3-5
However, since this is multidisciplinary course, students with different background may need different amounts of time to work on different sections/assignments of the course,  i.e. could be more or could be less.

Teaching Plan

See: https://ecs.wgtn.ac.nz/Courses/AIML427_2024T1/LectureSchedule

Communication of Additional Information

All online material for this course can be accessed at https://ecs.wgtn.ac.nz/Courses/AIML427_2024T1/

Offering CRN: 33069

Points: 15
Prerequisites: one of (AIML 420, 421, COMP 307, 309, STAT 393, 394); one of (ENGR 123, STAT 193, MATH 177, QUAN 102) or approved background in Statistics;
Restrictions: COMP 424, COMP 473 (2016-2018)
Duration: 26 February 2024 - 23 June 2024
Starts: Trimester 1
Campus: Kelburn