Binh Tran

PhD Student School of Engineering and Computer Science

Binh Tran profile picture

Thesis Info

Research Interests: Evolutionary Computation, Feature Selection, Feature Construction, Dimensionality Reduction, Classification, Machine Learning
Thesis Title: Evolutionary Computation for Feature Manipulation in Classification on High-Dimensional Data
Supervisor: Prof Mengjie Zhang and Dr Bing Xue

 

Research Interest

Evolutionary Computation, Feature Selection, Feature Construction, Dimensionality Reduction, Classification, Machine Learning.

Refereed Journal Papers

Binh Tran, Bing Xue and Mengjie Zhang. " A New Representation in PSO for Discretisation-Based Feature Selection", IEEE Transactions on Cybernetics, 2017 (DOI:10.1109/TCYB.2017.2714145).
@ARTICLE{Tran2016new,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
journal={IEEE Transactions on Cybernetics},
title={A New Representation in PSO for Discretization-Based Feature Selection},
year={2017},
volume={PP},
number={99},
pages={1-14},
keywords={Bioinformatics;Cybernetics;Machine learning algorithms;Merging;Particle swarm optimization;Search problems;Classification;discretization;feature selection (FS);high-dimensional data;particle swarm optimization (PSO)},
doi={10.1109/TCYB.2017.2714145}}

In machine learning, discretization and feature selection (FS) are important techniques for preprocessing data to improve the performance of an algorithm on high-dimensional data. Since many FS methods require discrete data, a common practice is to apply discretization before FS. In addition, for the sake of efficiency, features are usually discretized individually (or univariate). This scheme works based on the assumption that each feature independently influences the task, which may not hold in cases where feature interactions exist. Therefore, univariate discretization may degrade the performance of the FS stage since information showing feature interactions may be lost during the discretization process. Initial results of our previous proposed method [evolve particle swarm optimization (EPSO)] showed that combining discretization and FS in a single stage using bare-bones particle swarm optimization (BBPSO) can lead to a better performance than applying them in two separate stages. In this paper, we propose a new method called potential particle swarm optimization (PPSO) which employs a new representation that can reduce the search space of the problem and a new fitness function to better evaluate candidate solutions to guide the search. The results on ten high-dimensional datasets show that PPSO select less than 5% of the number of features for all datasets. Compared with the two-stage approach which uses BBPSO for FS on the discretized data, PPSO achieves significantly higher accuracy on seven datasets. In addition, PPSO obtains better (or similar) classification performance than EPSO on eight datasets with a smaller number of selected features on six datasets. Furthermore, PPSO also outperforms the three compared (traditional) methods and performs similar to one method on most datasets in terms of both generalization ability and learning capacity.

Binh Tran, Bing Xue, Mengjie Zhang, and Su Nguyen. " Investigation on Particle Swarm Optimisation for Feature Selection on High-dimensional Data: Local Search and Selection Bias", Connection Science, 2016.

@ARTICLE{Tran2016investigation,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie and Nguyen, Su},
TITLE = {Investigation on Particle Swarm Optimisation for Feature Selection on High-dimensional Data: Local Search and Selection Bias},
JOURNAL = {Connection Science},
VOLUME = {28},
NUMBER = {3},
PAGES = {270--294},
YEAR = {2016}
}

Feature selection is an essential step in classification tasks with a large number of features, such as in gene expression data. Recent research has shown that particle swarm optimisation (PSO) is a promising approach to feature selection. However, it also has potential limitation to get stuck into local optima, especially for gene selection problems with a huge search space. Therefore, we developed a PSO algorithm (PSO-LSRG) with a fast “local search” combined with a gbest resetting mechanism as a way to improve the performance of PSO for feature selection. Furthermore, since many existing PSO-based feature selection approaches on the gene expression data have feature selection bias, i.e. no unseen test data is used, 2 sets of experiments on 10 gene expression datasets were designed: with and without feature selection bias. As compared to standard PSO, PSO with gbest resetting only, and PSO with local search only, PSO-LSRG obtained a substantial dimensionality reduction and a significant improvement on the classification performance in both sets of experiments. PSO-LSRG outperforms the other three algorithms when feature selection bias exists. When there is no feature selection bias, PSO-LSRG selects the smallest number of features in all cases, but the classification performance is slightly worse in a few cases, which may be caused by the overfitting problem. This shows that feature selection bias should be avoided when designing a feature selection algorithm to ensure its generalisation ability on unseen data.

Binh Tran, Bing Xue and Mengjie Zhang. " Genetic Programming for Feature Construction and Selection in Classification on High-dimensional Data", Memetic Computing, 2015. (DOI:10.1007/s12293-015-0173-y).

@ARTICLE{Tran2015genetic,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
TITLE = {Genetic programming for feature construction and selection in classification on high-dimensional data},
JOURNAL = {Memetic Computing},
VOLUME = {8},
NUMBER = {1},
PAGES = {3--15},
YEAR = {2015}
}

Classification on high-dimensional data with thousands to tens of thousands of dimensions is a challenging task due to the high dimensionality and the quality of the feature set. The problem can be addressed by using feature selection to choose only informative features or feature construction to create new high-level features. Genetic programming (GP) using a tree-based representation can be used for both feature construction and implicit feature selection. This work presents a comprehensive study to investigate the use of GP for feature construction and selection on high-dimensional classification problems. Different combinations of the constructed and/or selected features are tested and compared on seven high-dimensional gene expression problems, and different classification algorithms are used to evaluate their performance. The results show that the constructed and/or selected feature sets can significantly reduce the dimensionality and maintain or even increase the classification accuracy in most cases. The cases with overfitting occurred are analysed via the distribution of features. Further analysis is also performed to show why the constructed feature can achieve promising classification performance.

Refereed Conference Papers

Binh Tran, Stjepan Picek, and Bing Xue. " Automatic Feature Construction for Network Intrusion Detection". Proceedings of the 11th International Conference on Simulated Evolution and Learning (SEAL 2017). Lecture Notes in Computer Science. Shenzhen, China. November 10-13, 2017.

@INPROCEEDINGS{tran2017automatic,
author="Tran, Binh
and Picek, Stjepan
and Xue, Bing",
title="Automatic Feature Construction for Network Intrusion Detection",
booktitle="Simulated Evolution and Learning",
year="2017",
publisher="Springer International Publishing",
address="Cham",
pages="569--580"
}

The notion of cyberspace became impossible to separate from the notions of cyber threat and cyberattack. Since cyberattacks are getting easier to run, they are also becoming more serious threats from the economic damage perspective. Consequently, we are evident of a continuous adversarial relationship between the attackers trying to mount as powerful as possible attacks and defenders trying to stop the attackers in their goals. To defend against such attacks, defenders have at their disposal a plethora of techniques but they are often falling behind the attackers due to the fact that they need to protect the whole system while the attacker needs to find only a single weakness to exploit. In this paper, we consider one type of a cyberattack -- network intrusion -- and investigate how to use feature construction via genetic programming in order to improve the intrusion detection accuracy. The obtained results show that feature construction offers improvements in a number of tested scenarios and therefore should be considered as an important step in defense efforts. Such improvements are especially apparent in scenario with the highly unbalanced data, which also represents the most interesting case from the defensive perspective. 

Binh Tran Ngan, Mengjie Zhang, Bing Xue. " Class Dependent Multiple Feature Construction Using Genetic Programming for High-Dimensional Data". Proceedings of the 30th Australasian Joint Conference on Artificial Intelligence (AI2017) Lecture Notes in Computer Science. Vol. 10400. Springer. Melbourne, Australia, August 19-20th, 2017. pp. 182-194.
@INPROCEEDINGS{Tran2016pso,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
TITLE = {Class Dependent Multiple Feature Construction Using Genetic Programming for High-Dimensional Data},
BOOKTITLE = {AI 2017: Advances in Artificial Intelligence},
YEAR = {2017},
PUBLISHER="Springer International Publishing",
PAGES = {182--194}
}

Genetic Programming (GP) has shown promise in feature construction where high-level features are formed by combining original features using predefined functions or operators. Multiple feature construction methods have been proposed for high-dimensional data with thousands of features. Results of these methods show that several constructed features can maintain or even improve the discriminating ability of the original feature set. However, some particular features may have better ability than other features to distinguish instances of one class from other classes. Therefore, it may be more difficult to construct a better discriminating feature when combing features that are relevant to different classes. In this study, we propose a new GP-based feature construction method called CDFC that constructs multiple features, each of which focuses on distinguishing one class from other classes. We propose a new representation for class-dependent feature construction and a new fitness function to better evaluate the constructed feature set. Results on eight datasets with varying difficulties showed that the features constructed by CDFC can improve the discriminating ability of thousands of original features in most cases. Results also showed that CFDC is more effective and efficient than the hybrid MGPFC method which was shown to have better performance than standard GP to feature construction.

Binh Tran Ngan, Mengjie Zhang, Bing Xue. "Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data". Proceedings of the 20th European Conference on Genetic Programming (EuroGP 2017). Lecture Notes in Computer Science. Vol. 10196. Amsterdam. 18-21 April 2017. pp.210--226.
@INPROCEEDINGS{Tran2017using,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
TITLE = {Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data},
BOOKTITLE = {Genetic Programming},
YEAR = {2017},
PUBLISHER="Springer International Publishing",
PAGES = {210--226}
}

Feature construction is a pre-processing technique to create new features with better discriminating ability from the original features. Genetic programming (GP) has been shown to be a prominent technique for this task. However, applying GP to high-dimensional data is still challenging due to the large search space. Feature clustering groups similar features into clusters, which can be used for dimensionality reduction by choosing representative features from each cluster to form the feature subset. Feature clustering has been shown promising in feature selection; but has not been investigated in feature construction for classification. This paper presents the first work of utilising feature clustering in this area. We propose a cluster-based GP feature construction method called CGPFC which uses feature clustering to improve the performance of GP for feature construction on high-dimensional data. Results on eight high-dimensional datasets with varying difficulties show that the CGPFC constructed features perform better than the original full feature set and features constructed by the standard GP constructor based on the whole feature set.


Binh Tran Ngan, Mengjie Zhang, Bing Xue. "Multiple Feature Construction in High-Dimensional Data Using Genetic Programming". IEEE Symposium Series on Computational Intelligence (IEEE SSCI 2016). Athens, Greece, December 6-9, 2016 pp. 1-8. (DOI:10.1109/SSCI.2016.7850130).
@INPROCEEDINGS{Tran2016pso,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
TITLE = {Multiple Feature Construction in High-Dimensional Data Using Genetic Programming},
BOOKTITLE = {IEEE Symposium Series on Computational Intelligence (SSCI)},
YEAR = {2016},
PAGES = {1--8},
PUBLISHER = {IEEE}
}


Feature construction and feature selection are common pre-processing techniques to obtain smaller but better discriminating feature sets than the original ones. These two techniques are essential in high-dimensional data with thousands or tens of thousands of features where there may exist many irrelevant and redundant features. Genetic programming (GP) is a powerful technique that has shown promising results in feature construction and feature selection. However, constructing multiple features for high-dimensional data is still challenging due to its large search space. In this paper, we propose a GP-based method that simultaneously performs multiple feature construction and feature selection to automatically transform high-dimensional datasets into much smaller ones. Experiment results on six datasets show that the size of the generated feature set is less than 4% of the original feature set size and it significantly improves the performance of K-Nearest Neighbour, Naive Bayes and Decision Tree algorithms on 15 out of 18 comparisons. Compared with the single feature construction method using GP, the proposed method has better performance on half cases and similar on the other half. Comparisons between the constructed features, the selected features and the combination of both constructed and selected features by the propose method reveal different preferences of the three learning algorithms on these feature sets. 


Binh Tran Ngan, Mengjie Zhang, Bing Xue. " A PSO Based Hybrid Feature Selection Algorithm for High-Dimensional Classification". Proceedings of 2016 IEEE World Congress on Computational Intelligence/ IEEE Congress on Evolutionary Computation (WCCI 2016 /CEC2016).

@INPROCEEDINGS{Tran2016pso,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
TITLE = {A PSO Based Hybrid Feature Selection Algorithm for High-Dimensional Classification},
BOOKTITLE = {Proceedings of 2016 IEEE World Congress on Computational Intelligence/ IEEE Congress on Evolutionary Computation (WCCI 2016 /CEC2016)},
YEAR = {2016},
PAGES = {3801--3808},
PUBLISHER = {IEEE}
}

Recent research has shown that Particle Swarm Optimisation is a promising approach to feature selection. However, applying it on high-dimensional data with thousands to tens of thousands of features is still challenging because of the large search space. While filter approaches are time efficient and scalable for high-dimensional data, they usually obtain lower classification accuracy than wrapper approaches. On the other hand, wrapper methods require a longer running time than filter methods due to the learning algorithm involved in fitness evaluation. This paper proposes a new strategy of combining filter and wrapper approaches in a single evolutionary process in order to achieve smaller feature subsets with better classification performance in a shorter time. A new local search heuristic using symmetric uncertainty is proposed to refine the solutions found by PSO and a new hybrid fitness function is used to better evaluate candidate solutions. The proposed method is examined and compared with three recent PSO based methods on eight high-dimensional problems of varying difficulty. The results show that the new hybrid PSO is more effective and efficient than the other methods. 

Binh Tran Ngan, Mengjie Zhang, Bing Xue. " Bare-Bone Particle Swarm Optimisation for Simultaneously Discretising and Selecting Features For High-Dimensional Classification". Proceedings of the 19th European Conference on the Applications of Evolutionary Computation (EvoApplications 2016, EvoIASP 2016). Lecture Notes in Computer Science.

@INPROCEEDINGS{Tran2016bare,
AUTHOR = {Tran, Binh and Xue, Bing and Zhang, Mengjie},
TITLE = {Bare-Bone Particle Swarm Optimisation for Simultaneously Discretising and Selecting Features for High-Dimensional Classification},
BOOKTITLE = {Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, Proceedings, Part I},
YEAR = {2016},
PAGES = {701–718},
PUBLISHER = {Springer}
}

Feature selection and discretisation have shown their effectiveness for data preprocessing especially for high-dimensional data with many irrelevant features. While feature selection selects only relevant features, feature discretisation finds a discrete representation of data that contains enough information but ignoring some minor fluctuation. These techniques are usually applied in two stages, discretisation and then selection since many feature selection methods work only on discrete features. Most commonly used discretisation methods are univariate in which each feature is discretised independently; therefore, the feature selection stage may not work efficiently since information showing feature interaction is not considered in the discretisation process. In this study, we propose a new method called PSO-DFS using bare-bone particle swarm optimisation (BBPSO) for discretisation and feature selection in a single stage. The results on ten high-dimensional datasets show that PSO-DFS obtains a substantial dimensionality reduction for all datasets. The classification performance is significantly improved or at least maintained on nine out of ten datasets by using the transformed “small” data obtained from PSO-DFS. Compared to applying the two-stage approach which uses PSO for feature selection on the discretised data, PSO-DFS achieves better performance on six datasets, and similar performance on three datasets with a much smaller number of features selected.

Binh Tran, Bing Xue and Mengjie Zhang. " Improved PSO for Feature Selection on High-Dimensional Datasets". Proceedings of the 10th International Conference on Simulated Evolution and Learning (SEAL 2014). Lecture Notes in Computer Science. Vol. 8886. Dunedin, New Zealand. December 15-18, 2014. pp. 503-515.

@INPROCEEDINGS{Tran2014improved,
AUTHOR = {Binh Tran, Bing Xue and Mengjie Zhang},
TITLE = {Improved PSO for Feature Selection on High-Dimensional Datasets},
BOOKTITLE = {Proceedings of the 10th International Conference on Simulated Evolution and Learning},
YEAR = {2014},
PAGES = {503--515},
PUBLISHER = {Springer}
}

Classification on high-dimensional (i.e. thousands of dimensions) data typically requires feature selection (FS) as a pre-processing step to reduce the dimensionality. However, FS is a challenging task even on datasets with hundreds of features. This paper proposes a new particle swarm optimisation (PSO) based FS approach to classification problems with thousands or tens of thousands of features. The proposed algorithm is examined and compared with three other PSO based methods on five high-dimensional problems of varying difficulty. The results show that the proposed algorithm can successfully select a much smaller number of features and significantly increase the classification accuracy over using all features. The proposed algorithm outperforms the other three PSO methods in terms of both the classification performance and the number of features. Meanwhile, the proposed algorithm is computationally more efficient than the other three PSO methods because it selects a smaller number of features and employs a new fitness evaluation strategy.

Binh Tran, Bing Xue and Mengjie Zhang. " Overview of Particle Swarm Optimisation for Feature Selection in Classification". Proceedings of the 10th International Conference on Simulated Evolution and Learning (SEAL 2014). Lecture Notes in Computer Science. Vol. 8886. Dunedin, New Zealand. December 15-18, 2014. pp. 605-617.

@INPROCEEDINGS{Tran2014overview,
AUTHOR = {Binh Tran, Bing Xue and Mengjie Zhang},
TITLE = {Overview of Particle Swarm Optimisation for Feature Selection in Classification},
BOOKTITLE = {Proceedings of the 10th International Conference on Simulated Evolution and Learning},
YEAR = {2014},
PAGES = {605--617},
PUBLISHER = {Springer}
}

Feature selection is a process of selecting a subset of relevant features from a large number of original features to achieve similar or better classification performance and improve the computation efficiency. As an important data pre-processing technique, research into feature selection has been carried out over the past four decades. Determining an optimal feature subset is a complicated problem. Due to the limitations of conventional methods, evolutionary computation (EC) has been proposed to solve feature selection problems. Particle swarm optimisation (PSO) is an EC technique which recently has caught much interest from researchers in the field. This paper presents a review of PSO for feature selection in classification. After describing the background of feature selection and PSO, recent work involving PSO for feature selection is reviewed. Current issues and challenges are also presented for future research.