An effective positive-unlabeled learning method for detecting a large scale of malware variants

Thumbnail Image
Khan, Mohammad Faham
Journal Title
Journal ISSN
Volume Title
University of Guelph

Malicious softwares (Malwares) are able to quickly evolve into many different variants and evade various existing detection techniques. Machine learning based techniques perform well in detecting malware variants, but in the real industry, the volume of malware variants grows fast and labelling data takes a lot of labour. Thus companies tend to label a small part of the malware samples and treat the rest of the unlabeled samples as benign samples, which leads to limited accuracy. To address such a problem, in this thesis, we propose a cost-sensitive boosting method to train a detection model with the malicious-unlabeled executables to improve the accuracy. Extensive experiments have demonstrated that the proposed method, when implemented into the machine learning algorithms (with positive and unlabeled datasets), has shown to improve the final results. It improved the reliability of the machine learning models, and during the training period, it improved the speed, convergence etc.

Positive-Unlabeled Learning, positive unlabeled learning, machine learning, Malware, co-occurrence matrix, decision boundary, markov chain, boosting (machine learning), markov model, convolutional neural network, linear regression, logistic regression, n-gram model, n-gap, cost sensitive, malware detection, unlabeled dataset, positive unlabeled malware dataset
J. Zhang, M. F. Khan, X. Lin and Z. Qin, "An Optimized Positive-Unlabeled Learning Method for Detecting a Large Scale of Malware Variants," 2019 IEEE Conference on Dependable and Secure Computing (DSC), Hangzhou, China, 2019, pp. 1-8, doi: 10.1109/DSC47296.2019.8937650.