ArgusLab Technical Report 2017-1

Android Malware Clustering through Malicious Payload Mining by Yuping Li, Jiyong Jang, Xin Hu, and Xinming Ou

Abstract: Clustering has been well studied for desktop malware analysis as an effective triage method. Conventional similarity-based clustering techniques, however, cannot be immediately applied to Android malware analysis due to the excessive use of third-party libraries in Android application development and Android application repackaging techniques. For example, two Android malicious apps from different malware families may share high level of overall similarity if both apps include the same popular libraries or both apps are repackaged based on the same original app.
In this paper, we propose novel malicious payload mining techniques to efficiently perform Android malware clustering. In particular, we design a robust method to precisely exclude legitimate library code from Android malware while retaining malicious code segments, even if the malicious code is injected under popular library names. We design and implement an Android malware clustering approach through iterative mining of malicious payload and checking whether malware samples share the same version of malicious payload. Our approach utilizes traditional hierarchical clustering technique and an efficient fuzzy hashing fingerprint representation. We also develop three optimization techniques to significantly improve the scalability, and our performance evaluation confirms the applicability of our approach in analyzing a large scale of malware families with little or no accuracy impact. To evaluate the overall performance, we first leverage VirusTotal reports, clustering techniques, and manual efforts to separate collected malware samples into 260 sub-families; then constructed 10 testing datasets by shuffling the sub-families and randomly select 30 sub-families for each dataset. When applying the proposed clustering approach on the 10 testing datasets constructed as described above, the experimental results demonstrate that the proposed clustering approach achieves average precision of 0.984 and recall of 0.959.

Full Paper