ArgusLab Technical Report 2017-1
Android Malware Clustering through Malicious Payload Mining by Yuping Li, Jiyong Jang, Xin Hu, and Xinming Ou
Abstract: Clustering has been well studied for desktop malware analysis
as an effective triage method. Conventional similarity-based
clustering techniques, however, cannot be immediately applied to Android malware analysis due to the excessive use of
third-party libraries in Android application development and
Android application repackaging techniques. For example,
two Android malicious apps from different malware families
may share high level of overall similarity if both apps include
the same popular libraries or both apps are repackaged based
on the same original app.
In this paper, we propose novel malicious payload mining
techniques to efficiently perform Android malware clustering. In particular, we design a robust method to precisely
exclude legitimate library code from Android malware while
retaining malicious code segments, even if the malicious code
is injected under popular library names. We design and
implement an Android malware clustering approach through
iterative mining of malicious payload and checking whether
malware samples share the same version of malicious payload. Our approach utilizes traditional hierarchical clustering
technique and an efficient fuzzy hashing fingerprint representation. We also develop three optimization techniques to
significantly improve the scalability, and our performance
evaluation confirms the applicability of our approach in analyzing a large scale of malware families with little or no
accuracy impact. To evaluate the overall performance, we
first leverage VirusTotal reports, clustering techniques, and
manual efforts to separate collected malware samples into
260 sub-families; then constructed 10 testing datasets by
shuffling the sub-families and randomly select 30 sub-families
for each dataset. When applying the proposed clustering
approach on the 10 testing datasets constructed as described
above, the experimental results demonstrate that the proposed clustering approach achieves average precision of 0.984
and recall of 0.959.