Clustering Microsoft Windows Executables based on TF-IDF and API Information

Main Article Content

Jonghwa Park
Gyoosik Kim
Youngsup Hwang
Seong-je Cho

Keywords

Clustering, Windows Executable, TF-IDF, API, K-means, Random Forest

Abstract

The illegal software usage is 39% worldwide and malware is frequent in the illegal software. To
protect attacks from malware, we use software filtering. The software filtering compares equivalence
of a testing software to an original one. This requires comparison between all the legal programs in
the market. So we have to reduce the number of comparisons by clustering programs in the market.
Every market provides categories to programs such as image viewer, video player, audio player, and
messenger, etc. But it is not clear that these categories are best fit to filter malware. We suggest new
categories which are more suitable to classification experimentally. Our categories are automatically
made from the K-means clustering algorithm based on TF-IDF and API information. Experimental
results show that our clustering scheme is better than the existing categories to classify malware.