Malware Detection Using Machine Leaning
Malware Detection Using Machine Leaning
Essentially, this study sets out on an extensive investigation approach has the potential to simulate the likelihood of a
of the complex field of malware detection—a voyage that system becoming infected with malware. The experimental
starts with a threat taxonomy, traverses the shortcomings of results, which show that the Random Model, KNN, and
traditional approaches, and reveals the exciting opportunities Logistic Regression are the most dependable of the studied
presented by a wide range of machine learning strategies. algorithms for malware detection, are presented at the end of
The prologue of the introduction sets the stage for the story the literature review. While the approaches do not completely
that follows as it attempts reach the zero false positive target, they do
to create a flexible and robust framework to combat show a significant boost in the overall detection rate,
malware's constant growth in the digital sphere. With every suggesting that they could be a useful addition to existing
step we take across this terrain, a new chapter in the history antivirus programs.
of cybersecurity is revealed.
In summary, the literature study offers insightful
II. LITERATURE REVIEW information on the current state of machine learning-based
The literature review provided in your research paper titled malware detection. It emphasizes how crucial data-driven
"Malware Detection Using Machine Learning" offers an approaches are becoming to combating the dynamic nature of
indepth exploration of the existing research and techniques in malware threats and the difficulties posed by large-scale data
the field of malware detection. Malware, a ubiquitous threat analysis.
to computer systems, is a broad category of harmful software
III. DATASETS
intended to compromise or corrupt systems without
authorization from the user. In addressing the ever-evolving We used three datasets: a training dataset, a test dataset, and
issues of malware detection, this study highlights the a “scale-up” dataset up to 200GB. The number of malware
shortcomings of conventional signature-based approaches files and respectively clean files in these datasets is shown in
because of the malware's ever-increasing sophistication, the first two columns of Table I. As stated above, our main
which includes behaviours that are polymorphic and self- goal is to achieve malware detection with only a few (if
updating. possible 0) false positives, therefore the clean files in this
The literature study explores a range of machine learning dataset (and in the scale-up dataset) are much larger than the
methods used to identify malware, illustrating the trend number of malware files.
toward data-driven strategies. An example of the potential of From the whole feature set that we created for malware
ensemble methods in classification applications is the higher detection, 308 binary features were selected for the
performance of boosted decision trees using n-gram data over experiments to be presented in this paper. Files that generate
Naive Bayes and Support Vector Machines. Other methods similar values for the chosen feature set were counted only
investigate the use of Hidden Markov Models and Profile once. The last two columns in Table I show the total number
Hidden Markov Models, along with association criteria of unique combinations of the 308 selected binary features in
obtained from Windows API execution sequences, to identify the training, test, and respectively scale-up datasets. Note that
malware variations. The paper also emphasizes how neural the number of clean combinations — i.e. combinations of
networks and Self-Organizing Maps can be used to detect feature values for the clean files — in the three datasets is
polymorphic malware and detect patterns in the behaviour of much smaller than the number of malware unique
Windows executable files that indicate the presence of combinations.
viruses. When taken as a whole, these findings highlight the
necessity for more advanced and complex methods to combat TABLE I
NUMBER OF FILES AND UNIQUE COMBINATIONS OF FEATURE
the complex nature of malware. The literature study also
VALUES IN THE TRAINING, TEST, AND SCALE-UP DATASETS
discusses the difficulties in handling big datasets of ".asm" UPTO 200GB.
and ".bytes" files, as well as data pretreatment and feature
extraction. Although the sheer volume of these files presents Files Unique combinations
major complications, they do offer a low-level insight on
software activity. The effectiveness of machine learning Database malware clean malware clean
models, such as K-Nearest Neighbours (KNN), is improved Training 6955 695535 6922 315
by reducing dimensionality and noise through the use of
Test 21740 6521 609 220
feature selection procedures, such as correlation-based
Scale-up approx. 2M approx. 80M 8817 12230
filtering and wrapper techniques.
In addition, logistic regression is presented as a binary
classification technique that models the probability of TABLE II
malware existence by utilizing the logistic (Sigmoid) MALWARE DISTRIBUTION IN THE TRAINING AND TEST
function. Based on pertinent qualities and characteristics, this DATASETS.
algorithm='auto', leaf_size=30, p=2, metric='minkowski', approach for binary classification problems, such as malware
metric_params=None, n_jobs=1, detection in computer systems, is logistic regression. It is
**kwargs) knn_classifier.fit(X, y) useful for spotting dangerous software since it can simulate
knn_classifier.predict(X) the likelihood that a system would become infected by
malware.
knn_classifier.predict_proba(X)
The logistic (Sigmoid) function is used in logistic
regression to determine the likelihood that a particular
occurrence would fall into the positive class:
for one chosen label (in our case either malware or clean), so
where,
that in the end the files situated on one side of the learned
linear separator have exactly that label (assuming that the two
classes are separable). The files on the other side of the linear L = the maximum value of the curve
separator can have mixed labels. e = the natural logarithm base (or Euler’s number)
Machine learning algorithms for classification and x0 = the x-value of the sigmoid midpoint
regression include K-Nearest Neighbours (KNN). It k = steepness of the curve or the logistic growth rate
categorizes data points in a feature space according to how
close they are to other data points. KNN is typically not a good For detecting malware, logistic regression models the
option for computer system malware detection. It is unable to likelihood that a given system or file has malware. Based on
detect intricate patterns and malware-specific traits. To the attributes
properly detect malware, more sophisticated techniques that
take a wider variety of traits and behaviours into account are
needed.
Mathematical Representation of KNN: KNN determines the
distance to its k nearest neighbours for a given data item X
and classifies it into the category that is most prevalent among and qualities of the system or file, the algorithm determines
its neighbours. Distance metrics, such Euclidean distance, are this likelihood. The algorithm categorizes the instance as
used as the foundation for the mathematical representation. malware (1) or non-malware (0) by defining a threshold.
The K-Nearest Neighbours (KNN) method for malware
detection entails classifying a particular data point as either
malware or non-malware based on the dominant class among V. RESULTS
its k nearest neighbours. This method relies solely on We performed cross-validation tests by running the three
closeness to neighbours in the feature space, which is not algorithm the Random Model, KNN and Logistic Regression
effective for malware identification because malware has presented in Section III on the training dataset described in
complex and distinctive properties. Section II (6922 malware unique combinations, and 315 clean
unique combinations).
Algorithm 3: Logistic Regression For the Random Model, the following functions were used:
▪ The main function in the Random Model is in charge of
import scikit-learn.linear_model sgd_classifier = producing random probabilities for each class. To generate
SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, random numbers, you can use Python functions like
l1_ratio=0.15, fit_intercept=True, max_iter=None, np.random.rand().
tol=None, shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, For the KNN classifier, the following function were used:
random_state=None, learning_rate='optimal', eta0=0.0, ▪ Distance Metric Function: KNN relies on a distance metric
power_t=0.5, class_weight=None, warm_start=False, function to measure the similarity between data points.
Common distance functions include Euclidean distance,
average=False, n_iter=None) sgd_classifier.fit(X, y,
Manhattan distance, and Minkowski distance.
coef_init=None, intercept_init=None,
▪ Voting Function: KNN classifies data points according to the
sample_weight=None) sgd_classifier.predict(X)
majority class among their k-nearest neighbours using a
voting procedure.
Binary categorization is handled by the machine learning
For the Logistic Regression, the following function were
algorithm logistic regression. A logistic (Sigmoid) function is
used:
used to model the likelihood that a data point will fall into a
▪ The logistic function, often known as the sigmoid logistic
specific class. The Stochastic Gradient Descent (SGD) form
function, is used in logistic regression to represent the
of Logistic Regression in Scikit-Learn is described in the
likelihood that a given occurrence will fall into the positive
pseudocode. It offers techniques to fit the model and produce
class. The logistic function is defined as:
predictions, as well as initializing the classifier. A popular
P( y = 1∣ X ) = 1 + e - ( X . w + b)
xtsne = TSNE(perplexity=50)
results = xtsne.fit_transform(result_x, axis=1)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=result_y,
cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(9))
plt.clim(0.5, 9)
plt.show()
Fig 3.1 – Cross Validation, Logistic Regression
if exists_in(filenames, file):
i = find_index_of(filenames, file)
class_bytes.append(class_y[i])
sizebytes.append(convert_to_MB(statinfo.st_size))
fnames.append(file)
samples that are not picked up by conventional detection 2006. [8] I. Yoo, “Visualizing Windows executable viruses
techniques are used for training.) using selforganizing maps,” in VizSEC/DMSEC ’04:
Proceedings of the 2004 ACM workshop on Visualization and
ACKNOWLEDGMENT data mining for computer security. New York, NY, USA:
ACM, 2004, pp. 82– 89.
[9] F. Rosenblatt, “The perceptron: a probabilistic model
We would like to express our heartfelt gratitude to Mr. Adil
for information storage and organization in the brain,” pp. 89–
Husain Rathar for his invaluable guidance and mentorship
114, 1988.
throughout this project and research endeavour. We truly
[10] T. Mitchell, Machine Learning. McGraw-Hill
appreciate his persistent support and the clear guidance he
Education (ISE Editions), October 1997.
gave us, which allowed us to effectively do our work within
the allotted time. We also want to express our sincere gratitude
to every individual in our group who worked so hard to ensure
the effective completion of this research project by lending
their knowledge and skills. Their commitment, collaboration,
and invaluable insights were crucial in helping us reach our
study objectives and identify workable answers. Finally, we
would like to express our gratitude to our university for giving
us a suitable platform and access to a large library, both of
which tremendously helped with our research. These tools
were essential to our capacity to carry out fruitful research and
support the scholarly community.
REFERENCES
[1] I. Santos, Y. K. Penya, J. Devesa, and P. G. Garcia,
“Ngrams-based file signatures for malware detection,” 2009.
[2] K. Rieck, T. Holz, C. Willems, P. Du¨ssel, and P. Laskov,
“Learning and classification of malware behavior,” in DIMVA
’08: Proceedings of the 5th international conference on
Detection of Intrusions and Malware, and Vulnerability
Assessment. Berlin, Heidelberg: Springer-Verlag, 2008, pp.
108–125.E. Konstantinou, “Metamorphic virus: Analysis and
detection,” 2008, Technical Report RHUL-MA-2008-2,
Search Security Award M.Sc. thesis, 93 pagesJ. Z. Kolter and
M. A. Maloof, “Learning to detect and classify malicious
executables in the wild,” Journal of Machine Learning
Research, vol. 7, pp. 2721–2744, December 2006, special
Issue on Machine Learning in Computer Security.
[3] Y. Ye, D. Wang, T. Li, and D. Ye, “Imds: intelligent
malware detection system,” in KDD, P. Berkhin, R. Caruana,
and X. Wu, Eds. ACM, 2007, pp. 1043–1047.
[4] M. Chandrasekaran, V. Vidyaraman, and S. J.
Upadhyaya, “Spycon: Emulating user activities to detect
evasive spyware,” in IPCCC. IEEE Computer Society, 2007,
pp. 502–509 [5] M. R. Chouchane, A. Walenstein, and A.
Lakhotia, “Using Markov Chains to filter machine-morphed
variants of malicious programs,” in Malicious and Unwanted
Software, 2008. MALWARE 2008. 3rd International
Conference on, 2008, pp. 77–84.
[6] M. Stamp, S. Attaluri, and S. McGhee, “Profile
hidden markov models and metamorphic virus detection,”
Journal in Computer Virology, 2008.
[7] R. Santamarta, “Generic detection and classification
of polymorphic malware using neural pattern recognition,”