0% found this document useful (0 votes)

33 views9 pages

Malware Detection Using Machine Leaning

This study presents a flexible malware detection system utilizing various machine learning algorithms, including Random Model, KNN, and Logistic Regression, to effectively distinguish between malware and clean files while minimizing false positives. The research emphasizes the growing complexity of malware and the need for advanced detection techniques, supported by extensive experiments on medium-sized datasets. The findings suggest that the proposed framework is a valuable addition to existing cybersecurity measures, demonstrating significant improvements in detection rates.

Uploaded by

abhishek.shete23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views9 pages

Malware Detection Using Machine Leaning

Uploaded by

abhishek.shete23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

International Journal of Scientific Research in Engineering and Management (IJSREM)

Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Malware Detection Using Machine Learning

Siddharth Dr. Bharti Sahu
Chandigarh University Chandigarh University
Chandigarh, India Chandigarh, India
[email protected] [email protected]

place in a dynamic environment with constantly changing

Abstract— In this work, we present a flexible system that
makes use of various machine learning methods to efficiently rules of engagement. One can explore the field of dynamical
distinguish between malware and clean files while purposefully file analysis to give a visual story of the ever-changing
reducing false positives. In the field of cybersecurity, our strong challenges presented by malware and the related advances in
framework is both flexible and strong, working along with detection techniques. In this case, the use of virtual
different machine learning algorithms. Our study unfolds with
an exploration of basic principles using the Random Model, K environment emulation acts as a stage for the elaborate dance
Nearest Neighbouring Classifier (KNN), and Logistic Regression that is performed between detection technology and malware
as foundational parts, emphasizing the differentiation between [2]. Furthermore, a thorough comprehension of the terrain
malware and benign files. Extensive experiments on
mediumsized datasets that include malware and clean files verify necessitates an investigation of conventional methods
the effectiveness of our methodology. The system then goes intended to detect metamorphic viruses. These methods
through a painstaking scaling-up process that guarantees provide as a basis of knowledge, illuminating the subtleties of
smooth operation with big datasets containing both malware and
clean files. Our methodology is validated by analysing three
identifying malicious code that modifies its form to avoid
important algorithms: Random Model, KNN, and Logistic detection through traditional means [3]. In reaction to the
Regression, each of which adds unique advantages to the growing complexity of malware, researchers are focusing on
malware detection system. The evaluation, which is carried out the potential of machine learning as a ray of hope in the fight
on several datasets, aims to minimize false positives while
striking a compromise between precision and recall. Finally, our against this constantly changing and dynamic threat
flexible system, implemented and evaluated on many datasets, environment. The literature study that follows provides a
demonstrates its efficacy in distinguishing malware from clean broad overview of various machine learning approaches that
files. The framework's flexibility and scalability make it an
invaluable tool in the everevolving field of cybersecurity, are ready to serve as sentinels in the search for effective
providing a sophisticated method of malware detection. The malware detection tools. Of these, boosted decision trees'
proposed algorithms emphasize the framework's potential as a ability to use n-gram data makes them a strong competitor,
supplementary tool to current cybersecurity measures while also
outperforming more conventional classifiers like Support
adding to its reliability.
Vector Machines and the Naive Bayes classifier in terms of
performance [5]. The extraction of association rules from
Keywords— ML, KNN, RM, LR
Windows API execution sequences adds even more depth to
the toolkit of malware detection techniques and demonstrates
I. INTRODUCTION
the adaptability of machine learning. By using Hidden
Malware is defined as software designed to infiltrate or Markov Models (HMMs), one can apply a probabilistic
damage a computer system without the owner’s informed method to determine if a given program file is a variant of a
consent. Malware is a broad category that includes a wide known object. In a related effort, Profile Hidden Markov
range of malicious programs and applications, from Models—which are well-known for their efficacy in the field
standalone viruses to file infectors. Among these, the rogues' of bioinformatics—are adapted to accomplish a comparable
gallery consists of characters with distinct digital weaponry, objective in the field of malware detection [8][9]. As a result,
such as Ramnit, Lollipop, Vundo, Simda, Tracur, the literature presents a varied tapestry of machine learning
Kelihos_ver1, Obfuscator, Kelihos_ver3, Gatak, and ACY. approaches, each adding a special thread to the complex web
The malicious forces behind malware have also changed and of malware detection. As Section V develops, the story
mutated as our digital environment continuous its unstoppable reaches a climax and reveals the outcome of this complex trip.
progress. They have added many polymorphic layers to evade Here, 52 key characteristics taken from the .asm files are
the conventional, signature-based techniques used by prepared to serve as the cornerstone of a largescale system,
antivirus solutions. capable of identifying malware in massive training datasets.
The convergence of preprocessing and analysis signals that
Security measures face a great challenge from the modern the framework is ready to be scaled up to take on the
malware landscape, which updates itself frequently to surpass formidable challenge of very large training datasets.
antivirus software that uses static signatures in its detection
[1]. The conflict between cyber attackers and defenders takes

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 1

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Essentially, this study sets out on an extensive investigation approach has the potential to simulate the likelihood of a
of the complex field of malware detection—a voyage that system becoming infected with malware. The experimental
starts with a threat taxonomy, traverses the shortcomings of results, which show that the Random Model, KNN, and
traditional approaches, and reveals the exciting opportunities Logistic Regression are the most dependable of the studied
presented by a wide range of machine learning strategies. algorithms for malware detection, are presented at the end of
The prologue of the introduction sets the stage for the story the literature review. While the approaches do not completely
that follows as it attempts reach the zero false positive target, they do

to create a flexible and robust framework to combat show a significant boost in the overall detection rate,
malware's constant growth in the digital sphere. With every suggesting that they could be a useful addition to existing
step we take across this terrain, a new chapter in the history antivirus programs.
of cybersecurity is revealed.
In summary, the literature study offers insightful
II. LITERATURE REVIEW information on the current state of machine learning-based
The literature review provided in your research paper titled malware detection. It emphasizes how crucial data-driven
"Malware Detection Using Machine Learning" offers an approaches are becoming to combating the dynamic nature of
indepth exploration of the existing research and techniques in malware threats and the difficulties posed by large-scale data
the field of malware detection. Malware, a ubiquitous threat analysis.
to computer systems, is a broad category of harmful software
III. DATASETS
intended to compromise or corrupt systems without
authorization from the user. In addressing the ever-evolving We used three datasets: a training dataset, a test dataset, and
issues of malware detection, this study highlights the a “scale-up” dataset up to 200GB. The number of malware
shortcomings of conventional signature-based approaches files and respectively clean files in these datasets is shown in
because of the malware's ever-increasing sophistication, the first two columns of Table I. As stated above, our main
which includes behaviours that are polymorphic and self- goal is to achieve malware detection with only a few (if
updating. possible 0) false positives, therefore the clean files in this
The literature study explores a range of machine learning dataset (and in the scale-up dataset) are much larger than the
methods used to identify malware, illustrating the trend number of malware files.
toward data-driven strategies. An example of the potential of From the whole feature set that we created for malware
ensemble methods in classification applications is the higher detection, 308 binary features were selected for the
performance of boosted decision trees using n-gram data over experiments to be presented in this paper. Files that generate
Naive Bayes and Support Vector Machines. Other methods similar values for the chosen feature set were counted only
investigate the use of Hidden Markov Models and Profile once. The last two columns in Table I show the total number
Hidden Markov Models, along with association criteria of unique combinations of the 308 selected binary features in
obtained from Windows API execution sequences, to identify the training, test, and respectively scale-up datasets. Note that
malware variations. The paper also emphasizes how neural the number of clean combinations — i.e. combinations of
networks and Self-Organizing Maps can be used to detect feature values for the clean files — in the three datasets is
polymorphic malware and detect patterns in the behaviour of much smaller than the number of malware unique
Windows executable files that indicate the presence of combinations.
viruses. When taken as a whole, these findings highlight the
necessity for more advanced and complex methods to combat TABLE I
NUMBER OF FILES AND UNIQUE COMBINATIONS OF FEATURE
the complex nature of malware. The literature study also
VALUES IN THE TRAINING, TEST, AND SCALE-UP DATASETS
discusses the difficulties in handling big datasets of ".asm" UPTO 200GB.
and ".bytes" files, as well as data pretreatment and feature
extraction. Although the sheer volume of these files presents Files Unique combinations
major complications, they do offer a low-level insight on
software activity. The effectiveness of machine learning Database malware clean malware clean
models, such as K-Nearest Neighbours (KNN), is improved Training 6955 695535 6922 315
by reducing dimensionality and noise through the use of
Test 21740 6521 609 220
feature selection procedures, such as correlation-based
Scale-up approx. 2M approx. 80M 8817 12230
filtering and wrapper techniques.
In addition, logistic regression is presented as a binary
classification technique that models the probability of TABLE II
malware existence by utilizing the logistic (Sigmoid) MALWARE DISTRIBUTION IN THE TRAINING AND TEST
function. Based on pertinent qualities and characteristics, this DATASETS.

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 2

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

9) test_predicted_y = zeros(test_data_len, 9) for i in

Training Dataset Test range(cv_data_len): rand_probs =
Dataset
random_values_between_0_and_1(9) cv_predicted_y[i] =
Malware Unique
combinations normalize_to_sum_1(rand_probs) log_loss_cv =
Type Files of feature values Files calculate_log_loss(y_cv, cv_predicted_y) for i in
Ramnit 14.2% 5.19% 13.84% range(test_data_len):
Lollipop 22.8% 30.73% 40.15% 22.95% rand_probs = random_values_between_0_and_1(9)
Kelihos-ver3 27.1% 12.15% 27.50%
Vundo 4.4% 0.11% 4.50%
test_predicted_y[i] = normalize_to_sum_1(rand_probs)
Simda 0.4% 2.66% 3.17% 0.09% log_loss_test = calculate_log_loss(y_test, test_predicted_y)
Tracur Kelihos- 6.9% 4.66% 6.89% predicted_y = find_argmax(test_predicted_y)
ver1 3.7% 6.10% 4.41%
Obfuscator.ACY 11.3% 11.49% plot_confusion_matrix(y_test, predicted_y + 1)
Gatak 9.3% 9.24%
rate for one category [12]. In the sequel we will use the
following data structures:
It is easy to create random probability for classification
tasks using the "Random Model" method. For each class (in
since the majority of features were developed to highlight a this case, there are nine classes) in this model, we generate
certain component of malware files (either a geometrical form random probability values so that the sum of these
or behaviour aspect). probabilities is 1. To establish a baseline model for
The majority of the clean files in the training database are comparison in machine learning tasks, this is done. The
system files (from various operating system versions) and procedures to implement the Random Model for producing
executable and library files from several well-known apps. In random probabilities and figuring out log loss for cross-
order to better train and test the system, we also employ clean validation and test datasets are described in the accompanying
files that are packed or that have the same form or geometrical pseudocode.
characteristics with malware files (e.g., use the same packer). We create a vector of probabilities in a random model, p =
The training dataset includes malware files that were obtained [p1, p2,..., pk], where k is the number of classes. Probabilities
from the Virus Heaven collection. The test dataset includes are added together, and the result is 1, hence pi = 1.
clean files from several operating systems and malware files Without taking into account any attributes or patterns, a
from the Wild List collection (other files than those used in Random Model in the context of malware detection assigns
the first database). The training and test datasets malware random probabilities to various classes (malware or non-
collections include Ramnit, Lollipop, Kelihos_ver3, Vundo, malware). Given that it relies on educated assumptions and is
Simda, Tracur, and Kelihos_ver1, Obfuscator.ACY, Gatak. unable to accurately categorize malware, it is unsuitable for
The percentage of those malware kinds in the training and, usage in realworld applications.
respectively, test datasets is shown in the first and third P(class = malware) = p, where 0 = p = 1 and P(class =
columns of Table II. The second column in Table II shows the nonmalware) = 1 - p are the definitions of the term. The
proportion of malware-specific unique combinations across Random
all feature value combinations in the training dataset. Model is not a useful method for detecting malware in
Divide the dataset into train, test, and cv parts, and then computer systems. To evaluate the performance of more
examine the distribution of class in each split to see if it is complex models, it serves as a baseline or reference model.
consistent across all splits. 6955 data points make up the train Models should be educated on pertinent traits and behaviours
data. 2174 data points make up the test data. 1739 data points of dangerous software in order to enable them to discriminate
were used for cross validation. between malware and benign software in real-world malware
detection.
IV. ALGORITHMS Algorithm 2, henceforth called KNN. It performs the training
The main goal of this section is to modify the Random for one chosen label (in our case either malware or clean), so
Model so as to correctly detect malware files, while forcing
detection Algorithm 2: K Nearest Neighbour Classifier

Algorithm 1: Random Model

import scikit-learn.neighbors knn_classifier =
test_data_len = X_test.get_row_count() cv_data_len = KNeighborsClassifier(n_neighbors=5, weights='uniform',
X_cv.get_row_count() cv_predicted_y = zeros(cv_data_len,

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 3

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

algorithm='auto', leaf_size=30, p=2, metric='minkowski', approach for binary classification problems, such as malware
metric_params=None, n_jobs=1, detection in computer systems, is logistic regression. It is
**kwargs) knn_classifier.fit(X, y) useful for spotting dangerous software since it can simulate
knn_classifier.predict(X) the likelihood that a system would become infected by
malware.
knn_classifier.predict_proba(X)
The logistic (Sigmoid) function is used in logistic
regression to determine the likelihood that a particular
occurrence would fall into the positive class:
for one chosen label (in our case either malware or clean), so
where,
that in the end the files situated on one side of the learned
linear separator have exactly that label (assuming that the two
classes are separable). The files on the other side of the linear L = the maximum value of the curve
separator can have mixed labels. e = the natural logarithm base (or Euler’s number)
Machine learning algorithms for classification and x0 = the x-value of the sigmoid midpoint
regression include K-Nearest Neighbours (KNN). It k = steepness of the curve or the logistic growth rate
categorizes data points in a feature space according to how
close they are to other data points. KNN is typically not a good For detecting malware, logistic regression models the
option for computer system malware detection. It is unable to likelihood that a given system or file has malware. Based on
detect intricate patterns and malware-specific traits. To the attributes
properly detect malware, more sophisticated techniques that
take a wider variety of traits and behaviours into account are
needed.
Mathematical Representation of KNN: KNN determines the
distance to its k nearest neighbours for a given data item X
and classifies it into the category that is most prevalent among and qualities of the system or file, the algorithm determines
its neighbours. Distance metrics, such Euclidean distance, are this likelihood. The algorithm categorizes the instance as
used as the foundation for the mathematical representation. malware (1) or non-malware (0) by defining a threshold.
The K-Nearest Neighbours (KNN) method for malware
detection entails classifying a particular data point as either
malware or non-malware based on the dominant class among V. RESULTS
its k nearest neighbours. This method relies solely on We performed cross-validation tests by running the three
closeness to neighbours in the feature space, which is not algorithm the Random Model, KNN and Logistic Regression
effective for malware identification because malware has presented in Section III on the training dataset described in
complex and distinctive properties. Section II (6922 malware unique combinations, and 315 clean
unique combinations).
Algorithm 3: Logistic Regression For the Random Model, the following functions were used:
▪ The main function in the Random Model is in charge of

import scikit-learn.linear_model sgd_classifier = producing random probabilities for each class. To generate
SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, random numbers, you can use Python functions like
l1_ratio=0.15, fit_intercept=True, max_iter=None, np.random.rand().
tol=None, shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, For the KNN classifier, the following function were used:
random_state=None, learning_rate='optimal', eta0=0.0, ▪ Distance Metric Function: KNN relies on a distance metric
power_t=0.5, class_weight=None, warm_start=False, function to measure the similarity between data points.
Common distance functions include Euclidean distance,
average=False, n_iter=None) sgd_classifier.fit(X, y,
Manhattan distance, and Minkowski distance.
coef_init=None, intercept_init=None,
▪ Voting Function: KNN classifies data points according to the
sample_weight=None) sgd_classifier.predict(X)
majority class among their k-nearest neighbours using a
voting procedure.
Binary categorization is handled by the machine learning
For the Logistic Regression, the following function were
algorithm logistic regression. A logistic (Sigmoid) function is
used:
used to model the likelihood that a data point will fall into a
▪ The logistic function, often known as the sigmoid logistic
specific class. The Stochastic Gradient Descent (SGD) form
function, is used in logistic regression to represent the
of Logistic Regression in Scikit-Learn is described in the
likelihood that a given occurrence will fall into the positive
pseudocode. It offers techniques to fit the model and produce
class. The logistic function is defined as:
predictions, as well as initializing the classifier. A popular

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 4

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

P( y = 1∣ X ) = 1 + e - ( X . w + b)

▪ Cost Function: The error between projected probabilities and

actual labels is measured using the logistic loss, also known
as the log loss, which is employed in logistic regression.
During training, the logistic loss aids in adjusting the model's
parameters.

For feature selection in conjunction with Random Model,

KNN and Logistic Regression Model. The Random Model
itself does not include feature selection because it assigns
probability at random without taking features into account. Fig 1.1 – Confusion Matrix of the Random Model.

Fig 1.2 – Precision Matrix of the Random Model

Fig 2.2 – Confusion Matrix

of the KNN classifier

Fig 1.3 – Recall Matrix of the Random Model

Fig 2.3 – Precision Matrix of the KNN classifier

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 5

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Feature selection can be used to enhance the performance of

K-Nearest Neighbo urs (KNN). Reducing dimensionality and
noise in the dataset through the selection of pertinent features
helps improve classification and distance calculations. Filter
approaches (such as correlation -based feature selection),
Wrapper methods (such as forward selection), and Embedded
methods (such as L1 regularization with Lasso) are frequently
used for KNN feature selection.

Fig 2.4 – Recall Matrix of the KNN classifier

Feature selection is frequently used in conjunction with

logistic regression to improve model interpretability and lessen
overfitting. The choice of features is important because logistic
regression models the connection between features and the
target variable. Recursive Feature Elimination (RFE), L1
regularization (Lasso), and mutual information-based
algorithms are common approaches.
Fig 2.1 – Cross Validation, KNN Classifier

Algorithm: Multivariate Analysis on Final Features

xtsne = TSNE(perplexity=50)
results = xtsne.fit_transform(result_x, axis=1)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=result_y,
cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(9))
plt.clim(0.5, 9)
plt.show()
Fig 3.1 – Cross Validation, Logistic Regression

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 6

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Fig 3.2 – Confusion Matrix of the Logistic Regression

Fig 4 – Multivariate Analysis on Final Features

VI. WORKING WITH VERY LARGE DATASETS

All the results presented in this section are obtained on the
large ( “scale-up”) dataset that was described in Section II.
Data analysis and machine learning need the use of big datasets
of ".asm" and ".bytes" files, which is a difficult but essential
Fig 3.3 – Precision Matrix of the Logistic Regression task. For many applications, including malware detection,
software classification, and cybersecurity research, these files
frequently contain extensive data and patterns.
The ".asm" files, which are often created from c disassembled
software, offer a low-level perspective of a program's processes.
They are composed of assembly language code, and
examination of them reveals details on the operation,
composition, and potential security risks of software.
Alternatively, ".bytes" files offer a new perspective on the same
software by encoding binary data. Often, this binary data
contains important details about a program's features, structure,
and even abnormalities.
The sheer amount of data, though, is where the real X_train_merge, X_cv_merge, y_train_merge, y_cv_merge
Fig 3.4 – Recall Matrix of the Logistic Regression
intricacy lies. Data preprocessing, feature extraction, and = train_test_split(X_train, y_train, stratify=y_train,
computational resource requirements are a few of the
test_size=0.20)
difficulties large datasets of these files provide. Additionally,
when examined together, the interactions between ".asm" and
".bytes" files can provide a more thorough knowledge of
software behaviour.

Algorithm: Train and Test Split

X_train, X_test_merge, y_train, y_test_merge =

train_test_split(result_x, result_y, stratify=result_y,
test_size=0.20)

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

To make sense of the data in this situation, researchers and

data scientists use a variety of approaches like feature Algorithm: Each .asm file size
engineering, deep learning, and natural language processing.
The knowledge gained from these big datasets is invaluable files = list_files_in_directory('asmFiles') filenames =
for improving software security, spotting vulnerabilities, and get_filenames_from_Y('ID') class_y =
comprehending the complexities of program development and
get_class_list_from_Y('Class') class_bytes = [] sizebytes
execution.

for file in files:

statinfo = get_file_statistics('asmFiles/' + file)
file = extract_file_name(file)

if exists_in(filenames, file):
i = find_index_of(filenames, file)
class_bytes.append(class_y[i])
sizebytes.append(convert_to_MB(statinfo.st_size))
fnames.append(file)

asm_size_byte = create_dataframe({'ID': fnames, 'size':

sizebytes, 'Class': class_bytes})
print(asm_size_byte.head())

Fig – Distribution of malware in whole data set = [] fnames = []

Using the presented algorithm, it is intended to gather data

on the file sizes of ".asm" files located in a directory and
combine it with relevant class labels. (Fig – Boxplot of .asm
files)

VII. CONCLUSION AND FUTURE WORK

Our primary goal was to develop a machine learning

framework with a zero false positive rate that can detect as
many malware samples as possible in a generic manner. Even
though we still have a non-zero false positive rate, we were
really near to our target. Several deterministic exception
mechanisms need to be developed for this framework to be
included in a fiercely competitive commercial product. We
believe that machine learning-based malware detection will
complement existing anti-virus providers' standard detection
techniques rather than replace them. Since every commercial
anti-virus program has speed and memory constraints, the
Random Model, KNN, and Logistic Regression algorithms are
the most dependable ones among those shown here. Given that
most antivirus programs are able to detect viruses at a rate of
over 90%, an increase of 3%–4% in the overall detection rate,
Fig – Boxplot of .byte files as achieved by our methods, is noteworthy. (Note that malware

International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

samples that are not picked up by conventional detection 2006. [8] I. Yoo, “Visualizing Windows executable viruses
techniques are used for training.) using selforganizing maps,” in VizSEC/DMSEC ’04:
Proceedings of the 2004 ACM workshop on Visualization and
ACKNOWLEDGMENT data mining for computer security. New York, NY, USA:
ACM, 2004, pp. 82– 89.
[9] F. Rosenblatt, “The perceptron: a probabilistic model
We would like to express our heartfelt gratitude to Mr. Adil
for information storage and organization in the brain,” pp. 89–
Husain Rathar for his invaluable guidance and mentorship
114, 1988.
throughout this project and research endeavour. We truly
[10] T. Mitchell, Machine Learning. McGraw-Hill
appreciate his persistent support and the clear guidance he
Education (ISE Editions), October 1997.
gave us, which allowed us to effectively do our work within
the allotted time. We also want to express our sincere gratitude
to every individual in our group who worked so hard to ensure
the effective completion of this research project by lending
their knowledge and skills. Their commitment, collaboration,
and invaluable insights were crucial in helping us reach our
study objectives and identify workable answers. Finally, we
would like to express our gratitude to our university for giving
us a suitable platform and access to a large library, both of
which tremendously helped with our research. These tools
were essential to our capacity to carry out fruitful research and
support the scholarly community.

REFERENCES
[1] I. Santos, Y. K. Penya, J. Devesa, and P. G. Garcia,
“Ngrams-based file signatures for malware detection,” 2009.
[2] K. Rieck, T. Holz, C. Willems, P. Du¨ssel, and P. Laskov,
“Learning and classification of malware behavior,” in DIMVA
’08: Proceedings of the 5th international conference on
Detection of Intrusions and Malware, and Vulnerability
Assessment. Berlin, Heidelberg: Springer-Verlag, 2008, pp.
108–125.E. Konstantinou, “Metamorphic virus: Analysis and
detection,” 2008, Technical Report RHUL-MA-2008-2,
Search Security Award M.Sc. thesis, 93 pagesJ. Z. Kolter and
M. A. Maloof, “Learning to detect and classify malicious
executables in the wild,” Journal of Machine Learning
Research, vol. 7, pp. 2721–2744, December 2006, special
Issue on Machine Learning in Computer Security.
[3] Y. Ye, D. Wang, T. Li, and D. Ye, “Imds: intelligent
malware detection system,” in KDD, P. Berkhin, R. Caruana,
and X. Wu, Eds. ACM, 2007, pp. 1043–1047.
[4] M. Chandrasekaran, V. Vidyaraman, and S. J.
Upadhyaya, “Spycon: Emulating user activities to detect
evasive spyware,” in IPCCC. IEEE Computer Society, 2007,
pp. 502–509 [5] M. R. Chouchane, A. Walenstein, and A.
Lakhotia, “Using Markov Chains to filter machine-morphed
variants of malicious programs,” in Malicious and Unwanted
Software, 2008. MALWARE 2008. 3rd International
Conference on, 2008, pp. 77–84.
[6] M. Stamp, S. Attaluri, and S. McGhee, “Profile
hidden markov models and metamorphic virus detection,”
Journal in Computer Virology, 2008.
[7] R. Santamarta, “Generic detection and classification
of polymorphic malware using neural pattern recognition,”

Development of Malware Detection and Analysis Mode
No ratings yet
Development of Malware Detection and Analysis Mode
50 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
Malware - Detection - Using - Machine - Learning (3) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (3) - Removed
31 pages
Malwarepjct PDF
No ratings yet
Malwarepjct PDF
70 pages
Diabetes Case Study
100% (1)
Diabetes Case Study
12 pages
Malware
No ratings yet
Malware
10 pages
Stray Dogs Exhibit A Lesser Variety of Colors Than Pet Dogs
50% (4)
Stray Dogs Exhibit A Lesser Variety of Colors Than Pet Dogs
6 pages
Free Whale Intarsia Pattern by JGR PDF
No ratings yet
Free Whale Intarsia Pattern by JGR PDF
11 pages
Shaft Design Problem 1 - 9
No ratings yet
Shaft Design Problem 1 - 9
14 pages
Fermentation Media, Fermentation Process and Downstream Processing Bcba p7 T
No ratings yet
Fermentation Media, Fermentation Process and Downstream Processing Bcba p7 T
159 pages
EFLM European Urinalysis Guidelines Draft
No ratings yet
EFLM European Urinalysis Guidelines Draft
285 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
11 pages
Ec8353 Electronic Devices and Circuits Unit 2
No ratings yet
Ec8353 Electronic Devices and Circuits Unit 2
126 pages
9515-181-50-Eng - Rev - G1 Eli 280 V2.2.0
No ratings yet
9515-181-50-Eng - Rev - G1 Eli 280 V2.2.0
87 pages
Malware Detection Using Machine Learning
No ratings yet
Malware Detection Using Machine Learning
4 pages
MEMS Motion Sensor: Three-Axis Digital Output Gyroscope: Applications
No ratings yet
MEMS Motion Sensor: Three-Axis Digital Output Gyroscope: Applications
44 pages
Sec 1 Math CHIJ ST Theresa Sec SA2 2017
No ratings yet
Sec 1 Math CHIJ ST Theresa Sec SA2 2017
24 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
Nursing Management of Patients With Urinary Incontinence: Moh Nursing Clinical Practice Guidelines 1/2003
No ratings yet
Nursing Management of Patients With Urinary Incontinence: Moh Nursing Clinical Practice Guidelines 1/2003
45 pages
Vapor Liquid Equilibrium
100% (1)
Vapor Liquid Equilibrium
7 pages
15709-Article Text-55876-2-10-20220114
No ratings yet
15709-Article Text-55876-2-10-20220114
26 pages
Tbc-50s - Instruction Manual
No ratings yet
Tbc-50s - Instruction Manual
32 pages
TOURISM GR11 MEMO NOV2022 - English
No ratings yet
TOURISM GR11 MEMO NOV2022 - English
10 pages
Analyzing and Comparing The Effectiveness of Malware Detection - A Study of Machine Learning Approaches - ScienceDirect
No ratings yet
Analyzing and Comparing The Effectiveness of Malware Detection - A Study of Machine Learning Approaches - ScienceDirect
39 pages
A Case Study Malware Classification
No ratings yet
A Case Study Malware Classification
32 pages
Seminar Report 1
No ratings yet
Seminar Report 1
28 pages
Malware Classification ML Report TechGB2336 Group13
No ratings yet
Malware Classification ML Report TechGB2336 Group13
27 pages
Malware - Detection - Using - Machine - Learning (2) - Removed
No ratings yet
Malware - Detection - Using - Machine - Learning (2) - Removed
31 pages
Research Paper 2 Malware Detection
No ratings yet
Research Paper 2 Malware Detection
24 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
Gate Question Paper
No ratings yet
Gate Question Paper
20 pages
ACT 2 Romeo and Juliet
No ratings yet
ACT 2 Romeo and Juliet
19 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
First Review B19
No ratings yet
First Review B19
24 pages
Computers 13 00059
No ratings yet
Computers 13 00059
18 pages
Electronics 11 03665 v2
No ratings yet
Electronics 11 03665 v2
20 pages
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
No ratings yet
A Multi-View Feature Fusion Approach For Effective Malware Classification Using Deep Learning
15 pages
Mini Project
No ratings yet
Mini Project
11 pages
Research 4
No ratings yet
Research 4
17 pages
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
No ratings yet
The State-of-the-Art in AI-Based Malware Detection Techniques: A Review
18 pages
Automated Machine Learning For Deep Learning Based Malware Detection
No ratings yet
Automated Machine Learning For Deep Learning Based Malware Detection
17 pages
A Novel Ensemble-Based Approach For Windows Malware Detection
No ratings yet
A Novel Ensemble-Based Approach For Windows Malware Detection
10 pages
Innovation in Cyber Threat Detection: Transformer-Based Approach
No ratings yet
Innovation in Cyber Threat Detection: Transformer-Based Approach
15 pages
Dynamic Malware Detection in Wireless Networks Using Deep Learning
No ratings yet
Dynamic Malware Detection in Wireless Networks Using Deep Learning
16 pages
Scalable Malware Detection System Using Big Data A
No ratings yet
Scalable Malware Detection System Using Big Data A
18 pages
Danfoss Recommended Lubricants
No ratings yet
Danfoss Recommended Lubricants
2 pages
Effective Malware Detection Based On Behaviour and Data Features
No ratings yet
Effective Malware Detection Based On Behaviour and Data Features
16 pages
Symmetry 14 02304
No ratings yet
Symmetry 14 02304
11 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
9 pages
IET Information Security - 2020 - Ghouti - Malware Classification Using Compact Image Features and Multiclass Support
No ratings yet
IET Information Security - 2020 - Ghouti - Malware Classification Using Compact Image Features and Multiclass Support
11 pages
Final Research - Merged
No ratings yet
Final Research - Merged
10 pages
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
No ratings yet
Radon Transform Based Malware Classification in Cyb 2024 Results in Control
14 pages
AI-driven Data Analytics For Cyber Threat Intelligence and Anomaly Detection-2108
No ratings yet
AI-driven Data Analytics For Cyber Threat Intelligence and Anomaly Detection-2108
14 pages
Survey Paper of Group 7
No ratings yet
Survey Paper of Group 7
9 pages
676006d84b482 IJAR-49403
No ratings yet
676006d84b482 IJAR-49403
15 pages
Quantity of Sewage and Storm Water
No ratings yet
Quantity of Sewage and Storm Water
12 pages
Final Synposis
No ratings yet
Final Synposis
10 pages
6 Thsemminiproject
No ratings yet
6 Thsemminiproject
12 pages
Core 2010 Wu
No ratings yet
Core 2010 Wu
12 pages
Document Malware
No ratings yet
Document Malware
9 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
8 pages
Detection of Advanced Malware by Machine Learning Techniques
No ratings yet
Detection of Advanced Malware by Machine Learning Techniques
8 pages
Tuning The K Value in K-Nearest Neighbors For Malware Detection
No ratings yet
Tuning The K Value in K-Nearest Neighbors For Malware Detection
8 pages
Research Paper
No ratings yet
Research Paper
8 pages
Malware - Detection - Research - Paper - Updated Soheb6
No ratings yet
Malware - Detection - Research - Paper - Updated Soheb6
8 pages
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
No ratings yet
The Curious Case of Machine Learning in Malware Detection: Sherif Saad, William Briguglio and Haytham Elmiligi
8 pages
Ly Ngoc Vu YSCPaper
No ratings yet
Ly Ngoc Vu YSCPaper
11 pages
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
No ratings yet
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
8 pages
Synopsis 1
No ratings yet
Synopsis 1
7 pages
Unifying Traditional and Machine Learning Approaches For Robust Malware Classification
No ratings yet
Unifying Traditional and Machine Learning Approaches For Robust Malware Classification
6 pages
Malware Detection Research Paper Updated Soheb6
No ratings yet
Malware Detection Research Paper Updated Soheb6
6 pages
Analysis of Cyber Security Threats Using
No ratings yet
Analysis of Cyber Security Threats Using
5 pages
New Automated Lubricity Tester Evaluates Fluid Additives, Systems and Their Application
No ratings yet
New Automated Lubricity Tester Evaluates Fluid Additives, Systems and Their Application
8 pages
Malcode Detection
No ratings yet
Malcode Detection
5 pages
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
No ratings yet
A Framework For Detection of Malicious Code by Exploiting Machine Learning Techniques On Portable Executables
4 pages
In-Vitro Anti-Diabetic Activity of Acalypha indica by Using Α - Amylase Inhibition Assay
No ratings yet
In-Vitro Anti-Diabetic Activity of Acalypha indica by Using Α - Amylase Inhibition Assay
3 pages
FuzzyRNN NIT SUB 2columns PDF
No ratings yet
FuzzyRNN NIT SUB 2columns PDF
8 pages
DRAGO Iecexcertificate
No ratings yet
DRAGO Iecexcertificate
5 pages
From Code To Conundrum Machine Learnings Role in Modern Malware Detection
No ratings yet
From Code To Conundrum Machine Learnings Role in Modern Malware Detection
6 pages
S1 - Flex in LTE
No ratings yet
S1 - Flex in LTE
6 pages
IEEE Conference Template 1
No ratings yet
IEEE Conference Template 1
4 pages
Amutenda r206668v Technical Paper
No ratings yet
Amutenda r206668v Technical Paper
5 pages
Contoh Soal Peringatan
No ratings yet
Contoh Soal Peringatan
3 pages
M&F Final Examination-B
No ratings yet
M&F Final Examination-B
2 pages
5 Quotation
No ratings yet
5 Quotation
3 pages
Narative Text
No ratings yet
Narative Text
2 pages
Pelco IP110 Series Camclosure Spec
No ratings yet
Pelco IP110 Series Camclosure Spec
4 pages
Hydrologist
No ratings yet
Hydrologist
2 pages
Cefalexin Monohydrate Falteria (Drug Study)
No ratings yet
Cefalexin Monohydrate Falteria (Drug Study)
1 page
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
From Everand
Penetration Testing Fundamentals-2: Penetration Testing Study Guide To Breaking Into Systems
Devi Prasad
No ratings yet

Malware Detection Using Machine Leaning

Uploaded by

Malware Detection Using Machine Leaning

Uploaded by

International Journal of Scientific Research in Engineering and Management (IJSREM)

Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Malware Detection Using Machine Learning

place in a dynamic environment with constantly changing

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 1

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 2

9) test_predicted_y = zeros(test_data_len, 9) for i in

Algorithm 1: Random Model

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 3

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 4

▪ Cost Function: The error between projected probabilities and

For feature selection in conjunction with Random Model,

Fig 1.2 – Precision Matrix of the Random Model

of the KNN classifier

Fig 1.3 – Recall Matrix of the Random Model

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 5

Feature selection can be used to enhance the performance of

Fig 2.4 – Recall Matrix of the KNN classifier

Feature selection is frequently used in conjunction with

Algorithm: Multivariate Analysis on Final Features

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 6

Fig 3.2 – Confusion Matrix of the Logistic Regression

Fig 4 – Multivariate Analysis on Final Features

VI. WORKING WITH VERY LARGE DATASETS

Algorithm: Train and Test Split

X_train, X_test_merge, y_train, y_test_merge =

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 7

To make sense of the data in this situation, researchers and

for file in files:

asm_size_byte = create_dataframe({'ID': fnames, 'size':

Fig – Distribution of malware in whole data set = [] fnames = []

Using the presented algorithm, it is intended to gather data

VII. CONCLUSION AND FUTURE WORK

Our primary goal was to develop a machine learning

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 8

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 9

You might also like