0% found this document useful (0 votes)
32 views9 pages

Malware Detection Using Machine Leaning

This study presents a flexible malware detection system utilizing various machine learning algorithms, including Random Model, KNN, and Logistic Regression, to effectively distinguish between malware and clean files while minimizing false positives. The research emphasizes the growing complexity of malware and the need for advanced detection techniques, supported by extensive experiments on medium-sized datasets. The findings suggest that the proposed framework is a valuable addition to existing cybersecurity measures, demonstrating significant improvements in detection rates.

Uploaded by

abhishek.shete23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views9 pages

Malware Detection Using Machine Leaning

This study presents a flexible malware detection system utilizing various machine learning algorithms, including Random Model, KNN, and Logistic Regression, to effectively distinguish between malware and clean files while minimizing false positives. The research emphasizes the growing complexity of malware and the need for advanced detection techniques, supported by extensive experiments on medium-sized datasets. The findings suggest that the proposed framework is a valuable addition to existing cybersecurity measures, demonstrating significant improvements in detection rates.

Uploaded by

abhishek.shete23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal of Scientific Research in Engineering and Management (IJSREM)

Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Malware Detection Using Machine Learning


Siddharth Dr. Bharti Sahu
Chandigarh University Chandigarh University
Chandigarh, India Chandigarh, India
[email protected] [email protected]

place in a dynamic environment with constantly changing


Abstract— In this work, we present a flexible system that
makes use of various machine learning methods to efficiently rules of engagement. One can explore the field of dynamical
distinguish between malware and clean files while purposefully file analysis to give a visual story of the ever-changing
reducing false positives. In the field of cybersecurity, our strong challenges presented by malware and the related advances in
framework is both flexible and strong, working along with detection techniques. In this case, the use of virtual
different machine learning algorithms. Our study unfolds with
an exploration of basic principles using the Random Model, K environment emulation acts as a stage for the elaborate dance
Nearest Neighbouring Classifier (KNN), and Logistic Regression that is performed between detection technology and malware
as foundational parts, emphasizing the differentiation between [2]. Furthermore, a thorough comprehension of the terrain
malware and benign files. Extensive experiments on
mediumsized datasets that include malware and clean files verify necessitates an investigation of conventional methods
the effectiveness of our methodology. The system then goes intended to detect metamorphic viruses. These methods
through a painstaking scaling-up process that guarantees provide as a basis of knowledge, illuminating the subtleties of
smooth operation with big datasets containing both malware and
clean files. Our methodology is validated by analysing three
identifying malicious code that modifies its form to avoid
important algorithms: Random Model, KNN, and Logistic detection through traditional means [3]. In reaction to the
Regression, each of which adds unique advantages to the growing complexity of malware, researchers are focusing on
malware detection system. The evaluation, which is carried out the potential of machine learning as a ray of hope in the fight
on several datasets, aims to minimize false positives while
striking a compromise between precision and recall. Finally, our against this constantly changing and dynamic threat
flexible system, implemented and evaluated on many datasets, environment. The literature study that follows provides a
demonstrates its efficacy in distinguishing malware from clean broad overview of various machine learning approaches that
files. The framework's flexibility and scalability make it an
invaluable tool in the everevolving field of cybersecurity, are ready to serve as sentinels in the search for effective
providing a sophisticated method of malware detection. The malware detection tools. Of these, boosted decision trees'
proposed algorithms emphasize the framework's potential as a ability to use n-gram data makes them a strong competitor,
supplementary tool to current cybersecurity measures while also
outperforming more conventional classifiers like Support
adding to its reliability.
Vector Machines and the Naive Bayes classifier in terms of
performance [5]. The extraction of association rules from
Keywords— ML, KNN, RM, LR
Windows API execution sequences adds even more depth to
the toolkit of malware detection techniques and demonstrates
I. INTRODUCTION
the adaptability of machine learning. By using Hidden
Malware is defined as software designed to infiltrate or Markov Models (HMMs), one can apply a probabilistic
damage a computer system without the owner’s informed method to determine if a given program file is a variant of a
consent. Malware is a broad category that includes a wide known object. In a related effort, Profile Hidden Markov
range of malicious programs and applications, from Models—which are well-known for their efficacy in the field
standalone viruses to file infectors. Among these, the rogues' of bioinformatics—are adapted to accomplish a comparable
gallery consists of characters with distinct digital weaponry, objective in the field of malware detection [8][9]. As a result,
such as Ramnit, Lollipop, Vundo, Simda, Tracur, the literature presents a varied tapestry of machine learning
Kelihos_ver1, Obfuscator, Kelihos_ver3, Gatak, and ACY. approaches, each adding a special thread to the complex web
The malicious forces behind malware have also changed and of malware detection. As Section V develops, the story
mutated as our digital environment continuous its unstoppable reaches a climax and reveals the outcome of this complex trip.
progress. They have added many polymorphic layers to evade Here, 52 key characteristics taken from the .asm files are
the conventional, signature-based techniques used by prepared to serve as the cornerstone of a largescale system,
antivirus solutions. capable of identifying malware in massive training datasets.
The convergence of preprocessing and analysis signals that
Security measures face a great challenge from the modern the framework is ready to be scaled up to take on the
malware landscape, which updates itself frequently to surpass formidable challenge of very large training datasets.
antivirus software that uses static signatures in its detection
[1]. The conflict between cyber attackers and defenders takes

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 1


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Essentially, this study sets out on an extensive investigation approach has the potential to simulate the likelihood of a
of the complex field of malware detection—a voyage that system becoming infected with malware. The experimental
starts with a threat taxonomy, traverses the shortcomings of results, which show that the Random Model, KNN, and
traditional approaches, and reveals the exciting opportunities Logistic Regression are the most dependable of the studied
presented by a wide range of machine learning strategies. algorithms for malware detection, are presented at the end of
The prologue of the introduction sets the stage for the story the literature review. While the approaches do not completely
that follows as it attempts reach the zero false positive target, they do

to create a flexible and robust framework to combat show a significant boost in the overall detection rate,
malware's constant growth in the digital sphere. With every suggesting that they could be a useful addition to existing
step we take across this terrain, a new chapter in the history antivirus programs.
of cybersecurity is revealed.
In summary, the literature study offers insightful
II. LITERATURE REVIEW information on the current state of machine learning-based
The literature review provided in your research paper titled malware detection. It emphasizes how crucial data-driven
"Malware Detection Using Machine Learning" offers an approaches are becoming to combating the dynamic nature of
indepth exploration of the existing research and techniques in malware threats and the difficulties posed by large-scale data
the field of malware detection. Malware, a ubiquitous threat analysis.
to computer systems, is a broad category of harmful software
III. DATASETS
intended to compromise or corrupt systems without
authorization from the user. In addressing the ever-evolving We used three datasets: a training dataset, a test dataset, and
issues of malware detection, this study highlights the a “scale-up” dataset up to 200GB. The number of malware
shortcomings of conventional signature-based approaches files and respectively clean files in these datasets is shown in
because of the malware's ever-increasing sophistication, the first two columns of Table I. As stated above, our main
which includes behaviours that are polymorphic and self- goal is to achieve malware detection with only a few (if
updating. possible 0) false positives, therefore the clean files in this
The literature study explores a range of machine learning dataset (and in the scale-up dataset) are much larger than the
methods used to identify malware, illustrating the trend number of malware files.
toward data-driven strategies. An example of the potential of From the whole feature set that we created for malware
ensemble methods in classification applications is the higher detection, 308 binary features were selected for the
performance of boosted decision trees using n-gram data over experiments to be presented in this paper. Files that generate
Naive Bayes and Support Vector Machines. Other methods similar values for the chosen feature set were counted only
investigate the use of Hidden Markov Models and Profile once. The last two columns in Table I show the total number
Hidden Markov Models, along with association criteria of unique combinations of the 308 selected binary features in
obtained from Windows API execution sequences, to identify the training, test, and respectively scale-up datasets. Note that
malware variations. The paper also emphasizes how neural the number of clean combinations — i.e. combinations of
networks and Self-Organizing Maps can be used to detect feature values for the clean files — in the three datasets is
polymorphic malware and detect patterns in the behaviour of much smaller than the number of malware unique
Windows executable files that indicate the presence of combinations.
viruses. When taken as a whole, these findings highlight the
necessity for more advanced and complex methods to combat TABLE I
NUMBER OF FILES AND UNIQUE COMBINATIONS OF FEATURE
the complex nature of malware. The literature study also
VALUES IN THE TRAINING, TEST, AND SCALE-UP DATASETS
discusses the difficulties in handling big datasets of ".asm" UPTO 200GB.
and ".bytes" files, as well as data pretreatment and feature
extraction. Although the sheer volume of these files presents Files Unique combinations
major complications, they do offer a low-level insight on
software activity. The effectiveness of machine learning Database malware clean malware clean
models, such as K-Nearest Neighbours (KNN), is improved Training 6955 695535 6922 315
by reducing dimensionality and noise through the use of
Test 21740 6521 609 220
feature selection procedures, such as correlation-based
Scale-up approx. 2M approx. 80M 8817 12230
filtering and wrapper techniques.
In addition, logistic regression is presented as a binary
classification technique that models the probability of TABLE II
malware existence by utilizing the logistic (Sigmoid) MALWARE DISTRIBUTION IN THE TRAINING AND TEST
function. Based on pertinent qualities and characteristics, this DATASETS.

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 2


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

9) test_predicted_y = zeros(test_data_len, 9) for i in


Training Dataset Test range(cv_data_len): rand_probs =
Dataset
random_values_between_0_and_1(9) cv_predicted_y[i] =
Malware Unique
combinations normalize_to_sum_1(rand_probs) log_loss_cv =
Type Files of feature values Files calculate_log_loss(y_cv, cv_predicted_y) for i in
Ramnit 14.2% 5.19% 13.84% range(test_data_len):
Lollipop 22.8% 30.73% 40.15% 22.95% rand_probs = random_values_between_0_and_1(9)
Kelihos-ver3 27.1% 12.15% 27.50%
Vundo 4.4% 0.11% 4.50%
test_predicted_y[i] = normalize_to_sum_1(rand_probs)
Simda 0.4% 2.66% 3.17% 0.09% log_loss_test = calculate_log_loss(y_test, test_predicted_y)
Tracur Kelihos- 6.9% 4.66% 6.89% predicted_y = find_argmax(test_predicted_y)
ver1 3.7% 6.10% 4.41%
Obfuscator.ACY 11.3% 11.49% plot_confusion_matrix(y_test, predicted_y + 1)
Gatak 9.3% 9.24%
rate for one category [12]. In the sequel we will use the
following data structures:
It is easy to create random probability for classification
tasks using the "Random Model" method. For each class (in
since the majority of features were developed to highlight a this case, there are nine classes) in this model, we generate
certain component of malware files (either a geometrical form random probability values so that the sum of these
or behaviour aspect). probabilities is 1. To establish a baseline model for
The majority of the clean files in the training database are comparison in machine learning tasks, this is done. The
system files (from various operating system versions) and procedures to implement the Random Model for producing
executable and library files from several well-known apps. In random probabilities and figuring out log loss for cross-
order to better train and test the system, we also employ clean validation and test datasets are described in the accompanying
files that are packed or that have the same form or geometrical pseudocode.
characteristics with malware files (e.g., use the same packer). We create a vector of probabilities in a random model, p =
The training dataset includes malware files that were obtained [p1, p2,..., pk], where k is the number of classes. Probabilities
from the Virus Heaven collection. The test dataset includes are added together, and the result is 1, hence pi = 1.
clean files from several operating systems and malware files Without taking into account any attributes or patterns, a
from the Wild List collection (other files than those used in Random Model in the context of malware detection assigns
the first database). The training and test datasets malware random probabilities to various classes (malware or non-
collections include Ramnit, Lollipop, Kelihos_ver3, Vundo, malware). Given that it relies on educated assumptions and is
Simda, Tracur, and Kelihos_ver1, Obfuscator.ACY, Gatak. unable to accurately categorize malware, it is unsuitable for
The percentage of those malware kinds in the training and, usage in realworld applications.
respectively, test datasets is shown in the first and third P(class = malware) = p, where 0 = p = 1 and P(class =
columns of Table II. The second column in Table II shows the nonmalware) = 1 - p are the definitions of the term. The
proportion of malware-specific unique combinations across Random
all feature value combinations in the training dataset. Model is not a useful method for detecting malware in
Divide the dataset into train, test, and cv parts, and then computer systems. To evaluate the performance of more
examine the distribution of class in each split to see if it is complex models, it serves as a baseline or reference model.
consistent across all splits. 6955 data points make up the train Models should be educated on pertinent traits and behaviours
data. 2174 data points make up the test data. 1739 data points of dangerous software in order to enable them to discriminate
were used for cross validation. between malware and benign software in real-world malware
detection.
IV. ALGORITHMS Algorithm 2, henceforth called KNN. It performs the training
The main goal of this section is to modify the Random for one chosen label (in our case either malware or clean), so
Model so as to correctly detect malware files, while forcing
detection Algorithm 2: K Nearest Neighbour Classifier

Algorithm 1: Random Model


import scikit-learn.neighbors knn_classifier =
test_data_len = X_test.get_row_count() cv_data_len = KNeighborsClassifier(n_neighbors=5, weights='uniform',
X_cv.get_row_count() cv_predicted_y = zeros(cv_data_len,

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 3


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

algorithm='auto', leaf_size=30, p=2, metric='minkowski', approach for binary classification problems, such as malware
metric_params=None, n_jobs=1, detection in computer systems, is logistic regression. It is
**kwargs) knn_classifier.fit(X, y) useful for spotting dangerous software since it can simulate
knn_classifier.predict(X) the likelihood that a system would become infected by
malware.
knn_classifier.predict_proba(X)
The logistic (Sigmoid) function is used in logistic
regression to determine the likelihood that a particular
occurrence would fall into the positive class:
for one chosen label (in our case either malware or clean), so
where,
that in the end the files situated on one side of the learned
linear separator have exactly that label (assuming that the two
classes are separable). The files on the other side of the linear L = the maximum value of the curve
separator can have mixed labels. e = the natural logarithm base (or Euler’s number)
Machine learning algorithms for classification and x0 = the x-value of the sigmoid midpoint
regression include K-Nearest Neighbours (KNN). It k = steepness of the curve or the logistic growth rate
categorizes data points in a feature space according to how
close they are to other data points. KNN is typically not a good For detecting malware, logistic regression models the
option for computer system malware detection. It is unable to likelihood that a given system or file has malware. Based on
detect intricate patterns and malware-specific traits. To the attributes
properly detect malware, more sophisticated techniques that
take a wider variety of traits and behaviours into account are
needed.
Mathematical Representation of KNN: KNN determines the
distance to its k nearest neighbours for a given data item X
and classifies it into the category that is most prevalent among and qualities of the system or file, the algorithm determines
its neighbours. Distance metrics, such Euclidean distance, are this likelihood. The algorithm categorizes the instance as
used as the foundation for the mathematical representation. malware (1) or non-malware (0) by defining a threshold.
The K-Nearest Neighbours (KNN) method for malware
detection entails classifying a particular data point as either
malware or non-malware based on the dominant class among V. RESULTS
its k nearest neighbours. This method relies solely on We performed cross-validation tests by running the three
closeness to neighbours in the feature space, which is not algorithm the Random Model, KNN and Logistic Regression
effective for malware identification because malware has presented in Section III on the training dataset described in
complex and distinctive properties. Section II (6922 malware unique combinations, and 315 clean
unique combinations).
Algorithm 3: Logistic Regression For the Random Model, the following functions were used:
▪ The main function in the Random Model is in charge of

import scikit-learn.linear_model sgd_classifier = producing random probabilities for each class. To generate
SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, random numbers, you can use Python functions like
l1_ratio=0.15, fit_intercept=True, max_iter=None, np.random.rand().
tol=None, shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, For the KNN classifier, the following function were used:
random_state=None, learning_rate='optimal', eta0=0.0, ▪ Distance Metric Function: KNN relies on a distance metric
power_t=0.5, class_weight=None, warm_start=False, function to measure the similarity between data points.
Common distance functions include Euclidean distance,
average=False, n_iter=None) sgd_classifier.fit(X, y,
Manhattan distance, and Minkowski distance.
coef_init=None, intercept_init=None,
▪ Voting Function: KNN classifies data points according to the
sample_weight=None) sgd_classifier.predict(X)
majority class among their k-nearest neighbours using a
voting procedure.
Binary categorization is handled by the machine learning
For the Logistic Regression, the following function were
algorithm logistic regression. A logistic (Sigmoid) function is
used:
used to model the likelihood that a data point will fall into a
▪ The logistic function, often known as the sigmoid logistic
specific class. The Stochastic Gradient Descent (SGD) form
function, is used in logistic regression to represent the
of Logistic Regression in Scikit-Learn is described in the
likelihood that a given occurrence will fall into the positive
pseudocode. It offers techniques to fit the model and produce
class. The logistic function is defined as:
predictions, as well as initializing the classifier. A popular

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 4


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

P( y = 1∣ X ) = 1 + e - ( X . w + b)

▪ Cost Function: The error between projected probabilities and


actual labels is measured using the logistic loss, also known
as the log loss, which is employed in logistic regression.
During training, the logistic loss aids in adjusting the model's
parameters.

For feature selection in conjunction with Random Model,


KNN and Logistic Regression Model. The Random Model
itself does not include feature selection because it assigns
probability at random without taking features into account. Fig 1.1 – Confusion Matrix of the Random Model.

Fig 1.2 – Precision Matrix of the Random Model


Fig 2.2 – Confusion Matrix

of the KNN classifier

Fig 1.3 – Recall Matrix of the Random Model


Fig 2.3 – Precision Matrix of the KNN classifier

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 5


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Feature selection can be used to enhance the performance of


K-Nearest Neighbo urs (KNN). Reducing dimensionality and
noise in the dataset through the selection of pertinent features
helps improve classification and distance calculations. Filter
approaches (such as correlation -based feature selection),
Wrapper methods (such as forward selection), and Embedded
methods (such as L1 regularization with Lasso) are frequently
used for KNN feature selection.

Fig 2.4 – Recall Matrix of the KNN classifier

Feature selection is frequently used in conjunction with


logistic regression to improve model interpretability and lessen
overfitting. The choice of features is important because logistic
regression models the connection between features and the
target variable. Recursive Feature Elimination (RFE), L1
regularization (Lasso), and mutual information-based
algorithms are common approaches.
Fig 2.1 – Cross Validation, KNN Classifier

Algorithm: Multivariate Analysis on Final Features

xtsne = TSNE(perplexity=50)
results = xtsne.fit_transform(result_x, axis=1)
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=result_y,
cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(9))
plt.clim(0.5, 9)
plt.show()
Fig 3.1 – Cross Validation, Logistic Regression

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 6


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

Fig 3.2 – Confusion Matrix of the Logistic Regression

Fig 4 – Multivariate Analysis on Final Features

VI. WORKING WITH VERY LARGE DATASETS


All the results presented in this section are obtained on the
large ( “scale-up”) dataset that was described in Section II.
Data analysis and machine learning need the use of big datasets
of ".asm" and ".bytes" files, which is a difficult but essential
Fig 3.3 – Precision Matrix of the Logistic Regression task. For many applications, including malware detection,
software classification, and cybersecurity research, these files
frequently contain extensive data and patterns.
The ".asm" files, which are often created from c disassembled
software, offer a low-level perspective of a program's processes.
They are composed of assembly language code, and
examination of them reveals details on the operation,
composition, and potential security risks of software.
Alternatively, ".bytes" files offer a new perspective on the same
software by encoding binary data. Often, this binary data
contains important details about a program's features, structure,
and even abnormalities.
The sheer amount of data, though, is where the real X_train_merge, X_cv_merge, y_train_merge, y_cv_merge
Fig 3.4 – Recall Matrix of the Logistic Regression
intricacy lies. Data preprocessing, feature extraction, and = train_test_split(X_train, y_train, stratify=y_train,
computational resource requirements are a few of the
test_size=0.20)
difficulties large datasets of these files provide. Additionally,
when examined together, the interactions between ".asm" and
".bytes" files can provide a more thorough knowledge of
software behaviour.

Algorithm: Train and Test Split

X_train, X_test_merge, y_train, y_test_merge =


train_test_split(result_x, result_y, stratify=result_y,
test_size=0.20)

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 7


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

To make sense of the data in this situation, researchers and


data scientists use a variety of approaches like feature Algorithm: Each .asm file size
engineering, deep learning, and natural language processing.
The knowledge gained from these big datasets is invaluable files = list_files_in_directory('asmFiles') filenames =
for improving software security, spotting vulnerabilities, and get_filenames_from_Y('ID') class_y =
comprehending the complexities of program development and
get_class_list_from_Y('Class') class_bytes = [] sizebytes
execution.

for file in files:


statinfo = get_file_statistics('asmFiles/' + file)
file = extract_file_name(file)

if exists_in(filenames, file):
i = find_index_of(filenames, file)
class_bytes.append(class_y[i])
sizebytes.append(convert_to_MB(statinfo.st_size))
fnames.append(file)

asm_size_byte = create_dataframe({'ID': fnames, 'size':


sizebytes, 'Class': class_bytes})
print(asm_size_byte.head())

Fig – Distribution of malware in whole data set = [] fnames = []

Using the presented algorithm, it is intended to gather data


on the file sizes of ".asm" files located in a directory and
combine it with relevant class labels. (Fig – Boxplot of .asm
files)

VII. CONCLUSION AND FUTURE WORK

Our primary goal was to develop a machine learning


framework with a zero false positive rate that can detect as
many malware samples as possible in a generic manner. Even
though we still have a non-zero false positive rate, we were
really near to our target. Several deterministic exception
mechanisms need to be developed for this framework to be
included in a fiercely competitive commercial product. We
believe that machine learning-based malware detection will
complement existing anti-virus providers' standard detection
techniques rather than replace them. Since every commercial
anti-virus program has speed and memory constraints, the
Random Model, KNN, and Logistic Regression algorithms are
the most dependable ones among those shown here. Given that
most antivirus programs are able to detect viruses at a rate of
over 90%, an increase of 3%–4% in the overall detection rate,
Fig – Boxplot of .byte files as achieved by our methods, is noteworthy. (Note that malware

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 8


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 08 Issue: 04 | April - 2024 SJIF Rating: 8.448 ISSN: 2582-3930

samples that are not picked up by conventional detection 2006. [8] I. Yoo, “Visualizing Windows executable viruses
techniques are used for training.) using selforganizing maps,” in VizSEC/DMSEC ’04:
Proceedings of the 2004 ACM workshop on Visualization and
ACKNOWLEDGMENT data mining for computer security. New York, NY, USA:
ACM, 2004, pp. 82– 89.
[9] F. Rosenblatt, “The perceptron: a probabilistic model
We would like to express our heartfelt gratitude to Mr. Adil
for information storage and organization in the brain,” pp. 89–
Husain Rathar for his invaluable guidance and mentorship
114, 1988.
throughout this project and research endeavour. We truly
[10] T. Mitchell, Machine Learning. McGraw-Hill
appreciate his persistent support and the clear guidance he
Education (ISE Editions), October 1997.
gave us, which allowed us to effectively do our work within
the allotted time. We also want to express our sincere gratitude
to every individual in our group who worked so hard to ensure
the effective completion of this research project by lending
their knowledge and skills. Their commitment, collaboration,
and invaluable insights were crucial in helping us reach our
study objectives and identify workable answers. Finally, we
would like to express our gratitude to our university for giving
us a suitable platform and access to a large library, both of
which tremendously helped with our research. These tools
were essential to our capacity to carry out fruitful research and
support the scholarly community.

REFERENCES
[1] I. Santos, Y. K. Penya, J. Devesa, and P. G. Garcia,
“Ngrams-based file signatures for malware detection,” 2009.
[2] K. Rieck, T. Holz, C. Willems, P. Du¨ssel, and P. Laskov,
“Learning and classification of malware behavior,” in DIMVA
’08: Proceedings of the 5th international conference on
Detection of Intrusions and Malware, and Vulnerability
Assessment. Berlin, Heidelberg: Springer-Verlag, 2008, pp.
108–125.E. Konstantinou, “Metamorphic virus: Analysis and
detection,” 2008, Technical Report RHUL-MA-2008-2,
Search Security Award M.Sc. thesis, 93 pagesJ. Z. Kolter and
M. A. Maloof, “Learning to detect and classify malicious
executables in the wild,” Journal of Machine Learning
Research, vol. 7, pp. 2721–2744, December 2006, special
Issue on Machine Learning in Computer Security.
[3] Y. Ye, D. Wang, T. Li, and D. Ye, “Imds: intelligent
malware detection system,” in KDD, P. Berkhin, R. Caruana,
and X. Wu, Eds. ACM, 2007, pp. 1043–1047.
[4] M. Chandrasekaran, V. Vidyaraman, and S. J.
Upadhyaya, “Spycon: Emulating user activities to detect
evasive spyware,” in IPCCC. IEEE Computer Society, 2007,
pp. 502–509 [5] M. R. Chouchane, A. Walenstein, and A.
Lakhotia, “Using Markov Chains to filter machine-morphed
variants of malicious programs,” in Malicious and Unwanted
Software, 2008. MALWARE 2008. 3rd International
Conference on, 2008, pp. 77–84.
[6] M. Stamp, S. Attaluri, and S. McGhee, “Profile
hidden markov models and metamorphic virus detection,”
Journal in Computer Virology, 2008.
[7] R. Santamarta, “Generic detection and classification
of polymorphic malware using neural pattern recognition,”

© 2024, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM30244 | Page 9

You might also like