0% found this document useful (0 votes)
128 views7 pages

Malware-Classification - Jupyter Notebook

The document discusses using machine learning techniques to classify malware. It loads a dataset of over 130,000 files and preprocesses the data, dropping identifying columns like name and MD5. It splits the data into malicious and legitimate files to classify. The document then explores using an ExtraTreesClassifier, which fits multiple randomized decision trees to the data for malware detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views7 pages

Malware-Classification - Jupyter Notebook

The document discusses using machine learning techniques to classify malware. It loads a dataset of over 130,000 files and preprocesses the data, dropping identifying columns like name and MD5. It splits the data into malicious and legitimate files to classify. The document then explores using an ExtraTreesClassifier, which fits multiple randomized decision trees to the data for malware detection.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.

ipynb

A Machine Learning approach for Malware Detection


Importing all the required libraries

In [1]: from sklearn.model_selection import cross_validate


import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import pickle
import pefile
import sklearn.ensemble as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
import matplotlib patches mpatches
In [2]: plt style ('seaborn')

Loading the initial dataset delimited by |

In [3]: dataset pd read_csv('data.csv' '|' low_memory False)

In [4]: dataset shape

Out[4]: (138047, 57)

In [5]: dataset head(20)

Out[5]: Name md5 Machine SizeOfOptionalHeader Characteristics

0 memtest.exe 631ea355665f28d4707448e442fbf5b8 332 224

1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b 332 224

2 setup.exe 4d92f518527353c0db88a70fddcfd390 332 224

3 DW20.EXE a41e524f8d45f0074fd07805ff0c9b12 332 224

4 dwtrig20.exe c87e561258f2f8650cef999bf643a731 332 224

5 airappinstaller.exe e6e5a0ab3b1a27127c5c4a29b237d823 332 224

6 AcroBroker.exe dd7d901720f71e7e4f5fb13ec973d8e9 332 224

7 AcroRd32.exe 540c61844ccd78c121c3ef48f3a34f0e 332 224

8 AcroRd32Info.exe 9afe3c62668f55b8433cde602258236e 332 224

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

1 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

Name md5 Machine SizeOfOptionalHeader Characteristics

9 AcroTextExtractor.exe ba621a96e44f6558c08cf25b40cb1bd4 332 224

10 AdobeCollabSync.exe bf0a35c0efcaf650550b9e346dfcbd33 332 224

11 Eula.exe 1556a34d117a80bdc85a66d8ea4fbcf2 332 224

12 LogTransport2.exe c4005b63df77068bce158ac8ef7c522b 332 224

13 reader_sl.exe e595f220ed529885d8bc0ef42e455e4d 332 224

14 AcrobatUpdater.exe 0e9dee95fdf47d6195da804a0deeda5b 332 224

15 AdobeARM.exe 47c1de0a890613ffcff1d67648eedf90 332 224

16 armsvc.exe 11a52cf7b265631deeb24c6149309eff 332 224

17 ReaderUpdater.exe 5ed9b78b308d302c702d44f4505b3f46 332 224

Adobe AIR
18 Application 2da20164a6912ca8a11bb3089d0f3453 332 224
Installer.exe

Adobe AIR
19 397ef02798d24bf192997b5f7d8ed8ca 332 224
Updater.exe

In [6]: dataset describe()

Out[6]: Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVersion

count 138047.000000 138047.000000 138047.000000 138047.000000 138047.000000

mean 4259.069274 225.845632 4444.145994 8.619774 3.819286

std 10880.347245 5.121399 8186.782524 4.088757 11.862675

min 332.000000 224.000000 2.000000 0.000000 0.000000

25% 332.000000 224.000000 258.000000 8.000000 0.000000

50% 332.000000 224.000000 258.000000 9.000000 0.000000

75% 332.000000 224.000000 8226.000000 10.000000 0.000000

max 34404.000000 352.000000 49551.000000 255.000000 255.000000

8 rows × 55 columns

Number of malicious files vs Legitimate files in the training set

In [7]: dataset groupby(dataset['legitimate']).size()

Out[7]: legitimate
0 96724
1 41323
dtype: int64

Dropping columns like Name of the file, MD5 (message digest) and label

In [8]: dataset columns

Out[8]:
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

2 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

Index(['Name', 'md5', 'Machine', 'SizeOfOptionalHeader', 'Characterist


ics',
'MajorLinkerVersion', 'MinorLinkerVersion', 'SizeOfCode',
'SizeOfInitializedData', 'SizeOfUninitializedData',
'AddressOfEntryPoint', 'BaseOfCode', 'BaseOfData', 'ImageBase',
'SectionAlignment', 'FileAlignment', 'MajorOperatingSystemVersi
on',
'MinorOperatingSystemVersion', 'MajorImageVersion', 'MinorImage
Version',
'MajorSubsystemVersion', 'MinorSubsystemVersion', 'SizeOfImage
',
'SizeOfHeaders', 'CheckSum', 'Subsystem', 'DllCharacteristics',
'SizeOfStackReserve', 'SizeOfStackCommit', 'SizeOfHeapReserve',
'SizeOfHeapCommit', 'LoaderFlags', 'NumberOfRvaAndSizes', 'Sect
ionsNb',
'SectionsMeanEntropy', 'SectionsMinEntropy', 'SectionsMaxEntrop
y',
'SectionsMeanRawsize', 'SectionsMinRawsize', 'SectionMaxRawsize
',
'SectionsMeanVirtualsize', 'SectionsMinVirtualsize',
'SectionMaxVirtualsize', 'ImportsNbDLL', 'ImportsNb',
'ImportsNbOrdinal', 'ExportNb', 'ResourcesNb', 'ResourcesMeanEn
tropy',
'ResourcesMinEntropy', 'ResourcesMaxEntropy', 'ResourcesMeanSiz
e',
'ResourcesMinSize', 'ResourcesMaxSize', 'LoadConfigurationSize
',

In [9]: X = dataset.drop(['Name','md5','legitimate'],axis=1).values
dataset['legitimate'].values
In [10]: temp = dataset.drop(['Name','md5'],axis=1)
temp.head()
print(len(temp))
temp temp sample(frac 1)
138047

ExtraTreesClassifier

ExtraTreesClassifier fits a number of randomized decision trees (a.k.a. extra-trees) on various


sub-samples of the dataset and use averaging to improve the predictive accuracy and control
over-fitting

In [11]: extratrees = ek.ExtraTreesClassifier().fit(X,y)


model = SelectFromModel(extratrees, prefit=True)
X_new = model.transform(X)
nbfeatures X_new shape[1]
ExtraTreesClassifier helps in selecting the required features useful for classifying a file as either
Malicious or Legitimate

14 features are identified as required by ExtraTreesClassifier

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

3 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

In [12]: nbfeatures

Out[12]: 12

In [13]: X_new

Out[13]: array([[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,


2.56884382e+00, 3.53793936e+00, 1.60000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.33000000e+03, ...,
3.42074425e+00, 5.08017686e+00, 1.80000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.33000000e+03, ...,
2.84644859e+00, 5.27181276e+00, 1.80000000e+01],
...,
[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,
2.61702640e+00, 7.99048737e+00, 1.40000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.31660000e+04, ...,
2.06096405e+00, 4.73974433e+00, 0.00000000e+00],
[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,
1.98048202e+00, 6.11537436e+00, 0.00000000e+00]])

Cross Validation

Cross validation is applied to divide the dataset into random train and test subsets. test_size = 0.2
represent the proportion of the dataset to include in the test split

In [14]: from sklearn model_selection import train_test_split

In [15]: X_train X_test y_train y_test train_test_split(X_new test_size 0.2

In [16]: features = []
index argsort(extratrees feature_importances_)[:: 1][:nbfeatures]
The features identified by ExtraTreesClassifier

In [17]: Imp_features pd Series()

In [18]: for f in range(nbfeatures):


print("%d. feature %s (%f)" % (f + 1, dataset.columns[2+index[f]], extratree
Imp_features = Imp_features.append(pd.Series({dataset.columns[2+index
features.append(dataset.columns[2+f])
#Imp_features = Imp_features.append(pd.Series({'Others' : 100-sum(Imp_features.v

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

4 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

1. feature DllCharacteristics (0.152541)


2. feature Characteristics (0.103062)

In [19]: features

Out[19]: ['Machine',
'SizeOfOptionalHeader',
'Characteristics',
'MajorLinkerVersion',
'MinorLinkerVersion',
'SizeOfCode',
'SizeOfInitializedData',
'SizeOfUninitializedData',
'AddressOfEntryPoint',
'BaseOfCode',
'BaseOfData',
'ImageBase']

In [20]: plt.figure(figsize=(15,7))
plt.title('Comparision of different Feature Importances')
plt.bar(Imp_features.index,Imp_features.values)
plt.ylabel('Feature Importances (%)')
plt.xlabel('Feature Label')
plt.xticks(rotation=90)
plt show()

In [21]: colormap = np.array(['#f00534', '#19f005'])


pop_a = mpatches.Patch(color='#f00534', label='Malignant')
pop_b = mpatches.Patch(color='#19f005', label='Benign')
for ind in Imp_features.index:
plt.figure(figsize=(15,7))
plt.scatter(range(len(temp)),temp[ind],c=colormap[temp['legitimate']],
plt.legend(handles=[pop_a,pop_b])
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

5 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

plt.xlabel('Example no.')
plt.title("%s as a feature" %ind)
plt.show()

Building the below Machine Learning model

In [22]: model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),


"RandomForest":ek.RandomForestClassifier(n_estimators=50),
"Adaboost":ek.AdaBoostClassifier(n_estimators=50),
"GradientBoosting":ek.GradientBoostingClassifier(n_estimators=50),
"GNB":GaussianNB()
}
Training each of the model with the X_train and testing with X_test. The model with best accuracy
will be ranked as winner

In [ ]: results = {}
res = pd.Series()
for algo in model:
clf = model[algo]
clf.fit(X_train,y_train)
score = clf.score(X_test,y_test)
print ("%s : %s " %(algo, score))
res = res.append(pd.Series({algo:100*score}))
results[algo]
In [ ]: plt.figure(figsize=(15,8))
plt.title('Comparision of different Testing Model')
plt.bar(res.index,res.values)
plt.ylabel('% score')
plt xlabel('Model Label')
In [ ]: winner (results key results get)

Saving the model

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

6 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

In [ ]: joblib dump(model[winner],'classifier.pkl')

In [ ]: joblib dump(features 'features.pkl')

Calculating the False positive and negative on the dataset

In [ ]: clf = model[winner]
res = clf.predict(X_new)
mt = confusion_matrix(y, res)
print("False positive rate : %f %%" % ((mt[0][1] / float(sum(mt[0])))*100))
print('False negative rate : %f %%' % ( (mt[1][0] / float( (mt[1]))*100)))
In [ ]: clf joblib load('classifier.pkl')

In [ ]: features joblib load('features.pkl')

In [ ]: features

In [ ]: clf

Testing with unseen file

Given any unseen test file, it's required to extract the characteristics of the given file.

In order to test the model on an unseen file, it's required to extract the characteristics of the given
file. Python's pefile.PE library is used to construct and build the feature vector and a ML model is
used to predict the class for the given file based on the already trained model.

Let's run the program to test the file - TestDummy.exe

To test for the malicious file, an application has been downloaded from malwr.com

In [ ]: % malware_test "TestDummy.exe"

In [ ]: % malware_test pyl "vlc-3.0.8-win32.exe"

In [ ]: % malware_test "GitHubDesktopSetup.exe"

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

7 of 7 7/24/2021, 1:47 PM

You might also like