0% found this document useful (0 votes)

128 views7 pages

Malware-Classification - Jupyter Notebook

The document discusses using machine learning techniques to classify malware. It loads a dataset of over 130,000 files and preprocesses the data, dropping identifying columns like name and MD5. It splits the data into malicious and legitimate files to classify. The document then explores using an ExtraTreesClassifier, which fits multiple randomized decision trees to the data for malware detection.

Uploaded by

2K18/CO/008 ABHAY LODHI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views7 pages

Malware-Classification - Jupyter Notebook

Uploaded by

2K18/CO/008 ABHAY LODHI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.

ipynb

A Machine Learning approach for Malware Detection

Importing all the required libraries

In [1]: from sklearn.model_selection import cross_validate

import joblib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import pickle
import pefile
import sklearn.ensemble as ek
from sklearn import tree, linear_model
from sklearn.feature_selection import SelectFromModel
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn import svm
from sklearn.linear_model import LinearRegression
import matplotlib patches mpatches
In [2]: plt style ('seaborn')

Loading the initial dataset delimited by |

In [3]: dataset pd read_csv('data.csv' '|' low_memory False)

In [4]: dataset shape

Out[4]: (138047, 57)

In [5]: dataset head(20)

Out[5]: Name md5 Machine SizeOfOptionalHeader Characteristics

0 memtest.exe 631ea355665f28d4707448e442fbf5b8 332 224

1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b 332 224

2 setup.exe 4d92f518527353c0db88a70fddcfd390 332 224

3 DW20.EXE a41e524f8d45f0074fd07805ff0c9b12 332 224

4 dwtrig20.exe c87e561258f2f8650cef999bf643a731 332 224

5 airappinstaller.exe e6e5a0ab3b1a27127c5c4a29b237d823 332 224

6 AcroBroker.exe dd7d901720f71e7e4f5fb13ec973d8e9 332 224

7 AcroRd32.exe 540c61844ccd78c121c3ef48f3a34f0e 332 224

8 AcroRd32Info.exe 9afe3c62668f55b8433cde602258236e 332 224

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

1 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

Name md5 Machine SizeOfOptionalHeader Characteristics

9 AcroTextExtractor.exe ba621a96e44f6558c08cf25b40cb1bd4 332 224

10 AdobeCollabSync.exe bf0a35c0efcaf650550b9e346dfcbd33 332 224

11 Eula.exe 1556a34d117a80bdc85a66d8ea4fbcf2 332 224

12 LogTransport2.exe c4005b63df77068bce158ac8ef7c522b 332 224

13 reader_sl.exe e595f220ed529885d8bc0ef42e455e4d 332 224

14 AcrobatUpdater.exe 0e9dee95fdf47d6195da804a0deeda5b 332 224

15 AdobeARM.exe 47c1de0a890613ffcff1d67648eedf90 332 224

16 armsvc.exe 11a52cf7b265631deeb24c6149309eff 332 224

17 ReaderUpdater.exe 5ed9b78b308d302c702d44f4505b3f46 332 224

Adobe AIR
18 Application 2da20164a6912ca8a11bb3089d0f3453 332 224
Installer.exe

Adobe AIR
19 397ef02798d24bf192997b5f7d8ed8ca 332 224
Updater.exe

In [6]: dataset describe()

Out[6]: Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVersion

count 138047.000000 138047.000000 138047.000000 138047.000000 138047.000000

mean 4259.069274 225.845632 4444.145994 8.619774 3.819286

std 10880.347245 5.121399 8186.782524 4.088757 11.862675

min 332.000000 224.000000 2.000000 0.000000 0.000000

25% 332.000000 224.000000 258.000000 8.000000 0.000000

50% 332.000000 224.000000 258.000000 9.000000 0.000000

75% 332.000000 224.000000 8226.000000 10.000000 0.000000

max 34404.000000 352.000000 49551.000000 255.000000 255.000000

8 rows × 55 columns

Number of malicious files vs Legitimate files in the training set

In [7]: dataset groupby(dataset['legitimate']).size()

Out[7]: legitimate
0 96724
1 41323
dtype: int64

Dropping columns like Name of the file, MD5 (message digest) and label

In [8]: dataset columns

Out[8]:
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

2 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

Index(['Name', 'md5', 'Machine', 'SizeOfOptionalHeader', 'Characterist

ics',
'MajorLinkerVersion', 'MinorLinkerVersion', 'SizeOfCode',
'SizeOfInitializedData', 'SizeOfUninitializedData',
'AddressOfEntryPoint', 'BaseOfCode', 'BaseOfData', 'ImageBase',
'SectionAlignment', 'FileAlignment', 'MajorOperatingSystemVersi
on',
'MinorOperatingSystemVersion', 'MajorImageVersion', 'MinorImage
Version',
'MajorSubsystemVersion', 'MinorSubsystemVersion', 'SizeOfImage
',
'SizeOfHeaders', 'CheckSum', 'Subsystem', 'DllCharacteristics',
'SizeOfStackReserve', 'SizeOfStackCommit', 'SizeOfHeapReserve',
'SizeOfHeapCommit', 'LoaderFlags', 'NumberOfRvaAndSizes', 'Sect
ionsNb',
'SectionsMeanEntropy', 'SectionsMinEntropy', 'SectionsMaxEntrop
y',
'SectionsMeanRawsize', 'SectionsMinRawsize', 'SectionMaxRawsize
',
'SectionsMeanVirtualsize', 'SectionsMinVirtualsize',
'SectionMaxVirtualsize', 'ImportsNbDLL', 'ImportsNb',
'ImportsNbOrdinal', 'ExportNb', 'ResourcesNb', 'ResourcesMeanEn
tropy',
'ResourcesMinEntropy', 'ResourcesMaxEntropy', 'ResourcesMeanSiz
e',
'ResourcesMinSize', 'ResourcesMaxSize', 'LoadConfigurationSize
',

In [9]: X = dataset.drop(['Name','md5','legitimate'],axis=1).values
dataset['legitimate'].values
In [10]: temp = dataset.drop(['Name','md5'],axis=1)
temp.head()
print(len(temp))
temp temp sample(frac 1)
138047

ExtraTreesClassifier

ExtraTreesClassifier fits a number of randomized decision trees (a.k.a. extra-trees) on various

sub-samples of the dataset and use averaging to improve the predictive accuracy and control
over-fitting

In [11]: extratrees = ek.ExtraTreesClassifier().fit(X,y)

model = SelectFromModel(extratrees, prefit=True)
X_new = model.transform(X)
nbfeatures X_new shape[1]
ExtraTreesClassifier helps in selecting the required features useful for classifying a file as either
Malicious or Legitimate

14 features are identified as required by ExtraTreesClassifier

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

3 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

In [12]: nbfeatures

Out[12]: 12

In [13]: X_new

Out[13]: array([[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,

2.56884382e+00, 3.53793936e+00, 1.60000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.33000000e+03, ...,
3.42074425e+00, 5.08017686e+00, 1.80000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.33000000e+03, ...,
2.84644859e+00, 5.27181276e+00, 1.80000000e+01],
...,
[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,
2.61702640e+00, 7.99048737e+00, 1.40000000e+01],
[3.32000000e+02, 2.24000000e+02, 3.31660000e+04, ...,
2.06096405e+00, 4.73974433e+00, 0.00000000e+00],
[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,
1.98048202e+00, 6.11537436e+00, 0.00000000e+00]])

Cross Validation

Cross validation is applied to divide the dataset into random train and test subsets. test_size = 0.2
represent the proportion of the dataset to include in the test split

In [14]: from sklearn model_selection import train_test_split

In [15]: X_train X_test y_train y_test train_test_split(X_new test_size 0.2

In [16]: features = []
index argsort(extratrees feature_importances_)[:: 1][:nbfeatures]
The features identified by ExtraTreesClassifier

In [17]: Imp_features pd Series()

In [18]: for f in range(nbfeatures):

print("%d. feature %s (%f)" % (f + 1, dataset.columns[2+index[f]], extratree
Imp_features = Imp_features.append(pd.Series({dataset.columns[2+index
features.append(dataset.columns[2+f])
#Imp_features = Imp_features.append(pd.Series({'Others' : 100-sum(Imp_features.v

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

4 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

1. feature DllCharacteristics (0.152541)

2. feature Characteristics (0.103062)

In [19]: features

Out[19]: ['Machine',
'SizeOfOptionalHeader',
'Characteristics',
'MajorLinkerVersion',
'MinorLinkerVersion',
'SizeOfCode',
'SizeOfInitializedData',
'SizeOfUninitializedData',
'AddressOfEntryPoint',
'BaseOfCode',
'BaseOfData',
'ImageBase']

In [20]: plt.figure(figsize=(15,7))
plt.title('Comparision of different Feature Importances')
plt.bar(Imp_features.index,Imp_features.values)
plt.ylabel('Feature Importances (%)')
plt.xlabel('Feature Label')
plt.xticks(rotation=90)
plt show()

In [21]: colormap = np.array(['#f00534', '#19f005'])

pop_a = mpatches.Patch(color='#f00534', label='Malignant')
pop_b = mpatches.Patch(color='#19f005', label='Benign')
for ind in Imp_features.index:
plt.figure(figsize=(15,7))
plt.scatter(range(len(temp)),temp[ind],c=colormap[temp['legitimate']],
plt.legend(handles=[pop_a,pop_b])
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

5 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

plt.xlabel('Example no.')
plt.title("%s as a feature" %ind)
plt.show()

Building the below Machine Learning model

In [22]: model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),

"RandomForest":ek.RandomForestClassifier(n_estimators=50),
"Adaboost":ek.AdaBoostClassifier(n_estimators=50),
"GradientBoosting":ek.GradientBoostingClassifier(n_estimators=50),
"GNB":GaussianNB()
}
Training each of the model with the X_train and testing with X_test. The model with best accuracy
will be ranked as winner

In [ ]: results = {}
res = pd.Series()
for algo in model:
clf = model[algo]
clf.fit(X_train,y_train)
score = clf.score(X_test,y_test)
print ("%s : %s " %(algo, score))
res = res.append(pd.Series({algo:100*score}))
results[algo]
In [ ]: plt.figure(figsize=(15,8))
plt.title('Comparision of different Testing Model')
plt.bar(res.index,res.values)
plt.ylabel('% score')
plt xlabel('Model Label')
In [ ]: winner (results key results get)

Saving the model

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

6 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb

In [ ]: joblib dump(model[winner],'classifier.pkl')

In [ ]: joblib dump(features 'features.pkl')

Calculating the False positive and negative on the dataset

In [ ]: clf = model[winner]
res = clf.predict(X_new)
mt = confusion_matrix(y, res)
print("False positive rate : %f %%" % ((mt[0][1] / float(sum(mt[0])))*100))
print('False negative rate : %f %%' % ( (mt[1][0] / float( (mt[1]))*100)))
In [ ]: clf joblib load('classifier.pkl')

In [ ]: features joblib load('features.pkl')

In [ ]: features

In [ ]: clf

Testing with unseen file

Given any unseen test file, it's required to extract the characteristics of the given file.

In order to test the model on an unseen file, it's required to extract the characteristics of the given
file. Python's pefile.PE library is used to construct and build the feature vector and a ML model is
used to predict the class for the given file based on the already trained model.

Let's run the program to test the file - TestDummy.exe

To test for the malicious file, an application has been downloaded from malwr.com

In [ ]: % malware_test "TestDummy.exe"

In [ ]: % malware_test pyl "vlc-3.0.8-win32.exe"

In [ ]: % malware_test "GitHubDesktopSetup.exe"

Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js

7 of 7 7/24/2021, 1:47 PM

Hands On Data Visualization Using Matplotlib
100% (1)
Hands On Data Visualization Using Matplotlib
7 pages
Pattern Recognition Lab
No ratings yet
Pattern Recognition Lab
24 pages
ML Manual
No ratings yet
ML Manual
21 pages
Multiple Linear Regression - Ipynb
No ratings yet
Multiple Linear Regression - Ipynb
13 pages
Da Program
No ratings yet
Da Program
18 pages
Mona
No ratings yet
Mona
360 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
ML Record
No ratings yet
ML Record
19 pages
ML LabReport Final Index Edited
No ratings yet
ML LabReport Final Index Edited
35 pages
Ensemble Learning
No ratings yet
Ensemble Learning
1 page
De&v Lab Manual
No ratings yet
De&v Lab Manual
91 pages
Wa0009.
No ratings yet
Wa0009.
26 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
ISAA Lab DA 5 KRISH
No ratings yet
ISAA Lab DA 5 KRISH
11 pages
Code:: To Find Frequent Itemsets and Association Between Different Itemsets Using Apriori Algorithm
No ratings yet
Code:: To Find Frequent Itemsets and Association Between Different Itemsets Using Apriori Algorithm
28 pages
Lect7 Skrearing
No ratings yet
Lect7 Skrearing
23 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
Inbuilt Kmeans
No ratings yet
Inbuilt Kmeans
3 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Machine Learning Lab Dlihebca6sem
100% (1)
Machine Learning Lab Dlihebca6sem
25 pages
School of Computer Science and Engineerin1
No ratings yet
School of Computer Science and Engineerin1
10 pages
2 - Data - Analysis - Ipynb - Colaboratory
No ratings yet
2 - Data - Analysis - Ipynb - Colaboratory
28 pages
BDA Experiments
No ratings yet
BDA Experiments
41 pages
Decision Tree
No ratings yet
Decision Tree
9 pages
Malware Detection
No ratings yet
Malware Detection
38 pages
Malware Detection
No ratings yet
Malware Detection
37 pages
Untitled Document
No ratings yet
Untitled Document
19 pages
To Study About Numpy, Pandas and Matplotlib Libraries in Python
No ratings yet
To Study About Numpy, Pandas and Matplotlib Libraries in Python
21 pages
Exp 1
No ratings yet
Exp 1
22 pages
DSL Rough Draft
No ratings yet
DSL Rough Draft
34 pages
PRACTICAL5
No ratings yet
PRACTICAL5
23 pages
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
Programs Lab Bca
No ratings yet
Programs Lab Bca
16 pages
CS3362 Data Science Laboratory Manual 2022-23
No ratings yet
CS3362 Data Science Laboratory Manual 2022-23
54 pages
B22EE010 Report
No ratings yet
B22EE010 Report
9 pages
Name: Suprit Darshan Shrestha Reg - no:19BCE2584: Lab DA1 Machine Learning Lab
No ratings yet
Name: Suprit Darshan Shrestha Reg - no:19BCE2584: Lab DA1 Machine Learning Lab
9 pages
ML Lab Programs 2
No ratings yet
ML Lab Programs 2
16 pages
KRAI LabManual
No ratings yet
KRAI LabManual
77 pages
Python Cprofile
No ratings yet
Python Cprofile
5 pages
Practical No - 1
No ratings yet
Practical No - 1
5 pages
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
No ratings yet
Know Your Dataset: Season Holiday Weekday Workingday CNT 726 727 728 729 730
1 page
RegresiÃ N Lineal Con Python - Ipynb
No ratings yet
RegresiÃ N Lineal Con Python - Ipynb
83 pages
Machine Learning
No ratings yet
Machine Learning
16 pages
Credit Card Fraud Detection V29.Ipynb
No ratings yet
Credit Card Fraud Detection V29.Ipynb
976 pages
Maxbox Starter With Python4Delphi
No ratings yet
Maxbox Starter With Python4Delphi
8 pages
ML - Practical File
No ratings yet
ML - Practical File
15 pages
Assignment-1 of Machine Learning On Decision Tree: Submitted To: Submitted by
No ratings yet
Assignment-1 of Machine Learning On Decision Tree: Submitted To: Submitted by
13 pages
DSBDA6
No ratings yet
DSBDA6
6 pages
ML Journal
No ratings yet
ML Journal
37 pages
Advanced Matplotlib in Python 1695062970
No ratings yet
Advanced Matplotlib in Python 1695062970
54 pages
Lab Manual
No ratings yet
Lab Manual
7 pages
Batch2 FDS Printout
No ratings yet
Batch2 FDS Printout
38 pages
Practical File DL
No ratings yet
Practical File DL
14 pages
Crime Prediction Using Machine Learning - Log
No ratings yet
Crime Prediction Using Machine Learning - Log
3 pages
Lab Program 3
No ratings yet
Lab Program 3
6 pages
Assignments Introduction To Machine Learning 2024
No ratings yet
Assignments Introduction To Machine Learning 2024
45 pages
Handwriting Recognition
No ratings yet
Handwriting Recognition
31 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
Forecasting Daily Evapotranspiration Using Artificial Neural Networks For Sustainable Irrigation Scheduling
No ratings yet
Forecasting Daily Evapotranspiration Using Artificial Neural Networks For Sustainable Irrigation Scheduling
15 pages
Benouli
No ratings yet
Benouli
7 pages
ALS DLL LS5 MATH New
No ratings yet
ALS DLL LS5 MATH New
7 pages
Reconfigurable Intelligent Surface Assisted UAV Communication Joint Trajectory Design and Passive Beamforming
No ratings yet
Reconfigurable Intelligent Surface Assisted UAV Communication Joint Trajectory Design and Passive Beamforming
5 pages
Issue With Automatic Set of "Final Delivery" Indicator in STO
No ratings yet
Issue With Automatic Set of "Final Delivery" Indicator in STO
3 pages
Artikel 21 Readability Test
No ratings yet
Artikel 21 Readability Test
8 pages
Energy-Efficient Logarithmic Square Rooter For Error-Resilient Applications
No ratings yet
Energy-Efficient Logarithmic Square Rooter For Error-Resilient Applications
4 pages
Elsevier Cover Letter Sample
100% (2)
Elsevier Cover Letter Sample
5 pages
Pacific Ocean - Wikipedia
No ratings yet
Pacific Ocean - Wikipedia
1 page
Physics Solutions
No ratings yet
Physics Solutions
4 pages
3 - 1.2 Linear Models and Rates of Change
No ratings yet
3 - 1.2 Linear Models and Rates of Change
30 pages
21UHV49 UHV Module 5
No ratings yet
21UHV49 UHV Module 5
20 pages
4 MGP 2024 Cohort 7 General Studies September 24, 0023-09-30 AM
No ratings yet
4 MGP 2024 Cohort 7 General Studies September 24, 0023-09-30 AM
54 pages
HYDROSELECT - Version 3.0 - For Public Consultation
No ratings yet
HYDROSELECT - Version 3.0 - For Public Consultation
60 pages
2009 Chen IJSS AluCurves2
No ratings yet
2009 Chen IJSS AluCurves2
11 pages
Gear and Screw Thread Metrology
No ratings yet
Gear and Screw Thread Metrology
34 pages
Xingang Catalogue 2020
No ratings yet
Xingang Catalogue 2020
9 pages
Sitxglc002-Learner Assessment Pack Prctical
No ratings yet
Sitxglc002-Learner Assessment Pack Prctical
19 pages
English 7-Q3 Module 3
No ratings yet
English 7-Q3 Module 3
11 pages
20 VMC120 Adv
No ratings yet
20 VMC120 Adv
2 pages
Mosses
No ratings yet
Mosses
21 pages
Sujet Dissertation Telephone Portable
100% (2)
Sujet Dissertation Telephone Portable
4 pages
Assignment-1 IAC NPTEL 2025
No ratings yet
Assignment-1 IAC NPTEL 2025
9 pages
Linguistics and Evolution A Developmental Approach Andresen JT PDF Download
No ratings yet
Linguistics and Evolution A Developmental Approach Andresen JT PDF Download
79 pages
Episode 13 Layout
No ratings yet
Episode 13 Layout
22 pages
Institutional Plan HSS Frisal 2025
No ratings yet
Institutional Plan HSS Frisal 2025
8 pages
10 Non-Exact Equations
No ratings yet
10 Non-Exact Equations
20 pages
Mtec 115 - Workshop Theory and Practice III B Second Semester SY 2019 - 2020 Course Completion Hacksaw
No ratings yet
Mtec 115 - Workshop Theory and Practice III B Second Semester SY 2019 - 2020 Course Completion Hacksaw
10 pages
Course 3 Week 4 - 4 Practice Quiz - Test Your Knowledge - Risk Management
No ratings yet
Course 3 Week 4 - 4 Practice Quiz - Test Your Knowledge - Risk Management
3 pages
Z-ABS Technical Data Sheet Eng-1
No ratings yet
Z-ABS Technical Data Sheet Eng-1
2 pages

Malware-Classification - Jupyter Notebook

Uploaded by

Malware-Classification - Jupyter Notebook

Uploaded by

malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.

A Machine Learning approach for Malware Detection

In [1]: from sklearn.model_selection import cross_validate

Loading the initial dataset delimited by |

In [3]: dataset pd read_csv('data.csv' '|' low_memory False)

In [4]: dataset shape

Out[4]: (138047, 57)

In [5]: dataset head(20)

Out[5]: Name md5 Machine SizeOfOptionalHeader Characteristics

0 memtest.exe 631ea355665f28d4707448e442fbf5b8 332 224

1 ose.exe 9d10f99a6712e28f8acd5641e3a7ea6b 332 224

2 setup.exe 4d92f518527353c0db88a70fddcfd390 332 224

3 DW20.EXE a41e524f8d45f0074fd07805ff0c9b12 332 224

4 dwtrig20.exe c87e561258f2f8650cef999bf643a731 332 224

5 airappinstaller.exe e6e5a0ab3b1a27127c5c4a29b237d823 332 224

6 AcroBroker.exe dd7d901720f71e7e4f5fb13ec973d8e9 332 224

7 AcroRd32.exe 540c61844ccd78c121c3ef48f3a34f0e 332 224

8 AcroRd32Info.exe 9afe3c62668f55b8433cde602258236e 332 224

Name md5 Machine SizeOfOptionalHeader Characteristics

9 AcroTextExtractor.exe ba621a96e44f6558c08cf25b40cb1bd4 332 224

10 AdobeCollabSync.exe bf0a35c0efcaf650550b9e346dfcbd33 332 224

11 Eula.exe 1556a34d117a80bdc85a66d8ea4fbcf2 332 224

12 LogTransport2.exe c4005b63df77068bce158ac8ef7c522b 332 224

13 reader_sl.exe e595f220ed529885d8bc0ef42e455e4d 332 224

14 AcrobatUpdater.exe 0e9dee95fdf47d6195da804a0deeda5b 332 224

15 AdobeARM.exe 47c1de0a890613ffcff1d67648eedf90 332 224

16 armsvc.exe 11a52cf7b265631deeb24c6149309eff 332 224

17 ReaderUpdater.exe 5ed9b78b308d302c702d44f4505b3f46 332 224

In [6]: dataset describe()

Out[6]: Machine SizeOfOptionalHeader Characteristics MajorLinkerVersion MinorLinkerVersion

count 138047.000000 138047.000000 138047.000000 138047.000000 138047.000000

mean 4259.069274 225.845632 4444.145994 8.619774 3.819286

std 10880.347245 5.121399 8186.782524 4.088757 11.862675

min 332.000000 224.000000 2.000000 0.000000 0.000000

25% 332.000000 224.000000 258.000000 8.000000 0.000000

50% 332.000000 224.000000 258.000000 9.000000 0.000000

75% 332.000000 224.000000 8226.000000 10.000000 0.000000

max 34404.000000 352.000000 49551.000000 255.000000 255.000000

Number of malicious files vs Legitimate files in the training set

In [7]: dataset groupby(dataset['legitimate']).size()

In [8]: dataset columns

Index(['Name', 'md5', 'Machine', 'SizeOfOptionalHeader', 'Characterist

ExtraTreesClassifier fits a number of randomized decision trees (a.k.a. extra-trees) on various

In [11]: extratrees = ek.ExtraTreesClassifier().fit(X,y)

14 features are identified as required by ExtraTreesClassifier

Out[13]: array([[3.32000000e+02, 2.24000000e+02, 2.58000000e+02, ...,

In [14]: from sklearn model_selection import train_test_split

In [15]: X_train X_test y_train y_test train_test_split(X_new test_size 0.2

In [17]: Imp_features pd Series()

In [18]: for f in range(nbfeatures):

1. feature DllCharacteristics (0.152541)

In [21]: colormap = np.array(['#f00534', '#19f005'])

Building the below Machine Learning model

In [22]: model = { "DecisionTree":tree.DecisionTreeClassifier(max_depth=10),

Saving the model

In [ ]: joblib dump(features 'features.pkl')

Calculating the False positive and negative on the dataset

In [ ]: features joblib load('features.pkl')

Testing with unseen file

Let's run the program to test the file - TestDummy.exe

In [ ]: % malware_test pyl "vlc-3.0.8-win32.exe"

You might also like