Malware-Classification - Jupyter Notebook
Malware-Classification - Jupyter Notebook
ipynb
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js
1 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb
Adobe AIR
18 Application 2da20164a6912ca8a11bb3089d0f3453 332 224
Installer.exe
Adobe AIR
19 397ef02798d24bf192997b5f7d8ed8ca 332 224
Updater.exe
8 rows × 55 columns
Out[7]: legitimate
0 96724
1 41323
dtype: int64
Dropping columns like Name of the file, MD5 (message digest) and label
Out[8]:
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js
2 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb
In [9]: X = dataset.drop(['Name','md5','legitimate'],axis=1).values
dataset['legitimate'].values
In [10]: temp = dataset.drop(['Name','md5'],axis=1)
temp.head()
print(len(temp))
temp temp sample(frac 1)
138047
ExtraTreesClassifier
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js
3 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb
In [12]: nbfeatures
Out[12]: 12
In [13]: X_new
Cross Validation
Cross validation is applied to divide the dataset into random train and test subsets. test_size = 0.2
represent the proportion of the dataset to include in the test split
In [16]: features = []
index argsort(extratrees feature_importances_)[:: 1][:nbfeatures]
The features identified by ExtraTreesClassifier
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js
4 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb
In [19]: features
Out[19]: ['Machine',
'SizeOfOptionalHeader',
'Characteristics',
'MajorLinkerVersion',
'MinorLinkerVersion',
'SizeOfCode',
'SizeOfInitializedData',
'SizeOfUninitializedData',
'AddressOfEntryPoint',
'BaseOfCode',
'BaseOfData',
'ImageBase']
In [20]: plt.figure(figsize=(15,7))
plt.title('Comparision of different Feature Importances')
plt.bar(Imp_features.index,Imp_features.values)
plt.ylabel('Feature Importances (%)')
plt.xlabel('Feature Label')
plt.xticks(rotation=90)
plt show()
5 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb
plt.xlabel('Example no.')
plt.title("%s as a feature" %ind)
plt.show()
In [ ]: results = {}
res = pd.Series()
for algo in model:
clf = model[algo]
clf.fit(X_train,y_train)
score = clf.score(X_test,y_test)
print ("%s : %s " %(algo, score))
res = res.append(pd.Series({algo:100*score}))
results[algo]
In [ ]: plt.figure(figsize=(15,8))
plt.title('Comparision of different Testing Model')
plt.bar(res.index,res.values)
plt.ylabel('% score')
plt xlabel('Model Label')
In [ ]: winner (results key results get)
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js
6 of 7 7/24/2021, 1:47 PM
malware-classification - Jupyter Notebook http://localhost:8888/notebooks/malware-classification.ipynb
In [ ]: joblib dump(model[winner],'classifier.pkl')
In [ ]: clf = model[winner]
res = clf.predict(X_new)
mt = confusion_matrix(y, res)
print("False positive rate : %f %%" % ((mt[0][1] / float(sum(mt[0])))*100))
print('False negative rate : %f %%' % ( (mt[1][0] / float( (mt[1]))*100)))
In [ ]: clf joblib load('classifier.pkl')
In [ ]: features
In [ ]: clf
Given any unseen test file, it's required to extract the characteristics of the given file.
In order to test the model on an unseen file, it's required to extract the characteristics of the given
file. Python's pefile.PE library is used to construct and build the feature vector and a ML model is
used to predict the class for the given file based on the already trained model.
To test for the malicious file, an application has been downloaded from malwr.com
In [ ]: % malware_test "TestDummy.exe"
In [ ]: % malware_test "GitHubDesktopSetup.exe"
Loading [MathJax]/jax/output/HTML-CSS/fonts/STIX-Web/fontdata.js
7 of 7 7/24/2021, 1:47 PM