0% found this document useful (0 votes)
14 views

PythonMalware FirstReview

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

PythonMalware FirstReview

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

SIGNIFICANT

PERMISSION
IDENTIFICATION FOR
ANDROID MALWARE
DETECTION
SYNOPSIS
•Project
introduces Significant Permission IDentification (SigPID),
a malware detection system using neural network
based on permission usage analysis
to cope with the rapid increase in the number of Android
malware.

•It develops three levels of pruning by mining the permission data

•to identify the most significant permissions that can be effective in


distinguishing between benign and malicious apps.

•AppliesLinear Regression, SVM, KNN and CNN classification on the


new data set.
OBJECTIVES
 Identification
of dangerous, benign as well as shutdown
enabled permission list.

 Feature reduction.

 To consider SVM/KNN Classification so that probability


of benign/suspicious apps in the given new test data is
possible.

 To reduce features (based on unique values in


permission list) before malware identification using
CNN is carried out.
EXISTING SYSTEM
 Theexisting system focuses on Linear Regression and
SVM classification algorithms to effectively detect
malware apps.

 The dataset is taken from kaggle. Preprocessed such as


zero value, N/A value and unicode character elimination
are not is not carried out here.

 Important features are not extracted out for better


classification.

 Confusion matrix is not prepared with accuracy score


calculation.
DISADVANTAGES OF
EXISTING SYSTEM
SVM Classification is not considered so that
probability of benign/suspicious apps in the
given new test data is not possible.

Feature reduction before malware identification


is not carried out.

Data columns with numeric values only take


from SVM classification.
PROPOSED SYSTEM
 The proposed system focuses on knn classification
algorithms as well as neural network to effectively
detect malware apps.
 The dataset is taken from kaggle and preprocessed such
as Unicode removal.

 Important features are extracted out for better


classification.

 Confusion matrix is prepared with accuracy score


calculation.
PROPOSED SYSTEM
(CONTD)
Accuracy prediction is also carried out.

Convolutional Neural Network based


prediction model is worked out to find
algorithm efficiency.

15600 training records and 4000 test records


are taken out for convolutional neural network
training.
ADVANTAGE OF PROPOSED SYSTEM
 KNN Classification is considered so that probability of
benign/suspicious apps in the given new test data is
possible.

 Feature reduction before malware identification is


carried out.

 KNN supports well even if the dataset size is big.

 Convolutional Neural Network based prediction model


is worked out to find algorithm efficiency.
HARDWARE SPECIFICATION
Processor : Intel Core 2 Duo

Hard Disk Capacity : 500 GB

RAM : 4 GB DDR RAM

Monitor : 17inch Color

Keyboard : 102 keys

Mouse : Optical Scroll


SOFTWARE SPECIFICATION

Operating System : Windows 10 Pro

Environment : IDLE/CoLabs

Language : Python 3.7


MODULES
DATA SET COLLECTION

DATA SET SUBSETTING BASED ON


MALWARE TYPES

Linear Regression, Support Vector Machine/K-


Nearest Neighbor CLASSIFICATION

Convolutional Neural Network based


CLASSIFICATION
1. DATA SET COLLECTION

The dataset which contains 79 columns,


(e_magic, e_cblp, e_cp, e_crlc, e_cparhdr,
e_minalloc, e_maxalloc, e_ss, e_sp, e_csum,
e_ip, e_cs, e_lfarlc, e_ovno, etc) are saved in
a single Excel workbook as records. This is
the input for the project.
2. DATA SET SUBSETTING
BASED ON MALWARE TYPES
The dataset which contains 79 columns, (e_magic,
e_cblp, e_cp, e_crlc, e_cparhdr, e_minalloc,
e_maxalloc, e_ss, e_sp, e_csum, e_ip, e_cs, e_lfarlc,
e_ovno, etc) are saved in a single Excel workbook as
records.

This is the input for the project in which


15600 (collectively (Malware 1 and 0) for
training records and
4000 (collectively (Malware 1 and 0) for testing
records are split and given for neural network.
3. LR/SVM/KNN CLASSIFICATION
In this module, 80% of the data in given data set is taken as
training data and 20% of the data is taken as test data.

The text (categorical) columns are converted into numerical


values.

Then the model is trained with training data and then


predicted with test data.

Of which, most of the apps are classified as Benign and


fewer apps are classified as Suspicious.
4. CNN CLASSIFICATION
Here the dataset is taken first. It can be seen that news data is stored in the
form of csv values (Comma Separated Values).
Each record contains 79 values for one virus definition.

Data Encoding: It converts the categorical column (label in out case) into
numerical values.

These are some variables required for the model training. Once the model is
created, it can be imported and then compiled using ‘model.compile’.

The model is trained for just five epochs but we can increase the number of
epochs.

After the training process is completed we can make predictions on the test
set. The accuracy value is displayed during iterations.
SYSTEM FLOW DIAGRAM
SIGNIFICANT PERMISSION IDENTIFICATION FOR ANDROID
MALWARE DETECTION

DATA SET PREPROCESS CLASSIFICATION

Remove N/A values LR/SVM/KNN


Classification

Subset train and test Feature Reduction


records for CNN for CNN

Select Excel File


CNN
Classification in
reduced feature
set
LITERATURE SURVEY PAPERS
[1] M.Grace, Y.Zhou, Q.Zhang, S.Zou and X.Jiang, “RiskRanker: Scalable
andaccuratezero-day android malware detection,”inProc.10thInt.Conf. Mobile Syst., Appl.,
Services, 2012, pp. 281–294.

[2] A. P. Felt, E. Chin, S. Hanna, D. Song, and D. Wagner, “Android permissions


demystified,” in Proc. 18th ACM Conf. Comput. Commun. Security, 2011, pp. 627–638.

[3] W. Enck et al., “TaintDroid: An information-flow tracking system for


realtimeprivacymonitoringonsmartphones,”ACMTrans.Comput.Syst., vol. 32, no. 2, 2014,
Art. no. 5.

[4] D. Arp, M. Spreitzenbarth, M. H¨ubner, H. Gascon, K. Rieck, and C. Siemens,


“DREBIN: Effective and explainable detection of android malware in your pocket,”
presented at Annu. Symp. Netw. Distrib. Syst. Security, 2014.

[5] C. Yang, Z. Xu, G. Gu, V. Yegneswaran, and P. Porras, “DroidMiner: Automated mining
and characterization of fine-grained malicious behaviors
inandroidapplications,”inProc.Eur.Symp.Res.Comput.Security,2014, pp. 163–182.
CONCLUSION
The project focuses on SVM classification algorithms to
effectively detect malware apps.

The dataset is taken from kaggle.

Preprocessed such as zero value, N/A value and unicode


character elimination are not is not carried out here.

Important features are extracted out for better classification.


CONCLUSION (Contd)
In addition, K-NN and CNN based classification
algorithms are caried out to effectively detect
malware apps are also carried out.

Confusion matrix is prepared with accuracy score


calculation.

Accuracy prediction is also carried out.


SAMPLE CODING
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras

from sklearn.model_selection import train_test_split


from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Dense, Dropout
from keras.layers import Flatten, BatchNormalization
train_df = pd.read_csv('./dataset_malwares_train.csv')
test_df = pd.read_csv('./dataset_malwares_test.csv')
train_df.head()
train_data = np.array(train_df.iloc[:, 1:])
test_data = np.array(test_df.iloc[:, 1:])
train_labels =train_df.iloc[:, 0]# to_categorical(train_df.iloc[:, 0])
test_labels = test_df.iloc[:, 0]#to_categorical(test_df.iloc[:, 0])
rows, cols = 7, 5
train_data = train_data.reshape(train_data.shape[0], rows, cols, 1)
test_data = test_data.reshape(test_data.shape[0], rows, cols, 1)
train_data = train_data.astype('float32')
test_data = test_data.astype('float32')
train_data /= 255.0
SAMPLE CODING
test_data /= 255.0
train_x, val_x, train_y, val_y = train_test_split(train_data, train_labels, test_size=0.2)
batch_size = 32#256
epochs = 10
input_shape = (rows, cols, 1)
def baseline_model():
model = Sequential()
model.add(BatchNormalization(input_shape=input_shape))
model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(Conv2D(32, (3, 3), padding='same', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='softmax'))
return model
SCREEN SHOTS
LINEAR REGRESSION
SCREEN SHOTS
KNN ACCURACY SCORE
RECALL VALUE : 0.86
SCREEN SHOTS
THANK YOU

You might also like