Document For Final Project
Document For Final Project
Project Report on
SECURING ANDROID DEVICES THROUGH MACHINE
LEARNING BASED ON MALWARE DETECTION
Submitted to
N.B.K.R INSTUTE OF SCIENCE AND TECHNOLOGY
(Atonomous)
Affiliated to JNTUA, Anantapuramu
in partial fulfillment of the requirements for the award of the Degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
Submitted by
Batch No: A12
A. RAMESWAR (21KB5A0515)
J. VISHNU VARDHAN REDDY (21KB1A0561)
D. KUSUMANJALI (21KB1A0539)
K. ASRITHA (21KB1A0563)
BONAFIDE CERTIFICATE
This is to certify that the project work entitled “ SECURING ANDROID DEVICES
THROUGH MACHINE LEARNING BASED ON MALWARE DETECTION ” is a
bonafide work done by A RAMESWAR(21KB5A0515), J. VISHNU VARDHAN REDDY
(21KB1A0561), D. KUSUMANJALI (21KB1A0539) , K. ASRITHA (21KB1A0563) in the
Department of Computer Science & Engineering, N.B.K.R. Institute of Science &
Technology, Vidyanagar and is submitted to JNTUA, Anantapuramu in the partial fulfillment
for the award of B.Tech degree in Computer Science & Engineering. This work has been
carried out under my supervision.
Examiner-1 Examiner-2
DECLARATION
A. RAMESWAR (21KB5A0515)
J. VISHNU VARDHAN REDDY (21KB1A0561)
D. KUSUMANJALI. (21KB1A0539)
K. ASRITHA (21KB1A0563)
ACKNOWLEDGEMENT
We are thankful to our guide Mr. K.Raveendra Chaithanya for her valuable
guidance and encouragement. His helping attitude and suggestions have helped us in the
successful completion of the project.
We would like to express our gratefulness and sincere thanks to Dr. A. Raja Sekhar
Reddy, Head of the Department of COMPUTER SCIENCE AND ENGINEERING, for
his kind help and encouragement during the course of our study and in the successful
completion of the project work.
We have great pleasure in expressing our hearty thanks to our beloved Directed Dr.
V. Vijaya Kumar Reddy, for spending his valuable time with us to complete this project.
Successful completion of any project cannot be done without proper support and
encouragement. We sincerely thank to the Management for providing all the necessary
facilities during the course of study.
We would like to thank our parents and friends, who have the greatest contributions
in all our achievements, for the great care and blessings in making us successful in all our
endeavors.
A. RAMESWAR (21KB5A0515)
J. VISHNU VARDHAN REDDY (21KB1A0561)
D. KUSUMANJALI. (21KB1A0539)
K. ASRITHA (21KB1A0563)
TABLE OF CONTENTS
Chapter No. Description Page No.
Abstract i
List of Figure ii
1 Introduction 1
2 Project Description 2
2.1 Problem Definition 2
2.2 Project Details 2
3 Computational Environment 3
3.1 Software Specification 5
3.2 Hardware Specification 5
3.3 Software Features 6
4 Feasibility Study 14
4.1 Technical Feasibility 14
4.2 Social Feasibility 15
4.3 Economical Feasibility 15
5 System Analysis 16
5.1 Existing System 16
5.1.1 Drawbacks of existing system 18
5.2 Proposed System 18
5.2.1 Advantages of proposed System 20
6 System Design 21
6.1 UML Diagrams 23
6.1.1 Class Diagram 25
6.1.2 Use case Diagram 26
6.1.3 Sequence Diagram 30
6.1.4 Activity Diagram 31
6.1.5 Deployment Diagram 34
7 System Implementation 35
7.1 Implementation Process 35
7.2 Modules 35
8 Testing 37
8.1 Unit Testing 38
8.2 Integration Testing 39
8.3 System Testing 40
8.4 Acceptance Testing 41
9 Sample Source Code 43
10 Screen Layouts 46
11 Conclusion and Future Scope 48
12 Bibliography 49
References 49
Websites 50
ABSTRACT
Android has become the most standard smartphone operating system. The rapidly growing
acceptance of android has resulted in significant increase in the number of malwares when compared
with earlier years.
There exists plenty of antimalware programs which are designed to efficiently protect the user’s
sensitive data in mobile systems from such attacks. Here, I have examined the different android
malwares and their methods based on deep learning that are used for attacking the devices and
antivirus programs that act against malwares to care for Android systems.
Then, we have discuss on different deep learning based android malware detection techniques
such as, Maldozer, Droid Detector, Droidv DeepLearner, Deep Flow, Droid Delver and Droid Deep.
We aim to implement a model based on deep learning that can automatically identify whether an
android application is malware infected or not without installation.
The ultimate aim of this study is to design and implement a deep learning-based model capable
of automatically and accurately identifying whether an Android application is malware-infected or not,
without the need for installation. Our approach seeks to enhance pre-installation security checks,
minimize false positives, and provide an efficient, scalable solution for Android malware detection.
Through this work, we hope to contribute towards the advancement of intelligent, adaptive
security mechanisms that can keep pace with the rapidly evolving landscape of mobile cybersecurity
threats.
(i)
LIST OF FIGURES
(ii)
Securing Android Devices Through Machine Learning Based on Malware Detection
1. INTRODUCTION
In our daily life Mobile Applications have become an essential part since countless facilities are
providing to us by using Mobile Apps. It will change the way of communication, as the apps are
installed on most of the smart devices. Mobile devices have refined sensors like cameras, gyroscopes,
microphones and GPS. These several sensors open up entire innovative world of applications for the
users and create massive quantities of data containing highly complex data.
Security solutions are therefore needed to defend operators from malicious applications that exploit
the complexity of smart devices and their complex data. Android OS physically grows through the
power of a wide range of smart devices. In mobile computing industry, it has largest part with 85% in
2017 due to its vulnerable source distribution.
Currently on Android platforms to defend against malware is a risky communication system that
notifies users for the required permissions earlier each application is installed. This system is slightly
ineffective because it offers permissions on its personal. To distinguish malware from benign
applications, the user want excessively much methodical knowledge.
The same permissions are required for the both benign and malicious application, consequently
we cannot be distinguished by this permission based system. Generally, the permission based
methodologies are largely not developed for the detection of malware, but it is used for the risk
assessment.
The Android Operating System make malware more difficult for the installation and execution,
because of the Android itself provide a several security solution for example Android permission and
Google’s Bouncer to address the progressively widespread security threats. Every Android application
need to ask the user for the permission to execute certain task on Android devices, such as transfer
SMS message, during the installation process.
Most of the users are allow the permission without even considering what kinds of permissions
they demand thus the Android permission system is knowingly weaken. Accordingly, the Android
permission system spread the malicious apps itself and it is very challenging in training.
2. PROJECT DESCRIPTION
The open source nature of Android Operating System has attracted wider adoption of the
system by multiple types of developers. This phenomenon has further fostered an exponential
proliferation of devices running the Android OS into different sectors of the economy. Although this
development has brought about great technological advancements and ease of doing businesses
(ecommerce) and social interactions, they have however become strong mediums for the uncontrolled
rising cyberattacks and espionage against business infrastructures and the individual users of these
mobile devices. Different cyberattacks techniques exist but attacks through malicious applications
have taken the lead aside other attack methods like social engineering. Android malware have evolved
in sophistications and intelligence that they have become highly resistant to existing detection systems
especially those that are signature based. Machine learning techniques have risen to become a more
competent choice for combating the kind of sophistications and novelty deployed by emerging
Android malwares. The models created via machine learning methods work by first learning the
existing patterns of malware behaviour and then use this knowledge to separate or identify any such
similar behaviour from unknown attacks. This paper provided a comprehensive review of machine
learning techniques and their applications in Android contemporary literature.
Research has shown that Android malware analysis can be done in three different ways: The
first method involves the deployment of static [1] and dynamic [2]. Investigation of code of application
in order to spot components that are malicious before loading the application into any device; The
second method involve modification of the Android system in order to put in modules for monitoring
and interception of abnormal behaviours that may occur on the device [3,4,5] while the third approach
involve engaging virtualization to implement the separation of domains ranging from lightweight
isolation of an application on the device to running multiple instances of
Android OS on the same device [6,7]. However, recent study has shown that machine learning
or “anomaly detection” approaches have now emerged to become a leading and more effective
approach for defeating Android malware [8, 9, 10, 11].
Unlike the static analysis techniques that involves the manual examination of the
AndroidManifest.xml file, source files and the Dalvik byte code, and the Dynamic analysis that
involves running an application in a controlled environment to study its behaviour, the Machine
Learning approach involves learning the general rules and patterns from benign and malicious app
samples and then allowing data-driven predictions of decisions, such as classification [12]. Machine
learning methodologies largely depends on static attributes extracted from an application [13]. The
static components of an Android application provide the baseline upon which machine learning
approaches are anchored and these static features are carefully gotten through the process of reverse
engineering. Machine learning techniques have been applied widely for the classification of
applications, focusing mainly on generic malware detection. The application of machine learning in
Android malware detection helps eliminate the difficulty involved with manually crafting and
updating detection patterns [8]. Machine Learning is a procedure that analyzes data using software
techniques (algorithms) to create a model, as shown in Fig. 1, which is useful for finding patterns and
regularities in datasets [14]. It is a process of making machines learn from past experiences (existing
data) in order to make decisions on future occurring events or data instances. Feature vectors are very
essential elements of Machine Learning and they are usually built for the specific task the Machine
Algorithm intent to accomplish. The basic idea behind Machine Learning is to get the probability
distribution of data.
Machine Learning is divided into three main categories and they are Supervised Machine
Learning [16, 17, 18] and unsupervised machine learning [18] and Reinforcement Machine Learning
[19]. Furthermore, there are three basic Learning Methods associated with each Learning Category;
Classifications, Clustering, and Regression. Classification is the process used in Supervised Learning
in which the data sets are well labelled into groups or classes; Clustering is the process used in
unsupervised learning for un labelled data sets; and Regression is best associated with Re-enforcement
learning in which the expected end result is being ranked, graded or estimated. A label is the name of
the definite class or group the data instances belongs to. In machine learning, data are represented by
a fixed number of features which can either be categorical, nominal, or continuous [20]. This paper
gives a thorough review of different existing literatures in the field of Android malware detections
using machine learning techniques.
Authors in [27, 28] showed in their works that malware attack methods can be characterized
as follows:
• Information Extraction: The malware in this category compromises a device and then steals
personal information such as IMEI number, user’s personal information and many more.
• Automatic Calls and SMS: This group of malware increases a user’s phone bill by placing
automatic calls and sending SMS to some premium numbers.
• Root Exploits: These set of malware seek to gain system root privileges in order to take control
of the system and modify the system’s configuration and other system information.
• Search Engine Optimizations: The malware here artificially searches for a term and
simulates clicks on targeted websites in order to increase the revenue of a search engine or increase
the traffic on a website.
• Dynamically Downloaded Code: This technique enables an installed benign application to
download a malicious code and deploys it in the mobile devices without the user being aware.
Covert and Overt Communication Channels: This is a vulnerability that is found in a device that
facilitates the information leak between the processes that are not supposed to share the information.
This technique is seen as a highly sophisticated.
3. COMPUTATIONAL ENVIRONMENT
• Ram : 512 MB
Anaconda is more than just a package manager; it's a comprehensive platform for data science
and machine learning workflows. Some key features and components of Anaconda include:
1. *Conda Package Manager*: Anaconda comes with Conda, a powerful package manager that
simplifies package installation, updates, and dependency management. Conda can install packages
from the Anaconda repository as well as from other channels like PyPI.
2. *Anaconda Navigator*: A graphical user interface (GUI) that allows users to easily manage
environments, install packages, and launch applications. It provides a convenient way to navigate
through projects and environments.
4. *Spyder IDE*: Anaconda comes with Spyder, an Integrated Development Environment (IDE)
designed specifically for scientific computing and data analysis in Python. Spyder provides features
such as code editing, debugging, variable exploration, and integrated IPython consoles.
7. *Support for Multiple Platforms*: Anaconda is available for Windows, macOS, and Linux,
making it accessible to users across different operating systems.
Overall, Anaconda provides a convenient and powerful platform for data scientists, researchers, and
developers to work with Python and its ecosystem of libraries for data analysis, machine learning, and
scientific computing.
3.3.2 SPYDER
Spyder is an Integrated Development Environment (IDE) primarily used for scientific
computing and data analysis in Python. While it's commonly known for its frontend interface, which
provides features such as code editing, variable exploration, and debugging, Spyder also has a robust
backend that facilitates these functionalities. Here's some more information about Spyder's backend:
1. *Code Editor*: Spyder's backend includes a sophisticated code editor with features like
syntax highlighting, code completion, code folding, and automatic indentation. The backend manages
these functionalities to provide a smooth coding experience for users.
2. *Debugger*: Spyder's backend integrates a powerful debugger that allows users to step
through code, set breakpoints, inspect variables, and analyze program execution. The backend handles
communication with the Python interpreter to provide debugging capabilities within the IDE.
3. *Variable Explorer*: Spyder includes a Variable Explorer that allows users to interactively
explore and manipulate variables in their Python environment. The backend manages the
synchronization of variables between the Python interpreter and the Variable Explorer interface.
4. *Integrated IPython Console*: Spyder's backend integrates an IPython console within the
IDE, allowing users to execute Python code interactively and access the full power of the IPython
interpreter. The backend handles communication between the console and the Python interpreter
running in the background.
5. *Code Analysis Tools*: Spyder's backend includes tools for static code analysis, such as
linting, code style checking, and code formatting. These tools help users write clean, consistent, and
errorfree code by providing real-time feedback and suggestions.
6. *Integration with External Tools*: Spyder's backend can integrate with external tools and
libraries for specialized tasks, such as version control systems (e.g., Git), data visualization libraries
(e.g., Matplotlib), and scientific computing packages (e.g., NumPy, SciPy).
Overall, Spyder's backend plays a crucial role in providing a seamless development experience
for users working on scientific computing and data analysis projects in Python. It handles various
tasks behind the scenes to ensure that users can write, debug, and analyze code efficiently within the
IDE.
SRS
DATA MINING
The actual data mining task is the automatic or semi-automatic analysis of large quantities of
data to extract previously unknown, interesting patterns such as groups of data records (cluster
analysis), unusual records (anomaly detection), and dependencies (association rule mining). This
usually involves using database techniques such as spatial indices. These patterns can then be seen as
a kind of summary of the input data, and may be used in further analysis or, for example, in machine
learning and predictive analytics.
For example, the data mining step might identify multiple groups in the data, which can then
be used to obtain more accurate prediction results by a decision support system. Neither the data
collection, data preparation, nor result interpretation and reporting is part of the data mining step, but
do belong to the overall KDD process as additional steps.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining
methods to sample parts of a larger population data set that are (or may be) too small for reliable
statistical inferences to be made about the validity of any patterns discovered. These methods can,
however, be used in creating new hypotheses to test against the larger data populations.
Big Data concern large-volume, complex, growing data sets with multiple, autonomous
sources. With the fast development of networking, data storage, and the data collection capacity, Big
Data are now rapidly expanding in all science and engineering domains, including physical, biological
and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the
Big Data revolution, and proposes a Big Data processing model, from the data mining perspective.
This data-driven model involves demand-driven aggregation of information sources, mining and
analysis, user interest modeling, and security and privacy considerations. We analyze the challenging
issues in the data-driven model and also in the Big Data revolution.
BIG DATA
Big data is a collection of data sets so large and complex that it becomes difficult to process
using on hand database management tools. The challenges include capture, curation, storage, search,
sharing, analysis, and visualization.
The trend to larger data sets is due to the additional information derivable from analysis of a
single large set of related data, as compared to separate smaller sets with the same total amount of
data, allowing correlations to be found to "spot business trends, determine quality of research, prevent
diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.
Put another way, big data is the realization of greater business intelligence by storing,
processing, and analyzing data that was previously ignored due to the limitations of traditional data
management technologies
• Veracity: Trust and integrity is a challenge and a must and is important for big data just as for
traditional relational DBs
Fig.no.3.1
Some concepts
• No SQL (Not Only SQL): Databases that “move beyond” relational data models (i.e., no tables,
limited or no use of SQL)
– Focus on retrieval of data and appending new data (not necessarily tables)
– Focus on key-value data stores that can be used to locate data objects
Hadoop
• Hadoop is a distributed file system and data processing engine that is designed to handle
extremely high volumes of data in any structure.
– The Hadoop distributed file system (HDFS), which supports data in structured
relational form, in unstructured form, and in any form in between
• Apache Avro: designed for communication between Hadoop nodes through data serialization
• Cassandra and Hbase: a non-relational database designed for use with Hadoop
• Hive: a query language similar to SQL (HiveQL) but compatible with Hadoop
• Mahout: an AI tool designed for machine learning; that is, to assist with filtering data for
analysis and exploration
• Pig Latin: A data-flow language and execution framework for parallel computation
The Knowledge Discovery in Databases (KDD) process is commonly defined with the stages:
(1) Selection
(2) Pre-processing
(3) Transformation
(5) Interpretation/Evaluation.
Fig.no.3.2
It exists, however, in many variations on this theme, such as the Cross Industry Standard
Process for Data Mining (CRISP-DM) which defines six phases:
(1) Business Understanding
(4) Modeling
(5) Evaluation
(6) Deployment or a simplified process such as (1) pre-processing, (2) data mining, and (3)
results validation.
4. FEASIBILITY STUDY
The feasibility study is an essential phase in the system development life cycle, focusing on
analyzing the viability of the proposed project. During this phase, a general plan is outlined, and
preliminary cost estimates are provided. The objective is to ensure that the proposed system is
practical, achievable, and will not impose an unnecessary burden on the organization.
A thorough feasibility study requires a clear understanding of the major system requirements,
ensuring that the solution aligns with the organization's resources, objectives, and constraints. The
feasibility analysis primarily revolves around three key considerations:
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
Technical feasibility focuses on assessing the technical resources and capabilities required to
develop and implement the proposed system. It ensures that the system's technical requirements are
within the organization's current capabilities and infrastructure.
A technically feasible system should not impose excessive demands on the existing resources.
If the system requires significant upgrades, it may not be practical. Hence, our proposed system has
been designed with modest technical requirements to minimize the need for extensive modifications
or additional infrastructure.
The technologies employed in the system are mainstream, well-supported, and scalable,
ensuring ease of integration and maintenance. This guarantees that the system can be effectively
deployed with minimal disruption to the organization's operations.
Social feasibility evaluates the level of user acceptance and readiness to adapt to the new
system. The success of any system largely depends on how well it is received by its intended users.
This involves training users adequately so that they can operate the system efficiently without
hesitation or resistance. Efforts are made to ensure that users perceive the system as an enhancement
rather than a threat. Raising the confidence level of users is crucial — they should feel comfortable
providing feedback and suggesting improvements, fostering a sense of ownership and trust in the
system.
Economic feasibility examines the cost-effectiveness of the proposed system. It ensures that
the benefits derived from the system outweigh the costs involved in its development, deployment, and
maintenance.
Given that organizational budgets are often constrained, it is vital that the system remains
within financial limits. In this project, economic feasibility has been carefully considered, and most of
the technologies and tools utilized are open-source or freely available, significantly reducing costs.
Only necessary customized components were procured, keeping expenditures well within the
allocated budget. The overall result is a cost-efficient, high-performing system that delivers strong
value to the organization without imposing a financial strain.
5. SYSTEM ANALYSIS
The Bouncer can scan the Android application for a limited period of time, allowing a malicious
app too effortlessly bypass because of during the scan phase it doing nothing malicious.
At the second step, when scanned by the Bouncer, no malicious code must be included in the
initial installer. In this case, the malicious app may have a higher chance of avoiding the detection of
Bouncer.
The same permissions are required for the both benign and malicious application, consequently
we cannot be distinguished by this permission based system. Generally, the permission based
methodologies are largely not developed for the detection of malware, but it is used for the risk
assessment.
The current methods for Android malware detection have significant limitations that leave devices
vulnerable to sophisticated attacks. One of the primary detection mechanisms is Google Bouncer,
which scans Android applications before they are made available on the Play Store. However, Bouncer
has several notable shortcomings:
• Limited Time Scanning:Bouncer analyzes applications only for a limited period. Malicious
apps can easily exploit this by remaining inactive during the scan, displaying no harmful
behavior and thereby evading detection.
• Installer-Based Evasion:During the initial scan phase, if the APK installer file does not
contain any obvious malicious code, the application can bypass Bouncer’s scrutiny. Malicious
functionality can be downloaded or activated later, once the app has been installed on a user's
device, making the initial scanning ineffective.
• Permission-Based Detection Limitations:Traditional systems often rely on analyzing
application permissions to assess risk. However, both benign and malicious apps frequently
request similar permissions, making it difficult to distinguish between them using permissions
alone. Furthermore, permission-based models are typically designed for risk assessment, not
active malware detection, reducing their effectiveness against more sophisticated threats.
Despite these improvements, there is still room for enhancing detection accuracy, reducing false
positives, and improving the adaptability of models to new, unseen malware variants. Therefore, there
is a growing need for more sophisticated and efficient malware detection systems based on deep
learning techniques, capable of operating proactively and reliably even against evolving threats.
The reported user will get the result, containing complete information from the integrity check and
both analyses. Since new types of applications are constantly emerging, two crawler modules have
been designed. For crawling the benign apps from the Google Play Store they used one crawler and
the other crawler is used to scroll malware from known sources of malware.
Droid Deep Learner is an Android Malware categorization and identification method. Droid Deep
Learner uses deep learning method to report the present requirement for malware detection. In this
method, they required a set of features for the detection.
The features like permissions, APIs, Actions, Intents, IP addresses and URLs are encrypted in the
apk file. Based on source recompilation tool, they construct a decoder to decode the apps into readable
format.
The user can identify the Android app is malware infected or not by using Droid Detector and it
is available online as an open source as shown in Fig. 2. At first user have to submit the .apk file in
the system, Droid Detector will check its reliability and defines whether an Android application is
truthful, complete and appropriate.
The general architecture of the proposed Droid Deep Learner method is illustrated in Fig. 3.
The goal of this system is to leverage both permission-based and API function call-based features to
detect Android malware. To achieve this, the system first examines Android applications by extracting
their manifest files (.xml) and source code files (.java), as these contain the essential information for
feature extraction.
The system begins by parsing the Android app’s manifest file to extract relevant permissions,
such as internet access, location data access, or any other sensitive permissions that might indicate
potential malicious behavior. Permissions are key indicators of what an app is allowed to do, and they
play a crucial role in the detection process. For instance, an app requesting permissions that are
not typically necessary for its functionality may raise a red flag.
Additionally, the system analyzes the Java source files of the app to identify API function calls.
These calls provide insights into the app's interactions with the operating system and third-party
services, which can be highly indicative of malicious behavior. For example, API calls related to
sending SMS messages, accessing personal data, or making network connections could point to
potentially harmful activities.
Once both sets of features—permissions and API function calls—are extracted, they are
combined into a comprehensive feature set. This feature set serves as the input for the training and
testing phases of the Deep Belief Network (DBN)-based deep learning model. The DBN model, a type
of deep learning architecture, is particularly effective in learning from large amounts of unstructured
data, making it well-suited for malware detection in Android apps.
The deep learning model classifies the applications based on the extracted features, ultimately
distinguishing between benign and malicious apps. This classification process involves training the
model on a labeled dataset containing both benign and malicious apps. During testing, the model's
accuracy is evaluated by comparing the predicted classifications against the actual labels, ensuring that
the model generalizes well to unseen data.
To gather the necessary datasets, the system crawls malware samples from identified malware
sources and benign applications from the Google Play Store using specialized crawlers. By regularly
crawling both categories of apps, the system ensures that it remains up-to-date with the latest trends in
both malware and legitimate app development. This dynamic crawling method allows the system to
adapt to the rapidly evolving landscape of Android threats.
One of the standout features of the proposed system, Deep Flow, is its ability to maintain high
precision in detecting emerging and frequently evolving types of malware. By utilizing deep learning,
the system can continuously improve its ability to detect new malware variants that might otherwise
evade traditional detection methods. The ongoing training of the model with new samples allows it to
adapt and keep pace with the ever-changing Android ecosystem.
In summary, the proposed system combines the power of deep learning with detailed analysis
of app permissions and API function calls to provide an advanced solution for detecting malicious
Android apps. Its dynamic and evolving nature ensures it can adapt to new threats, offering both high
precision and adaptability in the detection process.
There are multiple hyper-parameters like amount of layers and the model’s complication.
During the deployment time, they try to have the neural network model as humble as likely. To
routinely determine the design in the raw method calls, MalDozer depend on the convolution layers.
The vector sequence is used as input to the neural network, i.e. an L×K shaped matrix. In the training
phase, depend on the app vector classification and its tags, MalDozer trains neural network parameters
for:
(i) malicious or novel for the recognition task, and
(ii) Malicious relations for the attribution task. In deployment phase, the embedding model is
used to produce the vector sequence and mine the sequence of techniques. At last, they use
the vector sequence for detect the an
(iii) droid app is malware infected or not.
6. SYSTEM DESIGN
System design is the critical phase in the development process where the overall architecture,
components, modules, interfaces, and data flow of a system are defined in order to meet the specified
user requirements. It serves as the blueprint for constructing the system and is crucial for transforming
high-level requirements into a working solution. System design, in this context, refers to applying
systems theory to product development, ensuring that the system works efficiently and integrates
smoothly within the intended environment.
In the realm of Android malware detection, system design takes on a pivotal role in ensuring
the efficiency and accuracy of the detection model. It involves making key decisions on how the
various components of the malware detection system will interact, which tools and technologies will
be used, and how data will flow between different system modules. The system design is not just a set
of functional modules but also includes considerations of scalability, performance, security, and
maintainability.
The overall system design for Android malware detection revolves around multiple layers of
interaction and processing:
1. User Interface Layer: The system provides an intuitive web-based interface where users can
upload Android APK files for analysis. This layer is crucial for ensuring that end-users, such
as security analysts or app developers, can easily interact with the system. It also allows them
to receive the analysis results in an understandable format, such as a classification label
indicating whether the app is benign or malicious.
2. Feature Extraction Layer: This is the core processing layer where the system extracts key
features from the APK files. Features such as permissions (from the manifest file) and API
calls (from the source code) are gathered, processed, and organized. The efficiency of feature
extraction directly impacts the accuracy of the classification model, making it a crucial step in
the design process.
3. Data Preprocessing and Augmentation: Before the features are fed into the machine learning
model, they must undergo preprocessing to handle missing data, normalize values, or perform
feature scaling. Additionally, data augmentation techniques can be applied to enrich the dataset,
especially in the case of detecting rare or evolving types of malware.
4. Machine Learning and Deep Learning Model Layer: This layer is responsible for the actual
detection of malware. The system utilizes a Deep Belief Network (DBN)-based deep learning
model to classify apps as benign or malicious based on the extracted features. The model is
trained with a large dataset of labeled applications, including both benign and malware
samples. The system design must ensure that the model can be updated regularly with new data
to adapt to evolving malware types.
5. Malware Database and API: To enhance the system's capability to identify new threats, it
integrates with external databases or malware repositories. This allows the system to stay
updated with known malware samples and continuously improve its detection accuracy.
Additionally, the system may call upon external APIs to check for previously detected malware
signatures or perform reputation checks on certain behaviors exhibited by the app.
6. Security and Privacy Layer: Given that malware detection inherently deals with analyzing
potentially malicious apps, the system design must ensure secure handling of the uploaded
APK files. This involves ensuring that uploaded files are isolated from the system to avoid
accidental execution or exposure to other vulnerabilities. Additionally, user data and analysis
results should be kept confidential and secure from unauthorized access.
7. Output and Reporting Layer: Once the analysis is complete, the system generates detailed
reports outlining the classification results. These reports may include additional insights such
as the most suspicious permissions or API calls, which could help in further investigation or
debugging. The output can be provided in different formats, such as HTML or PDF, and may
include suggestions for remedial actions.
8. System Integration and Testing: System integration ensures that all modules work together
seamlessly. This involves testing the various components (feature extraction, machine learning
model, user interface) in a controlled environment before deployment. In addition, the design
must incorporate robust testing methods to evaluate the performance, security, and scalability
of the system. Regular testing is essential for identifying and resolving issues early, especially
when dealing with new and emerging threats.
9. Scalability and Adaptability: A critical aspect of the system design is ensuring that it can
scale to handle a large number of APK submissions while maintaining high accuracy and
performance. The system should be designed to accommodate future growth, both in terms of
user traffic and the number of supported malware signatures. Furthermore, adaptability is key
as new malware strains and techniques emerge rapidly. The design must include mechanisms
for regular model updates and new data integration.
The system design process is iterative and involves continuously refining each layer based on
feedback from initial tests and ongoing monitoring. As the Android ecosystem evolves and new
malware techniques emerge, the system must be capable of adapting to these changes. This is why
designing a robust, flexible, and secure system is vital for ensuring the long-term success of Android
malware detection.
In conclusion, system design is the backbone of creating a reliable and effective malware detection
solution. It provides the necessary structure and framework to integrate various technologies and
methodologies, ensuring that the final system meets user requirements while remaining adaptable to
future challenges.
As the strategic value of software increases for many companies, the industry looks for techniques
to automate the production of software and to improve quality and reduce cost and time-to-market.
These techniques include component technology, visual programming, patterns and frameworks.
Businesses also seek techniques to manage the complexity of systems as they increase in scope and
scale. In particular, they recognize the need to solve recurring architectural problems, such as physical
distribution, concurrency, replication, security, load balancing and fault tolerance. Additionally, the
development for the World Wide Web, while making some thin simpler, has exacerbated these
architectural problems. The Unified Modelling Language (UML) was designed to respond to these
needs. Simply, Systems design refers to the process of defining the architecture, components, modules,
interfaces, and data for a system to satisfy specified requirements which can be done easily through
UML diagrams.
Contents of UML
Ø Relationship Lines that model the relationships between entities in the system.
Ø Generalization -- a solid line with an arrow that points to a higher abstraction of the
present item.
Ø Association -- a solid line that represents that one entity uses another entity as part of
its behaviour.
Ø Dependency -- a dotted line with an arrowhead that shows one entity depends on the
behaviour of another entity.
1) Class Diagram
3) Sequence Diagram
4) Activity Diagram
5) 5) Deployment Diagram
UML class diagrams model static class relationships that represent the
fundamental architecture of the system. Note that these diagrams describe the
relationships between classes, not those between specific objects instantiated from those
classes. Thus the diagram applies to all the objects in the system.
A class diagram consists of the following features:
Ø Classes: These titled boxes represent the classes in the system and contain
information about the name of the class, fields, methods and access specifies.
Abstract roles of the Class in the system can also be indicated.
Ø Interfaces: These titled boxes represent interfaces in the system and contain
information about the name of the interface and its methods. Relationship Lines
that model the relationships between classes and interfaces in the system.
Ø Dependency: A dotted line with an open arrowhead that shows one entity
depends on the behavior of another entity. Typical usages are to represent that
one class instantiates another or that it uses the other as an input parameter
Ø Aggregation: Represented by an association line with a hollow diamond at the
tail end. An aggregation models the notion that one object uses another object
without "owning" it and thus is not responsible for its creation or destruction.
Ø Inheritance: A solid line with a solid arrowhead that points from a sub-class to
a super class or from a sub-interface to its super-interface.
Ø Implementation: A dotted line with a solid arrowhead that points from a class
to the interface that it implement
Ø Composition: Represented by an association line with a solid diamond at the
tail end. A composition models the notion of one object "owning" another and
thus being responsible for the creation and destruction of another object
A use case can be thought of as a collection of possible scenarios related to a particular goal, indeed,
the use case and goal are sometimes considered to be synonymous.
The main purpose of a use case diagram is to show what system functions are performed
for which actor.
Import Dataset
Feature Extraction
Recognition
User
Feature Matching
Malware Recognition
A rectangle is drawn around the use cases, called the system boundary box, to
indicate the scope of system. Anything within the box represents functionality that is in
scope and anything outside
Include
In one form of interaction, a given use case may include another. "Include is a
Directed Relationship between two use cases, implying that the behavior of the included
use case is inserted into the behavior of the including use case”.
The first use case often depends on the outcome of the included use case. This
is useful for extracting truly common behaviours from multiple use cases into a single
description. The notation is a dashed arrow from the including to the included use case,
with the label "«include»". This usage resembles a macro expansion where the included
use case behavior is placed inline in the base use case behavior. There are no parameters
or return values. To specify the location in a flow of events in which the base use case
includes the behavior of another, you simply write include followed by the name of use
case you want to include, as in the following flow for track order.
Extend
In another form of interaction, a given use case (the extension) may extend
another. This relationship indicates that the behavior of the extension use case may be
inserted in the extended use case under some conditions. The notation is a dashed arrow
from the extension to the extended use case, with the label "«extend»".
Generalization
In the third form of relationship among use cases, a generalization/
specialization relationship exists. A given use case may have common behaviours,
requirements, constraints, and assumptions with a more general use case. In this case,
describe them once, and deal with it in the same way, describing any differences in the
specialized cases. The notation is a solid line ending in a hollow triangle drawn from
the specialized to the more general use case (following the standard generalization
notation).
Associations
Associations between actors and use cases are indicated in use case diagrams by
solid lines. An association exists whenever an actor is involved with an interaction
described by a use case. Associations are modelled as lines connecting use cases and
actors to one another, with an optional arrowhead on one end of the line. The arrowhead
is often used to indicate the direction of the initial invocation of the relationship or to
indicate the primary actor within the use case. The arrowheads imply control flow and
should not be confused with data flow.
• Identifying Actor
Sequence Diagram
Lifeline
A lifeline will generally be generated, and it is a dashed line that sits vertically,
and the top will be in the form of a rectangle. This rectangle is used to indicate both the
instance and the class. If the lifeline must be used to denote an object, it will be
underlined.
Messages
To showcase an interaction, messages will be used. These messages will come
in the form of horizontal arrows, and the messages should be written on top of the
arrows. If the arrow has a full head, and it’s solid, it will be called a synchronous call.
If the solid arrow has a stick head, it will be an asynchronous call. Stick heads with dash
arrows are used to represent return messages.
Objects
Objects will also be given the ability to call methods upon themselves, and they
can add net activation boxes. Because of this, they can communicate with others to
show multiple levels of processing. Whenever an object is eradicated or erased from
memory, the "X" will be drawn at the lifeline's top, and the dash line will not be drawn
beneath it. This will often occur as a result of a message.
If a message is sent from the outside of the diagram, it can be used to define a message
that comes from a circle that is filled in. Within a UML based model, a Super step is a
collection of steps which result from outside stimuli.
• Set the stage for the interaction by identifying which objects play a role in interaction.
• Start with the message that initiates the interaction. Visualize the nesting of messages
or the points in time during actual computation.
• Specify time and space constraints, adorn each message with timing mark and attach
suitable time or space constraints.
• Specify the flow of control more formally, attach pre and post conditions to each
message.
Feature Malware
Import Data
Camera Train Data set
Face Detection Feature
ace Alingment
Extraction Feature Matching Feature Store in
Extraction Recognition
Matching DataBase
Base
An unlabelled set of Malware data
Collaborative Diagram:
Store in
DataBase
Activity Diagram
• Activities
• Association
• Conditions
• Constraints
The following are the basic notational elements that can be used to make up a diagram:
Initial state
An initial state represents a default vertex that is the source for a single transition
to the default state of a composite state. There can be at most one initial vertex in a
region. The outgoing transition from the initial vertex may have a behavior, but not a
trigger or guard. It is represented by Filled circle, pointing to the initial state.
Final state
A special kind of state signifying that the enclosing region is completed. If the
enclosing region is directly contained in a state machine and all other regions in the
state machine also are completed, then it means that the entire state machine is
completed. It is represented by Hollow circle containing a smaller filled circle,
indicating the final state.
Rounded rectangle
It denotes a state. Top of the rectangle contains a name of the state. Can contain
a horizontal line in the middle, below which the activities that are done in that state are
indicated.
Arrow
It denotes transition. The name of the event (if any) causing this transition labels the
arrow body.
• Beginning at the operation’s initial state, specify the activities and actions.
7. SYSTEM IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is turned
out into a working system. Thus, it can be considered to be the most critical stage in
achieving a successful new system and in giving the user, confidence that the new
system will work and be effective.
7.2 MODULES
1. Android Security.
2. Malware Detection Technique.
Android Security
These malwares are seriously threat Android security. The attacker can monitor user’s
information like: Messages, Contacts, Bank mTANs, Locations, etc. Here we survey on different
Android Malware Detection Techniques like: MalDozer, Droid Detector, Droid Deep Learner and
Deep Flow.
Then, we have discuss on different deep learning based android malware detection
techniques such as, Maldozer, Droid Detector, Droid DeepLearner, Deep Flow, Droid Delver and
Droid Deep. We aim to implement a model based on deep learning that can automatically identify
whether an android application is malware infected or not without installation.
Malware is a malicious code which is developed to harm a computer or network. The number
of malwares is growing so fast and this amount of growth makes the computer security researchers
invent new methods to protect computers and networks. There are three main methods used to malware
detection:
Signature based, Behavioral based and Heuristic ones. Signature based malware detection is the most
common method used by commercial antiviruses but it can be used in the cases which are completely
known and documented. Behavioral malware detection was introduced to cover deficiencies of
signature based method.
8. TESTING
The purpose of testing is to uncover errors and ensure that the system functions as expected
under various conditions. Testing is a crucial phase in the software development lifecycle, as it helps
identify faults, weaknesses, or bugs in the system that could compromise its effectiveness, security, or
performance. In essence, testing serves as a means to evaluate whether the software meets its
requirements, adheres to user expectations, and behaves correctly under all anticipated conditions.
Testing is the process of systematically exercising software to ensure that it works as intended
and does not fail in an unacceptable manner. It is a vital step in the software development lifecycle,
enabling developers and stakeholders to identify and correct errors before the software is released to
users. The main goal of testing is to ensure that the system meets its functional and non-functional
requirements, while also providing confidence that it is reliable, secure, and performs as expected.
Testing is not a one-time activity but an ongoing process that takes place throughout the
development life cycle. It helps identify issues at various levels, from individual components to the
entire system, ensuring that the software delivers the expected outcomes without unexpected failures.
Through rigorous testing, developers can guarantee that the system operates as intended under normal
and extreme conditions, providing end users with a stable, secure, and high-performing solution.
• Verification and Validation: Ensuring that the software meets its specifications (verification)
and meets user needs and expectations (validation).
• Fault Detection: Identifying defects or vulnerabilities that could affect the functionality,
security, or usability of the system.
• Reliability: Ensuring that the system performs correctly and consistently, even under stressful
or unforeseen conditions.
• Performance Evaluation: Testing the system's ability to handle large workloads, maintain
speed, and operate efficiently over time.
• User Confidence: Providing evidence that the system is ready for deployment and that it is
robust enough to be trusted by users.
TYPES OF TESTS
Unit Testing
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the application
.it is done after the completion of an individual unit before integration. This is a structural testing, that
relies on knowledge of its construction and is invasive. Unit tests perform basic tests at component
level and test a specific business process, application, and/or system configuration. Unit tests ensure
that each unique path of a business process performs accurately to the documented specifications and
contains clearly defined inputs and expected results.
Integration Testing
Integration testing is event-driven, meaning it typically focuses on the behavior of the system in
response to user actions or system events. It involves checking the flow of data between modules and
the proper functioning of interconnected components. Since integration testing typically occurs after
unit testing, where each individual component is verified in isolation, this phase helps to validate that
the components will work together as expected in a real-world scenario.
• To verify that multiple components, which were unit tested individually, now work together
when integrated.
• To identify issues that may arise due to the interactions between modules, such as incorrect
data exchange, miscommunication between services, or failures when components interact in
different environments.
• To ensure that the system works as a whole, meeting the functional and non-functional
requirements.
1. Interaction Between Modules: Ensuring that modules communicate correctly, share data, and
provide expected outputs when integrated.
2. Consistency: Verifying that the combined components work consistently and follow a
predictable behavior pattern.
3. End-to-End Functionality: Testing workflows or use cases that span across multiple
components or modules, ensuring that complex functionality works as expected.
4. Error Handling: Ensuring that errors or failures in one module are correctly handled by other
modules, maintaining system stability.
Functional test
Functional tests provide systematic demonstrations that functions tested are available as specified by
the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
System Test
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration oriented system integration test. System testing is based on process descriptions and
flows, emphasizing pre-driven process links and integration points.
• Functional Testing: Verifying that the system performs all the functions it is expected to do,
based on the user requirements.
• Non-functional Testing: This includes testing for performance, security, usability, and
reliability.
• Configuration Testing: Ensuring that the system works correctly across different
environments and configurations (e.g., different operating systems, hardware setups).
• End-to-End Testing: Validating that the system works end-to-end, including the interaction
of various subsystems and external systems.
• Code Coverage: Ensuring that all the code paths are tested, including conditional statements, loops,
and branches.
• Path Testing: Verifying the different execution paths through the software.
• Unit Testing: Often used in conjunction with unit testing to validate the correctness of individual
functions and methods.
• Focus on Inputs and Outputs: Testers provide inputs based on requirements or user stories
and check if the system produces the correct outputs.
• No Knowledge of Internal Code: Testers are not concerned with the internal code structure
or logic; they focus purely on the functionality.
• Specification-based Testing: Black box tests are written based on the software's
specifications, requirements, or use cases, ensuring that the software meets the defined
functional requirements.
Test objectives
• All field entries must work properly.
Features to be tested
• Verify that the entries are of the correct format
User Acceptance Testing is a critical phase of any project and requires significant participation by the
end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
start_time = time.time()
# Load dataset
dataset = pd.read_csv("/content/Malware dataset.csv")
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Evaluate
y_pred = model.predict(X_test) > 0.5
cm = confusion_matrix(y_test, y_pred)
scores = model.evaluate(X_train, y_train)
print("\nAccuracy: %.2f%%" % (scores[1] * 100))
# ROC Curve
y_proba = model.predict(X_test).ravel()
# Calculate fpr and tpr using roc_curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba) # This line is added
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='ANN (AUC = {:.3f})'.format(roc_auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()
# Save model
model_json = model.to_json()
with open("SmartAM.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("SmartAM_weights.weights.h5") # Changed the filename to include '.weights'
print("Model saved successfully.")
end_time = time.time()
print(f"Execution time: {round(end_time - start_time, 2)} seconds")
# Save model
model_json = model.to_json()
with open("SmartAM.json", "w") as json_file:
json_file.write(model_json)
model.save_weights("SmartAM_weights.weights.h5") # Changed the filename to include '.weights'
print("Model saved successfully.")
if not os.path.exists(app.config['UPLOAD_FOLDER']):
os.makedirs(app.config['UPLOAD_FOLDER'])
print("Directory created")
else:
print("Directory exists")
if request.method == "POST":
if 'file' not in request.files:
flash('No file part')
return redirect(request.url)
file = request.files['file']
if file.filename == '':
flash('No selected file')
return redirect(request.url)
if file and file.filename.endswith('.apk'):
filename = secure_filename(file.filename)
print(filename)
filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
file.save(filepath)
if request.form['algorithm'] == 'KNN':
accuracy = algorithms['KNN']
result, name, sdk, size = classifier.classify(filepath, 0)
elif request.form['algorithm'] == 'Support Vector Classifier':
accuracy = algorithms['Support Vector Classifier']
result, name, sdk, size = classifier.classify(filepath, 1)
if __name__ == "__main__":
app.run(debug=True)
11.1 CONCLUSION
In this paper we have discussed about different types of Android Malware Detection Techniques
using various Deep Learning Methods. Because of open nature on Android, countless malwares are
hidden in a large number of benign apps in Android markets. These malwares are seriously threat
Android security. The attacker can monitor user’s information like: Messages, Contacts, Bank
mTANs, Locations, etc.
Here we survey on different Android Malware Detection Techniques like: MalDozer, Droid
Detector, Droid Deep Learner and Deep Flow. MalDozer is used the Convolution Neural Network for
Malware Detection. It works on static analysis method and API method calls as a feature to detect the
application is malware infected or not.
Droid Detector will use the Deep Belief Network for the detection. They used the static and
dynamic analysis with features like: permissions, APIs, Dynamic behavior for malware detection.
Droid Deep Learner method is also use the Deep Belief Network for malware detection.
They also use a static analysis method with the features like permissions and APIs for
malware detection. Deep Flow also use the Deep Belief Network with the static analysis method. In
this method they use the API method calls for Android Malware Detection. But, these all methods are
working after installing the application on device or upload it to their model. To overcome this problem
we are trying to implement a Deep Learning model that can automatically identify the application is
malicious or not before the installation.
12. BIBLIOGRAPHY
Good Teachers are worth more than thousand books, we have them in Our Department References
Made From:
1. User Interfaces in C#: Windows Forms and Custom Controls by
Matthew MacDonald.
2. Applied Microsoft® .NET Framework Programming (Pro-
Developer) by Jeffrey Richter.
3. Practical .Net2 and C#2: Harness the Platform, the Language,
and the Framework by Patrick Smacchia.
4. Data Communications and Networking, by Behrouz A Forouzan.
5. Computer Networking: A Top-Down Approach, by James F.
Kurose.
6. Operating System Concepts,by Abraham Silberschatz.
7. M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. H. Katz, A.
Konwinski,G. Lee, D. A. Patterson, A. Rabkin, I. Stoica, and M.
Zaharia,“Above the clouds: A berkeley view of cloud
computing,” University ofCalifornia, Berkeley, Tech. Rep. USB-
EECS-2009-28, Feb 2009.
11. O. Regev and N. Nisan, “The popcorn market. online markets for
computational resources,” Decision Support Systems, vol. 28, no.
1-2, pp. 177 – 189, 2000.
Sites Referred:
http://www.sourcefordgde.com
http://www.networkcomputing.com/
http://www.ieee.org http://www.emule-
project.net/