IoT-based Green City Architecture Using Secured and Sustanibale Android Services

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Environmental Technology & Innovation 20 (2020) 101091

Contents lists available at ScienceDirect

Environmental Technology & Innovation


journal homepage: www.elsevier.com/locate/eti

IoT-based green city architecture using secured and


sustainable android services

Farhan Ullah a , , Fadi Al-Turjman b , Anand Nayyar c,d
a
School of Software, Northwestern Polytechnical University, Beilin District, Xi’an Shaanxi, 710072, P.R. China
b
Artificial Intelligence Engineering Department, Research Centre for AI and IoT, Near East University, Nicosia, Mersin 10, Turkey
c
Graduate School, Duy Tan University, Da Nang 550000, Viet Nam
d
Faculty of Information Technology, Duy Tan University, Da Nang 550000, Viet Nam

article info a b s t r a c t

Article history: Green and smart cities deliver services to their residents using mobile applications that
Received 14 May 2020 make daily life more convenient. The privacy and security of these applications are
Received in revised form 15 July 2020 significant in providing sustainable services in a green city. The software cloning is
Accepted 4 August 2020
a severe threat which may breach the security and privacy of android applications.
Available online 7 August 2020
A centrally controlled and automated screening system across multiple app stores
Keywords: is inevitable to prevent the release of copyrighted or cloned copies of these apps. In this
Green city paper, we proposed IoT-enabled green city architecture for clone detection in android
Internet of Things markets using a deep learning approach. First, the proposed system obtained an original
Mobile applications APK file together with potential candidate cloned APKs via the cloud network. For each
Abstract syntax tree subject software, the system uses an APK Extractor tool to retrieve Dalvik Executable
Deep learning (DEX) files. The Jdex decompiler is utilized to retrieve Java source files through Dalvik
Executables. Second, the AST features are extracted using ANother Tool for Language
Recognition (ANTLR) parser. Third, the linear features are mined from these hierarchical
structures, and Term Frequency Inverse Document Frequency (TFIDF) is applied to
estimate the significance of each feature. Finally, the deep learning model is configured
to detect cloned apps. The deep learning model is fine-tuned to get better accuracy.
The proposed approach is analyzed on five different cloned applications collected from
different android markets. The main objective of this system is to avoid the release
of pirated apps with various pirated labels in multiple app markets.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction

Internet of Things (IoT) (Butun et al., 2019; Mbarek et al., 2020; Paul et al., 2016) network is a combination of sensors,
actuators, and moving objects that provide real-time connectivity. Therefore, IoT with Artificial Intelligence (AI) is vital
technical directions for opening the door to new perspectives that enable smart homes, smart hospitals, smart cities,
intelligent vehicles, and intelligent wearables in green and smart cities. In the green city, all kinds of user data are
collected in mobile devices to make it secure. The Android-based smartphone is the most commonly used mobile device,
which is the core of all smart technologies. However, the Android-based mobile device may not responsible for handling
confidential user data and face privacy breaches caused by clone attacks. Android apps can be designed and installed to
enjoy the services of a green city through mobile services. These applications are mostly available on some Android stores,

∗ Corresponding author.
E-mail addresses: [email protected] (F. Ullah), [email protected] (F. Al-Turjman), [email protected] (A. Nayyar).

https://doi.org/10.1016/j.eti.2020.101091
2352-1864/© 2020 Elsevier B.V. All rights reserved.
2 F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091

Fig. 1. Cloned applications. (a-b-c): APK Pure (com.jiuzhangtech.hangman) (d): Google Market (br.com.passeionaweb).

which can be easily downloaded and install. Thus, such applications can be easily attacked by software cloning to take
control of these smart services (Li et al., 2015; Sodhro et al., 2019; Ojala and Ferm, 2015; Rathore et al., 2016). The sales of
cell phones have evolved over the last few years. Android seems to have a marketplace for sharing the app, and its mobile
phone sales have recently hit 850,000 activations each day. Even though the Android OS offers the predominant cell phone
experience, a substantial portion of customer experience relies on third-party apps. Android has several marketplaces
where customers can download apps via third parties that allow easy access to social networking and games. There is a
need to protect individuals from scammers who want to reap the benefits of hard work from a genuine developer. The
authorized Android store or other third-party platforms can be used to publish Android apps (Kang et al., 2014; Braun
et al., 2018).
It is essential to provide a trusted market environment for programmers to generate consistent applications. A paid
app may be cracked and published, and a free app may be cloned and re-released, which causes the scammer to earn
advertising revenue. A scammer may modify an application’s current catalog, delete the developers ’ client ID, and install
a new revenue-providing library. The open-source platform for Android allows the scammers to clone applications, and
republish them to stores. The apps should undergo an approval process in Apple’s App Store in contrast with the Android
markets. Recently, the Android1 market installed a service that scans new apps. Programmers and the academic world
have alleged the cloning of android applications.
Fig. 1 shows the same app with different Graphical User Interface (GUI). These apps are published in different app
stores. The cracker changed the GUI of the cloned application to publish with different credentials. Therefore, it is possible
to clone the same application by just modifying the GUI. The finding suggests that every application’s source code overlaps
with each other, implying that at-least three apps are a duplicate (Zhou et al., 2012b; Rathore et al., 2017). Developers
can release their apps in android markets such as SlideMe2 and GoApk3 at the cost of just $25. Contrary to Apple’s App
Store, cloning modifying and redistributing apps in Android markets is really easy. It is, therefore, necessary to secure the
copyrights and the revenue streams of entrepreneurs (Su et al., 2012; Zhou et al., 2012a; Lorimer et al., 2018). Android
Applications are disseminated in Packages, which include images and XML files. An XML persist file designates many app
attributes, including its title, edition, code namespace, and executable authorizations. Android programs are originally
developed through Java source code. The code for Java is compiled to byte code and afterwards translated to DEX. The
DEX byte code is generally run on the virtual environment through Dalvik. Java byte code is converted into DEX byte
code, which is done via several reverse engineering techniques from third-party vendors.
The current study presents the introduction of a reliable methodology for the identification of android app clones
among various android stores. In order to explore clones of android apps in several markets, we have developed an IoT-
based clone detection approach using AST features and deep learning model. The AST features can assist us in analyzing
the abstract view of each program for the efficient clone detection system. The main contributions are as follows:

(1) We proposed a sustainable green and smart city architecture for secured mobile services.

1 http://googlemobile.blogspot.com/2012/02/android-and-security.html.
2 https://www.appbrain.com/stats/number-of-android-apps.
3 http://slideme.org.
F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091 3

(2) The AST hierarchical features are generated from android applications.
(3) Deep learning model is configured for efficient classification of cloned features.

The remaining paper is organized as follows: Section 2 contains the relevant literature, Section 3 includes the proposed
methodology discussions, Section 4 comprises the detailed experiments, and finally, Section 5 includes the conclusion.

2. Literature review

We explain the pros and cons of many other methods to identifying cloned code statically, and we conclude with
the suggested approach. For instance, a feature-based approach (Wang et al., 2015) examines an application and extracts
a range of features. The variability in the features in a package relies on various classes, methods, variables, and loops.
This strategy has a weakness, as it could exclude valuable information about the software’s structure. Because of the high
false-positive rate, feature-based methods are highly susceptible. Structure-based methods tokenize the code into streams
first and afterwards correlate the streams across two apps. It converts the software into a flow of tokens and avoids their
statements, function names, and white space. It recognizes the clone more robustly when contrasted with feature-based
methods. The analogy of streams of DEX byte code to identify the precise copied script is really fast. Unfortunately, byte
code streams cannot offer semantic level code knowledge. Thus, this technique is vulnerable to technical changes. For
instance, Winnowing (Schleimer et al., 2003) has attempted to catch a clone with modifications based on k-grams. If
there are few variances for both programs, then, the comparison can analyze many similar k-length token streams. If
the replicated portions of the script act just like their actual versions, they should get the same dependency between
certain variables of inputs and outputs. Because these dependencies do not even modify after significant masks have
been adhered to a plagiarized script. Clone identification focused on PDG (Joshi and Khanna, 2013; Ghafir et al., 2018) is
more rigorous than structure-based monitoring systems. If the actions of the plagiarized portion in a package is the same
as of the official version, then maybe the input and output parameters have the same dependencies. However, opposed
to structured methods, such dependencies may not modify even when significant changes to both the cloned code have
adhered.
Many academics have recently applied the structure-based methodology to classify clones of Android application in
research studies. DroidMOSS (Liu et al., 2006), first approximates fuzzy hashes for each process in the APK directory
and afterwards incorporates them to equate the fuzzy hashes of APKs to discover similarities depending on the specific
hashes. Research can be vulnerable to the methods of evasion. ComDroid (Chin et al., 2011) identified bugs in inter-
application interaction through static analysis. Wang et al. (2015) presented a two-way strategy of clone detection to
classify suspected files. It is used in the very first step to evaluate weighted static features focused on semantics, then
in the second phase, to equate detailed features used in the identified scripts. Furthermore, they used the preprocessing
approach of automated clustering to combine clone programs. Crussell et al. (2013) suggested an AnDrawn approach
which works on two scenarios used among different developers to identify identical clones. The intended algorithm
forecasted a corpus, contained 88 malicious files. This method can be used to determine related data files within an app
cluster. Li et al. (2017) focused third party technology based on the proposed methodology. It is relying on hash algorithms
with LibD, which are tested on a large dataset. This approach is used to recognize individual code dependencies in android
programs, so these features are further used to detect third-party modules. The experiments demonstrate that LibD can
manage third-party applications with multiple packages. Assessment of the user interfaces is used to classify clones in the
Android market (Soh et al., 2015). This mechanism is used by running an app to obtain information through consumers
in a dynamic environment. Specific feedback stuff is easy to acquire through users worldwide. Second, it primarily targets
semantic knowledge, and the strategy of obfuscation, which does not influence the behavior in real-time. The developed
method provided low false positives and false negative scores with public datasets. Nichols et al. (2019) adapted the
optimal Smith–Waterman sequence alignment algorithm to measure plagiarism between source codes. This approach
mostly applies to any form of source code that is described as ANTLR. The source code is first translated and then linearized
to parse tree. This linear structure is further preprocessed to be organized in sequence to be recognized by the algorithm
developed. Afterwards, the Smith–Waterman algorithm is applied to calculate the score of similarity.
The studies listed did not use semantic information effectively to identify clones in their research. The AST features
tend to be more stable in structural coding, as well as less resource and time-consuming. In our case, every android app
requires hundreds of source codes documents; thus, it is cost-intensive to mine the dependencies of each execution path
in the code. As a result, AST is more effective in identifying cloned components. Our proposed work focuses on android
clones, which have been released in different android markets. The abstract structure for each clone is captured for further
implementation of the details. The deep learning model can effectively process the large corpus of AST features for clone
detection.

3. Proposed scheme: Sustainable green city architecture for secured mobile services

A smart and green city is a society with basic infrastructure to provide people with a decent standard of living, a healthy
and sustainable environment through smart enterprise solutions. This is a better way to enforce governance and offer
community services to the residents for improving infrastructure. Fig. 2 shows the IoT-based green city architecture
4 F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091

Fig. 2. IoT-based green and smart city architecture for social interaction using mobile devices.

for the social interaction of users. The green city services offered through installed servers, and users can enjoy these
facilities using their mobile phones (Kaur et al., 2018; Rathore et al., 2018b). Users are free to move and can enjoy
the smart services using their Android-based IoT devices. The remote gateway is configured for a specific area which
provides seamless connectivity to their local departments, such as hospital, school, park, house, etc. Several gateways
are interconnected to offer different services to the whole green city. IoT-based devices may use android applications
to connect with specialized servers. A single android application works for individual services such as smart health care
or smart education (Rathore et al., 2018a). Thus, one IoT device may be configured may android applications depend on
the offered services by the green city. These android applications can be attacked by software cloning to disturb smart
services. The green city administration must, therefore, take wise steps to protect these android applications in order to
provide sustainable services to users. There are no appropriate checkpoints designed to test the cloned application across
different Android device markets. A centralized Android device testing framework is required that can distinguish clone
applications across multiple Android stores. These devices are attached via cloud services, which are also linked to the
Android store, as shown in Fig. 3.

3.1. Application extractor

A clone detection system initiates and chooses applications for necessary Android candidates. The particular function
of the proposed method is started. The ‘‘APK Extractor’’ is called to retrieve DEX files from an Android app. First, the
app markets are linked via cloud technology. Scammer submits an APK for distribution in the app market. In order to
accomplish clone detection on Android apps, we have to mine .java documents to examine the code replicas among
analogous code for the application. The Jdex decompiler is used to retrieve APKs from the .dex files. The .dex files are
executables for Android OS that comprises of compiled Java codes. After retrieving compiled documents via DEX, the
Jdex decompiler is once again referred to explore Java classes into actual source code needed for AST features. Since
extracting code archives of each category and the corresponding cloned programs, we cleaned the source codes from
noisy comments (Ullah et al., 2019; Harrand et al., 2019).

3.2. Abstract syntax tree

AST is a syntax tree representation of the syntactic structure of the source codes where each node represents a
construct that occurs in the sequence. It may provide a summary of the substance of each method and a statement
in the source codes. This does not create a separate node for the parenthesis used in the if state argument, but then
creates a single node for that form of expression. Then build sub-nodes for sub-statements, i.e. if–else statements. The
first node shows the if condition and the sub-nodes represent the contents of the parent node. The source code sequence
of each statement can be shown. Thus, the AST can easily recognize the code ordering plagiarism attacks along with
other forms of plagiarisms. The node contains the function or expression, and the edge contains the inner statement or
F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091 5

Fig. 3. Android-based clone detection system using deep learning and AST.

blocks of statements. This hierarchy can show the inner view of codes in a tree-like structure. Therefore, it can be easily
analyzed that which statements written within which function or conditional statement. The two applications with the
same hierarchy of statements can indicate that these programs are cloned of each other. Hence, the AST-based high-quality
features can assist us in avoiding the clone attacks in android markets. There are some other methods DECKARD (Jiang
et al., 2007) and Wahler et al. (2004), which use vector and xml to store AST information. Although these tools worked
well to find similar patterns in similar source codes, however, they still process the AST in a hierarchically structured way.
Therefore, it is a time-consuming job to traverse each path of AST from a large corpus of source codes. We have therefore
converted the vast number of AST features into linear ones in order to identify similar patterns more effectively.
The ANTLR (Parr, 2013; Pan and Kang, 2011) is an efficient parser producer to read, process, execute, and translate
the structured source codes. It is typically used to construct parser, which can be used to traverse syntax trees. ANTLR
can produce lexers, parsers, tree parsers, and combination lexer parsers. Parsers can automatically create parse trees or
abstract syntax trees that can be further handled by tree parsers. ANTLR gives a single coherent notation for defining
lexers, parsers, and tree parsers. ANTLR scans a grammar by default and produces a recognizer for the language specified
by the grammar. Fig. 4 shows the ANTLR working components with AST.
This accepts input as characters and output emissions. For example, user input statements of source code are then
broken into useful tokens by the lexer. i.e., 25 and 45 are numbers, + is a sum. Then, the parser is used to recognize the
statement structure using these tokens with metadata. The ANTLR uses different methods to develop parser for different
types of source codes, as shown below:

• ArrayInitParser(): It contain parser class definition according to grammar used.


• ArrayInitLexer (): It contains the lexer class definition
• ArrayInit.tokens( ): ArrayInitLexer.tokens( ): These are the internal ANTLR files. These files give support to token
dictionary with corresponding identifiers.
6 F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091

Fig. 4. ANTRL working components with AST.

• ArrayInitListener( ): This interface is used for walking through AST for further analysis
• ArrayInitBaseLisener( ): It is a base listener class uses to initialize listener
• ArrayInitVisitor( ): This interface is used to walk through AST using a visitor design pattern
• ArrayInitBaseVisitor( ): It is a base visitor class uses to initialize visitor

Fig. 5 shows the ASTs of cloned applications for adding two numbers. Both programs are changed in variable names and
output functions, but still, ANTLR generates the same AST features. These features are significant for the classification of
cloned scripts.

3.3. Linear features extractions and weighting

Comparing ASTs in a hierarchical structure is time and resource extensive, so we have transformed ASTs into linear
features for efficient analysis. The main objective is to produce a collection of indexes that can achieve the best results
using information retrieval methods. Preprocessing methods, along with a bag of words model, is used to split the ASTs
into nodes without affecting the actual sequence. Preprocessing steps include stop word removal stemming, minimum and
maximum frequency settings, etc. Stop words are information that may not be relevant to the language of the source code.
These stop words that contain low levels of information that may negatively affect the process of detecting plagiarism
in the source code. For example, integer, char, string, class, static. These words have no useful connection with the code
concept.
We have got a burst of features of frequencies in linear form. At this point, it is not clear which features are more
relevant for the clone detection system. We used the TFIDF model (Yokoi et al., 2018; Ke et al., 2018; Kinawy et al., 2018)
for local and weighting to calculate the node weights locally as well as globally. The TFIDF is a statistical model that is
proposed to reflect how important a feature means to a single file in a group of files. It has two parts, such as TF for
local weight and IDF for global weight. The local and global means to compute the importance of each feature in some
single and multiple files. Thus, we can be able to zoom the significance of each feature and remove the noisy data. TFIDF
is used to weight each node and edge of AST in order to analyze the significance of each instance. After that, we can be
able to process the most effective features for the deep learning classification and remove the noisy data. Let T indicates
the AST, and for each subtree s, and weight ws,T. The overall estimated weight of a node is the multiplication of TF and
IDF methods, as defined using Eq. (1).
ws,T = TF (s, T) . IDF(s, T) (1)
Mathematically, the TF (s, T) and IDF(s, T) can be defined as in Eqs. (1) and (2).
cnt(s, T)
TF. (s, T) = (2)
n(T)
Where cnt is the number of occurrences of subtrees or nodes such that s ∈ ST and n(T) is the number of subtrees in
the selected portion.
N
IDF (s, T) = log (3)
c (s)
Where c(s) is the number of abstract syntax trees in a s, and N denotes the total number of generated abstract syntax
trees from a collection of source codes documents. Nodes with similar TFIDF values can play a significant role in source
code similarity. On the other hand, if a node has a bigger weight value in one document but zero or lower in other has
F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091 7

Fig. 5. An example of abstract syntax for cloned applications.

a low impact on the effectiveness of the proposed system. These abrupt numbers can affect the overall classification
accuracy negatively. We are more interested in nodes with similar or close TFIDF values. Thus, the TFIDF model can play
a significant role in magnifying the AST nodes with their importance for the proposed system (Din et al., 2019).

3.4. Deep learning model

The Tensor Flow-based deep learning model is configured to examine the cloned application from AST weighted
features. The Tensor Flow is an open-source library that is considered to deep learning tasks for heterogeneous applicants.
Many types of research have indicated that Tensor Flow (Abadi et al., 2016a,b) is a serious need for online experience
with machine and deep learning neural networks in data analysis. It has a stronger conceptual visualization capability by
8 F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091

Table 1
Dataset detailed analysis.
Android apps Files Cloned Before After Edges Nodes
ES File 400 250 5750 2044 2323 1584
Shadow socks 350 235 4624 1676 1925 1316
Hangman 46 43 1042 362 395 274
Hangman 2 84 57 1456 475 518 353
One Clnr. 137 65 1535 536 594 395
Total 1017 650 14,407 5093 5755 3922

which we can view the deep learning model performance, including loss epoch graph, accuracy and error estimating, and
so on. It can be executed on a wide scale, from cellular devices to complicated computer frameworks.
The sequential model is configured to identify identical clones across multiple android markets. We designed a total of
seven layers including, one is input and output layer, and five are hidden layers. The number of neurons in each layer is
80, 50, 40, 30, 20, 10, 5, with a 20% dropout, respectively. The dropout layer is used to handle the overfitting problem. The
‘fit’ method is utilized for training the model. It has the five parameters, i.e. training data (x_train), target data (y_train),
validation split, number of epochs, and batch size. Validation splits are used to divide the data randomly into the training
and testing ratio of 80% and 20%, respectively. During the training phase, we can see the loss of validation, which results
in the mean squared error of the designed model on the validation set. The number of epochs refers to the number of
times the model cycles have passed through the data. Hence more epochs we run, the better the model will strengthen,
to a certain level. After that level, the model can behave the same for the remaining epochs. The Rectifier Linear Unit
(ReLU) (Agostinelli et al., 2014; Din and Paul, 2019) activation method is applied in input and hidden layers. The output
layer is configured with the softmax function. The Adaptive Moment Estimation (Adam) (Wang et al., 2020) is applied for
optimization to measure the mean dynamic execution of present and next gradients in the adaptive curves on relational
epochs. It practices the iterative technique to update the network weights. It processes the separate adaptive learning
rates for every check-in model. The decaying means of pas squared gradients are shown in Eqs. (4) and (5).

mt = β1 mt (1 − β1 )gt (4)
vt = β2 vt−1 + (1 − β 2
2 )gt (5)

where mt and vt are the expected means of the first and second instant gradients, respectively. The g signifies a specific
gradient for every instance. There are 750 parameters trained on layer 1, 15100 parameters on layer 2, 5050, on layer 3,
and 5049 on layer 4 and so on. A total of 25,949 parameters are trained for the considered experimentation. The dropout
layer, number of neurons, and layers, learning error rates are optimized to fine-tune the specifications for addressing the
overfitting problem.

4. Results and discussions

4.1. Data analysis

The dataset is obtained according to five cloned applications, i.e., ES File Downloader, Hangman, Hangman 2, One
Cleaner, and Shadow socks, along with their corresponding cloned apps gathered from Apk Pure, Google Play, Mobile1
Market, App Brain, and Aptoide Android markets. These apps are originally downloaded in the .apk style, which is also an
Android pack file used it to disseminate Android apps across various markets. The cloned applications are decompiled to
retrieve the Java source files. Then, we extract the cloned files and clean these files from noisy data. After that, we extract
edges and nodes via AST from the cleaned cloned Java files. The detailed information of the dataset, including the total
number of files, cloned files, number of lines before and after cleaning, and number of edges with nodes for each Android
application, are shown in Table 1.

4.2. Result analysis

The nodes and edges are then processed to mine weights of each edge and node. This helps us to extract the most
important features for effective classification. We proposed TFIDF to extract weighted features from all five cloned
applications. A chunk of TFIDF features for Shadow socks applications is shown in Table 2. The nodes and edges with
weighted features from both APKs are shown left and right, respectively. The n1, n2, etc. indicates AST nodes, and else,
FALSE, final, etc. denotes the edges. The values represent the weighting contributions for each feature. Those values which
are the same or close to each other can contribute more in classification than those who are far. For instance, n1, n2, n6,
n9, etc. have the same weighting values, which indicates that these features are extracted from clones’ scripts of Shadow
socks application. By doing this, we can zoom in the sparse AST features and filter that information, which is more effective
for better classification.
F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091 9

Table 2
A chunk of TFIDF features for shadow socks.
AST nodes with weights AST edges with weights
Nodes APK1 APK2 Edges APK1 APK2
n1 4.159 4.269 drawerview 4.159 4.159
n2 2.197 2.197 edges 0.693 0.693
n3 14.556 13.456 else 23.979 23.979
n4 6.702 2.197 enable 6.438 6.438
n5 3.206 4.228 end 0.693 0.693
n6 2.197 2.197 extends 0.693 0.693
n7 2.197 2.297 FALSE 4.159 4.159
n8 3.277 2.197 field 0.000 40.890
n9 2.197 2.197 final 30.779 30.779
n10 6.438 6.438 fliprtl 4.159 4.159
n11 3.286 2.197 float 30.779 30.779
n12 2.197 2.197 getapp 0.724 0.693
n13 4.265 4.159 getdecor 0.693 0.693
n14 6.438 6.438 getdrawable 5.201 4.159
n15 3.288 2.197 getdrawertog 0.693 0.693
n16 2.197 2.197 getitemid 0.693 0.693
n17 6.402 4.159 getlayout 0.693 0.693
n18 2.197 2.197 getposition 2.197 2.197
n19 14.556 15.667 getthemeup 27.334 27.334
n20 2.197 2.197 getwindow 0.693 0.693
n21 2.197 2.197 glyphoffset 6.438 6.438
n22 5.204 2.197 graphics 0.000 8.959

Next, the extract features are given to a deep learning model to classify cloned applications. We configured the deep
learning model with the default setting and also with fine-tuning parameters for better accuracy. Here, the fine-tuned
means to set the deep learning model for different numbers of layers, including hidden and dropout layers, number of
neurons in each layer, and learning error rate. The accuracy and loss of visualizations for shadow socks application are
shown in Fig. 6. There are two accuracy and loss graphs (a, b, c, d) describing the visualization before and fine-tuning
configuration. The horizontal and vertical lines show the epochs and accuracy, loss curves, respectively. We used 200
epochs for each curve to show the effectiveness of different data points. The blue and orange curves show the visualized
training and testing data points for the respective accuracy and loss, respectively. It can be seen that, with default settings,
the model is performing badly with 86% accuracy. But, with fine-tuning, the model boasts the visualized data points up to
99% of classification accuracy. The testing curve goes down on 75, 120, and 150 epochs, but soon after, it behaves normal
and going along with the training curve. Similarly, the loss curve, after fine-tuning show much improvement. It shows the
effectiveness of the designed configurations. The loss curve is approximately 35%, but after fine-tuning the model, it goes
down to 2%. Initially, the loss was quite high such as 50%, but soon after, it goes down up to 30% on the 15th epoch. After
that, it goes down to 10% on the 20th epoch but again jumps to 40% on the 50th epoch. This fluctuation continues to the
73rd epoch, and after that, both training and testing curves more or less constant. Overfitting occurs whenever the model
learns information and acoustics in data such that the output of deep learning models is adversely affected. We used
validation metrics, including loss of validation, and accuracy of validation on the train and test data to detect overfitting
in the deep learning model. The validation metrics typically stop improving after some epochs and further enhance the
training metrics to find the best match. To handle the overfitting problem of our deep learning model, we used three
schemes. Firstly, to maximize learning ability, we choose the right number of networks and hidden layers. Secondly, we
used regularization, e.g. adding cost for bigger weights to the loss function. Finally, network layers are introduced to
remove some features and avoid overfitting of the deep learning model.
We select four performance matrices such as Precision, Recall, F-measure, and Classification Accuracy (CA) to analyze
the performance of our approach. The designed experiment is applied to all five Android applications, and the resultant
performance values (in percentage) are shown in Table 3. The first column shows the five Android apps, and other
columns show the CA, Precision, Recall, and F-measure, respectively. All values are shown as before and after fine-tune
configuration to show the effectiveness of the proposed approach more clearly. The ES File and One Clnr used for ES File
downloader and One Cleaner application, respectively. The Shadow socks, ES File, Hangman, Hangman 2 and One Cleaner
have the CA (before/after) 86%, 78%, 75%, 76%, 80%, 99%, 98%, 98%, 96.2%, and 98%, respectively. The Shadow socks and
Hangman 2 provide the maximum and minimum CA, respectively, Similarly, the Precision, Recall, and F-measure values
are given before and fine-tune configurations for five Android applications. To show the effectiveness of the proposed
approach, we compared the proposed approach with popular state of the art method, such as Multilayer Perceptron (MLP),
Convolutional Neural Network (CNN), and Random Forest (RF). The detailed comparisons for Shadow socks and ES File
Downloader applications with performance matrices are shown in Table 4. It can be seen that our proposed outperforms
among the state of the given methods based on CA, Precision, Recall, and F-measure values. MLP is the subsequent best
algorithm for Shadow socks with 85% accuracy. The RF provides the lowest performance for the Shadow socks with 82%
10 F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091

Fig. 6. Accuracy and loss curves before/after fine-tuning for shadow socks application.. (For interpretation of the references to color in this figure
legend, the reader is referred to the web version of this article.)

Table 3
Performance analysis of the proposed approach.
Android Apps CA Precision Recall F-measure
Before After Before After Before After Before After
Shadow Socks 86 99 87 99 86 98 86 99
ES File 78 98 78 99 79 98 78 98
Hangman 75 98 74 97 75 98 75 97
Hangman 2 76 96 76 96 76 97 77 96
One Cleaner 80 98 78 97 80 97 80 98

accuracy. For the ES File Downloader application, MLP is again the best classification model with 83.44% accuracy. The
CNN gives minimum classification performance for the ES File Downloader with an accuracy of 80%. The Precision, Recall
and F-Measure values are also extracted to show the detailed performance comparisons of the proposed approach.
We used the AST-based features extraction method to study the abstract view of source codes. These type of features
are vulnerable to the semantic-based plagiarism, such as data insertion or detection, logic flow modification, etc. However,
our main focus is to detect clone applications with the same internal logic but different GUIs. The proposed method can
significantly detect these types of clones.
F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091 11

Table 4
Performance comparisons with other methods.
Android Apps Methods Precision Recall F-measure CA
MLP 85.10 85.00 84.80 85.00
CNN 83.30 83.30 83.20 83.33
Shadow Sock
RF 82.90 82.90 82.60 82.86
Our Approach 99.00 98.00 99.00 99.00
MLP 83.40 83.30 83.40 83.44
CNN 79.90 80.00 79.80 80.00
ES File
RF 82.10 82.10 82.00 82.14
Our Approach 99.00 98.00 98.00 98.00

5. Conclusion

Green City services primarily run on mobile applications through IoT-based devices. Cloning software is a procedure
for designing similar apps by different developers that can be easily accessed in any Android market. In order to maintain
these services, the applications installed should be protected from attacks by clones. Cloning software is one of the
major risks affecting the android technology sector. The services offered by the green city can be severely disrupted.
There may be more pirated apps available for the original applications that could damage the actual capabilities of the
smart services. Google also launched an app certification program to discourage pirate-related downloads, which enables
developers to implement registration policies for their software. Android marketplaces can be connected to IoT systems
to optimize network traffic and conduct AI real-time data assessment. Android markets are widely obtainable to IoT
customers, but there is no security check which can verify the cloning functionalities. Subscribers can publish cloned
applications for commercial purposes on any android market. This procedure may cause significant financial loss for the
original authors. In this paper, we proposed an IoT-based deep learning model with AST for efficient classification of cloned
applications. First, the proposed solution would recover the original APK document along with the candidate duplicated
APKs over the cloud network. We use APK Extractor to collect Dalvik Executable (DEX) documents for every program. Jdex
decompiler is used to recover Java files from Dalvik Executables. Second, the AST elements are derived from the ANTLR
parser. Third linear features are collected from these hierarchies, and TFIDF is used to calculate the importance of each
feature. Finally, the deep learning model is designed to identify cloned scripts for android applications. The PDG features
can extract the semantic patterns, including the program and data dependencies. In Future, we will try to use the hybrid
approach of PDG features and graph embedding methods for accurate clone classification.

CRediT authorship contribution statement

Farhan Ullah: Proposed the research, Simulated and wrote the whole manuscript. Fadi Al-Turjman: Review and
analyzed the proposed research. Anand Nayyar: Gave suggestions and assisted in revisions.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

References

Abadi, M., et al., 2016a. Tensorflow: A system for large-scale machine learning. In: 12th {USENIX} Symposium on Operating Systems Design and
Implementation ({OSDI} 16), pp. 265–283.
Abadi, M., et al., 2016b. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
Agostinelli, F., Hoffman, M., Sadowski, P., Baldi, P., 2014. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:
1412.6830.
Braun, T., Fung, B.C., Iqbal, F., Shah, B., 2018. Security and privacy challenges in smart cities. Sustain. Cities Soc. 39, 499–507.
Butun, I., Österberg, P., Song, H., 2019. Security of the internet of things: vulnerabilities, attacks and countermeasures. IEEE Commun. Surv. Tutor.
Chin, E., Felt, A.P., Greenwood, K., Wagner, D., 2011. Analyzing inter-application communication in android. In: Proceedings of the 9th International
Conference on Mobile Systems, Applications, and Services. ACM, pp. 239–252.
Crussell, J., Gibler, C., Chen, H., 2013. Scalable semantics-based detection of similar android applications. In: Proc. of ESORICS, vol. 13. Citeseer.
Din, S., Paul, A., 2019. Smart health monitoring and management system: Toward autonomous wearable sensing for internet of things using big data
analytics. Future Gener. Comput. Syst. 91, 611–619.
Din, S., Paul, A., Hong, W.-H., Seo, H., 2019. Constrained application for mobility management using embedded devices in the Internet of Things
based urban planning in smart cities. Sustainable Cities Soc. 44, 144–151.
Ghafir, I., et al., 2018. Security threats to critical infrastructure: the human factor. J. Supercomput. 1–17.
Harrand, N., Soto-Valero, C., Monperrus, M., Baudry, B., 2019. The strengths and behavioral quirks of java bytecode decompilers. In: 2019 19th
International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, pp. 92–102.
Jiang, L., Misherghi, G., Su, Z., Glondu, S., 2007. Deckard: Scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th
International Conference on Software Engineering. IEEE Computer Society, pp. 96–105.
12 F. Ullah, F. Al-Turjman and A. Nayyar / Environmental Technology & Innovation 20 (2020) 101091

Joshi, M., Khanna, K., 2013. Plagiarism detection over the web. Int. J. Comput. Appl. 68 (15), 17–20.
Kang, K., Pang, Z., Da Xu, L., Ma, L., Wang, C., 2014. An interactive trust model for application market of the internet of things. IEEE Trans. Ind. Inf.
10 (2), 1516–1526.
Kaur, G., Tomar, P., Singh, P., 2018. Design of cloud-based green iot architecture for smart cities. In: Internet of Things and Big Data Analytics Toward
Next-Generation Intelligence. Springer, pp. 315–333.
Ke, W., Jiang, J.-H., Rui-Yun, M., 2018. A code classification method based on TF-IDF. DEStech Trans. Econ. Bus. Manag. no. eced.
Kinawy, S., El-Diraby, T., Konomi, H., 2018. Customizing information delivery to project stakeholders in the smart city. Sustain. Cities Soc. 38, 286–300.
Li, Y., Dai, W., Ming, Z., Qiu, M., 2015. Privacy protection for preventing data over-collection in smart city. IEEE Trans. Comput. 65 (5), 1339–1350.
Li, M., et al., 2017. LibD: scalable and precise third-party library detection in android markets. In: 2017 IEEE/ACM 39th International Conference on
Software Engineering (ICSE). IEEE, pp. 335–346.
Liu, C., Chen, C., Han, J., Yu, P.S., 2006. GPLAG: detection of software plagiarism by program dependence graph analysis. In: Proceedings of the 12th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 872–881.
Lorimer, P.A., Diec, V.M.-F., Kantarci, B., 2018. COVERS-UP: Collaborative verification of smart user profiles for social sustainability of smart cities.
Sustain. Cities Soc. 38, 348–358.
Mbarek, B., Ge, M., Pitner, T., 2020. Enhanced network intrusion detection system protocol for internet of things. In: Proceedings of the 35th Annual
ACM Symposium on Applied Computing, pp. 1156–1163.
Nichols, L., Dewey, K., Emre, M., Chen, S., Hardekopf, B., 2019. Syntax-based improvements to plagiarism detectors and their evaluations. In:
Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education. ACM, pp. 555–561.
Ojala, O., Ferm, T., 2015. Building green city with green choices in traffic. In: EChallenges E-2015 Conference. IEEE, pp. 1–8.
Pan, X., Kang, M.-n., 2011. Research on examination paper recognizing and importing system based on ANTLR. Electron. Des. Eng. 7.
Parr, T., 2013. The Definitive ANTLR 4 Reference. Pragmatic Bookshelf.
Paul, A., Ahmad, A., Rathore, M.M., Jabbar, S., 2016. Smartbuddy: defining human behaviors using big data analytics in social internet of things. IEEE
Wirel. Commun. 23 (5), 68–74.
Rathore, M.M., Ahmad, A., Paul, A., Rho, S., 2016. Urban planning and building smart cities based on the internet of things using big data analytics.
Comput. Netw. 101, 63–80.
Rathore, M.M., Paul, A., Ahmad, A., Chilamkurti, N., Hong, W.-H., Seo, H., 2018a. Real-time secure communication for smart city in high-speed big
data environment. Future Gener. Comput. Syst. 83, 638–652.
Rathore, M.M., Paul, A., Ahmad, A., Jeon, G., 2017. IoT-Based big data: from smart city towards next generation super city planning. Int. J. Semant.
Web Inf. Syst. (IJSWIS) 13 (1), 28–47.
Rathore, M.M., Paul, A., Hong, W.-H., Seo, H., Awan, I., Saeed, S., 2018b. Exploiting iot and big data analytics: Defining smart digital city using
real-time urban data. Sustain. Cities Soc. 40, 600–610.
Schleimer, S., Wilkerson, D.S., Aiken, A., 2003. Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD
International Conference on Management of Data. ACM, pp. 76–85.
Sodhro, A.H., Pirbhulal, S., Luo, Z., de Albuquerque, V.H.C., 2019. Towards an optimal resource management for IoT based green and sustainable smart
cities. J. Cleaner Prod. 220, 1167–1179.
Soh, C., Tan, H.B.K., Arnatovich, Y.L., Wang, L., 2015. Detecting clones in android applications through analyzing user interfaces. In: Proceedings of
the 2015 IEEE 23rd International Conference on Program Comprehension. IEEE Press, pp. 163–173.
Su, X., Chuah, M., Tan, G., 2012. Smartphone dual defense protection framework: Detecting malicious applications in android markets. In: 2012 8th
International Conference on Mobile Ad-Hoc and Sensor Networks (MSN). IEEE, pp. 153–160.
Ullah, F., et al., 2019. Detection of clone scammers in android markets using IoT-based edge computing. Trans. Emerg. Telecommun. Technol. e379.
Wahler, V., Seipel, D., Wolff, J., Fischer, G., 2004. Clone detection in source code by frequent itemset techniques. In: Source Code Analysis and
Manipulation, Fourth IEEE International Workshop on. IEEE, pp. 128–135.
Wang, H., Guo, Y., Ma, Z., Chen, X., 2015. Wukong: A scalable and accurate two-phase approach to android app clone detection. In: Proceedings of
the 2015 International Symposium on Software Testing and Analysis, pp. 71–82.
Wang, W., Li, G., Ma, B., Xia, X., Jin, Z., 2020. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In: 2020
IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, pp. 261–271.
Yokoi, K., Choi, E., Yoshida, N., Inoue, K., 2018. Investigating vector-based detection of code clones using bigclonebench. In: 2018 25th Asia-Pacific
Software Engineering Conference (APSEC). IEEE, pp. 699–700.
Zhou, Y., Wang, Z., Zhou, W., Jiang, X., 2012a. Hey, you, get off of my market: detecting malicious apps in official and alternative android markets.
NDSS 25 (4), 50–52.
Zhou, W., Zhou, Y., Jiang, X., Ning, P., 2012b. Detecting repackaged smartphone applications in third-party android marketplaces. In: Proceedings of
the Second ACM Conference on Data and Application Security and Privacy. ACM, pp. 317–326.

You might also like