0% found this document useful (0 votes)
41 views5 pages

Machine Learning For Mobile Defense Detecting SMS Malware and Riskware On Android

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views5 pages

Machine Learning For Mobile Defense Detecting SMS Malware and Riskware On Android

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Machine Learning for Mobile Defense: Detecting

SMS Malware and Riskware on Android


1st Anmol Rattan Singh 2nd Gurjinder Singh
Chitkara University Institute of Engineering and Technology Chitkara University Institute of Engineering and Technology
Chitkara University, Punjab, India Chitkara University, Punjab, India
[email protected] [email protected]

3rd Nitin Saluja


Chitkara University Institute of Engineering and Technology
Chitkara University, Punjab, India
[email protected]

Abstract—Android smartphones are frequently targeted by downloads. Educating users about the dangers of SMS Mal-
SMS Malware and Riskware attacks that aim to steal private ware and Riskware is essential for securing Android devices
information such as user credentials, photos, videos, and banking in today’s digital environment.
details. Often, Android users unknowingly allow these attacks by
clicking on unknown links or messages. The goal of this research The malicious software in question operates under the
paper is to boost user confidence and privacy on Android devices. guise of a standard SMS application, exploiting the SMS
The CICMalDroid 2020 dataset, which includes data on traffic functionalities such as credit or unit transfers offered by many
patterns related to Riskware and SMS malware attacks, forms carriers worldwide. This malware thrives by fully control-
the basis of this research. The study introduces a methodology for ling the SMS capabilities, thanks to Android’s ”permission”
the early detection of these threats by incorporating a machine
learning-enabled chip at the hardware level. The dataset has been subsystem, ”broadcast receiver” subsystem, and its message-
processed using various machine learning algorithms to assess the sending mechanism [1]. It can conceal transactions from
most effective ones for detecting Android malware. The findings telecom operators, draining funds from users’ accounts and
demonstrate the superiority of the random forest algorithm impacting multiple stakeholders. Despite passing standard
on the CICMalDroid dataset, with this classifier achieving the malware checks, this highlights the urgent need for advanced
highest accuracy, recall, and F1 score compared to others. Con-
versely, the Naive Bayes method is noted for its reliability and its detection techniques. The article outlines several methods to
ability to minimize false positives. These outcomes emphasize the mitigate these threats.
importance of strengthening cybersecurity measures, regularly As Android continues to dominate the mobile market for
updating software, and the critical role of machine learning in smartphones, tablets, and other portable devices, its open-
enhancing the security of Android devices. source nature and increasing popularity make it an attractive
Index Terms—Attack; Defense; Mobile; SMS; Machine Learn-
ing; Malware; Riskware target for hackers. This research incorporates both static and
dynamic analyses of Android applications to detect malware,
scrutinizing app permissions and behaviors. The findings re-
I. I NTRODUCTION
veal that 80% of applications request risky permissions and
SMS Malware and Riskware attacks targeting Android 13% display potentially harmful behavior, indicating a need
smartphones have escalated, posing significant threats to user for stricter safety standards.
privacy and data security. A global operation uncovered thou- Despite ongoing efforts by academia and industry, chal-
sands of devices compromised by text messages containing lenges in Android malware detection and classification persist.
malicious URLs designed to trick recipients into clicking, Collecting labeled data for supervised learning is costly and
thereby infecting their devices and risking data theft, financial difficult. The proposed solution in this study is a pseudo-
fraud, or unauthorized access. Moreover, the prevalence of label semi-supervised learning approach that utilizes deep
riskware has risen, with attackers using ostensibly benign apps neural networks trained on both labeled and unlabeled in-
to harvest sensitive information. Experts warn that numerous puts. Dynamic analysis helps to create feature vectors from
legitimate Android apps from well-known app stores may behavior profiles. The study introduces the CICMalDroid2020
harbor concealed riskware, endangering login details, financial dataset, which includes 17,341 Android adware, banking,
information, and location data. Android’s widespread adoption SMS, riskware, and benign applications, offering the most
globally makes it a lucrative target for cybercriminals. To comprehensive static and dynamic data publicly available [2].
combat these threats, experts recommend regularly updating The results demonstrate that the proposed model surpasses
Android devices, installing reliable antivirus software, and Label Propagation (LP) and other machine learning methods,
staying alert to suspicious websites and unauthorized app achieving a high F1-score of 97.84% and a low false positive
rate of 2.76%. several parameters. The study uses labeled datasets, includ-
This research manuscript comprises several sections, each ing around 18,000 samples categorized as adware, banking,
addressing different aspects of the study: Section II provides SMS, riskware, and benign apps. Well-known datasets like
an overview of recent advancements in protecting Android CICMalDroid2020, CICMalDroid2017, and CICAndMal2017
devices from malware threats; Section III details the method- are employed to ensure accuracy [6]. The research contributes
ologies used in the experimental study; Section IV discusses to the development of new Android malware detection algo-
the results and insights gained from applying machine learn- rithms, suggesting that further studies on Android security are
ing algorithms to detect intrusions on Android devices; and essential.
Section V concludes the study, summarizing the key findings Android malware presents a significant threat across sectors
and contributions along with the future scope of the same. such as healthcare, finance, transportation, government, and e-
commerce. This research presents two machine learning algo-
II. L ITERATURE R EVIEW rithms for dynamic Android malware analysis. One approach
achieved over 96% accuracy in detecting and identifying
The number of gadgets and applications continues to rise, Android malware, while the second method accurately classi-
with smartphones and tablets driving increased app usage due fied malware families with over 99% accuracy. The research
to rapid technological advances. Despite passing initial virus utilized a large dataset comprising 14 malware categories and
checks, many apps containing difficult-to-detect malware are 180 malware variants, enabling accurate and efficient dynamic
still trusted and available on the Play Store. Initial tests may analysis of Android malware [7].
miss unknown threats, potentially causing significant damage. Due to the rapid expansion of Android, malware attacks
This project introduces real-time Android malware detection on mobile websites and applications have become a major
solutions using dynamic algorithms and hybrid approaches, concern. Traditional malware detection methods are often
aimed at quickly identifying complex Android malware. The ineffective and time-consuming, requiring improved solutions
research differentiates between harmful and benign applica- [8]. The study introduces a lightweight convolutional neural
tions after filtering the dataset. The findings show that a hy- network (CNN)-based detection technique, which converts An-
brid Random Forest-Multilayer Perceptron network achieved droid malware components like classes.dex and AndroidMan-
97.5% accuracy in just 22.945 seconds, making it a promising ifest.xml into RGB images for analysis. This approach allows
defense against cyberattacks by identifying harmful mobile for quick and accurate detection of malware on platforms with
app versions and protecting users’ devices [3]. limited processing capabilities, such as mobile devices.
Despite efforts in both academic and corporate research, As malware attacks on computers, the internet, and mobile
Android malware detection and classification remain unsolved devices increase, zero-day malware detection has become a
challenges. Collecting labeled data is costly and difficult. top priority for security experts [9]. Attackers are increasingly
The authors propose a pseudo-label semi-supervised learn- targeting Android, the most widely used mobile operating
ing method that trains deep neural networks on both la- system. Security experts employ various machine learning
beled and unlabeled data. Dynamic analysis generates feature techniques to detect Android malware by distinguishing be-
vectors from behavior profiles. The research introduces CI- tween malicious and legitimate APKs. The study uses 27 CI-
CMalDroid2020, a dataset containing 17,341 Android adware, CMalDroid2020 APK features from a dataset of 18,998 APKs.
banking, SMS, riskware, and benign apps, offering extensive Machine learning classifiers like Random Forest achieved
static and dynamic data [4]. The model outperformed Label 98.6% accuracy in detecting Android malware. This highlights
Propagation (LP) and other machine learning algorithms, with the effectiveness of machine learning in combating Android
a high F1-score of 97.84% and a low false positive rate of malware, as evidenced by the detection of 9,599,519 mobile
2.76%. malware cases in 2021 by Kaspersky. [10]
Malware attacks have surged with the widespread use of Android users frequently engage in online banking and
Android devices. Attackers frequently develop new or updated shopping, exposing their devices to malware threats. The
malware. Due to the limitations of traditional detection meth- authors use the CICMalDroid2020 dataset of 17,341 modern
ods in Android systems, machine learning has gained popular- Android apps to test various machine learning classifiers for
ity in cybersecurity. The research introduces an evolutionary malware detection [11]. After preparing the data and applying
approach for detecting Android malware with a detection classifiers, the results are evaluated for accuracy. The aim of
accuracy of 99.11%, making it a promising method for zero- this research is to identify the most effective classifier for the
day threat detection. Integrating machine learning techniques dataset, helping readers understand which machine learning
has significantly enhanced Android’s security, with malware algorithm performs best.
detection accuracy improving substantially [5]. Many publicly available malware datasets either lack proper
Given Android’s 72% market share, it has become a pri- labels or only provide a single label per sample, making
mary target for cybercriminals. Despite challenges, detecting it difficult to capture the complex behavior of malware.
malicious apps is a key focus of Android research. This paper This paper offers a multi-labeling technique to automatically
presents a supervised learning approach for Android malware identify the different behaviors of malware samples based on
detection, which outperforms semi-supervised methods across classifications from automated detection engines [12]. The
By adopting this comprehensive security solution, Android
users are better equipped to enjoy robust and dependable
protection against the constant threats present in the digital
environment, ensuring the functionality and integrity of their
devices remain intact. The Figure 1, represents a simplified
flowchart of an Android malware detection system, detailing
how APK files are processed to check for malware. Here’s a
streamlined explanation:
1) Internet Download: APK files are fetched from the
Internet.
2) Malware Detection: The Android Malware Detection
Software analyzes the APKs to determine if they are
safe or contain malware.
• If malware is detected in APK 1, the data is removed
to protect the device.
• If no malware is found in APK 2, it is deemed safe
for use.
This flowchart illustrates the process of securing an Android
device by assessing and handling potential threats found in
applications downloaded from the internet.
IV. R ESULTS
This research paper utilizes various machine learning al-
gorithms—logistic regression, random forest classifier, and
naive Bayes classifier—to bolster the security and privacy of
Android mobile phones by effectively removing SMS Malware
and Riskware. The aim is to enhance the confidentiality of
personal information on Android devices. The study employs
data from the CICMalDroid 2020 dataset, which provides a
substantial base for analysis.

Fig. 1. Methodology
A. Model Used
The Random Forest Classifier emerged as the top performer,
delivering the highest accuracy, recall, and F1-score, proving
study compares the behavior of known malware with the its effectiveness in detecting harmful software. Meanwhile, the
labeling approach. After applying the multi-labeling method Naive Bayes algorithm also displayed high precision in its
to four public Android malware datasets, the composition and outcomes. Despite the effectiveness of these machine learning
representativeness of these datasets are analyzed and discussed models, the paper emphasizes the necessity of comprehensive
[13], [14]. cybersecurity measures and regular software updates to combat
the continuously evolving cyber threats.
III. M ETHODOLOGY 1) Precision:: Precision measures the accuracy of the pos-
The solution outlined in this paper significantly enhances itive predictions. It is the ratio of correctly predicted positive
the defense of Android-based devices, including laptops, cell observations to the total predicted positives. This metric is
phones, and tablets, against malicious software and potentially particularly important when the cost of a false positive is high.
harmful data from the internet. This method ensures compre- TP
P recision = (1)
hensive security by continuously monitoring data collected (T P + F P )
online through sophisticated machine learning (ML) tech- Precision is crucial in scenarios like spam detection, where
niques. It can instantly detect and eliminate potential threats it’s preferable to let some spam emails through rather than
by analyzing both data signatures and behavioral patterns. wrongly classifying important emails as spam.
This proactive approach not only protects user privacy 2) Recall (Sensitivity or True Positive Rate): :
and sensitive information but also increases user awareness Recall measures the model’s ability to detect positive in-
of potential threats, keeping them informed about the latest stances from the actual positives available during the testing.
developments. The ML-based security system is designed to It is the ratio of correctly predicted positive observations to
adapt to the latest cyber intrusion methods while avoiding false all observations in actual class - yes. Formula:
positives, providing dynamic protection that evolves with the TP
changing landscape of internet risks. Recall = (2)
TP + FN
Fig. 2. Precision Fig. 4. F1-Score

Fig. 3. Recall Comparison Fig. 5. Accuracy

Recall is critical in medical scenarios where it’s crucial errors are similar. However, it might be misleading in the
to identify all possible positive cases, such as predicting if presence of imbalanced classes, as it can reflect the underlying
a patient has a disease. Missing a true positive can be life- class distributions rather than the ability of the model to make
threatening. accurate predictions. In above equations we have,
3) F1 Score: : • TP (True Positives): The number of correct positive
The F1 Score is the weighted average of Precision and predictions.
Recall. Therefore, this score takes both false positives and • FP (False Positives): The number of incorrect positive
false negatives into account. It is especially useful when the predictions.
classes are imbalanced. Formula: • TN (True Negatives): The number of correct negative
2 ∗ (P recision ∗ Recall) predictions.
F 1Score = (3) • FN (False Negatives): The number of positives that were
(P recision + Recall)
incorrectly predicted as negative.
F1 Score is used when an equal balance of precision and recall
is important, such as in document classification where it’s V. C ONCLUSION
important to precisely predict the category of the document
as well as to cover all potential documents of a category. In conclusion, this research employs a variety of ma-
4) Accuracy: : Accuracy measures the overall correctness chine learning algorithms—Logistic Regression, Random For-
of the model, that is, the ratio of correct predictions (both est Classifier, and Naive Bayes Classifier—to enhance the
true positives and true negatives) to the total number of cases trust and privacy of Android mobile phones by detecting and
examined. Formula: mitigating SMS Malware and Riskware. The findings, detailed
in this research paper, indicate that the Random Forest Classi-
TP + TN fier outperformed other models in terms of accuracy, recall,
Accuracy = (4)
TP + TN + FP + FN and F1-score, proving its efficacy in identifying malicious
Accuracy is a useful measure when the target classes are software and bolstering Android device security. Conversely,
well balanced and the costs of different types of prediction the Naive Bayes algorithm demonstrated high accuracy and
reduced false positives, underscoring the significant role of [12] I. Sharma and V. Pahuja, “Comparative Analysis of Open-Source
machine learning in enhancing mobile security. However, it Vulnerability Assessment Tools for Campus Area Network,” in 2023
International Conference on Emerging Smart Computing and Informatics
is crucial to note that machine learning models need to be (ESCI), 2023, pp. 1–6. doi: 10.1109/ESCI56872.2023.10100030.
supplemented with robust cybersecurity measures and regular [13] I. Sharma, “Evolution of Unmanned Aerial Vehicles (UAVs) with
software updates due to the ever-evolving nature of cyber Machine Learning,” in 2021 International Conference on Advances in
Technology, Management & Education (ICATME), 2021, pp. 25– 30.
threats. Future research should aim to refine these models to doi: 10.1109/ICATME50232.2021.9732774.
counter the increasing threats specific to Android malware. [14] S. Mahdavifar, D. Alhadidi, and Ali. A. Ghorbani, “Effective and
Ongoing collaboration between the cybersecurity community Efficient Hybrid Android Malware Classification Using PseudoLabel
Stacked Auto-Encoder,” Journal of Network and Systems Management,
and machine learning experts remains vital for developing vol. 30, no. 1, p. 22, 2021, doi: 10.1007/s10922-021- 09634-4.
more effective security mechanisms. Mobile operating system
developers must continue investing in security technologies
to enhance overall device protection comprehensively. Addi-
tionally, efforts to increase user education and awareness are
essential to equip Android device users with the necessary
skills to recognize and counteract potential threats.

R EFERENCES

[1] K. Singh Dhindsa, S. Rani, and K. Singh Dhindsa Baba Banda


Singh Bahadur, “Android Malware Detection in Official and
Third Party Application Stores,” 2018, [Online] Available:
https://www.researchgate.net/publication/323573686
[2] K. Hamandi, A. Chehab, I. H. Elhajj, and A. Kayssi, “Android SMS
Malware: Vulnerability and Mitigation,” in 2013 27th International
Conference on Advanced Information Networking and Applications
Workshops, 2013, pp. 1004–1009. doi: 10.1109/WAINA.2013.134.
[3] A. H. El Fiky, A. El Shenawy, and M. A. Madkour, “Android Mal-
ware Category and Family Detection and Identification using Ma-
chine Learning,” CoRR, vol. abs/2107.01927, 2021, [Online]. Available:
https://arxiv.org/abs/2107.01927
[4] S. Mahdavifar, A. F. Abdul Kadir, R. Fatemi, D. Alhadidi, and A.
A. Ghorbani, “Dynamic Android Malware Category Classification us-
ing Semi-Supervised Deep Learning,” in 2020 IEEE Intl Conf on
Dependable, Autonomic and Secure Computing, Intl Conf on Perva-
sive Intelligence and Computing, Intl Conf on Cloud and Big Data
Computing, Intl Conf on Cyber Science and Technology Congress
(DASC/PiCom/CBDCom/CyberSciTech), 2020, pp. 515– 522. doi:
10.1109/DASC-PICom-CBDComCyberSciTech49142.2020.00094.
[5] A. Gómez and A. Muñoz, “Deep Learning-Based Attack Detection and
Classification in Android Devices,” Electronics (Basel), vol. 12, no. 15,
2023, doi: 10.3390/electronics12153253.
[6] M. Faisal Ahmed et al., “ShielDroid: A Hybrid Approach Integrating
Machine and Deep Learning for Android Malware Detection,” in 2022
International Conference on Decision Aid Sciences and Applications
(DASA), 2022, pp. 911–916. doi: 10.1109/DASA54658.2022.9764984.
[7] S. Mahdavifar, A. F. Abdul Kadir, R. Fatemi, D. Alhadidi, and A.
A. Ghorbani, “Dynamic Android Malware Category Classification us-
ing Semi-Supervised Deep Learning,” in 2020 IEEE Intl Conf on
Dependable, Autonomic and Secure Computing, Intl Conf on Perva-
sive Intelligence and Computing, Intl Conf on Cloud and Big Data
Computing, Intl Conf on Cyber Science and Technology Congress
(DASC/PiCom/CBDCom/CyberSciTech), 2020, pp. 515– 522. doi:
10.1109/DASC-PICom-CBDComCyberSciTech49142.2020.00094.
[8] S. Roy, S. Bhanja, and A. Das, “AndyWar: an intelligent android
malware detection using machine learning,” Innov Syst Softw Eng, 2023,
doi: 10.1007/s11334-023-00530-5.
[9] W. Waheed and H. Alyasiri, “Evolving trees for detecting android
malware using evolutionary learning,” International Journal of Nonlinear
Analysis and Applications, vol. 14, no. 1, pp. 753–761, 2023, doi:
10.22075/ijnaa.2022.6874.
[10] S. Mahdavifar, D. Alhadidi, and Ali. A. Ghorbani, “Effective and
Efficient Hybrid Android Malware Classification Using PseudoLabel
Stacked Auto-Encoder,” Journal of Network and Systems Management,
vol. 30, no. 1, p. 22, 2021, doi: 10.1007/s10922-021- 09634-4.
[11] P. Garcı́a-Teodoro, J. A. Gómez-Hernández, and A. Abellán-
Galera, “Multi-labeling of complex, multi-behavioral malware
samples,” Comput Secur, vol. 121, p. 102845, 2022, doi:
https://doi.org/10.1016/j.cose.2022.102845

You might also like