0% found this document useful (0 votes)

167 views10 pages

Windows Malware Binaries in C/C++ Github Repositories: Prevalence and Lessons Learned

Uploaded by

Wesley Haripo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

167 views10 pages

Windows Malware Binaries in C/C++ Github Repositories: Prevalence and Lessons Learned

Uploaded by

Wesley Haripo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Windows Malware Binaries in C/C++ GitHub Repositories:

Prevalence and Lessons Learned

William La Cholter, Matthew Elder and Antonius Stalick

Applied Physics Laboratory, Johns Hopkins University, U.S.A.

Keywords: Malware, GitHub, Open Source Software, Windows.

Abstract: Does malware lurking in GitHub pose a threat? GitHub is the most popular open source software website,
having 188 million repositories. GitHub hosts malware-related projects for research and educational purposes
and has also been used by malware to attack users. In this paper, we explore the prevalence of unencrypted,
uncompressed binary code malware in Microsoft Windows compatible C and C++ GitHub repositories and
characterize the threat. We mined 1,835 repositories for already-compiled malicious files and data suggesting
whether the repository is malware-related. We focused on these repositories because Windows is frequently
targeted by malware written in C or C++. These repositories are good resources for attackers and could target
Windows users. We extracted all Portable Executable (PE) files from all commits and queried the malware
resource VirusTotal for analysis from its 76 anti-virus engines. Of the 24,395 files, 4,335 are suspicious, with
at least one detection; 440 could be considered malicious, with at least seven detections. We identify topic tags
suggesting malware or offensive security content, to differentiate from seemingly benign repositories. 197 of
440 malicious executables were in 27 ostensibly benign repositories. This work illustrates risks in source code
repositories and lessons learned in relating GitHub and VirusTotal data.

1 INTRODUCTION Test, 2020). Malware developers target many plat-

forms (e.g., desktop, mobile, servers, cloud, and deep
GitHub is the most popular open source soft- learning systems), use many different programming
ware website, with over 188 million reposito- languages (e.g., C, C++, Java, JavaScript, Assembly,
ries (GitHub.com, 2020a). GitHub is known Python, Ruby, C#, and Delphi), and produce many
to host malware-related projects for research and different forms of malware (e.g., Windows Portable
educational purposes—described as allowable in Executable (PE), Linux Executable and Linkable For-
their “GitHub Community Guidelines” (GitHub.com, mat (ELF), shell code injection, database injection,
2020c)—including source code examples of exploita- and raw malicious data). For this malware research,
tion and generally nefarious functionality, such as we focused on Windows Intel x86 binary files written
keyboard logging. GitHub originally became popu- in C and C++ because of their volume, reach, com-
lar as a service to host software source code reposi- plexity, and potential for uniform analysis methods.
tories but has also become a popular hosting environ- It is therefore natural that our research started with
ment for non-source code information, such as raw ostensibly Windows C and C++ repositories.
data sets, including curated malware collections such In July 2019, we found 1,870 GitHub repositories
as theZoo (ytisf, 2020). GitHub has also been used by using the search terms of “windows” and “c” or “cpp.”
malware for command and control, download infras- Of those, 1,862 have source code that could be built
tructure, or serving backdoored code (Avast Threat using a modern Windows C++ compiler, and 1,835
Intelligence Team, 2018), (Munoz, 2020). Given were still online when we checked again in Decem-
that malware resides on GitHub both legitimately and ber 2019. Some related web UI searches, such as for
maliciously, we study whether malware lurking in Microsoft Visual C++ project files (.vcxproj), yielded
GitHub repositories poses a threat to repository users repositories outside of this initial set. Additionally,
and downstream consumers of these repositories. keywords mined from these repositories suggest more
Malware is a huge cybersecurity problem, with repositories of interest beyond our search terms. Ex-
over 350,000 new malicious programs and potentially panding the data set is future work.
unwanted applications discovered every day (AV-

475
Cholter, W., Elder, M. and Stalick, A.
Windows Malware Binaries in C/C++ GitHub Repositories: Prevalence and Lessons Learned.
DOI: 10.5220/0010237904750484
In Proceedings of the 7th International Conference on Information Systems Security and Privacy (ICISSP 2021), pages 475-484
ISBN: 978-989-758-491-6
Copyright c 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

To determine whether a file might be malicious, violation of this policy occurred in March 2018, when
we searched the VirusTotal malware information ser- cybercriminals uploaded cryptocurrency mining mal-
vice that aggregates the detection results of 76 anti- ware to forked GitHub projects and used phishing ads
virus (AV) products (VirusTotal, 2020b). Any reg- to download and execute the malware (Avast Threat
istered user can submit a sample to VirusTotal for Intelligence Team, 2018). More recently, 26 open
analysis. The detection results and other file infor- source projects were discovered to have backdoors in-
mation are available to anyone for subsequent query, serted by the Octopus malware, which used the build
by submitting a cryptographic hash of the file. Virus- process to spread to other NetBeans projects (Munoz,
Total’s Application Programming Interface (API) in- 2020). GitHub appears to allow executable malware
cludes rescan requests for results from the most up- in curated malware collections. A search for “mal-
to-date AV products and much threat intelligence data ware samples” returns over 250 repositories. Al-
related to malware (VirusTotal, 2020a). though many repository descriptions suggest analy-
The contribution of this paper is a methodology sis tools or malware-related resources, some explic-
for investigating the presence of malware over all the itly indicate that they include malware samples.
commits in the lifetime of a GitHub repository. While In terms of detecting malware or malicious repos-
it is straightforward to clone a repository to a specific itories in GitHub, only recently have two efforts sys-
point in time - e.g., the current head state or some tematically studied this problem. Recent work by
arbitrary branch in the past - our approach investi- Rokon et al. developed a methodology for find-
gates all of the commits throughout the history of the ing malware source code within GitHub projects and
repository to identify files for analysis. We use the identified 7,504 malware source repositories (Rokon
well-established method of VirusTotal anti-virus en- et al., 2020). While the findings from this work can be
gine results to assess maliciousness of a particular file used to search for malware binaries in GitHub as well,
type (Windows portable executable binaries), and we our work seeks to find malicious binaries in GitHub
apply our methodology to a subset of GitHub reposi- repositories that are not necessarily purporting to con-
tories (Windows C and C++ repositories) in this pre- tain malware. Zhang et al. developed a deep neural
liminary investigation. However, this methodology network approach to detect malicious GitHub reposi-
could be applied to additional populations of GitHub tories using content-based features from source code
repositories, identifying other file types of interest files, investigating a population of blockchain and
through the repository lifetimes, and using other mal- crytocurrency repositories (Zhang et al., 2020). They
ware analysis methods. used VirusTotal as part of their evaluation process
In this paper, we present our preliminary inves- for comparison purposes, ultimately labeling 1,492
tigation into the presence of malware files in Win- repositories as malicious out of their population of
dows C/C++ GitHub repositories. Section 2 provides 3,729 repositories, but again this work was more fo-
background on GitHub and related work in VirusTotal cused on malicious source code in GitHub.
malware research. We describe our approach to mine Many previous research efforts have used Virus-
Windows binary files from GitHub and then query Total to support malware detection and analysis
VirusTotal for malware detection results in Section 3. in the domains of malware binaries run in dy-
Section 4 presents our initial VirusTotal analysis re- namic analysis sandboxes (Graziano et al., 2015),
sults for the Windows files that we mined from our signed malware binaries (Kim et al., 2018), and mo-
GitHub repositories of interest. Section 5 provides a bile applications (Hurier et al., 2017), (Pendlebury
discussion and more detailed analysis of our results. et al., 2019), (Salem et al., 2019), (Suciu et al.,
We present our conclusions and directions for future 2018), (Wang et al., 2019). VirusTotal can also be
research in Section 6. used for analysis of malicious web addresses, i.e.,
Uniform Resource Locators (URLs), such as those
used in phishing campaigns (Peng et al., 2019). These
2 BACKGROUND AND RELATED research efforts and others each utilize VirusTotal in
different ways, either using various thresholds for the
WORK number of VirusTotal engines needed to consider a
sample as malicious (e.g., 1, 5, or 10), thresholds
GitHub is known to host malware, both legitimately based on percentage of engines (e.g., 50%), or results
(i.e., in compliance with GitHub’s terms of use) and from a subset of engines based on high reputation or
illegitimately. GitHub prohibits content that “con- market share. In short, there is little consensus on
tains or installs any active malware or exploits, or how to definitively interpret VirusTotal results to de-
uses our platform for exploit delivery” (GitHub.com, termine whether a sample is malicious.
2020b). An example of GitHub hosting malware in

476
Windows Malware Binaries in C/C++ GitHub Repositories: Prevalence and Lessons Learned

Recently, Zhu et al. published a study on the

behavior of the anti-virus engines within VirusTotal,
which included a survey of 115 academic papers that
used VirusTotal (Zhu et al., 2020). The most common
approach to using VirusTotal was to set the thresh-
old at one malicious engine detection for labeling a
sample as malware (50 out of 115 papers). How-
ever, one key finding of their research was that the en-
gines within VirusTotal “flip” detection results over
time, sometimes oscillating between malicious and
benign labels for the same sample over short peri-
ods of time. The authors recommended setting the Figure 1: VirusTotal Query Flowchart.
threshold somewhere between 2 and 39 for stability efficient, O(b) for b = |Blobs|, to sweep the database
of engine labels. Zhu et al. also found that the de- for all file content that ever was in the history of com-
tection results from certain engines are highly corre- mits and tags. But it is impossible to establish where
lated, which affects how one should set a threshold, and when they are referenced without walking the
with the largest cluster containing six engines using a commit tree and tag graphs, naively O(t · c) for t =
hierarchical clustering algorithm with a threshold of |Blobs ∪ TreeItemLists| and c = |Commits ∪ Tags|.
0.001 (Zhu et al., 2020). We used pygit2, a wrapper of libgit2, which
was anecdotally an order of magnitude faster than
GitPython during our early prototyping. We used
3 APPROACH python-magic content type identification, which
wraps libmagic. Because Git SHA-1 hashes are
We used GitHub to find and clone repositories, with computed on file contents and additional metadata,
the intent of compiling the code for binary analysis we needed to compute pure cryptographic hashes
and getting data and metadata that provide insights for VirusTotal submission and used Python’s hashlib,
into the software development process. In the course written in C. By performing operations in-memory
of that work, we discovered the presence of suspi- using underlying C, performance was strong and we
cious files and a paucity of rigorous research on them. did not change repository file system state. For cross-
We cloned all of our repositories of interest 9-July- repository analysis and structured ad hoc data we used
2019. By picking a specific date, we eliminated the PostgreSQL relations and JSON columns.
need to account for the variable of time in our anal-
ysis of GitHub data. Git repositories provide core 3.2 Querying VirusTotal
ground truth through SHA-1 cryptographic hashes of
files, commits (file versions, predecessors, and com- VirusTotal supports queries by MD5, SHA-1, and
ments), and tags. GitHub provides ground truth of SHA-256 cryptographic hashes. Although SHA-1 is
user-provided data and approval of commits by the generally deprecated because of collisions, it is fast
repository maintainer. We performed as much mining and sufficient for file identification.
as possible on local copies to avoid API limits. Figure 1 shows the flowchart of our VirusTotal
query process. We started by querying VirusTotal us-
3.1 Mining GitHub and Git ing the file content’s SHA-1 hash. If VirusTotal has
previously received and analyzed the file, it returns
GitHub and Git present data management challenges: JSON results that include the last analysis from its AV
GitHub provides additional online context for the po- engines, labeled “prior” analysis in our results. That
tentially offline Git commit activities, but it provides analysis could have occurred years ago, depending on
snapshot or event-driven data rather than historical in- the file’s age, when it was first submitted, and when it
formation through its API. For example, to find can- was last analyzed. We saved those “prior” results to
didate repositories, we used GitHub’s GraphQL API, characterize the initial results and subsequent analy-
querying for languages “c” and “cpp” and the “win- sis. VirusTotal AV detections generally improve over
dows” topic and cloned them locally. However, those time, as vendors improve algorithms and signatures,
topics associations can change over time. and as VirusTotal adds new engines. To establish re-
To be thorough in analyzing all commits through- sults across contemporary engines, we requested re-
out a repository’s history, it is necessary to scan all analysis. We also uploaded all files that VirusTotal
files (“blobs”) in Git’s local key-value store. It is very has not previously received and then queried those de-

477
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

Table 1: VirusTotal Detection Results - Suspicious Files, Previously Scanned and Unseen.
Binary Code Files # Samples # Prior Hits # Latest Hits
Previously scanned by VT 10,413 1,353 1,090
Previously unseen by VT 13,982 N/A 3,245
Total 24,395 1,353 4,335

Table 2: VirusTotal Detection Results - Malicious Files, Previously Scanned and Unseen.
Binary Code Files # Samples # Prior Hits # Latest Hits
Previously scanned by VT 10,413 226 240
Previously unseen by VT 13,982 N/A 200
Total 24,395 226 440

tection results. We downloaded “latest” results from engines’ detection results are highly correlated, and
24-December-2019 to 7-January-2020. the largest cluster consisted of six engines, a threshold
VirusTotal provides four core file-related AV re- of seven ensures that at least two independent engines
quest APIs for non-premium users: the most recent are indicating ”malicious.”
scan results of a file, the request to rescan a file, the re-
sults of the request to scan a file, and the results from
a specific non-public request identifier. The commer- 4 RESULTS
cial/premium API service also offers users the ability
to query the list of non-public request identifiers, nec-
In this section we present the VirusTotal detection
essary to obtain results from arbitrary past requests.
results for the Windows binaries extracted from our
1,835 GitHub repositories of interest. We built a data
3.3 Threats to Validity set of 24,395 unique binary code files, mining all
commits from all 1,835 GitHub repositories of inter-
VirusTotal introduces inherent variability of results est. A file was included if its MIME type was “exe-
that challenge reproducibility: the accuracy of any cutable.” (One 171 MB file was excluded because we
given AV engine scan; the variability of available were unable to upload it to VirusTotal.) The first sub-
engines in VirusTotal at any given time; the suc- section presents the results for the data set as a whole,
cess of individual engines processing the sample in and the second subsection provides results based on
a VirusTotal-managed processing window, the results repository characteristics.
from specific engines over time; the opacity, consis-
tency, and provenance of details in reports; and the 4.1 VirusTotal Results
ability to obtain the most recent results without ob-
taining a paid premium account. It is not controver-
Table 1 shows the results of VirusTotal scans for new
sial to say that a given AV engine scanning a given file
and previously uploaded binary files when setting the
at a given time may report false positive or false neg-
threshold to at least one malicious detection, indicat-
ative results. We do not consider that a threat to our
ing that a file is ”suspicious.” Of the 24,395 files,
experiment’s validity because of the well-understood
10,413 had been submitted previously, indicated by
caveats one may apply to an interpretation of AV re-
“Previously scanned by VT”; 1,353 of those had at
sults. In this research, the main threat is that data cap-
least one malicious detection at the time of prior anal-
ture is not instantaneous and that the same file could
ysis in VirusTotal, labeled “# prior hits.” When we
garner different results at the beginning and end of a
requested reanalysis for these files, 1,090 files had at
capture window.
least one malicious engine detection, showing that de-
We captured data in a two-week period, December
tections decreased overall on rescan. Of the 13,982
2019 – January 2020 to minimize the period of time
files “Previously unseen by VT” that we uploaded for
that a change could have occurred. We provide results
analysis, 3,245 had a malicious detection.
for any Windows binary that has at least one AV en- Table 2 shows the results of VirusTotal scans for
gine detection of ”malicious”, which indicates that the new and previously uploaded binary files when at
sample is ”suspicious.” We also report results using least seven engines provide a malicious detection,
a threshold of seven AV engine detections of ”mali- our threshold to determine that a file is ”malicious.”
cious”, based on recommendations and interpretation Setting the detection threshold higher results in far
of the recent Zhu paper. Given the finding that certain fewer hits, of course: only 440 out of the 24,395

478
Windows Malware Binaries in C/C++ GitHub Repositories: Prevalence and Lessons Learned

Table 3: VirusTotal Detection Results - Suspicious Files, Previous Scan and Rescan Results.
Binary Code Files # Samples Detected Not Detected
Previously submitted to VT 10,413 1,353 9,060
Resubmitted to VT 10,413 1,090 9,323

Table 4: VirusTotal Detection Change in Results - Suspicious Files.

Originally Benign Originally Suspicious
# samples 9,060 # samples 1,353
# that became suspicious 289 # that became benign 552
% that became suspicious 3% % that became benign 41%
# AV engines 1 - 69 # AV engines 1-3

have at least seven AV engines indicating malicious (DLL) on modern Windows poses a risk of incorpo-
detections. Of the 10,413 files previously scanned ration into the repository’s build outputs or execution
by VirusTotal, 226 previously exceeded our malicious as a system service or code injected into a process on
detection threshold and 240 are currently deemed ma- a build host. Table 5 shows that of the 4,280 suspi-
licious in the latest results. Of the 13,982 files pre- cious files, 1,074 are DLLs and 3,206 are standalone
viously unseen by VirusTotal, 200 are deemed mali- executable files. For the 418 malicious files, 28 are
cious in the latest results. DLLs and 390 are standalone executable files.
Both tables of VirusTotal detection results demon- Table 6 presents the number of files in weighted
strate the change in engine detections over time. To bins by the number of engines indicating “malicious.”
highlight these changes in more detail for the suspi- This shows the range of hits and the large proportion
cious file results (i.e., those with at least one mali- of samples with low hit counts.
cious detection), Table 3 shows that some previously The results above for all files represent the aggre-
benign-seeming files were considered suspicious–and gate across all commits over the lifetime of the repos-
vice-versa–in the reports that we requested in the De- itory. For results at a single point in time, we also
cember 2019 – January 2020 timeframe. The overall analyzed the files that were accessible from the head
decrease of 263 files—from 1,353 to 1,090—having of the repository. A repository’s head commit—the
at least one malicious detection is the net result of files accessible after cloning and updates—represents
289 files being detected as malicious that were not a public view of the repository at the time of cloning
previously and 552 files previously being detected as and analysis. Across all 1,835 of our repositories of
malicious no longer having any AV engine detections. interest, there are 7,772 unique binary files in the head
Table 4 shows the relative change in results for the commits on 9-July-2019, of which 939 were suspi-
suspicious files. The substantial re-characterization of cious with at least one AV detection in VirusTotal,
files as having detections vs. not having detections co- and 204 were malicious with at least seven AV de-
incides with a relatively small number of initial posi- tections. 5,512 files were already analyzed by Virus-
tives results, with 1 to 3 AV engines previously indi- Total, while 2,260 had to be uploaded for analysis.
cating malicious. On the other hand, files only later
getting malicious detections have a much larger range 4.2 Repository-based Results
of 1 to 69 detecting engines.
Table 5 shows the breakdown of files within differ- Of the 1,835 repositories queried, 593 repositories
ent categories of Windows executable binaries. The contain binary files. 314 have at least one suspicious
vast majority of binary code files are targeted to run binary file, which is a significant subset. 52 reposito-
on modern 32- or 64-bit Windows versions. There ries have at least one malicious binary with seven or
are also files targeting DOS and 16-bit Windows in more VirusTotal AV engine detections.
the “Pre-Win32” category, which are ostensibly com- We examined the concentration of suspicious bi-
patible with Windows. Finally, there are incompat- naries across repositories, presented in Table 7. Of
ible ELF and boot image files in the “Other” cate- the 314 repositories having suspicious files, a major-
gory (presumably misclassified by libmagic). As seen ity, 182 repositories, have one (1) or two (2) suspi-
in the second column of Table 5, 4,280 Windows- cious files. Across the population, the mean file count
compatible files were suspicious and 418 were ma- is 7.03 and standard deviation is 20.67. Similarly, Ta-
licious. Except for “Other” files, any standalone ex- ble 8 presents the distribution of malicious file counts
ecutable file poses an immediate risk to a repository across the 52 repositories with malicious binaries and
user who runs it, while a dynamically linked library shows that most only have one or two.

479
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

Table 5: VirusTotal Detection Results – By File Type.

All Win 32/64 DLLs EXEs Pre-Win32 Other
Benign 20,060 19,385 12,331 7,054 431 244
Suspicious 4,335 4,280 1,074 3,206 14 41
Malicious 440 418 28 390 2 20
Total 24,395 23,665 13,405 10,260 445 285

Table 6: VirusTotal Hit Counts in Weighted Bins.

Hit Count 1 2 3 4 5 6 7-10 11-20 21-30 31-40 41-50 51-60 61+
# samples 2,491 722 298 161 135 88 100 92 92 71 59 20 6

Table 7: Suspicious File Count by Repository Count.

# suspicious files 1 2 3 4 5 6-10 11-617
# repos 131 54 20 24 16 22 47

Table 8: Malicious File Count by Repository Count.

# malicious files 1 2 3 4 5 6-10 11-617
# repos 20 10 5 3 2 1 8

Table 9: Top 10 Repositories by Files Having VirusTotal Detection - Suspicious Files.

Repository Name # Detected # Binaries score
papyrussolution/OpenPapyrus 617 1,259 1.19
lhmouse/mcfgthread 507 1,175 1.45
ffftp/ffftp 305 1,061 1.13
processhacker/processhacker 220 282 4.10
arjunae/myScite 205 1,762 1.99
RomaniukVadim/hack scripts 198 313 21.45
arizvisa/windows-binary-tools 166 924 2.34
tomdaley92/kiwi-8 116 186 5.27
Twilight-Dream-Of-Magic/BackDoorProgram-EncryptOrDecryptFile 113 160 2.74
alexfru/SmallerC 109 300 20.79

Table 9 shows the top ten repositories by num- security incidents and not a computer security natu-
ber of suspicious files and the mean score of those ral language topic model. There are other efforts to-
detections. The second column of Table 9 provides wards defining cyber security ontologies (Syed et al.,
the number of overall binary files in these reposito- 2016), which could contribute to a characterization of
ries for additional context, indicating how prevalent malware-related purposes. This is an area of future
binary files are in each of these repositories and the exploration.
ratio of suspicious binaries. Of the 314 repositories that contain at least one
To assess the stated purpose of each GitHub repos- suspicious binary, only 50 have at least one malware
itory, we extracted the user-provided repository tags or offensive security-related tag. This leaves 259
and found 1,802 unique tags across the 1,835 reposi- repositories with suspicious/malicious binaries where
tories. We classified 70 tags as potentially related to users might not expect that risk. Of the 52 reposi-
malware or other offensive security topics. Each au- tories that contain at least one malicious binary, 25
thor identified candidate tags, and those receiving a have at least one malware or offensive security-related
majority of votes were selected. Our malware-related tag. The 27 repositories not tagged as being related to
tags have overlap with the Malware Attribution Enu- malware or offensive security contain 197 malicious
meration and Characterization (MAEC) structured binaries, representing risk to unsuspecting repository
language for malware information sharing (The Mitre users.
Corporation, 2017), allowing for fuzzy matching and
semantic equivalence. It is important to note that
MAEC is a prescriptive taxonomy for documenting

480
Windows Malware Binaries in C/C++ GitHub Repositories: Prevalence and Lessons Learned

Table 10: Examples of Varying Scan Results over Time.

Example 1 Example 2 Example 3 Example 4
Our Scan Requests
scan date 12/24/2019 12/24/2019 12/24/2019 12/24/2019
# engines 74 73 75 75
# malicious 0 43 12 2
Previous Scans
last analysis date 12/10/2015 9/29/2019 11/26/2019 2/1/2017
# engines 52 71 71 58
# malicious 0 43 11 0
Earlier Activity
last modification date 1/8/2019 9/29/2019 12/4/2019 2/1/2017
first submission date 9/20/2013 5/12/2016 5/7/2011 1/4/2017
Submitter or Author-Reported Data
PE file ”creation date” 9/11/2013 5/8/2016 5/7/2011 7/28/2014
”first seen itw date” 9/11/2013 5/8/2016 11/20/2010 12/31/2097

5 DISCUSSION by AV engines.
Finally, build files such as Makefiles, .vcxproj
5.1 Risks Posed by Unhygienic files, and continuous integration orchestration files are
essentially executable scripts, which pose the risk that
Repositories building a project can compromise a system. Non-
malware repository researchers would also benefit
Without even considering the risk of malicious con- from safe handling, such as processing as much as
tent, binary files in repositories should raise concerns. possible on less-targeted OSs and with repositories
It is almost always a bad practice to store build out- that are bare or mirrors without local file copies.
puts in any repository because they increase the repos-
itory size, are not amenable to editing or compar- 5.2 Not All Windows Malware Is in PE
isons across versions, and may be accidentally up-
dated when the repository is built–especially Win- Files
dows PE files, which contain the build timestamp.
Including binaries, such as libraries, as build in- Malware comes in many forms. We looked for bi-
puts or runtime dependencies violates the spirit of nary files, but these repositories may have malware
open source development. It may be unavoidable for in other formats, such as documents and scripts. It
a repository owner seeking to baseline specific build is worth noting that in scanning repository head com-
inputs while holding a software license that allows mits, we identified 761 archive files (WinZip, 7-Zip,
redistribution of binaries. In most cases, however, and RAR), 33 of which are or could be encrypted.
GitHub repository maintainers should provide pre- Perhaps the 33 represent responsibly encrypted mal-
built software in GitHub release bundles, outside the ware samples. There are other forms of malware
Git repositories. that we could mine from GitHub repositories beyond
The virus research community has adopted safe Windows binaries, such as Linux malware, mobile
handling procedures, including packaging malware in malware, malicious scripts, and malicious PDF docu-
encrypted archives (Zeltser, 2020), and sharing sam- ments.
ples only after vetting interested researchers. Reposi-
tories that violate these rules expose non-malware re- 5.3 Git-related Observations
search environments. Indeed, when we cloned repos-
itories from our Linux environment onto a Windows In the course of this research we used many in-
server, we set off over 100 alerts in our enterprise AV terfaces to Git-related data. While not necessarily
sensors—and that was only in the file system copies critical to this immediate work, our experience pro-
from the head branches. Many malicious binaries lay vides some insights for future researchers. Online
dormant and unscanned while they rest in Git’s cus- APIs such as GitHub REST v3, GitHub GraphQL,
tom storage formats, likely unsupported for scanning

481
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

GH Archive (gharchive.com, 2020), and Google Big- imentation time, which will be whenever the file last
Query (Google Cloud, 2020a) are powerful for high- had an analysis requested; and (3) “latest” analysis,
level data, but for compute-intensive file analysis, lo- requested at the current experimentation time. Ta-
cal execution may be the only option. While Git is the bles 1 and 2 previously presented the change in de-
primary source for commit history and files, its data tection results between points (2) and (3). Table 10
model is optimized for efficiency and extensibility of illustrates with four example files that metrics based
end-user file-based operations. The researcher is left on these points in time can be inconsistent within a
to develop a new data model to manage the federation small time window, across larger windows, and fabri-
of Git and online APIs. cated.
GitHub provides a rich online community and For example, before our rescan request (“Our
source of data, but does not provide direct temporal Scan Results” in Table 10), the results for one “be-
control over results comparable to the cryptograph- nign” binary named “curl.exe” (Example 1) were
ically stable Git commit log, which admittedly is originally created when scanned on 20-September-
coming under attack because of SHA-1’s emerging 2013, updated with scan results from 52 engines on
weaknesses to hash collisions. So, while it may be 10-December-2015, and modified on 8-January-2019.
straightforward to time-box commits up to a certain Other dates in a report, such as first seen in the wild
date, finding the GitHub topic associations at that (the year 2097 in Example 4 in Table 10), and of
date requires forethought to query all GitHub infor- course, the PE header timestamp have no assurance
mation, sifting through events from the beginning of because they are subject to spoofing by the submit-
the repository to that point in time (or in reverse from ter or binary author (sometimes the same individ-
the present time), or queries using third-party services ual! (Zetter, 2014)).
such as GH Archive and Google BigQuery. GitHub’s Across all of our rescan requests started on 24-
5,000 REST requests or GraphQL 5,000 points per December-2019, we received results from 46 to 76
hour (GitHub.com, 2020d) and BigQuery’s 1 TB free engines, with a mean of 73.4 engines and standard
per month API (Google Cloud, 2020b) quotas re- deviation of 1.21.
quire considerable planning and data acquisition de- The VirusTotal terms of service do not allow shar-
sign, and therefore we attempted to maximize local ing full reports that would reveal AV vendor capabili-
analysis with Git. Moreover, a local checkout of Git ties. Therefore, experiments relying on precise scan
provides groundtruth for what a developer would see details are not reproducible and the data cannot be
from cloning the repository. broadly shared. One researcher could affect an un-
related researcher’s work by requesting a rescan at a
5.4 VirusTotal Observations non-deterministic time during overall data capture, a
significant risk with a public API rate limited to four
As previous research has shown (Zhu et al., per minute and with a potential for three requests for
2020), (Pendlebury et al., 2019), (Peng et al., a single sample. Indeed, the footprints of our queries
2019), (Salem et al., 2019), VirusTotal engine data is are all over the data. It is also possible that the foot-
subtle: results change based on when a query is run, prints from the authors’ IT department can be ob-
and the non-premium API provides only the most reserved in the data, as the authors were contacted by
cent results based on the time of the last requested them in the course of cloning repositories to explore
scan, which could have been any arbitrary point in build experimentation.
time in the past. It is possible that one or more en- It is possible to get all scan history for a sample,
gines within VirusTotal could provide a false positive by purchasing the premium service—but those results
detection for a file. VirusTotal’s AV engines change indicate which scans were requested, not whether a
over time and the results from the engines can change given file might have been considered malicious at
based on AV engine implementation and signature up- a particular point in time, if only someone had re-
dates. While it may be tempting to use VirusTotal as quested a scan at that time. For example, it is infeasi-
a form of oracle for malware detection, there is no ble to perform a post-mortem of an attack by asking,
universally accepted threshold for the number of AV ”Could all of the files in an intrusion have been identi-
engines in VirusTotal that “guarantees” a file is mali- fied as malware on 1-June-2015?” Although VirusTo-
cious. tal adds a very different dimension of data to software
There are at least three interesting points in the repository research, it does not offer the temporal con-
lifetime of a file analyzed by VirusTotal: (1) initial trol required in many studies and experiments.
analysis at the time of first submission to VirusTo-
tal; (2) “prior” analysis relative to the current exper-

482
Windows Malware Binaries in C/C++ GitHub Repositories: Prevalence and Lessons Learned

6 CONCLUSIONS AND FUTURE ACKNOWLEDGEMENTS

WORK
This work was funded by the Minerva Research Ini-
Does the malware lurking in GitHub pose a threat? tiative and is sponsored by the Department of the
Yes, we found 4,335 suspicious Windows binary files Navy, Office of Naval Research under ONR award
with at least one malicious AV detections in Virus- number N00014-18-1-2111. Any opinions, findings,
Total across 314 of 1,835 repositories studied. We and conclusions or recommendations expressed in
found 440 malicious binaries with at least seven AV this material are those of the author(s) and do not nec-
detections across 52 repositories. Just as some re- essarily reflect the views of the Office of Naval Re-
searchers found hidden API keys in repositories (Meli search.
et al., 2019), we found hidden malicious content, not
easily queried because of the number of files and
repositories, the cost of querying online services, and REFERENCES
changing malware scan results. Users and researchers
should be careful when downloading open source AV-Test (2020). Malware statistics & trends report |av-test.
repositories, because it is difficult to be sure that the https://www.av-test.org/en/statistics/malware/.
content is safe, especially binary content. Reposi- Avast Threat Intelligence Team (2018). Greedy cybercrim-
tory owners should be vigilant given their role in the inals host malware on github. https://blog.avast.com/
open source software supply chain. We have submit- greedy-cybercriminals-host-malware-on-github.
ted the hashes and repository URLs to GitHub, out gharchive.com (2020). Gh archive. https://gharchive.org.
of an abundance of due care in exercising responsible GitHub.com (2020a). Code search - github. https://github.
com/search?q=&ref=simplesearch.
disclosure.
This study mined a particular slice of GitHub GitHub.com (2020b). Github acceptable use poli-
cies. https://help.github.com/en/github/site-policy/
for malicious Windows binaries—we could obviously github-acceptable-use-policies.
expand the population of GitHub repositories, beyond GitHub.com (2020c). Github community guidelines.
those tagged as Windows and C or C++, and expand https://help.github.com/en/github/site-policy/
the types of malware investigated. The substantial ob- github-community-guidelines.
served swing in VirusTotal results over time motivates GitHub.com (2020d). Graphql resource limitations |
more study to identify the controlling variables and GitHub Developer Guide. https://developer.github.
ultimately to achieve a better understanding of how to com/v4/guides/resource-limitations/.
assess confidence in a particular scan. Google Cloud (2020a). Bigquery: Cloud data warehouse
GitHub is a convenient platform for hosting — google cloud. https://cloud.google.com/bigquery.
source code and other user-provided content. GitHub Google Cloud (2020b). Estimating storage and
users hosting malware should, at a minimum, apply query costs | BigQuery | Google Cloud. https:
basic safety measures, such as storing malware in //cloud.google.com/bigquery/docs/estimate-costs#
estimating query costs using the pricing calculator.
encrypted archives (Zeltser, 2020). More troubling,
Graziano, M., Canali, D., Bilge, L., Lanzi, A., and
though, is that the mere presence of binary content Balzarotti, D. (2015). Needles in a haystack: Mining
in a source code repository suggests a violation of information from public dynamic analysis sandboxes
best practices—mining the repository history can pro- for malware intelligence. In 24th USENIX Security
vide insights into a project’s overall quality and ma- Symposium (USENIX Security 15), pages 1057–1072,
turity. The accidental presence of malicious binary Washington, D.C. USENIX Association.
content suggests a violation of trust—mining the con- Hurier, M., Suarez-Tangil, G., Dash, S. K., Bissyandé, T. F.,
tributors’ history might provide insights into the kinds Le Traon, Y., Klein, J., and Cavallaro, L. (2017). Eu-
of people unwittingly compromised. The intentional phony: Harmonious unification of cacophonous anti-
virus vendor labels for android malware. In 2017
and surreptitious insertion of malicious binary con- IEEE/ACM 14th International Conference on Mining
tent is an attack on trust—mining the entire repository Software Repositories (MSR), pages 425–435.
history might help identify future targets and enable Kim, D., Kwon, B. J., Kozák, K., Gates, C., and Dumitraş,
attribution of the those willfully corrupting the open T. (2018). The broken shield: Measuring revocation
source software supply chain. effectiveness in the windows code-signing pki. In Pro-
ceedings of the 27th USENIX Conference on Security
Symposium, SEC’18, page 851–868, USA. USENIX
Association.
Meli, M., McNiece, M. R., and Reaves, B. (2019). How
bad can it git? characterizing secret leakage in public
github repositories. In NDSS.

483
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

Munoz, A. (2020). The octopus scanner mal- github. In 2020 IEEE International Conference on
ware: Attacking the open source supply Knowledge Graph (ICKG), pages 458–465.
chain. https://securitylab.github.com/research/ Zhu, S., Shi, J., Yang, L., Qin, B., Zhang, Z., Song, L., and
octopus-scanner-malware-open-source-supply-chain. Wang, G. (2020). Measuring and modeling the la-
Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., and bel dynamics of online anti-malware engines. In 29th
Cavallaro, L. (2019). Tesseract: Eliminating experi- USENIX Security Symposium (USENIX Security 20),
mental bias in malware classification across space and pages 2361–2378. USENIX Association.
time. In Proceedings of the 28th USENIX Confer-
ence on Security Symposium, SEC’19, page 729–746,
USA. USENIX Association.
Peng, P., Yang, L., Song, L., and Wang, G. (2019). Opening
the blackbox of virustotal: Analyzing online phishing
scan engines. In Proceedings of the Internet Measure-
ment Conference, IMC ’19, page 478–485, New York,
NY, USA. Association for Computing Machinery.
Rokon, M. O. F., Islam, R., Darki, A., Papalexakis, V. E.,
and Faloutsos, M. (2020). Sourcefinder: Finding mal-
ware source-code from publicly available repositories.
Salem, A., Banescu, S., and Pretschner, A. (2019).
Don’t pick the cherry: An evaluation methodol-
ogy for android malware detection methods. CoRR,
abs/1903.10560.
Suciu, O., Mărginean, R., Kaya, Y., Daumé, H., and
Dumitraş, T. (2018). When does machine learning
fail? generalized transferability for evasion and poi-
soning attacks. In Proceedings of the 27th USENIX
Conference on Security Symposium, SEC’18, page
1299–1316, USA. USENIX Association.
Syed, Z., Padia, A., Finin, T., Mathews, L., and
Joshi, A. (2016). Uco: A unified cybersecurity
ontology. https://www.aaai.org/ocs/index.php/WS/
AAAIW16/paper/view/12574.
The Mitre Corporation (2017). Maec 5.0 specification –
vocabularies. http://maecproject.github.io/releases/5.
0/MAEC Vocabularies Specification.pdf.
VirusTotal (2020a). Getting started. https://developers.
virustotal.com/reference.
VirusTotal (2020b). How it works – virustotal.
https://support.virustotal.com/hc/en-us/articles/
115002126889-How-it-works.
Wang, H., Si, J., Li, H., and Guo, Y. (2019). Rmvdroid:
Towards a reliable android malware dataset with app
metadata. In Proceedings of the 16th International
Conference on Mining Software Repositories, MSR
’19, page 404–408. IEEE Press.
ytisf (2020). Github - ytisf/thezoo: A repository of live
malwares for your own joy and pleasure. thezoo is
a project created to make the possibility of malware
analysis open and available to the public. https://
github.com/ytisf/theZoo.
Zeltser, L. (2020). How to share malware sam-
ples with other researchers. https://zeltser.com/
share-malware-with-researchers/.
Zetter, K. (2014). A google site meant to protect you is
helping hackers attack you. https://www.wired.com/
2014/09/how-hackers-use-virustotal/.
Zhang, Y., Fan, Y., Hou, S., Ye, Y., Xiao, X., Li, P., Shi,
C., Zhao, L., and Xu, S. (2020). Cyber-guided deep
neural network for malicious repository detection in

484

A Comprehensive Survey On Deep Learning Based Malware Detectiontechniques
No ratings yet
A Comprehensive Survey On Deep Learning Based Malware Detectiontechniques
36 pages
Web Development Using Spring Framework: A Project Report On
No ratings yet
Web Development Using Spring Framework: A Project Report On
39 pages
Malware Analysis
0% (1)
Malware Analysis
11 pages
Case Study
No ratings yet
Case Study
13 pages
Raid20 Rokon
No ratings yet
Raid20 Rokon
16 pages
Bad Snakes Understanding and Improving Python Package Index Malware Scanning
No ratings yet
Bad Snakes Understanding and Improving Python Package Index Malware Scanning
13 pages
Barlow Vanessa Technical Report
No ratings yet
Barlow Vanessa Technical Report
7 pages
1 s2.0 S0140366423000907 Main
No ratings yet
1 s2.0 S0140366423000907 Main
9 pages
Reasearch 1
No ratings yet
Reasearch 1
18 pages
SSRN Id3901568
No ratings yet
SSRN Id3901568
21 pages
Capturing Malware Behaviour With Ontology-Based Knowledge Graphs
No ratings yet
Capturing Malware Behaviour With Ontology-Based Knowledge Graphs
8 pages
Dynamic Malware Analysis Using Cuckoo Sandbox
No ratings yet
Dynamic Malware Analysis Using Cuckoo Sandbox
5 pages
Electronics 11 03665 v2
No ratings yet
Electronics 11 03665 v2
20 pages
20-CP-93 NSC Lab 1
No ratings yet
20-CP-93 NSC Lab 1
5 pages
Android Malware Classification Using LSTM Model: Revue D'intelligence Artificielle
No ratings yet
Android Malware Classification Using LSTM Model: Revue D'intelligence Artificielle
7 pages
Analysis of Cyber Security Threats Using
No ratings yet
Analysis of Cyber Security Threats Using
5 pages
2024-A2-CLM Few-Shot Malware Detection Based On Adversarial Heterogeneous Graph Augmentation
No ratings yet
2024-A2-CLM Few-Shot Malware Detection Based On Adversarial Heterogeneous Graph Augmentation
16 pages
Cuckoo Assignment
No ratings yet
Cuckoo Assignment
4 pages
Ijett V73i1p132
No ratings yet
Ijett V73i1p132
15 pages
23 Jan 7th
No ratings yet
23 Jan 7th
31 pages
Malware Detection Using Machine Learning and Deep Learning
No ratings yet
Malware Detection Using Machine Learning and Deep Learning
10 pages
Im 2007
No ratings yet
Im 2007
48 pages
In Depth Security Vol. III: Proceedings of the DeepSec Conferences
From Everand
In Depth Security Vol. III: Proceedings of the DeepSec Conferences
BoD - Books on Demand
No ratings yet
Eight Years of Rider Measurement in The Android Malware Ecosystem
No ratings yet
Eight Years of Rider Measurement in The Android Malware Ecosystem
15 pages
1 s2.0 S2405844023107821 Main
No ratings yet
1 s2.0 S2405844023107821 Main
19 pages
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
No ratings yet
Malware Detection and Classification Based On Graph Convolutional Networks and Function Call Graphs
11 pages
Ijcna 2021 o 56
No ratings yet
Ijcna 2021 o 56
18 pages
Paprer CJ Usenix03
No ratings yet
Paprer CJ Usenix03
18 pages
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
No ratings yet
A Comprehensive Survey On Identification of Malware Types and Malware Classification Using Machine Learning Techniques
8 pages
Mini Project
No ratings yet
Mini Project
11 pages
Detecting Malware Activities With MalpMiner A Dynamic Analysis Approach
No ratings yet
Detecting Malware Activities With MalpMiner A Dynamic Analysis Approach
13 pages
Synopsis 1
No ratings yet
Synopsis 1
7 pages
Api MD
No ratings yet
Api MD
13 pages
Malware Detection and Analysis Challenges and Rese
No ratings yet
Malware Detection and Analysis Challenges and Rese
10 pages
Obfuscated Malware Detection Using Deep Generative Models
No ratings yet
Obfuscated Malware Detection Using Deep Generative Models
13 pages
Malware Analysis Project c9
No ratings yet
Malware Analysis Project c9
2 pages
Malware Lab Concept
No ratings yet
Malware Lab Concept
44 pages
Comp. Project Synopsis Reviwed
No ratings yet
Comp. Project Synopsis Reviwed
16 pages
Lightweight and Robust Malware Detection Using Dictionaries of API Calls
No ratings yet
Lightweight and Robust Malware Detection Using Dictionaries of API Calls
12 pages
Malware Survey IJNSA
No ratings yet
Malware Survey IJNSA
22 pages
Analyzing and Comparing The Effectiveness of Malware Detection - A Study of Machine Learning Approaches - ScienceDirect
No ratings yet
Analyzing and Comparing The Effectiveness of Malware Detection - A Study of Machine Learning Approaches - ScienceDirect
39 pages
A Comparative Analysis of Open Source Automated Malware Tools
100% (1)
A Comparative Analysis of Open Source Automated Malware Tools
5 pages
Computers 13 00059
No ratings yet
Computers 13 00059
18 pages
p6 Digital Forensics For Malware Classification An Approach For
No ratings yet
p6 Digital Forensics For Malware Classification An Approach For
12 pages
Bypassing Antivirus Detection Old-School Malware N
No ratings yet
Bypassing Antivirus Detection Old-School Malware N
10 pages
Scalable Malware Detection System Using Big Data A
No ratings yet
Scalable Malware Detection System Using Big Data A
18 pages
Malware Survey Arxxiv
No ratings yet
Malware Survey Arxxiv
9 pages
A Forensic Analysis of Android Malware - How Is Malware Written and How It Could Be Detected?
No ratings yet
A Forensic Analysis of Android Malware - How Is Malware Written and How It Could Be Detected?
5 pages
Analysis-Cheat-Sheet-Part-1/#gref Analysis-Cheat-Sheet-Part-2/#gref
No ratings yet
Analysis-Cheat-Sheet-Part-1/#gref Analysis-Cheat-Sheet-Part-2/#gref
12 pages
The Art of Mac Malware, Volume 1: The Guide to Analyzing Malicious Software
From Everand
The Art of Mac Malware, Volume 1: The Guide to Analyzing Malicious Software
Patrick Wardle
4/5 (1)
Cuckoo Sandbox Implementation For Malware Analysis
No ratings yet
Cuckoo Sandbox Implementation For Malware Analysis
4 pages
Research Paper 2 Malware Detection
No ratings yet
Research Paper 2 Malware Detection
24 pages
1 s2.0 S2214212623002740 Main
No ratings yet
1 s2.0 S2214212623002740 Main
12 pages
Chapter One 1.1 Background of The Study
No ratings yet
Chapter One 1.1 Background of The Study
40 pages
An Analysis of Internet of Things IoT Malwares and Detection Based On Static and Dynamic Techniques
No ratings yet
An Analysis of Internet of Things IoT Malwares and Detection Based On Static and Dynamic Techniques
6 pages
Malware Application Detection Using Machine Learning
No ratings yet
Malware Application Detection Using Machine Learning
8 pages
ICIIS
No ratings yet
ICIIS
6 pages
Malware Detection Using ML
No ratings yet
Malware Detection Using ML
20 pages
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
No ratings yet
Phase 1 Report Group ID CSE19-G58 Malware Detection Using ML
30 pages
Android Malware Detection Using Machine Learning
No ratings yet
Android Malware Detection Using Machine Learning
4 pages
14th ICCCNT 2023 Paper 943
No ratings yet
14th ICCCNT 2023 Paper 943
5 pages
1 s2.0 S1570870521001281 Main
No ratings yet
1 s2.0 S1570870521001281 Main
13 pages
Infoblox Deployment Guide Network Insight Deployment Guide
No ratings yet
Infoblox Deployment Guide Network Insight Deployment Guide
38 pages
2ND SUMMATIVE TEST in Math 8
No ratings yet
2ND SUMMATIVE TEST in Math 8
3 pages
Module 1 &2 Question Bank (BEC302)
No ratings yet
Module 1 &2 Question Bank (BEC302)
2 pages
User+Manual+of+12 1+Inch+Patient+Monitor
No ratings yet
User+Manual+of+12 1+Inch+Patient+Monitor
87 pages
PCS Manager Job Description
No ratings yet
PCS Manager Job Description
3 pages
2013 SCC Online Bom 1530 PDF
No ratings yet
2013 SCC Online Bom 1530 PDF
20 pages
Capital Controls Series 70CV3000
No ratings yet
Capital Controls Series 70CV3000
8 pages
Experiment 13
No ratings yet
Experiment 13
7 pages
JKR IT Labelling Standard PDF
100% (1)
JKR IT Labelling Standard PDF
7 pages
FPGNN Atpg
No ratings yet
FPGNN Atpg
6 pages
Ict Igcse Chapter 2 Summary
No ratings yet
Ict Igcse Chapter 2 Summary
2 pages
LG w2243c PFV
No ratings yet
LG w2243c PFV
24 pages
Paperless EPFO Presentation
No ratings yet
Paperless EPFO Presentation
36 pages
ID654ba25e8-1994 Audi 100 Quattro Power Steering Fluid Manual
No ratings yet
ID654ba25e8-1994 Audi 100 Quattro Power Steering Fluid Manual
2 pages
Test Printer Print
No ratings yet
Test Printer Print
1 page
Lab Report 2
No ratings yet
Lab Report 2
5 pages
Resuume
No ratings yet
Resuume
2 pages
Setcom S CORE
No ratings yet
Setcom S CORE
8 pages
Design and Implementation of Smart Security System
No ratings yet
Design and Implementation of Smart Security System
14 pages
AHL Geometry and Trigonometry
No ratings yet
AHL Geometry and Trigonometry
22 pages
Edtpa Lesson Plan - Animals On The Farm
No ratings yet
Edtpa Lesson Plan - Animals On The Farm
5 pages
Submit MCQs
No ratings yet
Submit MCQs
4 pages
Mod 1 WinTR20 Intro Overview V310
No ratings yet
Mod 1 WinTR20 Intro Overview V310
47 pages
Filters
No ratings yet
Filters
3 pages
Aviation Brochure en
No ratings yet
Aviation Brochure en
2 pages
ECE 8th SEM PE EC 801B, 802B, OEEC803A, 804B
No ratings yet
ECE 8th SEM PE EC 801B, 802B, OEEC803A, 804B
5 pages
WT32 SC01+PLUS (ZX3D50CE08S USRC 4832) Datasheet V1.6EN
No ratings yet
WT32 SC01+PLUS (ZX3D50CE08S USRC 4832) Datasheet V1.6EN
14 pages
URC UI Tool User Guide
No ratings yet
URC UI Tool User Guide
17 pages

Windows Malware Binaries in C/C++ Github Repositories: Prevalence and Lessons Learned

Uploaded by

Windows Malware Binaries in C/C++ Github Repositories: Prevalence and Lessons Learned

Uploaded by

Windows Malware Binaries in C/C++ GitHub Repositories:

Prevalence and Lessons Learned

William La Cholter, Matthew Elder and Antonius Stalick

Keywords: Malware, GitHub, Open Source Software, Windows.

1 INTRODUCTION Test, 2020). Malware developers target many plat-

Recently, Zhu et al. published a study on the

Table 4: VirusTotal Detection Change in Results - Suspicious Files.

Table 5: VirusTotal Detection Results – By File Type.

Table 6: VirusTotal Hit Counts in Weighted Bins.

Table 7: Suspicious File Count by Repository Count.

Table 8: Malicious File Count by Repository Count.

Table 9: Top 10 Repositories by Files Having VirusTotal Detection - Suspicious Files.

Table 10: Examples of Varying Scan Results over Time.

6 CONCLUSIONS AND FUTURE ACKNOWLEDGEMENTS

You might also like