Py Driller
Py Driller
This project report presents the design and development of a PyDriller-based automated security
monitoring tool to identify, classify, and log security-relevant commits from Git repositories. By
analyzing commit diffs, mapping to MITRE CVE details, and OWASP classification, the system
automates the detection and reporting of vulnerabilities in software projects involving large-scale
usage of open-source components and collaborative software development processes.
Keywords: PyDriller, Git security, automated monitoring, CVE, OWASP, commits, vulnerability
classification
Table of Contents
Innehåll
1 Introduction ____________________________________________________ 7
1.1 Problem Statement ___________________________________________ 7
1.2 Background and Motivation ___________________________________ 7
1.3 Aim ______________________________________________________ 8
1.4 Scope _____________________________________________________ 8
1.5 Completeness Criteria ________________________________________ 8
1.6 Time Plan __________________________________________________ 8
1.7 Outline ____________________________________________________ 9
2. Methodology _____________________________________________________ 9
1.8 Requirements _______________________________________________ 9
1.9 Development Method and Workflow ____________________________ 9
1.10 Tools and Architecture _______________________________________ 9
2 Implementation ________________________________________________ 11
2.1 Design ___________________________________________________ 11
2.2 Testing ___________________________________________________ 12
3 Results _______________________________________________________ 13
4.1 DVWA (Damn Vulnerable Web Application) _____________________ 13
4.2 cURL Repository ____________________________________________ 15
4.3 Dependency-Check Repository _________________________________ 18
4.4 Additional Highlights and Features ______________________________ 19
4.5 Summary of Key Observations _________________________________ 21
4.6 Automated Test Suite ________________________________________ 22
4 Analysis and Discussion ____________________________________________ 26
Table of Figures
5
Terminology
API: Application Programming Interface
6
1 Introduction
1.1 Problem Statement
Modern software projects rely heavily on open-source and collaborative development,
which also subject them to the threat of inadvertently introducing security vulnerabilities.
Existing commit analysis tools lack the automation required to identify, classify, and
prioritize security-related changes. Manual review is not feasible in large repositories due to
the volume of commits. Our project aims to bridge this gap by developing an automated
security monitoring tool that fetches commit diffs, classifies vulnerabilities according to
MITRE CVE and OWASP frameworks, and generates actionable, standardized reports.
7
minimize security blind spots, and offer a proactive defense in the face of security
vulnerabilities in software development.
1.3 Aim
The primary objective of this project is to design an intelligent security monitoring system
capable of automatically detecting, classifying, and reporting security-related alterations in
Git repositories. The system can extract commit diffs, identify vulnerabilities based on pre-
defined security keywords, CVE references [2][5], and OWASP classifications, and
generate structured security reports in different formats, including CSV, JSON, and
Markdown. Secondly, the tool supports continuous monitoring, allowing real-time security
commit monitoring and prompt feedback to developers and security teams.
1.4 Scope
Both public and private Git repositories, including GitHub, are included in the project scope.
The system is coded to detect commits containing security-related keywords or known CVE
identifiers [2], so that potential vulnerabilities can be classified according to defined security
standards. The extracted security data is formatted into structured reports, making it easier
for teams to analyze and react to security threats. Further, the tool also has integration with
an automated re-scanning feature, where users can monitor repositories at regular intervals
indefinitely, giving continuous security analysis and preemptive threat preemption.
Classifies vulnerabilities with severity levels (Critical, High, Medium, Low) and maps them
to known CVE/OWASP categories (unmatched references are allowed to use placeholders,
such as "No CVE found" or "Unknown Category").
Week Tasks
8
3 Incorporate multi-threading for performance enhancement and multi-
format report generation.
1.7 Outline
This report is organized as follows:
Analysis & Discussion: Examines the performance of the tool and potential areas for
optimization.
References & Appendix: Lists cited references and any supplementary material.
2. Methodology
1.8 Requirements
Our project requirements are to automatically extract security commits, associate commits with
severity categories and OWASP, retrieve CVE details (when available), generate CSV, JSON, and
Markdown reports, and provide continuous monitoring options.
9
GitHub.
10
2 Implementation
2.1 Design
The architecture consists of distinct modules:
1. The commit extraction module, which uses the PyDriller library to scan Git commits.
3. The commit extraction module, which uses the PyDriller library to scan Git commits.
5. The report generation module, which generates comprehensive security reports in CSV,
JSON, and Markdown formats.
11
Figure 2 extractor test files
2.2 Testing
The testing process consists of unit testing, which verifies individual functions
(fetch_cve_details, classify_owasp, etc.), as well as integration testing, where full pipeline
tests are executed using provided Python test scripts (test_diff_extractor.py,
test_full_pipeline.py, and test_patch_labeler.py) for real-world repository scenarios. Manual
verification is performed to ensure the accuracy of output reports and classifications.
12
3 Results
Below is an exhaustive overview of the security analysis output generated by our tool from
PyDriller for some sample repositories. These examples show how the tool determines high-
risk commits, groups vulnerabilities based on OWASP Top 10 categories, assigns severity
scores, and saves detailed patch files with code differences for each potential commit. The
tool also outputs CSV, JSON, and Markdown summary reports and optionally continuously
scans with a user-defined interval.
1. High-Level Findings
Injection issues (OWASP A03:2021) dominate the flagged commits, reflecting DVWA's
deliberate vulnerabilities related to SQL/command injection.
Some commits were reported as Unknown Category where no direct mapping to OWASP or
known CVE references was provided, indicating potential malicious or unclassified
changes.
2. Severity Distribution
There is a directory patches/containing a diff (*.diff) for each indicated commit, providing
developers with line-by-line differences that added or modified the security-related code.
4. Trend Analysis
13
Commits by every month/year are in a temporal split in the form of number of security-
impacting commits per month. In DVWA, huge peaks are observed for the years 2015-09
(14 commits), 2017-04 (5 commits), 2021-10 (7 commits), and some random ones in later
years (2024–2025).
5. Date-Filtered Example
Demonstrating the --since 2024-01-01 parameter, only 3 flagged commits since January
2024 were discovered in DVWA—these all involve Unknown Category vulnerabilities with
style sheets (main.css) and new controllers (GenericController.php, HealthController.php,
UserController.php).
Console output example for DVWA readily shows each commit's hash, severity, OWASP
category, and a piece of its diff. The tool further wrote a number of patch files like
patches/101597a_main.css.diff for convenient reference and thorough code examination.
14
Figure 4 DVWA trend analysis
Our tool also scanned the cURL repository over a broad date range.
Below is a snippet of the JSON output, illustrating how each commit is annotated:
{
"commit_hash": "c799f608f2a063b25af28042d76a1a841e30a0df",
"author": "Viktor Szakats",
"date": "2025-03-13 23:53:40+01:00",
"severity": "Critical",
"cve_details": "No CVE found",
"owasp_category": "Unknown Category",
"filename": "configure.ac",
"diff_preview": "@@ -609,6 +609,7 @@ curl_cv_cygwin='no'..."
},
15
{
"commit_hash": "8dfc93e573ca740544a2d79ebb0ed786592c65c3",
"author": "Daniel Stenberg",
"date": "2022-08-29 00:09:17+02:00",
"severity": "Low",
"cve_details": "CVE-2022-35252: No CVE details found.",
"owasp_category": "Unknown Category",
"filename": "cookie.c",
"diff_preview": "@@ -441,6 +441,30 @@ static bool bad_domain(const char *domain)\n..."
}
1. High-Level Findings
The majority of flagged commits are labeled Unknown Category—the commit messages
contain partial or suspicious patterns but lack direct matches to our OWASP list.
Many commits still scored as Critical due to references in the commit message: e.g.
“security fix,” “patch,” or “exploit.”
Over 1,000 patch files were generated under patches/, capturing changes in .c, .h, .am, .ac,
CMakeLists.txt, etc.
3. Trend Analysis
A timeline from 2022 to 2025 reveals large bursts of flagged commits in early 2023–2025.
16
On average, 3–10 flagged commits monthly, with occasional spikes exceeding 10 in a single
month.
17
4.3 Dependency-Check Repository
1. High-Level Findings
- Most of the flagged commits are categorized under Unknown Category, indicating
that there are keywords or references that appear suspicious (e.g., "dependency,"
"suppression," "pom.xml" changes).
- Some commits are associated with resource files (e.g., dependencycheck-base-
suppression.xml, .jar versioning, and dbStatements_*.properties), indicating that
high-risk changes are focused on how the tool itself handles scanning rules,
vulnerability suppression, or custom database initialization.
2. Severity Distribution
- Critical severity for the vast majority of flagged commits (over 200).
- The number of commits labeled as High or lower severities (e.g., "Low" for certain
borderline references).
- The minority mapped to OWASP categories such as "Security Misconfiguration"
(A05:2021).
3. Patch Files & Reports
- The tool created over 900 patch files under patches/ for code changes
in.java,.jar,.gemspec,.json,.md,.groovy, and more.
- A composite CSV, JSON, and Markdown report provides each commit's metadata
(hash, author, date), severity level, any CVE or OWASP classification, filenames
changed, and a short diff preview.
4. Trend Analysis
18
Figure 8 DependcyCheck analysis
- report.csv – Commit hash, author, date, severity, CVE, and filename reference table
at a glance.
- report.json – JSON format perfect for import into dashboards, further automation, or
specialized analytics.
19
- report.md – Security teams' and management at-a-glance security overview, simple
to read in standard Markdown format.
F
i
g
u
r
e
S
E
Q
F
i
g
u
r
e
\
*
A
R
F
A
i
B
g
Iu
C
r
1e
0S
rE 20
eQ
pF
oi
r
F Patch Diffs
i
Each
g flagged commit's code change is automatically committed to
u
patches/<commit>_<filename>.diff. This enables teams to instantly view precisely which
r created potential security issues.
lines
e
S
Multiple OWASP Mappings
E
Q
Commits are mapped, where applicable, into Injection (A03:2021), Identification &
F
Authentication
i Failures (A07:2021), Security Misconfiguration (A05:2021), or left
Unknown
g Category where no pattern had an exact match.
u
4.5
r Summary of Key Observations
e
Injection Results: A great majority of the highlighted commits throughout DVWA are
\
injection-based, which is by design for DVWA. In the same manner, within Dependency-
*
Check,
A many commits are about internal citations or questionable citations to libraries as
well
R as config changes.
Authentication
A Weaknesses: DVWA consistently fired A07:2021 because of login flow
change,
B session management, or password use.
I
Unknown Category: Both projects have commits that did not map neatly into an existing
C
OWASP category or CVE. This is partial reliance on curated keywords and room for future
1
improvement.
2
r
e 21
p
o
r
t
Patch Volume: Not only does the tool enumerate suspicious commits, but it also archives all
associated diffs, making in-depth code review or further manual triage easy.
1. test_classify_severity
Verifies that known vulnerability-related keywords (e.g., “buffer overflow,” “RCE,” “SQL
injection,” etc.) are assigned the correct severity level.
Figure 13 test_classify_severity
22
2. test_classify_owasp
Ensures that messages with recognized keywords are mapped to the correct OWASP
category (e.g., “SQL injection” → A03:2021 - Injection).
Figure 14 test_calssify_owasp
3. test_extract_cve_ids_from_msg
23
Checks that CVE references (e.g., CVE-2023-1234) are found in commit messages.
Figure 15 test_extract_cve_ids_from_msg
4. test_extract_security_diffs_and_store
Figure 16 test_extract_security_diffs_and_store
5. test_label_patches_with_commit_hash
24
Confirms that patch files are labeled correctly (shortened commit hash, replaced slashes
with underscores) and the file’s diff text is written accurately.
Figure 17 test_full_pipline
6. test_full_pipeline
All six tests pass (see in (Figure 18)), confirming the correctness of classification, patch
labeling, commit scanning, and full-pipeline integration.
25
Figure 19 passed 6 tests
Analysis of severity distribution shows that 31 commits were marked as Critical, and 20
were marked as High. These findings indicate the capability of the tool to rank security
26
vulnerabilities by high risk, allowing security teams to fix the most severe problems first.
Even with the high detection rate for security commits, a number of identified commits
lacked explicit CVE references. This indicates a weakness in the sole use of commit
messages to search for known vulnerabilities because developers do not consistently
reference CVEs explicitly in commit messages [2][5]. To overcome this shortcoming, the
tool would be assisted by a more advanced CVE detection feature, including more
integration with external vulnerability databases, context-based commit analysis, and more
advanced pattern matching techniques [2][5].
One of the most important strengths of the tool is the ability to produce patch files, which
allows security teams and developers to examine directly the exact code changes
introducing security vulnerabilities. This feature enhances remediation and security auditing
processes significantly by providing in-depth information regarding changed lines of code.
The tool was able to pull patch files for DVWA, particularly for SQL-dependent PHP files,
authentication scripts, and session management code. However, the effectiveness of patch
generation depends on the commit structure. In a case where a commit contains many
unrelated changes, security-related changes may be hard to isolate. Refining the patch
extraction algorithm to give special consideration to security-critical files and filtering out
irrelevant changes would make the output patches more useable.
Another vital feature of the functionality of the tool is the provision for continuous
monitoring mode, thus enabling real-time security detection. When executed with the --
continuous --interval 60 option, the tool continuously scans the repository for new security-
type commits every 60 seconds [3][4]. This feature is particularly useful for actively
maintained repositories, where constant code updates may introduce new security concerns.
The ability to track security trends over time also provides useful insight into an
organization's overall security stance. For DVWA, the trends revealed some spikes in
security-related commits at particular times, such as 2015, 2017, and 2021. Such trends may
be associated with ongoing security improvements or increased development efforts,
indicating the usefulness of the tool as a security auditing tool for tracking security-related
changes over time [1].
Despite its advantages, the tool also has some drawbacks that need to be addressed to
enhance it further. One of the major limitations is having a large number of "Unknown
Category" classifications, which indicates where there are holes in OWASP mapping [1].
Incorporating additional complete sets of keywords, natural language processing (NLP)
techniques, or machine learning models into the classification algorithm would improve
accuracy. One challenge is the tool's dependence on commit message quality because
27
security-related commits do not always contain explicit descriptions of vulnerabilities. This
can be mitigated by incorporating static code analysis techniques to complement commit
message analysis to ensure detection is more reliable. Another concern is handling large
commits with numerous files since security-related modifications may be spread across the
codebase in various spots. Including heuristics that prefer some files or functions handling
security components such as authentication, database queries, and session management
could make it more accurate[1][6].
To continue to enhance the tool, several improvements can be introduced. First, enhancing
OWASP mapping by introducing contextual phrase matching and keyword set enrichment
would reduce "Unknown Category" classifications [1]. Second, enhancing CVE detection
by integrating with external vulnerability databases, applying NLP for semantic commit
message analysis, and parsing commit diffs for CVE references would significantly enhance
detection accuracy [2][5]. Third, integrating the tool with security dashboards and SIEM
tools (e.g., Splunk, ELK Stack) would give more visibility and observability into security
trends. Additionally, the use of machine learning-based classification models with security-
oriented commit training could improve security commit detection accuracy beyond trivial
keyword-based detection. Another valuable addition would be to expand real-time alerting
capabilities to include notification via Slack, email, or webhooks, allowing security teams to
respond to security incidents proactively.
28
References
[1] OWASP Foundation, “OWASP Top 10 – 2021”, 2021. [Online]. Available: https://owasp.org/Top10/ [Accessed: 10
March 2025].
[2] National Vulnerability Database (NVD), “About CVE”, [Online]. Available: https://www.cve.org/about/overview
[Accessed: 10 March 2025].
[3] Accessible AI, “Extracting Git Data with PyDriller”, [Online]. Available: https://accessibleai.dev/post/extracting-git-
data-pydriller/ [Accessed: 10 March 2025].
[4] PyDriller Documentation, “PyDriller - A Python framework to analyze Git repositories”, [Online]. Available:
https://github.com/ishepard/pydriller [Accessed: 10 March 2025].
[5] National Institute of Standards and Technology (NIST), “National Vulnerability Database”, [Online]. Available:
https://nvd.nist.gov/ [Accessed: 10 March 2025].
[6] OWASP Foundation, “OWASP Application Security Verification Standard (ASVS)”, [Online]. Available:
https://owasp.org/www-project-application-security-verification-standard/ [Accessed: 10 March 2025].
29