0% found this document useful (0 votes)

18 views19 pages

Seminar Final

Uploaded by

Nima Dorji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views19 pages

Seminar Final

Uploaded by

Nima Dorji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

ELE101 Cyber Growth Conversation

CASE STUDY

Source Code Vulnerability

Section A

Enrolment No:
12220017
12220047
12220059
Full Stack

Submission Date: 10th May, 2024

[1]
What is seminar?

A seminar is a hands-on learning experience where attendees talk, exchange expertise, and delve
into particular subjects. Seminars are essential for sharing knowledge, encouraging teamwork, and
raising awareness of security dangers and best practices in the context of cybersecurity.

Seminar Types:

Webinars: These are virtual seminars that take place online. Experts present on a range of
cybersecurity subjects as participants participate virtually.

Seminars for Academics: Academic seminars, which take place in educational settings, provide a
forum for academics, students, and faculty members to exchange ideas about cybersecurity theory,
emerging trends, and research findings.

Seminars on Professional Development: Industry professionals, practitioners, and executives are

the target audience for these seminars. They emphasize case studies, real-world applications, and
practical skills.

Seminar Evolution: In the past, most seminars were held in physical spaces, including lecture
halls or conference rooms. On the other hand, the emergence of globally accessible virtual
seminars, or webinars, can be attributed to technological advancements. Experts from various
places can now more easily communicate and cooperate on projects thanks to the move toward
online forms.

Mini seminars are targeted, condensed events that address particular facets of cybersecurity. They
could continue for a single day or a few hours.

Mini seminars can cover a wide range of subjects, such as:

Phishing Attacks: Recognizing and combating deceptive communications that try to get private
information (credit card numbers, login credentials, etc.). Taking care of the hazards posed by USB
drives, SD cards, and other portable storage devices is known as removable media security.

Password-Based Authentication: Examining the best methods for generating secure, one-of-a-kind
passwords and putting two-factor authentication into place. Protecting private information on
mobile devices, such as wearables, tablets, and smartphones, is known as mobile device security.

[2]
Blockchain Security: Protecting cryptocurrency and decentralized networks.

Advantages of seminars

Knowledge Sharing: Participants get insights into the most recent dangers and solutions, hear from
experts, and discuss ideas.

Networking: Attending seminars offers chances to meet like-minded people, business executives,
and possible partners.

Enhancement of abilities: Participants' abilities are improved through hands-on workshops and
case studies.

Awareness: Seminars promote preventative actions and increase public knowledge of

cybersecurity threats.

A Guide to Effective Seminar Planning:

Select a Topic That Is Related: Make sure your topic fits the requirements and interests of the
audience.

Speakers with expertise should be invited; they can offer insightful commentary.

Publicize the Event: Reach possible participants by using email, social media, and other platforms.

Make Engaging Workshops: Suggest having conversations, Q&A sessions, and practical exercises.

To make future seminars better, collect input and evaluate them.

[3]
(Research Paper 1)

Topic: Vulnerability discovery based on source code patch commit mining: a systematic literature
review

https://www.researchgate.net/publication/377208399_Vulnerability_discovery_based_on_source
_code_patch_commit_mining_a_systematic_literature_review

Introduction:

In software development, patch commits represent critical changes to codebases, particularly in

addressing security vulnerabilities. With the rising number of vulnerabilities in open-source
software, the prompt identification and remediation of these issues are paramount. However, not
all vulnerabilities are publicly disclosed, leading to potential risks. Automated patch commit
mining offers a solution to efficiently identify silently fixed vulnerabilities. Leveraging
advancements in machine learning and deep learning, researchers are exploring data-driven
approaches to enhance vulnerability discovery efforts. Despite progress, significant challenges
persist, and a comprehensive survey of recent advances is needed to guide future research
directions in patch commit analysis for vulnerability discovery.

Commit:

It outlines the structure of a typical commit, which includes a commit message and a source code
difference (diff) detailing code changes between versions. Patch commits serve various purposes,
including non-security improvements such as performance enhancements and feature additions, as
well as security patches aimed at fixing vulnerabilities. Examples are provided to illustrate both
types of patches, highlighting their respective components within a commit. The structure of a
commit message is detailed, comprising a subject line summarizing the changes, an optional body
providing further details, and a footer for referencing related issues. Additionally, it explains the
use of prefixes in multi-patch commits and the formatting conventions within the commit message
and diff.

[4]
Overall, the passage emphasizes the role of patch commits in software maintenance and
development, offering insights into their composition and significance within the Git repository
management system.

Fig1: An example of non-security patch commit Fig2: A security patch commit for CVE-20220699

Datasets:

In the article it states that the process of collecting datasets for code vulnerability involves several
steps to ensure comprehensiveness and accuracy. Initially, researchers faced challenges due to the
scarcity and limited accessibility of well-labeled vulnerability datasets. However, recent efforts
have aimed to overcome these obstacles by making resources publicly available and developing
methods for dataset construction. One common method outlined involves crawling reference
website addresses from Common Vulnerabilities and Exposures (CVE) entries, filtering out those
corresponding to security-related commits. Patch URLs are generated by appending “. patch" to
these addresses, allowing for the retrieval of patch files. Additionally, non-security patches can be
collected from platforms like GitHub.

To enhance the dataset, techniques such as nearest link search are employed to identify candidate
patches from code repositories based on similarities with verified security patches. Human
verification is then conducted to label certain candidates as security patches, enriching the dataset.
Furthermore, synthetic data generation methods like oversampling are utilized to augment the

[5]
dataset's size and diversity. Overall, this approach ensures that datasets for code vulnerability are
comprehensive and accurately labeled, providing valuable resources for training robust machine
learning models and advancing research in computer security.

Empirical analysis on security patches:

Empirical investigations into security patches, such as those referenced as [53, 54], have provided
valuable insights into the dynamics of vulnerability mitigation within software projects. Unlike
some other research endeavors, these studies primarily focus on describing observed phenomena
and offering quantitative analyses rather than solely discovering vulnerabilities. One notable study
examined over 4000 security patches across more than 3000 vulnerabilities in 682 open-source
software projects, revealing that patches generally prove effective in addressing identified
vulnerabilities. However, a recurring issue emerged concerning the timely installation of these
patches. Additionally, the research highlighted disparities in patch prioritization, with
vulnerabilities related to web-oriented issues being addressed more promptly than others. Another
investigation involving 3663 vulnerabilities across 1096 OSS projects underscored the persistence
of vulnerabilities within source code, with half remaining unresolved for over a year. These
findings collectively emphasize the importance of not only developing effective patches but also
ensuring their timely deployment to mitigate potential risks effectively.

Research on bug report:

The statement highlights the close relationship between bug reports and patch commits in modern
software maintenance practices, emphasizing the one-to-many relationship where multiple
commits may be associated with a single bug report. Furthermore, it suggests that bug reports and
patch commits can complement each other in uncovering silent vulnerabilities within software
systems. To address this relationship, several papers, such as references [55, 56], attempt to link
patch commits with bug reports. However, due to the extensive research in this interdisciplinary
field, the statement briefly outlines potential research questions in bug report studies and reviews
recent progress, focusing on the most relevant works to facilitate in-depth explorations for
interested readers. This approach ensures that despite the vast body of research, readers are directed
towards the most pertinent and up-to-date studies in this area.

[6]
Methods:

Feature Engineering: Conventional machine learning models often rely on feature engineering to
represent problems effectively. The study outlines various practical features proposed in prior
works, categorizing them into different groups based on their relevance to different research
questions.

Security Patch Localization: Methods such as PatchScout and VCMatch utilize statistical features
between commits and vulnerabilities for security patch localization. These features encompass
aspects such as lines of code, identity, location, and token-based features. Additionally, VCMatch
incorporates semantic features generated by BERT and employs classification models like
XGBoost, LightGBM, and CNN for prediction.

Vulnerable Commit Identification: VCCFinder employs statistical differences between

vulnerability-related and other commits to detect potential vulnerabilities introduced by incoming
commits. However, challenges in replicating and improving upon the original results have been
encountered.

Vulnerability Type Classification: Some research focuses on classifying security patches based on
the type of vulnerability they address, aiding in prioritizing security patch deployment.

Natural Language Processing (NLP) Methods: NLP techniques are employed for tasks such as
security patch detection. Conventional methods treat log messages and patch code as bags of
words, while more recent approaches utilize deep learning models like LSTM to improve accuracy.

These methods collectively contribute to the research's findings and insights into patch commit
mining for vulnerability discovery, highlighting both successes and remaining challenges in the
field.

[7]
Results:

In the research, a method called VCC Finder was employed to automatically detect whether
incoming commits would introduce vulnerabilities. This approach involved extracting various
statistical differences between commits associated with vulnerabilities and those unrelated to
security issues. Metrics such as lines of code modified, complexity of changes, and frequency of
alterations in vulnerable areas were utilized to distinguish between vulnerability-contributing
commits and others. The results indicated promising potential for VCCFinder in identifying
commits likely to introduce security risks. However, challenges were encountered in replicating
and refining the original findings, underscoring the need for further validation and improvement
of the approach. Nonetheless, these insights highlight the efficacy of data-driven methods in
preemptively identifying vulnerabilities during software development, thereby facilitating
proactive risk mitigation strategies.

Conclusions:

In the world of software development and upkeep, fixing problems with patches is common,
especially when it comes to fixing security issues that hackers could exploit. This makes finding
these patches really important. Researchers are interested in finding ways to spot these security
patches. But, until now, there hasn't been a complete review of what's been done and what still
needs work in this area. So, this article looked at a bunch of research and shared our thoughts on
what's going well and what challenges are still ahead. They have also made it easy for people who
are new to this field to understand by sharing some basic experiences. Our hope is that by pointing
out the problems and opportunities, we can encourage more people to join in and help make
software safer.

[8]
(Research Paper 2)

Title: "Advancing Software Security: Exploring Source Code Vulnerability Research"

https://www.researchgate.net/publication/308960782_Exploring_Software_Security_Approaches
_in_Software_Development_Lifecycle_A_Systematic_Mapping_Study

Abstract:

Source code vulnerability research is essential for enhancing software security and mitigating
cyber threats. This abstract provides a concise overview of the key aspects of source code
vulnerability research, including identification, classification, mitigation, and understanding of
vulnerabilities in software code. Methodologies employed in such research encompass empirical
studies, theoretical investigations, and the application of machine learning techniques. Results
often include the development of new detection methods, analysis of vulnerability patterns, and
evaluation of existing security tools and practices. Discussions explore implications for software
development practices, limitations of the research, and areas for further investigation. Future work
may involve exploring novel detection and mitigation techniques, addressing emerging threats,
and enhancing the resilience of software systems against cyber-attacks. Overall, source code
vulnerability research plays a critical role in advancing software security and protecting against
evolving cyber threats.

Introduction:

Overall source code vulnerability may cover various aspects of computer science and
cybersecurity. The field of source code vulnerability research encompasses the identification,
classification, exploitation, and comprehension of software source code vulnerabilities. The
introduction to this paper covered most critical areas of research within this field, discussing the
importance of secure coding ensured by secure coding practices and vulnerability management
implemented using efficient vulnerability management strategies.

[9]
Methodology:

Depending on the focus of research, methods of studying codes vary. One common method is
empirical studies where real-world data or experiments are used to analyze vulnerability.
Alternatively, theoretical research may involve developing algorithms, models or frameworks for
vulnerability detection and prevention. Methodologies can also involve machine learning
techniques applied to automated vulnerability identification databases which are already
swimming with vulnerabilities ripe for expression in code. Another popular choice in methodology
is surveying developer practice and attitude toward security.

Results and Discussion:

In studying and assessing vulnerabilities the source code can reveal all kinds of findings, for
example new detection methods may appear, researchers might come across vulnerability patterns
or trends, Whether existing security systems are confined to history. Or more Groundbreaking, the
effects that vulnerabilities have on software systems and users The implications of the results for
software development, the efficacy of potential remedies, restrictions on research and further work
to be done should be the focus of such conversations. For example, discussions may revolve around
the trade-offs between various vulnerability detection techniques, the difficulties involved in
embedding security within an entire software development cycle, and researchers need better
relationships with other people outside their immediate field of vision.

Future Work:

Future directions in source code vulnerability research may include exploring novel techniques for
vulnerability detection and mitigation, addressing emerging threats and attack vectors, improving
the usability and scalability of security tools, and enhancing the resilience of software systems
against cyber attacks. Researchers may also investigate the socio-technical aspects of software
security, such as the role of human factors in vulnerability management, the impact of
organizational culture on secure coding practices, and the effectiveness of security awareness
training programs. Additionally, there is a need for interdisciplinary research that bridges the gap
between technical solutions and broader societal issues related to cybersecurity and privacy.

[10]
Conclusion:

In conclusion, source code vulnerability research plays a crucial role in advancing our
understanding of software security and developing effective strategies for protecting against cyber
threats. By leveraging diverse methodologies and interdisciplinary approaches, researchers can
contribute to the development of more secure and resilient software systems. However, ongoing
efforts are needed to address the evolving nature of cyber threats and ensure that software
development practices keep pace with emerging challenges.

[11]
(Research Paper 3)

Title: Vulnerability Prediction from Source Code - A Machine Learning Approach Based on
Abstract Syntax Tree Representation

https://ieeexplore.ieee.org/abstract/document/9167194

Abstract: Software vulnerabilities pose significant threats to individuals and organizations alike,
demanding efficient and automated solutions for early detection. While traditional methods like
static and dynamic analysis have limitations, recent advancements in machine learning offer
promising avenues for intelligent vulnerability prediction. This paper presents a novel approach
based on representing source code as numerical vectors derived from its Abstract Syntax Tree
(AST). This representation preserves the structural and semantic relationships within the code,
enabling effective machine learning analysis. Our experiments on a real-world dataset demonstrate
the effectiveness of this method in predicting various vulnerability types, outperforming existing
approaches like Code2vec. We further discuss potential extensions of this work, including
vulnerability localization and explanation, and explore the application of transfer learning for
cross-language vulnerability prediction.

Introduction:

Software vulnerabilities represent critical weaknesses that can be exploited for malicious purposes,
leading to data breaches, financial losses, and reputational damage. The increasing complexity of
software systems necessitates automated and intelligent solutions for vulnerability detection.
While traditional methods like static and dynamic analysis have been widely used, they often suffer
from limitations. Static analysis struggles to capture runtime behavior and may generate false
positives, while dynamic analysis can only explore limited execution paths.

Machine learning (ML) presents a promising avenue for tackling these challenges. By learning
from existing vulnerabilities and code patterns, ML models can automatically identify potentially
vulnerable code segments with greater accuracy and efficiency. Recent research has explored
various ML-based approaches, including those leveraging software metrics [2], [23], text mining
techniques [25], and AST-based analysis [12]–[14], [20]. However, there remains a need for more
robust and comprehensive methods that effectively capture the complex semantics of source code.

[12]
This paper proposes a novel approach for vulnerability prediction based on representing source
code as numerical vectors derived from its Abstract Syntax Tree (AST). This representation
preserves the structural and semantic relationships within the code, providing valuable information
for ML models to learn from. Our contributions are as follows:

AST-based vulnerability prediction: We present a comprehensive methodology for vulnerability

prediction using AST representation and demonstrate its effectiveness on a real-world dataset.

Source code representation: We propose a novel method for converting ASTs into numerical
vectors while retaining essential structural and semantic information.

Performance evaluation: We conduct extensive experiments and comparative analysis with

existing state-of-the-art methods, demonstrating the superiority of our approach.

Future directions: We discuss potential extensions of this work, including vulnerability

localization and explanation, and explore the application of transfer learning for cross-language
vulnerability prediction.

Dataset:

Our experiments utilize the Draper VDISC Dataset [16], which contains a large collection of
function-level C source code fragments extracted from open-source projects. The dataset includes
various vulnerability types categorized according to the Common Weakness Enumeration (CWE)
system, as shown in Table 2 of the original document. We maintain the original train-validation-
test split and acknowledge the inherent imbalance in the dataset, reflecting the real-world
distribution of vulnerabilities.

Source Code Representation:

The core of our approach lies in representing source code as numerical vectors. We achieve this
by converting the AST of each function into a complete binary tree and then encoding the AST
nodes into numerical tuples. This process is detailed in Section IV of the reference document and
involves the following steps:

Function-level Partitioning: The source code is divided into individual functions for granular
analysis.

[13]
Tokenization: The function's code is tokenized into a sequence of meaningful units.

AST Generation: The AST of the function is generated using a parser.

Conversion to Complete Binary AST: The AST is converted into a complete binary tree for
consistent representation.

Encoding to Numerical Tuples: AST nodes are encoded into numerical tuples, capturing token
type and auxiliary information.

Array Representation: The encoded tuples are arranged into a one-dimensional array, preserving
the structural relationships of the AST.

Vulnerability Prediction Model:

We formulate vulnerability prediction as a multi-label classification task. Two ML algorithms are

employed: Multi-Layer Perceptron (MLP) with Scikit-learn and Convolutional Neural Network
(CNN) with TensorFlow. Both models take the numerical vector representation of the source code
as input and predict the presence or absence of each vulnerability type.

Evaluation Metrics:

Given the imbalanced nature of the dataset, we use Precision-Recall (P-R) curves and F1 scores to
evaluate the models' performance. Additionally, for balanced subsets, we utilize Receiver
Operating Characteristic (ROC) curves and Area Under the Curve (AUC) for comparison with
existing literature.

Results and Discussion

 Impact of AST Depth

We investigate the impact of using partial AST representations by varying the depth of the
complete binary tree. Experiments on the CWE-119 vulnerability type show that an AST depth of
8 achieves the best performance, balancing information richness with computational efficiency.
Deeper ASTs lead to the curse of dimensionality, while shallower ones lack sufficient information
for accurate prediction.

 Dimensionality Reduction

[14]
Due to the abundance of NULL nodes in the complete binary AST, many elements in the numerical
vector representation are zeros. Principal Component Analysis (PCA) reveals that a significant
portion of these features have negligible impact on the model. By reducing the dimensionality
from 1533 to 250 (for depth 8), we achieve a 68% reduction in training time without compromising
performance.

 Performance on Imbalanced Dataset

Our MLP model demonstrates promising results on the imbalanced dataset. Figure 9 from the
original document showcases the P-R curves for different vulnerability categories. CWE-476
achieves the best performance with an F1 score of 0.598, while CWE-469 presents the most
challenge due to its high imbalance ratio. The micro-average P-R curve yields an AUC of 0.377,
exceeding the baseline performance.

 Performance on Balanced Dataset

To facilitate comparison with existing approaches, we create balanced subsets for each
vulnerability category. Using a CNN model, we achieve AUC values ranging from 0.755 (CWE-
other) to 0.882 (CWE-476), as depicted in Figure 10 of the original document. This indicates the
effectiveness of our method in detecting various vulnerability types under controlled conditions.

 Comparison with Code2vec

We compare our approach with the state-of-the-art Code2vec method [20]. As shown in Table 5
of the reference document, our method significantly outperforms Code2vec in terms of F1 score
across all vulnerability categories. This can be attributed to the preservation of structural and
semantic information in our AST-based representation, allowing the ML models to better capture
patterns indicative of vulnerabilities. Additionally, our method boasts faster training times
compared to Code2vec.

Future Work:

This research opens several avenues for future exploration:

Vulnerability Localization: We aim to develop techniques to pinpoint the specific location of

vulnerabilities within identified code fragments, improving the actionability of our predictions.

[15]
Vulnerability Explanation: We seek to incorporate explainable AI methods to provide insights into
why a particular code segment is classified as vulnerable, enhancing the transparency and
trustworthiness of our model.

Transfer Learning: We will investigate the application of transfer learning to adapt our model to
different programming languages, extending its applicability to a wider range of software systems.

Additional Features: We plan to explore incorporating additional features, such as control flow
graphs and data flow graphs, to further enrich the source code representation and potentially
improve prediction accuracy.

Conclusion:

This paper presents a novel and effective approach for vulnerability prediction from source code
using AST-based representation and machine learning. Our method demonstrates superior
performance compared to existing approaches, offering a promising solution for automated
vulnerability detection. Future work will focus on enhancing the interpretability and applicability
of our approach, contributing to the development of more secure and reliable software systems.

[16]
Single Report for all the Researches done (based on three research paper)

Title: Advancing Software Security; Exploring Source Code Vulnerability Research

Abstract:

Source code vulnerability research is crucial for enhancing software security and mitigating cyber
threats. This report synthesizes findings from three seminal studies on source code vulnerability
research. The first study proposes a novel approach based on Abstract Syntax Tree (AST)
representation for vulnerability prediction, leveraging machine learning techniques. The second
study focuses on empirical analysis and methodological advancements in vulnerability discovery
through patch commit mining. Lastly, the third study explores various aspects of source code
vulnerability research, including identification, classification, mitigation, and future directions.
Together, these studies contribute valuable insights into the multifaceted domain of software
security.

Introduction:

Source code vulnerability research encompasses a wide array of activities aimed at identifying,
classifying, and mitigating vulnerabilities in software code. The introduction highlights the
importance of secure coding practices and effective vulnerability management strategies in
safeguarding software systems against cyber threats. It outlines the scope of source code
vulnerability research, emphasizing its relevance in the context of evolving cyber threats and the
increasing complexity of software systems.

Methodology:

Methodologies employed in source code vulnerability research vary based on the research
objectives and focus areas. The first study utilizes machine learning techniques, particularly AST
representation, for vulnerability prediction. The second study employs empirical analysis and
patch commit mining to explore vulnerability discovery dynamics. In contrast, the third study
adopts a comprehensive approach, encompassing empirical studies, theoretical investigations, and
the application of machine learning techniques. These diverse methodologies contribute to a

[17]
comprehensive understanding of source code vulnerabilities and inform the development of
effective detection and mitigation strategies.

Results and Discussion:

Findings from the three studies offer valuable insights into different facets of source code
vulnerability research. The first study demonstrates the effectiveness of AST-based representation
and machine learning in vulnerability prediction, outperforming existing approaches. The second
study provides empirical evidence on the dynamics of vulnerability mitigation through patch
commit analysis, highlighting challenges and opportunities for improving patch deployment
practices. The third study explores various aspects of source code vulnerability research, including
vulnerability patterns, detection methods, and future directions. Discussions delve into the
implications of these findings for software development practices, research limitations, and
opportunities for further exploration.

Future Work:

Future directions in source code vulnerability research encompass a wide range of areas, including
the exploration of novel detection and mitigation techniques, addressing emerging threats,
improving tool usability, and investigating socio-technical aspects of software security.
Researchers may also focus on enhancing the resilience of software systems against cyber-attacks,
bridging the gap between technical solutions and broader societal issues related to cybersecurity
and privacy.

Conclusion:

Source code vulnerability research plays a critical role in advancing software security and
protecting against cyber threats. By leveraging diverse methodologies and interdisciplinary
approaches, researchers can contribute to the development of more secure and resilient software
systems. Ongoing efforts are needed to address the evolving nature of cyber threats and ensure that
software development practices keep pace with emerging challenges.

[18]
This comprehensive report integrates detailed information from the three research papers,
providing a thorough overview of source code vulnerability research and its implications for
software security.

Source code vulnerabilities carry serious implications for software systems, organizations, and
individuals alike. These vulnerabilities can result in:

 Data Breaches: Weaknesses in source code can be exploited by attackers to gain

unauthorized access to sensitive data, leading to breaches of confidentiality and leaks of
confidential information.
 System Compromise: Vulnerable source code can provide attackers with entry points to
compromise software systems, allowing them to gain unauthorized control, install
malware, steal credentials, or disrupt system operations.
 Financial Losses: Organizations may suffer financial losses due to source code
vulnerabilities, including costs associated with data breaches, system downtime, legal
penalties, regulatory fines, and damage to reputation.
 Privacy Violations: Source code vulnerabilities can compromise user privacy by exposing
personal or sensitive information to unauthorized parties, leading to risks such as identity
theft, fraud, or invasion of privacy.
 Intellectual Property Theft: Attackers may exploit source code vulnerabilities to access
proprietary code, trade secrets, or confidential information, resulting in the theft of
intellectual property and loss of competitive advantage.

Thank you!

[19]