0% found this document useful (0 votes)
5 views55 pages

STT - CSE First 4 Labs

The document is a January report for the CS202 course, detailing four lab reports conducted on various topics related to software tools and techniques. It includes a comprehensive overview of version control with Git, methodologies for mining software repositories, exploration of diff algorithms, and analysis of cyclomatic complexity in open-source projects. The report highlights hands-on experiences, challenges faced, and lessons learned throughout the lab activities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views55 pages

STT - CSE First 4 Labs

The document is a January report for the CS202 course, detailing four lab reports conducted on various topics related to software tools and techniques. It includes a comprehensive overview of version control with Git, methodologies for mining software repositories, exploration of diff algorithms, and analysis of cyclomatic complexity in open-source projects. The report highlights hands-on experiences, challenges faced, and lessons learned throughout the lab activities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

CS202: SOFTWARE TOOLS &

TECHNIQUES FOR CSE


January Month Report

This pdf contains 4 lab reports which documents the lab activities conducted on
9, 16, 23, 30 January – 2025

ID : 24120036 NISHIT PRAJAPATI


Table of Contents
Introduction to Version Controlling, Git Workflows, and Actions ............................................................ 1
Overview ............................................................................................................................................................ 1
Introduction and Tools........................................................................................................................................ 1
Setup .................................................................................................................................................................. 2
Methodology and Execution .............................................................................................................................. 3
Result and Analysis ........................................................................................................................................... 11
Discussion ......................................................................................................................................................... 12
Conclusion ........................................................................................................................................................ 12
Links .................................................................................................................................................................. 12
Introduction to Mining Software Repositories for Bug Fixes in the Wild .............................................. 13
Overview .......................................................................................................................................................... 13
Introduction and Tools...................................................................................................................................... 13
Setup ................................................................................................................................................................ 14
Methodology and Execution ............................................................................................................................ 15
Result and Analysis ........................................................................................................................................... 20
Discussion ......................................................................................................................................................... 26
Conclusion ........................................................................................................................................................ 27
Links ................................................................................................................................................................. 27

Exploration of different diff algorithms on Open-Source Repositories ................................................. 28


Overview .......................................................................................................................................................... 28
Introduction and Tools...................................................................................................................................... 28
Setup ................................................................................................................................................................ 29
Methodology and Execution ............................................................................................................................ 30
Result and Analysis ........................................................................................................................................... 39
Discussion ......................................................................................................................................................... 40
Conclusion ........................................................................................................................................................ 41
Links ................................................................................................................................................................. 41

Exploring cyclomatic complexity (MCC) changes in Open-Source Repositories ............................................... 42


Overview .......................................................................................................................................................... 42
Introduction and Tools...................................................................................................................................... 42
Setup ................................................................................................................................................................ 43
Methodology and Execution ............................................................................................................................ 43
Result and Analysis ........................................................................................................................................... 49
Discussion ......................................................................................................................................................... 52
Conclusion ........................................................................................................................................................ 53
Links ................................................................................................................................................................. 53
Course: Software Tools & Techniques for CSE (CS202)

Date: 9-Jan-2025
Name: Nishit Prajapati
ID: 24120036
Lab 01 Report
Introduction to Version Controlling, Git Workflows, and Actions

Overview
In this lab assignment, I got familiarized with version control system such as Git and got
hands-on experience with Git. I also got to learn to work with remote repositories on
GitHub. I have implemented basic Git operations such as

• Initializing a repository on local system as well as on GitHub


• Adding files and committing changes
• Setting up a pylint workflow and resolving the code to meet python
standards

Introduction and Tools


Suppose in the development process, we made an error and caused a bug. Now, we want to
go back in time or say at a condition before the error took place and get the code which is
error free. In short, we want to track changes in code. Tools which can help us to do this are
referred as Version Control Systems.
Git is one such version control system. It is used mainly for two reasons:
i) Track the history
ii) Collaborate with other individuals or groups
It is popular, free, open source, fast & scalable.

GitHub is a website which helps developers to store and manage their code using Git.

I had Git already installed on my system. Its version is:

This lab work was done on a system with following configuration:


Windows 11, i5 12th gen processor, 16 GB ram

1
Setup

Now, in Git Bash, I configured my information to use it across all my local repositories.

“git config --global user.name "Nishit Prajapati"” and “git config --global user.email
[email protected] ” sets the global username and email in Git.

Why --global? Well, this flag will apply the configuration to all Git repositories on my system.
If we want to configure only for a specific repository, we use the same command without
“--global” flag.
Note: We need to first initialize a git repository before using the above command without
“--global” flag. No such requirement, if configured globally.

Now, we can verify configuration by using this “git config --list” command.

2
Methodology and Execution

We can continue our task on Git Bash but it is simple to communicate between GitHub and
local system using an IDE. So, I am going to use Visual Studio Code.
Why? Because we can perform several tasks like writing codes, managing files, and running
Git commands at one place. Also, Visual Studio Code provides built in Git tools.

Now, we can create and get repository to our local system in two ways:
• Create a repository on GitHub and clone it to the local system. (Simple & Easy)
• Create a repository on local system and add a remote repository.
We will choose the second method as specified in the Lab 01 assignment. First create a folder
where we want our local repository.
Steps:
1. Open the file explorer
2. Choose the location where you want to create folder. (I opted for folder in a Desktop)
3. Right click -> New -> New Folder (Name it according to your choice)

Now, Initializing a Git Repository.


Steps:
1. Open the folder which we created just before (see above steps) in Visual Studio Code.
2. Press “Ctrl+Shift+`” to open a new terminal.
3. Enter “git init” command on the terminal to create a .git folder. This will make it as a
Git repository.

We can create folders in many ways, like creating folder using “mkdir” command, but in this report, I have mentioned only
those steps which I executed to complete this Lab work.

3
Now, we have successfully initialized a local repository. For adding and committing files in that
repository, like a README.md file, follow the below steps:
1. On VS code, create a file with name ‘README.md’ and type your desired text in that
file. Alternatively, you can use command “echo "This is for Lab 01 Report" >
README.md”.
2. As we have now created the README.md file, it is an untracked file. To move it to the
staging area, enter command “git add README.md”.
3. Now, to record the changes in staging area in the repository’s history, use command
“git commit -m “Your desired commit message”.
4. To view the commit history, use command “git log”

Note: you can use command “git status” to know the current status of the repository.

We can now create a GitHub repository and add it as a remote. To do that follow these steps:
1. Log in to your GitHub account and create a new repository. While creating the
repository, do not initialize it with a README.md file as we are going to make that
repository a remote.
2. To link the that repository with our local repository use “git remote add origin <URL of
the GitHub Repository>” command on the VS code terminal.
3. To verify the remote use “git remote -v”.
Note: Before adding the remote, make sure in the VS code terminal you are in the same
directory where the local repository was created. Also, when copying the URL of your
GitHub repository, make sure you copy it after selecting HTTPS not SSH. More details
about this will be mentioned in the challenges section of this report.

4
Now, we have successfully linked our local repository with the GitHub repository and made it
as a remote. After doing this, we will push the committed changes to the remote repository.
To do that I used command “git push origin main” but it gave me the following error.

I didn’t know what to do, so I sought help from one of the course TAs. He suggested me to
generate a SSH key and paste the public SSH key in my GitHub account. Now, after hearing his
suggestion I went to Stack Overflow to know the process for generating the SSH key. Here is
the link to that page.
Then, I generated the SSH key using the command “ssh-keygen -t ed25519 -C
"[email protected]"”. After entering this command on VS code terminal, it asked for
the file where I want to save the SSH key and also to enter the passphrase. After completing
these formalities, my SSH key was generated. I copied the key, went to my GitHub account ->
settings -> SSH and GPG keys -> New SSH key, after this I gave a title to key and pasted my
public SSH key there.
After this whole process, I again used the command “git push origin main”. And this time it
worked!

5
As you can see, the files created on the local system were successfully pushed to the remote
repository. Now we want to clone an existing repository. For that we will clone our recently
created repository in some other folder. So, do the following:
1. Create an empty folder on the local system
2. In VS code terminal or any terminal, change and move to the directory where the
empty folder was created. (Use cd <path to that folder>)
3. Now, in the same terminal use “git clone <URL of the repository>” command. This will
clone our previous repository to that empty folder.

Before

After

As you can see, we have successfully cloned our previous repository into a new empty folder
rough.

6
Now if we make some changes in our remote repository and want those same changes on our
local system i.e., we need to pull changes from the remote repository. To do that follow these
steps:
1. Make some changes in your remote repository like adding files or updating files. I
update the README.md file.
2. Now use “git status” command. You will see “Your branch is behind origin/main by x
commits” on the terminal/command line. It will suggest to pull those changes.
3. To pull those changes use this “git pull origin main” command.
Note: If you want to pull changes in some different branch, replace ‘main’ with the
‘branch name’.

Now for the final part, we are going to setup a pylint workflow. What is pylint? It is tool which
is used to identify errors in python code and also to make sure that the python code is
following some standards such as 4 spaces indentation, snake case naming convention for
variables and functions and two blank lines to separate functions, etc. At the end it also gives
score out of 10 to the python code which it analyzes.

We can setup pylint workflow via GitHub actions by following these steps:
1. Navigate to the repository on GitHub where you want to set up the pylint workflow.
2. Click on ‘Actions’ tab and type ‘pylint’ on workflows search bar.

3. After that click on ‘configure’.

7
4. On line 10, you will find a code line mentioning different python version. Make sure
that it contains your python version. If not, add it manually and remove the
unnecessary. You can know your python version by typing “python --version” on the
command line.

Before

After

8
5. Then click on ‘commit changes ’.

6. Now, upload a python code in the repository.


7. Pylint will automatically start analyzing the uploaded python code. After some time, it
indicates a ‘green tick’ if your code meets the python standards otherwise a ‘cross’.
8. You can also check the errors or issues under ‘Analyzing the code with pylint’ section
in Actions log and improve your code.

9
9. You can either make changes directly in file uploaded on GitHub or pull the changes to
local system and make changes in the IDE. After that you can push those changes.

10. Pylint will again analyze the code. Do it until you get the green tick.

10
To provide Pylint with a codebase to analyse, I wrote code in python language to solve the
maximum subarray problem (total line 70), using the method described in the book
'Introduction to Algorithms' by Thomas H. Cormen.

All the tasks mentioned in LAB 1 have been successfully completed! You can view the
repository through this link.

Results and Analysis


All of the output screenshots are already present in the methodology and execution section
of this report. So, I believe there is no point in mentioning same things in this section again.
But still, some of the results from lab 1 are as follows:
• After initializing the repository, I did a commit with the message "Initial commit with
README". I used ‘git log’ command to confirm if commit was successfully recorded in
the repository.
• After linking the local repository with the remote repository, the changes were
successfully pushed to GitHub with the command ‘git push -u origin main’. This added
README.md file in the repository on GitHub, and changes were reflected in the
commit history on GitHub.
• The GitHub Actions workflow ran successfully, and the Pylint marked the errors within
the Actions logs. After resolving the errors, it showed a green tick since all the quality
checks were passed by the code.

Through this lab work, I observed many things like even after using the HTTPS URL while
linking local repository to remote repository, when I used ‘git push -u origin main’ command
it gave me the following error message:

The course TA suggested generating an SSH key and adding the public key to my GitHub
account, which actually helped in resolving the error. However, since I initially used the HTTPS
URL instead of the SSH (Secure Shell Key) URL, I wonder why it required me to generate an
SSH key in the first place. Maybe it could be because I have two GitHub account
(Nishitttrrrejected and NishitVSP).

During the process, I observed minor merge conflicts occurred when pushing changes. It
occurred because I simultaneously made changes in both remote and local repository. I
resolved the conflicts with a merge conflict editor provided by VS code. It made my work
simple as I only had to select the part of the code I wanted. Also, I learned ‘git diff’ command
can also be used for resolving these merge conflicts.

Some of the key takeaways from this lab work include:


• Deep knowledge of version control and commands of Git.
• Resolving merge conflicts.
• Automated code quality checks using workflows like pylint.

11
Discussion

I faced many challenges during this lab work:


• I faced merge conflicts when multiple changes were made to the same file, which
required manual resolving.
• Initially, linking the local repository to GitHub was tricky, particularly with the ‘git push
origin main’ command.
• While working with Pylint, I had to correct several issues related to coding style and
potential errors. Fixing these required some trial and error.

In the process of overcoming all these above challenges, I learned many lessons:
• I learned how to resolve merge conflicts using the merge editor provided by Visual
studio code. Also, I could have resolved merge conflicts more efficiently by pulling
changes from the remote repository more frequently.
• After taking suggestion from TA, I learned how to generate the SSH key and was able
to resolve that issue.
• I already knew about some of the python standards but through this exercise I came
to know about Docstrings. I didn’t know it was mandatory to add Docstring for every
function.

In short, this exercise significantly improved my understanding of version control systems,


especially Git. I also gained insights into using GitHub for remote collaboration and using
automated tools like Pylint to ensure code quality.

Conclusion
In this lab work, I successfully configured Git for version control, connected to GitHub, and set
up a Pylint workflow using GitHub Actions. I learned how to commit changes, push them to a
remote repository, and handle merge conflicts. The automated quality check system helped
improve the overall code quality. This experience improved my understanding of Git, GitHub,
and automated testing, which I think is very important in today’s world.

Links
Repository link: NishitVSP/Any_Name: LAB 01 CS202
Book which I referred: 'Introduction to Algorithms' by Thomas H. Cormen
Website from where I learned about generating SSH key: git - How to generate ssh keys (for
github) - Stack Overflow

-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x- End of Lab 01 Report -x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-

12
Course: Software Tools & Techniques for CSE (CS202)

Date: 16-Jan-2025
Name: Nishit Prajapati
ID: 24120036

Lab 02 Report
Introduction to Mining Software Repositories for Bug Fixes in the Wild

Overview
In this lab assignment, we are going to explore mining software repositories to analyse bug
fixes in open-source software (OSS) ecosystems using the tool MineCPP. Notably, one of the
contributors to the development of minecpp is our course instructor, Shouvick Mondal. I
worked with a multilingual repository from GitHub to extract insights on bug-fix patterns,
coding efforts and similarity score, etc. This assignment includes

• Setting up the tool


• Selecting repositories
• Creating datasets, and analysing metrics.
Moreover, I evaluated developer commit messages against bug type descriptions generated
by this tool to check whether it "better explains" the bug type or not. The final outcomes
include observations of data, visualizations of metrics, and recommendations to improve OSS
development activities.

Introduction and Tools


Suppose we want to analyze a software repository and gain some insights from it. To fulfill this
requirement, there are some mining tools available. MineCPP(Minecraft++) is one such tools.
It analyses software repositories and extracts some meaningful insights like bug-fix pairs,
buggy commits and much more. It also presents visualization of coding efforts and similarity
scores for the analyzed repository. MineCPP is tool developed through collaboration, with one
of the contributors being our course instructor, Shouvick Mondal. It is an extension of
previously developed tool Minecraft. This previous tool used to extract 5 features like before
bug fix, after bug fix, location, bug type and commit message. But in MineCPP, it extracts 17
features which includes all features from Minecraft along with other 12 features like buggy
commit, fixed commit and file path, etc. This all features are presented in a csv file along with
some features like coding efforts and similarity scores represented visually.

13
In this lab activity, we are going to use this MineCPP tool.

Setup
The setup of this tool is little bit resistive. We can install this tool on any operating systems
just by using command “pip install minecpp” on command line but it requires some
configuration such as your system should have python 3.8 or above and needs C++ 14. Even if
your system meets these requirements, while installing this tool you may face many errors. I
faced the following error:

It suggested that it was not able to find 2.1.2 version of torch. So, I tried to install that version
of torch on my system but failed. I referred the official page to install 2.1.2 version of torch
but its recommended commands didn’t work on my system.

It gave me back the same error!


So, as suggested in lab, I tried installing minecpp on a Virtual Machine provided by IIT
Gandhinagar for Software Engineering and Testing Group. Here is the link to the page from
where I installed this virtual machine. To download it follow the below steps:

14
• Install Oracle VM virtual box on your machine.
• Click on this ‘link’ and download the SET-IITGN-VM.ova file.
• Then in virtual box click File -> Import Appliance -> Select the above file
• Use the below credentials to login:
1. Login password: set-iitgn-vm
2. Root password: set-iitgn-vm
After installing the above-mentioned virtual machine, open its terminal and check if it has
python with version 3.8 or above. Then use command “pip install minecpp” .

Now, MineCPP is successfully installed!

Methodology and Execution


So, after installing the MineCPP tool I tried it on a repository by using “minecpp -u <URL of the
repository>” command on the terminal. But it gave me an error “version ‘GLIBCXX_3.4.29’ not
found”.

To resolve this error, I searched on google and found my solution on Ask Ubuntu. It was clear
to me that I need to install 3.4.29 version of “GLIBCXX”. Initially the virtual system only had
GLIBCXX version up to 3.4.28(see below image). To install the required version, I ran certain
commands on terminal mentioned on Ask Ubuntu website. Those commands where :

15
1. “strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX” to check the
current version.

As you can see, the virtual machine only had version up to 3.4.28 for GLIBCXX. So, we will
install the version 3.4.29 of GLIBCXX by using the below command.

2. Use command “sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test” followed by


“sudo apt-get update”.
3. After implementing above commands, use command “sudo apt install gcc-10 g++-10
libstdc++6”.

16
So, after implementing all of the above steps, all the version of GLIBCXX was installed. We
can again use command “strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX”
to check the current version.

Now to select a repository for using this tool, I adopted a similar strategy as mentioned in the
Lecture 2 slide of this course. I used SEART GitHub Search Engine to perform this task. So, my
selection criteria were as follows:

Language should be C++

(Search engine returned 99,281 results)


results
Minimum stars : 30000 and
Minimum fork : 18000
(Search engine returned 34 results)

4000 < Total commits < 5000

(Search engine returned 6 results)

Repository : Caffe selected


)
results

17
So, why these criteria? To be honest the reason to select C++ language was just a random
choice. But other decisions were made due to some specific reasons like keeping the total
commits between 4000 – 5000 ensured that we can mine the repository in a short duration
of time. Also, we wanted to make sure that the selected repository is not any toy project and
is a real-world project, that’s why we selected a repository with stars count greater than
30000. This ensured that the selected repository is a real-world project as the number of stars
reflects the popularity of any repository and real-world open-source projects are generally
quite popular. It is also true in our case, caffe is a fast open framework used for deep learning.
Thus, it can be considered as a real-world project.
We have selected the repository. Now, we will analyze this repository using the tool MineCPP.
For that use command “minecpp -u <URL of the repository>” on the virtual machine’s
terminal. The tool will take some time to analyze the repository and will provide a
project_name.csv file which will contain the 17 features as discussed in the introduction of
this lab 02 report.
Note: It will take a significant amount of time to analyze, so please be patient and wait!

18
Start

Finish

So, as you can observe in the above screenshots, it took 5 hour and 41 minutes to analyze the
whole repository. It also created a file named “caffe.csv” which contains the output of the
analysis. We will talk about this file in more detail in the results and analysis section of this
report.
Now, to find out the top 2 bug fix pair with large impact, I created a metric ’Impact’ which will
be nothing but the sum of the difference between the Lizard feature for fixed and buggy code
and the differences between the BLEU and crystalBLEU_score. Why? Because the difference
between the Lizard feature for fixed and buggy code measures the code complexity, while the
differences between BLEU and crystalBLEU_score assess the structural and semantic changes,
together giving a more comprehensive view of the bug fix's overall impact. Its code will be
discussed in the results and analysis section.

19
Results and Analysis

Above screenshot is just one glance of the file created by the MineCPP. As you can observe, it
created many duplicates for the same commit and bug type. It is possible that same commit
has many bugs and also its fixes. Also, parameters like coding effort are different for each bug-
fix pair. So, the file created by this tool contains 2772 rows of bug-fix pair with many duplicates.
Here is the link to the file. Therefore, I removed the duplicates. Here is how I did it:
• Open the Caffe.csv file and select any filled cell.
• Press ‘Ctrl+A’, this will select all the filled cell.
• Click Data -> Remove duplicates
• Selected 4 parameters which are ‘bug type’, ’commit message’, ’fixed commit’ and
‘buggy commit’. On the basis of these 4 parameters duplicates were removed from
the file.
So, after removing the duplicates based on the above-mentioned parameters, I got 664 unique
bug-fix pairs. Here is the link to the reduced row file. One of the main points to notice here is
that our repository contained a total of 4156 commits and the total generated bug-fix pairs
were 2772, which also suggest that there are many duplicates created in this file.
After reducing the number of rows to 664, I randomly selected 15 rows and analyzed them. I
looked at LLM-inferenced bug type description and developer provided commit message.
Before mentioning the results, it is important to mention the definition for “better explains”.
Here, “better explains” refers to whether the LLM-inferenced bug type provides a simple,
clearer, more detailed, and contextually accurate explanation of the bug fix compared to the
developer’s commit message. This includes specifying the exact issue and the description of
the fix in a way that enhances understanding better than what the developer originally
described. Click here to access the excel sheet which contains the 15 labeled bug-fix pairs.

20
Sr.
No. Bug type Commit Message LABEL Reason for Label
1 The bug type
explicitly mentions a
typo fix, whereas the
commit message is
fix typo in softmax_layer.cpp bugfix TOOL too vague.
2 The bug type clearly
states what was fixed
(a typo), while the
commit message
bugfix and made focuses on LevelDB,
the C++ interface which is unrelated
for creating and just “bugfix” is
fix a typo in caffe/util/io.cpp leveldb TOOL too vague.
3 The bug type is
specific about
caffe common coverage fixes, while
fix coverage for CUDA_CHECK cpp: fixed an the commit message
and CUDA_VSL_CHECK embarassing bug TOOL is too vague.
4 The bug type
some major bug specifically mentions
fixes (includes a regression, which is
some to-be- more informative
fix regression in removed than the generic
window_data_layer.cpp debugging code) TOOL commit message.
5 Both descriptions
loss in forward provide useful
fix bug in pass fix for context for the bug
window_data_layer.cpp window data layer BOTH fix.
6 The bug type talks
compile caffe about fixing specific
without MKL functions, while the
(dependency commit message talks
replaced by about dependency
fix caffe_axpy and boost::random, changes, making
caffe_cblas_axpy Eigen3) NONE them unrelated.
7
The bug type refers to
adding support for a
library, while the
commit message
Fixed uniform refers to fixing a
distribution upper distribution bound,
add boost/math/nextafter bound to be making them
support to caffe inclusive NONE unrelated.

21
8 The bug type states a
typo fix, but the
commit message talks
about renaming a
Rename signbit in macro, both are
macros to sgnbit almost similar but
to avoid conflicts commit message is
fix typo in caffe_gpu_asum with std::signbit DEV more descriptive.
9 The bug type is
general, but the
commit message
Fixed CPPLint refers to CPPLint
errors related to fixes, making it more
fix caffe tests math funtions DEV descriptive
10 The bug type
specifies a typo fix,
while the commit
passing too many message is unrelated
args to tool to typos. The tool’s
fix a typo in binaries is an output is more
convert_imageset.cpp error TOOL informative.
11 The bug type talks
about fixing a return
value, while the
commit message
refers to tool exit
codes, making them
tools should have quite similar. But the
fix the return value of nonzero error exit bug type is more
compute_image_mean.cpp codes TOOL precise and accurate.
12
Both descriptions talk
about improving type
use static_cast instead of fix casts (static for casting, making them
reinterpret_cast void*) BOTH equally useful.
13

The commit message


is clearer in
fixing pooling explaining what was
SetUp() to allow fixed (stride and pad
fix wrong checks in default values for defaults), while the
caffe/layers/pooling_layer.cpp stride and pad DEV bug type is vague.

22
14 The bug type
mentions functional
improvements
(checking non-square
filters), while the
commit message talks
about styling, making
add checks for non-square the tool’s output
filters in pooling layer fixed style errors TOOL more useful.
15 Avoid using
cudaMemcpy for Both messages
memcpy when convey relevant
fix incorrect usage of'memcpy' there is no GPU context about the
in cpp and CUDA driver BOTH memcpy issue.

So, after labelling all 15 bug-fix pairs, we can see that 47% (7 out of 15) of the descriptions
provided by the tool "better explains" the bug fix compared to the developer’s commit
message. Around 20% (3 out of 15) of the descriptions were equally well explained by both
the tool and the developer. Additionally, 13% (2 out of 15) of the descriptions were not well
explained by either the developer or the tool. Still, 20% (3 out of 15) of the descriptions were
better explained by the developer’s commit message than by the tool.
Note: A sample of 15 is not enough to make judgement on the performance of this
tool!

Now, the MineCPP also provided a GUI which contains two graphs. One graph represents
“coding efforts” vs “bug-fix pairs” and other one represents “similarity scores” vs “bug-fix
pairs”. We can get these graphs by clicking on “Dataset Quantitative Analysis” which is in GUI
provided by the tool. After that we can select number of bug-fix pairs or say the number of
rows which we want to analyse. It will then provide us these two graph corresponding to the
selected number of rows. Below are some of the graphs provided by GUI:

Files requires further review or


optimization(locate the file from
csv file generated by this tool)

23
Stable

Note: To analyse all these metrics, reference has been taken from the Lecture 2 slides.

1. Interpretation of "Coding Effort"


The coding effort metric represents the #nodes traversed before reaching the buggy
node of an Abstract Syntax Tree (AST). It Quantifies the complexity and effort required
by author/developer to introduce bugs within source code.
From the visualizations we can see significant variations in coding effort across bug-fix
pairs. Some bug-fix pairs show spikes in effort, indicating major code changes or
complex fixes. Also, certain areas in the repository exhibit consistently low effort,
suggesting minimal modifications or stable code sections. Now to find out the bug fix
pair with the highest coding effort, we can fetch high spikes from the above graphs
directly but as you can see, when the number of bug-fix increases the frequency of
high spikes also increase as discussed above. So, we will open the csv file generated by
this tool and then follow these steps:
1. Click on Data tab.
2. Click Sort.
3. Select the Coding effort column.
4. Select the sorting from largest to smallest

24
• Above steps will sort bug-fix pairs on the basis of coding efforts. Now, we see
the bug-fix pair with the highest coding effort.

• So, these two bug-fix pairs have the highest value of coding effort.
Thus, files or modules (which we access from the csv file generated by this tool) with
consistently high coding effort, frequent modifications, and major bug-fix spikes requires
further review or optimization.

2. Identifying Patterns and Trends


We can observe the graph following a trend. It indicates that as the repository
develops, the coding effort increases. This could be due to growing complexity and
additional files. High spikes in coding efforts highlight sections where significant
changes or feature enhancements were made. Also, certain areas exhibit frequent
occurrences of low coding effort, which may suggest that these modules are stable,
well-maintained, or required only minor bug fixes. Thus, the distribution of coding
efforts reveals that certain files or modules are continuously updated while some
being relatively static.

Now, to find out the top 2 bug-fix pairs with the with a large impact on the metric. I did the
following code:

It returned me the row numbers in caffe.csv file which contains the bug fix pairs with the
largest impact on the metrics.

The top 2 bug fix pairs are:

25
3. Recommendations for Improvement
Thus, by using this tool we can identify files or modules with high coding efforts and
can even try to optimize those files and modules. We can adopt concepts of object-
oriented programming to simplify future modifications. We can also use some
automated tools like pylint(discussed in Lab 01) to catch issues early.
Also, if certain sections require ongoing modifications, we can refactor it to enhance
long-term maintainability and can continuously monitor trends through this tool to
track progress and adapt development strategies as needed.

Discussion
I faced many challenges during this lab work:
• Installation of MineCPP was quite tricky. Even after meeting and installing all the
requirements, I was not able to install MineCPP on windows.
• Also, when I attempted to install MineCPP on virtual machine, encountered an error
related to “GLIBCXX_3.4.29” not being found.
• So, it required me to install the above-mentioned specific version on the virtual
machine.
• MineCPP took 5 hours and 41 minutes to analyze the selected repository, which is a
significant amount of time.
• The initial dataset generated by this tool contained 2772 rows, but many were
duplicates. So, I had to manually remove the duplicates rows based on specific
parameters.
In the process of overcoming all these above challenges, I learned many lessons:

• I learned the importance of environment configuration. I found that ensuring the


correct versions of dependencies is quite important before installing tools like
MineCPP.
• I learned how to handle duplicate data and infer data for analysis.
• I got hands-on experience with this tool and learned how its coding effort metric
helped identify high-impact bug fixes, indicating areas that required further
optimization. Observing trends in coding efforts can highlight stable versus
frequently modified code sections.

26
Conclusion
This lab showcased how effective MineCPP is in mining software repositories for analysing bug
fixes and coding efforts. Although there were challenges with installation and data duplication,
the tool offered valuable insights into bug-fix patterns and the stability of repositories. The
findings indicate that automated tools can aid in the analysis of software repositories. Future
enhancements could focus on improving Minecpp’s efficiency, simplifying the installation
process by making it install all the requirements automatically on any system without human
intervention and incorporating more validation mechanisms to boost the accuracy of bug
descriptions.

Links:
1. Lecture 2 slides.
2. SEART GitHub Search Engine.
3. Ask Ubuntu : referred code snippets from here to update GLIBCXX_3.4.28 ->
GLIBCXX_3.4.29.
4. INSTALLED Oracle VM virtual box and through this ‘link’ download the SET-IITGN-
VM.ova file.
5. Excel file which contains the labelled 15 rows.
6. File from which duplicate rows were removed.
7. Original file.

-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x- End of Lab 02 Report -x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-

27
Course: Software Tools & Techniques for CSE (CS202)

Date: 23-Jan-2025
Name: Nishit Prajapati
ID: 24120036
Lab 03 Report
Exploration of different diff algorithms on Open-Source Repositories

Overview
In this lab assignment, we are going to perform some tasks using a tool named ‘PyDriller’.
Using this tool, we are going to mine a software repository and fetch some data about that
repository like old file path, new file path, commit’s hash, parent commit’s hash, commit
message, diff using myers algorithm and diff using histogram algorithm. After fetching all this
data, we are going to analysis it and plot the final dataset statistics.
After this lab work, we will be able to:

• Setup repository mining tools and apply them on real world projects.
• Analyse diff output due to variants of the diff algorithm applied in the wild.
• Analyse the impact of different diff algorithms on code versus non-code artifacts.

Introduction and Tools


PyDriller is an open-source Python library that makes Git repository mining easy. It
offers a simple interface to obtain commit history, code change analysis, and a great
deal of repository statistics.
It will provide us with many in-built functions/fields which will help us fetch various
kind of data from the repository which we are going to mine for this lab.
The information regarding the functions and fields which PyDriller provides is available
in this documentation and below mentioned details about the fields have been taken
from this documentation.
In this lab assignment, we are going to use these following fields provided by PyDriller:

• old_path: old path of the file (can be None if the file is added).
• new_path: new path of the file (can be None if the file is deleted).
• diff: diff of the file as Git presents it (e.g., starting with @@ xx,xx @@).

28
• hash: Hash code of the commit.
• parent[0]: Hash code of the parent commit(can be None if the file is new and does
not have a parent commit).
• msg: commit message of the commit.
Note: old_path, new_path and diff are the fields for modified file whereas hash,
parent[0] and msg are the fields for the commit in the repository.

Setup
Before installing PyDriller on any system, make sure that it meets the following requirements:

• Python 3.4 or newer


• Git
Source: PyDriller Documentation

As my system already meets the above-mentioned requirements, we will install PyDriller


using “pip install pydriller” on the terminal provided by VS code.

Note: It is not mandatory to use above command on terminal within VS code only. You
can do this in any command line interface such as PowerShell or Git Bash. But, as we
are going to modify some code in this lab activity, it is recommended to do this in VS
code.

After installing the PyDriller tool, we are going to install the analysis framework. Download
the framework through this link.
This link will contain a zip folder named cs202_miner. Extract the folder and we will receive
the following files:

• analysis (SH source file)


• checkout.py (Python file)
• fetch_py_projects.py (Python file)
• getCommits.py (Python file)
• getCommitsInfo.py (Python file)
• main (SH source file)
• projects (csv file)

29
Open the folder which contains all these files in Visual Studio code.

Methodology and Execution


Now, to select a repository to mine, I followed the same strategy as mentioned in the Lab 02
report. I used SEART GitHub Search Engine to perform this task. So, my selection criteria were
as follows:

Language should be Python

(Search engine returned 3,19,513 results)


results
Minimum stars : 30000

(Search engine returned 134 results)

Minimum forks : 18000

(Search engine returned 14 results)

Repository Django selected )


results

Reason to choose python language was completely random. In Lab activity 02 I opted C++ so
this time I went for python. Now, when the search engine returned 14 Repository, I spotted
Django in it. I already knew little bit about Django, so I opted for this repository.
Django is an open-source web framework which helps developers to build web applications.
It is primarily used in the backend development. That’s all I know about Django!

30
Now, we will copy the URL of Django from GitHub and modify the projects.csv file. We will
update the URL, project name and number of stars in this csv file.

Before

After

Before modifying the code in these files, let’s first understand what all these files do.
1. projects.csv: It is a csv file which contains the project name(Repository name), Git
Repository URL and stars that repository has.
2. main.sh : It is an SH source file/shell script and this file will be our starting point for
mining the repository.
It reads the projects.csv file, ignoring the first line, and then extracts the project name
and URL from csv file. After that, it clones the repository from the extracted URL and
calls analysis.sh shell script.
3. analysis.sh: It removes the temporary files and creates a <Respository name>_results
folder which will contain the commits_info.csv created by getCommitsInfo.py file and
<Repository name>.commits file created by getCommits.py file.
As mentioned earlier it removes the temporary files. What are temporary files?
Suppose we ran main shell script two times. Now, our program will create one csv file
and commits file and store it in the directory created by analysis shell script. So, when
we run the main shell script for second time, it removes these two files.
This analysis shell script also runs the getCommitsInfo.py and getCommits.py file to
extract all details related to commits.
4. getCommits.py: It extracts the last n(we will choose the value of n, for this lab work n
= 500) non-merge commits from the main branch and prints the commit hash code.
The extraction is done in reverse order.
5. checkout.py: It checks out the repository to a specific commit hash using pydriller’s
checkout method. This checkout method takes ‘sys.argv[2]’ as a parameter. See,
sys.argv[1] refers to the repository path and sys.argv[2] refers to the commit hash.
So, checkout.py initializes PyDrillers’s repository object with the repository path and
using the checkout method it switches this repository to the commit hash.
6. getCommitsInfo.py: This file is one of the most important for this lab assignment. We
are going to add logic to this file to perform the tasks assigned in this lab work. It will
create the csv file containing all the extracted data.

31
7. fetch_py_projects.py: It fetches the most starred python repositories from GitHub
and store it in a csv file.
Now, we know what all these files do. But in this lab assignment, there is no use of
fetch_py_projects.py file. So, we will delete this file. Actually, in this lab assignment we are
assigned to generate the dataset(commits_info.csv), so we also don’t need the getCommits.py
file as it will generate <Repository name>.commits file. Therefore, I removed that file also.
Before we modifying anything make sure to do these things:

• When opening a new terminal in VS code, use Git Bash(it is recommended)


• Make sure you are in right path. You need to be inside the cs202_miner folder to run
the main shell script. To do that use command “cd <path to cs202_miner folder>”.
Now, we are assigned to make a csv file which will contain the following columns:

So, we will modify code in getCommitsInfo.py file. Below screenshot highlights modified part
of the code.

Before

32
Before

After

33
After

As you can see in the highlighted part of the code, we made these changes:

• Value of last_n changed to 500 from 100 as we need to analyze last 500 non-merge
commits.
• Removed unnecessary imports code line(e.g. CodeChurn, HunksCount and
CommitsCount).
• Used two loop to traverse the commits. In first loop only one flag,
“skip_whitespaces=True” was included as a parameter in the Repository class. In
second loop two flags, “skip_whitespaces=True” and “histogram_diff=True”, were
included as parameters in the Repository class.
• Fields like old_path, new_path, hash, parents[0], msg and diff were used to extract the
corresponding data and were appended in the list named rows. In the end, we write
these rows in the commits_info.csv file.
• A logic was created that if the diff created using Myers and Histogram algorithm is
same then Matches column will store the string “Yes” otherwise “No” in the
corresponding rows.
• Two flags “encoding=’utf-8’” and “newline = ‘ ’” were included as parameters in the
open function.
Now, we will discuss, why above changes were made. Also, it is important to note that diff can
be obtained using four algorithms : myers, histogram, patience and minimal. Histogram is
considered the best in certain scenarios because it shows minimum number of changed lines
and also it gives better alignment of changes.
In getCommitsInfo.py, we used a loop twice to traverse the commits. Why? Because PyDriller
uses Git’s diff and Git’s diff, by default, uses myers algorithm. Therefore, we used the loop
twice , so that in the first loop, we got diff using myers algorithm and in the second loop as
“histogram_diff=True” was included as a parameter in the Repository class, we got diff using
histogram algorithm. Flag “skip_whitespaces=True” was included, so that when we compare
lines it ignores the whitespaces. In the second loop, before appending diff_hist in the rows,

34
we wanted to make sure that the modified file for which we are finding the diff using
histogram algorithm is same as the modified file for which the diff was done using the Myers
algorithm. To do that, I just added a comparison in the loop. This comparison included
commit’s hash, if the commit’s hash(row[2]) is same as well as the old file and new file path
(row[0] and row[1]) is same, then only it will assign row[6] with the diff obtained using
histogram algorithm otherwise it will remain empty.
Also, two flags, “encoding=’utf-8’” and “newline = ‘ ’” were included as a parameter in the
open function. Without “encoding=’utf-8’”, I encountered ‘UnicodeEncodeError’ so, I referred
to this page and included this flag. Moreover, without “newline = ‘ ’”, when I ran the main shell
script, it created a csv file which contained blank rows. So, “newline= ‘ ’” was used to prevent
the creation of blank lines in the csv file.
It's not that I made all these modifications in one go without any errors. In fact, it took many
trials and errors to arrive at this version of the code. For instances the PyDriller’s
documentation includes that it has the option to change the diff such that it uses histogram
algorithm, but it doesn’t include any illustration or mentions the code to execute it. At first, I
didn’t know how can I change the diff’s configuration but the course TA suggested me to look
for this in PyDriller’s source code and not in documentation. And this was the key moment,
when I looked at the source code, I learnt about the flags which I just mentioned above. This
suggestion really helped me and made my process simple. Before this, I tried many things that
were suggested by Google but it only complicated things even more.

35
The last thing we need to change is remove the line in the analysis shell script that calls
getCommits.py because we have already removed it as we only need the csv file which gets
created by getCommitsInfo.py file.

This line executes

The getCommitsInfo.py

Also, if we remove the checkout.py file and comment out some parts of analysis shell script,
our program would work perfectly fine. In this lab assignment, we only have two main tasks:

• Create and fill the commits_info.csv file using/executing the getCommitsInfo.py file.
• Read the created csv and generate the plots.

Now, we have made all the necessary modifications to the code. Now we will run main shell
script. Use command “./main.sh”
This will run the main shell script. As main shell scripts call analysis shell script, analysis shell
script also gets executed. Furthermore, analysis shell script calls getCommitsInfo.py, so it also
gets executed. The getCommitsInfo.py creates the commit_info.csv file which contains the
extracted data obtained from mining repository using PyDriller.
Below is the screenshot of the commits_info.csv file created by our program.

36
Now, we need to plot the final dataset(this csv file data) statistics using python code. To do
that we will use matplotlib and pandas library. If these libraries are not installed in your system
use command “pip install matplotlib” and “pip install pandas” on the terminal. This will install
these libraries.
I have created a final_dataset.py file which contains the code that will generate the plots.
First, let’s define what are code artifacts and what are non-code artifacts. So, code artifacts
are those file whose extensions are like ‘.py’, ‘.c’, ‘.cpp’, ‘.js’, etc -in short, those files which
contains code. And non-code artifacts are those file whose extensions are like '.txt', '.md',
'.csv', '.json', '.yml', '.yaml', etc.
So, here is the algorithm I used to generate the plot:

• Load the dataset


• As, Pydriller is mostly written in python and with a little bit use of HTML, I defined a
list which contained the extensions of code artifacts and non-code artifacts. I will use
this list to categorize each file.
• Then I created the function which will take file_path as an argument and categorize it.
This file_path is nothing but a parameter which is going to take values stored in our
‘new_file path’ column in commits_info.csv file.

37
• Now, we create a data frame ‘category’ which will store either “code” or “non-code”
string based on the result of above function.
• We will also create two more data frames ‘matches’ and ‘no_matches’, which will store
the results from “Matches” column of our commits_info.csv file.
• Now, we simply calculate the count using pandas’s shape attribute.
• Finally, we will plot the bar graph using matplotlib library based on the above counts.

Before executing the file, which generates plot, make sure it is in the same folder where
commits_info.csv file is stored. Use command “cd <path to the folder where the csv file is
located>” . After this you can directly click on run or use command “python3 <name of the
python file which generates the plot>”.
Also, you can move the commits_info.csv out from the <Repository name>_results folder to
cs202_miner folder. Because if you ran the main shell script again, the pre-clean up code in
analysis shell script will delete the csv file created in its previous execution.

We have successfully executed every task mentioned in this lab assignment! Now, we will
analyze the results in the next section.

38
Results and Analysis
The results from the dataset created by the main shell script are quite interesting. The csv file
it created contains 500 rows, as expected, but the labeling it did under the ‘Matches’ column,
based on our logic that if the diff using myers algorithm is same as diff using histogram
algorithm, it should be labelled as ‘Yes’ otherwise ‘No’, surprised me. Why? Because most of
the labels are ‘Yes’. I expected that the outcome of diff using myers and histogram algorithm
would be different, so most of the rows would be labeled ‘No’, but the actual results were the
opposite of what I expected. Because of this reason, I asked the course TA about this outcome.
He told me that it is quite possible because may be the developer modified the code in such
a way that diffs using myers and histogram algorithm are same, so I accepted the result. But
then I noticed one other thing. If you look at the plot our code generated :

You can see that, there are some rows which are labelled as ‘No’, i.e., the diff using myers and
histogram algorithm is different.
But now let’s look at the csv file:

39
As you can see in the csv file, both diff generated using myers(left) and histogram(right)
algorithm, looks same to me but it still got labeled as ‘No’. Therefore, I again looked at the
code, but didn’t find anything wrong in it. I have used flag ‘histogram_diff=True’, which means
that the .diff field will contain the diff produced using Histogram algorithm only. And also, in
the first loop that extracts commit object, in the parameters of repository class, I didn’t include
this flag because by default .diff field will contain the diff produced using myers algorithm. So,
it is confirmed that I have stored correct diff in the correct column. Also, if you see the code, I
have used flag ‘skip_whitespaces=True’ in both the loops, so while comparing diffs it ignores
the whitespaces for both types of the diff. See, there is no point in using diff_parsed field. Its
just a more readable format of diffs. I even checked that if I use diff_parsed field for both
myers and histogram diffs, still most of the rows were labeled ‘Yes’. Also, I cannot use diff field
for myers diff and diff_parsed field for histogram diff, then it would certainly label every row
‘No’.
At the end, I didn’t find anything wrong in the code of getCommitsInfo.py, so I accepted the
results. Also, the course TA told me it is possible.
So, if we analyze the results, then in the total commits, most of the file committed were code
artifacts. And in our result, there were total 312 code artifacts with same diff using myers and
histogram algorithm and 8 code artifacts with different diffs. On the other hand, there were
total 153 non-code artifacts with same diffs and 4 non-code artifacts with different diffs. You
may notice, that the total artifacts (312+8+173+4=497) is not equal to 500. This is because of
the condition we induced while appending diff using histogram in the rows. If the condition is
not met, both diff_hist and Matches column remains empty. Therefore, there are 3 such rows
where both diff_hist and Matches columns are empty.
We have now successfully completed the analysis of our result!

Discussion
In this lab assignment, I faced few challenges:

• Initially, it was not clear how to obtain diffs using the histogram algorithm. PyDriller's
documentation did not have a clear example; thus, I had to look through its source
code to determine the correct way to set it up.
• Due to such unexpected results from generated csv file, I had to revisit the code many
times. Also, it made me question myself whether I was doing anything wrong.
• Since it had been some time(used these libraries in my 1st semester) since I used the
Pandas and Matplotlib libraries, I needed to brush up on their documentation and
syntax to correctly process and visualize the data.
I learned many lessons to overcome above challenges:

• Value of understanding configurations of tools: Rather than relying on


documentation, reading source code can reveal underlying configurations and
features. Also, I learned that I didn't need to read the full source code. Later, I found

40
out that, if I just hover my mouse cursor on parameters of the Repository class, visual
studio code will show all the options available, it showed the flag histogram_diff.
• Value of data validation: When processing commit data, correct categorization and
filtering techniques are essential to valuable analysis.
• Effective debugging practices: Unusual results should always be logically cross-
checked, and at times, seeking advice (e.g., from a TA) can save time.
• Revalidation of data processing and visualization skills: Relearning Pandas and
Matplotlib reminded me of how to process datasets effectively and produce insightful
plots.
Due to this lab assignment, I improved my problem-solving skills and gained a better
appreciation for repository mining and analysis.

Conclusion
This lab exercise illustrated the importance of PyDriller as a code change analysis and software
repository mining tool. Through the use of Myers and Histogram diff algorithms, we were able
to quantify their effects on different types of artifacts. We found out that, in the most cases,
the diffs produced using Myers algorithm and the diff produced using Histogram algorithms
were same.
Also, there should be more information available regarding the diff algorithm implementation
in the PyDriller's documentation.
Overall, this lab exercise gave me valuable hands-on experience in repository mining, data
processing, and visualization, thereby making me learn fundamental concepts of software
analysis and version control.

Links:
Pydriller documentation: link

Analysis framework: link

SEART GitHub Search Engine

References:
[1] D. Spadini, M. Aniche and A. Bacchelli, “PyDriller: Python Framework for Mining Software
Repositories,” ACM (Association for Computing Machinery), New York, USA, 2018.

[2]

-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x- End of Lab 03 Report -x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-

41
Course: Software Tools & Techniques for CSE (CS202)

Date: 30-Jan-2025
Name: Nishit Prajapati
ID: 24120036
Lab 04 Report
Exploring cyclomatic complexity (MCC) changes in Open-Source
Repositories

Overview
In this lab assignment, we are going to perform some tasks using a tool named ‘PyDriller’.
Using this tool, we are going to mine a software repository and fetch some data about that
repository like old file path, new file path, commit hash, parent commit hash, commit
message, diff using histogram algorithm, old file McCabe’s cyclomatic complexity and new file
McCabe’s cyclomatic complexity. After fetching all this data, we are going to analysis it and
plot the CFG(Control Flow Graph) of the most frequent changed source code file. Also, we
need to find top 3 most frequently changed source code file and plot the changes of
cyclomatic complexity values along the timeline of software evolution.
After this lab work, we will be able to:

• Fetch the cyclomatic complexity of all file in the selected repository using PyDriller
tool.
• Generate the Control Flow Graph of the source code.
• Visualize the changes in cyclomatic complexity.

Introduction and Tools


In this section, we are not going to introduce the PyDriller tool because it has already been
discussed in the previous report. Now, what is cyclomatic complexity? Cyclomatic complexity
is a software measure to determine program control flow complexity. It approximates the
number of linearly independent paths through a program, making it simpler to estimate the
maintainability and understandability of a program.
McCabe's cyclomatic complexity is derived from the program's control flow graph (CFG),
where:

• Nodes represents program parts.


• Edges represents the control flow.
• The equation to calculate cyclomatic complexity 𝑉(𝐺) is: 𝑉(𝐺)= 𝐸 − 𝑁 + 2𝑃

42
Where:
𝐸 = the number of edges in the graph.
𝑁 = the number of nodes in the graph.
𝑃 = the number of connected components (usually 1 for a single program).
Source of the above information : Lecture 4

Setup
We don’t have much to set up in this lab assignment as we are just going to modify the code,
we used in the lab 3. But still at the end part of this assignment we are going to use several
libraries like pandas, matplotlib, networkx and counter, etc. So, I installed all the libraries using
“pip install <library name>” command on VS code terminal.
I copied all the code used in the Lab 3 and pasted it in the Lab 4 folder on my system.

Methodology and Execution


Now for this lab, I am going to use same repository used in Lab 3.
After copying all the codes from the Lab 3, I modified the getCommitInfo.py file in the
following way:

• Modified the columns to extract the corresponding data through PyDriller.

• Columns like “old_file path”, “new_file path”, “commit SHA”, “parent commit SHA”,
“commit message” and “diff_hist” are filled using the fields which pydriller provides.
The execution for all of these is same as mentioned in the Lab 3 report, so not included
here.
• Now, we had to fetch the cyclomatic complexity for both the old file and the new file.
And PyDriller’s field ‘complexity’ only calculates the cyclomatic complexity for new file.
So, what I did, I created a temporary folder, stored the old file’s source code in that
folder and used lizard library to calculate the cyclomatic complexity of the old file
stored in the temporary folder. Now, I know we were instructed to use only PyDriller’s
complexity field to calculate the old file’s cyclomatic complexity, but that field doesn’t
take old file source code as one of the parameters, i.e., we cannot calculate the old file
MCC using PyDriller’s ‘complexity’ field. Therefore, I used lizard to calculate the old file
MCC as it reads the old file source code and then calculates the cyclomatic complexity.
Also, I calculated the cyclomatic complexity of new file using PyDriller’s ‘complexity’
field only.

43
After all these modifications, the getCommitsInfo.py file will create a commits_info.csv
file which contain our required data.
Now, just use command “./main.sh” on Git Bash. Remember to be in the same folder/directory
where main shell script file is located. Use command “cd <path to the folder where main shell
script is located>” on command line before running the main shell script using the above
command ‘./main.sh’.
After generating the csv file, I used libraries like pandas and to read the csv file and fetch out
the top 3 most frequent changed source code lines.
To do this I followed this logic:

• Filtered the ‘new_file path’ using the data frame from csv file.
• Used a counter against all the file path (which ends with .py) to count how many times
a particular file path is present in the column ‘new_file path’
• Extracted the top 3 file path with the maximum count. This count reflects how many
times the source code in that file path is changed.

44
Now, we were instructed to create CFG(control flow graph) for the most changed python file.
This is where I got stuck but ultimately was able to create the CFGs of the most frequently
changed file
To generate CFGs, it was mentioned in the course plan to use py2cfg library. So, first I did:

• Use the file path of the top most changed file which we got from the above code to
compare the new file path extracted using PyDriller’s new_path field. If it matched, I
copied the source code of that file on my system. So that I can use the source code to
generate CFGs.

45
I like to mention that as you can see in the above screenshot in line 45, we have done
‘top_file_change_count – version_count’ as PyDriller extracts commits from newest to oldest
due to the flag “order = ‘reverse’”, so we are using above logic to make sure that each version
of source code is named in a correct way.
Now, I read the py2cfg library documentation and did the code to generate the CFGS.
But, when I used this library to create CFGs, it gave ‘AttributeError’. I tried many things to
resolve this error but didn’t find any solution. After reading through the error message, I got
to know that the problem was in the py2cfg library. To solve this issue, initially I took
suggestion from DeepSeek AI and it suggested me to modify the source code. But I knew
modifying the library source code would make things even more complicated, so I didn’t
follow its suggestion.

Screenshot of chat generated by DeepSeek AI.

So, I asked our course instructor ‘Shouvick Mondal’ about this. He suggested me to change
the repository and then try again. And this worked!

46
I took a different repository named ‘Oumi’ to perform the task mentioned in this lab
assignment on it. Why Oumi? Because it has more than 5k stars. Now, I performed all the task
again from the beginning on this repository. I generated a commits_info.csv, which contained
all the data extracted by PyDriller. Fetched out the top 3 frequently changed source code file.
Now, I defined time line as number of commits. Therefore, the parent commit is ‘past’, current
commit is ‘present’ and the next commit is going to be considered as ‘future’. So, as the
number of commits increases over software evolves. So, if a commit changes the source code,
we consider ‘source code before change’ as older version and ‘source code after change’ as
newer version. So, as mentioned in the documentation of py2cfg, I did the below code:
• I used the file path of all the source code, we saved on our system using the above
code and genearates the CFGs of all the versions of the most frequently changed file.
• After this, I removed all the saved versions of these source codes.

This will generate all the CFGs of all the versions of the most frequently changed source code
file.
After this, I went ahead and did a code which generated the plot of changes in cyclomatic
complexity values along the timeline of software evolution.
To generate the above mention plot, I did a code which:

• Reads the commits_info.csv file using pandas and plot the data available in the ‘old file
MCC’ and ‘new file MCC’ column using matplotlib library.

47
• I also calculated some basic statistics using the pandas library.

48
Therefore, I have done every task mentioned in the Lab 4 assignment except generating the
CFG properly.

Results and Analysis


Now, we will look at the results of all the code we have done so far. So, the python code in
‘getCommitsInfo.py’ file generated a ‘commits_info.csv’ file. This csv file contains data like old
file path, new file path, commit’s hash, parent commit’s hash, commit message, cyclomatic
complexity of old file and cyclomatic complexity of new file which is extracted using the tool
PyDriller.

If we look at the csv file, we can observe some of the cells under column ‘old_file MCC’ and ‘
new_file MCC’ are empty. This could happen due to following reasons:

• File doesn’t exist in the commit history


• File is deleted in the commit
• Or the file is a non-code artifact.
Now, we used the data collected in this file to find the top 3 frequently changed source code
file by counting every file path which appears in the column ‘new_file path’. The python code
gave the following result:

49
As you can see in the last 500 non-merge commits remote_inference_engine.py,
test_train_e2e.py and test_remote_inference_engine.py are the most frequently change
source code file. The remote_inference_engine.py has been modified 16 times,
test_train_e2e.py has been modified 16 times and test_remote_inference_engine.py has
been modified 15 times.
After this we extracted the source code of remote_inference_engine.py for all of its 16
versions using PyDriller and generated CFGs using py2cfg. In the next page, I have manually
resized and included all the 16 CFGs.
Now, our last code generated an image:

Above plot illustrates the increase in cyclomatic complexity over multiple commits by
comparing old and new complexity values for each commit. Both old and new complexity
values have huge fluctuations, with some commits having extreme peaks in complexity.
Several commits have low complexity, while others have sudden peaks, going as high as values
above 600 in a few cases .The new complexity (orange dots) tends to have higher peaks than
the old complexity (blue dots), suggesting that complexity has risen in a few cases which is
expected because as the software evolves complexity increases.

Below is all the CFGs of all the versions of remote_inference_engine.py file:

50
Version 1 Version 2 Version 3

Version 4 Version 5 Version 6

Version 7 Version 8 Version 9

Version 10 Version 11 Version 12

Version 13 Version 14 Version 15

Version 16

To view all the image in good quality, click here.


Now, if you look at the nodes(see yellow part along the width of images), throughout software
evolution, i.e., from version 1 to version 2 to version 3 and so on, it is increasing. This increase

51
in nodes means more control statements get added to the source code. Also, if you at the
images labelled as ‘version 5’ and ‘version 6’, there is a significant amount of increase in nodes
as well as branches. This means most control statements were added to the source code in
the commit which updated the source code from version 5 to version 6.

I also calculated some simple statistics by reading the commits_info.csv file. Now,
The calculated mean, median, maximum, minimum, and standard deviation deliver a
numerical insight into complexity changes over commits. The values of the new complexity's
mean and median are higher than those of the old complexity, it indicates that complexity has
increased overall. The number of commits with increased complexity is more than those with
decreased complexity, suggesting a general increase in complexity over time. The rise in
cyclomatic complexity suggest increasing complexity of code, and this is quite normal in the
process of development.
Thus, I have done every task mentioned in this lab assignment.

Discussion
In this lab exercise, I encountered some difficulties:
First, it was difficult to calculate cyclomatic complexity for old files since PyDriller only
supports complexity for new files. To resolve this, I had to save the old file to a temporary
directory and use the Lizard library for calculation.
When creating Control Flow Graphs (CFGs) using py2cfg, I encountered an AttributeError,
which took a long time to debug. After consulting my teacher, I fixed the issue by using a
different repository.
PyDriller output sometimes contained missing values, which made it difficult to analyze the
complexity changes correctly. I had to account for cases where files were deleted, moved, or
were not code.

52
From these difficulties, I learned:
The importance of investigating alternative tools when the ones provided do not work well.
Using Lizard enabled me to circumvent PyDriller's limitation in calculating old file complexity.
The strength of using good debugging skills, such as consulting experts rather than altering
library code, to save effort and time.
The importance of carefully examining data when analyzing how repositories change to obtain
useful results. How software complexity tends to increase over time, which was clear from the
CFGs and the plotted complexity changes.
Through this lab, I enhanced my skills in repository mining, control flow analysis, and data
visualization, and I had a better understanding of how software complexity evolves.

Conclusion
This lab activity shows the importance of PyDriller in software repository mining and evolution
of code complexity analysis. By retrieving commit history and calculating McCabe's cyclomatic
complexity, I was able to monitor changes across versions of modified source files. The results
showed that control complexity grows with software evolution as a result of more branching
and logic.
Problems encountered in CFG generation underscored the importance of using appropriate
tools for analysis. Moreover, analysis of real-world repositories underscored the importance
of data validation because missing or inconsistent values influence conclusions.
This lab was beneficial in providing practical experience in repository analysis, measurement
of complexity, and data-driven decision-making. It deepened my knowledge in software
maintainability and why code complexity grows as software evolves.

Links:
Lecture 4

Pydriller documentation: link

Analysis framework: link

Py2cfg documentation: link


Link to all CFGs

-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x- End of Lab 04 Report -x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-

53

You might also like