0% found this document useful (0 votes)
5 views

Lecture 3 (Mining Data)

Mining Data

Uploaded by

Bilal Rauf
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture 3 (Mining Data)

Mining Data

Uploaded by

Bilal Rauf
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Analytics in Software Engineering

(MSE669)

Dr. Assad Abbas

Associate Professor
Department of Computer Science
COMSATS University Islamabad, Islamabad Campus
[email protected]
Outline
n Mining data from software repositories
n Configuration management systems

June 23, 2024 2


Mining Data from Software Repositories
n One of the problems faced by the software engineering community is
scarcity of data for conducting empirical studies and performing analytics.
However, the software repositories can be mined to collect the data that
can be used for providing empirical results by validating various
techniques or methods.
n These evidences collected thus can allow software researchers to
establish well-formed and generalized theories. The data obtained from
software repositories can be used to answer several questions, for
example:
5 Is design A better than design B?
5 Is process/method A better than process/method B?
5 What is the probability of occurrence of a defect or change in a module?
5 Is the effort estimation process accurate?
5 What is the time taken to correct a bug? Is testing technique A better than
testing technique B?

June 23, 2024 3


Mining Data from Software Repositories
n Data can be collected from various software sources:
5 Proprietary software
g Concern: Difficult as the companies are not usually
willing to share the source code and information related
to the evolution of the software
5 Open-source software
g Concern: As the software might be developed by
inexperienced students, therefore, it is hard to establish
the accuracy and applicability of data

June 23, 2024 4


Configuration Management Systems
n Configuration management systems are central to almost all
software projects developed by the organizations.
n The aim of a configuration management system is to control
and manage changes that occur in all the artifacts produced
during the software development life cycle.
n The artifacts (also known as deliverables) produced during
the software development life cycle include:
5 software requirement specification
5 software design document
5 source code
5 user manuals, and so on

June 23, 2024 5


Configuration Management Systems
n A configuration management system also controls any
changes incurred in these artifacts.
n Typically, configuration management consists of three
activities:
5 configuration identification
5 configuration control
5 configuration accounting

June 23, 2024 6


Configuration Management Systems
n Configuration identification
5 Every software project artifact produced during the software
development life cycle is uniquely named
5 The following terminologies are related to configuration identification:
g Release: The first issue of a software artifact is called a release. This
usually provides most of the functionalities of a product but may contain
many bugs and thus is prone to issue fixing and enhancements.
g Versions: Significant changes incurred in the software project’s artifacts
are called versions. Each version tends to enhance the functionalities of a
product, or fix some critical bugs reported in the previous version. New
functionalities may or may not be added.
g Editions: Minor changes or revisions incurred in the software artifacts are
termed as editions. As opposed to a version, an edition may not introduce
significant enhancements or fix some critical issues reported in the
previous version. Rather, small fixes and patches are introduced.

June 23, 2024 7


Configuration Management Systems
n Configuration Control
5 Configuration control is a critical process of versioning or configuration
management activities.
5 This process incorporates the approval, control, and implementation of
changes to the software project artifact(s), or to the software project itself.
5 Its primary purpose is to ensure that every change incurred to any software
artifact is carried out with the knowledge and approval of the software project
management team.

June 23, 2024 8


Configuration Management Systems
n Configuration Control

June 23, 2024 9


Configuration Management Systems
n Configuration Accounting
5 Configuration accounting is the process that is responsible for
keeping track of each activity, including changes, and any action that
affects the configuration of a software product artifact, or the software
product itself.
5 Typical configuration status report includes
g A list of software artifacts under versioning.
g Version-wise date as to when the baseline of a version was established.
g Specifications that describe each artifact under versioning.
g History of changes incurred in the baseline.
g Open change requests for a given artifact.
g Deficiencies discovered by reviews and audits.
g The status of approved changes.

June 23, 2024 10


Importance of Mining Software Repositories
n Importance of mining software repositories has been ignored
n Recognizing the need for effective mining techniques, the mining
software repositories (MSR) field is being given attention by
software engineering practitioners.
n The MSR focuses on analyzing and cross-linking the rich and
valuable data stored in the software repositories to discover
interesting and applicable information about various software
systems as well as projects.
n Benefits of mining the data from repositories are:
5 Apply analytics to enhance maintenance of the software system
5 Apply analytics for future development
5 Empirical validation of techniques and methods
5 Supporting software reuse
5 Proper allocation of testing and maintenance resources

June 23, 2024 11


Importance of Mining Software Repositories
Data analysis procedure after mining software
repositories

June 23, 2024 12


Common Types of Software Repositories
n Historical Repositories
5 Historical repositories record varied information
regarding the evolution and progress of a software
project.
5 They also capture significant historical dependencies
prevalent between various artifacts of a project, such
as functions (in the source code), documentation
files, or configuration files
5 Developers can possibly employ the information
extracted from these historical repositories for various
purposes
g For example, two modules A and B of a program may
co-change.

June 23, 2024 13


Common Types of Software Repositories
n Historical Repositories
5 Historical repositories include:
g source control repositories,
g bug repositories,
g archived communications.

June 23, 2024 14


Common Types of Software Repositories
n Historical Repositories
5 Source Control Repositories
g Record and maintain the development trail of a project.
g Track every change incurred in any of the artifacts of a software
system, such as the source code, documentation manuals, and so
on.
g Maintain the metadata regarding each change, for instance, the
developer or project member who carried out the change, the
timestamp when the change was performed, and a short
description of the change.
g Some popular sources code repositories
u Git
u CVS
u Subversion (SVN)
u Perforce
u ClearCase

June 23, 2024 15


Common Types of Software Repositories
n Historical Repositories
5 Bug Repositories
g Track and maintain the resolution history of defect/bug

reports, which provide valuable information regarding


the bugs that were reported by the users of a large
software project, as well as the developers of that
project. Bugzilla and Jira are the commonly used bug
repositories.

June 23, 2024 16


Common Types of Software Repositories
n Historical Repositories
5 Archived Communication Repositories
g Discussions regarding the various aspects of a

software project during its life cycle, such as mailing


lists, emails, instant messages, and internet relay chats
(IRCs) are recorded in the archived communications..

June 23, 2024 17


Run-Time Repositories or Deployment Logs
n Run-time repositories, also known as deployment
logs, record information regarding the execution of a
single deployment, or different deployments of a
software system.
5 For example, run-time repositories may record the
error messages reported by a software application at
varied deployment sites
n Run-time repositories can possibly be employed to
determine the execution anomalies by discovering
dominant execution or usage patterns across various
deployments, and recording the deviations observed
from such patterns.

June 23, 2024 18


Version Control Systems (VCS)
n VCS, also known as source control systems or simply versioning
systems, are systems that track, and record changes incurred to a
single artifact or a set of artifacts of a software system.
n Every change, no matter how big or small, is recorded over time so
that we may recall specific revisions or versions of the system
artifacts later.
n Terms associated with VCS
5 Revision numbers: VCS typically tend to distinguish between different
version numbers of the software artifacts (internal versioning)
5 Release numbers: indicate different releases of the products (to the
customer)
5 Baseline or trunk: A baseline is the approved version or revision of a
software artifact from which changes can be made subsequently. It is
also called trunk or master.
5 Tag: Whenever a new version of a software product is released, a
symbolic name, called the tag, is assigned to the revision numbers of
current software artifacts.
June 23, 2024 19
Version Control Systems (VCS)
n Terms associated with VCS
5 Branch: Branching, in version control and software configuration
management, is the duplication of an object under version control
(such as a source code file or a directory tree).
5 Each object can thereafter be modified separately and in parallel so
that the objects become different. In this context the objects are called
branches. The users of the version control system can branch any
branch.

June 23, 2024 20


Version Control Systems (VCS)
n The major functionalities provided by a VCS include
the following:
5 Revert project artifacts back to a previously recorded
and maintained state
5 Revert the entire software project back to a
previously recorded state
5 Review any change made over time to any of the
project artifacts
5 Retrieve metadata about any change, such as the
developer or project member who last modified any
artifact that might be causing a problem, and more

June 23, 2024 21


Version Control Systems (VCS)
n Classification of VCS
5 Local Version Control System
5 Centralized Version Control System
5 Distributed Version Control System

June 23, 2024 22


Version Control Systems (VCS)
n Local Version Control System
5 Employ a simple database that records
and maintains all the changes to
artifacts of the software project under
revision control
5 However, the user cannot collaborate
with other users on other systems, as
the database is local and not be
maintained centrally.
5 Each user has his/her own copy of the
different revisions of project artifacts,
and thus there are consistency and data
sharing problems.
5 Moreover, if one user loses the
versioning data, recovering it is
impossible until and unless a backup is
maintained from time to time
June 23, 2024 23
Version Control Systems (VCS)
n Centralized Version Control
System (CVCS)
5 The main aim of CVCS is to allow
the user to easily collaborate with
different users on other systems.
5 These systems, such as CVS,
Perforce, and SVN, employ a
single centralized server that
records and maintains all the
versioned artifacts of a software
project under revision control, and
there are several clients or users
that check out (obtain) the project
artifacts from that central server

June 23, 2024 24


Version Control Systems (VCS)
n Distributed Version Control System
5 To overcome the limitations of CVCS,
distributed VCS (DVCS) were introduced.
5 As opposed to CVCS, a DVCS (such as
Bazaar, Darcs, Git, and Mercurial) ensures
that the clients or users do not just obtain or
check out the latest revision or snapshot of
the project artifacts, but clone, mirror, or
download the entire software project
repository to obtain the artifacts.
5 Thus, if any server of the DVCS fails or its
data is corrupted or lost, any of the
software project repositories stored at the
client machine can be uploaded as back up
to the server to restore it.
5 Therefore, every checkout carried out by a
client is essentially a complete backup of
the entire software project data.

June 23, 2024 25


Bug Tracking System
n A software system/application that is built with the intent of keeping a track
record of various defects, bugs, or issues in software development life
cycle.
n It is a type of issue tracking system.
n Commonly employed by a large number of OSS systems and most of
these tracking systems allow the users to generate various types of defect
reports directly.
n Crucial information regarding the bugs reported by the users and/or
developers is stored in database
n The information about a bug typically includes the following:
5 The time when the bug was reported in the software system
5 Severity of the reported bug
5 Behavior of the source program/module in which the bug was encountered
5 Details on how to reproduce that bug
5 Information about the person who reported that bug
5 Developers who are possibly working to fix that bug, or will be assigned the
job to do so

June 23, 2024 26


Bug Tracking System

June 23, 2024 27


Static Source Code Analysis
n Several facts and information can be extracted from
the source code.
n The source code of a software system may be easily
obtained by “cloning” a source code software
repository where that system is hosted.
n Cloning simply means to copy and paste the entire
software repository from a server (usually a remote
server) to the end user’s system.
n Source code is a very crucial artifact of any software
system, which can be used to reveal interesting
findings for that system through effective analysis.

June 23, 2024 28


Static Source Code Analysis
n Levels of Detail
5 Method level
5 Class level
5 File level
5 System level

June 23, 2024 29


Software Historical Analysis
n Software historical repositories record several kinds of
information regarding the evolution and progress of a
software project.
n Historical repositories may be mined to extract useful
information for future trend analysis and other research
areas

June 23, 2024 30


Applications of Software Historical Analysis
n Understanding Dependencies in a System
5 Dependencies are the relationships or interconnections between
software components or modules that affect their functionality, quality,
and maintainability
5 Dependencies can introduce various challenges and risks for
software evolution, such as increased complexity, reduced modularity,
unexpected side effects, and increased defect proneness. Therefore,
it is important to identify, analyze, and resolve dependencies in a
software system to improve its design, quality, and performance.

June 23, 2024 31


Applications of Software Historical Analysis
n Understanding Dependencies in a System
5 Historical software analysis helps understand dependencies in a
software system by using various techniques and methods, such as:
g Mining historical repositories for relevant information, such as the origin,
developer, and motive of dependencies, the frequency and impact of
changes, the defect history and severity, and the communication and
collaboration patterns.
g Attaching historical sticky notes to code dependencies, which are
annotations that record the historical information of dependencies and
provide insights into their rationale and evolution.
g Explaining and analyzing the unexpected, circular, or hidden
dependencies that may exist in a software system and their causes and
consequences.
g Exposing the temporal and evolutionary dimensions of dependencies and
how they change over time and across versions.
g Predicting the location of future bugs or defects based on the historical
information of dependencies and the design flaws detected in the source
code.

June 23, 2024 32


Applications of Software Historical Analysis
n Change Impact Analysis
5 Software configuration management aims to predict
the impact of changes on the system
5 Predicting the impact can help with decision making,
program understanding, and cost estimation
5 Software repositories store information about bugs,
components, and developers
5 This information can be used to predict the affected
modules/classes and the suitable developers

June 23, 2024 33


Applications of Software Historical Analysis
n Change Propagation
5 Change propagation is the process of spreading
changes from one module to others in a software
system
5 Historical co-changes can be used to identify the
modules that are likely to co-change in the future
5 Change propagation is done after change impact
analysis
5 Historical dependencies are more effective than
traditional information for change propagation
5 Historical data can be used for change propagation
from code to documentation

June 23, 2024 34


Applications of Software Historical Analysis
n Defect Proneness
5 Defect proneness is the prediction of defects or bugs
in software systems
5 Defect prediction results can help software
practitioners allocate resources and prioritize testing
5 Defect information is extracted from repositories and
used to create defect prediction models
5 Defects can be categorized by severity levels and
predicted using different ML methods

June 23, 2024 35


Applications of Software Historical Analysis
n User and Team Dynamics Understanding
5 Archived communications are used by users and team
members for various discussions related to large software
projects
5 These discussions contain historical information about the
internal operations and workings of the projects
5 Mining these discussions can help to analyze and understand
the dynamics of large software development teams
5 Bug repositories store bug reports for large software projects
5 Bug reports need to be prioritized and assigned to developers
5 Past bug reports can help to speed up the bug-prioritization
process and find the most suitable developers

June 23, 2024 36


Applications of Software Historical Analysis
n Change Prediction
5 Change proneness is the probability of a software
component changing in the future
5 Change proneness prediction is important for
software maintenance and testing
5 Change proneness prediction can be based on OO
metrics and historical analysis
5 Historical repositories provide rich data for change
proneness prediction

June 23, 2024 37


Applications of Software Historical Analysis
n Mining Textual Descriptions
5 Mining textual descriptions is the process of extracting
useful information from defect reports in software
repositories
5 Text mining techniques can be used to analyze defect
descriptions and find relevant attributes, such as severity,
features, price, ratings, etc.
5 Text mining can help with making strategic decisions, such
as finding effective defect detection techniques, establishing
relationships between defect attributes, and comparing
mobile apps from different perspectives
5 Text mining studies have been conducted on various data
sources, such as OSS defect-tracking systems and
Blackberry app store

June 23, 2024 38


Applications of Software Historical Analysis
n Social Network Analysis
5 Social network analysis is a technique to find and
measure relationships between people
5 Social network analysis can be used for MSR to discover
software development information, such as roles,
contributions, and associations
g For example, using logs to group developers and measure
their contributions at a module level
g
Also can use visualizations tools for knowledge sharing
and collaboration across projects and companies by
analyzing data from SourceForge etc.

June 23, 2024 39


Applications of Software Historical Analysis
n Change Smells and Refactoring
5 Software historical data can be used for refactoring
code based on change smells
5 Change smells are signs of structural problems in
code that need reengineering or refactoring
5 Change smells can be identified by graph
visualization or logical couplings
5 Two examples of change smells are man-in-the-
middle and data containers
5 Different refactoring methods can be applied to
remove or improve change smells

June 23, 2024 40


Source
n Malhotra, Ruchika. Empirical research in software
engineering: concepts, analysis, and applications.
CRC press, 2016.

June 23, 2024 41

You might also like