100% found this document useful (2 votes)
96 views

Python Meetup Talk 21/07/2009

This document discusses using data mining techniques on data stored in software repositories to improve software quality. The goal is to analyze support tools data to discover patterns and relationships that can reduce errors, provide documentation, and support IDEs. Data mining methods like clustering, classification, and association rules are applied to repository data on issues, revisions, authors and more. The results show high success predicting errors in software projects. A prototype was created using tools like Weka, Proximity, and APIs to access version control data.

Uploaded by

purbon
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
96 views

Python Meetup Talk 21/07/2009

This document discusses using data mining techniques on data stored in software repositories to improve software quality. The goal is to analyze support tools data to discover patterns and relationships that can reduce errors, provide documentation, and support IDEs. Data mining methods like clustering, classification, and association rules are applied to repository data on issues, revisions, authors and more. The results show high success predicting errors in software projects. A prototype was created using tools like Weka, Proximity, and APIs to access version control data.

Uploaded by

purbon
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Introduction

Data Mining
And the results are
A vision over the present and the future

Mining Software Repositories


Improving software

Pere Urbón Bayes

Data Management Group


Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
[email protected]

July of 2009

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining
And the results are
A vision over the present and the future

Index

Introduction
Data Mining
The results
The future

Pere Urbón Bayes Mining Software Repositories


Introduction
Motivations
Data Mining
The Situation
And the results are
Objectives
A vision over the present and the future

The problem

Companies need to own highly available and reliable software.


The software of low quality harms both, clients and producers.
Unfortunately, avoiding defects is a difficult task to undertake.

Project Leaders need to keep an eye inside to many projects.


Software engineer tend not to document software in deep.
The complexity of software projects is growing every day.

Pere Urbón Bayes Mining Software Repositories


Introduction
Motivations
Data Mining
The Situation
And the results are
Objectives
A vision over the present and the future

The software development process

Pere Urbón Bayes Mining Software Repositories


Introduction
Motivations
Data Mining
The Situation
And the results are
Objectives
A vision over the present and the future

Support tools

Tools used to support software development:


Version Control server.
Bug Tracker server.
Project Management server.
Life cycle management software.
...

This set of tools store a huge amount of information during the


process, Why not to use this information to improve our software?

Pere Urbón Bayes Mining Software Repositories


Introduction
Motivations
Data Mining
The Situation
And the results are
Objectives
A vision over the present and the future

Objective and Applications

Objectives:
Analyse the use of data mining technology, to data stored in
support tools, with the aim to improve software quality.
Develop an experimental prototype tool.
Applications:
Reduce the error rate.
Provides a non-exploited source of documentation.
Provide a new source of support tools for IDE’s.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Introduction
And the results are The use of
A vision over the present and the future

Data mining

Type of database analysis that attempts to discover useful patterns


or relationships in a group of data. The analysis uses advanced
statistical methods, such as cluster analysis, and sometimes
employs artificial intelligence or neural network techniques. A
major goal of data mining is to discover previously unknown
relationships among the data, especially when the data come from
different databases.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Introduction
And the results are The use of
A vision over the present and the future

Methods

Types of:
Traditional Data Mining (K-Means, C4.5, Bayesian Networks).
Relational Data Mining (ILP, Markov logic networks,
Relational bayesian methods, Dependency Networks).
Categories:
Clusterers
Classifiers
Associative rules
Network Models.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Introduction
And the results are The use of
A vision over the present and the future

Data mining

Type of database analysis that attempts to discover useful patterns


or relationships in a group of data. The analysis uses advanced
statistical methods, such as cluster analysis, and sometimes
employs artificial intelligence or neural network techniques. A
major goal of data mining is to discover previously unknown
relationships among the data, especially when the data come from
different databases.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Introduction
And the results are The use of
A vision over the present and the future

Issue detection

LOC DefectAppearence2Month RevisionsAuthor


LineAddedIRLAdd ReportedI2Month Revision2Month
LineAddedIRLDel Revision3Month Releases
AlterType DefectAppearence3Month ReportedI1Month
AgeMonths ReportedI3Month ReportedIssues
RevisionAge Revision5Month ReportedI5Month
DefectReleases DefectAppearence5Month
Revision1Month DefectAppearance1Month

Question: Has this file a non detected error. The exact number of
errors can be predicted to.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Introduction
And the results are The use of
A vision over the present and the future

Another types of objectives

Predict bugs related to a software developer.


Prediction of bugs in software components.
This techniques could be used in different topics:
Software understanding.
Software evolution.
Software visualization.
Change propagation.
Impact analysis.
Software complexity.
Fault prediction.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Error prediction
And the results are Software
A vision over the present and the future

Error prediction

Eclipse Project Firefox Project


Correctly classified 94.65% 94.822%
Statistics Kappa 0.893 0.8883
Precision 0.9465 0.9482
Recall 0.945 0.949
AUC ROC 0.9682 0.9808
Eclipse-Firefox Firefox-Eclipse
Correctly classified 82.0065% 87.975%
Statistics Kappa 0.5976 0.7595
Precision 0.818 0.894
Recall 0.82 0.88
AUC ROC 0.805 0.83

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Error prediction
And the results are Software
A vision over the present and the future

The end App

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Software libraries
And the results are An envision
A vision over the present and the future

The Prototype

Software being used:


Programming: JAVA
Database: MySQL and MonetDB.
Data Mining: Weka 3.6 and Proximity 4.3
XML: Apache Xerces 2.9.1
SVN, CVS : svnkit 1.3.0, for CVS netbeans-cvs lib and a
custom rcs file parser.
Presentation: Prefuse Visualization Toolkit and Weka
Drawing facilities.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Software libraries
And the results are An envision
A vision over the present and the future

Could python give use the same?


Machine Learning:
Orange: With 1.0 this lib has many interesting and useful
methods, Classification, Regression and Clustering. The most
similar to Weka.
PyML: Only has classifier facilities.
Shogun: Only for Support Vector Machines.
RPy: An interface to R.
Databases:
The most important relational databases are available via
DB-API.
ZODB: Zope Object Database.
Metakit: An embedded database with a not defined paradigm.
Pygr: Python graph database framework for bioinformatics.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Software libraries
And the results are An envision
A vision over the present and the future

Could python give use the same?

Presentation:
Graph Drawing: NetworkX, with nice result. There are some
other but they look incomplete.
GUI: PyQT, wxWindows, pyGTK. It’s your taste XD!.
SVN, CVS processing:
SVN: pysvn - Python interface to Subversion.
CVS: It seams nothing is available.
GIT: PyGit - Pythonic git bindings targeted towards
porcelains.
XML Processing could be done using built-in support and with any
SAX or DOM parser.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Software libraries
And the results are An envision
A vision over the present and the future

The future

Known issues:
Data preprocessing performance.
Database performance, is the relational model valid?
Dynamic procedure addition.
The Todo List:
Develop new procedures over different related topics, like
software visualization, change support, etc.
Develop a more mature software. Python could help in some
parts. This software must be easily extensible.
Improve the hole process performance.

Pere Urbón Bayes Mining Software Repositories


Introduction
Data Mining Software libraries
And the results are An envision
A vision over the present and the future

The end

Question?

Pere Urbón Bayes


Data Management Group
Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
[email protected]

Pere Urbón Bayes Mining Software Repositories

You might also like