Jira Python
Jira Python
This Task consist of two parts. In the first part (Task 1.1), you will create a dataset with the issues from the JIRA
repository. In the second one (Task 1.2), you will pre-process and clean the dataset.
In [1]:
"""
Example of using JIRA Python API
"""
jira = JIRA('https://jira.atlassian.com')
issue = jira.issue('JRA-9')
print issue.fields.project.key # 'JRA'
print issue.fields.issuetype.name # 'Suggestion'
print issue.fields.reporter.displayName # 'Mike Cannon-Brookes [Atlassian]'
JRASERVER
Suggestion
Mike Cannon-Brookes
In [2]:
jira = JIRA('https://jira.spring.io')
print [issue.key for issue in jira.search_issues('project=XD order by created desc', maxRes
Issues
Using the JIRA Python API, connect to the Spring XD JIRA repository (https://jira.spring.io) and create a
dataset with the following info about the issues:
Key
Assignee
Creator
Reporter
Created
Components
Description
Summary
Fix Versions
Subtask
Issuetype
Priority
Resolution
Resolution date
Status
Status Description
Updated
Versions
Watches
Story Points - Hint: the name of the field in which the story points are stored is not enough descriptive
in this case. Thus, you might have a look at the data on each field and their distribution in order to
understand where the story points are stored. Keep in mind that Story Points usually follow the
fibanacci series (i.e. ½, 1, 2, 3, 5, 8, 13, 20, 40, 100)
Important! Since the JIRA API allow you to connect directly with an online repository, there are some
restrictions to avoid the overload of the servers. For example, JIRA API only allow for extracting bulks of issues
(n=1000). Thus, it is important to take this into account when you are making request to some server.
In [3]:
jira = JIRA('https://jira.spring.io')
In [4]:
import pandas as pd
issues = pd.DataFrame()
In [5]:
# first solution
# the easy (and lazy) way
In [6]:
print len(issues_in_proj_1)
print len(issues_in_proj_2)
print len(issues_in_proj_3)
print len(issues_in_proj_4)
1000
1000
1000
706
In [7]:
# Second solution
# Getting all the issues in only one block
jira = JIRA('https://jira.spring.io')
block_size = 1000
block_num = 0
allissues = []
while True:
start_idx = block_num*block_size
issues = jira.search_issues('project=XD', start_idx, block_size)
if len(issues) == 0:
# Retrieve issues until there are no more to come
break
block_num += 1
for issue in issues:
#log.info('%s: %s' % (issue.key, issue.fields.summary))
allissues.append(issue)
In [8]:
## Into pandas
import pandas as pd
issues = pd.DataFrame()
print len(issues)
issues.head()
3706
Out[9]:
Keerthi
[Stream 2017-05- My project 7 node cluster
2 None kanth []
Module] 19T21:28:43.000+0000 and in that 2 node a...
Nagaraj
See
Janne 2017-03- Mark
3 [] https://github.com/spring- [1.3
Valkealahti 21T16:54:43.000+0000 Pollack
projects/spring-...
In [10]:
Here, you have to extract data about the changes in the issues (number of fields to extract=6):
Issue Key. The key of the issue that have been updated.
Author: The key of the author who have made the change.
Date. The timestamp indicating when the change have been done.
Field. The updated field.
From. The field value before the update.
To. The field value after the update.
In [11]:
import pandas as pd
changelog = pd.DataFrame()
In [12]:
changelog.head()
Out[12]:
Fix stream
Mark 2017-03- XD-
3 summary jira None failover on None
Pollack 22T18:27:00.900+0000 3765
YARN
In [14]:
issues = pd.read_csv('issues-xd.csv')
changelog = pd.read_csv('changelog-xd.csv')
In [15]:
issues.columns
Out[15]:
In [16]:
df = issues[ ~issues['key'].isin(to_remove['key']) ]
Remove all the issues whose story points have been updated more than once, since they can
represent misleading information. According to most of the estimation techniques, Story points must
have been assigned once and never updated afterward.
In [17]:
# Filtering all the user stories that have been updated in the story points field
issues = df
df = issues[ ~issues['key'].isin(to_remove['key']) ]
Remove all the issues that have not been addressed. We consider that an issue is addressed when its
Status is set to Closed (or similar, e.g. Fixed, Completed) and its resolution field is set to Fixed (or
similar, e.g. Done, Completed).
In [18]:
print issues['status.name'].unique()
print issues['resolution'].unique()
In [19]:
issues = df
Remove all the issues whose Story Points or Description fields have been updated after they were
addressed. Note that fields such as Title and Description may be adjusted or updated at any given
time. However, once the issue is addressed updates rarely happen. Issue report that have been
updated after they were addressed; they are likely to be unstable
In [20]:
changelog.columns
Out[20]:
issues = df
to_remove = []
for ix, line in issues.iterrows():
date = pd.to_datetime(line['resolution.date'])
key = line['key']
if ( pd.notnull(date) ):
key_remove = changelog[ (changelog['key'] == key) &
((changelog['field'] == 'description') | (changelog['field']
(pd.to_datetime(changelog['date']) > date) ]['key']
to_remove.append(key_remove)
df = issues[ ~issues['key'].isin(to_remove) ]
C:\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\dtypes\missing.py:28
9: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
if left.shape != right.shape:
Remove all the issues whose story points are not in the Fibonacci series (i.e. ½, 1, 2, 3, 5, 8, 13, 20,
40, 100)
In [22]:
issues = df
df = issues[ issues['storypoints'].isin(FIBO_CARDS) ]
Remove all the issues whose informative fields are updated after the story points initialization, since
they are considered as unstable issues. We define informative fields: Issue Type, Description,
Summary, and Component/s.
In [23]:
changelog['field'].unique()
Out[23]:
array(['Fix Version', 'status', 'resolution', 'summary', 'Attachment',
'description', 'Version', 'Component', 'priority', 'issuetype',
'Rank', 'labels', 'Link', 'Pull Request URL', 'assignee', 'Sprint',
'Epic Link', 'Actual Story Points', 'Acceptance Criteria',
'Story Points', 'RemoteIssueLink', 'Parent', 'Comment',
'Epic Child', 'Epic Name', 'Parent Issue', 'Epic Colour',
'reporter', 'Reference URL', 'environment', 'Workflow', 'Key',
'project', 'Out of Scope', 'Epic Status', 'Time Spent', 'WorklogId',
'timespent', 'timeestimate', 'Project'], dtype=object)
In [24]:
issues = df
to_remove = []
for ix, line in issues.iterrows():
date = pd.to_datetime(line['created'])
key = line['key']
key_remove = changelog[ (changelog['key'] == key) &
( (changelog['field'] == 'description') | (changelog['field'] ==
(changelog['field'] == 'summary') | (changelog['field'] == 'Co
(pd.to_datetime(changelog['date']) > date) ]['key']
to_remove.append(key_remove)
df = issues[ ~issues['key'].isin(to_remove) ]
In [25]:
issues = df
import numpy as np
# issues = issues[~pd.isnull(issues['storypoints'])]
df = remove_outliers(df, 'storypoints')
issues = df
cachedStopWords = stopwords.words("english")
# extra cleaning
df = issues[ pd.notnull(issues['description']) ]
issues = df
C:\Anaconda2\envs\gl-env\lib\site-packages\ipykernel\__main__.py:14: Unicode
Warning: Unicode equal comparison failed to convert both arguments to Unicod
e - interpreting them as being unequal
In [27]:
issues[['description', 'summary']].head(2)
Out[27]:
description summary
0 The jobs appear Executions section Jobs spring... How I make job restartable spring xd
Remove the code snippets from the title and the summary of the issues. The code snippets can be
easily identify by looking for the tag <code></code>
In [28]:
import re
original = pd.read_csv('issues-xd.csv')
pd.options.display.max_colwidth = -1
print
#print after
print issues[ issues['key'] == 'XD-3751']['description']
Remove all the completed issues whose don't have any asignee.
In [30]:
df = issues[issues['assignee'].notnull()]
In [31]:
issues = df
Like in Task 1, this Task consist of two parts. In the first part (Task 2.1), you will create a dataset from the
repository whereas in the second one (Task 2.2) you will pre-process and clean the dataset.
You can create your 'Personal API Access Token' from the Settings menu in your GitHub account.
https://github.com/settings/tokens (https://github.com/settings/tokens)
We’ll opt to take advantage of a Python library so that we can avoid some of the tedious details involved in
making requests, parsing responses, and handling pagination. In this particular case, we’ll use PyGithub,
which can be installed with the somewhat predictable pip install PyGithub.
https://github.com/PyGithub/PyGithub (https://github.com/PyGithub/PyGithub)
In [32]:
"""
Using PyGithub to query a particular repository
"""
Contributors
In [33]:
Number of contributors: 44
Contributors' stats
Stats from contributors
http://pygithub.readthedocs.io/en/latest/github_objects/StatsContributor.html#statscontributor
(http://pygithub.readthedocs.io/en/latest/github_objects/StatsContributor.html#statscontributor)
In [36]:
stats_contributors = repo.get_stats_contributors()
stats_contributors_df = pd.DataFrame()
for sc in stats_contributors:
for w in sc.weeks:
d = {
"week" : w.w,
"additions" : w.a,
"deletions" : w.d,
"commits" : w.c,
"author" : sc.author.login,
"total" : sc.total
}
stats_contributors_df = stats_contributors_df.append(d, ignore_index=True)
In [37]:
Size: 11616
Out[37]:
Save to a file
In [38]:
# Saving to a file
In [39]:
# Aggregated stats
Out[39]:
author
Repositories
In [40]:
repositories = pd.DataFrame()
for i, c in enumerate(contributors):
print "Contributor: ", c.login
repos = c.get_repos()
print repos
for r in repos:
d = {
'contributor' : c.login,
'lang' : r.language,
'owner' : r.owner.login,
'repo' : r.name
}
repositories = repositories.append(d, ignore_index=True)
print "Processed ", i+1, " contributors."
print "Rate limit remaining", client.rate_limiting
In [41]:
# save to file
repositories.to_csv("contributors-xd.csv", index=False, encoding='utf-8')
Commits
In [42]:
commitsrepo = repo.get_commits()
In [43]:
import pandas as pd
commits = pd.DataFrame()
Save to a file
In [44]:
# save to file
commits.to_csv("commits-xd.csv", index=False, encoding='utf-8')
In [80]:
commits = pd.read_csv("commits-xd.csv")
In [81]:
commits.head()
Out[81]:
In [82]:
before 2097
after 2093
2 - Remove all the commits related to artifact release -- hint: the message begins with [artifactory-release]
In [83]:
commits = commits[~commits['message'].str.startswith('[artifactory-release]')]
before 2093
after 2064
3 - Remove commits from authors who have been commit only once
In [84]:
commits['author'].unique()
Out[84]:
array(['jvalkeal', 'markpollack', 'garyrussell', 'cppwfs', 'mbogoevici',
'ilayaperumalg', 'artembilan', 'dturanski', 'sabbyanandan',
'ghillert', 'trisberg', 'mminella', 'pperalta', 'aclement',
'BoykoAlex', 'ericbottard', 'LinkedList', 'sworisbreathing',
'htynkn', 'morfeo8marc', 'philwebb', 'fmarchand', 'smaldini',
'kdowbecki', 'twoseat', 'jbrisbin', 'thomasdarimont', 'agandhinit',
'markfisher', 'nebhale', 'spring-buildmaster', 'liujiong1982',
'sathiyas', 'wilkinsona', 'dsyer', 'fbiville', 'parikhkc',
'chrisjs', 'tekul', 'aeisenberg', 'jencompgeek', 'datianshi',
'kashyap-parikh', 'gregturn'], dtype=object)
In [85]:
In [86]:
stats.head()
Out[86]:
stats['author'].unique()
Out[90]:
In [91]:
authors_commits = stats.groupby(by='author')['commits'].sum()
In [92]:
authors_commits = authors_commits.reset_index()
In [93]:
authors_commits.head()
Out[93]:
author commits
0 BoykoAlex 2.0
1 LinkedList 1.0
2 aclement 19.0
3 aeisenberg 4.0
4 agandhinit 1.0
In [94]:
authors_commits[authors_commits['commits'] == 1]['author']
Out[94]:
1 LinkedList
4 agandhinit
6 chrisjs
17 htynkn
22 kashyap-parikh
31 parikhkc
32 philwebb
35 sathiyas
38 sworisbreathing
Name: author, dtype: object
In [95]:
cs = authors_commits[authors_commits['commits'] == 1]['author']
before 2064
Number of authors to delete from the stats: 9
after 2055
In [96]:
commits['author'].unique()
Out[96]:
array(['jvalkeal', 'markpollack', 'garyrussell', 'cppwfs', 'mbogoevici',
'ilayaperumalg', 'artembilan', 'dturanski', 'sabbyanandan',
'ghillert', 'trisberg', 'mminella', 'pperalta', 'aclement',
'BoykoAlex', 'ericbottard', 'morfeo8marc', 'fmarchand', 'smaldini',
'kdowbecki', 'twoseat', 'jbrisbin', 'thomasdarimont', 'markfisher',
'nebhale', 'spring-buildmaster', 'liujiong1982', 'wilkinsona',
'dsyer', 'fbiville', 'tekul', 'aeisenberg', 'jencompgeek',
'datianshi', 'gregturn'], dtype=object)
4 - Remove all the commits from changes that have been done to auxiliar files (i.e. gradle.properties, README,
docfiles)
In [97]:
to_remove = []
after 1967
In [98]:
commits.head()
Out[98]:
In [99]:
after 1967
Finally, save the cleaned dataset to a file.
In [100]:
# save to file
commits.to_csv("commits-xd-cleaned.csv", index=False, encoding='utf-8')