0% found this document useful (0 votes)

372 views24 pages

Extracting Data with JIRA Python API

The document discusses extracting issue data from the JIRA repository for the Spring XD project. It describes connecting to the JIRA API, retrieving issues in batches to avoid overloading servers, and saving the extracted issue details and changelog data to Pandas dataframes. The data includes issue fields like key, assignee, creator, components, description, and story points. The changelog extraction retrieves details of field changes like author, date, changed field, and previous/new values.

Uploaded by

sysrerun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

372 views24 pages

Extracting Data with JIRA Python API

Uploaded by

sysrerun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Task 1.

Getting data from JIRA

In this part, you will extract the data you need to apply the data mining algorithms. The target is the Spring XD
project ([Link] which uses the JIRA project management tool
([Link] to manage their issues.

This Task consist of two parts. In the first part (Task 1.1), you will create a dataset with the issues from the JIRA
repository. In the second one (Task 1.2), you will pre-process and clean the dataset.

[INFO] Using the JIRA Python API

The JIRA Python API library ([Link] eases the use of the JIRA REST API from
Python and it has been used in production for years.

See the documentation for full details. [Link] ([Link]

In [1]:

"""
Example of using JIRA Python API
"""

from jira import JIRA

jira = JIRA('[Link]

issue = [Link]('JRA-9')
print [Link] # 'JRA'
print [Link] # 'Suggestion'
print [Link] # 'Mike Cannon-Brookes [Atlassian]'

JRASERVER
Suggestion
Mike Cannon-Brookes

In [2]:

# Getting the keys of the last 3 reported issues

jira = JIRA('[Link]
print [[Link] for issue in jira.search_issues('project=XD order by created desc', maxRes

[u'XD-3768', u'XD-3767', u'XD-3766']

Task 1.1 Data extraction from JIRA

Issues
Using the JIRA Python API, connect to the Spring XD JIRA repository ([Link] and create a
dataset with the following info about the issues:

Key
Assignee
Creator
Reporter
Created
Components
Description
Summary
Fix Versions
Subtask
Issuetype
Priority
Resolution
Resolution date
Status
Status Description
Updated
Versions
Watches
Story Points - Hint: the name of the field in which the story points are stored is not enough descriptive
in this case. Thus, you might have a look at the data on each field and their distribution in order to
understand where the story points are stored. Keep in mind that Story Points usually follow the
fibanacci series (i.e. ½, 1, 2, 3, 5, 8, 13, 20, 40, 100)

Important! Since the JIRA API allow you to connect directly with an online repository, there are some
restrictions to avoid the overload of the servers. For example, JIRA API only allow for extracting bulks of issues
(n=1000). Thus, it is important to take this into account when you are making request to some server.

See more details about this on [Link]

[Link] ([Link]
[Link])

In [3]:

# Connect with Spring

from jira import JIRA

jira = JIRA('[Link]

In [4]:

# it is a good idea to save the issues in a dataframe

import pandas as pd

issues = [Link]()
In [5]:

# first solution
# the easy (and lazy) way

issues_in_proj_1 = jira.search_issues('project=XD', startAt=0, maxResults=1000)

issues_in_proj_2 = jira.search_issues('project=XD', startAt=1000, maxResults=1000)
issues_in_proj_3 = jira.search_issues('project=XD', startAt=2000, maxResults=1000)
issues_in_proj_4 = jira.search_issues('project=XD', startAt=3000, maxResults=1000)

In [6]:

print len(issues_in_proj_1)
print len(issues_in_proj_2)
print len(issues_in_proj_3)
print len(issues_in_proj_4)

1000
1000
1000
706

In [7]:

# Second solution
# Getting all the issues in only one block

from jira import JIRA

jira = JIRA('[Link]

block_size = 1000
block_num = 0
allissues = []
while True:
start_idx = block_num*block_size
issues = jira.search_issues('project=XD', start_idx, block_size)
if len(issues) == 0:
# Retrieve issues until there are no more to come
break
block_num += 1
for issue in issues:
#[Link]('%s: %s' % ([Link], [Link]))
[Link](issue)

In [8]:

print 'Number of issues:', len(allissues)

Number of issues: 3706

In [9]:

## Into pandas

import pandas as pd

issues = [Link]()

for issue in allissues:

d = {
'key': [Link],
'assignee': [Link],
'creator' : [Link],
'reporter': [Link],
'created' : [Link],
'components': [Link],
'description': [Link],
'summary': [Link],
'fixVersions': [Link],
'subtask': [Link],
'issuetype': [Link],
'priority': [Link],
'resolution': [Link],
'[Link]': [Link],
'[Link]': [Link],
'[Link]': [Link],
'updated': [Link],
'versions': [Link],
'watches': [Link],
'storypoints': [Link].customfield_10142
}

issues = [Link](d, ignore_index=True)

print len(issues)
[Link]()

3706
Out[9]:

assignee components created creator description fixV

The jobs that appear

2017-07- abhineet
0 None [] under Executions section []
10T[Link].000+0000 kumar
...

Working with Spring-XD

2017-06- Manuel
1 None [] version []
26T[Link].000+0000 Jordan
[Link]\n...
assignee components created creator description fixV

Keerthi
[Stream 2017-05- My project 7 node cluster
2 None kanth []
Module] 19T[Link].000+0000 and in that 2 node a...
Nagaraj

See
Janne 2017-03- Mark
3 [] [Link] [1.3
Valkealahti 21T[Link].000+0000 Pollack
projects/spring-...

2017-03- Pavan I'm trying to run a Job on

4 None [Batch] []
06T[Link].000+0000 Srikar SpringXD and the jo...

Save your results to a file

In [10]:

issues.to_csv("[Link]", encoding='utf-8', header=True, index=False, line_terminator=

Extracting the changelog

There are two options to get the changelog.

1. Getting blocks of issues with additional information (e.g. the changelog)

2. Retrieve each issue individually with its additional information (e.g. the changelog)

If we opt for the first option (1)

issues = jira.search_issues('project=XD', start_idx, block_size, expand='changelo

g')

If we opt for the second option (2)

issue = [Link]('FOO-100', expand='changelog')

Here, you have to extract data about the changes in the issues (number of fields to extract=6):

Issue Key. The key of the issue that have been updated.
Author: The key of the author who have made the change.
Date. The timestamp indicating when the change have been done.
Field. The updated field.
From. The field value before the update.
To. The field value after the update.
In [11]:

import pandas as pd

changelog = [Link]()

for issue in allissues:

issue = [Link]([Link], expand='changelog')

achangelog = [Link]

changelog = [Link](d, ignore_index=True)

print "Number of records: ", len(changelog)

Number of records: 43222

In [12]:

[Link]()

Out[12]:

author date field fieldtype from fromString key to

Mark 2017-03- XD-

0 Fix Version jira None None 15491
Pollack 21T[Link].170+0000 3765

Mark 2017-03- XD-

1 status jira 10000 To Do 10001
Pollack 21T[Link].057+0000 3765

Mark 2017-03- XD-

2 resolution jira None None 8
Pollack 21T[Link].057+0000 3765

Fix stream
Mark 2017-03- XD-
3 summary jira None failover on None
Pollack 22T[Link].900+0000 3765
YARN

Krzysztof 2017-01- XD-

4 Attachment jira None None 23691
Noga 23T[Link].638+0000 3763

Save the dataset to a file

In [13]:

# backup! save to file

changelog.to_csv("[Link]", sep=',', encoding='utf-8', doublequote = True, header=

Task 1.2 Pre-processing and cleaning the data

In this part, you will have to apply the following criteria to clean the dataset. The criteria consist of 10 cleaning
steps.

In [14]:

issues = pd.read_csv('[Link]')
changelog = pd.read_csv('[Link]')

In [15]:

[Link]

Out[15]:

Index([u'assignee', u'components', u'created', u'creator', u'description',

u'fixVersions', u'issuetype', u'key', u'priority', u'reporter',
u'resolution', u'[Link]', u'[Link]',
u'[Link]', u'storypoints', u'subtask', u'summary', u'updated',
u'versions', u'watches'],
dtype='object')

Remove all the issues with null or zero values

In [16]:

to_remove = issues[ (issues['storypoints'] == 0.0) | ([Link](issues['storypoints']))]

df = issues[ ~issues['key'].isin(to_remove['key']) ]

print 'before/after =', len(issues), '/', len(df)

before/after = 3706 / 3541

Remove all the issues whose story points have been updated more than once, since they can
represent misleading information. According to most of the estimation techniques, Story points must
have been assigned once and never updated afterward.
In [17]:

# Filtering all the user stories that have been updated in the story points field
issues = df

to_remove = changelog[((changelog['field'] == 'Story Points') | ( changelog['field'] == 'Ac

df = issues[ ~issues['key'].isin(to_remove['key']) ]

print 'before/after =', len(issues), '/', len(df)

before/after = 3541 / 3064

Remove all the issues that have not been addressed. We consider that an issue is addressed when its
Status is set to Closed (or similar, e.g. Fixed, Completed) and its resolution field is set to Fixed (or
similar, e.g. Done, Completed).

In [18]:

print issues['[Link]'].unique()
print issues['resolution'].unique()

['To Do' 'Done' 'In PR' 'In Progress']

[nan 'Complete' 'Invalid' 'Works as Designed' 'Deferred' "Won't Fix"
'Cannot Reproduce' 'Incomplete' 'Duplicate' 'Fixed']

In [19]:

issues = df

df = issues[ ~issues['[Link]'].isin(['Done']) | ~issues['resolution'].isin(['Complete'

print 'before/after =', len(issues), '/', len(df)

before/after = 3064 / 1035

Remove all the issues whose Story Points or Description fields have been updated after they were
addressed. Note that fields such as Title and Description may be adjusted or updated at any given
time. However, once the issue is addressed updates rarely happen. Issue report that have been
updated after they were addressed; they are likely to be unstable

In [20]:

[Link]

Out[20]:

Index([u'author', u'date', u'field', u'fieldtype', u'from', u'fromString',

u'key', u'to', u'toString'],
dtype='object')
In [21]:

issues = df

to_remove = []
for ix, line in [Link]():
date = pd.to_datetime(line['[Link]'])
key = line['key']
if ( [Link](date) ):
key_remove = changelog[ (changelog['key'] == key) &
((changelog['field'] == 'description') | (changelog['field']
(pd.to_datetime(changelog['date']) > date) ]['key']
to_remove.append(key_remove)

df = issues[ ~issues['key'].isin(to_remove) ]

print 'before/after =', len(issues), '/', len(df)

C:\Anaconda2\envs\gl-env\lib\site-packages\pandas\core\dtypes\[Link]
9: RuntimeWarning: tp_compare didn't return -1 or -2 for exception
if [Link] != [Link]:

before/after = 1035 / 1035

Remove all the issues whose story points are not in the Fibonacci series (i.e. ½, 1, 2, 3, 5, 8, 13, 20,
40, 100)

In [22]:

issues = df

FIBO_CARDS=[0.5, 1, 2, 3, 5, 8, 13, 20, 40, 100]

df = issues[ issues['storypoints'].isin(FIBO_CARDS) ]

print 'before/after =', len(issues), '/', len(df)

before/after = 1035 / 943

Remove all the issues whose informative fields are updated after the story points initialization, since
they are considered as unstable issues. We define informative fields: Issue Type, Description,
Summary, and Component/s.

In [23]:

changelog['field'].unique()

Out[23]:
array(['Fix Version', 'status', 'resolution', 'summary', 'Attachment',
'description', 'Version', 'Component', 'priority', 'issuetype',
'Rank', 'labels', 'Link', 'Pull Request URL', 'assignee', 'Sprint',
'Epic Link', 'Actual Story Points', 'Acceptance Criteria',
'Story Points', 'RemoteIssueLink', 'Parent', 'Comment',
'Epic Child', 'Epic Name', 'Parent Issue', 'Epic Colour',
'reporter', 'Reference URL', 'environment', 'Workflow', 'Key',
'project', 'Out of Scope', 'Epic Status', 'Time Spent', 'WorklogId',
'timespent', 'timeestimate', 'Project'], dtype=object)
In [24]:

issues = df

to_remove = []
for ix, line in [Link]():
date = pd.to_datetime(line['created'])
key = line['key']
key_remove = changelog[ (changelog['key'] == key) &
( (changelog['field'] == 'description') | (changelog['field'] ==
(changelog['field'] == 'summary') | (changelog['field'] == 'Co
(pd.to_datetime(changelog['date']) > date) ]['key']
to_remove.append(key_remove)

df = issues[ ~issues['key'].isin(to_remove) ]

print 'before/after =', len(issues), '/', len(df)

before/after = 943 / 943

Remove outliers. Use Tukey’s fences.

In [25]:

issues = df

import numpy as np

def remove_outliers(df, column, k = 1.5):

first_quartile = [Link](df[column], 25)
third_quartile = [Link](df[column], 75)
iqr = third_quartile - first_quartile

minval = first_quartile - k*iqr

maxval = third_quartile + k*iqr

return df[(df[column] > minval) | (df[column] < maxval)].copy()

# issues = issues[~[Link](issues['storypoints'])]
df = remove_outliers(df, 'storypoints')

print 'before/after =', len(issues), '/', len(df)

before/after = 943 / 943

Remove the stop words from title and the summary.

In [26]:

issues = df

from [Link] import stopwords

cachedStopWords = [Link]("english")

# extra cleaning
df = issues[ [Link](issues['description']) ]

print 'before/after =', len(issues), '/', len(df)

issues = df

issues['description'] = issues['description'].apply( lambda x : ' '.join([word for word in

issues['summary'] = issues['summary'].apply( lambda x : ' '.join([word for word in [Link](

before/after = 943 / 869

C:\Anaconda2\envs\gl-env\lib\site-packages\ipykernel\__main__.py:14: Unicode
Warning: Unicode equal comparison failed to convert both arguments to Unicod
e - interpreting them as being unequal

In [27]:

issues[['description', 'summary']].head(2)

Out[27]:

description summary

0 The jobs appear Executions section Jobs spring... How I make job restartable spring xd

Working Spring-XD version [Link] admin config timezone command

1
Starti... work

Remove the code snippets from the title and the summary of the issues. The code snippets can be
easily identify by looking for the tag <code></code>

In [28]:

import re

original = pd.read_csv('[Link]')

issues = issues[ ~[Link](issues['description']) ]

for ix, line in [Link]():

m = [Link]('{code}(.*){code}', line['description'], flags=[Link])
if m:
[Link][ix, 'description'] = line['description'][:[Link](0)] + line['descriptio
# [Link][ix, 'code'] = line['description'][[Link](0):[Link](0)]
In [29]:

[Link].max_colwidth = -1

# the issue report XD-3751 has code in its description

print original[ original['key'] == 'XD-3751']['description']

print
#print after
print issues[ issues['key'] == 'XD-3751']['description']

17 In a case where reactor's ringbuffer is full and thus handling backpre

ssure by blocking `onNext`, shutdown phase where `onComplete` is send will c
ause a deadlock.\r\n\r\nThis is shown by a thread dump during a shutdown. Th
is will basically break further deployments for this stream in distributed m
ode while single node will show more errors during undeployment.\r\n\r\n{cod
e}\r\n"pool-7-thread-1" #58 prio=5 os_prio=0 tid=0x979fe800 nid=0x54de runna
ble [0x986ad000]\r\n [Link]: TIMED_WAITING (parking)\r\n\t
at [Link](Native Method)\r\n\tat [Link]
[Link]([Link])\r\n\tat [Link]
[Link]([Link])\r\n\t
at [Link](SingleProd
[Link])\r\n\tat [Link].
next([Link])\r\n\tat [Link]
[Link]([Link])\r\n\tat [Link].p
[Link]([Link])\r\n{code}
\r\n\r\n{code}\r\n"main-EventThread" #19 daemon prio=5 os_prio=0 tid=0x9b93a
400 nid=0x54b1 runnable [0x9aefe000]\r\n [Link]: TIMED_WAI
TING (parking)\r\n\tat [Link](Native Method)\r\n\tat [Link]
[Link]([Link])\r\n\tat react
[Link](SingleProducerSequ
[Link])\r\n\tat [Link]
[Link]([Link])\r\n\tat [Link].
[Link]([Link])\r\n\tat [Link]
[Link]([Link]
4)\r\n\tat [Link](RingBufferP
[Link])\r\n\tat [Link]
[Link]([Link])\r\n{code}\r\n\r\nI've b
een crafting workaround for this by trying to wait reactor stream/buffer to
get drained by gpdb and finally as last resort, forcing processor in reactor
to shutdown.
Name: description, dtype: object

17 In case reactor's ringbuffer full thus handling backpressure blocking

`onNext`, shutdown phase `onComplete` send cause deadlock. This shown thread
dump shutdown. This basically break deployments stream distributed mode sing
le node show errors undeployment. I've crafting workaround trying wait reac
tor stream/buffer get drained gpdb finally last resort, forcing processor re
actor shutdown.
Name: description, dtype: object

Remove all the completed issues whose don't have any asignee.
In [30]:

df = issues[issues['assignee'].notnull()]

print 'before/after =', len(issues), '/', len(df)

before/after = 869 / 222

Finally, save the cleaned dataset to a file.

In [31]:

issues = df

issues.to_csv("[Link]", sep=',', encoding='utf-8', doublequote = True, heade

Task 2 Getting data from Github

In this task, you will extract data from the same project used in Task 1 (the Spring-XD project), but now the
source will be the GitHub repository instead of the JIRA project management tool.

Like in Task 1, this Task consist of two parts. In the first part (Task 2.1), you will create a dataset from the
repository whereas in the second one (Task 2.2) you will pre-process and clean the dataset.

[INFO] Creating a GitHub API Connection

Since GitHub implements OAuth, getting API access involves creating an account followed by one of two
possibilities:

1. creating an application to use as the consumer of the API or

2. creating a Personal Access Token that will be linked directly to your account

You can create your 'Personal API Access Token' from the Settings menu in your GitHub account.
[Link] ([Link]

We’ll opt to take advantage of a Python library so that we can avoid some of the tedious details involved in
making requests, parsing responses, and handling pagination. In this particular case, we’ll use PyGithub,
which can be installed with the somewhat predictable pip install PyGithub.

[Link] ([Link]
In [32]:

"""
Using PyGithub to query a particular repository
"""

from github import Github

# XXX: Specify your own access token here

ACCESS_TOKEN = '08e610638148003f5098f191324375149ee14d72'

# Specify a username and repository of interest for that user.

USER = 'spring-projects'
REPO = 'spring-xd'

client = Github(ACCESS_TOKEN, per_page=100)

user = client.get_user(USER)
repo = user.get_repo(REPO)

print "Rate limit remaining", client.rate_limiting

#print "Rate limit reached. Reset time: ", [Link](client.rate_limiting_rese

Rate limit remaining (4998, 5000)

Task 2.1 Extracting the data

Using the PyGithub ([Link] API, connect to the Spring XD Github repository
and extract the following data:

Contributors: a list of users who contribute to the project

Repositories: a list of repositories in which the contributors are involved
Contributors' stats: Weekly stats from the contributors
Repository
Week
Number of additions
Number of deletions
Number of commits
Contributor
Commits:
Author
Date
Changes
Additions
Deletions
Filename
Comment_count
Message

Contributors
In [33]:

# Get a list of people who have contributed to the repo.

contributors = [ s for s in repo.get_contributors() ]

print "Number of contributors: ", len(contributors)

Number of contributors: 44

Contributors' stats
Stats from contributors
[Link]
([Link]

In [36]:

stats_contributors = repo.get_stats_contributors()

stats_contributors_df = [Link]()
for sc in stats_contributors:
for w in [Link]:
d = {
"week" : w.w,
"additions" : w.a,
"deletions" : w.d,
"commits" : w.c,
"author" : [Link],
"total" : [Link]
}
stats_contributors_df = stats_contributors_df.append(d, ignore_index=True)

In [37]:

print "Size: ", len(stats_contributors_df)

stats_contributors_df.head()

Size: 11616
Out[37]:

additions author commits deletions total week

0 0.0 LinkedList 0.0 0.0 1.0 2013-04-07

1 0.0 LinkedList 0.0 0.0 1.0 2013-04-14

2 0.0 LinkedList 0.0 0.0 1.0 2013-04-21

3 0.0 LinkedList 0.0 0.0 1.0 2013-04-28

4 0.0 LinkedList 0.0 0.0 1.0 2013-05-05

Save to a file
In [38]:

# Saving to a file

stats_contributors_df.to_csv("[Link]", index=False, encoding='utf-8')

In [39]:

# Aggregated stats

stats_contributors_df[['additions', 'author', 'commits', 'deletions']].groupby(by='author')

Out[39]:

additions commits deletions

author

BoykoAlex 383.0 2.0 84.0

LinkedList 5.0 1.0 5.0

aclement 13151.0 19.0 2109.0

aeisenberg 49221.0 4.0 98.0

agandhinit 244.0 1.0 69.0

Repositories
In [40]:

repositories = [Link]()
for i, c in enumerate(contributors):
print "Contributor: ", [Link]

repos = c.get_repos()

print repos

for r in repos:
d = {
'contributor' : [Link],
'lang' : [Link],
'owner' : [Link],
'repo' : [Link]
}
repositories = [Link](d, ignore_index=True)
print "Processed ", i+1, " contributors."
print "Rate limit remaining", client.rate_limiting

print "Repositories: ", len(repositories)

[Link]()
Rate limit remaining (4957, 5000)
Contributor: artembilan
<[Link] instance at 0x0000000046181C88>
Processed 27 contributors.
Rate limit remaining (4955, 5000)
Contributor: BoykoAlex
<[Link] instance at 0x00000000478C8A88>
Processed 28 contributors.
Rate limit remaining (4954, 5000)
Contributor: gregturn
<[Link] instance at 0x0000000046831048>
Processed 29 contributors.
Rate limit remaining (4952, 5000)
Contributor: kdowbecki
<[Link] instance at 0x0000000046E77448>
Processed 30 contributors.
Rate limit remaining (4951, 5000)
Contributor: morfeo8marc
<[Link] instance at 0x0000000046267408>
Processed 31 contributors.
( )
Save to a file

In [41]:

# save to file
repositories.to_csv("[Link]", index=False, encoding='utf-8')

Commits

In [42]:

commitsrepo = repo.get_commits()
In [43]:

import pandas as pd

commits = [Link]()

for i, commit in enumerate(commitsrepo):

try:
d = {
'author' : [Link] if [Link] is not None else commit.a
'changes' : [Link],
'additions' : [Link],
'deletions' : [Link],
'files' : [[Link] for f in [Link]],
'author_date' : [Link],
'commiter_date' : [Link],
'commiter' : [Link],
'message' : [Link],
'comments' : [comment for comment in commit.get_comments()]
}
commits = [Link](d, ignore_index=True)
except Exception, e: #[Link]:
print "Encountered an error fetching details of commit #", i, " Skipping."
print e

print "Number of commits: ", len(commits)

[Link]()
Encountered an error fetching details of commit # 1361 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1362 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1363 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1364 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1365 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1369 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1382 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1385 Skipping.
'NoneType' object has no attribute 'login'
Encountered an error fetching details of commit # 1386 Skipping.
'NoneType' object has no attribute 'login'

Save to a file

In [44]:

# save to file
commits.to_csv("[Link]", index=False, encoding='utf-8')

Task 2.2 Pre-processing and cleaning the data

Apply the following criteria to clean the dataset:
Remove commits with null message
Remove all the commits related to artifact release -- hint: the message begins with [artifactory-release]
Remove commits from authors who have been commit only once
Remove all the commits from changes that have been done to auxiliar files (i.e. [Link],
README, docfiles)
Remove all the commits that are related to a given release

In [80]:

commits = pd.read_csv("[Link]")

In [81]:

[Link]()

Out[81]:

additions author author_date changes comments commiter commiter_date

spring- 2017-03-22 Spring 2017-03-22

0 1.0 2.0 []
buildmaster [Link] Buildmaster [Link]

spring- 2017-03-22 Spring 2017-03-22

1 1.0 2.0 []
buildmaster [Link] Buildmaster [Link]

2017-02-01 Janne 2017-02-01

2 5.0 jvalkeal 10.0 []
[Link] Valkealahti [Link]

2016-12-23 Mark 2016-12-23

3 1.0 markpollack 2.0 []
[Link] Pollack [Link]

2016-12-23 Mark 2016-12-23

4 4.0 markpollack 8.0 []
[Link] Pollack [Link]

1 - Remove commits with null message

In [82]:

print 'before' , len(commits)

commits = commits[ ~[Link](commits['message']) ]

print 'after' , len(commits)

before 2097
after 2093

2 - Remove all the commits related to artifact release -- hint: the message begins with [artifactory-release]
In [83]:

print 'before' , len(commits)

commits = commits[~commits['message'].[Link]('[artifactory-release]')]

print 'after' , len(commits)

before 2093
after 2064

3 - Remove commits from authors who have been commit only once

In [84]:

commits['author'].unique()

Out[84]:
array(['jvalkeal', 'markpollack', 'garyrussell', 'cppwfs', 'mbogoevici',
'ilayaperumalg', 'artembilan', 'dturanski', 'sabbyanandan',
'ghillert', 'trisberg', 'mminella', 'pperalta', 'aclement',
'BoykoAlex', 'ericbottard', 'LinkedList', 'sworisbreathing',
'htynkn', 'morfeo8marc', 'philwebb', 'fmarchand', 'smaldini',
'kdowbecki', 'twoseat', 'jbrisbin', 'thomasdarimont', 'agandhinit',
'markfisher', 'nebhale', 'spring-buildmaster', 'liujiong1982',
'sathiyas', 'wilkinsona', 'dsyer', 'fbiville', 'parikhkc',
'chrisjs', 'tekul', 'aeisenberg', 'jencompgeek', 'datianshi',
'kashyap-parikh', 'gregturn'], dtype=object)

In [85]:

# Reading from file

import pandas as pd
stats = pd.read_csv("[Link]")

In [86]:

[Link]()

Out[86]:

additions author commits deletions total week

0 0.0 LinkedList 0.0 0.0 1.0 2013-04-07

1 0.0 LinkedList 0.0 0.0 1.0 2013-04-14

2 0.0 LinkedList 0.0 0.0 1.0 2013-04-21

3 0.0 LinkedList 0.0 0.0 1.0 2013-04-28

4 0.0 LinkedList 0.0 0.0 1.0 2013-05-05

In [90]:

stats['author'].unique()

Out[90]:

array(['LinkedList', 'sworisbreathing', 'htynkn', 'philwebb', 'agandhinit',

'sathiyas', 'parikhkc', 'chrisjs', 'kashyap-parikh', 'artembilan',
'sabbyanandan', 'BoykoAlex', 'morfeo8marc', 'kdowbecki',
'thomasdarimont', 'nebhale', 'wilkinsona', 'fbiville', 'datianshi',
'gregturn', 'fmarchand', 'smaldini', 'twoseat', 'aeisenberg',
'liujiong1982', 'jbrisbin', 'pperalta', 'aclement', 'dsyer',
'jvalkeal', 'jencompgeek', 'mminella', 'spring-buildmaster',
'mbogoevici', 'cppwfs', 'dturanski', 'tekul', 'ghillert',
'ericbottard', 'markpollack', 'garyrussell', 'trisberg',
'markfisher', 'ilayaperumalg'], dtype=object)

In [91]:

authors_commits = [Link](by='author')['commits'].sum()

In [92]:

authors_commits = authors_commits.reset_index()

In [93]:

authors_commits.head()

Out[93]:

author commits

0 BoykoAlex 2.0

1 LinkedList 1.0

2 aclement 19.0

3 aeisenberg 4.0

4 agandhinit 1.0

In [94]:

authors_commits[authors_commits['commits'] == 1]['author']

Out[94]:

1 LinkedList
4 agandhinit
6 chrisjs
17 htynkn
22 kashyap-parikh
31 parikhkc
32 philwebb
35 sathiyas
38 sworisbreathing
Name: author, dtype: object
In [95]:

print 'before', len(commits)

cs = authors_commits[authors_commits['commits'] == 1]['author']

commits = commits[ ~commits['author'].isin(cs) ]

print 'after' , len(commits)

before 2064
Number of authors to delete from the stats: 9
after 2055

In [96]:

commits['author'].unique()

Out[96]:
array(['jvalkeal', 'markpollack', 'garyrussell', 'cppwfs', 'mbogoevici',
'ilayaperumalg', 'artembilan', 'dturanski', 'sabbyanandan',
'ghillert', 'trisberg', 'mminella', 'pperalta', 'aclement',
'BoykoAlex', 'ericbottard', 'morfeo8marc', 'fmarchand', 'smaldini',
'kdowbecki', 'twoseat', 'jbrisbin', 'thomasdarimont', 'markfisher',
'nebhale', 'spring-buildmaster', 'liujiong1982', 'wilkinsona',
'dsyer', 'fbiville', 'tekul', 'aeisenberg', 'jencompgeek',
'datianshi', 'gregturn'], dtype=object)

4 - Remove all the commits from changes that have been done to auxiliar files (i.e. [Link], README,
docfiles)

In [97]:

to_remove = []

for ix, line in [Link]():

if( ('[Link]' in line['files']) | ('README' in line['files']) ):
commits = [Link](index=ix)

print 'after' , len(commits)

after 1967
In [98]:

[Link]()

Out[98]:

additions author author_date changes comments commiter commiter_date

2017-02-01 Janne 2017-02-01

2 5.0 jvalkeal 10.0 []
[Link] Valkealahti [Link]

2016-12-23 Mark 2016-12-23

4 4.0 markpollack 8.0 []
[Link] Pollack [Link]

2016-05-03 Artem 2016-11-01

5 32.0 garyrussell 36.0 []
[Link] Bilan [Link]

2016-02-29 Artem 2016-03-01

6 123.0 garyrussell 134.0 []
[Link] Bilan [Link]

2016-02-23 Michael 2016-02-23

9 16.0 cppwfs 19.0 []
[Link] Minella [Link]

5 - Remove all the commits that are related to a given release

In [99]:

commits = commits[ ~commits['message'].[Link]('[Release') ]

print 'after' , len(commits)

after 1967
Finally, save the cleaned dataset to a file.

In [100]:

# save to file
commits.to_csv("[Link]", index=False, encoding='utf-8')

JIRA Admin Guide for IT Pros
No ratings yet
JIRA Admin Guide for IT Pros
33 pages
Create A New User User Management
No ratings yet
Create A New User User Management
4 pages
JIRA: Streamline Your Dev Team
No ratings yet
JIRA: Streamline Your Dev Team
2 pages
Top 20 JIRA Interview Questions (2023) - Javatpoint PDF
No ratings yet
Top 20 JIRA Interview Questions (2023) - Javatpoint PDF
31 pages
Acceleo User Guide
No ratings yet
Acceleo User Guide
56 pages
JiraEssentials SG APR272020
No ratings yet
JiraEssentials SG APR272020
220 pages
Understanding JIRA: Name and Usage
No ratings yet
Understanding JIRA: Name and Usage
1 page
Advanced Python Material
No ratings yet
Advanced Python Material
232 pages
Jira Mastery for Agile Project Managers
No ratings yet
Jira Mastery for Agile Project Managers
1 page
How To Use GitLab
No ratings yet
How To Use GitLab
8 pages
Jira Tables
No ratings yet
Jira Tables
6 pages
Configuring and Troubleshooting Permissions in Jira Slides and Transcript PDF
100% (1)
Configuring and Troubleshooting Permissions in Jira Slides and Transcript PDF
138 pages
Jira Interview Questions and Answers
No ratings yet
Jira Interview Questions and Answers
75 pages
Python Flask Cheat
No ratings yet
Python Flask Cheat
3 pages
JIRA Demos
No ratings yet
JIRA Demos
14 pages
Django/Python Framework
100% (5)
Django/Python Framework
57 pages
Git Commands for Developers
No ratings yet
Git Commands for Developers
2 pages
300+ TOP JIRA Interview Questions and Answers 2023
No ratings yet
300+ TOP JIRA Interview Questions and Answers 2023
14 pages
Jenkins & Jira in Project Management
No ratings yet
Jenkins & Jira in Project Management
19 pages
Beginner's Guide to JIRA Tool
No ratings yet
Beginner's Guide to JIRA Tool
5 pages
Jira Workflow Guide for Business Teams
100% (1)
Jira Workflow Guide for Business Teams
45 pages
Thingworx Devops
No ratings yet
Thingworx Devops
120 pages
ServiceNow-Jira Integration Guide
No ratings yet
ServiceNow-Jira Integration Guide
12 pages
Bkash Technical
No ratings yet
Bkash Technical
8 pages
Best Practices For Using Git
No ratings yet
Best Practices For Using Git
2 pages
Essential Git Commands and Workflow Guide
No ratings yet
Essential Git Commands and Workflow Guide
11 pages
Resume Krish
No ratings yet
Resume Krish
2 pages
Excel Automation with xlwings
No ratings yet
Excel Automation with xlwings
214 pages
Mastering JIRA - Sample Chapter
0% (1)
Mastering JIRA - Sample Chapter
28 pages
Django Training for Developers
No ratings yet
Django Training for Developers
45 pages
Atlassian ACP-100 Free Practice Exam & Test Training - ITExams - Com 3
No ratings yet
Atlassian ACP-100 Free Practice Exam & Test Training - ITExams - Com 3
1 page
Python 100 Days of Code Roadmap
100% (1)
Python 100 Days of Code Roadmap
13 pages
Django Rest Framework Json API
No ratings yet
Django Rest Framework Json API
21 pages
JIRA-Admin-Pt1-Lab Workbook
No ratings yet
JIRA-Admin-Pt1-Lab Workbook
184 pages
Overview of Bottle Python Framework
No ratings yet
Overview of Bottle Python Framework
18 pages
Pandas: Reference Sheet
No ratings yet
Pandas: Reference Sheet
9 pages
Python Bokeh Cheat Sheet
No ratings yet
Python Bokeh Cheat Sheet
1 page
Git Commands
No ratings yet
Git Commands
3 pages
My IaC AWS Multi-Account Provisioning BluePrint & Best Practices
No ratings yet
My IaC AWS Multi-Account Provisioning BluePrint & Best Practices
37 pages
GUI Design & MVC Principles
No ratings yet
GUI Design & MVC Principles
18 pages
JIRA-SFT Instance: Overview For Jira Project Leaders
100% (1)
JIRA-SFT Instance: Overview For Jira Project Leaders
17 pages
OnkarPramodKurle (3 0)
No ratings yet
OnkarPramodKurle (3 0)
7 pages
Essential Git Commands Cheat Sheet
50% (2)
Essential Git Commands Cheat Sheet
2 pages
Jboss Enterprise Application Platform: Troubleshooting Guide
No ratings yet
Jboss Enterprise Application Platform: Troubleshooting Guide
15 pages
Cheat Sheet: Types Modules
No ratings yet
Cheat Sheet: Types Modules
2 pages
Flask With Aws Cloudwatch
No ratings yet
Flask With Aws Cloudwatch
6 pages
IPython Dashboard Documentation 0.1.2
No ratings yet
IPython Dashboard Documentation 0.1.2
21 pages
Python Beyond Automate The Boring Stuff With Python - Real-World Automation & Mastery
No ratings yet
Python Beyond Automate The Boring Stuff With Python - Real-World Automation & Mastery
20 pages
Git Commands: Getting & Creating Projects
No ratings yet
Git Commands: Getting & Creating Projects
3 pages
Beginners Python Cheat Sheet PCC Classes PDF
100% (1)
Beginners Python Cheat Sheet PCC Classes PDF
2 pages
Python Programming Guide
No ratings yet
Python Programming Guide
43 pages
12 Comp Sci 1 Revision Notes Pythan Advanced Prog
No ratings yet
12 Comp Sci 1 Revision Notes Pythan Advanced Prog
5 pages
Full Stack Developer JAVA
No ratings yet
Full Stack Developer JAVA
14 pages
Jira API Guide for Developers
No ratings yet
Jira API Guide for Developers
6 pages
Unit-2 DH&V
No ratings yet
Unit-2 DH&V
188 pages
Cheat Sheet
No ratings yet
Cheat Sheet
12 pages
4 BNI Python Training
100% (1)
4 BNI Python Training
126 pages
Python Libraries: NumPy, Pandas, Matplotlib
No ratings yet
Python Libraries: NumPy, Pandas, Matplotlib
68 pages
RM-JIRA API Calls and JSON Schemas-220225-233711
No ratings yet
RM-JIRA API Calls and JSON Schemas-220225-233711
9 pages
MySQL Database Schema Overview
No ratings yet
MySQL Database Schema Overview
1 page
H2 Database Schema Overview
No ratings yet
H2 Database Schema Overview
1 page
7 Tips for Effective Requirements Management
No ratings yet
7 Tips for Effective Requirements Management
10 pages
Veteran-Centered VA Design Pilot
No ratings yet
Veteran-Centered VA Design Pilot
32 pages
365 Sex Positions PDF
50% (92)
365 Sex Positions PDF
387 pages
Essential Ebooks on Seduction and Sex
No ratings yet
Essential Ebooks on Seduction and Sex
8 pages
SeduceHer Kissing Secrets Ebook
77% (13)
SeduceHer Kissing Secrets Ebook
83 pages
Usecasemodeling Vpuml
No ratings yet
Usecasemodeling Vpuml
9 pages
Xinflying 400S User Manual-English
No ratings yet
Xinflying 400S User Manual-English
30 pages
Experiment No: 1 HDL Code To Realize All The Logic Gates: Name: A.Vineela Reddy Date: HT N0.: 16H61A04C6 Page No.
No ratings yet
Experiment No: 1 HDL Code To Realize All The Logic Gates: Name: A.Vineela Reddy Date: HT N0.: 16H61A04C6 Page No.
49 pages
Student Motivation in Vocabulary Learning
No ratings yet
Student Motivation in Vocabulary Learning
92 pages
1.3.3 Constraining A Sketch
No ratings yet
1.3.3 Constraining A Sketch
4 pages
RATIONAL KFC Training Document 2017 V1
No ratings yet
RATIONAL KFC Training Document 2017 V1
30 pages
GEHC Pattison Case Study
No ratings yet
GEHC Pattison Case Study
4 pages
The Domino Designer Quickstart Tutorial1 21 30
No ratings yet
The Domino Designer Quickstart Tutorial1 21 30
10 pages
XL4005 DC-DC Converter Guide
No ratings yet
XL4005 DC-DC Converter Guide
9 pages
UNIT 5 - Data Science - III BSC CS
No ratings yet
UNIT 5 - Data Science - III BSC CS
16 pages
Volume 39 Number 2
No ratings yet
Volume 39 Number 2
15 pages
Class II Syllabus 2024 25
No ratings yet
Class II Syllabus 2024 25
8 pages
Some Considerations About SOAP 1.1 and 1.2
No ratings yet
Some Considerations About SOAP 1.1 and 1.2
3 pages
IoT-Based Air Pollution Monitoring Report
No ratings yet
IoT-Based Air Pollution Monitoring Report
76 pages
Review Paper 01 - Siddesh Alavekar HFSC 823
No ratings yet
Review Paper 01 - Siddesh Alavekar HFSC 823
9 pages
Manual de Usuario MEB-9400
No ratings yet
Manual de Usuario MEB-9400
346 pages
2024 - 2
No ratings yet
2024 - 2
8 pages
Oracle Personalization: Procedure Example
100% (2)
Oracle Personalization: Procedure Example
4 pages
Multimedia Networks for Students
No ratings yet
Multimedia Networks for Students
66 pages
Thompson Industrial Communications Fifth Edition
100% (1)
Thompson Industrial Communications Fifth Edition
46 pages
Lap Trinh .NET DLLs in Vijeo Citect 2015
No ratings yet
Lap Trinh .NET DLLs in Vijeo Citect 2015
34 pages
Individual Assignment
No ratings yet
Individual Assignment
11 pages
Salesforce Admin Book
No ratings yet
Salesforce Admin Book
135 pages
p310-NPTEL-Power-System-Protection S.A Soman PDF
100% (2)
p310-NPTEL-Power-System-Protection S.A Soman PDF
324 pages
Theory of Computation - CS3452 - Question Bank and Important Questions
No ratings yet
Theory of Computation - CS3452 - Question Bank and Important Questions
17 pages
Marketing Imperatives for Business Growth
No ratings yet
Marketing Imperatives for Business Growth
4 pages
Microsoft Access 2000 Database Workshop
No ratings yet
Microsoft Access 2000 Database Workshop
8 pages
IEEE Access 2023 Honeywell
No ratings yet
IEEE Access 2023 Honeywell
5 pages
University of Malta: Junior College JUNE 2011
No ratings yet
University of Malta: Junior College JUNE 2011
4 pages
Proposal - Philsunrise Maritime Inc
No ratings yet
Proposal - Philsunrise Maritime Inc
2 pages
MICRAL V Portable Microcomputer System Dec78
No ratings yet
MICRAL V Portable Microcomputer System Dec78
41 pages

Extracting Data with JIRA Python API

Uploaded by

Extracting Data with JIRA Python API

Uploaded by

Task 1.

Getting data from JIRA

[INFO] Using the JIRA Python API

See the documentation for full details. [Link] ([Link]

from jira import JIRA

# Getting the keys of the last 3 reported issues

[u'XD-3768', u'XD-3767', u'XD-3766']

Task 1.1 Data extraction from JIRA

See more details about this on [Link]

# Connect with Spring

from jira import JIRA

# it is a good idea to save the issues in a dataframe

issues_in_proj_1 = jira.search_issues('project=XD', startAt=0, maxResults=1000)

from jira import JIRA

print 'Number of issues:', len(allissues)

Number of issues: 3706

for issue in allissues:

issues = [Link](d, ignore_index=True)

assignee components created creator description fixV

The jobs that appear

Working with Spring-XD

2017-03- Pavan I'm trying to run a Job on

Save your results to a file

issues.to_csv("[Link]", encoding='utf-8', header=True, index=False, line_terminator=

Extracting the changelog

1. Getting blocks of issues with additional information (e.g. the changelog)

If we opt for the first option (1)

issues = jira.search_issues('project=XD', start_idx, block_size, expand='changelo

If we opt for the second option (2)

issue = [Link]('FOO-100', expand='changelog')

for issue in allissues:

issue = [Link]([Link], expand='changelog')

for history in [Link]:

changelog = [Link](d, ignore_index=True)

print "Number of records: ", len(changelog)

Number of records: 43222

author date field fieldtype from fromString key to

Mark 2017-03- XD-

Mark 2017-03- XD-

Mark 2017-03- XD-

Krzysztof 2017-01- XD-

Save the dataset to a file

# backup! save to file

Task 1.2 Pre-processing and cleaning the data

Index([u'assignee', u'components', u'created', u'creator', u'description',

Remove all the issues with null or zero values

to_remove = issues[ (issues['storypoints'] == 0.0) | ([Link](issues['storypoints']))]

print 'before/after =', len(issues), '/', len(df)

before/after = 3706 / 3541

to_remove = changelog[((changelog['field'] == 'Story Points') | ( changelog['field'] == 'Ac

print 'before/after =', len(issues), '/', len(df)

before/after = 3541 / 3064

['To Do' 'Done' 'In PR' 'In Progress']

df = issues[ ~issues['[Link]'].isin(['Done']) | ~issues['resolution'].isin(['Complete'

print 'before/after =', len(issues), '/', len(df)

before/after = 3064 / 1035

Index([u'author', u'date', u'field', u'fieldtype', u'from', u'fromString',

print 'before/after =', len(issues), '/', len(df)

before/after = 1035 / 1035

FIBO_CARDS=[0.5, 1, 2, 3, 5, 8, 13, 20, 40, 100]

print 'before/after =', len(issues), '/', len(df)

before/after = 1035 / 943

print 'before/after =', len(issues), '/', len(df)

before/after = 943 / 943

Remove outliers. Use Tukey’s fences.

def remove_outliers(df, column, k = 1.5):

minval = first_quartile - k*iqr

return df[(df[column] > minval) | (df[column] < maxval)].copy()

print 'before/after =', len(issues), '/', len(df)

before/after = 943 / 943

Remove the stop words from title and the summary.

from [Link] import stopwords

print 'before/after =', len(issues), '/', len(df)

issues['description'] = issues['description'].apply( lambda x : ' '.join([word for word in

before/after = 943 / 869

Working Spring-XD version [Link] admin config timezone command