100% found this document useful (1 vote)
821 views13 pages

GitLab Infrastructure 20160621

The summary provides an overview of GitLab's infrastructure status report in 3 sentences: The report discusses experiments adding more memory to workers which had a good impact on reducing load and improving API timings, but Redis was still leaving connections open which caused an outage; deploys of RC3 and RC4 blew up in production and staging respectively; and improvements were made to monitoring how methods are measured but no other performance progress besides the API.

Uploaded by

sytses
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
821 views13 pages

GitLab Infrastructure 20160621

The summary provides an overview of GitLab's infrastructure status report in 3 sentences: The report discusses experiments adding more memory to workers which had a good impact on reducing load and improving API timings, but Redis was still leaving connections open which caused an outage; deploys of RC3 and RC4 blew up in production and staging respectively; and improvements were made to monitoring how methods are measured but no other performance progress besides the API.

Uploaded by

sytses
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

GitLab Infrastructure

Status Report

We have HTTP queue time in monitoring, so we ran an experiment


What if we add more memory to workers?

This had a good impact across the board - less load in general

Specially on API timings (authorized_keys lookup timings)

This is why git-ssh is


going faster, but there
is still a long way to go.

Some things did not go well with the change


Redis leaves connections behind - GitLab max connections open -> outage

Deploys - RC3 blowed in production


On a Friday 1AM my time.
So we built staging the next Monday
staging.gitlab.com is way smaller and
less powerful than GitLab.com, but it has
all the data.
Thanks @Jeroen
Done is better than perfect

Deploys - RC4 blew in staging that very same Monday


Production

Staging

<

Postgres is still dying on us, or was it?

Query counts monitoring


allowed us to corner the
problem and get it fixed In
RC5
Thanks @marat!

Monitoring - improvements on how methods are measured

We are
actually
showing
where the time
is going now.
Thanks
@Yorick!

Performance - no progress besides the API

Storage

Cephfs - dev.gitlab.org has been running on cephfs for the last month

Did you noticed? No? Thats good! :)


Pushing the linux kernel to it takes 27 minutes ~1.5Gb
Pushing the linux kernel to GitLab.com takes between 1:30hs to forever

Our measurements were wrong, Cephfs gives 500/150 IOPS


But it scaled without a hiccup up to 98 workers nodes (clients).
We are testing behaviour when we add more nodes/ODS, etc.
We have a plan to move to Cephfs without downtime

Storage - capacity
Git data - 28TB out of 49TB

Shared data - 3TB out of 4TB


we can grow this one easily-ish

Other news

Whats coming soon

Multiple mount points/shards - Thanks @Alejandro!


2 new hires
Alex as a Production Engineer
Ahmad as a Performance Specialist

We are talking with CI to transfer knowledge into Infrastructure.


We are going to take over GitHost.io
We are starting to build infrastructure monitoring that can be shipped with
GitLab
We are hiring!
Thats all folks!

You might also like