User Details
- User Since
- Jun 7 2021, 7:25 AM (160 w, 16 h)
- Availability
- Available
- LDAP User
- Jelto
- MediaWiki User
- JWodstrcil (WMF) [ Global Accounts ]
Today
I'll close the task as the major version upgrade is finished. If any new errors are introduced with the new version, feel free to reopen or link a subtask like the two above.
This was bot traffic: https://superset.wikimedia.org/superset/dashboard/p/l39rmY3r05p/. Alert recovered after 5 minutes. I'll take a deeper look if that happens again.
I'll close the task, the downtime was around 5 minutes. We have a task to improve the downtimes in the upgrade cookbook: T363564
Fri, Jun 28
All instances (except the WMCS Bookworm Runner) are now on GitLab version 17. I'll leave the task open over the weekend in case there are any errors to report.
Thu, Jun 27
I deleted packager02.packaging.eqiad1.wikimedia.cloud, which was the last buster instance. So I'll resolve the task. All buster instances are replaced or deleted.
Wed, Jun 26
I created the bookworm host packager-etherpad01.packaging.eqiad1.wikimedia.cloud to replace packager02.packaging.eqiad1.wikimedia.cloud. I successfully rebuilt Etherpad version 1.9.7 (the current production version) on the new bookworm host.
Mon, Jun 24
As discussed in IRC, the release VMs should probably have a separate disk mounted at /srv (similar to the 150GB disk mounted at /srv/docker). However, I'm not sure if /srv/docker needs all that space. The Docker partition is using only 5%, with no significant change over the past month. Therefore, we could use this larger disk at /srv and mount a smaller one for Docker, or just use the 150GB disk for /srv (including /srv/docker).
I'll resolve this task optimistically because this was likely caused by a bigger database incident: T368098.
Fri, Jun 21
The custom nginx rule to block non-admin configuration changes using the web UI is in place now. So I'm resolving the task.
Thu, Jun 20
gitlab1003 was upgraded successfully to 17.0.2: https://gitlab-replica-b.wikimedia.org. I'll proceed with the other replica on Monday.
I will upgrade gitlab1003 to version 17.0 later today. If everything goes as planned, I will update the other replica on Monday and the production system during the deployment window on Friday, June 28th.
Tue, Jun 18
I think all of the options above can be used. I updated the frequency of the scheduled pipeline from every 30 minutes to every 15 minutes. Also I added a [DO NOT EDIT] description to the Trusted Runners.
Mon, Jun 17
As discussed in T367544, the VM used for building the Etherpad Debian package will be deleted soon. Therefore, it might be sensible to attempt one last version upgrade. This would give us additional time to evaluate new systems for building and deploying Etherpad (like Kubernetes, wmf-debci or another bookworm build vm).
It might be possible to schedule another Etherpad upgrade (T362432) before the packager02.packaging.eqiad1.wikimedia.cloud host is deleted.
According to the etherpad upgrade docs this host is used to build the etherpad Debian package. I also used the host in the past to build the etherpad package. The dedicated host is used because "etherpad builds fetches npm modules during the build time".
I updated the test instance and test gitlab-runners to GitLab version 17. Everything looks good and the migration was successful. @brennen we can test the account approval bot on the test instance now.
Fri, Jun 14
I'm going to upgrade the test instance to 17.0 next Monday June 17th.
This should be fixed with the change above. We ignore user-generated script_failures in the alert now.
Related to ongoing upgrade in T367382. I silenced the alert. This should resolve later today.
Thu, Jun 13
This was triggered by failing jobs due to faulty CI jobs/scripts. Jobs came mostly from https://gitlab.wikimedia.org/repos/security/ci-cd-testing-gitlab-ci-security-templates. So this was user generated.
This is resolved after GitLab is back. I'll try to catch the error in a new version of the exporter.
This is resolved after GitLab is back. I'll try to catch the error in a new version of the exporter.
Wed, Jun 12
^ another patch was needed to make sure to probe the service IPs and not the default host IPs (similar to the blackbox::http check).
resolved after fixing the IPs which should be probed. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042191
Tue, Jun 11
This seemed to be a puppet dependency issue and not a SSH issue. I checked the sshd on all instances and IPv6 worked on both replicas but not on production. I restarted the git sshd and after that sshd listens on IPv6:
Mon, Jun 10
After digging through the docs I did not find any way to restrict the edit page for runners in the projects CI menu, for example here in airflow-dags: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/runners/1484/edit. All users with elevated permissions on the project can edit the CI settings for every runner which is assigned to the project.
The gitlab-ssh service is listening on IPv4 only. So the above ProbeDown alert fired for IPv6.
This is resolved after switching to monitor IPv4 only. In T367021 I'll discuss more how to properly add IPv6 to the gitlab-ssh service.
Fri, Jun 7
Thu, Jun 6
One end-user edited the runner within their project to solve a building/dependency issue:
the alert is right, the protected checkbox was missing. I manually protected the runner again in https://gitlab.wikimedia.org/admin/runners/1484#/.
Wed, Jun 5
I left a comment in https://gitlab.wikimedia.org/repos/phabricator/phabricator/-/merge_requests/52#note_86778. +1 for removing the custom rate limiting code. I'd try to move any existing extra logic to a single place (requestctl).
Tue, Jun 4
Mon, Jun 3
I executed the cleanup command on all instances, I'll resolve the task.
May 31 2024
All instance updated to new postgres version.
I'd move the cleanup (mentioned in the scripts output) to somewhere next week:
May 30 2024
This is fixed now, the exporter is retrying:
May 29 2024
This was caused by the daily backup-restore. In v1.0.10 of the exporter I added better error handling : https://gitlab.wikimedia.org/repos/sre/gitlab-exporter/-/tags/v1.0.10
May 28 2024
I can upload a fix for that in the exporter tomorrow