All Systems Operational

Updated a few seconds ago

Back to current status

Status History

Filter: Canary (Clear)



June 2020

Elevated Web Error Rate Post-Deploy

June 29, 2020 21:06 UTC

Incident Status

Degraded Performance


Components

Website, API, Git Operations, CI/CD - Hosted runners on Linux, CI/CD - Hosted runners on Windows, Background Processing, Canary


Locations

Google Compute Engine




June 29, 2020 21:06 UTC
[Resolved] At 20:06 UTC, Google Cloud Platform issued the "all clear" re: the network connectivity issues in the us-east1-c Availability Zone. At 20:30 UTC, we finished repairing the VMs that were impacted and have fully restored operations on GitLab.com. We apologize to any of our users that were negatively impacted – and thank you all for your patience.

June 29, 2020 16:59 UTC
[Monitoring] We've churned through the backlog of CI queued jobs. All jobs should be processing normally again. We continue to monitor the status of Google Cloud Platform's incident.

June 29, 2020 16:39 UTC
[Monitoring] We've recovered to normal levels of latency, but with slightly elevated Web error rate. Google Cloud Platform has indicated that they expect a resolution for `us-east1-d` within 30 minutes, at which point we should have enough capacity to continue serving out customers without fear of interruption.

June 29, 2020 16:20 UTC
[Identified] We have no material update to provide at this time but are still closely monitoring our infrastructure and are strategizing on how to scale up our fleet to compensate.

June 29, 2020 16:05 UTC
[Identified] We're monitoring all of our infrastructure very closely, and currently still observing issues primarily with web errors and CI job queuing. Google Cloud Platform has updated the issue noting that `us-east1 -*` VM creation may fail, which may limit our ability to scale up our fleet to meet capacity.

June 29, 2020 15:47 UTC
[Identified] Google Cloud Platform has acknowledged an issue with the `us-east1-c` availability zone (AZ). We're increasing capacity across AZs to scale up to meet the demand of our users. Also, we're observing an increase in the CI jobs queue. Users' pipelines are impacted and will take longer than normal to complete.

June 29, 2020 15:31 UTC
[Investigating] We continue to investigate a capacity issue that's causing an increase in error rates, primarily with web responses. We suspect an issue with Google Cloud Platform's us-east1-c zone and C2 instance types, which have become unresponsive across our fleet. We've escalated to support engineers and are awaiting a reply. Meanwhile, we're planning to increase our capacity across other zones.

June 29, 2020 15:18 UTC
[Investigating] We started a deploy at 13:26 UTC and have noticed an elevated error rate after our web servers were unresponsive post-deploy. We're recording in `gitlab.com/gitlab-com/gl-infra/production/-/issues/2347`.





Back to current status