PinDoo Outage Postmortem

,

Incident Summary: On the afternoon of December 11th, PinDoo experienced a service disruption resulting from degraded Vault performance and a Consul server with high CPU load. This issue later escalated, leading to a significant hardware failure. The incident was successfully resolved on December 12, 2023.

Timeline:

  1. Incident Onset:
    • Date and Time: December 11, 2023 – Afternoon
    • Description: The incident initiated with degraded Vault performance and heightened CPU load on a single Consul server. At this point, customers were unaffected, and the Kuyomi engineering team promptly commenced the investigation.
  2. Unhealthy Consul Cluster:
    • Description: Initial findings revealed an unhealthy Consul cluster, a critical component for various services, including Vault. The Consul cluster exhibited elevated write latency in the underlying KV store, with the 50th percentile latency reaching 2 seconds, a significant deviation from the typical 300ms. Suspecting degraded hardware performance, the team initiated the replacement of one Consul cluster node.
  3. Collaborative Diagnosis:
    • Description: Recognizing the complexity of the issue, Other engineers joined forces with Kuyomi & PinDoo’s engineers to assist in diagnosis and remediation. The collaborative efforts aimed to address the root cause of the Consul cluster’s degraded performance.
  4. Continuous Hardware Issues:
    • Description: Despite the introduction of new hardware, the Consul cluster’s performance continued to suffer. This prolonged degradation led to a drop in the number of websites using PinDoo to 50% of the normal count at 21:53 AEDT.

Root Cause Analysis: The primary cause of the service disruption was identified as a combination of degraded hardware performance within the Consul cluster and the subsequent failure to recover despite the introduction of new hardware.

Mitigation Steps and Preventive Measures:

  1. Thorough Analysis:
    • PinDoo will conduct a detailed post-incident analysis to understand the intricacies of the hardware failure and identify preventive measures.
  2. Enhanced Monitoring and Alerting:
    • Improvements will be made to monitoring systems to provide early detection of hardware performance issues and timely alerts for proactive remediation.
  3. Collaborative Approach:
    • PinDoo recognizes the value of collaboration and will continue to engage with external experts, such as HashiCorp, to enhance incident response capabilities.
  4. User Communication:
    • Efforts will be intensified to keep users informed and updated during incidents, ensuring transparency and managing expectations.

Conclusion: PinDoo acknowledges the impact of this outage on its users and expresses sincere apologies for the inconvenience caused. The incident resolution has paved the way for comprehensive improvements in PinDoo’s infrastructure, monitoring, and collaborative strategies to fortify the platform against similar issues in the future.

Users still experiencing issues are should contact help@pindoo.xyz for immediate assistance.

PinDoo remains committed to providing a robust and reliable platform and appreciates the understanding and patience of its user community.

Leave a Reply

Your email address will not be published. Required fields are marked *