Gitlab故障回顾和总结

Gitlab故障回顾

1月31日,Giblab在修复一个PostgreSQL数据同步问题(DB Replication lagged too far behind)时,误将生产环境的数据删除(本来是计划删除db1上的数据,结果发现在错误的db2上操作了)。进而寻求从备份数据恢复,结果发现没有实时备份:

最后,Gitlab只能从LVM snapshot上恢复6小时前的数据。由于备份机器性能极差,并且数据拷贝极慢,整个恢复过程也比较慢(18个小时)。

改进措施

故障发生后,Gitlab列出了一系列的改进措施,包括

  1. Update PS1 across all hosts to more clearly differentiate between hosts and environments (#1094)
  2. Prometheus monitoring for backups (#1095)
  3. Set PostgreSQL’s max_connections to a sane value (#1096)
  4. Investigate Point in time recovery & continuous archiving for PostgreSQL (#1097)
  5. Hourly LVM snapshots of the production databases (#1098)
  6. Azure disk snapshots of production databases (#1099)
  7. Move staging to the ARM environment (#1100)
  8. Recover production replica(s) (#1101)
  9. Automated testing of recovering PostgreSQL database backups (#1102)
  10. Improve PostgreSQL replication documentation/runbooks (#1103)
  11. Investigate pgbarman for creating PostgreSQL backups (#1105)
  12. Investigate using WAL-E as a means of Database Backup and Realtime Replication (#494)
  13. Build Streaming Database Restore
  14. Assign an owner for data durability
  15. Bundle pgpool-II 3.6.1 (!1251)
  16. Connection pooling/load balancing for PostgreSQL (#259)

教训

参考链接

Comments

comments powered by Disqus