layout: post title: Gitlab故障回顾和总结 date: 2017-03-03 22:27:37 tags: []
Gitlab故障回顾
1月31日,Giblab在修复一个PostgreSQL数据同步问题(DB Replication lagged too far behind)时,误将生产环境的数据删除(本来是计划删除db1上的数据,结果发现在错误的db2上操作了)。进而寻求从备份数据恢复,结果发现没有实时备份:
- LVM Snapshot每24小时备份一次,最新数据是6小时前的
- 常规备份由于pg_dump客户端版本问题失效
- Azure Disk snapshot未启用
- 数据库同步会导致webhook删除,所以webhook只能从备份中恢复
- S3 备份未生效,bucket为空
- 糟糕的备份流程,并且没有明确的文档
最后,Gitlab只能从LVM snapshot上恢复6小时前的数据。由于备份机器性能极差,并且数据拷贝极慢,整个恢复过程也比较慢(18个小时)。
改进措施
故障发生后,Gitlab列出了一系列的改进措施,包括
- Update PS1 across all hosts to more clearly differentiate between hosts and environments (#1094)
- Prometheus monitoring for backups (#1095)
- Set PostgreSQL’s max_connections to a sane value (#1096)
- Investigate Point in time recovery & continuous archiving for PostgreSQL (#1097)
- Hourly LVM snapshots of the production databases (#1098)
- Azure disk snapshots of production databases (#1099)
- Move staging to the ARM environment (#1100)
- Recover production replica(s) (#1101)
- Automated testing of recovering PostgreSQL database backups (#1102)
- Improve PostgreSQL replication documentation/runbooks (#1103)
- Investigate pgbarman for creating PostgreSQL backups (#1105)
- Investigate using WAL-E as a means of Database Backup and Realtime Replication (#494)
- Build Streaming Database Restore
- Assign an owner for data durability
- Bundle pgpool-II 3.6.1 (!1251)
- Connection pooling/load balancing for PostgreSQL (#259)
教训
- PostgreSQL配置和使用错误,参见Dataloss at Gitlab
- 自动化的必要性,人总是会犯错的,应该用技术而不是管理来解决问题(设计更合理的高可用系统而不是靠权限控制人肉操作)
- 备份和故障恢复系统需要定期演练,否则即便像Gitlab拥有这么多的备份系统依然会丢失数据