- Managing maintenance windows
- Runbooks? SOP? (cparedes: might be worthwhile even though we want to automate SOP’s away as much as possible - what should we check at 2 AM? What do folks typically do in this situation if automation fails?)
- Architecture and design (cparedes: also maybe talk about why we choose that design - what problems did we try to solve? Why is this a good solution?) How to manage documentation
How to get help, keep sharp, learn new skills, and network within the systems community.
Sign up and participate. As your own questions, but also answer questions that look interesting to you. This will not only help the community, but can keep you sharp, even on technologies you don’t work with on a daily basis.
- Web Operations, John Allspaw and Jesse Robbins
- The Art of Capacity Planning, John Allspaw
- Blueprints for High Availability, Evan Marcus and Hal Stern
- Resilience Engineering, Erik Hollnagel
- Human Error, James Reason
- To Engineer is Human, Henry Petroski
- To Forgive Design, Henry Petroski