forked from google/opendocs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial checkin of Services Documentation templates.
This is a version of the Sysops `ops100` templates used for documenting services for oncallers, ticketeers (aka onduty or interrupts), helpdesk, and service teams.
- Loading branch information
Jamie Wilkinson
committed
Oct 5, 2021
1 parent
5f9b52b
commit 47596c8
Showing
7 changed files
with
891 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
# Templates for Services Documentation | ||
|
||
(The "ops100" docs.) | ||
|
||
The `templates` directory contains the following templates: | ||
|
||
- [index.md](templates/index.md) | ||
- [operations.md](templates/operations.md) | ||
- [build.md](templates/build.md) | ||
- [disaster\_recovery.md](templates/disaster_recovery.md) | ||
- [common\_tasks.md](templates/common_tasks.md) | ||
- [security.md](templates/security.md) | ||
|
||
The `index.md` file is a template for the landing page that should be in each | ||
service's directory. | ||
|
||
The `operations.md` file is intended as a quick overview and holding page for | ||
basic facts about the service as well as the basic troubleshooting guide for | ||
oncall (since they will need the basic facts as well). | ||
|
||
The `build.md` file contains instructions for building functioning instances for | ||
with the service. Ideally you should provide step-by-step instructions that | ||
someone else could follow to recreate the server and service. The `build.md` | ||
template provides an outline structure to issues you should remember to cover, | ||
but you may format this document as is most appropriate for your service. Some | ||
services may like to describe their *infrastructure as code* location and | ||
release+deployment automation instead. | ||
|
||
The `disaster_recovery.md` template provides instructions on how to rebuild and | ||
restore a service in the event of catastrophic failure. | ||
|
||
The `common_tasks.md` template provides instructions for helpdesk, onduty, | ||
security operations, and end users who support or use the service. It also | ||
provides escalation information. | ||
|
||
The `disaster.md` template asks a series of questions regarding the reliability | ||
of a service, and the impact caused by loss of that service. This document can | ||
be used in a proactive way during the design and implementation phase of a | ||
service or after the fact as a way of evaluating how the service will fair in | ||
various scenarios. (This may or may not be used in the future and is not | ||
currently linked from the template navigation.) | ||
|
||
The `security.md` template describes access control and authorisation mechanisms | ||
required by the service. | ||
|
||
## Motivation | ||
|
||
A common structure of documentation helps prompt service owners to document what | ||
others might need to know but the owner doesn't know that they don't know. In | ||
other words, it helps operations teams mature by turning tribal knowledge into | ||
useful documentation. | ||
|
||
Another benefit of a common structure is that it lowers the cognitive burden of | ||
a context switch for oncall and onduty (or interrupts) staff when moving between | ||
services while debugging dependency chains. It also lowers the ramp-up time for | ||
people transitioning between teams. | ||
|
||
The contents of these docs should be reviewed and changed regularly as the | ||
service evolves and matures, and as new team members join and are encouraged to | ||
update the docs as they learn where they've become incorrect. However try to | ||
keep highly dynamic details out of these documents, like 6 month plans and | ||
feature roadmaps, especially if that information is already hosted elsewhere -- | ||
hyperlinks are better than manually synchronising content. | ||
|
||
## Caution | ||
|
||
It may happen that while filling out the templates, one is motivated to describe | ||
what should be, rather than what is. While a | ||
[*Production Readiness Review*](https://sre.google/sre-book/evolving-sre-engagement-model/) | ||
may ask similar questions, this is not a PRR: make sure you document the system | ||
as built, as that is most useful for the pople who are maintaining the service | ||
or responding to incidents. (But, do file feature requests for any ideas you | ||
have to improve the reliability and operational maturity of the service as you | ||
think of them!) | ||
|
||
## Usage | ||
|
||
1. Copy all of the templates to the canonical location for your service | ||
documentation. | ||
|
||
This might be in a central location in a monorepo, subdivided by common | ||
service name (e.g. /docs/operations/ldap, /docs/operations/ntp) or in a | ||
common subdirectory name in each project repo. Crucially everyone should be | ||
able to develop muscle memory for the location of the service docs, and find | ||
identical structure. | ||
|
||
- Add a link to yuor new index page to the central index of services. | ||
|
||
2. Replace all strings marked in **@** (e.g. `@SERVICE_NAME@`) with the correct | ||
values for your service. | ||
|
||
Embrace sed: | ||
|
||
``` | ||
sed -i -e 's/@SERVICE_NAME@/ntp/g' \ | ||
-e 's/@TICKET_URL@/some_url/g' \ | ||
-e 's/@DESIGN_URL@/www.designdoc.com/g' \ | ||
*.md | ||
``` | ||
3. Replace the blockquote instructions with the content they describe. | ||
- Try to use the headings provided whenever possible as consistency | ||
between documents lowers the burden of a context switch. | ||
- Don't feel obliged to fill in all headings if they are not relevant, but | ||
if the answer is unknown, leave blank and come back to it later. | ||
- You might find you want to add additional documents, which is OK, just | ||
update the navigation links. | ||
- Change the names of "Ops, Helpdesk, or Security" to match your | ||
organisation. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,176 @@ | ||
[Home](index.md) [Operations](operations.md) [Build](build.md) [DR](disaster_recovery.md) [Common Tasks](common_tasks.md) [Security](security.md) | ||
|
||
Last Update: @Date@ | ||
|
||
> Replace this note with your own customisation. | ||
**Note:** This document is meant for Ops, Helpdesk, or Security. If you are | ||
having problems with a service, please call Helpdesk or | ||
[file a ticket](@TICKET_URL@). | ||
|
||
@SERVICE_NAME@ Build Document | ||
================================= | ||
|
||
##### Quick links: | ||
|
||
- [Server locations](operations.md#servers_hardware) | ||
- [Outage Impact/SLA](index.md#outageimpact) | ||
|
||
------------------------------------------------------------------------ | ||
|
||
This document gives instructions on how to build a new instance of this | ||
service. Its intended audience is sysadmins who may not be familiar with | ||
the service. | ||
|
||
Build Prerequisites | ||
------------------- | ||
|
||
> What hardware, software, and networking infrastructure that should be in | ||
> place before you attempt to install the service application? | ||
> Prerequisites include basic hardware setup, or virual machine type, and the installation of | ||
> commonly-used supporting software packages, such as Apache. | ||
### Hardware Requirements {#hardware_requirements} | ||
|
||
> What server or other hardware is required for this service? How is it | ||
> obtained, are there spares available? Is this service on console and/or | ||
> remote power? If not, why not? | ||
> If the hardware is virtual, explain what footprint is required. If the service is cloud-native, what's a minimum footprint look like, and on what "as a service" is it on? | ||
#### Physical Location {#location} | ||
|
||
> Are there any constraints on what data center(s) or server room(s) this | ||
> service should be installed in? How do you determine what power circuit | ||
> it should be connected to? (This may differ if you are setting up a | ||
> replacement server for a dead box vs. an additional server in a pool.) | ||
> Provide a link to a list of [current servers and their | ||
> locations](operations.md#servers_hardware). | ||
> For cloud environments, are there any policy constraints on where the service can be deployed? | ||
### Software Requirements {#software_requirements} | ||
|
||
#### OS | ||
|
||
> What OS should be installed and what is the procedure? | ||
> Does a human install Windows 2003 from CD? Do you PXE-boot Debian from the network? Do you boot a VM image? | ||
#### Third Party Software | ||
|
||
> List all software dependencies. | ||
> Do we use OS packages or fetch them directly from the upstream maintainers? | ||
> Do we fork the source and maintain our own branch? | ||
> Are customizations beyond the default install needed for this service? | ||
|
||
#### Licenses and Keys | ||
|
||
> Are there any licenses or keys that need to be obtained in order to run the | ||
> service? If so, where do you get them? Are they stored by Security, or do you | ||
> fetch them from a cloud key store? | ||
### Networking Requirements {#network_requirements} | ||
|
||
#### Setting up file shares | ||
|
||
> Does the service require you to set up NFS shares? If so, provide the | ||
> details. | ||
#### Configuring the IP/Subnet/vlan/VIPs | ||
|
||
> How should the IP of this service be determined (can it reuse the IP if this | ||
> is a replacement for a dead server? What if this is an additional server?). DO | ||
> we have a planning spreadsheet? Does the IP get handed out by DHCP, or do you | ||
> need to ask the Czar of Naming? What subnet should this service be installed | ||
> in? What switch should it be connected to? Is it behind a VIP, if so, how is | ||
> this configured? | ||
#### Configuring Access Control (ACLs/Security Operations) | ||
|
||
> Are there network access issues: do routes, acls or firewall rules need to be | ||
> configured? | ||
For information on access controls and processes to grant them see the | ||
[security document](security.md). | ||
|
||
#### DNS | ||
|
||
> What DNS entries need to be configured? Where? | ||
### Global Replication {#global_replication} | ||
|
||
For information on how this service is replicated | ||
see the [operational document](operations.md#global_replication). | ||
|
||
Build Procedure {#build_procedure} | ||
--------------- | ||
|
||
> This section should contain a step-by-step procedure for installing the | ||
> service. Installation of hardware and commonly used supporting software | ||
> packages should be placed in the Build prerequisite section, if that makes sense to do so. | ||
> Bonus points for linking to the "infrastructure as code" source tree and explaining the automated build. | ||
### Installing the Supporting Software {#supporting_software} | ||
|
||
> What packages need to be installed? Are there customizations beyond the | ||
> default install needed? Do they need to be checked in to p4? Are | ||
> licenses needed? How do you obtain them? | ||
> Bonus points for linking to the "infrastructure as code" source tree and explaining the automated build. | ||
### Connecting to Other Services {#connecting_to_other_services} | ||
|
||
> Which other services does this one need to connect to? What information | ||
> does it need to get from these services? | ||
### Starting the Service {#starting_the_service} | ||
|
||
For instructions on how to start the service, see the [operational document](operations.md#start_stop). | ||
|
||
### Testing the Service {#testing_the_service} | ||
|
||
For instructions on how to verify if the service is running, see the [operational document](operations.md#service_verify). | ||
|
||
### Setting Up Service Monitoring {#setting_up_monitoring} | ||
|
||
> Is there local monitoring that needs to be installed/configured/started? | ||
> Do we need to configure a monitoring service to collect or receive instrumentation? | ||
For a list of current service monitoring, see the [operational document](operations.md#monitoring). | ||
|
||
### Setting Up Backups {#setting_up_backups} | ||
|
||
> How is the service backed up? What needs to be done to set up the | ||
> backups? | ||
### Required Notifications or Other Issues {#required_notifications} | ||
|
||
> Any other build issues? Is notification required when new servers or | ||
> replacement servers are installed? Whom should be notified and for | ||
> non-urgent changes, how much notice should be given? Are there any | ||
> change management processes/email addresses that need to notified with | ||
> information about this build? | ||
Adding a Server to the Service {#adding_a_server} | ||
------------------------------ | ||
|
||
> Provide step-by-step instructions for adding a server to the service. | ||
Roles {#roles} | ||
----------- | ||
|
||
> If using a configuration management tool like Slack or Puppet, provide the | ||
> roles and descriptions here. | ||
> | ||
> Role Subrole Decription | ||
> | ||
> ------------------------------------------------------------------------------ | ||
> | ||
> | ||
In-House Build Instructions {#package_build_instructions} | ||
-------------------------- | ||
|
||
> If there is custom built software for this | ||
> service, document its build and release process here. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
[Home](index.md) [Operations](operations.md) [Build](build.md) [DR](disaster_recovery.md) [Common Tasks](common_tasks.md) [Security](security.md) | ||
|
||
Last Update: @Date@ | ||
|
||
> Replace this note with your own customisation. | ||
**Note:** This document is meant for Ops, Helpdesk, or Security. If you are | ||
having problems with a service, please call Helpdesk or | ||
[file a ticket](@TICKET_URL@). | ||
|
||
@SERVICE_NAME@ Common Tasks | ||
=============================== | ||
|
||
##### Quick links: | ||
|
||
- [Server locations](oeprations.md#servers_hardware) | ||
- [Outage Impact/SLA](index.md#outageimpact) | ||
|
||
|
||
------------------------------------------------------------------------ | ||
|
||
This document describes how to perform routine administrative tasks for this | ||
service. Its intended audience includes onduty, helpdeskers and security | ||
engineers --- people other than the primary owners who may be asked to perform | ||
administrative tasks for this service. | ||
|
||
Escalation | ||
---------- | ||
|
||
> For information on how to route tickets for common issues concerning | ||
> this service, link to the escalation wiki. You should create an entry in | ||
> the table for your service, and log common issues there. To create an | ||
> entry, just copy the the test heading and edit table, and customize them | ||
> for your service. | ||
Helpdesk | ||
-------- | ||
|
||
> What tasks do helpdesk personnel need to perform? Please provide | ||
> step-by-step instructions for these tasks. Link to an FAQ, if | ||
> appropriate. | ||
Onduty | ||
------ | ||
|
||
> What tasks do onduty personnel need to perform to support this service? | ||
> Please provide step-by-step instructions for these tasks. Link to an | ||
> FAQ, if appropriate. | ||
## Security | ||
|
||
> What tasks do Security personnel need to perform to support this service? | ||
> Please provide step-by-step instructions for these tasks. Link to an FAQ, if | ||
> appropriate. | ||
End User {#end_user} | ||
-------- | ||
|
||
> What tasks do end users need to perform to use this service? Please | ||
> provide links to the relevant helpdesk documentation. If the helpdesk | ||
> documentation is not up-to-date or complete, file a docbug and help us | ||
> fix it. Link to an FAQ, if appropriate. | ||
Obtaining Approval {#approval} | ||
------------------ | ||
|
||
> There are certain system tasks that require approval by by the service owner | ||
> or by Security. Please list them in the following approval matrix. | ||
Task Approval Needed | ||
------ ----------------- | ||
|
Oops, something went wrong.