Skip to content

Commit

Permalink
Initial checkin of Services Documentation templates.
Browse files Browse the repository at this point in the history
This is a version of the Sysops `ops100` templates used for documenting
services for oncallers, ticketeers (aka onduty or interrupts), helpdesk, and
service teams.
  • Loading branch information
Jamie Wilkinson committed Oct 5, 2021
1 parent 5f9b52b commit 47596c8
Show file tree
Hide file tree
Showing 7 changed files with 891 additions and 0 deletions.
113 changes: 113 additions & 0 deletions service_docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Templates for Services Documentation

(The "ops100" docs.)

The `templates` directory contains the following templates:

- [index.md](templates/index.md)
- [operations.md](templates/operations.md)
- [build.md](templates/build.md)
- [disaster\_recovery.md](templates/disaster_recovery.md)
- [common\_tasks.md](templates/common_tasks.md)
- [security.md](templates/security.md)

The `index.md` file is a template for the landing page that should be in each
service's directory.

The `operations.md` file is intended as a quick overview and holding page for
basic facts about the service as well as the basic troubleshooting guide for
oncall (since they will need the basic facts as well).

The `build.md` file contains instructions for building functioning instances for
with the service. Ideally you should provide step-by-step instructions that
someone else could follow to recreate the server and service. The `build.md`
template provides an outline structure to issues you should remember to cover,
but you may format this document as is most appropriate for your service. Some
services may like to describe their *infrastructure as code* location and
release+deployment automation instead.

The `disaster_recovery.md` template provides instructions on how to rebuild and
restore a service in the event of catastrophic failure.

The `common_tasks.md` template provides instructions for helpdesk, onduty,
security operations, and end users who support or use the service. It also
provides escalation information.

The `disaster.md` template asks a series of questions regarding the reliability
of a service, and the impact caused by loss of that service. This document can
be used in a proactive way during the design and implementation phase of a
service or after the fact as a way of evaluating how the service will fair in
various scenarios. (This may or may not be used in the future and is not
currently linked from the template navigation.)

The `security.md` template describes access control and authorisation mechanisms
required by the service.

## Motivation

A common structure of documentation helps prompt service owners to document what
others might need to know but the owner doesn't know that they don't know. In
other words, it helps operations teams mature by turning tribal knowledge into
useful documentation.

Another benefit of a common structure is that it lowers the cognitive burden of
a context switch for oncall and onduty (or interrupts) staff when moving between
services while debugging dependency chains. It also lowers the ramp-up time for
people transitioning between teams.

The contents of these docs should be reviewed and changed regularly as the
service evolves and matures, and as new team members join and are encouraged to
update the docs as they learn where they've become incorrect. However try to
keep highly dynamic details out of these documents, like 6 month plans and
feature roadmaps, especially if that information is already hosted elsewhere --
hyperlinks are better than manually synchronising content.

## Caution

It may happen that while filling out the templates, one is motivated to describe
what should be, rather than what is. While a
[*Production Readiness Review*](https://sre.google/sre-book/evolving-sre-engagement-model/)
may ask similar questions, this is not a PRR: make sure you document the system
as built, as that is most useful for the pople who are maintaining the service
or responding to incidents. (But, do file feature requests for any ideas you
have to improve the reliability and operational maturity of the service as you
think of them!)

## Usage

1. Copy all of the templates to the canonical location for your service
documentation.

This might be in a central location in a monorepo, subdivided by common
service name (e.g. /docs/operations/ldap, /docs/operations/ntp) or in a
common subdirectory name in each project repo. Crucially everyone should be
able to develop muscle memory for the location of the service docs, and find
identical structure.

- Add a link to yuor new index page to the central index of services.

2. Replace all strings marked in **@** (e.g. `@SERVICE_NAME@`) with the correct
values for your service.

Embrace sed:

```
sed -i -e 's/@SERVICE_NAME@/ntp/g' \
-e 's/@TICKET_URL@/some_url/g' \
-e 's/@DESIGN_URL@/www.designdoc.com/g' \
*.md
```
3. Replace the blockquote instructions with the content they describe.
- Try to use the headings provided whenever possible as consistency
between documents lowers the burden of a context switch.
- Don't feel obliged to fill in all headings if they are not relevant, but
if the answer is unknown, leave blank and come back to it later.
- You might find you want to add additional documents, which is OK, just
update the navigation links.
- Change the names of "Ops, Helpdesk, or Security" to match your
organisation.
176 changes: 176 additions & 0 deletions service_docs/templates/build.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
[Home](index.md) [Operations](operations.md) [Build](build.md) [DR](disaster_recovery.md) [Common Tasks](common_tasks.md) [Security](security.md)

Last Update: @Date@

> Replace this note with your own customisation.
**Note:** This document is meant for Ops, Helpdesk, or Security. If you are
having problems with a service, please call Helpdesk or
[file a ticket](@TICKET_URL@).

@SERVICE_NAME@ Build Document
=================================

##### Quick links:

- [Server locations](operations.md#servers_hardware)
- [Outage Impact/SLA](index.md#outageimpact)

------------------------------------------------------------------------

This document gives instructions on how to build a new instance of this
service. Its intended audience is sysadmins who may not be familiar with
the service.

Build Prerequisites
-------------------

> What hardware, software, and networking infrastructure that should be in
> place before you attempt to install the service application?
> Prerequisites include basic hardware setup, or virual machine type, and the installation of
> commonly-used supporting software packages, such as Apache.
### Hardware Requirements {#hardware_requirements}

> What server or other hardware is required for this service? How is it
> obtained, are there spares available? Is this service on console and/or
> remote power? If not, why not?
> If the hardware is virtual, explain what footprint is required. If the service is cloud-native, what's a minimum footprint look like, and on what "as a service" is it on?
#### Physical Location {#location}

> Are there any constraints on what data center(s) or server room(s) this
> service should be installed in? How do you determine what power circuit
> it should be connected to? (This may differ if you are setting up a
> replacement server for a dead box vs. an additional server in a pool.)
> Provide a link to a list of [current servers and their
> locations](operations.md#servers_hardware).
> For cloud environments, are there any policy constraints on where the service can be deployed?
### Software Requirements {#software_requirements}

#### OS

> What OS should be installed and what is the procedure?
> Does a human install Windows 2003 from CD? Do you PXE-boot Debian from the network? Do you boot a VM image?
#### Third Party Software

> List all software dependencies.
> Do we use OS packages or fetch them directly from the upstream maintainers?
> Do we fork the source and maintain our own branch?
> Are customizations beyond the default install needed for this service?

#### Licenses and Keys

> Are there any licenses or keys that need to be obtained in order to run the
> service? If so, where do you get them? Are they stored by Security, or do you
> fetch them from a cloud key store?
### Networking Requirements {#network_requirements}

#### Setting up file shares

> Does the service require you to set up NFS shares? If so, provide the
> details.
#### Configuring the IP/Subnet/vlan/VIPs

> How should the IP of this service be determined (can it reuse the IP if this
> is a replacement for a dead server? What if this is an additional server?). DO
> we have a planning spreadsheet? Does the IP get handed out by DHCP, or do you
> need to ask the Czar of Naming? What subnet should this service be installed
> in? What switch should it be connected to? Is it behind a VIP, if so, how is
> this configured?
#### Configuring Access Control (ACLs/Security Operations)

> Are there network access issues: do routes, acls or firewall rules need to be
> configured?
For information on access controls and processes to grant them see the
[security document](security.md).

#### DNS

> What DNS entries need to be configured? Where?
### Global Replication {#global_replication}

For information on how this service is replicated
see the [operational document](operations.md#global_replication).

Build Procedure {#build_procedure}
---------------

> This section should contain a step-by-step procedure for installing the
> service. Installation of hardware and commonly used supporting software
> packages should be placed in the Build prerequisite section, if that makes sense to do so.
> Bonus points for linking to the "infrastructure as code" source tree and explaining the automated build.
### Installing the Supporting Software {#supporting_software}

> What packages need to be installed? Are there customizations beyond the
> default install needed? Do they need to be checked in to p4? Are
> licenses needed? How do you obtain them?
> Bonus points for linking to the "infrastructure as code" source tree and explaining the automated build.
### Connecting to Other Services {#connecting_to_other_services}

> Which other services does this one need to connect to? What information
> does it need to get from these services?
### Starting the Service {#starting_the_service}

For instructions on how to start the service, see the [operational document](operations.md#start_stop).

### Testing the Service {#testing_the_service}

For instructions on how to verify if the service is running, see the [operational document](operations.md#service_verify).

### Setting Up Service Monitoring {#setting_up_monitoring}

> Is there local monitoring that needs to be installed/configured/started?
> Do we need to configure a monitoring service to collect or receive instrumentation?
For a list of current service monitoring, see the [operational document](operations.md#monitoring).

### Setting Up Backups {#setting_up_backups}

> How is the service backed up? What needs to be done to set up the
> backups?
### Required Notifications or Other Issues {#required_notifications}

> Any other build issues? Is notification required when new servers or
> replacement servers are installed? Whom should be notified and for
> non-urgent changes, how much notice should be given? Are there any
> change management processes/email addresses that need to notified with
> information about this build?
Adding a Server to the Service {#adding_a_server}
------------------------------

> Provide step-by-step instructions for adding a server to the service.
Roles {#roles}
-----------

> If using a configuration management tool like Slack or Puppet, provide the
> roles and descriptions here.
>
> Role Subrole Decription
>
> ------------------------------------------------------------------------------
>
>      
In-House Build Instructions {#package_build_instructions}
--------------------------

> If there is custom built software for this
> service, document its build and release process here.
72 changes: 72 additions & 0 deletions service_docs/templates/common_tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
[Home](index.md) [Operations](operations.md) [Build](build.md) [DR](disaster_recovery.md) [Common Tasks](common_tasks.md) [Security](security.md)

Last Update: @Date@

> Replace this note with your own customisation.
**Note:** This document is meant for Ops, Helpdesk, or Security. If you are
having problems with a service, please call Helpdesk or
[file a ticket](@TICKET_URL@).

@SERVICE_NAME@ Common Tasks
===============================

##### Quick links:

- [Server locations](oeprations.md#servers_hardware)
- [Outage Impact/SLA](index.md#outageimpact)


------------------------------------------------------------------------

This document describes how to perform routine administrative tasks for this
service. Its intended audience includes onduty, helpdeskers and security
engineers --- people other than the primary owners who may be asked to perform
administrative tasks for this service.

Escalation
----------

> For information on how to route tickets for common issues concerning
> this service, link to the escalation wiki. You should create an entry in
> the table for your service, and log common issues there. To create an
> entry, just copy the the test heading and edit table, and customize them
> for your service.
Helpdesk
--------

> What tasks do helpdesk personnel need to perform? Please provide
> step-by-step instructions for these tasks. Link to an FAQ, if
> appropriate.
Onduty
------

> What tasks do onduty personnel need to perform to support this service?
> Please provide step-by-step instructions for these tasks. Link to an
> FAQ, if appropriate.
## Security

> What tasks do Security personnel need to perform to support this service?
> Please provide step-by-step instructions for these tasks. Link to an FAQ, if
> appropriate.
End User {#end_user}
--------

> What tasks do end users need to perform to use this service? Please
> provide links to the relevant helpdesk documentation. If the helpdesk
> documentation is not up-to-date or complete, file a docbug and help us
> fix it. Link to an FAQ, if appropriate.
Obtaining Approval {#approval}
------------------

> There are certain system tasks that require approval by by the service owner
> or by Security. Please list them in the following approval matrix.
Task Approval Needed
------ -----------------

Loading

0 comments on commit 47596c8

Please sign in to comment.