forked from preslavmihaylov/booknotes
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
preslavmihaylov
committed
Mar 12, 2023
1 parent
47acd5a
commit 793d5fa
Showing
15 changed files
with
247 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,247 @@ | ||
# Distributed Email Service | ||
We'll design a distributed email service, similar to gmail in this chapter. | ||
|
||
In 2020, gmail had 1.8bil active users, while Outlook had 400mil users worldwide. | ||
|
||
# Step 1 - Understand the Problem and Establish Design Scope | ||
* C: How many users use the system? | ||
* I: 1bil users | ||
* C: I think following features are important - auth, send/receive email, fetch email, filter emails, search email, anti-spam protection. | ||
* I: Good list. Don't worry about auth for now. | ||
* C: How do users connect \w email servers? | ||
* I: Typically, email clients connect via SMTP, POP, IMAP, but we'll use HTTP for this problem. | ||
* C: Can emails have attachments? | ||
* I: Yes | ||
|
||
## Non-functional requirements | ||
* Reliability - we shouldn't lose data | ||
* Availability - We should use replication to prevent single points of failure. We should also tolerate partial system failures. | ||
* Scalability - As userbase grows, our system should be able to handle them. | ||
* Flexibility and extensibility - system should be flexible and easy to extend with new features. One of the reasons we chose HTTP over SMTP/other mail protocols. | ||
|
||
## Back-of-the-envelope estimation | ||
* 1bil users | ||
* Assuming one person sends 10 emails per day -> 100k emails per second. | ||
* Assuming one person receives 40 emails per day and each email on average has 50kb metadata -> 730pb storage per year. | ||
* Assuming 20% of emails have storage attachments and average size is 500kb -> 1,460pb per year. | ||
|
||
# Step 2 - Propose High-Level Design and Get Buy-In | ||
## Email knowledge 101 | ||
There are various protocols used for sending and receiving emails: | ||
* SMTP - standard protocol for sending emails from one server to another. | ||
* POP - standard protocol for receiving and downloading emails from a remote mail server to a local client. Once retrieved, emails are deleted from remote server. | ||
* IMAP - similar to POP, it is used for receiving and downloading emails from a remote server, but it keeps the emails on the server-side. | ||
* HTTPS - not technically an email protocol, but it can be used for web-based email clients. | ||
|
||
Apart from the mailing protocol, there are some DNS records we need to configure for our email server - the MX records: | ||
![dns-lookup](images/dns-lookup.png) | ||
|
||
Email attachments are sent base64-encoded and there is usually a size limit of 25mb on most mail services. | ||
This is configurable and varies from individual to corporate accounts. | ||
|
||
## Traditional mail servers | ||
Traditional mail servers work well when there are a limited number of users, connected to a single server. | ||
![traditional-mail-server](images/traditional-mail-server.png) | ||
* Alice logs into her Outlook email and presses "send". Email is sent to Outlook mail server. Communication is via SMTP. | ||
* Outlook server queries DNS to find MX record for gmail.com and transfers the email to their servers. Communication is via SMTP. | ||
* Bob fetches emails from his gmail server via IMAP/POP. | ||
|
||
In traditional mail servers, emails were stored on the local file system. Every email was a separate file. | ||
![local-dir-storage](images/local-dir-storage.png) | ||
|
||
As the scale grew, disk I/O became a bottleneck. Also, it doesn't satisfy our high availability and reliability requirements. | ||
Disks can be damaged and server can go down. | ||
|
||
## Distributed mail servers | ||
Distributed mail servers are designed to support modern use-cases and solve modern scalability issues. | ||
|
||
These servers can still support IMAP/POP for native email clients and SMTP for mail exchange across servers. | ||
|
||
But for rich web-based mail clients, a RESTful API over HTTP is typically used. | ||
|
||
Example APIs: | ||
* `POST /v1/messages` - sends a message to recipients in To, Cc, Bcc headers. | ||
* `GET /v1/folders` - returns all folders of an email account | ||
Example response: | ||
``` | ||
[{id: string Unique folder identifier. | ||
name: string Name of the folder. | ||
According to RFC6154 [9], the default folders can be one of | ||
the following: All, Archive, Drafts, Flagged, Junk, Sent, | ||
and Trash. | ||
user_id: string Reference to the account owner | ||
}] | ||
``` | ||
* `GET /v1/folders/{:folder_id}/messages` - returns all messages under a folder \w pagination | ||
* `GET /v1/messages/{:message_id}` - get all information about a particular message | ||
Example response: | ||
``` | ||
{ | ||
user_id: string // Reference to the account owner. | ||
from: {name: string, email: string} // <name, email> pair of the sender. | ||
to: [{name: string, email: string}] // A list of <name, email> paris | ||
subject: string // Subject of an email | ||
body: string // Message body | ||
is_read: boolean // Indicate if a message is read or not. | ||
} | ||
``` | ||
|
||
Here's the high-level design of the distributed mail server: | ||
![high-level-architecture](images/high-level-architecture.png) | ||
* Webmail - users use web browsers to send/receive emails | ||
* Web servers - public-facing request/response services used to manage login, signup, user profile, etc. | ||
* Real-time servers - Used for pushing new email updates to clients in real-time. We use websockets for real-time communication but fallback to long-polling for older browsers that don't support them. | ||
* Metadata db - stores email metadata such as subject, body, from, to, etc. | ||
* Attachment store - Object store (eg Amazon S3), suitable for storing large files. | ||
* Distributed cache - We can cache recent emails in Redis to improve UX. | ||
* Search store - distributed document store, used for supporting full-text searches. | ||
|
||
Here's what the email sending flow looks like: | ||
![email-sending-flow](images/email-sending-flow.png) | ||
* User writes an email and presses "send". Email is sent to load balancer. | ||
* Load balancer rate limits excessive mail sends and routes to one of the web servers. | ||
* Web servers do basic email validation (eg email size) and short-circuits outbound flow if domain is same as sender. But does spam check first. | ||
* If basic validation passes, email is sent to message queue (attachment is referenced from object store) | ||
* If basic validation fails, email is sent to error queue | ||
* SMTP outgoing workers pull messages from outgoing queue, do spam/virus checks and route to destination mail server. | ||
* Email is stored in the "Sent Emails" folder | ||
|
||
We need to also monitor size of outgoing message queue. Growing too large might indicate a problem: | ||
* Recipient's mail server is unavailable. We can retry sending the email at a later time using exponential backoff. | ||
* Not enough consumers to handle the load, we might have to scale the consumers. | ||
|
||
Here's the email receiving flow: | ||
![email-receiving-flow](images/email-receiving-flkow.png) | ||
* Incoming emails arrive at the SMTP load balancer. Mails are distributed to SMTP servers, where mail acceptance policy is done (eg invalid emails are directly discarded). | ||
* If attachment of email is too large, we can put it in object store (s3). | ||
* Mail processing workers do preliminary checks, after which mails are forwarded to storage, cache, object store and real-time servers. | ||
* Offline users get their new emails once they come back online via HTTP API. | ||
|
||
# Step 3 - Design Deep Dive | ||
Let's now go deeper into some of the components. | ||
|
||
## Metadata database | ||
Here are some of the characteristics of email metadata: | ||
* headers are usually small and frequently accessed | ||
* Body size ranges from small to big, but is typically read once | ||
* Most mail operations are isolated to a single user - eg fetching email, marking as read, searching. | ||
* Data recency impacts data usage. Users typically read only recent emails | ||
* Data has high-reliability requirements. Data loss is unacceptable. | ||
|
||
At gmail/outlook scale, the database is typically custom made to reduce input/output operations per second (IOPS). | ||
|
||
Let's consider what database options we have: | ||
* Relational database - we can build indexes for headers and body, but these DBs are typically optimized for small chunks of data. | ||
* Distributed object store - this can be a good option for backup storage, but can't efficiently support searching/marking as read/etc. | ||
* NoSQL - Google BigTable is used by gmail, but it's not open-sourced. | ||
|
||
Based on above analysis, very few existing solutions seems to fit our needs perfectly. | ||
In an interview setting, it's infeasible to design a new distributed database solution, but important to mention characteristics: | ||
* Single column can be a single-digit MB | ||
* Strong data consistency | ||
* Designed to reduce disk I/O | ||
* Highly available and fault tolerant | ||
* Should be easy to create incremental backups | ||
|
||
In order to partition the data, we can use the `user_id` as a partition key, so that one user's data is stored on a single shard. | ||
This prohibits us from sharing an email with multiple users, but this is not a requirement for this interview. | ||
|
||
Let's define the tables: | ||
* Primary key consists of partition key (data distribution) and clustering key (sorting data) | ||
* Queries we need to support - get all folders for a user, display all emails for a folder, create/get/delete an email, fetch read/unread email, get conversation threads (bonus) | ||
|
||
Legend for tables to follow: | ||
![legend](images/legend.png) | ||
|
||
Here is the folders table: | ||
![folders-table](images/folders-table.png) | ||
|
||
emails table: | ||
![emails-table](images/emails-table.png) | ||
* email_id is timeuuid which allows sorting based on timestamp when email was created | ||
|
||
Attachments are stored in a separate table, identified by filename: | ||
![attachments](images/attachments.png) | ||
|
||
Supporting fetchin read/unread emails is easy in a traditional relational database, but not in Cassandra, since filtering on non-partition/clustering key is prohibited. | ||
One workaround to fetch all emails in a folder and filter in-memory, but that doesn't work well for a big-enough application. | ||
|
||
What we can do is denormalize the emails table into read/unread emails tables: | ||
![read-unread-emails](images/read-unread-emails.png) | ||
|
||
In order to support conversation threads, we can include some headers, which mail clients interpret and use to reconstruct a conversation thread: | ||
``` | ||
{ | ||
"headers" { | ||
"Message-Id": "<[email protected]>", | ||
"In-Reply-To": "<[email protected]>", | ||
"References": ["<[email protected]>"] | ||
} | ||
} | ||
``` | ||
|
||
Finally, we'll trade availability for consistency for our distributed database, since it is a hard requirement for this problem. | ||
|
||
Hence, in the event of a failover or network parititon, sync/update actions will be briefly unavailable to impacted users. | ||
|
||
## Email deliverability | ||
It is easy to setup a server to send emails, but getting the email to a receiver's inbox is hard, due to spam-protection algorithms. | ||
|
||
If we just setup a new mail server and start sending mails through it, our emails will probably end up in the spam folder. | ||
|
||
Here's what we can do to prevent that: | ||
* Dedicated IPs - use dedicated IPs for sending emails, otherwise, recipient servers will not trust you. | ||
* Classify emails - avoid sending marketing emails from the same servers to prevent more important email to be classified as spam | ||
* Warm up your IP address slowly to build a good reputation with big email providers. It takes 2 to 6 weeks to warm up a new IP | ||
* Ban spammers quickly to not deteriorate your reputation | ||
* Feedback processing - setup a feedback loop with ISPs to keep track of complaint rate and ban spam accounts quickly. | ||
* Email authentication - use common techniques to combat phishing such as Sender Policy Framework, DomainKeys Identified Mail, etc. | ||
|
||
You don't need to remember all of this. Just know that building a good mail server requires a lot of domain knowledge. | ||
|
||
## Search | ||
Searching includes doing a full-text search based on email contents or more advanced queries based on from, to, subject, unread, etc filters. | ||
|
||
One characteristic of email search is that it is local to the user and it has more writes than reads, because we need to re-index it on each operation, but users rarely use the search tab. | ||
|
||
Let's compare google search with email search: | ||
| | Scope | Sorting | Accuracy | | ||
|---------------|----------------------|---------------------------------------|---------------------------------------------------| | ||
| Google search | The whole internet | Sort by relevance | Indexing takes some time, so not instant results. | | ||
| Email search | User’s own email box | Sort by attributes eg time, date, etc | Indexing should be quick and results accurate. | | ||
|
||
To achieve this search functionality, one option is to use an Elasticsearch cluster. We can use `user_id` as the partition key to group data under the same node: | ||
![elasticsearch](images/elasticsearch.png) | ||
|
||
Mutating operations are async via Kafka in order to decouple services from the reindexing flow. | ||
Actually searching for data happens synchronously. | ||
|
||
Elasticsearch is one of the most popular search-engine databases and supports full-text search for emails very well. | ||
|
||
Alternatively, we can attempt to develop our own custom search solution to meet our specific requirements. | ||
|
||
Designing such a system is out of scope. One of the core challenges when building it is to optimize it for write-heavy workloads. | ||
|
||
To achieve that, we can use Log-Structured Merge-Trees (LSM) to structure the index data on disk. Write path is optimized for sequential writes only. | ||
This technique is used in Cassandra, BigTable and RocksDB. | ||
|
||
Its core idea is to store data in-memory until a predefined threshold is reached, after which it is merged in the next layer (disk): | ||
![lsm-tree](images/lsm-tree.png) | ||
|
||
Main trade-offs between the two approaches: | ||
* Elasticsearch scales to some extent, whereas a custom search engine can be fine-tuned for the email use-case, allowing it to scale further. | ||
* Elasticsearch is a separate service we need to maintain, alongside the metadata store. A custom solution can be the datastore itself. | ||
* Elasticsearch is an off-the-shelf solution, whereas the custom search engine would require significant engineering effort to build. | ||
|
||
## Scalability and availability | ||
Since individual user operations don't collide with other users, most components can be independently scaled. | ||
|
||
To ensure high availability, we can also use a multi-DC setup with leader-folower failover in case of failures: | ||
![multi-dc-example](images/multi-dc-example.png) | ||
|
||
# Step 4 - Wrap up | ||
Additional talking points: | ||
* Fault tolerance - Many parts of the system could fail. It is worthwhile how we'd handle node failures. | ||
* Compliance - PII needs to be stored in a reasonable way, given Europe's GDPR laws. | ||
* Security - email encryption, phishing protection, safe browsing, etc. | ||
* Optimizations - eg preventing duplication of the same attachments, sent multiple times by different users. |
Binary file added
BIN
+150 KB
system-design/system-design-interview/chapter24/images/attachments.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+88 KB
system-design/system-design-interview/chapter24/images/elasticsearch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+302 KB
system-design/system-design-interview/chapter24/images/email-receiving-flkow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+351 KB
system-design/system-design-interview/chapter24/images/email-sending-flow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+162 KB
system-design/system-design-interview/chapter24/images/emails-table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+90.7 KB
system-design/system-design-interview/chapter24/images/folders-table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+211 KB
system-design/system-design-interview/chapter24/images/high-level-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+47.1 KB
system-design/system-design-interview/chapter24/images/local-dir-storage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+130 KB
system-design/system-design-interview/chapter24/images/multi-dc-example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+158 KB
system-design/system-design-interview/chapter24/images/read-unread-emails.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+219 KB
system-design/system-design-interview/chapter24/images/traditional-mail-server.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.