Skip to content

Commit

Permalink
AlgoExpert: Systems Design Fundamentals - MapReduce
Browse files Browse the repository at this point in the history
  • Loading branch information
phgnam committed Nov 13, 2023
1 parent 3063821 commit b8d161c
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 2 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ I think that is very common, where people just feel inhibited to ask for help, a
Now, of course, if they repeatedly don't ask for help, then that's an issue.
One thing that I think is important to do when you're dealing with low performance is to...
At least here I'm talking about a big tech company environment, is to document everything and to document steps that were taken to improve the performance of the perceived low performer.
And this can really help them, because if it turns out that the project, or their perceived low performance wasn't their fault, it's important to have that documented so that in six months down the line, you don't have a new person who comes in, and just sees, "Oh, well, this person failed to deliver that project six months ago in a certain amount of time.
Therefore, they performed poorly." When in fact it was due to an external dependency so important to document that kind of stuff.
And this can really help them, because if it turns out that the project, or their perceived low performance wasn't their fault, it's important to have that documented so that in six months down the line, you don't have a new person who comes in, and just sees, "Oh, well, this person failed to deliver that project six months ago in a certain amount of time. Therefore, they performed poorly."
When in fact it was due to an external dependency so important to document that kind of stuff.
And also, if they were in fact performing poorly, and they took steps to improve on their performance and actually did improve.
That's a great thing to document and can actually help them a lot in their next performance review for the next promotion and so on and so forth.
It's always important to document.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# 23 - MapReduce
"MapReduce is a programming model for processing and generating big data sets with a parallel, distributed algorithm on a cluster."

Does Wikipedia's nebulous definition confuse you? Of course it does. In this video, we'll map out this complex topic and reduce it to clear, easily-understood concepts. See what we did there? ( ͡~ ͜ʖ ͡°)

## Prerequisites

### File System

An abstraction over storage medium that defines how to manage data. While there exist many different types of file systems, most follow a hiearchical structure that consists of diractories and files, like the **Unix file system**'s structure.

### Idempotent Operation

An operation that has the same ultimate outcome regardless of how many times it's performed. If an operation can be performed multiple times without changing its overall effect, it's idempotent. Operations performed through a **Pub/Sub** messaging system typicallu have to be idempotent, since Pub/Sub systems tend to allow the same messages to be consumed multiple times.

For example, increasing an integer value in a database is *not* an idempotent operation, since repeating this operation will not have the same effect as if it had been performed only once. Conversely, setting a value to "COMPLETE" *is* an idempotent operation, since repeating this operation will always yield the same result: the value will be "COMPLETE".

## Key Terms

### MapReduce

A popular framword for processing very large datasets in a trstributed setting efficiently, quickly, and in a afault-tolerant manner. A MapReduce job is comprised of 3 main steps:
- the Map step, which runs a map function on the arious chunks of the dataset and transforms these chunks into intermediate key-value pairs.
- the Shuffle step, which reorganizes the intermediate key-value pairs such that pairs of the same key are routed to the same machine in the final step.
- the Reduce step, which runs a reduce function on the newly shuffled key-value pairs and transforms them into more meaningful data.

The canonical example of a MapReduce use case is counting the number of orrcurrences of words in a large text file.

When dealing with a MapReduce labrary, engineers and/or systems administrators only need to worry about the map and reduce functions, as well as their inputs and outputs. All other concerns, includeing the parallelization of tasks and the fault-tolerance of the MapReduce job, are abstracted away and taken care of by the MapReduce implementation.

### Distributed File System

A Distributed File System is an abstraction over a (usually large) cluster of machines that allows them to act like on large file system. The two most popular implementations of a DFS are Google File System (GFS) and the Hadoop Distributed File System (HDFS).

Typically, DFSs take care of the classic availablility and replication guarantees that can be tricky to obtain in a distributed-system setting. The overarching idea is that files are split into chunks of a certain size (4MDB or 64MB, for instance), those chunks are sharded across a large cluster of machines. A central control plane is in charge of deciding where each chunk resides, routing reads to the right nodes, and handling communitation between machines.

Different DFS implementations have slightly different APIs and semantics, but they achieve the same common goal: extremely large-scale persistent storage.

### Hadoop (This is a technology or product that you can use in your systems)

A popular, open-source framework that supports MapReduce jobs and many other kinds of data-processing pipelines. ITs central components is HDFS (Hadoop Distributed File System), on top of which other technologies have been developed.

Learn more: [https://hadoop.apache.org](https://hadoop.apache.org)
Loading

0 comments on commit b8d161c

Please sign in to comment.