Skip to content

Commit

Permalink
[MINOR][STREAMING][DOCS] Minor changes on kinesis integration
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

Some minor changes for documentation page "Spark Streaming + Kinesis Integration".

Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.

## How was this patch tested?

Tested manually, on my local machine.

Author: Xin Ren <[email protected]>

Closes apache#14097 from keypointt/kinesisDoc.
  • Loading branch information
keypointt authored and tdas committed Jul 12, 2016
1 parent 9e2c763 commit 05d7151
Showing 1 changed file with 13 additions and 13 deletions.
26 changes: 13 additions & 13 deletions docs/streaming-kinesis-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m

- `[checkpoint interval]`: The interval (e.g., Duration(2000) = 2 seconds) at which the Kinesis Client Library saves its position in the stream. For starters, set it to the same as the batch interval of the streaming application.

- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see Kinesis Checkpointing section and Amazon Kinesis API documentation for more details).
- `[initial position]`: Can be either `InitialPositionInStream.TRIM_HORIZON` or `InitialPositionInStream.LATEST` (see [`Kinesis Checkpointing`](#kinesis-checkpointing) section and [`Amazon Kinesis API documentation`](http://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html) for more details).

- `[message handler]`: A function that takes a Kinesis `Record` and outputs generic `T`.

Expand All @@ -128,14 +128,6 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
Alternatively, you can also download the JAR of the Maven artifact `spark-streaming-kinesis-asl-assembly` from the
[Maven repository](http://search.maven.org/#search|ga|1|a%3A%22spark-streaming-kinesis-asl-assembly_{{site.SCALA_BINARY_VERSION}}%22%20AND%20v%3A%22{{site.SPARK_VERSION_SHORT}}%22) and add it to `spark-submit` with `--jars`.

*Points to remember at runtime:*

- Kinesis data processing is ordered per partition and occurs at-least once per message.

- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.

- A single Kinesis stream shard is processed by one input DStream at a time.

<p style="text-align: center;">
<img src="img/streaming-kinesis-arch.png"
title="Spark Streaming Kinesis Architecture"
Expand All @@ -145,6 +137,14 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>

*Points to remember at runtime:*

- Kinesis data processing is ordered per partition and occurs at-least once per message.

- Multiple applications can read from the same Kinesis stream. Kinesis will maintain the application-specific shard and checkpoint info in DynamoDB.

- A single Kinesis stream shard is processed by one input DStream at a time.

- A single Kinesis input DStream can read from multiple shards of a Kinesis stream by creating multiple KinesisRecordProcessor threads.

- Multiple input DStreams running in separate processes/instances can read from a Kinesis stream.
Expand Down Expand Up @@ -173,7 +173,7 @@ To run the example,

- Set up Kinesis stream (see earlier section) within AWS. Note the name of the Kinesis stream and the endpoint URL corresponding to the region where the stream was created.

- Set up the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_KEY with your AWS credentials.
- Set up the environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_KEY` with your AWS credentials.

- In the Spark root directory, run the example as

Expand Down Expand Up @@ -216,6 +216,6 @@ de-aggregate records during consumption.

- Checkpointing too frequently will cause excess load on the AWS checkpoint storage layer and may lead to AWS throttling. The provided example handles this throttling with a random-backoff-retry strategy.

- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (InitialPositionInStream.TRIM_HORIZON) or from the latest tip (InitialPositionInStream.LATEST). This is configurable.
- InitialPositionInStream.LATEST could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
- InitialPositionInStream.TRIM_HORIZON may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.
- If no Kinesis checkpoint info exists when the input DStream starts, it will start either from the oldest record available (`InitialPositionInStream.TRIM_HORIZON`) or from the latest tip (`InitialPositionInStream.LATEST`). This is configurable.
- `InitialPositionInStream.LATEST` could lead to missed records if data is added to the stream while no input DStreams are running (and no checkpoint info is being stored).
- `InitialPositionInStream.TRIM_HORIZON` may lead to duplicate processing of records where the impact is dependent on checkpoint frequency and processing idempotency.

0 comments on commit 05d7151

Please sign in to comment.