GitHub - Striderk/Crime-Statistics-with-Spark-Streaming: data engineering with spark

How did changing values on the SparkSession property parameters affect the throughput and latency of the data?

It affected the inputRowsPerSecond and inputRowsPerSecond, both through and delay may increased.

What were the 2-3 most efficient SparkSession property key/value pairs? Through testing multiple variations on values, how can you tell these were the most optimal?

The most efficient SparkSession properties are: spark.sql.shuffle.partitions and spark.default.parallelism.

The highest value i observed was 202.8400928279368.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
screenshots		screenshots
.gitignore		.gitignore
README.md		README.md
data_stream.py		data_stream.py
kafka_server.py		kafka_server.py
police-department-calls-for-service.json.zip		police-department-calls-for-service.json.zip
producer_server.py		producer_server.py
radio_code.json		radio_code.json
requirements.txt		requirements.txt
start.sh		start.sh

Provide feedback