Rewrite/improve basic load test (cadence-workflow#4399)

gaecom · Aug 27, 2021 · 0b98055 · 0b98055
1 parent cde0f41
commit 0b98055
Show file tree

Hide file tree

Showing 9 changed files with 365 additions and 246 deletions.
diff --git a/bench/README.md b/bench/README.md
@@ -7,62 +7,94 @@ Setup
 -----------
 ### Cadence server
 
-Basic bench test don't require Advanced Visibility. 
+Bench suite is running against a Cadence server/cluster. 
+
+Note that only the Basic bench test don't require Advanced Visibility. 
 
-Other advanced bench tests requires Cadence server with Advanced Visibility. You can run it through:
+Other advanced bench tests requires Cadence server with Advanced Visibility. 
+
+For local env you can run it through:
 - Docker: Instructions for running Cadence server through docker can be found in `docker/README.md`. Either `docker-compose-es-v7.yml` or `docker-compose-es.yml` can be used to start the server.
 - Build from source: Please check [CONTRIBUTING](/CONTRIBUTING.md) for how to build and run Cadence server from source. Please also make sure Kafka and ElasticSearch are running before starting the server with `./cadence-server --zone es start`. If ElasticSearch v7 is used, change the value for `--zone` flag to `es_v7`.
 
-### Search Attributes
-One of the bench tests (called `Cron`), which is responsible for running other tests as a cron job and tracking the results, requires an search attribute named `Passed`. 
+See more [documentation here](https://cadenceworkflow.io/docs/concepts/search-workflows/).
 
-For local development environment, this search attribute has already been added to the ES index template and the list of valid search attributes.
+### Bench Workers
+:warning: NOTE: unlike canary, starting bench worker will not automatically start a bench test. Next two sections will cover how to start and configure it.
 
-However, if you already have a running ES cluster, you will need to add this search attribute to your ES cluster through the following steps:
+Different ways of start the bench workers:
 
-1. Update ES cluster index template using the following Cadence CLI command
-   ```
-   cadence adm cluster asa --search_attr_key Passed --search_attr_type 4 
-   ```
-2. Add `Passed: 4` to the dynamic config value of valid search attributes (`frontend.validSearchAttributes`), so that Cadence server can recognize it.
-3. Validate it has been successfully added with
-   ```
-   cadence cluster get-search-attr
-   ```
+#### 1. Use docker image `ubercadence/cadence-bench:master`
 
-### Bench Workers
-For now there's no docker image for bench workers. The only way to run bench workers is:
-1. Build cadence bench binary:
+For now, this image has no release versions for simplified the release process. Always use `master` tag for the image. 
+
+Similar to server/CLI images, the bench image will be built and published automatically by Github on every commit onto the `master` branch. 
+
+You can [pre-built docker-compose file](../docker/docker-compose-bench.yml) to run against local server
+In the `docker/` directory, run:
+```
+docker-compose -f docker-compose-bench.yml up
+```
+You can modify [the bench worker config](../docker/config/bench/development.yaml) to run against a prod server cluster. 
+
+Or may run it with Kubernetes, for [example](https://github.com/longquanzheng/cadence-lab/blob/master/eks/bench-deployment.yaml). 
+
+
+
+#### 2.  Build & Run the binary 
+
+In the project root, build cadence bench binary:
    ```
    make cadence-bench
    ```
-2. Start bench workers:
+
+Then start bench worker:
    ```
    ./cadence-bench start
    ```
-   By default, it will load the configuration in `config/bench/development.yaml`. Please run `./cadence-bench -h` for details on how to change the configuration directory and file used.
-3. Note that, unlike canary, starting bench worker will not automatically start a bench test. Next two sections will cover how to start and configure it.
+By default, it will load [the configuration in `config/bench/development.yaml`](../config/bench/development.yaml). 
+Run `./cadence-bench -h` for details to understand the start options of how to change the loading directory if needed. 
 
 Worker Configurations
 ----------------------
 Bench workers configuration contains two parts:
 - **Bench**: this part controls the client side, including the bench service name, which domains bench workers are responsible for and how many taskLists each domain should use.
+```yaml 
+bench:
+  name: "cadence-bench" # bench name
+  domains: ["cadence-bench", "cadence-bench-sync", "cadence-bench-batch"] # it will start workers on all those domains(also try to register if not exists) 
+  numTaskLists: 3 # it will start workers listening on cadence-bench-tl-0, cadence-bench-tl-1,  cadence-bench-tl-2
+``` 
+1. Bench workers will only poll from task lists whose name start with `cadence-bench-tl-`. If in the configuration, `numTaskLists` is specified to be 2, then workers will only listen to `cadence-bench-tl-0` and `cadence-bench-tl-1`. So make sure you use a valid task list name when starting the bench load.
+2. When starting bench workers, it will try to register a **local domain with archival feature disabled** for each domain name listed in the configuration, if not already exists. If your want to test the performance of global domains and/or archival feature, please register the domains first before starting the worker.
+
 - **Cadence**: this control how bench worker should talk to Cadence server, which includes the server's service name and address.
+```yaml
+cadence:
+  service: "cadence-frontend" # frontend service name
+  host: "127.0.0.1:7933" # frontend address
+```
+- **Metrics**: metrics configuration. Similar to server metric emitter, only M3/Statsd/Prometheus is supported. 
+- **Log**: logging configuration.  Similar to server logging configuration. 
 
-Note:
-1.  When starting bench workers, it will try to register a **local domain with archival feature disabled** for each domain name listed in the configuration, if not already exists. If your want to test the performance of global domains and/or archival feature, please register the domains first before starting the worker.
-2.  Bench workers will only poll from task lists whose name start with `cadence-bench-tl-`. If in the configuration, `numTaskLists` is specified to be 2, then workers will only listen to `cadence-bench-tl-0` and `cadence-bench-tl-1`. So make sure you use a valid task list name when starting the bench load.
 
-Bench Loads
+Bench Load Types
 -----------
 This section briefly describes the purpose of each bench load and provides a sample command for running the load. Detailed descriptions for each test's configuration can be found in `bench/lib/config.go`
 
 Please note that all load configurations in `config/bench` is for only local development and illustration purpose, it does not reflect the actual capability of Cadence server.
 
 ### Basic
-This is the only bench test that don't require advanced visibility.
+:warning: NOTE: This is the only bench test which doesn't require advanced visibility feature on the server. Make sure you set `useBasicVisibilityValidation` to true if run with basic(db) visibility.  
+Also basicVisibilityValidation requires only one test load run in the same domain. This is because of the limitation of basic visibility now allow using workflowType and status filters at the same time.  
+
+As the name suggests, this load tests the basic case of load testing. 
+You will start a `launchWorkflow` which will execute some `launchActivities` to start `stressWorkflows`. Then the stressWorkflows running activities in sequential/parallel.
+Once all stressWorkflows are started, launchWorkflow will wait stressWorkflows timeout + buffer time(default to 5 mins) before checking the status of all test workflows. 
 
-As the name suggests, this load tests the basic case of starting workflows and running activities in sequential/parallel. Once all test workflows are started, it will wait test workflow timeout + 5 mins before checking the status of all test workflows. If the failure rate is too high, or if there's any open workflows found, the test will fail.
+Two criteria must be met to pass the verification:
+1. No open workflows(this means server may lose some tasks and not able to close the stressWorkflows)
+2. Failed/timeouted workflows <= threshold(totalLaunchCount * failureThreshold )
 
 The basic load can also be run in "panic" mode by setting `"panicStressWorkflow": true,` to test if server can handle large number of panic workflows (which can be caused by a bad worker deployment).
 
@@ -86,31 +118,35 @@ Progress:
   22, 2021-08-20T11:59:24-07:00, WorkflowExecutionCompleted
 
 Result:
-  Run Time: 526 seconds
+  Run Time: 26 seconds
   Status: COMPLETED
-  Output: "SuccessCount: 100, FailedCount: 0"
-```
-The test will return error if the test doesn't pass. There are two cases:
-* The stress workflow couldn't finish within the timeout
-* There are more failed worklfow than expected(configured by `failureThreshold`)
-
-### Cron
-`Cron` itself is not a test. It is responsible for running multiple other tests in parallel or sequential according a cron schedule. 
-
-Tests in `Cron` are divided to into multiple test suites. Tests in different test suites will be run in parallel, while tests within a test suite will be run in a random sequential order. Different test suites can also be run in different domains, which provides a way for testing the multi-tenant performance of Cadence server. 
+  Output: "TEST PASSED. Details report: timeoutCount: 0, failedCount: 0, openCount:0, launchCount: 100, maxThreshold:1"
 
-On the completion of each test, `Cron` will be signaled with the result of the test, which can be queried through:
-```
-cadence --do <domain> wf query --wid <workflowID of the Cron workflow> --qt test-results
 ```
-This command will show the result of all completed tests.
 
-When all tests complete, `Cron` will update the value of the `Passed` search attribute accordingly. `Passed` will be set to `true` only when all tests have passed, and `false` otherwise. Since the last event for cron workflow is always WorkflowContinuedAsNew, this search attribute can be used to tell whether one run of `Cron` is successful or not. You can see the search attribute value by adding `--psa` flag to workflow list commands when listing `Cron` runs.
+The output/error result shows whether the test passes with detailed report.
 
-A sample cron configuration is in `config/bench/cron.json`, and it can be started with
-```
-cadence --do <domain> wf start --tl cadence-bench-tl-0 --wt cron-test-workflow --dt 30 --et 7200 --if config/bench/cron.json
-```
+Configuration of basic load type. The config is passed as the launch workflow input parameter using a JSON file. 
+ 
+```yaml
+# configuration for launch workflow
+useBasicVisibilityValidation:   use basic(db based) visibility to verify the stress workflows, default false which requires advanced visibility on the server
+totalLaunchCount	: total number of stressWorkflows that started by the launchWorkflow
+waitTimeBufferInSeconds : buffer time in addition of ExecutionStartToCloseTimeoutInSeconds to wait for stressWorkflows before verification, default 300(5 minutes)
+routineCount	: number of in-parallel launch activities that started by launchWorkflow, to start the stressWorkflows
+failureThreshold	: the threshold of failed stressWorkflow for deciding whether or not the whole testSuite failed.
+maxLauncherActivityRetryCount   : the max retry on launcher activity to start stress workflows, default: 5
+contextTimeoutInSeconds	: RPC timeout inside activities(e.g. starting a stressWorkflow) default 3s
+
+# configuration for stress workflow
+executionStartToCloseTimeoutInSeconds	: StartToCloseTimeout of stressWorkflow, default 5m
+chainSequence	: number of steps in the stressWorkflow
+concurrentCount	: number of in-parallel activity(dummy activity only echo data) in a step of the stressWorkflow
+payloadSizeBytes	: payloadSize of echo data in the dummy activity
+minCadenceSleepInSeconds	: control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
+maxCadenceSleepInSeconds	: control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
+panicStressWorkflow	: if true, stressWorkflow will always panic, default false
+``` 
 
 ### Cancellation
 The load tests the StartWorkflowExecution and CancelWorkflowExecution sync API, and validates the number of cancelled workflows and if there's any open workflow.
@@ -147,4 +183,30 @@ Typical usage is the same as the concurrent execution load above. Run it in para
 Sample configuration can be found in `config/bench/timer.json` and it can be started with
 ```
 cadence --do <domain> wf start --tl cadence-bench-tl-0 --wt timer-load-test-workflow --dt 30 --et 3600 --if config/bench/timer.json 
+```
+
+### Cron: Run all the workloads as a TestSuite
+
+:warning: NOTE: This requires a search attribute named `Passed` as boolean type. This search attribute should have been added to the [ES schema](/schema/elasticsearch). 
+make sure the dynamic config also have [this search attribute (`frontend.validSearchAttributes`)](/config/dynamicconfig/development_es.yaml), so that Cadence server can recognize it.
+* Validate `Passed` has been successfully added in the dynamic config:
+   ```
+   cadence cluster get-search-attr
+   ```
+
+`Cron` itself is not a test. It is responsible for running all other tests in parallel or sequential according a cron schedule. 
+
+Tests in `Cron` are divided to into multiple test suites. Tests in different test suites will be run in parallel, while tests within a test suite will be run in a random sequential order. Different test suites can also be run in different domains, which provides a way for testing the multi-tenant performance of Cadence server. 
+
+On the completion of each test, `Cron` will be signaled with the result of the test, which can be queried through:
+```
+cadence --do <domain> wf query --wid <workflowID of the Cron workflow> --qt test-results
+```
+This command will show the result of all completed tests.
+
+When all tests complete, `Cron` will update the value of the `Passed` search attribute accordingly. `Passed` will be set to `true` only when all tests have passed, and `false` otherwise. Since the last event for cron workflow is always WorkflowContinuedAsNew, this search attribute can be used to tell whether one run of `Cron` is successful or not. You can see the search attribute value by adding `--psa` flag to workflow list commands when listing `Cron` runs.
+
+A sample cron configuration is in `config/bench/cron.json`, and it can be started with
+```
+cadence --do <domain> wf start --tl cadence-bench-tl-0 --wt cron-test-workflow --dt 30 --et 7200 --if config/bench/cron.json
 ```
diff --git a/bench/lib/config.go b/bench/lib/config.go
@@ -89,17 +89,22 @@ type (
 
 	// BasicTestConfig contains the configuration for running the Basic test scenario
 	BasicTestConfig struct {
-		TotalLaunchCount                      int     `yaml:"totalLaunchCount"`
-		RoutineCount                          int     `yaml:"routineCount"`
-		ChainSequence                         int     `yaml:"chainSequence"`
-		ConcurrentCount                       int     `yaml:"concurrentCount"`
-		PayloadSizeBytes                      int     `yaml:"payloadSizeBytes"`
-		MinCadenceSleepInSeconds              int     `yaml:"minCadenceSleepInSeconds"`
-		MaxCadenceSleepInSeconds              int     `yaml:"maxCadenceSleepInSeconds"`
-		ExecutionStartToCloseTimeoutInSeconds int     `yaml:"executionStartToCloseTimeoutInSeconds"` // default 5m
-		ContextTimeoutInSeconds               int     `yaml:"contextTimeoutInSeconds"`               // default 3s
-		PanicStressWorkflow                   bool    `yaml:"panicStressWorkflow"`                   // default false
-		FailureThreshold                      float64 `yaml:"failureThreshold"`
+		// Launch workflow config
+		UseBasicVisibilityValidation  bool    `yaml:"useBasicVisibilityValidation"`  // use basic(db based) visibility to verify the stress workflows, default false which requires advanced visibility on the server
+		TotalLaunchCount              int     `yaml:"totalLaunchCount"`              // total number of stressWorkflows that started by the launchWorkflow
+		RoutineCount                  int     `yaml:"routineCount"`                  // number of in-parallel launch activities that started by launchWorkflow, to start the stressWorkflows
+		FailureThreshold              float64 `yaml:"failureThreshold"`              // the threshold of failed stressWorkflow for deciding whether or not the whole testSuite failed.
+		MaxLauncherActivityRetryCount int     `yaml:"maxLauncherActivityRetryCount"` // the max retry on launcher activity to start stress workflows, default: 5
+		ContextTimeoutInSeconds       int     `yaml:"contextTimeoutInSeconds"`       // RPC timeout inside activities(e.g. starting a stressWorkflow) default 3s
+		WaitTimeBufferInSeconds       int     `yaml:"waitTimeBufferInSeconds"`       // buffer time in addition of ExecutionStartToCloseTimeoutInSeconds to wait for stressWorkflows before verification, default 300(5 minutes)
+		// Stress workflow config
+		ExecutionStartToCloseTimeoutInSeconds int  `yaml:"executionStartToCloseTimeoutInSeconds"` // StartToCloseTimeout of stressWorkflow, default 5m
+		ChainSequence                         int  `yaml:"chainSequence"`                         // number of steps in the stressWorkflow
+		ConcurrentCount                       int  `yaml:"concurrentCount"`                       // number of in-parallel activity(dummy activity only echo data) in a step of the stressWorkflow
+		PayloadSizeBytes                      int  `yaml:"payloadSizeBytes"`                      // payloadSize of echo data in the dummy activity
+		MinCadenceSleepInSeconds              int  `yaml:"minCadenceSleepInSeconds"`              // control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
+		MaxCadenceSleepInSeconds              int  `yaml:"maxCadenceSleepInSeconds"`              // control sleep time between two steps in the stressWorkflow, actual sleep time = random(min,max), default: 0
+		PanicStressWorkflow                   bool `yaml:"panicStressWorkflow"`                   // if true, stressWorkflow will always panic, default false
 	}
 
 	// SignalTestConfig is the parameters for signalLoadTestWorkflow