Document flags/env variables useful for performance tuning (argoproj#…

…2312)
tkasuz · Sep 14, 2019 · 047d06f · 047d06f
1 parent 300b9b5
commit 047d06f
Show file tree

Hide file tree

Showing 2 changed files with 57 additions and 5 deletions.
diff --git a/cmd/argocd-application-controller/main.go b/cmd/argocd-application-controller/main.go
@@ -114,7 +114,7 @@ func newCommand() *cobra.Command {
 	command.Flags().IntVar(&glogLevel, "gloglevel", 0, "Set the glog logging level")
 	command.Flags().IntVar(&metricsPort, "metrics-port", common.DefaultPortArgoCDMetrics, "Start metrics server on given port")
 	command.Flags().IntVar(&selfHealTimeoutSeconds, "self-heal-timeout-seconds", 5, "Specifies timeout between application self heal attempts")
-	command.Flags().Int64Var(&kubectlParallelismLimit, "kubectl-parallelism-limit", 0, "Number of allowed concurrent kubectl fork/execs.")
+	command.Flags().Int64Var(&kubectlParallelismLimit, "kubectl-parallelism-limit", 20, "Number of allowed concurrent kubectl fork/execs. Any value less the 1 means no limit.")
 
 	cacheSrc = cache.AddCacheFlagsToCmd(&command)
 	return &command

diff --git a/docs/operator-manual/high_availability.md b/docs/operator-manual/high_availability.md
@@ -11,9 +11,61 @@ A set HA of manifests are provided for users who wish to run Argo CD in a highly
 
 ## Scaling Up
 
-You might scale up some Argo CD services in the following circumstances:
+### argocd-repo-server
 
-* The `argocd-repo-server` can scale up when there is too much contention on a single git repo (e.g. many apps defined in a single git repo).
-* The `argocd-server` can scale up to support more front-end load.
+**settings:**
 
-All other services should run with their pre-determined number of replicas. The `argocd-application-controller` must not be increased because multiple controllers will fight. The `argocd-dex-server` uses an in-memory database, and two or more instances would have inconsistent data. `argocd-redis` is pre-configured with the understanding of only three total redis servers/sentinels.
+The `argocd-repo-server` is responsible for cloning Git repository, keeping it up to date and generating manifests using the appropriate tool.
+
+* `argocd-repo-server` fork/exec config management tool to generate manifests. The fork can fail due to lack of memory and limit on the number of OS threads.
+The `--parallelismlimit` flag controls how many manifests generations are running concurrently and allows avoiding OOM kills.
+
+* one instance of `argocd-repo-server` executes only one operation on one Git repo concurrently. Increase the number of `argocd-repo-server` replica count if you have a lot of
+applications in the same repository.
+
+* `argocd-repo-server` clones repository into `/tmp` ( of path specified in `TMPDIR` env variable ). Pod might run out of disk space if have too many repository
+or repositories has a lot of files. To avoid this problem mount persistent volume.
+
+* `argocd-repo-server` `git ls-remote` to resolve ambiguous revision such as `HEAD`, branch or tag name. This operation is happening pretty frequently
+and might fail. To avoid failed syncs use `ARGOCD_GIT_ATTEMPTS_COUNT` environment variable to retry failed requests.
+
+**metrics:**
+
+* `argocd_git_request_total` - Number of git requests. The metric provides two tags: `repo` - Git repo URL; `request_type` - `ls-remote` or `fetch`.
+
+### argocd-application-controller
+
+**settings:**
+
+The `argocd-application-controller` uses `argocd-repo-server` to get generated manifests and Kubernetes API server to get actual cluster state.
+
+* controller uses two separate queues to process application reconciliation (milliseconds) and app syncing (seconds). Number of queue processors for each queue is controlled by
+`--status-processors` (20 by default) and `--operation-processors` (10 by default) flags. Increase number of processors if your Argo CD instance manages too many applications.
+For 1000 application we use 50 for `--status-processors` and 25 for `--operation-processors`
+
+* The manifest generation typically takes the most time during reconciliation. The duration of manifest generation is limited to make sure controller refresh queue does not overflow.
+The app reconciliation fails with `Context deadline exceeded` error if manifest generating taking too much time. As workaround increase value of `--repo-server-timeout-seconds` and
+consider scaling up `argocd-repo-server` deployment.
+
+* controller uses `kubectl` fork/exec to push changes into the cluster and to convert resource from preferred version into user specified version
+(e.g. Deployment `apps/v1` into `extensions/v1beta1`). Same as config management tool `kubectl` fork/exec might cause pod OOM kill. Use `--kubectl-parallelism-limit` flag to limit
+number of allowed concurrent kubectl fork/execs.
+
+* controller uses Kubernetes watch APIs to maintain lightweight Kubernetes cluster cache. This allows to avoid querying Kubernetes during app reconciliation and significantly improve
+performance. For performance reasons controller monitors and caches only preferred the version of a resource. During reconciliation, the controller might have to convert cached resource from
+preferred version into a version of the resource stored in Git. If `kubectl convert` fails because conversion is not supported than controller fallback to Kubernetes API query which slows down
+reconciliation. In this case advice user-preferred resource version in Git.
+
+**metrics**
+
+* `argocd_app_reconcile` - reports application reconciliation duration. Can be used to build reconciliation duration heat map to get high-level reconciliation performance picture.
+* `argocd_app_k8s_request_total` - number of k8s requests per application. The number of fallback Kubernetes API queries - useful to identify which application has a resource with
+non-preferred version and causes performance issues.
+
+### argocd-server
+
+The `argocd-server` is stateless and probably least likely to cause issues. You might consider increasing number of replicas to 3 or more to ensure there is no downtime during upgrades.
+
+### argocd-dex-server, argocd-redis
+
+The `argocd-dex-server` uses an in-memory database, and two or more instances would have inconsistent data. `argocd-redis` is pre-configured with the understanding of only three total redis servers/sentinels.