You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| split.even-distribution.factor.lower-bound | Double | No | 0.05 |
52
+
| split.even-distribution.factor.upper-bound | Double | No | 100 |
53
+
| split.sample-sharding.threshold | Int | No | 1000 |
54
+
| split.inverse-sampling.rate | Int | No | 1000 |
55
+
| common-options || No | - |
47
56
48
57
### driver [string]
49
58
@@ -98,6 +107,58 @@ improve performance by reducing the number database hits required to satisfy the
98
107
99
108
Additional connection configuration parameters,when properties and URL have the same parameters, the priority is determined by the <br/>specific implementation of the driver. For example, in MySQL, properties take precedence over the URL.
100
109
110
+
### table_path
111
+
112
+
The path to the full path of table, you can use this configuration instead of `query`.
113
+
114
+
examples:
115
+
- mysql: "testdb.table1"
116
+
- oracle: "test_schema.table1"
117
+
- sqlserver: "testdb.test_schema.table1"
118
+
- postgresql: "testdb.test_schema.table1"
119
+
120
+
### table_list
121
+
122
+
The list of tables to be read, you can use this configuration instead of `table_path`
123
+
124
+
example
125
+
126
+
```hocon
127
+
table_list = [
128
+
{
129
+
table_path = "testdb.table1"
130
+
}
131
+
{
132
+
table_path = "testdb.table2"
133
+
query = "select * from testdb.table2 where id > 100"
134
+
}
135
+
]
136
+
```
137
+
138
+
### where_condition
139
+
140
+
Common row filter conditions for all tables/queries, must start with `where`. for example `where id > 100`
141
+
142
+
### split.size
143
+
144
+
The split size (number of rows) of table, captured tables are split into multiple splits when read of table.
145
+
146
+
### split.even-distribution.factor.lower-bound
147
+
148
+
The lower bound of the chunk key distribution factor. This factor is used to determine whether the table data is evenly distributed. If the distribution factor is calculated to be greater than or equal to this lower bound (i.e., (MAX(id) - MIN(id) + 1) / row count), the table chunks would be optimized for even distribution. Otherwise, if the distribution factor is less, the table will be considered as unevenly distributed and the sampling-based sharding strategy will be used if the estimated shard count exceeds the value specified by `sample-sharding.threshold`. The default value is 0.05.
149
+
150
+
### split.even-distribution.factor.upper-bound
151
+
152
+
The upper bound of the chunk key distribution factor. This factor is used to determine whether the table data is evenly distributed. If the distribution factor is calculated to be less than or equal to this upper bound (i.e., (MAX(id) - MIN(id) + 1) / row count), the table chunks would be optimized for even distribution. Otherwise, if the distribution factor is greater, the table will be considered as unevenly distributed and the sampling-based sharding strategy will be used if the estimated shard count exceeds the value specified by `sample-sharding.threshold`. The default value is 100.0.
153
+
154
+
### split.sample-sharding.threshold
155
+
156
+
This configuration specifies the threshold of estimated shard count to trigger the sample sharding strategy. When the distribution factor is outside the bounds specified by `chunk-key.even-distribution.factor.upper-bound` and `chunk-key.even-distribution.factor.lower-bound`, and the estimated shard count (calculated as approximate row count / chunk size) exceeds this threshold, the sample sharding strategy will be used. This can help to handle large datasets more efficiently. The default value is 1000 shards.
157
+
158
+
### split.inverse-sampling.rate
159
+
160
+
The inverse of the sampling rate used in the sample sharding strategy. For example, if this value is set to 1000, it means a 1/1000 sampling rate is applied during the sampling process. This option provides flexibility in controlling the granularity of the sampling, thus affecting the final number of shards. It's especially useful when dealing with very large datasets where a lower sampling rate is preferred. The default value is 1000.
161
+
101
162
### common options
102
163
103
164
Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.
@@ -165,6 +226,10 @@ source {
165
226
query = "select * from type_bin"
166
227
partition_column = "id"
167
228
partition_num = 10
229
+
# Read start boundary
230
+
#partition_lower_bound = ...
231
+
# Read end boundary
232
+
#partition_upper_bound = ...
168
233
}
169
234
}
170
235
@@ -173,6 +238,60 @@ sink {
173
238
}
174
239
```
175
240
241
+
Using `table_path` read:
242
+
243
+
***Configuring `table_path` will turn on auto split, you can configure `split.*` to adjust the split strategy***
0 commit comments