Skip to content

Commit 33fa8ff

Browse files
authored
[Feature][Jdbc] Support read multiple tables (apache#5581)
1 parent a1d13b9 commit 33fa8ff

File tree

93 files changed

+3998
-818
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

93 files changed

+3998
-818
lines changed

docs/en/concept/connector-v2-features.md

+4
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,10 @@ In the **Parallelism Source Connector**, the source will be split into multiple
4949

5050
User can config the split rule.
5151

52+
### support multiple table read
53+
54+
Supports reading multiple tables in one SeaTunnel job
55+
5256
## Sink Connector Features
5357

5458
Sink connectors have some common core features, and each sink connector supports them to varying degrees.

docs/en/connector-v2/source/Jdbc.md

+135-16
Original file line numberDiff line numberDiff line change
@@ -25,25 +25,34 @@ supports query SQL and can achieve projection effect.
2525

2626
- [x] [parallelism](../../concept/connector-v2-features.md)
2727
- [x] [support user-defined split](../../concept/connector-v2-features.md)
28+
- [x] [support multiple table read](../../concept/connector-v2-features.md)
2829

2930
## Options
3031

31-
| name | type | required | default value |
32-
|------------------------------|--------|----------|-----------------|
33-
| url | String | Yes | - |
34-
| driver | String | Yes | - |
35-
| user | String | No | - |
36-
| password | String | No | - |
37-
| query | String | Yes | - |
38-
| compatible_mode | String | No | - |
39-
| connection_check_timeout_sec | Int | No | 30 |
40-
| partition_column | String | No | - |
41-
| partition_upper_bound | Long | No | - |
42-
| partition_lower_bound | Long | No | - |
43-
| partition_num | Int | No | job parallelism |
44-
| fetch_size | Int | No | 0 |
45-
| properties | Map | No | - |
46-
| common-options | | No | - |
32+
| name | type | required | default value |
33+
|--------------------------------------------|--------|----------|-----------------|
34+
| url | String | Yes | - |
35+
| driver | String | Yes | - |
36+
| user | String | No | - |
37+
| password | String | No | - |
38+
| query | String | No | - |
39+
| compatible_mode | String | No | - |
40+
| connection_check_timeout_sec | Int | No | 30 |
41+
| partition_column | String | No | - |
42+
| partition_upper_bound | Long | No | - |
43+
| partition_lower_bound | Long | No | - |
44+
| partition_num | Int | No | job parallelism |
45+
| fetch_size | Int | No | 0 |
46+
| properties | Map | No | - |
47+
| table_path | String | No | - |
48+
| table_list | Array | No | - |
49+
| where_condition | String | No | - |
50+
| split.size | Int | No | 8096 |
51+
| split.even-distribution.factor.lower-bound | Double | No | 0.05 |
52+
| split.even-distribution.factor.upper-bound | Double | No | 100 |
53+
| split.sample-sharding.threshold | Int | No | 1000 |
54+
| split.inverse-sampling.rate | Int | No | 1000 |
55+
| common-options | | No | - |
4756

4857
### driver [string]
4958

@@ -98,6 +107,58 @@ improve performance by reducing the number database hits required to satisfy the
98107

99108
Additional connection configuration parameters,when properties and URL have the same parameters, the priority is determined by the <br/>specific implementation of the driver. For example, in MySQL, properties take precedence over the URL.
100109

110+
### table_path
111+
112+
The path to the full path of table, you can use this configuration instead of `query`.
113+
114+
examples:
115+
- mysql: "testdb.table1"
116+
- oracle: "test_schema.table1"
117+
- sqlserver: "testdb.test_schema.table1"
118+
- postgresql: "testdb.test_schema.table1"
119+
120+
### table_list
121+
122+
The list of tables to be read, you can use this configuration instead of `table_path`
123+
124+
example
125+
126+
```hocon
127+
table_list = [
128+
{
129+
table_path = "testdb.table1"
130+
}
131+
{
132+
table_path = "testdb.table2"
133+
query = "select * from testdb.table2 where id > 100"
134+
}
135+
]
136+
```
137+
138+
### where_condition
139+
140+
Common row filter conditions for all tables/queries, must start with `where`. for example `where id > 100`
141+
142+
### split.size
143+
144+
The split size (number of rows) of table, captured tables are split into multiple splits when read of table.
145+
146+
### split.even-distribution.factor.lower-bound
147+
148+
The lower bound of the chunk key distribution factor. This factor is used to determine whether the table data is evenly distributed. If the distribution factor is calculated to be greater than or equal to this lower bound (i.e., (MAX(id) - MIN(id) + 1) / row count), the table chunks would be optimized for even distribution. Otherwise, if the distribution factor is less, the table will be considered as unevenly distributed and the sampling-based sharding strategy will be used if the estimated shard count exceeds the value specified by `sample-sharding.threshold`. The default value is 0.05.
149+
150+
### split.even-distribution.factor.upper-bound
151+
152+
The upper bound of the chunk key distribution factor. This factor is used to determine whether the table data is evenly distributed. If the distribution factor is calculated to be less than or equal to this upper bound (i.e., (MAX(id) - MIN(id) + 1) / row count), the table chunks would be optimized for even distribution. Otherwise, if the distribution factor is greater, the table will be considered as unevenly distributed and the sampling-based sharding strategy will be used if the estimated shard count exceeds the value specified by `sample-sharding.threshold`. The default value is 100.0.
153+
154+
### split.sample-sharding.threshold
155+
156+
This configuration specifies the threshold of estimated shard count to trigger the sample sharding strategy. When the distribution factor is outside the bounds specified by `chunk-key.even-distribution.factor.upper-bound` and `chunk-key.even-distribution.factor.lower-bound`, and the estimated shard count (calculated as approximate row count / chunk size) exceeds this threshold, the sample sharding strategy will be used. This can help to handle large datasets more efficiently. The default value is 1000 shards.
157+
158+
### split.inverse-sampling.rate
159+
160+
The inverse of the sampling rate used in the sample sharding strategy. For example, if this value is set to 1000, it means a 1/1000 sampling rate is applied during the sampling process. This option provides flexibility in controlling the granularity of the sampling, thus affecting the final number of shards. It's especially useful when dealing with very large datasets where a lower sampling rate is preferred. The default value is 1000.
161+
101162
### common options
102163

103164
Source plugin common parameters, please refer to [Source Common Options](common-options.md) for details.
@@ -165,6 +226,10 @@ source {
165226
query = "select * from type_bin"
166227
partition_column = "id"
167228
partition_num = 10
229+
# Read start boundary
230+
#partition_lower_bound = ...
231+
# Read end boundary
232+
#partition_upper_bound = ...
168233
}
169234
}
170235
@@ -173,6 +238,60 @@ sink {
173238
}
174239
```
175240

241+
Using `table_path` read:
242+
243+
***Configuring `table_path` will turn on auto split, you can configure `split.*` to adjust the split strategy***
244+
245+
```hocon
246+
Jdbc {
247+
url = "jdbc:mysql://localhost/test?serverTimezone=GMT%2b8"
248+
driver = "com.mysql.cj.jdbc.Driver"
249+
connection_check_timeout_sec = 100
250+
user = "root"
251+
password = "123456"
252+
253+
# e.g. table_path = "testdb.table1"、table_path = "test_schema.table1"、table_path = "testdb.test_schema.table1"
254+
table_path = "testdb.table1"
255+
#split.size = 8096
256+
#split.even-distribution.factor.upper-bound = 100
257+
#split.even-distribution.factor.lower-bound = 0.05
258+
#split.sample-sharding.threshold = 1000
259+
#split.inverse-sampling.rate = 1000
260+
}
261+
```
262+
263+
multiple table read:
264+
265+
***Configuring `table_list` will turn on auto split, you can configure `split.*` to adjust the split strategy***
266+
267+
```hocon
268+
Jdbc {
269+
url = "jdbc:mysql://localhost/test?serverTimezone=GMT%2b8"
270+
driver = "com.mysql.cj.jdbc.Driver"
271+
connection_check_timeout_sec = 100
272+
user = "root"
273+
password = "123456"
274+
275+
table_list = [
276+
{
277+
# e.g. table_path = "testdb.table1"、table_path = "test_schema.table1"、table_path = "testdb.test_schema.table1"
278+
table_path = "testdb.table1"
279+
},
280+
{
281+
table_path = "testdb.table2"
282+
# Use query filetr rows & columns
283+
query = "select id, name from testdb.table2 where id > 100"
284+
}
285+
]
286+
#where_condition= "where id > 100"
287+
#split.size = 8096
288+
#split.even-distribution.factor.upper-bound = 100
289+
#split.even-distribution.factor.lower-bound = 0.05
290+
#split.sample-sharding.threshold = 1000
291+
#split.inverse-sampling.rate = 1000
292+
}
293+
```
294+
176295
## Changelog
177296

178297
### 2.2.0-beta 2022-09-26

0 commit comments

Comments
 (0)