forked from dimitri/pgloader
-
Notifications
You must be signed in to change notification settings - Fork 0
/
pgloader.1
3201 lines (3072 loc) · 96.1 KB
/
pgloader.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
.\" generated with Ronn/v0.7.3
.\" http://github.com/rtomayko/ronn/tree/0.7.3
.
.TH "PGLOADER" "1" "February 2017" "ff" ""
.
.SH "NAME"
\fBpgloader\fR \- PostgreSQL data loader
.
.SH "SYNOPSIS"
.
.nf
pgloader [<options>] [<command\-file>]\.\.\.
pgloader [<options>] SOURCE TARGET
.
.fi
.
.SH "DESCRIPTION"
pgloader loads data from various sources into PostgreSQL\. It can transform the data it reads on the fly and submit raw SQL before and after the loading\. It uses the \fBCOPY\fR PostgreSQL protocol to stream the data into the server, and manages errors by filling a pair of \fIreject\.dat\fR and \fIreject\.log\fR files\.
.
.P
pgloader operates either using commands which are read from files:
.
.IP "" 4
.
.nf
pgloader commands\.load
.
.fi
.
.IP "" 0
.
.P
or by using arguments and options all provided on the command line:
.
.IP "" 4
.
.nf
pgloader SOURCE TARGET
.
.fi
.
.IP "" 0
.
.SH "ARGUMENTS"
The pgloader arguments can be as many load files as needed, or a couple of connection strings to a specific input file\.
.
.SS "SOURCE CONNECTION STRING"
The source connection string format is as follows:
.
.IP "" 4
.
.nf
format:///absolute/path/to/file\.ext
format://\./relative/path/to/file\.ext
.
.fi
.
.IP "" 0
.
.P
Where format might be one of \fBcsv\fR, \fBfixed\fR, \fBcopy\fR, \fBdbf\fR, \fBdb3\fR or \fBixf\fR\.
.
.IP "" 4
.
.nf
db://user:pass@host:port/dbname
.
.fi
.
.IP "" 0
.
.P
Where db might be of \fBsqlite\fR, \fBmysql\fR or \fBmssql\fR\.
.
.P
When using a file based source format, pgloader also support natively fetching the file from an http location and decompressing an archive if needed\. In that case it\'s necessary to use the \fB\-\-type\fR option to specify the expected format of the file\. See the examples below\.
.
.P
Also note that some file formats require describing some implementation details such as columns to be read and delimiters and quoting when loading from csv\.
.
.P
For more complex loading scenarios, you will need to write a full fledge load command in the syntax described later in this document\.
.
.SS "TARGET CONNECTION STRING"
The target connection string format is described in details later in this document, see Section Connection String\.
.
.SH "OPTIONS"
.
.SS "INQUIRY OPTIONS"
Use these options when you want to know more about how to use \fBpgloader\fR, as those options will cause \fBpgloader\fR not to load any data\.
.
.TP
\fB\-h\fR, \fB\-\-help\fR
Show command usage summary and exit\.
.
.TP
\fB\-V\fR, \fB\-\-version\fR
Show pgloader version string and exit\.
.
.TP
\fB\-E\fR, \fB\-\-list\-encodings\fR
List known encodings in this version of pgloader\.
.
.TP
\fB\-U\fR, \fB\-\-upgrade\-config\fR
Parse given files in the command line as \fBpgloader\.conf\fR files with the \fBINI\fR syntax that was in use in pgloader versions 2\.x, and output the new command syntax for pgloader on standard output\.
.
.SS "GENERAL OPTIONS"
Those options are meant to tweak \fBpgloader\fR behavior when loading data\.
.
.IP "\(bu" 4
\fB\-v\fR, \fB\-\-verbose\fR: Be verbose\.
.
.IP "\(bu" 4
\fB\-q\fR, \fB\-\-quiet\fR: Be quiet\.
.
.IP "\(bu" 4
\fB\-d\fR, \fB\-\-debug\fR: Show debug level information messages\.
.
.IP "\(bu" 4
\fB\-D\fR, \fB\-\-root\-dir\fR: Set the root working directory (default to "/tmp/pgloader")\.
.
.IP "\(bu" 4
\fB\-L\fR, \fB\-\-logfile\fR: Set the pgloader log file (default to "/tmp/pgloader\.log")\.
.
.IP "\(bu" 4
\fB\-\-log\-min\-messages\fR: Minimum level of verbosity needed for log message to make it to the logfile\. One of critical, log, error, warning, notice, info or debug\.
.
.IP "\(bu" 4
\fB\-\-client\-min\-messages\fR: Minimum level of verbosity needed for log message to make it to the console\. One of critical, log, error, warning, notice, info or debug\.
.
.IP "\(bu" 4
\fB\-S\fR, \fB\-\-summary\fR: A filename where to copy the summary output\. When relative, the filename is expanded into \fB*root\-dir*\fR\.
.
.IP
The format of the filename defaults to being \fIhuman readable\fR\. It is possible to have the output in machine friendly formats such as \fICSV\fR, \fICOPY\fR (PostgreSQL\'s own COPY format) or \fIJSON\fR by specifying a filename with the extension resp\. \fB\.csv\fR, \fB\.copy\fR or \fB\.json\fR\.
.
.IP "\(bu" 4
\fB\-l <file>\fR, \fB\-\-load\-lisp\-file <file>\fR: Specify a lisp \fIfile\fR to compile and load into the pgloader image before reading the commands, allowing to define extra transformation function\. Those functions should be defined in the \fBpgloader\.transforms\fR package\. This option can appear more than once in the command line\.
.
.IP "\(bu" 4
\fB\-\-dry\-run\fR:
.
.IP
Allow testing a \fB\.load\fR file without actually trying to load any data\. It\'s useful to debug it until it\'s ok, in particular to fix connection strings\.
.
.IP "\(bu" 4
\fB\-\-on\-error\-stop\fR
.
.IP
Alter pgloader behavior: rather than trying to be smart about error handling and continue loading good data, separating away the bad one, just stop as soon as PostgreSQL refuses anything sent to it\. Useful to debug data processing, transformation function and specific type casting\.
.
.IP "\(bu" 4
\fB\-\-self\-upgrade <directory>\fR:
.
.IP
Specify a \fIdirectory\fR where to find pgloader sources so that one of the very first things it does is dynamically loading\-in (and compiling to machine code) another version of itself, usually a newer one like a very recent git checkout\.
.
.IP "" 0
.
.SS "COMMAND LINE ONLY OPERATIONS"
Those options are meant to be used when using \fBpgloader\fR from the command line only, rather than using a command file and the rich command clauses and parser\. In simple cases, it can be much easier to use the \fISOURCE\fR and \fITARGET\fR directly on the command line, then tweak the loading with those options:
.
.IP "\(bu" 4
\fB\-\-with "option"\fR:
.
.IP
Allows setting options from the command line\. You can use that option as many times as you want\. The option arguments must follow the \fIWITH\fR clause for the source type of the \fBSOURCE\fR specification, as described later in this document\.
.
.IP "\(bu" 4
\fB\-\-set "guc_name=\'value\'"\fR
.
.IP
Allows setting PostgreSQL configuration from the command line\. Note that the option parsing is the same as when used from the \fISET\fR command clause, in particular you must enclose the guc value with single\-quotes\.
.
.IP "\(bu" 4
\fB\-\-field "\.\.\."\fR
.
.IP
Allows setting a source field definition\. Fields are accumulated in the order given on the command line\. It\'s possible to either use a \fB\-\-field\fR option per field in the source file, or to separate field definitions by a comma, as you would do in the \fIHAVING FIELDS\fR clause\.
.
.IP "\(bu" 4
\fB\-\-cast "\.\.\."\fR
.
.IP
Allows setting a specific casting rule for loading the data\.
.
.IP "\(bu" 4
\fB\-\-type csv|fixed|db3|ixf|sqlite|mysql|mssql\fR
.
.IP
Allows forcing the source type, in case when the \fISOURCE\fR parsing isn\'t satisfying\.
.
.IP "\(bu" 4
\fB\-\-encoding <encoding>\fR
.
.IP
Set the encoding of the source file to load data from\.
.
.IP "\(bu" 4
\fB\-\-before <filename>\fR
.
.IP
Parse given filename for SQL queries and run them against the target database before loading the data from the source\. The queries are parsed by pgloader itself: they need to be terminated by a semi\-colon (;) and the file may include \fB\ei\fR or \fB\eir\fR commands to \fIinclude\fR another file\.
.
.IP "\(bu" 4
\fB\-\-after <filename>\fR
.
.IP
Parse given filename for SQL queries and run them against the target database after having loaded the data from the source\. The queries are parsed in the same way as with the \fB\-\-before\fR option, see above\.
.
.IP "" 0
.
.SS "MORE DEBUG INFORMATION"
To get the maximum amount of debug information, you can use both the \fB\-\-verbose\fR and the \fB\-\-debug\fR switches at the same time, which is equivalent to saying \fB\-\-client\-min\-messages data\fR\. Then the log messages will show the data being processed, in the cases where the code has explicit support for it\.
.
.SH "USAGE EXAMPLES"
Review the command line options and pgloader\'s version:
.
.IP "" 4
.
.nf
pgloader \-\-help
pgloader \-\-version
.
.fi
.
.IP "" 0
.
.SS "Loading from a complex command"
Use the command file as the pgloader command argument, pgloader will parse that file and execute the commands found in it:
.
.IP "" 4
.
.nf
pgloader \-\-verbose \./test/csv\-districts\.load
.
.fi
.
.IP "" 0
.
.SS "CSV"
Load data from a CSV file into a pre\-existing table in your database:
.
.IP "" 4
.
.nf
pgloader \-\-type csv \e
\-\-field id \-\-field field \e
\-\-with truncate \e
\-\-with "fields terminated by \',\'" \e
\./test/data/matching\-1\.csv \e
postgres:///pgloader?tablename=matching
.
.fi
.
.IP "" 0
.
.P
In that example the whole loading is driven from the command line, bypassing the need for writing a command in the pgloader command syntax entirely\. As there\'s no command though, the extra inforamtion needed must be provided on the command line using the \fB\-\-type\fR and \fB\-\-field\fR and \fB\-\-with\fR switches\.
.
.P
For documentation about the available syntaxes for the \fB\-\-field\fR and \fB\-\-with\fR switches, please refer to the CSV section later in the man page\.
.
.P
Note also that the PostgreSQL URI includes the target \fItablename\fR\.
.
.SS "Reading from STDIN"
File based pgloader sources can be loaded from the standard input, as in the following example:
.
.IP "" 4
.
.nf
pgloader \-\-type csv \e
\-\-field "usps,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong" \e
\-\-with "skip header = 1" \e
\-\-with "fields terminated by \'\et\'" \e
\- \e
postgresql:///pgloader?districts_longlat \e
< test/data/2013_Gaz_113CDs_national\.txt
.
.fi
.
.IP "" 0
.
.P
The dash (\fB\-\fR) character as a source is used to mean \fIstandard input\fR, as usual in Unix command lines\. It\'s possible to stream compressed content to pgloader with this technique, using the Unix pipe:
.
.IP "" 4
.
.nf
gunzip \-c source\.gz | pgloader \-\-type csv \.\.\. \- pgsql:///target?foo
.
.fi
.
.IP "" 0
.
.SS "Loading from CSV available through HTTP"
The same command as just above can also be run if the CSV file happens to be found on a remote HTTP location:
.
.IP "" 4
.
.nf
pgloader \-\-type csv \e
\-\-field "usps,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong" \e
\-\-with "skip header = 1" \e
\-\-with "fields terminated by \'\et\'" \e
http://pgsql\.tapoueh\.org/temp/2013_Gaz_113CDs_national\.txt \e
postgresql:///pgloader?districts_longlat
.
.fi
.
.IP "" 0
.
.P
Some more options have to be used in that case, as the file contains a one\-line header (most commonly that\'s column names, could be a copyright notice)\. Also, in that case, we specify all the fields right into a single \fB\-\-field\fR option argument\.
.
.P
Again, the PostgreSQL target connection string must contain the \fItablename\fR option and you have to ensure that the target table exists and may fit the data\. Here\'s the SQL command used in that example in case you want to try it yourself:
.
.IP "" 4
.
.nf
create table districts_longlat
(
usps text,
geoid text,
aland bigint,
awater bigint,
aland_sqmi double precision,
awater_sqmi double precision,
intptlat double precision,
intptlong double precision
);
.
.fi
.
.IP "" 0
.
.P
Also notice that the same command will work against an archived version of the same data, e\.g\. http://pgsql\.tapoueh\.org/temp/2013_Gaz_113CDs_national\.txt\.gz\.
.
.P
Finally, it\'s important to note that pgloader first fetches the content from the HTTP URL it to a local file, then expand the archive when it\'s recognized to be one, and only then processes the locally expanded file\.
.
.P
In some cases, either because pgloader has no direct support for your archive format or maybe because expanding the archive is not feasible in your environment, you might want to \fIstream\fR the content straight from its remote location into PostgreSQL\. Here\'s how to do that, using the old battle tested Unix Pipes trick:
.
.IP "" 4
.
.nf
curl http://pgsql\.tapoueh\.org/temp/2013_Gaz_113CDs_national\.txt\.gz \e
| gunzip \-c \e
| pgloader \-\-type csv \e
\-\-field "usps,geoid,aland,awater,aland_sqmi,awater_sqmi,intptlat,intptlong"
\-\-with "skip header = 1" \e
\-\-with "fields terminated by \'\et\'" \e
\- \e
postgresql:///pgloader?districts_longlat
.
.fi
.
.IP "" 0
.
.P
Now the OS will take care of the streaming and buffering between the network and the commands and pgloader will take care of streaming the data down to PostgreSQL\.
.
.SS "Migrating from SQLite"
The following command will open the SQLite database, discover its tables definitions including indexes and foreign keys, migrate those definitions while \fIcasting\fR the data type specifications to their PostgreSQL equivalent and then migrate the data over:
.
.IP "" 4
.
.nf
createdb newdb
pgloader \./test/sqlite/sqlite\.db postgresql:///newdb
.
.fi
.
.IP "" 0
.
.SS "Migrating from MySQL"
Just create a database where to host the MySQL data and definitions and have pgloader do the migration for you in a single command line:
.
.IP "" 4
.
.nf
createdb pagila
pgloader mysql://user@localhost/sakila postgresql:///pagila
.
.fi
.
.IP "" 0
.
.SS "Fetching an archived DBF file from a HTTP remote location"
It\'s possible for pgloader to download a file from HTTP, unarchive it, and only then open it to discover the schema then load the data:
.
.IP "" 4
.
.nf
createdb foo
pgloader \-\-type dbf http://www\.insee\.fr/fr/methodes/nomenclatures/cog/telechargement/2013/dbf/historiq2013\.zip postgresql:///foo
.
.fi
.
.IP "" 0
.
.P
Here it\'s not possible for pgloader to guess the kind of data source it\'s being given, so it\'s necessary to use the \fB\-\-type\fR command line switch\.
.
.SH "BATCHES AND RETRY BEHAVIOUR"
To load data to PostgreSQL, pgloader uses the \fBCOPY\fR streaming protocol\. While this is the faster way to load data, \fBCOPY\fR has an important drawback: as soon as PostgreSQL emits an error with any bit of data sent to it, whatever the problem is, the whole data set is rejected by PostgreSQL\.
.
.P
To work around that, pgloader cuts the data into \fIbatches\fR of 25000 rows each, so that when a problem occurs it\'s only impacting that many rows of data\. Each batch is kept in memory while the \fBCOPY\fR streaming happens, in order to be able to handle errors should some happen\.
.
.P
When PostgreSQL rejects the whole batch, pgloader logs the error message then isolates the bad row(s) from the accepted ones by retrying the batched rows in smaller batches\. To do that, pgloader parses the \fICONTEXT\fR error message from the failed COPY, as the message contains the line number where the error was found in the batch, as in the following example:
.
.IP "" 4
.
.nf
CONTEXT: COPY errors, line 3, column b: "2006\-13\-11"
.
.fi
.
.IP "" 0
.
.P
Using that information, pgloader will reload all rows in the batch before the erroneous one, log the erroneous one as rejected, then try loading the remaining of the batch in a single attempt, which may or may not contain other erroneous data\.
.
.P
At the end of a load containing rejected rows, you will find two files in the \fIroot\-dir\fR location, under a directory named the same as the target database of your setup\. The filenames are the target table, and their extensions are \fB\.dat\fR for the rejected data and \fB\.log\fR for the file containing the full PostgreSQL client side logs about the rejected data\.
.
.P
The \fB\.dat\fR file is formatted in PostgreSQL the text COPY format as documented in http://www\.postgresql\.org/docs/9\.2/static/sql\-copy\.html#AEN66609 \fI\fR\.
.
.SH "A NOTE ABOUT PERFORMANCE"
pgloader has been developed with performance in mind, to be able to cope with ever growing needs in loading large amounts of data into PostgreSQL\.
.
.P
The basic architecture it uses is the old Unix pipe model, where a thread is responsible for loading the data (reading a CSV file, querying MySQL, etc) and fills pre\-processed data into a queue\. Another threads feeds from the queue, apply some more \fItransformations\fR to the input data and stream the end result to PostgreSQL using the COPY protocol\.
.
.P
When given a file that the PostgreSQL \fBCOPY\fR command knows how to parse, and if the file contains no erroneous data, then pgloader will never be as fast as just using the PostgreSQL \fBCOPY\fR command\.
.
.P
Note that while the \fBCOPY\fR command is restricted to read either from its standard input or from a local file on the server\'s file system, the command line tool \fBpsql\fR implements a \fB\ecopy\fR command that knows how to stream a file local to the client over the network and into the PostgreSQL server, using the same protocol as pgloader uses\.
.
.SH "A NOTE ABOUT PARALLELISM"
pgloader uses several concurrent tasks to process the data being loaded:
.
.IP "\(bu" 4
a reader task reads the data in,
.
.IP "\(bu" 4
at least one transformer task is responsible for applying the needed transformations to given data so that it fits PostgreSQL expectations, those transformations include CSV like user\-defined \fIprojections\fR, database \fIcasting\fR (default and user given), and PostgreSQL specific \fIformatting\fR of the data for the COPY protocol and in unicode,
.
.IP "\(bu" 4
at least one writer task is responsible for sending the data down to PostgreSQL using the COPY protocol\.
.
.IP "" 0
.
.P
The idea behind having the transformer task do the \fIformatting\fR is so that in the event of bad rows being rejected by PostgreSQL the retry process doesn\'t have to do that step again\.
.
.P
At the moment, the number of transformer and writer tasks are forced into being the same, which allows for a very simple \fIqueueing\fR model to be implemented: the reader task fills in one queue per transformer task, which then pops from that queue and pushes to a writer queue per COPY task\.
.
.P
The parameter \fIworkers\fR allows to control how many worker threads are allowed to be active at any time (that\'s the parallelism level); and the parameter \fIconcurrency\fR allows to control how many tasks are started to handle the data (they may not all run at the same time, depending on the \fIworkers\fR setting)\.
.
.P
We allow \fIworkers\fR simultaneous workers to be active at the same time in the context of a single table\. A single unit of work consist of several kinds of workers:
.
.IP "\(bu" 4
a reader getting raw data from the source,
.
.IP "\(bu" 4
N transformers preparing raw data for PostgreSQL COPY protocol,
.
.IP "\(bu" 4
N writers sending the data down to PostgreSQL\.
.
.IP "" 0
.
.P
The N here is setup to the \fIconcurrency\fR parameter: with a \fICONCURRENCY\fR of 2, we start (+ 1 2 2) = 5 concurrent tasks, with a \fIconcurrency\fR of 4 we start (+ 1 4 4) = 9 concurrent tasks, of which only \fIworkers\fR may be active simultaneously\.
.
.P
So with \fBworkers = 4, concurrency = 2\fR, the parallel scheduler will maintain active only 4 of the 5 tasks that are started\.
.
.P
With \fBworkers = 8, concurrency = 1\fR, we then are able to work on several units of work at the same time\. In the database sources, a unit of work is a table, so those settings allow pgloader to be active on as many as 3 tables at any time in the load process\.
.
.P
The defaults are \fBworkers = 4, concurrency = 1\fR when loading from a database source, and \fBworkers = 8, concurrency = 2\fR when loading from something else (currently, a file)\. Those defaults are arbitrary and waiting for feedback from users, so please consider providing feedback if you play with the settings\.
.
.P
As the \fBCREATE INDEX\fR threads started by pgloader are only waiting until PostgreSQL is done with the real work, those threads are \fINOT\fR counted into the concurrency levels as detailed here\.
.
.P
By default, as many \fBCREATE INDEX\fR threads as the maximum number of indexes per table are found in your source schema\. It is possible to set the \fBmax parallel create index\fR \fIWITH\fR option to another number in case there\'s just too many of them to create\.
.
.SH "SOURCE FORMATS"
pgloader supports the following input formats:
.
.IP "\(bu" 4
csv, which includes also tsv and other common variants where you can change the \fIseparator\fR and the \fIquoting\fR rules and how to \fIescape\fR the \fIquotes\fR themselves;
.
.IP "\(bu" 4
fixed columns file, where pgloader is flexible enough to accomodate with source files missing columns (\fIragged fixed length column files\fR do exist);
.
.IP "\(bu" 4
PostgreSLQ COPY formatted files, following the COPY TEXT documentation of PostgreSQL, such as the reject files prepared by pgloader;
.
.IP "\(bu" 4
dbase files known as db3 or dbf file;
.
.IP "\(bu" 4
ixf formated files, ixf being a binary storage format from IBM;
.
.IP "\(bu" 4
sqlite databases with fully automated discovery of the schema and advanced cast rules;
.
.IP "\(bu" 4
mysql databases with fully automated discovery of the schema and advanced cast rules;
.
.IP "\(bu" 4
MS SQL databases with fully automated discovery of the schema and advanced cast rules\.
.
.IP "" 0
.
.SH "PGLOADER COMMANDS SYNTAX"
pgloader implements a Domain Specific Language allowing to setup complex data loading scripts handling computed columns and on\-the\-fly sanitization of the input data\. For more complex data loading scenarios, you will be required to learn that DSL\'s syntax\. It\'s meant to look familiar to DBA by being inspired by SQL where it makes sense, which is not that much after all\.
.
.P
The pgloader commands follow the same global grammar rules\. Each of them might support only a subset of the general options and provide specific options\.
.
.IP "" 4
.
.nf
LOAD <source\-type>
FROM <source\-url> [ HAVING FIELDS <source\-level\-options> ]
INTO <postgresql\-url> [ TARGET COLUMNS <columns\-and\-options> ]
[ WITH <load\-options> ]
[ SET <postgresql\-settings> ]
[ BEFORE LOAD [ DO <sql statements> | EXECUTE <sql file> ] \.\.\. ]
[ AFTER LOAD [ DO <sql statements> | EXECUTE <sql file> ] \.\.\. ]
;
.
.fi
.
.IP "" 0
.
.P
The main clauses are the \fBLOAD\fR, \fBFROM\fR, \fBINTO\fR and \fBWITH\fR clauses that each command implements\. Some command then implement the \fBSET\fR command, or some specific clauses such as the \fBCAST\fR clause\.
.
.SH "COMMON CLAUSES"
Some clauses are common to all commands:
.
.IP "\(bu" 4
\fIFROM\fR
.
.IP
The \fIFROM\fR clause specifies where to read the data from, and each command introduces its own variant of sources\. For instance, the \fICSV\fR source supports \fBinline\fR, \fBstdin\fR, a filename, a quoted filename, and a \fIFILENAME MATCHING\fR clause (see above); whereas the \fIMySQL\fR source only supports a MySQL database URI specification\.
.
.IP
In all cases, the \fIFROM\fR clause is able to read its value from an environment variable when using the form \fBGETENV \'varname\'\fR\.
.
.IP "\(bu" 4
\fIINTO\fR
.
.IP
The PostgreSQL connection URI must contains the name of the target table where to load the data into\. That table must have already been created in PostgreSQL, and the name might be schema qualified\.
.
.IP
The \fIINTO\fR target database connection URI can be parsed from the value of an environment variable when using the form \fBGETENV \'varname\'\fR\.
.
.IP
Then \fIINTO\fR option also supports an optional comma separated list of target columns, which are either the name of an input \fIfield\fR or the white space separated list of the target column name, its PostgreSQL data type and a \fIUSING\fR expression\.
.
.IP
The \fIUSING\fR expression can be any valid Common Lisp form and will be read with the current package set to \fBpgloader\.transforms\fR, so that you can use functions defined in that package, such as functions loaded dynamically with the \fB\-\-load\fR command line parameter\.
.
.IP
Each \fIUSING\fR expression is compiled at runtime to native code\.
.
.IP
This feature allows pgloader to load any number of fields in a CSV file into a possibly different number of columns in the database, using custom code for that projection\.
.
.IP "\(bu" 4
\fIWITH\fR
.
.IP
Set of options to apply to the command, using a global syntax of either:
.
.IP "\(bu" 4
\fIkey = value\fR
.
.IP "\(bu" 4
\fIuse option\fR
.
.IP "\(bu" 4
\fIdo not use option\fR
.
.IP "" 0
.
.IP
See each specific command for details\.
.
.IP
All data sources specific commands support the following options:
.
.IP "\(bu" 4
\fIbatch rows = R\fR
.
.IP "\(bu" 4
\fIbatch size = \.\.\. MB\fR
.
.IP "\(bu" 4
\fIbatch concurrency = \.\.\.\fR
.
.IP "" 0
.
.IP
See the section BATCH BEHAVIOUR OPTIONS for more details\.
.
.IP
In addition, the following settings are available:
.
.IP "\(bu" 4
\fIworkers = W\fR
.
.IP "\(bu" 4
\fIconcurrency = C\fR
.
.IP "\(bu" 4
\fImax parallel create index = I\fR
.
.IP "" 0
.
.IP
See section A NOTE ABOUT PARALLELISM for more details\.
.
.IP "\(bu" 4
\fISET\fR
.
.IP
This clause allows to specify session parameters to be set for all the sessions opened by pgloader\. It expects a list of parameter name, the equal sign, then the single\-quoted value as a comma separated list\.
.
.IP
The names and values of the parameters are not validated by pgloader, they are given as\-is to PostgreSQL\.
.
.IP "\(bu" 4
\fIBEFORE LOAD DO\fR
.
.IP
You can run SQL queries against the database before loading the data from the \fBCSV\fR file\. Most common SQL queries are \fBCREATE TABLE IF NOT EXISTS\fR so that the data can be loaded\.
.
.IP
Each command must be \fIdollar\-quoted\fR: it must begin and end with a double dollar sign, \fB$$\fR\. Dollar\-quoted queries are then comma separated\. No extra punctuation is expected after the last SQL query\.
.
.IP "\(bu" 4
\fIBEFORE LOAD EXECUTE\fR
.
.IP
Same behaviour as in the \fIBEFORE LOAD DO\fR clause\. Allows you to read the SQL queries from a SQL file\. Implements support for PostgreSQL dollar\-quoting and the \fB\ei\fR and \fB\eir\fR include facilities as in \fBpsql\fR batch mode (where they are the same thing)\.
.
.IP "\(bu" 4
\fIAFTER LOAD DO\fR
.
.IP
Same format as \fIBEFORE LOAD DO\fR, the dollar\-quoted queries found in that section are executed once the load is done\. That\'s the right time to create indexes and constraints, or re\-enable triggers\.
.
.IP "\(bu" 4
\fIAFTER LOAD EXECUTE\fR
.
.IP
Same behaviour as in the \fIAFTER LOAD DO\fR clause\. Allows you to read the SQL queries from a SQL file\. Implements support for PostgreSQL dollar\-quoting and the \fB\ei\fR and \fB\eir\fR include facilities as in \fBpsql\fR batch mode (where they are the same thing)\.
.
.IP "" 0
.
.SS "Connection String"
The \fB<postgresql\-url>\fR parameter is expected to be given as a \fIConnection URI\fR as documented in the PostgreSQL documentation at http://www\.postgresql\.org/docs/9\.3/static/libpq\-connect\.html#LIBPQ\-CONNSTRING\.
.
.IP "" 4
.
.nf
postgresql://[user[:password]@][netloc][:port][/dbname][?option=value&\.\.\.]
.
.fi
.
.IP "" 0
.
.P
Where:
.
.IP "\(bu" 4
\fIuser\fR
.
.IP
Can contain any character, including colon (\fB:\fR) which must then be doubled (\fB::\fR) and at\-sign (\fB@\fR) which must then be doubled (\fB@@\fR)\.
.
.IP
When omitted, the \fIuser\fR name defaults to the value of the \fBPGUSER\fR environment variable, and if it is unset, the value of the \fBUSER\fR environment variable\.
.
.IP "\(bu" 4
\fIpassword\fR
.
.IP
Can contain any character, including the at sign (\fB@\fR) which must then be doubled (\fB@@\fR)\. To leave the password empty, when the \fIuser\fR name ends with at at sign, you then have to use the syntax user:@\.
.
.IP
When omitted, the \fIpassword\fR defaults to the value of the \fBPGPASSWORD\fR environment variable if it is set, otherwise the password is left unset\.
.
.IP "\(bu" 4
\fInetloc\fR
.
.IP
Can be either a hostname in dotted notation, or an ipv4, or an Unix domain socket path\. Empty is the default network location, under a system providing \fIunix domain socket\fR that method is preferred, otherwise the \fInetloc\fR default to \fBlocalhost\fR\.
.
.IP
It\'s possible to force the \fIunix domain socket\fR path by using the syntax \fBunix:/path/to/where/the/socket/file/is\fR, so to force a non default socket path and a non default port, you would have:
.
.IP "" 4
.
.nf
postgresql://unix:/tmp:54321/dbname
.
.fi
.
.IP "" 0
.
.IP
The \fInetloc\fR defaults to the value of the \fBPGHOST\fR environment variable, and if it is unset, to either the default \fBunix\fR socket path when running on a Unix system, and \fBlocalhost\fR otherwise\.
.
.IP "\(bu" 4
\fIdbname\fR
.
.IP
Should be a proper identifier (letter followed by a mix of letters, digits and the punctuation signs comma (\fB,\fR), dash (\fB\-\fR) and underscore (\fB_\fR)\.
.
.IP
When omitted, the \fIdbname\fR defaults to the value of the environment variable \fBPGDATABASE\fR, and if that is unset, to the \fIuser\fR value as determined above\.
.
.IP "\(bu" 4
\fIoptions\fR
.
.IP
The optional parameters must be supplied with the form \fBname=value\fR, and you may use several parameters by separating them away using an ampersand (\fB&\fR) character\.
.
.IP
Only some options are supported here, \fItablename\fR (which might be qualified with a schema name) \fIsslmode\fR, \fIhost\fR, \fIport\fR, \fIdbname\fR, \fIuser\fR and \fIpassword\fR\.
.
.IP
The \fIsslmode\fR parameter values can be one of \fBdisable\fR, \fBallow\fR, \fBprefer\fR or \fBrequire\fR\.
.
.IP
For backward compatibility reasons, it\'s possible to specify the \fItablename\fR option directly, without spelling out the \fBtablename=\fR parts\.
.
.IP
The options override the main URI components when both are given, and using the percent\-encoded option parameters allow using passwords starting with a colon and bypassing other URI components parsing limitations\.
.
.IP "" 0
.
.SS "Regular Expressions"
Several clauses listed in the following accept \fIregular expressions\fR with the following input rules:
.
.IP "\(bu" 4
A regular expression begins with a tilde sign (\fB~\fR),
.
.IP "\(bu" 4
is then followed with an opening sign,
.
.IP "\(bu" 4
then any character is allowed and considered part of the regular expression, except for the closing sign,
.
.IP "\(bu" 4
then a closing sign is expected\.
.
.IP "" 0
.
.P
The opening and closing sign are allowed by pair, here\'s the complete list of allowed delimiters:
.
.IP "" 4
.
.nf
~//
~[]
~{}
~()
~<>
~""
~\'\'
~||
~##
.
.fi
.
.IP "" 0
.
.P
Pick the set of delimiters that don\'t collide with the \fIregular expression\fR you\'re trying to input\. If your expression is such that none of the solutions allow you to enter it, the places where such expressions are allowed should allow for a list of expressions\.
.
.SS "Comments"
Any command may contain comments, following those input rules:
.
.IP "\(bu" 4
the \fB\-\-\fR delimiter begins a comment that ends with the end of the current line,
.
.IP "\(bu" 4
the delimiters \fB/*\fR and \fB*/\fR respectively start and end a comment, which can be found in the middle of a command or span several lines\.
.
.IP "" 0
.
.P
Any place where you could enter a \fIwhitespace\fR will accept a comment too\.
.
.SS "Batch behaviour options"
All pgloader commands have support for a \fIWITH\fR clause that allows for specifying options\. Some options are generic and accepted by all commands, such as the \fIbatch behaviour options\fR, and some options are specific to a data source kind, such as the CSV \fIskip header\fR option\.
.
.P
The global batch behaviour options are:
.
.IP "\(bu" 4
\fIbatch rows\fR
.
.IP
Takes a numeric value as argument, used as the maximum number of rows allowed in a batch\. The default is \fB25 000\fR and can be changed to try having better performance characteristics or to control pgloader memory usage;
.
.IP "\(bu" 4
\fIbatch size\fR
.
.IP
Takes a memory unit as argument, such as \fI20 MB\fR, its default value\. Accepted multipliers are \fIkB\fR, \fIMB\fR, \fIGB\fR, \fITB\fR and \fIPB\fR\. The case is important so as not to be confused about bits versus bytes, we\'re only talking bytes here\.
.
.IP "\(bu" 4
\fIbatch concurrency\fR
.
.IP
Takes a numeric value as argument, defaults to \fB10\fR\. That\'s the number of batches that pgloader is allows to build in memory in each reader thread\. See the \fIworkers\fR setting for how many reader threads are allowed to run at the same time: each of them is allowed as many as \fIbatch concurrency\fR batches\.
.
.IP "" 0
.
.P
Other options are specific to each input source, please refer to specific parts of the documentation for their listing and covering\.
.
.P
A batch is then closed as soon as either the \fIbatch rows\fR or the \fIbatch size\fR threshold is crossed, whichever comes first\. In cases when a batch has to be closed because of the \fIbatch size\fR setting, a \fIdebug\fR level log message is printed with how many rows did fit in the \fIoversized\fR batch\.
.
.SH "LOAD CSV"
This command instructs pgloader to load data from a \fBCSV\fR file\. Here\'s an example:
.
.IP "" 4
.
.nf
LOAD CSV
FROM \'GeoLiteCity\-Blocks\.csv\' WITH ENCODING iso\-646\-us
HAVING FIELDS
(
startIpNum, endIpNum, locId
)
INTO postgresql://user@localhost:54393/dbname?geolite\.blocks
TARGET COLUMNS
(
iprange ip4r using (ip\-range startIpNum endIpNum),
locId
)
WITH truncate,
skip header = 2,
fields optionally enclosed by \'"\',
fields escaped by backslash\-quote,
fields terminated by \'\et\'
SET work_mem to \'32 MB\', maintenance_work_mem to \'64 MB\';
.
.fi
.
.IP "" 0
.
.P
The \fBcsv\fR format command accepts the following clauses and options:
.
.IP "\(bu" 4
\fIFROM\fR
.
.IP
Filename where to load the data from\. Accepts an \fIENCODING\fR option\. Use the \fB\-\-list\-encodings\fR option to know which encoding names are supported\.
.
.IP
The filename may be enclosed by single quotes, and could be one of the following special values:
.
.IP "\(bu" 4
\fIinline\fR
.
.IP
The data is found after the end of the parsed commands\. Any number of empty lines between the end of the commands and the beginning of the data is accepted\.
.
.IP "\(bu" 4
\fIstdin\fR
.
.IP
Reads the data from the standard input stream\.
.
.IP "\(bu" 4
\fIFILENAMES MATCHING\fR
.
.IP
The whole \fImatching\fR clause must follow the following rule:
.
.IP "" 4
.
.nf
[ ALL FILENAMES | [ FIRST ] FILENAME ]
MATCHING regexp
[ IN DIRECTORY \'\.\.\.\' ]
.
.fi
.
.IP "" 0
.
.IP
The \fImatching\fR clause applies given \fIregular expression\fR (see above for exact syntax, several options can be used here) to filenames\. It\'s then possible to load data from only the first match of all of them\.
.
.IP
The optional \fIIN DIRECTORY\fR clause allows specifying which directory to walk for finding the data files, and can be either relative to where the command file is read from, or absolute\. The given directory must exists\.
.
.IP "" 0
.
.IP
The \fIFROM\fR option also supports an optional comma separated list of \fIfield\fR names describing what is expected in the \fBCSV\fR data file, optionally introduced by the clause \fBHAVING FIELDS\fR\.
.
.IP
Each field name can be either only one name or a name following with specific reader options for that field, enclosed in square brackets and comma\-separated\. Supported per\-field reader options are:
.
.IP "\(bu" 4
\fIterminated by\fR
.
.IP
See the description of \fIfield terminated by\fR below\.
.
.IP
The processing of this option is not currently implemented\.
.
.IP "\(bu" 4
\fIdate format\fR
.
.IP
When the field is expected of the date type, then this option allows to specify the date format used in the file\.
.
.IP
Date format string are template strings modeled against the PostgreSQL \fBto_char\fR template strings support, limited to the following patterns:
.
.IP "\(bu" 4
YYYY, YYY, YY for the year part
.
.IP "\(bu" 4
MM for the numeric month part
.
.IP "\(bu" 4
DD for the numeric day part
.
.IP "\(bu" 4
HH, HH12, HH24 for the hour part
.
.IP "\(bu" 4
am, AM, a\.m\., A\.M\.
.
.IP "\(bu" 4
pm, PM, p\.m\., P\.M\.
.
.IP "\(bu" 4
MI for the minutes part