forked from gagolews/stringi
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEWS
969 lines (628 loc) · 35.9 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
# Changelog
## 1.7.12 (2023-01-09)
* [BUGFIX] Fixed some potential problems reported by `rchk`.
* [NOTE] [BACKWARD INCOMPATIBLE CHANGE IF ICU >= 72]
If building against ICU >= 72,
note a backward incompatible change: `@` is no longer a word break;
see <https://github.com/unicode-org/cldr/pull/2256> for more details.
## 1.7.8 (2022-07-11)
* [DOCUMENTATION] Paper on *stringi* has been published in
the *Journal of Statistical Software*;
see <https://doi.org/10.18637/jss.v103.i02>.
* [BUGFIX] #473, #397: Fixed buffer overflow in `stri_dup`.
`stri_dup`, `stri_paste`, ... fail more graciously on attempts to
generate strings of length >= 2^31 each.
* [BUILD TIME] #480: Using `Rf_isNull` instead of `isNull`.
* [DOCUMENTATION] #462: That the `numeric=TRUE` collator
does not handle negative numbers correctly is now mentioned in the manual.
## 1.7.6 (2021-11-29)
* [BUILD TIME] #463: Added loongarch support in ICU's double conversion
(@liuxiang88).
* [BUGFIX] #467: The UCRT build on Windows was not marking strings as `latin1`.
## 1.7.5 (2021-10-04)
* [DOCUMENTATION] Paper on *stringi* has been accepted for
publication in the *Journal of Statistical Software*,
see <https://stringi.gagolewski.com/_static/vignette/stringi.pdf>
for a draft version.
* [DOCUMENTATION] The *stringi* website at <https://stringi.gagolewski.com>
now features a comprehensive tutorial based on the aforementioned paper.
* [DOCUMENTATION] The *ICU* Project site has been moved to
<https://icu.unicode.org/>.
* [BUILD TIME] #457: The `autoconf` macros `AC_LANG_CPLUSPLUS`
and `AC_TRY_COMPILE` were obsolete.
* [BUGFIX] #458: Passing ALTREP objects no longer yields
'embeded nul in string' errors.
## 1.7.4 (2021-08-12)
* [BUGFIX] #449: Fixed segfaults generated by `stri_sprintf`.
* [BUILD TIME] No longer defining `USE_RINTERNALS` and `R_NO_REMAP`.
## 1.7.3 (2021-07-15)
* [BUGFIX] Fixed the previous patch of ICU55 causing a build failure on,
amongst others, CRAN's Solaris-based target.
## 1.7.2 (2021-07-14)
* [BUGFIX] Workaround for a bug in `tools::checkFF` failing
when `NA_character_` is passed to `.Call`.
## 1.7.1 (2021-07-14)
* [BACKWARD INCOMPATIBILITY] `%s$%` and `%stri$%` now use the new `stri_sprintf`
(see below) function instead of `base::sprintf`.
* [BACKWARD INCOMPATIBILITY, NEW FEATURE] In `stri_sub<-` and `stri_sub_all<-`,
providing a negative `length` from now on does not result in the corresponding
input string being altered.
* [BACKWARD INCOMPATIBILITY, NEW FEATURE] In `stri_sub` and `stri_sub_all`,
negative `length` results in the corresponding output being `NA`
or not extracted at all, depending on the setting of the new argument
`ignore_negative_length`.
* [BACKWARD INCOMPATIBILITY, BUGFIX, NEW FEATURE] In `stri_subset*`
and their replacement versions, `pattern` and `value` cannot be longer
than `str` (but now they are recycled if necessary).
* [BACKWARD INCOMPATIBILITY, NEW FEATURE] `stri_sub*` now accept the
`from` argument being a matrix like `cbind(from, length=length)`.
Unnamed columns or any other names are still interpreted as `cbind(from, to)`.
Also, the new argument `use_matrix` can be used to disable
the special treatment of such matrices.
* [DOCUMENTATION] It has been clarified that the syntax of `*_charclass`
(e.g., used in `stri_trim*`) differs slightly from regex character
classes.
* [NEW FEATURE] #420: `stri_sprintf` (alias: `stri_string_format`)
is a Unicode-aware replacement for and enhancement of the base `sprintf`:
it adds a customised handling of `NA`s (on demand), computing field size
based on code point width, outputting substrings of at most given width,
variable width and precision (both at the same time), etc. Moreover,
`stri_printf` can be used to display formatted strings conveniently.
* [NEW FEATURE] #153: `stri_match_*_regex` now extract capture group names.
* [NEW FEATURE] #25: `stri_locate_*_regex` now have a new argument,
`capture_groups`, which allows for extracting positions of matches
to parenthesised subexpressions.
* [NEW FEATURE] `stri_locate_*` now have a new argument, `get_length`,
whose setting may result in generating *from-length* matrices
(instead of *from-to* ones).
* [NEW FEATURE] #438: `stri_trans_general` now supports rule-based
as well as reverse-direction transliteration.
* [NEW FEATURE] #434: `stri_datetime_format` and `stri_datetime_parse`
are now vectorised also with respect to the `format` argument.
* [NEW FEATURE] `stri_datetime_fstr` has a new argument, `ignore_special`,
which defaults to `TRUE` for backward compatibility.
* [NEW FEATURE] `stri_datetime_format`, `stri_datetime_add`, and
`stri_datetime_fields` now call `as.POSIXct` more eagerly.
* [NEW FEATURE] `stri_trim*` now have a new argument, `negate`.
* [NEW FEATURE] `stri_replace_rstr` converts `gsub`-style replacement strings
to `stri_replace`-style.
* [INTERNAL] `stri_prepare_arg*` have been refactored, buffer overruns
in the exception handling subsystem are now avoided.
* [BUGFIX] Few functions (`stri_length`, `stri_enc_toutf32`, etc.)
did not throw an exception on an invalid UTF-8
byte sequence (and merely issued a warning instead).
* [BUGFIX] `stri_datetime_fstr` did not honour `NA_character_`
and did not parse format strings such as `"%Y%m%d"` correctly.
It has now been completely rewritten (in C).
* [BUGFIX] `stri_wrap` did not recognise the width of certain Unicode sequences
correctly.
## 1.6.2 (2021-05-14)
* [BACKWARD INCOMPATIBILITY] In `stri_enc_list()`,
`simplify` now defaults to `TRUE`.
* [NEW FEATURE] #425: The outputs of `stri_enc_list()`, `stri_locale_list()`,
`stri_timezone_list()`, and `stri_trans_list()` are now sorted.
* [NEW FEATURE] #428: In `stri_flatten`, `na_empty=NA` now omits missing values.
* [BUILD TIME] #431: Pre-4.9.0 GCC has `::max_align_t`,
but not `std::max_align_t`, added a (possible) workaround, see the `INSTALL`
file.
* [BUGFIX] #429: `stri_width()` misclassified the width of certain
code points (including grave accent, Eszett, etc.);
General category *Sk* (Symbol, modifier) is no longer of width 0,
`UCHAR_EAST_ASIAN_WIDTH` of `U_EA_AMBIGUOUS` is no longer of width 2.
* [BUGFIX] #354: `ALTREP` `CHARSXP`s were not copied, and thus could have been
garbage collected in the so-called meanwhile (with thanks to @jimhester).
## 1.6.1 (2021-05-05)
* [GENERAL] #401: stringi is now bundled with ICU4C 69.1 (upgraded from 61.1),
which is used on most Windows and OS X builds as well as on *nix systems
not equipped with system ICU. However, if the C++11 support is disabled,
stringi will be built against the battle-tested ICU4C 55.1.
The update to ICU brings Unicode 13.0 and CLDR 39 support.
* [DOCUMENTATION] A draft version of a paper on `stringi` is now available at
<https://stringi.gagolewski.com/_static/vignette/stringi.pdf>.
* [GENERAL] stringi now requires R >= 3.1 (`CXX_STD` of `CXX11` or `CXX1X`).
* [NEW FEATURE] #408: `stri_trans_casefold()` performs case folding;
this is different from case mapping, which is locale-dependent.
Folding makes two pieces of text that differ only in case identical.
This can come in handy when comparing strings.
* [NEW FEATURE] #421: `stri_rank()` ranks strings in a character vector
(e.g., for ordering data frames with regards to multiple criteria,
the ranks can be passed to `order()`, see #219).
* [NEW FEATURE] #266: `stri_width()` now supports emojis.
* [NEW FEATURE] `%s$%` and `%stri$%` are now vectorised with respect to
both arguments.
* [BUGFIX] `stri_sort_key()` now outputs `bytes`-encoded strings.
* [BUGFIX] #415: `locale=''` was not equivalent to `locale=NULL`
in `stri_opts_collator()`.
* [INTERNAL] #414: Use `LEVELS(x)` macro instead of accessing `(x)->sxpinfo.gp`
directly (@lukaszdaniel).
## 1.5.3 (2020-09-04)
* [DOCUMENTATION] stringi home page has moved to
<https://stringi.gagolewski.com> and now includes a comprehensive reference
manual.
* [NEW FEATURE] #400: `%s$%` and `%stri$%` are now binary operators
that call base R's `sprintf()`.
* [NEW FEATURE] #399: The `%s*%` and `%stri*%` operators can be used
in addition to `stri_dup()`, for the very same purpose.
* [NEW FEATURE] #355: `stri_opts_regex()` now accepts the `time_limit` and
`stack_limit` options so as to prevent malformed or malicious regexes
from running for too long.
* [NEW FEATURE] #345: `stri_startswith()` and `stri_endswith()` are now equipped
with the `negate` parameter.
* [NEW FEATURE] #382: Incorrect regexes are now reported to ease debugging.
* [DEPRECATION WARNING] #347: Any unknown option passed to `stri_opts_fixed()`,
`stri_opts_regex()`, `stri_opts_coll()`, and `stri_opts_brkiter()` now
generates a warning. In the future, the `...` parameter will be removed,
so that will be an error.
* [DEPRECATION WARNING] `stri_duplicated()`'s `fromLast` argument
has been renamed `from_last`. `fromLast` is now its alias scheduled
for removal in a future version of the package.
* [DEPRECATION WARNING] `stri_enc_detect2()`
is scheduled for removal in a future version of the package.
Use `stri_enc_detect()` or the more targeted `stri_enc_isutf8()`,
`stri_enc_isascii()`, etc., instead.
* [DEPRECATION WARNING] `stri_read_lines()`, `stri_write_lines()`,
`stri_read_raw()`: use `con` argument instead of `fname` now.
The argument `fallback_encoding` is scheduled for removal and is no longer
used. `stri_read_lines()` does not support `encoding="auto"` anymore.
* [DEPRECATION WARNING] `nparagraphs` in `stri_rand_lipsum()` has been renamed
`n_paragraphs`.
* [NEW FEATURE] #398: Alternative, British spelling of function parameters
has been introduced, e.g., `stri_opts_coll()` now supports both
`normalization` and `normalisation`.
* [NEW FEATURE] #393: `stri_read_bin()`, `stri_read_lines()`, and
`stri_write_lines()` are no longer marked as draft API.
* [NEW FEATURE] #187: `stri_read_bin()`, `stri_read_lines()`, and
`stri_write_lines()` now support connection objects as well.
* [NEW FEATURE] #386: New function `stri_sort_key()` for generating
locale-dependent sort keys which can be ordered at the byte level and
return an equivalent ordering to the original string (@DavisVaughan).
* [BUGFIX] #138: `stri_encode()` and `stri_rand_strings()`
now can generate strings of much larger lengths.
* [BUGFIX] `stri_wrap()` did not honour `indent` correctly when
`use_width` was `TRUE`.
## 1.4.6 (2020-02-17)
* [BACKWARD INCOMPATIBILITY] #369: `stri_c()` now returns an empty string
when input is empty and `collapse` is set.
* [BUGFIX] #370: fixed an issue in `stri_prepare_arg_POSIXct()`
reported by rchk.
* [DOCUMENTATION] #372: documented arguments not in `\usage` in
documentation object `stri_datetime_format`: `...`
## 1.4.5 (2020-01-11)
* [BUGFIX] #366: fix for #363 required ICU >= 55 .
## 1.4.4 (2020-01-06)
* [BUGFIX] #348: Avoid copying 0 bytes to a nil-buffer in `stri_sub_all()`.
* [BUGFIX] #362: Removed `configure` variable `CXXCPP` as it is now deprecated.
* [BUGFIX] #318: PROTECTing objects from gcing as reported by `rchk`.
* [BUGFIX] #344, #364: Removed compiler warnings in icu61/common/cstring.h.
* [BUGFIX] #363: Status of `RegexMatcher` is now checked after its use.
## 1.4.3 (2019-03-12)
* [NEW FEATURE] #30: New function `stri_sub_all()` - a version of
`stri_sub()` accepting list `from`/`to`/`length` arguments for extracting
multiple substrings from each string in a character vector.
* [NEW FEATURE] #30: New function `stri_sub_all<-()` (and its `%<%`-friendly
version, `stri_sub_replace_all()`) - for replacing multiple substrings
with corresponding replacement strings.
* [NEW FEATURE] In `stri_sub_replace()`, `value` parameter
has a new alias, `replacement`.
* [NEW FEATURE] New convenience functions based on `stri_remove_empty()`:
`stri_omit_empty_na()`, `stri_remove_empty_na()`, `stri_omit_empty()`,
and also `stri_remove_na()`, `stri_omit_na()`.
* [BUGFIX] #343: `stri_trans_char()` did not yield correct results
for overlapping pattern and replacement strings.
* [WARNFIX] #205: `configure.ac` is now included in the source bundle.
## 1.3.1 (2019-02-10)
* [BACKWARD INCOMPATIBILITY] #335: A fix to #314 prevented (by design) the use
of the system ICU if the library had been compiled with `U_CHARSET_IS_UTF8=1`.
However, this is the default setting in `libicu`>=61. From now on, in such
cases the system ICU is used more eagerly, but `stri_enc_set()` issues
a warning stating that the default (UTF-8) encoding cannot be changed.
* [NEW FEATURE] #232: All `stri_detect_*` functions now have the `max_count`
argument that allows for, e.g., stopping at the first pattern occurrence.
* [NEW FEATURE] #338: `stri_sub_replace()` is now an alias for `stri_sub<-()`
which makes it much more easily pipable (@yutannihilation, @BastienFR).
* [NEW FEATURE] #334: Added missing `icudt61b.dat` to support big-endian
platforms (thanks to Dimitri John Ledkov @xnox).
* [BUGFIX] #296: Out-of-the box build used to fail on CentOS 6, upgraded
`configure` to `--disable-cxx11` more eagerly at an early stage.
* [BUGFIX] #341: Fixed possible buffer overflows when calling `strncpy()`
from within ICU 61.
* [BUGFIX] #325: Made `configure` more portable so that it works
under `/bin/dash` now.
* [BUGFIX] #319: Fixed overflow in `stri_rand_shuffle()`.
* [BUGFIX] #337: Empty search patterns in search functions (e.g.,
`stri_split_regex()` and `stri_count_fixed()`) used to raise
too many warnings on empty search patterns.
## 1.2.4 (2018-07-20)
* [BUGFIX] #314: Testing `U_CHARSET_IS_UTF8` in `configure` when
using `pkg-build`.
* [BUILD TIME] #317: Included `icudt61l.zip` in the source bundle to solve
the frequent `icudt download failed` error (also on CRAN's `windows-release`
and `windows-oldrel`). (reverted in version 1.3.1, the `winbuilder`
errors were caused by a build chain bug).
## 1.2.3 (2018-05-16)
* [BUGFIX] #296: Fixed the behaviour of the `configure` script on CentOS 6.
* [BUGFIX] Fixed broken Windows build by updating the `icudt` mirror list.
## 1.2.2 (2018-05-01)
* [GENERAL] #193: stringi is now bundled with ICU4C 61.1,
which is used on most Windows and OS X builds as well as on *nix systems
not equipped with ICU. However, if the C++11 support is disabled,
stringi will be built against ICU4C 55.1. The update to ICU brings
Unicode 10.0 support, including new emoji characters.
* [BUGFIX] #288: `stri_match()` did not return the correct number of columns
when input was empty.
* [NEW FEATURE] #188: `stri_enc_detect()` now returns a list of data frames.
* [NEW FEATURE] #289: `stri_flatten()` how has `na_empty` and `omit_empty`
arguments.
* [NEW FEATURE] New functions: `stri_remove_empty()`, `stri_na2empty()`.
* [NEW FEATURE] #285: Coercion from a non-trivial list (one that consists
of atomic vectors, each of length 1) to an atomic vector now issues a warning.
* [WARN] Removed `-Wparentheses` warnings in `icu55/common/cstring.h:38:63`
and `icu55/i18n/windtfmt.cpp` in the ICU4C 55.1 bundle.
## 1.1.7 (2018-03-06)
* [BUGFIX] Fixed ICU4C 55.1 generating some *significant warnings*
(`icu55/i18n/winnmfmt.cpp`) and *suppressing important diagnostics*
(`src/icu55/i18n/decNumber.c`).
## 1.1.6 (2017-11-10)
* [WINDOWS SPECIFIC] #270: Strings marked with `latin1` encoding
are now converted internally to UTF-8 using the WINDOWS-1252 codec.
This fixes problems with - among others - displaying the Euro sign.
* [NEW FEATURE] #263: Added support for custom rule-based break iteration,
see `?stri_opts_brkiter`.
* [NEW FEATURE] #267: `omit_na=TRUE` in `stri_sub<-()` now ignores missing
values in any of the arguments provided.
* [BUGFIX] Fixed unPROTECTed variable names and stack imbalances
as reported by `rchk`.
## 1.1.5 (2017-04-07)
* [GENERAL] stringi now requires ICU4C >= 52.
* [BUGFIX] Fixed errors pointed out by `clang-UBSAN` in `stri_brkiter.h`.
* [GENERAL] stringi now requires R >= 2.14.
* [BUILD TIME] #238, #220: Now trying *standard* ICU4C build flags if a call
to `pkg-config` fails.
* [BUILD TIME] #258: Use `CXX11` instead of `CXX1X` on R >= 3.4.
* [BUILD TIME, BUGFIX] #254: `dir.exists()` is R >= 3.2.
## 1.1.3 (2017-03-21)
* [REMOVE DEPRECATED] `stri_install_check()` and `stri_install_icudt()`
marked as deprecated in stringi 0.5-5 are no longer being exported.
* [BUGFIX] #227: Incorrect behaviour of `stri_sub()` and `stri_sub<-()`
if the empty string was the result.
* [BUILD TIME] #231: The `configure` (Linux/Unix only) script now reads the
following environment variables: `STRINGI_CFLAGS`, `STRINGI_CPPFLAGS`,
`STRINGI_CXXFLAGS`, `STRINGI_LDFLAGS`, `STRINGI_LIBS`,
`STRINGI_DISABLE_CXX11`, `STRINGI_DISABLE_ICU_BUNDLE`,
`STRINGI_DISABLE_PKG_CONFIG`, `PKG_CONFIG`,
see `INSTALL` for more information.
* [BUILD TIME] #253: Call to `R_useDynamicSymbols()` added.
* [BUILD TIME] #230: `icudt` is now being downloaded by
`configure` (*NIX only) *before* building.
* [BUILD TIME] #242: `_COUNT/_LIMIT` enum constants have been deprecated
as of ICU 58.2, stringi code has been upgraded accordingly.
## 1.1.2 (2016-09-30)
* [BUGFIX] `round()`, `snprintf()` is not C++98.
## 1.1.1 (2016-05-25)
* [BUGFIX] #214: Allow a regex pattern like `.*` to match an empty string.
* [BUGFIX] #210: `stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)`
now results in `c("1", NA)`.
* [NEW FEATURE] #199: `stri_sub<-()` now allows for ignoring `NA` locations
(a new `omit_na` argument added).
* [NEW FEATURE] #207: `stri_sub<-()` now allows for substring insertions
(via `length=0`).
* [NEW FUNCTION] #124: `stri_subset<-()` functions added.
* [NEW FEATURE] #216: `stri_detect()`, `stri_subset()`, `stri_subset<-()`
now all have the `negate` argument.
* [NEW FUNCTION] #175: `stri_join_list()` concatenates all strings
in a list of character vectors. Useful in conjunction with, e.g.,
`stri_extract_all_regex()`, `stri_extract_all_words()`, etc.
## 1.0-1 (2015-10-22)
* [GENERAL] #88: C API is now available for use in, e.g., Rcpp packages, see
<https://github.com/gagolews/ExampleRcppStringi> for an example.
* [BUGFIX] #183: Floating point exception raised in `stri_sub()` and
`stri_sub<-()` when `to` or `length` was a zero-length numeric vector.
* [BUGFIX] #180: `stri_c()` warned incorrectly (recycling rule) when using more
than two elements.
## 0.5-5 (2015-06-28)
* [BACKWARD INCOMPATIBILITY] `stri_install_check()` and `stri_install_icudt()`
are now deprecated. From now on they are supposed to be used only
by the stringi installer.
* [BUGFIX] #176: A patch for `sys/feature_tests.h` no longer included
(the original file was copyrighted by Sun Microsystems); fixed the *Compiler
or options invalid for pre-Unix 03 X/Open applications and pre-2001 POSIX
applications* error by forcing (conditionally) `_XPG6` conformance.
* [BUGFIX] #174: `stri_paste()` did not generate any warning when
the recycling rule is violated and `sep==""`.
* [BUGFIX] #170: `icu::setDataDirectory` is no longer called if our ICU
source bundle is not used (this used to cause build problems on openSUSE).
* [BUILD TIME] #169: `configure` now tries to switch to the *standard*
C++ compiler if a C++11 one is not configured correctly.
* [BUILD TIME] `configure.win` (`Biarch: TRUE`) now mimics `autoconf`'s
`AC_SUBST` and `AC_CONFIG_FILES` so that the build process is now
more similar across different platforms.
* [NEW FEATURE] `stri_info()` now also gives information about which version
of ICU4C is in use (system or bundle).
## 0.5-2 (2015-06-21)
* [BACKWARD INCOMPATIBILITY] The second argument to `stri_pad_*()` has
been renamed `width`.
* [GENERAL] #69: stringi is now bundled with ICU4C 55.1.
* [NEW FUNCTIONS] `stri_extract_*_boundaries()` extract text between text
boundaries.
* [NEW FUNCTION] #46: `stri_trans_char()` is a stringi-flavoured
`chartr()` equivalent.
* [NEW FUNCTION] #8: `stri_width()` approximates the *width* of a string
in a more Unicode-ish fashion than `nchar(..., "width")`
* [NEW FEATURE] #149: `stri_pad()` and `stri_wrap()` is now (by default)
based on code point widths instead of the number of code points.
Moreover, the default behaviour of `stri_wrap()` is now such that it
does not get rid of non-breaking, zero width, etc., spaces.
* [NEW FEATURE] #133: `stri_wrap()` silently allows for `width <= 0`
(for compatibility with `strwrap()`).
* [NEW FEATURE] #139: `stri_wrap()` gained a new argument: `whitespace_only`.
* [NEW FUNCTIONS] #137: Date-time formatting/parsing:
* `stri_timezone_list()` - lists all known time zone identifiers;
* `stri_timezone_set()`, `stri_timezone_get()` - manage the current
default time zone;
* `stri_timezone_info()` - basic information on a given time zone;
* `stri_datetime_symbols()` - gives localizable date-time formatting data;
* `stri_datetime_fstr()` - converts a `strptime`-like format string
to an ICU date/time format string;
* `stri_datetime_format()` - converts date/time to string;
* `stri_datetime_parse()` - converts string to date/time object;
* `stri_datetime_create()` - constructs date-time objects
from numeric representations;
* `stri_datetime_now()` - returns current date-time;
* `stri_datetime_fields()` - returns date-time fields' values;
* `stri_datetime_add()` - adds specific number of date-time units
to a date-time object.
* [GENERAL] #144: Performance improvements in handling ASCII strings
(these affect `stri_sub()`, `stri_locate()` and other string index-based
operations)
* [GENERAL] #143: Searching for short fixed patterns (`stri_*_fixed()`) now
relies on the current `libC`'s implementation of `strchr()` and `strstr()`.
This is very fast, e.g., on `glibc` using the `SSE2/3/4` instruction set.
* [BUILD TIME] #141: A local copy of `icudt*.zip` may be used on package
install; see the `INSTALL` file for more information.
* [BUILD TIME] #165: The `configure` option `--disable-icu-bundle`
forces the use of system ICU when building the package.
* [BUGFIX] Locale specifiers are now normalized in a more intelligent way:
e.g., `@calendar=gregorian` expands to `DEFAULT_LOCALE@calendar=gregorian`.
* [BUGFIX] #134: `stri_extract_all_words()` did not accept `simplify=NA`.
* [BUGFIX] #132: Incorrect behaviour in `stri_locate_regex()` for matches
of zero lengths.
* [BUGFIX] stringr/#73: `stri_wrap()` returned `CHARSXP` instead of `STRSXP`
on empty string input with `simplify=FALSE` argument.
* [BUGFIX] #164: Using `libicu-dev` failed on Ubuntu
(`LIBS` shall be passed after `LDFLAGS` and the list of `.o` files).
* [BUGFIX] #168: Build now fails if `icudt` is not available.
* [BUGFIX] #135: C++11 is now used by default (see the `INSTALL` file,
however) to build stringi from sources. This is because ICU4C uses the
`long long` type which is not part of the C++98 standard.
* [BUGFIX] #154: Dates and other objects with a custom class attribute
were not coerced to the character type correctly.
* [BUGFIX] Force ICU `u_init()` call on the stringi dynlib load.
* [BUGFIX] #157: Many overfull `hbox`es in the package PDF manual have been
corrected.
## 0.4-1 (2014-12-11)
* [IMPORTANT CHANGE] `n_max` argument in `stri_split_*()` has been renamed `n`.
* [IMPORTANT CHANGE] `simplify=FALSE` in `stri_extract_all_*()` and
`stri_split_*()` now calls `stri_list2matrix()` with `fill=""`.
`fill=NA_character_` may be obtained by using `simplify=NA`.
* [IMPORTANT CHANGE, NEW FUNCTIONS] #120: `stri_extract_words()` has been
renamed `stri_extract_all_words()` and `stri_locate_boundaries()` -
`stri_locate_all_boundaries()` as well as `stri_locate_words()` -
`stri_locate_all_words()`. New functions are now available:
`stri_locate_first_boundaries()`, `stri_locate_last_boundaries()`,
`stri_locate_first_words()`, `stri_locate_last_words()`,
`stri_extract_first_words()`, `stri_extract_last_words()`.
* [IMPORTANT CHANGE] #111: `opts_regex`, `opts_collator`, `opts_fixed`, and
`opts_brkiter` can now be supplied individually via `...`.
In other words, you may now simply call, e.g.,
`stri_detect_regex(str, pattern, case_insensitive=TRUE)` instead of
`stri_detect_regex(str, pattern,
opts_regex=stri_opts_regex(case_insensitive=TRUE))`.
* [NEW FEATURE] #110: Fixed pattern search engine's settings can
now be supplied via `opts_fixed` argument in `stri_*_fixed()`,
see `stri_opts_fixed()`. A simple (not suitable for natural language
processing) yet very fast `case_insensitive` pattern matching can be
performed now. `stri_extract_*_fixed()` is again available.
* [NEW FEATURE] #23: `stri_extract_all_fixed()`, `stri_count()`, and
`stri_locate_all_fixed()` may now also look for overlapping pattern
matches, see `?stri_opts_fixed`.
* [NEW FEATURE] #129: `stri_match_*_regex()` gained a `cg_missing` argument.
* [NEW FEATURE] #117: `stri_extract_all_*()`, `stri_locate_all_*()`,
`stri_match_all_*()` gained a new argument: `omit_no_match`.
Setting it to `TRUE` makes these functions compatible with their
`stringr` equivalents.
* [NEW FEATURE] #118: `stri_wrap()` gained `indent`, `exdent`, `initial`,
and `prefix` arguments. Moreover, Knuth's dynamic word wrapping algorithm
now assumes that the cost of printing the last line is zero, see #128.
* [NEW FEATURE] #122: `stri_subset()` gained an `omit_na` argument.
* [NEW FEATURE] `stri_list2matrix()` gained an `n_min` argument.
* [NEW FEATURE] #126: `stri_split()` is now also able to act
just like `stringr::str_split_fixed()`.
* [NEW FEATURE] #119: `stri_split_boundaries()` now has
`n`, `tokens_only`, and `simplify` arguments. Additionally,
`stri_extract_all_words()` is now equipped with `simplify` arg.
* [NEW FEATURE] #116: `stri_paste()` gained a new argument:
`ignore_null`. Setting it to `TRUE` makes this function more compatible
with `paste()`.
* [OTHER] #123: `useDynLib` is used to speed up symbol look-up in
the compiled dynamic library.
* [BUGFIX] #114: `stri_paste()`: could return result in an incorrect order.
* [BUGFIX] #94: Run-time errors on Solaris caused by setting
`-DU_DISABLE_RENAMING=1` - memory allocation errors in, among others,
the ICU `UnicodeString`. This setting also caused some `ASAN` sanity check
failures within ICU code.
## 0.3-1 (2014-11-06)
* [IMPORTANT CHANGE] #87: `%>%` overlapped with the pipe operator from
the `magrittr` package; now each operator like `%>%` has been renamed `%s>%`.
* [IMPORTANT CHANGE] #108: Now the `BreakIterator` (for text boundary analysis)
may be more easily controlled via `stri_opts_brkiter()` (see options `type`
and `locale` which aim to replace now-removed `boundary` and `locale`
parameters to `stri_locate_boundaries()`, `stri_split_boundaries()`,
`stri_trans_totitle()`, `stri_extract_words()`, and `stri_locate_words()`).
* [NEW FUNCTIONS] #109: `stri_count_boundaries()` and `stri_count_words()`
count the number of text boundaries in a string.
* [NEW FUNCTIONS] #41: `stri_startswith_*()` and `stri_endswith_*()`
determine whether a string starts or ends with a given pattern.
* [NEW FEATURE] #102: `stri_replace_all_*()` now all have the `vectorize_all`
parameter, which defaults to `TRUE` for backward compatibility.
* [NEW FUNCTION] #91: Added `stri_subset_*()` - a convenient and more efficient
substitute for `str[stri_detect_*(str, ...)]`.
* [NEW FEATURE] #100: `stri_split_fixed()`, `stri_split_charclass()`,
`stri_split_regex()`, `stri_split_coll()` gained a `tokens_only` parameter,
which defaults to `FALSE` for backward compatibility.
* [NEW FUNCTION] #105: `stri_list2matrix()` converts lists of atomic vectors
to character matrices, useful in conjunction with `stri_split()`
and `stri_extract()`.
* [NEW FEATURE] #107: `stri_split_*()` now allow
setting an `omit_empty=NA` argument.
* [NEW FEATURE] #106: `stri_split()` and `stri_extract_all()`
gained a `simplify` argument
(if `TRUE`, then `stri_list2matrix(..., byrow=TRUE)`
is called on the resulting list).
* [NEW FUNCTION] #77: `stri_rand_lipsum()` generates
a (pseudo)random dummy *lorem ipsum* text.
* [NEW FEATURE] #98: `stri_trans_totitle()` gained a `opts_brkiter`
parameter; it indicates which ICU `BreakIterator` should be used when
case mapping.
* [NEW FEATURE] `stri_wrap()` gained a new parameter: `normalize`.
* [BUGFIX] #86: `stri_*_fixed()`, `stri_*_coll()`, and `stri_*_regex()` could
give incorrect results if one of search strings were of length 0.
* [BUGFIX] #99: `stri_replace_all()` did not use the `replacement` arg.
* [BUGFIX] #112: Some of the objects were not PROTECTed from
garbage collection - this could have led to spontaneous SEGFAULTS.
* [BUGFIX] Some collator's options were not passed correctly to ICU services.
* [BUGFIX] Memory leaks as detected by
`valgrind --tool=memcheck --leak-check=full` have been removed.
* [DOCUMENTATION] Significant extensions/clean ups in the stringi manual.
## 0.2-5 (2014-05-16)
* Some examples are no longer run if `icudt` is not available
(this was reverted in a future version though).
## 0.2-4 (2014-05-15)
* [BUGFIX] Fixed issues with loading of misaligned addresses
in `stri_*_fixed()`.
## 0.2-3 (2014-05-14)
* [IMPORTANT CHANGE] `stri_cmp*()` now do not allow for passing
`opts_collator=NA`. From now on, `stri_cmp_eq()`, `stri_cmp_neq()`,
and the new operators `%===%`, `%!==%`, `%stri===%`, and `%stri!==%`
are locale-independent operations, which base on code point comparisons.
New functions `stri_cmp_equiv()` and `stri_cmp_nequiv()`
(and from now on also `%==%`, `%!=%`, `%stri==%`, and `%stri!=%`)
test for canonical equivalence.
* [IMPORTANT CHANGE] `stri_*_fixed()` search functions now perform
a locale-independent exact (byte-wise, of course after conversion to UTF-8)
pattern search. All the `Collator`-based, locale-dependent search routines
are now available via `stri_*_coll()`. The reason behind this is that
ICU's `USearch` has currently very poor performance. What is more,
in many search tasks exact pattern matching is sufficient anyway.
* [GENERAL] `stri_*_fixed` now use a tweaked Knuth-Morris-Pratt search
algorithm which improves the search performance drastically.
* [IMPORTANT CHANGE] `stri_enc_nf*()` and `stri_enc_isnf*()` function families
have been renamed `stri_trans_nf*()` and `stri_trans_isnf*()`,
respectively -- they deal with text transforming,
and not with character encoding. Note that all of these may
be performed by ICU's `Transliterator` too (see below).
* [NEW FUNCTION] `stri_trans_general()` and `stri_trans_list()` give access
to ICU's `Transliterator`: they may be used to perform some generic
text transforms, like Unicode normalisation, case folding, etc.
* [NEW FUNCTION `stri_split_boundaries()` uses ICU's `BreakIterator`
to split strings at specific text boundaries. Moreover,
`stri_locate_boundaries()` indicates positions of these boundaries.
* [NEW FUNCTION] `stri_extract_words()` uses ICU's `BreakIterator` to
extract all words from a text. Additionally, `stri_locate_words()`
locates start and end positions of words in a text.
* [NEW FUNCTION] `stri_pad()`, `stri_pad_left()`, `stri_pad_right()`,
and `stri_pad_both()` pad a string with a specific code point.
* [NEW FUNCTION] `stri_wrap()` breaks paragraphs of text into lines.
Two algorithms (greedy and minimal raggedness) are available.
* [IMPORTANT CHANGE] `stri_*_charclass()` search functions now
rely solely on ICU's `UnicodeSet` patterns. All the previously accepted
charclass identifiers became invalid. However, new patterns
should now be more familiar to the users (they are regex-like).
Moreover, we observe a very nice performance gain.
* [IMPORTANT CHANGE] `stri_sort()` now does not include `NA`s
in output vectors by default, for compatibility with `sort()`.
Moreover, currently none of the input vector's attributes are preserved.
* [NEW FUNCTION] `stri_unique()` extracts unique elements from
a character vector.
* [NEW FUNCTIONS] `stri_duplicated()` and `stri_duplicated_any()`
determine duplicate elements in a character vector.
* [NEW FUNCTION] `stri_replace_na()` replaces `NA`s in a character vector
with a given string, useful for emulating, e.g., R's `paste()` behaviour.
* [NEW FUNCTION] `stri_rand_shuffle()` generates a random permutation
of code points in a string.
* [NEW FUNCTION] `stri_rand_strings()` generates random strings.
* [NEW FUNCTIONS] New functions and binary operators for string comparison:
`stri_cmp_eq()`, `stri_cmp_neq()`, `stri_cmp_lt()`, `stri_cmp_le()`,
`stri_cmp_gt()`, `stri_cmp_ge()`, `%==%`, `%!=%`, `%<%`, `%<=%`,
`%>%`, `%>=%`.
* [NEW FUNCTION] `stri_enc_mark()` reads declared encodings of character
strings as seen by stringi.
* [NEW FUNCTION] `stri_enc_tonative(str)` is an alias to
`stri_encode(str, NULL, NULL)`.
* [NEW FEATURE] `stri_order()` and `stri_sort()` now have an additional
argument `na_last` (defaults to `TRUE` and `NA`, respectively).
* [NEW FEATURE] `stri_replace_all_charclass()`, `stri_extract_all_charclass()`,
and `stri_locate_all_charclass()` now have a new argument, `merge`
(defaults to `FALSE` for backward-compatibility). It may be used
to, e.g., replace sequences of white spaces with a single space.
* [NEW FEATURE] `stri_enc_toutf8()` now has a new `validate` argument
(which defaults to `FALSE` for backward-compatibility). It may be used
in a (rare) case where a user wants to fix an invalid UTF-8 byte sequence.
`stri_length()` (among others) now detects invalid UTF-8 byte sequences.
* [NEW FEATURE] All binary operators `%???%` now also have aliases `%stri???%`.
* [GENERAL] Performance improvements in `StriContainerUTF8`
and `StriContainerUTF16` (they affect most other functions).
* [GENERAL] Significant performance improvements in `stri_join()`,
`stri_flatten()`, `stri_cmp()`, `stri_trans_to*()`, and others.
* [GENERAL] Added 3rd mirror site for our `icudt` binary distribution.
* `U_MISSING_RESOURCE_ERROR` message in `StriException` now suggests
calling `stri_install_check()`.
* [BUGFIX] UTF-8 BOMs are now silently removed from input strings.
* [BUGFIX] No more attempts to re-encode UTF-8 encoded strings
if native encoding is UTF-8 in `StriContainerUTF8`.
* [BUGFIX] Possible memory leaks when throwing errors via `Rf_error()`.
* [BUGFIX] `stri_order()` and `stri_cmp()` could return incorrect results
for `opts_collator=NA`.
* [BUGFIX] `stri_sort()` did not guarantee to return strings in UTF-8.
## 0.1-25 (2014-03-12)
* LICENSE tweaks.
* First CRAN release.
## 0.1-24 (2014-03-11)
* Fixed bugs detected with `ASAN` and `UBSAN`,
e.g., fixed `CharClass::gcmask` type (`enum` -> `uint32_t`)
(reported by `UBSAN`).
* Fixed array over-runs detected with `valgrind` in `string8.h`.
* Fixed uninitialised class fields in `StriContainerUTF8`
(reported by `valgrind`).
## 0.1-23 (2014-03-11)
* License changed to BSD-3-clause, COPYRIGHTS updated.
* `icudt` is not shipped with stringi anymore;
it is now downloaded in `install.libs.R` from one of our servers.
* New functions: `stri_install_check()`, `stri_install_icudt()`.
## 0.1-22 (2014-02-20)
* System ICU is used on systems which do have one (version >= 50 needed).
ICU is auto-detected with `pkg-config` in `configure`.
Pass `'--disable-pkg-config'` to `configure` to force building
ICU from sources.
* `icudt52b` (custom subset) is now shipped with stringi
(for big-endian, ASCII systems).
## 0.1-21 (2014-02-19)
* Fixed some issues on Solaris while preparing stringi
for CRAN submission.
## 0.1-20 (2014-02-17)
* ICU4C 52.1 sources included (common, i18n, stubdata + `icu52dt.dat`
loaded dynamically). Compilation via Makevars.
* stringi does not depend on any external libraries anymore.
## 0.1-11 (2013-11-16)
* ICU4C is now statically linked on Windows.
* First OS X binary build.
* The package is being intensively tested by our students at Warsaw
University of Technology.
## 0.1-10 (2013-11-13)
* Using `pkg-config` via `configure` to look for ICU4C libs.
## 0.1-6 (2013-07-05)
* First Windows binary build.
* Compilation passed on Oracle Sun Studio compiler collection.
* By now we have implemented most of the functionality
scheduled for milestone 0.1.
## 0.1-1 (2013-01-05)
* The stringi project has been started.