Tags · BYVoid/uchardet

v0.0.5

Version 0.0.5 released.

- Revert UTF-16 and UTF-32 label change:
  it was an error to specify endianness for texts with BOM.
  The Unicode standard explicitly warns against it, and it actually
  even (partially) break conversions.
- Added supports:
    - French: Windows-1252.
    - German: ISO-8859-1, Windows-1252
    - Esperanto: ISO-8859-3
    - Turkish: ISO-8859-3 and ISO-8859-9
    - Thai: ISO-8859-11 (and TIS-620 model rebuilt).
- Single Byte charset detection algorithm improved:
  detection of control characters lowers confidence.

Dec 5, 2015
886e03a
zip
tar.gz
Notes

v0.0.4

Version 0.0.4 released.

- Add support of ISO-8859-1 and ISO-8859-15 for French.
- Re-enable Hungarian language models (ISO-8859-2 and Windows-1250)
  which used to conflict with other charsets (should be better now).
- Differentiate ASCII detection and detection failure.
- Improve single-byte charset detection confidence algorithm (fixes for
  instance Windows-1251 Russian text detection).
- "UTF-16" is now outputted with endianness information (UTF-16LE/BE).
- Add UTF-32 BOM detection.
- Discard single byte charsets upon illegal codepoint detection.
- Internal redesign of single-byte charmaps with more semantics, and
  variable sample size length (different languages have different sizes
  of grapheme lists).
- A lot more test files (33 successful unit tests should be successful
  with `make test`).
- Adding python scripts to generate language models from Wikipedia data
  in a single command.

Dec 3, 2015
e4260f4
zip
tar.gz
Notes

v0.0.3

Version 0.0.3 Released.

A quick release after 0.0.2 mostly to fix a bad crash on the command
line tool when charset detection failed (or detected ASCII).

Additionaly:

- The build now includes more test files for various language/encoding
  and a `make test` target for unit testing (20 encoding detection tests
  should be successful upon running it).
- The build has a new BUILD_STATIC option, by default set to ON,
  allowing to disable static library building if not needed.
- All encoding names are iconv-compatible, enabling developers to
  directly feed the result of uchardet_get_charset() into libiconv.
- Compilation warnings fixed.

Nov 19, 2015
ff5fd5e
zip
tar.gz
Notes

v0.0.2

Version 0.0.2 released.

Nov 16, 2015
d0ccdd5
zip
tar.gz
Notes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.5

v0.0.4

v0.0.3

v0.0.2

Tags: BYVoid/uchardet

v0.0.5

v0.0.4

v0.0.3

v0.0.2