Skip to content

Tags: BYVoid/uchardet

Tags

v0.0.5

Toggle v0.0.5's commit message
Version 0.0.5 released.

- Revert UTF-16 and UTF-32 label change:
  it was an error to specify endianness for texts with BOM.
  The Unicode standard explicitly warns against it, and it actually
  even (partially) break conversions.
- Added supports:
    - French: Windows-1252.
    - German: ISO-8859-1, Windows-1252
    - Esperanto: ISO-8859-3
    - Turkish: ISO-8859-3 and ISO-8859-9
    - Thai: ISO-8859-11 (and TIS-620 model rebuilt).
- Single Byte charset detection algorithm improved:
  detection of control characters lowers confidence.

v0.0.4

Toggle v0.0.4's commit message
Version 0.0.4 released.

- Add support of ISO-8859-1 and ISO-8859-15 for French.
- Re-enable Hungarian language models (ISO-8859-2 and Windows-1250)
  which used to conflict with other charsets (should be better now).
- Differentiate ASCII detection and detection failure.
- Improve single-byte charset detection confidence algorithm (fixes for
  instance Windows-1251 Russian text detection).
- "UTF-16" is now outputted with endianness information (UTF-16LE/BE).
- Add UTF-32 BOM detection.
- Discard single byte charsets upon illegal codepoint detection.
- Internal redesign of single-byte charmaps with more semantics, and
  variable sample size length (different languages have different sizes
  of grapheme lists).
- A lot more test files (33 successful unit tests should be successful
  with `make test`).
- Adding python scripts to generate language models from Wikipedia data
  in a single command.

v0.0.3

Toggle v0.0.3's commit message
Version 0.0.3 Released.

A quick release after 0.0.2 mostly to fix a bad crash on the command
line tool when charset detection failed (or detected ASCII).

Additionaly:

- The build now includes more test files for various language/encoding
  and a `make test` target for unit testing (20 encoding detection tests
  should be successful upon running it).
- The build has a new BUILD_STATIC option, by default set to ON,
  allowing to disable static library building if not needed.
- All encoding names are iconv-compatible, enabling developers to
  directly feed the result of uchardet_get_charset() into libiconv.
- Compilation warnings fixed.

v0.0.2

Toggle v0.0.2's commit message
Version 0.0.2 released.