Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 metadata not displayed properly #19

Open
pitrou opened this issue Oct 14, 2023 · 8 comments
Open

utf8 metadata not displayed properly #19

pitrou opened this issue Oct 14, 2023 · 8 comments

Comments

@pitrou
Copy link

pitrou commented Oct 14, 2023

I'm trying out the xifr extension so I choose a news article at random since they often come with illustrations.

On https://www.letelegramme.fr/ille-et-vilaine/rennes-35000/accuse-davoir-empeche-lexpulsion-de-lattaquant-darras-manuel-valls-repond-6448696.php , the metadata of the main article photo displays like this:

xifr

The unexpected characters in the description are a symptom of utf8-encoded text being displayed as if it were in another character set (such as latin-1 / iso-8859-1).

For example, "Défense" is exactly what you get when you take the word "Défense", encode it as utf8, and then decode it as latin1. See Python snippet:

>>> "Défense".encode('utf8')
b'D\xc3\xa9fense'
>>> "Défense".encode('utf8').decode("latin1")
'Défense'
@pitrou
Copy link
Author

pitrou commented Oct 14, 2023

I know next to nothing about EXIF, but a quick search seems to hint that utf8 is often used to encode non-ASCII metadata.

  • Wikipédia :

The latest version, 3.0, was released in May 2023, and brings, among other things, support for UTF-8 to allow text data in non-ASCII encoding.
(from https://en.wikipedia.org/wiki/Exif#Version_history)

  • ExifTool FAQ :

Most textual information in EXIF is stored in ASCII format (called "string" in the EXIF Tags documentation). By default ExifTool does not convert these strings. However, it is not uncommon for applications to write UTF‑8 or other encodings where ASCII is expected.
(from https://exiftool.org/faq.html#Q10)

A quick look at the hex dump of that PNG file shows that the metadata is indeed utf8-encoded:

00001960  44 45 53 54 4f 43 20 2f  20 4c 45 20 54 45 4c 45  |DESTOC / LE TELE|
00001970  47 52 41 4d 4d 45 20 49  4c 45 20 44 45 20 47 52  |GRAMME ILE DE GR|
00001980  4f 49 58 20 28 35 36 29  20 3a 20 4d 61 6e 75 65  |OIX (56) : Manue|
00001990  6c 20 56 41 4c 4c 53 2c  20 50 72 65 6d 69 65 72  |l VALLS, Premier|
000019a0  20 6d 69 6e 69 73 74 72  65 2c 20 65 6e 20 70 72  | ministre, en pr|
000019b0  c3 a9 73 65 6e 63 65 20  64 65 20 4a 65 61 6e 2d  |..sence de Jean-|
000019c0  59 76 65 73 20 4c 45 20  44 52 49 41 4e 2c 20 6d  |Yves LE DRIAN, m|
000019d0  69 6e 69 73 74 72 65 20  64 65 20 6c 61 20 44 c3  |inistre de la D.|
000019e0  a9 66 65 6e 73 65 20 65  74 20 70 72 c3 a9 73 69  |.fense et pr..si|
000019f0  64 65 6e 74 20 64 65 20  6c 61 20 20 52 c3 a9 67  |dent de la  R..g|
00001a00  69 6f 6e 2c 20 6c 6f 72  73 20 64 65 20 6c 61 20  |ion, lors de la |
00001a10  20 70 72 c3 a9 73 65 6e  74 61 74 69 6f 6e 20 64  | pr..sentation d|
00001a20  75 20 70 72 6f 6a 65 74  20 c3 a9 6f 6c 69 65 6e  |u projet ..olien|
00001a30  20 65 6e 20 42 72 65 74  61 67 6e 65 20 53 75 64  | en Bretagne Sud|
00001a40  20 70 61 72 20 6c 65 20  70 6f 72 74 65 75 72 20  | par le porteur |
00001a50  64 65 20 70 72 6f 6a 65  74 20 45 4f 4c 46 49 20  |de projet EOLFI |
00001a60  65 74 20 70 61 72 20 6c  65 73 20 73 65 72 76 69  |et par les servi|
00001a70  63 65 73 20 64 65 20 6c  e2 80 99 45 74 61 74 20  |ces de l...Etat |
00001a80  65 74 20 6c 61 20 70 72  c3 a9 73 65 6e 74 61 74  |et la pr..sentat|
00001a90  69 6f 6e 20 65 74 20 73  69 67 6e 61 74 75 72 65  |ion et signature|
00001aa0  20 64 65 20 6c e2 80 99  61 76 65 6e 61 6e 74 20  | de l...avenant |
00001ab0  61 75 20 63 6f 6e 74 72  61 74 20 64 65 20 70 6c  |au contrat de pl|
00001ac0  61 6e 20 45 74 61 74 2f  52 c3 a9 67 69 6f 6e 1c  |an Etat/R..gion.|

@StigNygaard
Copy link
Owner

@pitrou, thanks for reporting!

To be honest I don't know much about Exif-parsing either ;-) Especially when it comes to parsing binary Exif data, which seems be the only kind in "your" image. The Exif and meta-data parsing code in xIFr, I have inherited from wxIF and only made minor changes to (xIFr is a fork from the wxIF).

But as soon as I find some time, I will try to dig into this and see if I can find a way to make xIFr handle this without breaking anything already working.

And thanks for the work you have done for describing and analysing this issue, instead of just simply saying it doesn't work! :-)

6529a6c5a179a5771c63258a
652c0e93ee62c64a7e04c97d
test-tagged-with-bridge

@duanemoody
Copy link

https://civitai.com/images/5272231
Right-click on the image of the glass strawberry, and try to read its metadata with the extension.
Open the image in its own window and try again, the userComment field is depicted as Chinese. Save the file to your computer and open it with exiftool; the userComment is readable because it's utf-8 not ASCII, signified by the same convention TIFF uses for the same field and it would probably be wise to follow suit here.

@duanemoody
Copy link

Most textual information in EXIF is stored in ASCII format (called "string" in the EXIF Tags documentation). By default ExifTool does not convert these strings. However, it is not uncommon for applications to write UTF‑8 or other encodings where ASCII is expected.
(from https://exiftool.org/faq.html#Q10)

While exiftool's home page doesn't explicitly acknowledge the TIFF encoding method (an 8-byte header inside userComment) it recognizes and supports that method.

@StigNygaard StigNygaard added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Feb 17, 2024
@StigNygaard
Copy link
Owner

StigNygaard commented Feb 18, 2024

If xIFr could be fixed so it also can show those Stable Diffusion prompts, it would definitely be nice.
But as already mentioned, character sets and binary parsing is really not my field.
If it is going to be fixed in a short time-frame, I probably need someone to fix it for me.
But looking a bit further ahead, I plan to experiment with replacing the whole parsing-part with some 3rd party library. So maybe it is more interesting to look at if/which 3rd party libraries handle it.

@duanemoody
Copy link

I've done further research and this is actually the EXIF/TIFF specs' fault in part because while they permit Unicode in UserComment there's no implied encoding nor an obligation to use a BOM; most algorithms still assume content without one is BE not LE even though all modern ISAs are LE at this point. Civitai for whatever reason uses UTF-16LE which most software will not correctly guess, showing "Chinese" due to swapped byte order.

The key thing here is that the only reason ExifTool is able to consistently return a readable value from UserComment is because it analyzes the stream to determine encoding and even endianness. No one should have to reinvent that wheel; I've put in a request with Civitai to give images a formal Title field, when output the prompt to Description where MacOS will show it in Get Info and UserComment where Windows will show it in Comments. If there isn't a Title set, Windows will default to dumping Description into that field in Properties.

@StigNygaard
Copy link
Owner

@duanemoody thanks for the further analysis and the Civitai request.
Would still be cool if xIFr could show the current UserComments as intended in those SD images,
As mentioned, I hope to some day replace the parsing-part of xIFr with some "3rd party library", but I don't know if any of these are as clever as ExifTool. And to be realistic, it ain't gonna be any time soon.

@StigNygaard
Copy link
Owner

FYI. It's not me who have written these parts of the code in xIFr, but I see 4 different "bytes-to-string" functions used in xIFr.

binIptc.js:

  // Decodes arrays carrying UTF-8 sequences into Unicode strings.
  // Filters out illegal bytes with values between 128 and 191,
  // but doesn't validate sequences.
  function utf8BytesToString(utf8data, offset, num)

binExif.js:

  // Decodes arrays carrying UTF-8 sequences into Unicode strings.
  // It also validates sequences and throws an error if it encounters
  // invalid encodings.
  function utf8BytesToString(utf8data, offset, num)

fxifUtils.js:

  /* charWidth should normally be 1 and this function thus reads
   * the bytes one by one. But reading Unicode needs reading
   * 16 Bit values.
   * Stops at the first null byte.
   */
  this.bytesToString = function (data, offset, num, swapbytes, charWidth)

  /* Doesn’t stop at null bytes. */
  this.bytesToStringWithNull = function (data, offset, num)

And also a TextDecoder is in use in xmp.js:

const utf8decoder = new TextDecoder('utf-8');

Do you think I have the "building-blocks" to get any of these issues fixed?
The fxiUtils.bytesToString() with swapbytes and charlength parameters might be the most interesting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants