nvarchar character encoding in nettezza db #126

andreaspano · 2017-10-25T13:14:18Z

Issue Description and Expected Result

When quering nettezza on nvarchar fields, returns strange characters instead of plain UTF-8.
The issue occurs only on nvarchar.
varchar, date_time and numeric work correctly

Database

IBM Netezza

Reproducible Example

library(odbc)
  con <- dbConnect(odbc::odbc(),
                   driver = "NetezzaSQL",
                   database = "NYCFLIGHTS",
                   uid = "user",
                   pwd = "user",
                   server = "192.168.69.xxx",
                   port = 5480,
                   encoding = 'UTF-8')

dbGetQuery(con, 'select TAILNUM, SEATS from planes limit 3')

  TAILNUM SEATS
1  ㅎ㔱ㄵ    55
2  ㅎㄲ㈴    55
3  ㅎㄲ㌶    55

devtools::session_info()
setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language en                          
 collate  it_IT.UTF-8                 
 tz       Europe/Rome                 
 date     2017-10-25

jimhester · 2017-10-25T14:20:29Z

The encoding parameter needs to be set to the database encoding. It looks like you can query it using nzsql -c "show nz_encoding".

andreaspano · 2017-10-25T14:45:17Z

Already done ... The encoding is UTF-8. I am not using chinese character but this is what I see as output

jimhester · 2017-10-25T15:54:06Z

What are the values returned for this query if you use nzsql directly?

This is likely due to a mismatch in the unicode width between the driver and the database. You may have to set the UnicodeTranslationOption (https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.datacon.doc/c_datacon_odbc_driver_config_unix_linux.html).

See also https://stackoverflow.com/questions/21536059/python-pyodbc-connections-to-ibm-netezza-erroring, which are some answers for a similar problem with pyodbc.

andreaspano · 2017-10-25T16:10:35Z

Thanks to Jim for his support

nzsql works fine

My odbc.ini is set up as described in the first link

I am aware of the cast work around and it works fine, But, when moving forward to testing the Netezza-dplyr interface this is not enough :-(

jimhester · 2017-10-25T16:58:28Z

Did you try using UnicodeTranslationOption = utf16?

bquinart · 2017-10-25T20:29:11Z

Few months back we encountered the same problem. I opened an RStudio support case for this (21266) but we didn't find a solution. For now we are also casting the values. Which does not scale to many users and many queries...
Back then I also noticed the 8 byte vs 16 byte problem but didn't manage to figure out where it went wrong.
Note that we did not have the problem with RStudio on Windows. Only with RStudio Server Pro on Linux.
I remember trying different values for UnicodeTranslationOption without success. Will give that another try.

bquinart · 2017-10-25T21:00:11Z

Changing the value for UnicodeTranslationOption does not seem to have any effect.
Note that the RODBCDBI package has the correct characters... (for same DSN and query).

bquinart · 2017-10-25T21:30:36Z

It looks like UnicodeTranslationOption only relates to 'normal' varchar fields and not nvarchar.
The only thing I can 'achieve' is that the same 'error in conversion' happens with the regular varchar fields when the driver has "UnicodeTranslationOption=utf16" and the R odbc connection encoding = 'UTF-16'.
For all other combination varchar fields are either correct or the decoding of unicode characters fails.
The nvarchar output remains constant.

andreaspano · 2017-10-29T20:15:14Z

In conclusion, it seems that nvarchar, with the netezza driver and the odbc package, are not shown correctly.
At the moment no one seems to have a 100% solution.
Thanks to all for support

bquinart · 2017-11-03T17:16:30Z

We just upgraded our server to RStudio 1.1. Not sure if it is related, but we have the same effect of special characters for the database, schema and table names in the IDE Connections pane for Netezza connections.
I guess the metadata queries that fill that screen also return nvarchar fields.

randombk · 2017-11-11T01:32:27Z

Another datapoint: Netzza connectivity works perfectly with RODBC:

library(DBI)
library(RODBCDBI)

con <- dbConnect(
  RODBCDBI::ODBC(),
  dsn = NA,
  connection = sprintf(
    "DRIVER={NetezzaSQL};SERVER=%s;DATABASE=%s;ReadOnly=true;UID=%s;PWD=%s",
    "SERVER",
    "DATABASE",
    "USER",
    "PASS"
  ),
  case = 'nochange'
)

bquinart · 2018-04-13T10:14:56Z

I see this issue got a label reprex.
Does that mean the reprex available is good? Or that it is not reproducible?
I suppose it is difficult to work on this if you do not have a Netezza database available. If that is the problem I can look into the dev environment that IBM offers. See here.

bquinart · 2018-04-13T12:04:36Z

Unfortunately I have no C++ or R package development experience and can not investigate this problem.
However, I think I can show with below examples that the problem is that at some point, text that is UTF-8 encoded gets decoded as UTF-16.
You can force the same error for the varchar fields. Those fields are decoded according the the parameter that gets passed in the connection string.

con <- dbConnect(odbc::odbc(), dsn= 'Netezza', encoding = 'UTF-8')
dbGetQuery(
  con,
  "SELECT CAST('Hello!' AS VARCHAR(30)) AS SIMPLEFIELD, CAST('Hello!' AS NVARCHAR(30)) AS UNICODEFIELD"
)
  SIMPLEFIELD UNICODEFIELD
1      Hello!        效汬Ⅿ

When we change the encoding parameter to UTF-16, the first field that was correctly decoded earlier now has the same error as the second field. Note that the 6 bytes from the string gets decoded as 3 characters (which makes sense for UTF-16).

con <- dbConnect(odbc::odbc(), dsn= 'Netezza', encoding = 'UTF-16')
dbGetQuery(
  con,
  "SELECT CAST('Hello!' AS VARCHAR(30)) AS SIMPLEFIELD, CAST('Hello!' AS NVARCHAR(30)) AS UNICODEFIELD"
)
  SIMPLEFIELD UNICODEFIELD
1       效汬Ⅿ        效汬Ⅿ

Note that this is with the IBM ODBC drivers. I believe there are also Simba Netezza ODBC drivers. I wonder if the same error happens there. Unfortunately I have no access to those drivers myself.

jimhester · 2018-04-13T14:17:32Z

In this case it means I need to be able to reproduce it myself, but I cannot because I do not currently have access to a netezza db

stewartthomasj · 2018-04-23T21:01:45Z

We had this same issue, wanted to share some insights on what we found. If one looks at the bytes, it is an endian problem (UTF-16LE vs UTF-16BE), i.e. the order of the bytes. We were able to duplicate the issue from R in Python. Python has an encode('utf-16le') function that can transpose the bytes and then it was readable data when printed. I'm not sure if it is the driver or odbc package that needs additional modification, but something needs to be able to take an option of UTF-16LE vs UTF-16BE, or something for endianness that would also work with UTF-32, etc.

jimhester · 2018-06-08T15:20:19Z

The ODBC Specification seems to be that all SQLWCHAR data should be little endian

SQLWCHAR data must be UTF-16LE (Little Endian).

So this seems like a driver issue to me, I am not sure if there is a Netezza driver configuration setting that can change this or not.

You should be able to convert the data after retrieval by using iconv(x, "UTF-16BE", "UTF-8") or similar.

bquinart · 2018-06-15T12:42:12Z

Thanks for the update. I have a support ticket open with IBM.
I wasn't able to get the expected string back with iconv however.
Somehow the RODBC package does not have this problem with the same driver.

jkarioki · 2018-08-24T19:18:49Z

Can you try setting the locale as described in the following URL to see if that help? https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.dbu.doc/c_dbuser_specify_encoding.html

bquinart · 2018-08-24T20:29:17Z

No effect. I believe that relates to the nzsql tool and not the odbc connection.
As noted above, this only happens with the linux driver - not on windows (against same Netezza appliance). It was noted above that it is likely related to the UTF-16 endianness.

bquinart · 2018-08-24T22:05:33Z

I had another look at this. I do not believe it is an endian problem.
I can retrieve the original text with iconv(field, 'UTF-8','UTF-16LE').
Refer to my comment on 13/04. The data seems to be supplied as UTF-8 and is interpreted at some point as UTF-16LE. I understand the last part would be expected for type SQL_WVARCHAR.
Not sure how I could confirm type NVARCHAR is indeed mapped to SQL_WVARCHAR.
I wonder if this is working in RODBC, because they somehow still allow to decode SQL_WVARCHAR in other encodings like UTF-8.

jimhester changed the title ~~Chinese character~~ nvarchar character encoding in nettezza db Oct 26, 2017

jimhester added bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example labels Mar 21, 2018

krlmlr added the netezza label Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvarchar character encoding in nettezza db #126

nvarchar character encoding in nettezza db #126

andreaspano commented Oct 25, 2017

jimhester commented Oct 25, 2017

andreaspano commented Oct 25, 2017

jimhester commented Oct 25, 2017 •

edited

Loading

andreaspano commented Oct 25, 2017

jimhester commented Oct 25, 2017

bquinart commented Oct 25, 2017

bquinart commented Oct 25, 2017

bquinart commented Oct 25, 2017

andreaspano commented Oct 29, 2017

bquinart commented Nov 3, 2017

randombk commented Nov 11, 2017

bquinart commented Apr 13, 2018

bquinart commented Apr 13, 2018

jimhester commented Apr 13, 2018

stewartthomasj commented Apr 23, 2018

jimhester commented Jun 8, 2018 •

edited

Loading

bquinart commented Jun 15, 2018

jkarioki commented Aug 24, 2018

bquinart commented Aug 24, 2018

bquinart commented Aug 24, 2018

nvarchar character encoding in nettezza db #126

nvarchar character encoding in nettezza db #126

Comments

andreaspano commented Oct 25, 2017

Issue Description and Expected Result

Database

Reproducible Example

jimhester commented Oct 25, 2017

andreaspano commented Oct 25, 2017

jimhester commented Oct 25, 2017 • edited Loading

andreaspano commented Oct 25, 2017

jimhester commented Oct 25, 2017

bquinart commented Oct 25, 2017

bquinart commented Oct 25, 2017

bquinart commented Oct 25, 2017

andreaspano commented Oct 29, 2017

bquinart commented Nov 3, 2017

randombk commented Nov 11, 2017

bquinart commented Apr 13, 2018

bquinart commented Apr 13, 2018

jimhester commented Apr 13, 2018

stewartthomasj commented Apr 23, 2018

jimhester commented Jun 8, 2018 • edited Loading

bquinart commented Jun 15, 2018

jkarioki commented Aug 24, 2018

bquinart commented Aug 24, 2018

bquinart commented Aug 24, 2018

jimhester commented Oct 25, 2017 •

edited

Loading

jimhester commented Jun 8, 2018 •

edited

Loading