Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvarchar character encoding in nettezza db #126

Open
andreaspano opened this issue Oct 25, 2017 · 20 comments
Open

nvarchar character encoding in nettezza db #126

andreaspano opened this issue Oct 25, 2017 · 20 comments
Labels
bug an unexpected problem or unintended behavior netezza reprex needs a minimal reproducible example

Comments

@andreaspano
Copy link

Issue Description and Expected Result

When quering nettezza on nvarchar fields, returns strange characters instead of plain UTF-8.
The issue occurs only on nvarchar.
varchar, date_time and numeric work correctly

Database

IBM Netezza

Reproducible Example

library(odbc)
  con <- dbConnect(odbc::odbc(),
                   driver = "NetezzaSQL",
                   database = "NYCFLIGHTS",
                   uid = "user",
                   pwd = "user",
                   server = "192.168.69.xxx",
                   port = 5480,
                   encoding = 'UTF-8')

dbGetQuery(con, 'select TAILNUM, SEATS from planes limit 3')

  TAILNUM SEATS
1  ㅎ㔱ㄵ    55
2  ㅎㄲ㈴    55
3  ㅎㄲ㌶    55
devtools::session_info()
setting  value                       
 version  R version 3.4.1 (2017-06-30)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language en                          
 collate  it_IT.UTF-8                 
 tz       Europe/Rome                 
 date     2017-10-25    
@jimhester
Copy link
Contributor

The encoding parameter needs to be set to the database encoding. It looks like you can query it using nzsql -c "show nz_encoding".

@andreaspano
Copy link
Author

Already done ... The encoding is UTF-8. I am not using chinese character but this is what I see as output

@jimhester
Copy link
Contributor

jimhester commented Oct 25, 2017

What are the values returned for this query if you use nzsql directly?

This is likely due to a mismatch in the unicode width between the driver and the database. You may have to set the UnicodeTranslationOption (https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.datacon.doc/c_datacon_odbc_driver_config_unix_linux.html).

See also https://stackoverflow.com/questions/21536059/python-pyodbc-connections-to-ibm-netezza-erroring, which are some answers for a similar problem with pyodbc.

@andreaspano
Copy link
Author

Thanks to Jim for his support

nzsql works fine

My odbc.ini is set up as described in the first link

I am aware of the cast work around and it works fine, But, when moving forward to testing the Netezza-dplyr interface this is not enough :-(

@jimhester
Copy link
Contributor

Did you try using UnicodeTranslationOption = utf16?

@bquinart
Copy link

Few months back we encountered the same problem. I opened an RStudio support case for this (21266) but we didn't find a solution. For now we are also casting the values. Which does not scale to many users and many queries...
Back then I also noticed the 8 byte vs 16 byte problem but didn't manage to figure out where it went wrong.
Note that we did not have the problem with RStudio on Windows. Only with RStudio Server Pro on Linux.
I remember trying different values for UnicodeTranslationOption without success. Will give that another try.

@bquinart
Copy link

Changing the value for UnicodeTranslationOption does not seem to have any effect.
Note that the RODBCDBI package has the correct characters... (for same DSN and query).

@bquinart
Copy link

It looks like UnicodeTranslationOption only relates to 'normal' varchar fields and not nvarchar.
The only thing I can 'achieve' is that the same 'error in conversion' happens with the regular varchar fields when the driver has "UnicodeTranslationOption=utf16" and the R odbc connection encoding = 'UTF-16'.
For all other combination varchar fields are either correct or the decoding of unicode characters fails.
The nvarchar output remains constant.

@jimhester jimhester changed the title Chinese character nvarchar character encoding in nettezza db Oct 26, 2017
@andreaspano
Copy link
Author

In conclusion, it seems that nvarchar, with the netezza driver and the odbc package, are not shown correctly.
At the moment no one seems to have a 100% solution.
Thanks to all for support

@bquinart
Copy link

bquinart commented Nov 3, 2017

We just upgraded our server to RStudio 1.1. Not sure if it is related, but we have the same effect of special characters for the database, schema and table names in the IDE Connections pane for Netezza connections.
I guess the metadata queries that fill that screen also return nvarchar fields.

@randombk
Copy link

Another datapoint: Netzza connectivity works perfectly with RODBC:

library(DBI)
library(RODBCDBI)

con <- dbConnect(
  RODBCDBI::ODBC(),
  dsn = NA,
  connection = sprintf(
    "DRIVER={NetezzaSQL};SERVER=%s;DATABASE=%s;ReadOnly=true;UID=%s;PWD=%s",
    "SERVER",
    "DATABASE",
    "USER",
    "PASS"
  ),
  case = 'nochange'
)

@jimhester jimhester added bug an unexpected problem or unintended behavior reprex needs a minimal reproducible example labels Mar 21, 2018
@bquinart
Copy link

I see this issue got a label reprex.
Does that mean the reprex available is good? Or that it is not reproducible?
I suppose it is difficult to work on this if you do not have a Netezza database available. If that is the problem I can look into the dev environment that IBM offers. See here.

@bquinart
Copy link

Unfortunately I have no C++ or R package development experience and can not investigate this problem.
However, I think I can show with below examples that the problem is that at some point, text that is UTF-8 encoded gets decoded as UTF-16.
You can force the same error for the varchar fields. Those fields are decoded according the the parameter that gets passed in the connection string.

con <- dbConnect(odbc::odbc(), dsn= 'Netezza', encoding = 'UTF-8')
dbGetQuery(
  con,
  "SELECT CAST('Hello!' AS VARCHAR(30)) AS SIMPLEFIELD, CAST('Hello!' AS NVARCHAR(30)) AS UNICODEFIELD"
)
  SIMPLEFIELD UNICODEFIELD
1      Hello!        效汬Ⅿ

When we change the encoding parameter to UTF-16, the first field that was correctly decoded earlier now has the same error as the second field. Note that the 6 bytes from the string gets decoded as 3 characters (which makes sense for UTF-16).

con <- dbConnect(odbc::odbc(), dsn= 'Netezza', encoding = 'UTF-16')
dbGetQuery(
  con,
  "SELECT CAST('Hello!' AS VARCHAR(30)) AS SIMPLEFIELD, CAST('Hello!' AS NVARCHAR(30)) AS UNICODEFIELD"
)
  SIMPLEFIELD UNICODEFIELD
1       效汬Ⅿ        效汬Ⅿ

Note that this is with the IBM ODBC drivers. I believe there are also Simba Netezza ODBC drivers. I wonder if the same error happens there. Unfortunately I have no access to those drivers myself.

@jimhester
Copy link
Contributor

In this case it means I need to be able to reproduce it myself, but I cannot because I do not currently have access to a netezza db

@stewartthomasj
Copy link

We had this same issue, wanted to share some insights on what we found. If one looks at the bytes, it is an endian problem (UTF-16LE vs UTF-16BE), i.e. the order of the bytes. We were able to duplicate the issue from R in Python. Python has an encode('utf-16le') function that can transpose the bytes and then it was readable data when printed. I'm not sure if it is the driver or odbc package that needs additional modification, but something needs to be able to take an option of UTF-16LE vs UTF-16BE, or something for endianness that would also work with UTF-32, etc.

@jimhester
Copy link
Contributor

jimhester commented Jun 8, 2018

The ODBC Specification seems to be that all SQLWCHAR data should be little endian

SQLWCHAR data must be UTF-16LE (Little Endian).

So this seems like a driver issue to me, I am not sure if there is a Netezza driver configuration setting that can change this or not.

You should be able to convert the data after retrieval by using iconv(x, "UTF-16BE", "UTF-8") or similar.

@bquinart
Copy link

Thanks for the update. I have a support ticket open with IBM.
I wasn't able to get the expected string back with iconv however.
Somehow the RODBC package does not have this problem with the same driver.

@jkarioki
Copy link

Can you try setting the locale as described in the following URL to see if that help? https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.dbu.doc/c_dbuser_specify_encoding.html

@bquinart
Copy link

No effect. I believe that relates to the nzsql tool and not the odbc connection.
As noted above, this only happens with the linux driver - not on windows (against same Netezza appliance). It was noted above that it is likely related to the UTF-16 endianness.

@bquinart
Copy link

I had another look at this. I do not believe it is an endian problem.
I can retrieve the original text with iconv(field, 'UTF-8','UTF-16LE').
Refer to my comment on 13/04. The data seems to be supplied as UTF-8 and is interpreted at some point as UTF-16LE. I understand the last part would be expected for type SQL_WVARCHAR.
Not sure how I could confirm type NVARCHAR is indeed mapped to SQL_WVARCHAR.
I wonder if this is working in RODBC, because they somehow still allow to decode SQL_WVARCHAR in other encodings like UTF-8.

@krlmlr krlmlr added the netezza label Jan 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior netezza reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

7 participants