Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix string encoding problems from the geoip library
Strings that come out of GeoIP are sometimes UTF-8, sometimes ISO-8859-1, sometimes falsely-labeled ASCII-8BIT. For example: [15] pry(#<LogStash::Event>)> (to_hash["geoip"].to_a + to_hash["pants"].to_a).collect { |k,v| p k => (v.encoding rescue v.class) } {"ip"=>#<Encoding:UTF-8>} {"country_code2"=>#<Encoding:UTF-8>} {"country_code3"=>#<Encoding:UTF-8>} {"country_name"=>#<Encoding:UTF-8>} {"continent_code"=>#<Encoding:UTF-8>} {"region_name"=>#<Encoding:ASCII-8BIT>} {"city_name"=>#<Encoding:ISO-8859-1>} {"timezone"=>#<Encoding:UTF-8>} {"real_region_name"=>#<Encoding:UTF-8>} {"number"=>#<Encoding:ASCII-8BIT>} {"asn"=>#<Encoding:ASCII-8BIT>} In testing, I found that the strings with ASCII-8BIT encoding were actually mislabeled and were ISO8859-1. This patch converts any non-UTF-8 encoding, hopefully correctly, to UTF-8. I also fixed a bug in the tests. - Tests written in elastic#1054 now pass. Hurray! - Fixes LOGSTASH-1354, LOGSTASH-1372, and LOGSTASH-1853
- Loading branch information