Skip to content

Commit

Permalink
Fix string encoding problems from the geoip library
Browse files Browse the repository at this point in the history
Strings that come out of GeoIP are sometimes UTF-8, sometimes
ISO-8859-1, sometimes falsely-labeled ASCII-8BIT. For example:

    [15] pry(#<LogStash::Event>)> (to_hash["geoip"].to_a + to_hash["pants"].to_a).collect { |k,v| p k => (v.encoding rescue v.class) }
    {"ip"=>#<Encoding:UTF-8>}
    {"country_code2"=>#<Encoding:UTF-8>}
    {"country_code3"=>#<Encoding:UTF-8>}
    {"country_name"=>#<Encoding:UTF-8>}
    {"continent_code"=>#<Encoding:UTF-8>}
    {"region_name"=>#<Encoding:ASCII-8BIT>}
    {"city_name"=>#<Encoding:ISO-8859-1>}
    {"timezone"=>#<Encoding:UTF-8>}
    {"real_region_name"=>#<Encoding:UTF-8>}
    {"number"=>#<Encoding:ASCII-8BIT>}
    {"asn"=>#<Encoding:ASCII-8BIT>}

In testing, I found that the strings with ASCII-8BIT encoding were
actually mislabeled and were ISO8859-1.

This patch converts any non-UTF-8 encoding, hopefully correctly, to
UTF-8. I also fixed a bug in the tests.

- Tests written in elastic#1054 now pass. Hurray!
- Fixes LOGSTASH-1354, LOGSTASH-1372, and LOGSTASH-1853
  • Loading branch information
jordansissel committed Feb 13, 2014
1 parent 7c53593 commit 75eca3c
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 2 deletions.
12 changes: 10 additions & 2 deletions lib/logstash/filters/geoip.rb
Original file line number Diff line number Diff line change
Expand Up @@ -140,9 +140,17 @@ def filter(event)
geo_data_hash.each do |key, value|
next if value.nil? || (value.is_a?(String) && value.empty?)
if @fields.nil? || @fields.empty? || @fields.include?(key.to_s)
# no fields requested, so add all geoip hash items to
# the event's fields.
# convert key to string (normally a Symbol)
if value.is_a?(String)
# Some strings from GeoIP don't have the correct encoding...
value = case value.encoding
# I have found strings coming from GeoIP that are ASCII-8BIT are actually
# ISO-8859-1...
when Encoding::ASCII_8BIT; value.force_encoding("ISO-8859-1").encode("UTF-8")
when Encoding::ISO_8859_1; value.encode("UTF-8")
else; value
end
end
event[@target][key.to_s] = value
end
end # geo_data_hash.each
Expand Down
8 changes: 8 additions & 0 deletions spec/filters/geoip.rb
Original file line number Diff line number Diff line change
Expand Up @@ -73,14 +73,22 @@
dma_code area_code timezone)

sample("ip" => "1.1.1.1") do
checked = 0
expected_fields.each do |f|
next unless subject["geoip"][f]
checked += 1
insist { subject["geoip"][f].encoding } == Encoding::UTF_8
end
insist { checked } > 0
end
sample("ip" => "189.2.0.0") do
checked = 0
expected_fields.each do |f|
next unless subject["geoip"][f]
checked += 1
insist { subject["geoip"][f].encoding } == Encoding::UTF_8
end
insist { checked } > 0
end

end
Expand Down

0 comments on commit 75eca3c

Please sign in to comment.