Skip to content

Commit

Permalink
Make UTF-8 the default encoding for XML feeds
Browse files Browse the repository at this point in the history
Consider the feed http://planet.haskell.org/atom.xml
- This is a UTF-8 encoded XML file
- No encoding declaration in the XML header
- No Unicode byte order mark
- Served with HTTP Content-Type "text/xml" (no charset parameter)

Miniflux lets charset.NewReader handle this. The charset package
implements the HTML5 character encoding algorithm, which, in this
situation, defaults to windows-1252 encoding if there are no UTF-8
characters in the first 1000 bytes. So for this feed, we get the wrong
encoding.

I inserted an explicit "utf8.Valid()" check, which fixes this problem.
  • Loading branch information
pdewacht authored and fguillot committed Jan 3, 2019
1 parent 31e2669 commit 15505ee
Showing 1 changed file with 7 additions and 0 deletions.
7 changes: 7 additions & 0 deletions http/client/response.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"mime"
"regexp"
"strings"
"unicode/utf8"

"golang.org/x/net/html/charset"
)
Expand Down Expand Up @@ -97,6 +98,12 @@ func (r *Response) EnsureUnicodeBody() (err error) {
if xmlEncodingRegex.Match(buffer[0:length]) {
return
}

// If no encoding is specified in the XML prolog and
// the document is valid UTF-8, nothing needs to be done.
if utf8.Valid(buffer) {
return
}
}
}

Expand Down

0 comments on commit 15505ee

Please sign in to comment.