[SPARK-29286][PYTHON][TESTS] Uses UTF-8 with 'replace' on errors at P…

…ython testing script ### What changes were proposed in this pull request? This PR proposes to let Python 2 uses UTF-8, instead of ASCII, with permissively replacing non-UDF-8 unicodes into unicode points in Python testing script. ### Why are the changes needed? When Python 2 is used to run the Python testing script, with `decode(encoding='ascii')`, it fails whenever non-ascii codes are printed out. ### Does this PR introduce any user-facing change? To dev, it will enable to support to print out non-ASCII characters. ### How was this patch tested? Jenkins will test it for our existing test codes. Also, manually tested with UTF-8 output. Closes apache#26021 from HyukjinKwon/SPARK-29286. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
sluk3r · Oct 4, 2019 · 20ee2f5 · 20ee2f5
1 parent eecef75
commit 20ee2f5
Showing 1 changed file with 2 additions and 2 deletions.
diff --git a/python/run-tests.py b/python/run-tests.py
@@ -117,7 +117,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python):
                     log_file.writelines(per_test_output)
                 per_test_output.seek(0)
                 for line in per_test_output:
-                    decoded_line = line.decode()
+                    decoded_line = line.decode("utf-8", "replace")
                     if not re.match('[0-9]+', decoded_line):
                         print(decoded_line, end='')
                 per_test_output.close()
@@ -134,7 +134,7 @@ def run_individual_python_test(target_dir, test_name, pyspark_python):
             per_test_output.seek(0)
             # Here expects skipped test output from unittest when verbosity level is
             # 2 (or --verbose option is enabled).
-            decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
+            decoded_lines = map(lambda line: line.decode("utf-8", "replace"), iter(per_test_output))
             skipped_tests = list(filter(
                 lambda line: re.search(r'test_.* \(pyspark\..*\) ... (skip|SKIP)', line),
                 decoded_lines))