Spark: Better statistics estimation for Spark 2 Reader (apache#3134)

Follow-up to apache#3038. Use (estimated) row size * number of rows to estimate the size instead of adding up file sizes. The row size is estimated from the pruned schema if we prune columns.
Yanam · Sep 17, 2021 · ec2716e · ec2716e
1 parent f220f25
commit ec2716e
Showing 1 changed file with 1 addition and 2 deletions.
diff --git a/spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java b/spark2/src/main/java/org/apache/iceberg/spark/source/Reader.java
@@ -306,16 +306,15 @@ public Statistics estimateStatistics() {
       return new Stats(SparkSchemaUtil.estimateSize(lazyType(), totalRecords), totalRecords);
     }
 
-    long sizeInBytes = 0L;
     long numRows = 0L;
 
     for (CombinedScanTask task : tasks()) {
       for (FileScanTask file : task.files()) {
-        sizeInBytes += file.length();
         numRows += file.file().recordCount();
       }
     }
 
+    long sizeInBytes = SparkSchemaUtil.estimateSize(lazyType(), numRows);
     return new Stats(sizeInBytes, numRows);
   }