[SPARK-16302][SQL] Set the right number of partitions for reading dat…

…a from a local collection. follow apache#13137 This pr sets the right number of partitions when reading data from a local collection. Query 'val df = Seq((1, 2)).toDF("key", "value").count' always use defaultParallelism tasks. So it causes run many empty or small tasks. Manually tested and checked. Author: Lianhui Wang <[email protected]> Closes apache#13979 from lianhuiwang/localTable-Parallel.
sadams · Sep 2, 2016 · 06e3398 · 06e3398
1 parent 5bea875
commit 06e3398
Showing 1 changed file with 4 additions and 1 deletion.
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala
@@ -42,7 +42,10 @@ case class LocalTableScanExec(
     }
   }
 
-  private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows)
+  private lazy val numParallelism: Int = math.min(math.max(unsafeRows.length, 1),
+    sqlContext.sparkContext.defaultParallelism)
+
+  private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)
 
   protected override def doExecute(): RDD[InternalRow] = {
     val numOutputRows = longMetric("numOutputRows")