Skip to content

Commit

Permalink
[SPARK-2871] [PySpark] add zipWithIndex() and zipWithUniqueId()
Browse files Browse the repository at this point in the history
RDD.zipWithIndex()

        Zips this RDD with its element indices.

        The ordering is first based on the partition index and then the
        ordering of items within each partition. So the first item in
        the first partition gets index 0, and the last item in the last
        partition receives the largest index.

        This method needs to trigger a spark job when this RDD contains
        more than one partitions.

        >>> sc.parallelize(range(4), 2).zipWithIndex().collect()
        [(0, 0), (1, 1), (2, 2), (3, 3)]

RDD.zipWithUniqueId()

        Zips this RDD with generated unique Long ids.

        Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
        n is the number of partitions. So there may exist gaps, but this
        method won't trigger a spark job, which is different from
        L{zipWithIndex}

        >>> sc.parallelize(range(4), 2).zipWithUniqueId().collect()
        [(0, 0), (2, 1), (1, 2), (3, 3)]

Author: Davies Liu <[email protected]>

Closes apache#2092 from davies/zipWith and squashes the following commits:

cebe5bf [Davies Liu] improve test cases, reverse the order of index
0d2a128 [Davies Liu] add zipWithIndex() and zipWithUniqueId()
  • Loading branch information
davies authored and JoshRosen committed Aug 25, 2014
1 parent b1b2030 commit fb0db77
Showing 1 changed file with 47 additions and 0 deletions.
47 changes: 47 additions & 0 deletions python/pyspark/rdd.py
Original file line number Diff line number Diff line change
Expand Up @@ -1741,6 +1741,53 @@ def batch_as(rdd, batchSize):
other._jrdd_deserializer)
return RDD(pairRDD, self.ctx, deserializer)

def zipWithIndex(self):
"""
Zips this RDD with its element indices.
The ordering is first based on the partition index and then the
ordering of items within each partition. So the first item in
the first partition gets index 0, and the last item in the last
partition receives the largest index.
This method needs to trigger a spark job when this RDD contains
more than one partitions.
>>> sc.parallelize(["a", "b", "c", "d"], 3).zipWithIndex().collect()
[('a', 0), ('b', 1), ('c', 2), ('d', 3)]
"""
starts = [0]
if self.getNumPartitions() > 1:
nums = self.mapPartitions(lambda it: [sum(1 for i in it)]).collect()
for i in range(len(nums) - 1):
starts.append(starts[-1] + nums[i])

def func(k, it):
for i, v in enumerate(it, starts[k]):
yield v, i

return self.mapPartitionsWithIndex(func)

def zipWithUniqueId(self):
"""
Zips this RDD with generated unique Long ids.
Items in the kth partition will get ids k, n+k, 2*n+k, ..., where
n is the number of partitions. So there may exist gaps, but this
method won't trigger a spark job, which is different from
L{zipWithIndex}
>>> sc.parallelize(["a", "b", "c", "d", "e"], 3).zipWithUniqueId().collect()
[('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)]
"""
n = self.getNumPartitions()

def func(k, it):
for i, v in enumerate(it):
yield v, i * n + k

return self.mapPartitionsWithIndex(func)

def name(self):
"""
Return the name of this RDD.
Expand Down

0 comments on commit fb0db77

Please sign in to comment.