C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? #2084

hrbeuyz24 · 2024-12-12T07:55:24Z

We use orc as the storage format for our real-time data warehouse, our online query will have a lot of random reads and frequent seeks. We found that a lot of time is consumed in SeekToRowGroup and Skip.
Many of our target rows in multiple seeks are in the same row group, This leads to the problem in my title.
For example, there is an online query, we need to read the data of row 100 and row 130,
The current behavior is

SeekToRowGroup
Skip(100)
Next(1)
SeekToRowGroup
Skip(130)
Next(1)

Why not

SeekToRowGroup
Skip(100)
Next(1)
Skip(29)
Next(1)
We simply modified the code and found that in our scenario it can bring at least 50% read performance benefits.

wgtmac · 2024-12-12T07:59:29Z

Could you provide more context? How did you tell the C++ reader to read only row 100 and 130 out of an ORC file?

hrbeuyz24 · 2024-12-12T08:05:19Z

Could you provide more context? How did you tell the C++ reader to read only row 100 and 130 out of an ORC file?

Through the index we built ourselves, we know that a certain query needs to read the data of row 100 and 130, then we will SeekToRow(100) and Next(1), SeekToRow(130) and Next(1).

ffacs · 2024-12-12T08:14:06Z

An error will occur if a user attempts to read line 129 after reading line 130 without performing SeekToRowGroup again.

hrbeuyz24 · 2024-12-12T08:23:09Z

An error will occur if a user attempts to read line 129 after reading line 130 without performing SeekToRowGroup again.

We will ensure that we seek and read in order and will not read back.But I think how it is used externally has nothing to do with the internal implementation of orc.
When the reader finds that the target row to seek is in the same row-group as the current row and the target row is larger than the current row, why not call the skip function directly? Of course, when the target row is smaller than the current row, reader have to SeekToRowGroup again.

hrbeuyz24 · 2024-12-12T09:16:56Z

like this in SeekToRow function

auto targetRowInStripe = rowNumber - firstRowOfStripe_[seekToStripe];
if (currentStripe_ == seekToStripe &&
     isCurrentStripeInited() &&
     currentRowInStripe_ < targetRowInStrip &&
     currentRowInStripe_ / rowIndexStride == targetRowInStripe / rowIndexStride) {
  reader_->skip(targetRowInStripe - currentRowInStripe_);
  currentRowInStripe = targetRowInStripe;
  return;
}

hrbeuyz24 · 2024-12-12T09:23:22Z

This may be rare in offline batch processing scenarios.
However, it may be common to use orc as the storage format for real-time data warehouses to provide online query services. The data warehouse uses the index to the orc row and then reads it.

dongjoon-hyun · 2024-12-12T18:38:55Z

Thank you for reporting. Could you provide a real sample file to speed up our discussion, @hrbeuyz24 ?

dongjoon-hyun · 2024-12-12T18:40:20Z

In addition, IIUC, it's an improvement idea which you want to propose, right? In that case, please make a PR to us. We are software engineers. :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? #2084

C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? #2084

hrbeuyz24 commented Dec 12, 2024 •

edited

Loading

wgtmac commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024 •

edited

Loading

ffacs commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024 •

edited

Loading

hrbeuyz24 commented Dec 12, 2024

dongjoon-hyun commented Dec 12, 2024

dongjoon-hyun commented Dec 12, 2024

C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? #2084

C++ API. When the row I seek to is in the same row-group as the current row, why don't use the skip function directly, but instead seek to the row-group again and then skip? #2084

Comments

hrbeuyz24 commented Dec 12, 2024 • edited Loading

wgtmac commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024 • edited Loading

ffacs commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024 • edited Loading

hrbeuyz24 commented Dec 12, 2024

dongjoon-hyun commented Dec 12, 2024

dongjoon-hyun commented Dec 12, 2024

hrbeuyz24 commented Dec 12, 2024 •

edited

Loading

hrbeuyz24 commented Dec 12, 2024 •

edited

Loading

hrbeuyz24 commented Dec 12, 2024 •

edited

Loading