-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
After the great work from @hhhizzz in #8733, we will (finally) have the ability to use a Bitmask filter representation when applying filters during Parquet decode.
At the moment, the code relies on a simple threshold strategy to pick between representations
arrow-rs/parquet/src/arrow/arrow_reader/read_plan.rs
Lines 107 to 130 in 911331a
| RowSelectionPolicy::Auto { threshold, .. } => { | |
| let selection = match self.selection.as_ref() { | |
| Some(selection) => selection, | |
| None => return RowSelectionStrategy::Selectors, | |
| }; | |
| let trimmed = selection.clone().trim(); | |
| let selectors: Vec<RowSelector> = trimmed.into(); | |
| if selectors.is_empty() { | |
| return RowSelectionStrategy::Mask; | |
| } | |
| let total_rows: usize = selectors.iter().map(|s| s.row_count).sum(); | |
| let selector_count = selectors.len(); | |
| if selector_count == 0 { | |
| return RowSelectionStrategy::Mask; | |
| } | |
| if total_rows < selector_count.saturating_mul(threshold) { | |
| RowSelectionStrategy::Mask | |
| } else { | |
| RowSelectionStrategy::Selectors | |
| } | |
| } |
However, as @hhhizzz mentions in #8733 (comment)
Yes, my charts indicate that there are many rules for setting the RowSelectionStrategy, like the column type, column count, string length, and their combinations... We can create tickets and collaborate on improving these over time.
Describe the solution you'd like
I would like better heuristics for selecting between the stratgies
Describe alternatives you've considered
@hhhizzz has some good suggestions, and the charts from #8733 (comment) offer some good ideas:
For how I get the the average length to use the mask, here's some statistic, you can checkout to (https://github.com/hhhizzz/arrow-rs/tree/rowselectionempty-charts) and run
python3 dev/row_selection_analysis.pyon your local machine, this is the results on my x86 PC:One column
int32, different distribution type:![]()
![]()
![]()
![]()
Different column type:
![]()
![]()
![]()
Different column counts:
![]()
![]()
![]()
![]()
![]()
Additional context











