Skip to content

[Parquet] Better heuristics to pick between RowSelection and Mask filter representation #8846

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
After the great work from @hhhizzz in #8733, we will (finally) have the ability to use a Bitmask filter representation when applying filters during Parquet decode.

At the moment, the code relies on a simple threshold strategy to pick between representations

RowSelectionPolicy::Auto { threshold, .. } => {
let selection = match self.selection.as_ref() {
Some(selection) => selection,
None => return RowSelectionStrategy::Selectors,
};
let trimmed = selection.clone().trim();
let selectors: Vec<RowSelector> = trimmed.into();
if selectors.is_empty() {
return RowSelectionStrategy::Mask;
}
let total_rows: usize = selectors.iter().map(|s| s.row_count).sum();
let selector_count = selectors.len();
if selector_count == 0 {
return RowSelectionStrategy::Mask;
}
if total_rows < selector_count.saturating_mul(threshold) {
RowSelectionStrategy::Mask
} else {
RowSelectionStrategy::Selectors
}
}

However, as @hhhizzz mentions in #8733 (comment)

Yes, my charts indicate that there are many rules for setting the RowSelectionStrategy, like the column type, column count, string length, and their combinations... We can create tickets and collaborate on improving these over time.

Describe the solution you'd like
I would like better heuristics for selecting between the stratgies

Describe alternatives you've considered
@hhhizzz has some good suggestions, and the charts from #8733 (comment) offer some good ideas:

For how I get the the average length to use the mask, here's some statistic, you can checkout to (https://github.com/hhhizzz/arrow-rs/tree/rowselectionempty-charts) and run python3 dev/row_selection_analysis.py on your local machine, this is the results on my x86 PC:

One column int32, different distribution type:

scenario-dense80-dense80 scenario-sparse20-sparse20 scenario-spread50-spread50 scenario-uniform50-uniform50

Different column type:

dtype-int32-uniform50 dtype-utf8view-uniform50 dtype-float64-uniform50

Different column counts:

columns-C02-uniform50 columns-C04-uniform50 columns-C08-uniform50 columns-C16-uniform50 columns-C32-uniform50

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions