[Parquet] Better heuristics to pick between RowSelection and Mask filter representation

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
After the great work from @hhhizzz  in https://github.com/apache/arrow-rs/pull/8733, we will (finally) have the ability to use a Bitmask filter representation when applying filters *during* Parquet decode. 

At the moment, the code relies on a simple threshold strategy to pick between representations

https://github.com/apache/arrow-rs/blob/911331aafa13f5e230440cf5d02feb245985c64e/parquet/src/arrow/arrow_reader/read_plan.rs#L107-L130

However, as @hhhizzz mentions in https://github.com/apache/arrow-rs/pull/8733#discussion_r2506343981

> Yes, my charts indicate that there are many rules for setting the RowSelectionStrategy, like the column type, column count, string length, and their combinations... We can create tickets and collaborate on improving these over time.

**Describe the solution you'd like**
I would like better heuristics for selecting between the stratgies

**Describe alternatives you've considered**
@hhhizzz  has some good suggestions, and the charts from https://github.com/apache/arrow-rs/pull/8733#issuecomment-3468441165 offer some good ideas:

> For how I get the the average length to use the mask, here's some statistic, you can checkout to (https://github.com/hhhizzz/arrow-rs/tree/rowselectionempty-charts) and run `python3  dev/row_selection_analysis.py` on your local machine, this is the results on my x86 PC:
> # One column `int32`, different distribution type:
> <img width="768" height="576" alt="scenario-dense80-dense80" src="https://github.com/user-attachments/assets/30582512-ba85-43d7-9c20-02632020a08b" />
> <img width="768" height="576" alt="scenario-sparse20-sparse20" src="https://github.com/user-attachments/assets/dcbcf7ff-2a4a-4039-bcf4-3d821f8d39a1" />
> <img width="768" height="576" alt="scenario-spread50-spread50" src="https://github.com/user-attachments/assets/ff585f4e-959a-40cf-a580-911720c5fef8" />
> <img width="768" height="576" alt="scenario-uniform50-uniform50" src="https://github.com/user-attachments/assets/a581877a-15e4-445c-9b04-f4b9d25a4749" />
> 
> # Different column type:
> <img width="768" height="576" alt="dtype-int32-uniform50" src="https://github.com/user-attachments/assets/732befae-e6d2-4d01-9c88-f7d1e0a31269" />
> <img width="768" height="576" alt="dtype-utf8view-uniform50" src="https://github.com/user-attachments/assets/d097c6e2-3099-4001-8ec7-3e6b4ec9be79" />
> <img width="768" height="576" alt="dtype-float64-uniform50" src="https://github.com/user-attachments/assets/ca738a33-750a-472e-ae65-627a4e8387ef" />
>
> # Different column counts:
> <img width="768" height="576" alt="columns-C02-uniform50" src="https://github.com/user-attachments/assets/d0230ed3-a80e-470c-b9f8-72d012486430" />
> <img width="768" height="576" alt="columns-C04-uniform50" src="https://github.com/user-attachments/assets/7f5a252d-5ce3-44b8-a7af-c98bc9426dd5" />
> <img width="768" height="576" alt="columns-C08-uniform50" src="https://github.com/user-attachments/assets/ff658cec-6f6f-41ea-87fd-599096ab6d41" />
> <img width="768" height="576" alt="columns-C16-uniform50" src="https://github.com/user-attachments/assets/d903cf45-0336-4fdd-a1f0-c3b6bd331bdd" />
> <img width="768" height="576" alt="columns-C32-uniform50" src="https://github.com/user-attachments/assets/4bde5045-4391-446e-a678-675109b5e193" />



**Additional context**

	RowSelectionPolicy::Auto { threshold, .. } => {
	let selection = match self.selection.as_ref() {
	Some(selection) => selection,
	None => return RowSelectionStrategy::Selectors,
	};

	let trimmed = selection.clone().trim();
	let selectors: Vec<RowSelector> = trimmed.into();
	if selectors.is_empty() {
	return RowSelectionStrategy::Mask;
	}

	let total_rows: usize = selectors.iter().map(\|s\| s.row_count).sum();
	let selector_count = selectors.len();
	if selector_count == 0 {
	return RowSelectionStrategy::Mask;
	}

	if total_rows < selector_count.saturating_mul(threshold) {
	RowSelectionStrategy::Mask
	} else {
	RowSelectionStrategy::Selectors
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Parquet] Better heuristics to pick between RowSelection and Mask filter representation #8846

One column `int32`, different distribution type:

Different column type:

Different column counts:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Parquet] Better heuristics to pick between RowSelection and Mask filter representation #8846

Description

One column int32, different distribution type:

Different column type:

Different column counts:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

One column `int32`, different distribution type: