Skip to content

Conversation

@bongjunj
Copy link
Contributor

@bongjunj bongjunj commented Dec 5, 2025

This resolves #12106.

The following table shows the data backing this patch, where the removal of the canonicalizing de-morgan rules boosts execution and compilation performance for keccak benchmark. This is primarily because it confuses our e-graph based optimizer which selects the fastest expression among equivalent expressions collected by performing (bounded) equality saturation using the rules defined in ISLE. The selection happens during elaboration on each basic block where the cost model evaluates each expression by accumulating costs of instructions in each basic block. Each instruction has an assigned cost for its opcode. This can be confirmed by the following codes:

fn compute_best_values(&mut self) {
let best = &mut self.value_to_best_value;
// We can't make random decisions inside the fixpoint loop below because
// that could cause values to change on every iteration of the loop,
// which would make the loop never terminate. So in chaos testing
// mode we need a form of making suboptimal decisions that is fully
// deterministic. We choose to simply make the worst decision we know
// how to do instead of the best.
let use_worst = self.ctrl_plane.get_decision();
// Do a fixpoint loop to compute the best value for each eclass.
//
// The maximum number of iterations is the length of the longest chain
// of `vNN -> vMM` edges in the dataflow graph where `NN < MM`, so this
// is *technically* quadratic, but `cranelift-frontend` won't construct
// any such edges. NaN canonicalization will introduce some of these
// edges, but they are chains of only two or three edges. So in
// practice, we *never* do more than a handful of iterations here unless
// (a) we parsed the CLIF from text and the text was funkily numbered,
// which we don't really care about, or (b) the CLIF producer did
// something weird, in which case it is their responsibility to stop
// doing that.
trace!(
"Entering fixpoint loop to compute the {} values for each eclass",
if use_worst {
"worst (chaos mode)"
} else {
"best"
}
);
let mut keep_going = true;
while keep_going {
keep_going = false;
trace!(
"fixpoint iteration {}",
self.stats.elaborate_best_cost_fixpoint_iters
);
self.stats.elaborate_best_cost_fixpoint_iters += 1;
for (value, def) in self.func.dfg.values_and_defs() {
trace!("computing best for value {:?} def {:?}", value, def);
let orig_best_value = best[value];
match def {
ValueDef::Union(x, y) => {
// Pick the best of the two options based on
// min-cost. This works because each element of `best`
// is a `(cost, value)` tuple; `cost` comes first so
// the natural comparison works based on cost, and
// breaks ties based on value number.
best[value] = if use_worst {
if best[x].1.is_reserved_value() {
best[y]
} else if best[y].1.is_reserved_value() {
best[x]
} else {
std::cmp::max(best[x], best[y])
}
} else {
std::cmp::min(best[x], best[y])
};
trace!(
" -> best of union({:?}, {:?}) = {:?}",
best[x], best[y], best[value]
);
}
ValueDef::Param(_, _) => {
best[value] = BestEntry(Cost::zero(), value);
}
// If the Inst is inserted into the layout (which is,
// at this point, only the side-effecting skeleton),
// then it must be computed and thus we give it zero
// cost.
ValueDef::Result(inst, _) => {
if let Some(_) = self.func.layout.inst_block(inst) {
best[value] = BestEntry(Cost::zero(), value);
} else {
let inst_data = &self.func.dfg.insts[inst];
// N.B.: at this point we know that the opcode is
// pure, so `pure_op_cost`'s precondition is
// satisfied.
let cost = Cost::of_pure_op(
inst_data.opcode(),
self.func.dfg.inst_values(inst).map(|value| best[value].0),
);
best[value] = BestEntry(cost, value);
trace!(" -> cost of value {} = {:?}", value, cost);
}
}
};
// Keep on iterating the fixpoint loop while we are finding new
// best values.
keep_going |= orig_best_value != best[value];
}
}
if cfg!(any(feature = "trace-log", debug_assertions)) {
trace!("finished fixpoint loop to compute best value for each eclass");
for value in self.func.dfg.values() {
trace!("-> best for eclass {:?}: {:?}", value, best[value]);
debug_assert_ne!(best[value].1, Value::reserved_value());
// You might additionally be expecting an assert that the best
// cost is not infinity, however infinite cost *can* happen in
// practice. First, note that our cost function doesn't know
// about any shared structure in the dataflow graph, it only
// sums operand costs. (And trying to avoid that by deduping a
// single operation's operands is a losing game because you can
// always just add one indirection and go from `add(x, x)` to
// `add(foo(x), bar(x))` to hide the shared structure.) Given
// that blindness to sharing, we can make cost grow
// exponentially with a linear sequence of operations:
//
// v0 = iconst.i32 1 ;; cost = 1
// v1 = iadd v0, v0 ;; cost = 3 + 1 + 1
// v2 = iadd v1, v1 ;; cost = 3 + 5 + 5
// v3 = iadd v2, v2 ;; cost = 3 + 13 + 13
// v4 = iadd v3, v3 ;; cost = 3 + 29 + 29
// v5 = iadd v4, v4 ;; cost = 3 + 61 + 61
// v6 = iadd v5, v5 ;; cost = 3 + 125 + 125
// ;; etc...
//
// Such a chain can cause cost to saturate to infinity. How do
// we choose which e-node is best when there are multiple that
// have saturated to infinity? It doesn't matter. As long as
// invariant (2) for optimization rules is upheld by our rule
// set (see `cranelift/codegen/src/opts/README.md`) it is safe
// to choose *any* e-node in the e-class. At worst we will
// produce suboptimal code, but never an incorrectness.
}
}
}

impl Cost {
const DEPTH_BITS: u8 = 8;
const DEPTH_MASK: u32 = (1 << Self::DEPTH_BITS) - 1;
const OP_COST_MASK: u32 = !Self::DEPTH_MASK;
const MAX_OP_COST: u32 = Self::OP_COST_MASK >> Self::DEPTH_BITS;
pub(crate) fn infinity() -> Cost {
// 2^32 - 1 is, uh, pretty close to infinite... (we use `Cost`
// only for heuristics and always saturate so this suffices!)
Cost(u32::MAX)
}
pub(crate) fn zero() -> Cost {
Cost(0)
}
/// Construct a new `Cost` from the given parts.
///
/// If the opcode cost is greater than or equal to the maximum representable
/// opcode cost, then the resulting `Cost` saturates to infinity.
fn new(opcode_cost: u32, depth: u8) -> Cost {
if opcode_cost >= Self::MAX_OP_COST {
Self::infinity()
} else {
Cost(opcode_cost << Self::DEPTH_BITS | u32::from(depth))
}
}
fn depth(&self) -> u8 {
let depth = self.0 & Self::DEPTH_MASK;
u8::try_from(depth).unwrap()
}
fn op_cost(&self) -> u32 {
(self.0 & Self::OP_COST_MASK) >> Self::DEPTH_BITS
}
/// Return the cost of an opcode.
fn of_opcode(op: Opcode) -> Cost {
match op {
// Constants.
Opcode::Iconst | Opcode::F32const | Opcode::F64const => Cost::new(1, 0),
// Extends/reduces.
Opcode::Uextend
| Opcode::Sextend
| Opcode::Ireduce
| Opcode::Iconcat
| Opcode::Isplit => Cost::new(1, 0),
// "Simple" arithmetic.
Opcode::Iadd
| Opcode::Isub
| Opcode::Band
| Opcode::Bor
| Opcode::Bxor
| Opcode::Bnot
| Opcode::Ishl
| Opcode::Ushr
| Opcode::Sshr => Cost::new(3, 0),
// "Expensive" arithmetic.
Opcode::Imul => Cost::new(10, 0),
// Everything else.
_ => {
// By default, be slightly more expensive than "simple"
// arithmetic.
let mut c = Cost::new(4, 0);
// And then get more expensive as the opcode does more side
// effects.
if op.can_trap() || op.other_side_effects() {
c = c + Cost::new(10, 0);
}
if op.can_load() {
c = c + Cost::new(20, 0);
}
if op.can_store() {
c = c + Cost::new(50, 0);
}
c
}
}
}
/// Compute the cost of the operation and its given operands.
///
/// Caller is responsible for checking that the opcode came from an instruction
/// that satisfies `inst_predicates::is_pure_for_egraph()`.
pub(crate) fn of_pure_op(op: Opcode, operand_costs: impl IntoIterator<Item = Self>) -> Self {
let c = Self::of_opcode(op) + operand_costs.into_iter().sum();
Cost::new(c.op_cost(), c.depth().saturating_add(1))
}
/// Compute the cost of an operation in the side-effectful skeleton.
pub(crate) fn of_skeleton_op(op: Opcode, arity: usize) -> Self {
Cost::of_opcode(op) + Cost::new(u32::try_from(arity).unwrap(), (arity != 0) as _)
}
}

However, in a large sequential function with only a single block, the cost can saturate to infinity, where our cost model cannot decide the fastest expression, ending up selecting suboptimal expressions.

Benchmark HEAD Exec No DeMorgan Exec Exec Speedup HEAD Comp No DeMorgan Comp Comp Speedup
shootout-keccak 29,514,301 21,135,969 39.64% 100383275.4 74489436.9 34.76%
shootout-nestedloop 433 405 6.85% 62986270.12 62895037.2 0.15%
shootout-seqhash 8,641,777,537 8,579,060,454 0.73% 100553182.4 100005923.8 0.55%
blake3-scalar 317,195 315,877 0.42% 268166443.5 268150662.4 0.01%
shootout-xblabla20 2,927,213 2,916,862 0.35% 88557190.04 88280759.04 0.31%
shootout-ackermann 7,781,877 7,762,843 0.25% 92660624.26 92236280.38 0.46%
shootout-ed25519 9,421,698,075 9,401,086,401 0.22% 371753577.1 361934260.4 2.71%
shootout-matrix 539,420,132 538,256,278 0.22% 89971884.74 86974292.06 3.45%
regex 211,067,251 210,714,847 0.17% 1535901765 1532204889 0.24%
shootout-base64 353,052,919 352,747,070 0.09% 88998859.4 88132267.24 0.98%
shootout-sieve 841,972,639 841,375,673 0.07% 63777036.32 63575041.26 0.32%
shootout-random 439,817,997 439,827,859 0.00% 63900104.52 63538830.34 0.57%
shootout-heapsort 2,375,624,074 2,375,698,263 0.00% 28196732.02 28125294.96 0.25%
shootout-ratelimit 39,821,276 39,826,019 -0.01% 86722577.82 85973142.94 0.87%
shootout-minicsv 1,290,801,636 1,291,097,063 -0.02% 15083307.78 15097025.52 -0.09%
shootout-switch 153,666,266 153,712,564 -0.03% 131105772.9 130938195.3 0.13%
spidermonkey 633,415,685 633,770,903 -0.06% 22407719269 22354707118 0.24%
shootout-fib2 3,009,743,387 3,011,674,216 -0.06% 64783294.42 64608033.96 0.27%
blake3-simd 305,358 305,565 -0.07% 86174589.6 85577620.54 0.70%
shootout-gimli 5,416,594 5,428,633 -0.22% 5729545.32 5750271.64 -0.36%
bz2 86,355,971 86,624,547 -0.31% 285047295.8 282723762.6 0.82%
shootout-memmove 36,113,795 36,259,007 -0.40% 90058395.36 89318279.06 0.83%
shootout-xchacha20 4,407,751 4,435,038 -0.62% 88598416.98 87785480.78 0.93%
pulldown-cmark 6,600,830 6,654,919 -0.81% 644222877.5 642976024.9 0.19%
shootout-ctype 796,935,215 813,571,858 -2.04% 85272502.8 84941865.04 0.39%

@bongjunj bongjunj requested a review from a team as a code owner December 5, 2025 09:02
@bongjunj bongjunj requested review from alexcrichton and removed request for a team December 5, 2025 09:02
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language labels Dec 5, 2025
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

Subscribe to Label Action

cc @cfallin, @fitzgen

This issue or pull request has been labeled: "cranelift", "isle"

Thus the following users have been cc'd because of the following labels:

  • cfallin: isle
  • fitzgen: isle

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@bongjunj
Copy link
Contributor Author

bongjunj commented Dec 5, 2025

It seems like the CI didn't work because of the failure of npm registry server.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

@cfallin cfallin added this pull request to the merge queue Dec 5, 2025
Merged via the queue into bytecodealliance:main with commit 39f8c4e Dec 5, 2025
88 of 90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift Issues related to the Cranelift code generator isle Related to the ISLE domain-specific language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cranelift: ISLE mid-end performance regression (up to -66.08%)

2 participants