Document "hash derivation modulo"

Ericson2314 · roberth · Ericson2314 · commit 04def3282c35 · 2025-10-30T00:06:34.000-04:00
Progress on #13405, which asks for an explicit characterisation of the equivalence relation like the one given here. Mention #9259, a future work item. Co-authored-by: Robert Hensing <roberth@users.noreply.github.com>
diff --git a/doc/manual/book.toml.in b/doc/manual/book.toml.in
@@ -7,6 +7,7 @@ additional-css = ["custom.css"]
 additional-js = ["redirects.js"]
 edit-url-template = "https://github.com/NixOS/nix/tree/master/doc/manual/{path}"
 git-repository-url = "https://github.com/NixOS/nix"
+mathjax-support = true
 
 # Handles replacing @docroot@ with a path to ./source relative to that markdown file,
 # {{#include handlebars}}, and the @generated@ syntax used within these. it mostly
diff --git a/doc/manual/meson.build b/doc/manual/meson.build
@@ -92,6 +92,8 @@ manual = custom_target(
         (cd @2@; RUST_LOG=warn @1@ build -d @2@ 3>&2 2>&1 1>&3) | { grep -Fv "because fragment resolution isn't implemented" || :; } 3>&2 2>&1 1>&3
         rm -rf @2@/manual
         mv @2@/html @2@/manual
+        # Remove Mathjax 2.7, because we will actually use MathJax 3.x
+        find @2@/manual | grep .html | xargs sed -i -e '/2.7.1.MathJax.js/d'
         find @2@/manual -iname meson.build -delete
     '''.format(
       python.full_path(),
diff --git a/doc/manual/source/SUMMARY.md.in b/doc/manual/source/SUMMARY.md.in
@@ -26,6 +26,7 @@
     - [Derivation Outputs and Types of Derivations](store/derivation/outputs/index.md)
       - [Content-addressing derivation outputs](store/derivation/outputs/content-address.md)
       - [Input-addressing derivation outputs](store/derivation/outputs/input-address.md)
+  - [Derivation Resolution](store/resolution.md)
   - [Building](store/building.md)
   - [Store Types](store/types/index.md)
 {{#include ./store/types/SUMMARY.md}}
diff --git a/doc/manual/source/protocols/derivation-aterm.md b/doc/manual/source/protocols/derivation-aterm.md
@@ -1,6 +1,8 @@
 # Derivation "ATerm" file format
 
-For historical reasons, [store derivations][store derivation] are stored on-disk in [ATerm](https://homepages.cwi.nl/~daybuild/daily-books/technology/aterm-guide/aterm-guide.html) format.
+For historical reasons, [store derivations][store derivation] are stored on-disk in "Annotated Term" (ATerm) format
+([guide](https://homepages.cwi.nl/~daybuild/daily-books/technology/aterm-guide/aterm-guide.html),
+[paper](https://doi.org/10.1002/(SICI)1097-024X(200003)30:3%3C259::AID-SPE298%3E3.0.CO;2-Y)).
 
 ## The ATerm format used
 
diff --git a/doc/manual/source/protocols/json/schema/derivation-v3.yaml b/doc/manual/source/protocols/json/schema/derivation-v3.yaml
@@ -39,9 +39,9 @@ properties:
       This is a guard that allows us to continue evolving this format.
       The choice of `3` is fairly arbitrary, but corresponds to this informal version:
 
-      - Version 0: A-Term format
+      - Version 0: ATerm format
 
-      - Version 1: Original JSON format, with ugly `"r:sha256"` inherited from A-Term format.
+      - Version 1: Original JSON format, with ugly `"r:sha256"` inherited from ATerm format.
 
       - Version 2: Separate `method` and `hashAlgo` fields in output specs
 
diff --git a/doc/manual/source/store/derivation/index.md b/doc/manual/source/store/derivation/index.md
@@ -245,7 +245,7 @@ If those other derivations *also* abide by this common case (and likewise for tr
   >                                                           note the ".drv"
   > ```
 
-## Extending the model to be higher-order
+## Extending the model to be higher-order {#dynamic}
 
 **Experimental feature**: [`dynamic-derivations`](@docroot@/development/experimental-features.md#xp-feature-dynamic-derivations)
 
diff --git a/doc/manual/source/store/derivation/outputs/content-address.md b/doc/manual/source/store/derivation/outputs/content-address.md
@@ -167,7 +167,7 @@ It is only in the potential for that check to fail that they are different.
 >
 > In a future world where floating content-addressing is also stable, we in principle no longer need separate [fixed](#fixed) content-addressing.
 > Instead, we could always use floating content-addressing, and separately assert the precise value content address of a given store object to be used as an input (of another derivation).
-> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [Issue #11955](https://github.com/NixOS/nix/issues/11955).
+> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [issue #11955](https://github.com/NixOS/nix/issues/11955).
 >
 > In the current version of Nix, fixed outputs which fail their hash check are still registered as valid store objects, just not registered as outputs of the derivation which produced them.
 > This is an optimization that means if the wrong output hash is specified in a derivation, and then the derivation is recreated with the right output hash, derivation does not need to be rebuilt --- avoiding downloading potentially large amounts of data twice.
diff --git a/doc/manual/source/store/derivation/outputs/input-address.md b/doc/manual/source/store/derivation/outputs/input-address.md
@@ -6,26 +6,203 @@
 That is to say, an input-addressed output's store path is a function not of the output itself, but of the derivation that produced it.
 Even if two store paths have the same contents, if they are produced in different ways, and one is input-addressed, then they will have different store paths, and thus guaranteed to not be the same store object.
 
-<!---
+## Modulo fixed-output derivations {#hash-modulo}
 
-### Modulo fixed-output derivations
+So how do we compute the hash part of the output paths of an input-addressed derivation?
+This is done by the function `hashDerivationModulo`, shown below.
 
-**TODO hash derivation modulo.**
+First, a word on inputs.
+`hashDerivationModulo` is only defined on derivations whose [inputs](@docroot@/store/derivation/index.md#inputs) take the first-order form:
+```typescript
+type ConstantPath = {
+  path: StorePath;
+};
 
-So how do we compute the hash part of the output path of a derivation?
-This is done by the function `hashDrv`, shown in Figure 5.10.
-It distinguishes between two cases.
-If the derivation is a fixed-output derivation, then it computes a hash over just the `outputHash` attributes.
+type FirstOrderOutputPath = {
+  drvPath: StorePath;
+  output: OutputName;
+};
 
-If the derivation is not a fixed-output derivation, we replace each element in the derivation’s inputDrvs with the result of a call to `hashDrv` for that element.
-(The derivation at each store path in `inputDrvs` is converted from its on-disk ATerm representation back to a `StoreDrv` by the function `parseDrv`.) In essence, `hashDrv` partitions store derivations into equivalence classes, and for hashing purpose it replaces each store path in a derivation graph with its equivalence class.
+type FirstOrderDerivingPath = ConstantPath | FirstOrderOutputPath;
 
-The recursion in Figure 5.10 is inefficient:
-it will call itself once for each path by which a subderivation can be reached, i.e., `O(V k)` times for a derivation graph with `V` derivations and with out-degree of at most `k`.
-In the actual implementation, memoisation is used to reduce this to `O(V + E)` complexity for a graph with E edges.
+type Inputs = Set<FirstOrderDerivingPath>;
+```
 
--->
+For the the algorithm below, we adopt a derivation where the two types of (first order) derived paths are partitioned into two sets, as follows:
+```typescript
+type Derivation = {
+  // inputs: Set<FirstOrderDerivingPath>; // replaced
+  inputSrcs: Set<ConstantPath>; // new instead
+  inputDrvOutputs: Set<FirstOrderOutputPath>; // new instead
+  // ...other fields...
+};
+```
 
+In the [currently-experimental][xp-feature-dynamic-derivations] higher-order case where outputs of outputs are allowed as [deriving paths][deriving-path] and thus derivation inputs, derivations using that generalization are not valid arguments to this function.
+Those derivations must be (partially) [resolved](@docroot@/store/resolution.md) enough first, to the point where no such higher-order inputs remain.
+Then, and only then, can input addresses be assigned.
+
+```
+function hashDerivationModulo(drv) -> Hash:
+    assert(drv.outputs are input-addressed)
+    drv′ ← drv with {
+        inputDrvOutputs = ⋃(
+            assert(drvPath is store path)
+            case hashOutputsOrDerivationModulo(readDrv(drvPath)) of
+                drvHash : Hash →
+                    (drvHash.toBase16(), output)
+                outputHashes : Map[String, Hash] →
+                    (outputHashes[output].toBase16(), "out")
+            | (drvPath, output) ∈ drv.inputDrvOutputs
+        )
+    }
+    return hashSHA256(printDrv(drv′))
+
+function hashOutputsOrDerivationModulo(drv) -> Map[String, Hash] | Hash:
+    if drv.outputs are content-addressed:
+        return {
+            outputName ↦ hashSHA256(
+                "fixed:out:" + ca.printMethodAlgo() +
+                ":" + ca.hash.toBase16() +
+                ":" + ca.makeFixedOutputPath(drv.name, outputName))
+            | (outputName ↦ output) ∈ drv.outputs
+            , ca = output.contentAddress // or get from build trace if floating
+        }
+    else: // drv.outputs are input-addressed
+        return hashDerivationModulo(drv)
+```
+
+### `hashDerivationModulo`
+
+We replace each element in the derivation's `inputDrvOutputs` using data from a call to `hashOutputsOrDerivationModulo` on the `drvPath` of that element.
+When `hashOutputsOrDerivationModulo` returns a single drv hash (because the input derivation in question is input-addressing), we simply swap out the `drvPath` for that hash, and keep the same output name.
+When `hashOutputsOrDerivationModulo` returns a map of content addresses per per-output, we look up the output in question, and pair it with the output name `out`.
+
+The resulting pseudo-derivation (with hashes instead of store paths in `inputDrvs`) is then printed (in the ["ATerm" format](@docroot@/protocols/derivation-aterm.md)) and hashes, and this becomes the "hash modulo" of the derivation.
+
+When calculating output hashes, `hashDerivationModulo` is called on an almost-complete input-addressing derivation, which is just missing its input-addressed outputs paths.
+The derivation hash is then used to calculate output paths for each output.
+<!-- TODO describe how this is done. -->
+Those output paths can then be substituted into the almost-complete input-addressed derivation to complete it.
+
+> **Note**
+>
+> There may be an unintentional deviation from specification currently implemented in the `(outputHashes[output].toBase16(), "out")` case.
+> This is not fatal because the deviation would only apply for content-addressing derivations with more than one output, and that only occurs in the floating case, which is [experimental][xp-feature-ca-derivations].
+> Once this bug is fixed, this note will be removed.
+
+### `hashOutputsOrDerivationModulo`
+
+How does `hashOutputsOrDerivationModulo` in turn work?
+It consists of two main cases, based on whether the outputs of the derivation are to be input-addressed or content-addressed.
+
+#### Input-addressed outputs case
+
+In the input-addressed case, it just calls `hashDerivationModulo`, and returns that derivation hash.
+This makes `hashDerivationModulo` and `hashOutputsOrDerivationModulo` mutually-recursive.
+
+> **Note**
+>
+> In this case, `hashDerivationModulo` is being called on a *complete* input-addressing derivation that already has its output paths calculated.
+> The `inputDrvs` substitution takes place anyways.
+
+#### Content-addressed outputs case
+
+If the outputs are [content-addressed](./content-address.md), then it computes a hash for each output derived from the content-address of that output.
+
+> **Note**
+>
+> In the [fixed](./content-address.md#fixed) content-addressing case, the outputs' content addresses are statically specified in advanced, so this always just works.
+> (The fixed case is what the pseudo-code shows.)
+>
+> In the [floating](./content-address.md#floating) case, the content addresses are not specified in advanced.
+> This is what the "or get from build trace if floating" comment refers to.
+> In this case, the algorithm is *stuck* until the input in question is built, and we know what the actual contents of the output in question is.
+>
+> That is OK however, because there is no problem with delaying the assigning of input addresses (which, remember, is what `hashDerivationModulo` is ultimately for) until all inputs are known.
+
+### Performance
+
+The recursion in the algorithm is potentially inefficient:
+it could call itself once for each path by which a subderivation can be reached, i.e., `O(V^k)` times for a derivation graph with `V` derivations and with out-degree of at most `k`.
+In the actual implementation, memoisation is used to reduce this to `O(V + E)` complexity for a graph with `E` edges.
+
+### Semantic properties
+
+In essence, `hashDerivationModulo` partitions input-addressing derivations into equivalence classes: every derivation in that equivalence class is mapped to the same derivation hash.
+We can characterize this equivalence relation directly, by working bottom up.
+
+We start by defining an equivalence relation on first-order output deriving paths that refer content-addressed derivation outputs. Two such paths are equivalent if they refer to the same store object:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$d\_1$ is content-addressing}
+\\AxiomC{$d\_2$ is content-addressing}
+\\AxiomC{${}^\*(d\_1, o\_1) = {}^\*(d\_2, o\_2)$}
+\\TrinaryInfC{$(d\_1, o\_1) \\,\\sim_{\\mathrm{CA}}\\, (d\_2, o\_2)$}
+\\end{prooftree}
+\\]
+
+where \\({}^*(d, o)\\) denotes the store object that the output deriving path refers to.
+
+We will also need the following construction to lift any equivalence relation on \\(X\\) to an equivalence relation on (finite) sets of \\(X\\) (in short, \\(\\mathcal{P}(X)\\)):
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$\\forall a \\in A. \\exists b \\in B. a \\,\\sim\_X\\, b$}
+\\AxiomC{$\\forall b \\in B. \\exists a \\in A. b \\,\\sim\_X\\, a$}
+\\BinaryInfC{$A \\,\\sim_{\\mathcal{P}(X)}\\, B$}
+\\end{prooftree}
+\\]
+
+Now we can define the equivalence relation \\(\\sim_\\mathrm{IA}\\) on input-addressed derivation outputs. Two input-addressed outputs are equivalent if their derivations are equivalent (via the yet-to-be-defined \\(\\sim_{\\mathrm{IADrv}}\\) relation) and their output names are the same:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$d\_1$ is input-addressing}
+\\AxiomC{$d\_2$ is input-addressing}
+\\AxiomC{$d\_1 \\,\\sim_{\\mathrm{IADrv}}\\, d\_2$}
+\\AxiomC{$o\_1 = o\_2$}
+\\QuaternaryInfC{$(d\_1, o\_1) \\,\\sim_{\\mathrm{IA}}\\, (d\_2, o\_2)$}
+\\end{prooftree}
+\\]
+
+And now we can define \\(\\sim_{\\mathrm{IADrv}}\\).
+Two input-addressed derivations are equivalent if their content-addressed inputs are equivalent, their input-addressed inputs are also equivalent, and they are otherwise equal:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$
+  \\mathrm{caInputs}(d\_1)
+  \\,\\sim_{\\mathcal{P}(\\mathrm{CA})}\\,
+  \\mathrm{caInputs}(d\_2)
+$}
+\\AxiomC{$
+  \\mathrm{iaInputs}(d\_1)
+  \\,\\sim_{\\mathcal{P}(\\mathrm{IA})}\\,
+  \\mathrm{iaInputs}(d\_2)
+$}
+\\AxiomC{$
+  d\_1\left[\\mathrm{inputDrvOutputs} := \\{\\}\right]
+  \=
+  d\_2\left[\\mathrm{inputDrvOutputs} := \\{\\}\right]
+$}
+\\TrinaryInfC{$d\_1 \\,\\sim_{\\mathrm{IADrv}}\\, d\_2$}
+\\end{prooftree}
+\\]
+
+where \\(\\mathrm{caInputs}(d)\\) returns the content-addressed inputs of \\(d\\) and \\(\\mathrm{iaInputs}(d)\\) returns the input-addressed inputs.
+
+> **Note**
+>
+> An astute reader might notice that that nowhere does `inputSrcs` enter into these definitions.
+> That means that replacing an input derivation with its outputs directly added to `inputSrcs` always results in a derivation in a different equivalence class, despite the resulting input closure (as would be mounted in the store at build time) being the same.
+> [Issue #9259](https://github.com/NixOS/nix/issues/9259) is about creating a coarser equivalence relation to address this.
+>
+> \\(\\sim_\mathrm{Drv}\\) from [derivation resolution](@docroot@/store/resolution.md) is such an equivalence relation.
+> It is coarser than this one: any two derivations which are "'hash modulo'-equivalent" (\\(\\sim_\mathrm{IADrv}\\)) are also "resolution-equivalent" (\\(\\sim_\mathrm{Drv}\\)).
+> It also relates derivations whose `inputDrvOutputs` have been rewritten into `inputSrcs`.
+
+[deriving-path]: @docroot@/store/derivation/index.md#deriving-path
+[xp-feature-dynamic-derivations]: @docroot@/development/experimental-features.md#xp-feature-dynamic-derivations
 [xp-feature-ca-derivations]: @docroot@/development/experimental-features.md#xp-feature-ca-derivations
-[xp-feature-git-hashing]: @docroot@/development/experimental-features.md#xp-feature-git-hashing
-[xp-feature-impure-derivations]: @docroot@/development/experimental-features.md#xp-feature-impure-derivations
diff --git a/doc/manual/source/store/resolution.md b/doc/manual/source/store/resolution.md
@@ -0,0 +1,58 @@
+# Derivation Resolution
+
+To *resolve* a derivation is to replace its [inputs] with the simplest inputs --- plain store paths --- that denote the same store objects.
+
+[Deriving paths][deriving-path] intentionally make it possible to refer to the same [store object] in multiple ways.
+This is a consequence of content-addressing, since different derivations can produce the same outputs, and the same date can also be manually added to the store.
+This is also a consequence even of input-addressing, as an output can be referred to by derivation and output name, or directly by its store path input address.
+Since dereferencing deriving paths is thus not injective, it induces an equivalence relation on deriving paths.
+
+Let's call this equivalence relation \\(\\sim\\), where \\(p_1 \\sim p_2\\) means that deriving paths \\(p_1\\) and \\(p_2\\) refer to the same store object.
+
+**Content Equivalence**: Two deriving paths are equivalent if they refer to the same store object:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{${}^*p_1 = {}^*p_2$}
+\\UnaryInfC{$p_1 \\,\\sim_\\mathrm{DP}\\, p_2$}
+\\end{prooftree}
+\\]
+
+where \\({}^*p\\) denotes the store object that deriving path \\(p\\) refers to.
+
+This also induces an equivalence relation on sets of deriving paths:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$\\{ {}^*p | p \\in P_1 \\} = \\{ {}^*p | p \\in P_2 \\}$}
+\\UnaryInfC{$P_1 \\,\\sim_{\\mathcal{P}(\\mathrm{DP})}\\, P_2$}
+\\end{prooftree}
+\\]
+
+**Input Content Equivalence**: This, in turn, induces an equivalence relation on derivations: two derivations are equivalent if their inputs are equivalent, and they are otherwise equal:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$\\mathrm{inputs}(d_1) \\,\\sim_{\\mathcal{P}(\\mathrm{DP})}\\, \\mathrm{inputs}(d_2)$}
+\\AxiomC{$
+  d\_1\left[\\mathrm{inputs} := \\{\\}\right]
+  \=
+  d\_2\left[\\mathrm{inputs} := \\{\\}\right]
+$}
+\\BinaryInfC{$d_1 \\,\\sim_\\mathrm{Drv}\\, d_2$}
+\\end{prooftree}
+\\]
+
+Derivation resolution always maps derivations to input-content-equivalent derivations.
+
+Similar to evaluation, we can also speak of *partial* vs *total* derivation resolution.
+Total resolution is the function described above.
+For partial resolution, a derivation is related to equivalent derivations with the same or simpler inputs, but not all those inputs will be plain store paths.
+This is useful when the input refers to a floating content addressed output we have not yet built --- we don't know what (content-address) store path will used for that derivation, so we are "stuck" trying to resolve derived path in question.
+Partial resolution is not a function, but an (assymetic) relation, created by directing the above equivalence relation so the right-side items are always equal or simpler.
+(This is the usual practice for evaluation relations.)
+Like well-behaved evaluation relations, partial resolution is [*confluent*](https://en.wikipedia.org/wiki/Confluence_(abstract_rewriting)).
+
+[store object]: @docroot@/store/store-object.md
+[inputs]: @docroot@/store/derivation/index.md#inputs
+[deriving-path]: @docroot@/store/derivation/index.md#deriving-path
diff --git a/doc/manual/theme/head.hbs b/doc/manual/theme/head.hbs
@@ -0,0 +1,15 @@
+<script>
+MathJax = {
+  loader: {load: ['[tex]/bussproofs']},
+  tex: {
+    packages: {'[+]': ['bussproofs']},
+    // Doesn't seem to work in mathjax 3
+    //formatError: function(jax, error) {
+    //  console.log(`TeX error in "${jax.latex}": ${error.message}`);
+    //  return jax.formatError(error);
+    //}
+  }
+};
+</script>
+<!-- Load a newer versino of MathJax than mdbook does by default, and which in particular has working relative paths for the "bussproofs" extension. -->
+<script async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/3.0.1/es5/tex-mml-chtml.js"></script>
diff --git a/src/libstore/include/nix/store/derivations.hh b/src/libstore/include/nix/store/derivations.hh
@@ -277,7 +277,7 @@ struct BasicDerivation
     Path builder;
     Strings args;
     /**
-     * Must not contain the key `__json`, at least in order to serialize to A-Term.
+     * Must not contain the key `__json`, at least in order to serialize to ATerm.
      */
     StringPairs env;
     std::optional<StructuredAttrs> structuredAttrs;

Original file line number	Diff line number	Diff line change
`@@ -167,7 +167,7 @@ It is only in the potential for that check to fail that they are different.`
`167`	`167`	`>`
`168`	`168`	`> In a future world where floating content-addressing is also stable, we in principle no longer need separate [fixed](#fixed) content-addressing.`
`169`	`169`	`> Instead, we could always use floating content-addressing, and separately assert the precise value content address of a given store object to be used as an input (of another derivation).`
`170`		`-> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [Issue #11955](https://github.com/NixOS/nix/issues/11955).`
	`170`	`+> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [issue #11955](https://github.com/NixOS/nix/issues/11955).`
`171`	`171`	`>`
`172`	`172`	`> In the current version of Nix, fixed outputs which fail their hash check are still registered as valid store objects, just not registered as outputs of the derivation which produced them.`
`173`	`173`	`> This is an optimization that means if the wrong output hash is specified in a derivation, and then the derivation is recreated with the right output hash, derivation does not need to be rebuilt --- avoiding downloading potentially large amounts of data twice.`