Document "hash derivation quotiented", resolution, and build trace

Ericson2314 · roberth · Ericson2314 · commit bd4ac4f8072c · 2025-10-31T13:35:50.000-04:00
Progress on #13405, which asks for an explicit characterisation of the equivalence relation like the one given here. Also progress on #11895, because we're using the term "build trace entry" instead of "realisation". Mention #9259, a future work item. Co-authored-by: Robert Hensing <roberth@users.noreply.github.com>
diff --git a/doc/manual/book.toml.in b/doc/manual/book.toml.in
@@ -7,6 +7,7 @@ additional-css = ["custom.css"]
 additional-js = ["redirects.js"]
 edit-url-template = "https://github.com/NixOS/nix/tree/master/doc/manual/{path}"
 git-repository-url = "https://github.com/NixOS/nix"
+mathjax-support = true
 
 # Handles replacing @docroot@ with a path to ./source relative to that markdown file,
 # {{#include handlebars}}, and the @generated@ syntax used within these. it mostly
diff --git a/doc/manual/meson.build b/doc/manual/meson.build
@@ -92,6 +92,8 @@ manual = custom_target(
         (cd @2@; RUST_LOG=warn @1@ build -d @2@ 3>&2 2>&1 1>&3) | { grep -Fv "because fragment resolution isn't implemented" || :; } 3>&2 2>&1 1>&3
         rm -rf @2@/manual
         mv @2@/html @2@/manual
+        # Remove Mathjax 2.7, because we will actually use MathJax 3.x
+        find @2@/manual | grep .html | xargs sed -i -e '/2.7.1.MathJax.js/d'
         find @2@/manual -iname meson.build -delete
     '''.format(
       python.full_path(),
diff --git a/doc/manual/source/SUMMARY.md.in b/doc/manual/source/SUMMARY.md.in
@@ -26,6 +26,8 @@
     - [Derivation Outputs and Types of Derivations](store/derivation/outputs/index.md)
       - [Content-addressing derivation outputs](store/derivation/outputs/content-address.md)
       - [Input-addressing derivation outputs](store/derivation/outputs/input-address.md)
+  - [Build Trace](store/build-trace.md)
+  - [Derivation Resolution](store/resolution.md)
   - [Building](store/building.md)
   - [Store Types](store/types/index.md)
 {{#include ./store/types/SUMMARY.md}}
diff --git a/doc/manual/source/protocols/derivation-aterm.md b/doc/manual/source/protocols/derivation-aterm.md
@@ -1,6 +1,8 @@
 # Derivation "ATerm" file format
 
-For historical reasons, [store derivations][store derivation] are stored on-disk in [ATerm](https://homepages.cwi.nl/~daybuild/daily-books/technology/aterm-guide/aterm-guide.html) format.
+For historical reasons, [store derivations][store derivation] are stored on-disk in "Annotated Term" (ATerm) format
+([guide](https://homepages.cwi.nl/~daybuild/daily-books/technology/aterm-guide/aterm-guide.html),
+[paper](https://doi.org/10.1002/(SICI)1097-024X(200003)30:3%3C259::AID-SPE298%3E3.0.CO;2-Y)).
 
 ## The ATerm format used
 
diff --git a/doc/manual/source/protocols/json/schema/derivation-v3.yaml b/doc/manual/source/protocols/json/schema/derivation-v3.yaml
@@ -39,9 +39,9 @@ properties:
       This is a guard that allows us to continue evolving this format.
       The choice of `3` is fairly arbitrary, but corresponds to this informal version:
 
-      - Version 0: A-Term format
+      - Version 0: ATerm format
 
-      - Version 1: Original JSON format, with ugly `"r:sha256"` inherited from A-Term format.
+      - Version 1: Original JSON format, with ugly `"r:sha256"` inherited from ATerm format.
 
       - Version 2: Separate `method` and `hashAlgo` fields in output specs
 
diff --git a/doc/manual/source/store/build-trace.md b/doc/manual/source/store/build-trace.md
@@ -0,0 +1,54 @@
+# Build Trace
+
+> **Warning**
+>
+> This entire concept is currently
+> [**experimental**](@docroot@/development/experimental-features.md#xp-feature-ca-derivations)
+> and subject to change.
+
+The *build trace* is a [memoization table](https://en.wikipedia.org/wiki/Memoization) for builds.
+It maps the the inputs of builds to the outputs of builds.
+Concretely, that means it maps [derivations][derivation] to maps of [output] names to [store objects].
+
+In general the derivations used as a key should be be [*resolved*](./resolution.md).
+A build trace with all-resolved-derivation keys is also called a *base build trace* for extra clarity.
+If all the resolved inputs of a derivation are content-addressed, that means the inputs will be fully determined, leaving no ambiguity for what build was performed.
+(Input-addressed inputs however are still ambiguous. They too should be locked down, but this is left as future work.)
+
+Accordingly, to look up an unresolved derivation, one must first resolve it to get a resolved derivation.
+Resolving itself involves looking up entries in the build trace, so this is a mutually recursive process that will end up inspecting possibly many entries.
+
+Except for the issue with input-addressed paths called out above, base build traces are trivially *coherent* -- incoherence is not possible.
+That means that the claims that each key-value base build try entry makes are independent, and no mapping invalidates another mapping.
+
+Whether the mappings are *true*, i.e. the faithful recording of actual builds performed, is another matter.
+Coherence is about the multiple claims of the build trace being mutually consistent, not about whether the claims are individually true or false.
+
+In general, there is no way to audit a build trace entry except for by performing the build again from scratch.
+And even in that case, a different result doesn't mean the original entry was a "lie", because the derivation being built may be non-deterministic.
+As such, the decision of whether to trust a counterparty's build trace is a fundamentally subject policy choice.
+Build trace entries are typically *signed* in order to enable arbitrary public-key-based trust polices.
+
+## Derived build traces
+
+Implementations that wish to memoize the above may also keep additional *derived* build trace entries that do map unresolved derivations.
+This is
+
+If they do so, they *must* also keep the underlying base entries with resolved derivation keys around.
+Firstly, this ensures that the derived entries are merely cache, which could be recomputed from scratch.
+Secondly, this ensures the coherence of the derived build trace.
+
+Unlike with base build traces, incoherence with derived build traces is possible.
+The key ingredient is that derivation resolution is only deterministic with respect to a fixed base build trace.
+Without fixing the base build trace, it inherits the subjectivity of base build traces themselves.
+
+Concretely, suppose there are three derivations \\(a\\), \\(b\\), and \((c\\).
+Let \\(a\\) be a resolved derivation, but let \\(b\\) and \((c\\) be unresolved and both take as an input an output of \\(a\\).
+Now suppose that derived entries are made for \\(b\\) and \((c\\) based on two different entries of \\(a\\).
+(This could happen if \\(a\\) is non-deterministic, \\(a\\) and \\(b\\) are built in one store, \\(a\\) and \\(c\\) are built in another store, and then a third store substitutes from both of the first two stores.)
+
+If trusting the derived build trace entries for \\(b\\) and \((c\\) requires that each's underlying entry for \\(a\\) be also trusted, the two different mappings for \\(a\\) will be caught.
+However, if \\(b\\) and \((c\\)'s entries can be combined in isolation, there will be nothing to catch the contradiction in their hidden assumptions about \\(a\\)'s output.
+
+[derivation]: ./derivation/index.md
+[output]: ./derivation/outputs/index.md
diff --git a/doc/manual/source/store/derivation/index.md b/doc/manual/source/store/derivation/index.md
@@ -245,7 +245,7 @@ If those other derivations *also* abide by this common case (and likewise for tr
   >                                                           note the ".drv"
   > ```
 
-## Extending the model to be higher-order
+## Extending the model to be higher-order {#dynamic}
 
 **Experimental feature**: [`dynamic-derivations`](@docroot@/development/experimental-features.md#xp-feature-dynamic-derivations)
 
diff --git a/doc/manual/source/store/derivation/outputs/content-address.md b/doc/manual/source/store/derivation/outputs/content-address.md
@@ -167,7 +167,7 @@ It is only in the potential for that check to fail that they are different.
 >
 > In a future world where floating content-addressing is also stable, we in principle no longer need separate [fixed](#fixed) content-addressing.
 > Instead, we could always use floating content-addressing, and separately assert the precise value content address of a given store object to be used as an input (of another derivation).
-> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [Issue #11955](https://github.com/NixOS/nix/issues/11955).
+> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [issue #11955](https://github.com/NixOS/nix/issues/11955).
 >
 > In the current version of Nix, fixed outputs which fail their hash check are still registered as valid store objects, just not registered as outputs of the derivation which produced them.
 > This is an optimization that means if the wrong output hash is specified in a derivation, and then the derivation is recreated with the right output hash, derivation does not need to be rebuilt --- avoiding downloading potentially large amounts of data twice.
diff --git a/doc/manual/source/store/derivation/outputs/input-address.md b/doc/manual/source/store/derivation/outputs/input-address.md
@@ -6,26 +6,207 @@
 That is to say, an input-addressed output's store path is a function not of the output itself, but of the derivation that produced it.
 Even if two store paths have the same contents, if they are produced in different ways, and one is input-addressed, then they will have different store paths, and thus guaranteed to not be the same store object.
 
-<!---
+## Modulo content addressed derivation outputs {#hash-quotient-drv}
 
-### Modulo fixed-output derivations
+So how do we compute the hash part of the output paths of an input-addressed derivation?
+This is done by the function `hashQuotientDerivation`, shown below.
 
-**TODO hash derivation modulo.**
+First, a word on inputs.
+`hashQuotientDerivation` is only defined on derivations whose [inputs](@docroot@/store/derivation/index.md#inputs) take the first-order form:
+```typescript
+type ConstantPath = {
+  path: StorePath;
+};
 
-So how do we compute the hash part of the output path of a derivation?
-This is done by the function `hashDrv`, shown in Figure 5.10.
-It distinguishes between two cases.
-If the derivation is a fixed-output derivation, then it computes a hash over just the `outputHash` attributes.
+type FirstOrderOutputPath = {
+  drvPath: StorePath;
+  output: OutputName;
+};
 
-If the derivation is not a fixed-output derivation, we replace each element in the derivation’s inputDrvs with the result of a call to `hashDrv` for that element.
-(The derivation at each store path in `inputDrvs` is converted from its on-disk ATerm representation back to a `StoreDrv` by the function `parseDrv`.) In essence, `hashDrv` partitions store derivations into equivalence classes, and for hashing purpose it replaces each store path in a derivation graph with its equivalence class.
+type FirstOrderDerivingPath = ConstantPath | FirstOrderOutputPath;
 
-The recursion in Figure 5.10 is inefficient:
-it will call itself once for each path by which a subderivation can be reached, i.e., `O(V k)` times for a derivation graph with `V` derivations and with out-degree of at most `k`.
-In the actual implementation, memoisation is used to reduce this to `O(V + E)` complexity for a graph with E edges.
+type Inputs = Set<FirstOrderDerivingPath>;
+```
 
--->
+For the the algorithm below, we adopt a derivation where the two types of (first order) derived paths are partitioned into two sets, as follows:
+```typescript
+type Derivation = {
+  // inputs: Set<FirstOrderDerivingPath>; // replaced
+  inputSrcs: Set<ConstantPath>; // new instead
+  inputDrvOutputs: Set<FirstOrderOutputPath>; // new instead
+  // ...other fields...
+};
+```
 
+In the [currently-experimental][xp-feature-dynamic-derivations] higher-order case where outputs of outputs are allowed as [deriving paths][deriving-path] and thus derivation inputs, derivations using that generalization are not valid arguments to this function.
+Those derivations must be (partially) [resolved](@docroot@/store/resolution.md) enough first, to the point where no such higher-order inputs remain.
+Then, and only then, can input addresses be assigned.
+
+```
+function hashQuotientDerivation(drv) -> Hash:
+    assert(drv.outputs are input-addressed)
+    drv′ ← drv with {
+        inputDrvOutputs = ⋃(
+            assert(drvPath is store path)
+            case hashOutputsOrQuotientDerivation(readDrv(drvPath)) of
+                drvHash : Hash →
+                    (drvHash.toBase16(), output)
+                outputHashes : Map[String, Hash] →
+                    (outputHashes[output].toBase16(), "out")
+            | (drvPath, output) ∈ drv.inputDrvOutputs
+        )
+    }
+    return hashSHA256(printDrv(drv′))
+
+function hashOutputsOrQuotientDerivation(drv) -> Map[String, Hash] | Hash:
+    if drv.outputs are content-addressed:
+        return {
+            outputName ↦ hashSHA256(
+                "fixed:out:" + ca.printMethodAlgo() +
+                ":" + ca.hash.toBase16() +
+                ":" + ca.makeFixedOutputPath(drv.name, outputName))
+            | (outputName ↦ output) ∈ drv.outputs
+            , ca = output.contentAddress // or get from build trace if floating
+        }
+    else: // drv.outputs are input-addressed
+        return hashQuotientDerivation(drv)
+```
+
+### `hashQuotientDerivation`
+
+We replace each element in the derivation's `inputDrvOutputs` using data from a call to `hashOutputsOrQuotientDerivation` on the `drvPath` of that element.
+When `hashOutputsOrQuotientDerivation` returns a single drv hash (because the input derivation in question is input-addressing), we simply swap out the `drvPath` for that hash, and keep the same output name.
+When `hashOutputsOrQuotientDerivation` returns a map of content addresses per per-output, we look up the output in question, and pair it with the output name `out`.
+
+The resulting pseudo-derivation (with hashes instead of store paths in `inputDrvs`) is then printed (in the ["ATerm" format](@docroot@/protocols/derivation-aterm.md)) and hashes, and this becomes the hash of the "quotient derivation".
+
+When calculating output hashes, `hashQuotientDerivation` is called on an almost-complete input-addressing derivation, which is just missing its input-addressed outputs paths.
+The derivation hash is then used to calculate output paths for each output.
+<!-- TODO describe how this is done. -->
+Those output paths can then be substituted into the almost-complete input-addressed derivation to complete it.
+
+> **Note**
+>
+> There may be an unintentional deviation from specification currently implemented in the `(outputHashes[output].toBase16(), "out")` case.
+> This is not fatal because the deviation would only apply for content-addressing derivations with more than one output, and that only occurs in the floating case, which is [experimental][xp-feature-ca-derivations].
+> Once this bug is fixed, this note will be removed.
+
+### `hashOutputsOrQuotientDerivation`
+
+How does `hashOutputsOrQuotientDerivation` in turn work?
+It consists of two main cases, based on whether the outputs of the derivation are to be input-addressed or content-addressed.
+
+#### Input-addressed outputs case
+
+In the input-addressed case, it just calls `hashQuotientDerivation`, and returns that derivation hash.
+This makes `hashQuotientDerivation` and `hashOutputsOrQuotientDerivation` mutually-recursive.
+
+> **Note**
+>
+> In this case, `hashQuotientDerivation` is being called on a *complete* input-addressing derivation that already has its output paths calculated.
+> The `inputDrvs` substitution takes place anyways.
+
+#### Content-addressed outputs case
+
+If the outputs are [content-addressed](./content-address.md), then it computes a hash for each output derived from the content-address of that output.
+
+> **Note**
+>
+> In the [fixed](./content-address.md#fixed) content-addressing case, the outputs' content addresses are statically specified in advanced, so this always just works.
+> (The fixed case is what the pseudo-code shows.)
+>
+> In the [floating](./content-address.md#floating) case, the content addresses are not specified in advanced.
+> This is what the "or get from [build trace](@docroot@/store/build-trace.md) if floating" comment refers to.
+> In this case, the algorithm is *stuck* until the input in question is built, and we know what the actual contents of the output in question is.
+>
+> That is OK however, because there is no problem with delaying the assigning of input addresses (which, remember, is what `hashQuotientDerivation` is ultimately for) until all inputs are known.
+
+### Performance
+
+The recursion in the algorithm is potentially inefficient:
+it could call itself once for each path by which a subderivation can be reached, i.e., `O(V^k)` times for a derivation graph with `V` derivations and with out-degree of at most `k`.
+In the actual implementation, memoisation is used to reduce this to `O(V + E)` complexity for a graph with `E` edges.
+
+### Semantic properties
+
+In essence, `hashQuotientDerivation` partitions input-addressing derivations into equivalence classes: every derivation in that equivalence class is mapped to the same derivation hash.
+We can characterize this equivalence relation directly, by working bottom up.
+
+We start by defining an equivalence relation on first-order output deriving paths that refer content-addressed derivation outputs. Two such paths are equivalent if they refer to the same store object:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$d\_1$ is content-addressing}
+\\AxiomC{$d\_2$ is content-addressing}
+\\AxiomC{${}^\*(d\_1, o\_1) = {}^\*(d\_2, o\_2)$}
+\\TrinaryInfC{$(d\_1, o\_1) \\,\\sim_{\\mathrm{CA}}\\, (d\_2, o\_2)$}
+\\end{prooftree}
+\\]
+
+where \\({}^*(d, o)\\) denotes the store object that the output deriving path refers to.
+
+We will also need the following construction to lift any equivalence relation on \\(X\\) to an equivalence relation on (finite) sets of \\(X\\) (in short, \\(\\mathcal{P}(X)\\)):
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$\\forall a \\in A. \\exists b \\in B. a \\,\\sim\_X\\, b$}
+\\AxiomC{$\\forall b \\in B. \\exists a \\in A. b \\,\\sim\_X\\, a$}
+\\BinaryInfC{$A \\,\\sim_{\\mathcal{P}(X)}\\, B$}
+\\end{prooftree}
+\\]
+
+Now we can define the equivalence relation \\(\\sim_\\mathrm{IA}\\) on input-addressed derivation outputs. Two input-addressed outputs are equivalent if their derivations are equivalent (via the yet-to-be-defined \\(\\sim_{\\mathrm{IADrv}}\\) relation) and their output names are the same:
+
+\\[
+\\begin{prooftree}
+\\AxiomC{$d\_1$ is input-addressing}
+\\AxiomC{$d\_2$ is input-addressing}
+\\AxiomC{$d\_1 \\,\\sim_{\\mathrm{IADrv}}\\, d\_2$}
+\\AxiomC{$o\_1 = o\_2$}
+\\QuaternaryInfC{$(d\_1, o\_1) \\,\\sim_{\\mathrm{IA}}\\, (d\_2, o\_2)$}
+\\end{prooftree}
+\\]
+
+And now we can define \\(\\sim_{\\mathrm{IADrv}}\\).
+Two input-addressed derivations are equivalent if their content-addressed inputs are equivalent, their input-addressed inputs are also equivalent, and they are otherwise equal:
+
+<!-- cheating a bit with the semantics to get a good layout that fits on the page -->
+
+\\[
+\\begin{prooftree}
+\\alwaysNoLine
+\\AxiomC{$
+  \\mathrm{caInputs}(d\_1)
+  \\,\\sim_{\\mathcal{P}(\\mathrm{CA})}\\,
+  \\mathrm{caInputs}(d\_2)
+$}
+\\AxiomC{$
+  \\mathrm{iaInputs}(d\_1)
+  \\,\\sim_{\\mathcal{P}(\\mathrm{IA})}\\,
+  \\mathrm{iaInputs}(d\_2)
+$}
+\\BinaryInfC{$
+  d\_1\left[\\mathrm{inputDrvOutputs} := \\{\\}\right]
+  \=
+  d\_2\left[\\mathrm{inputDrvOutputs} := \\{\\}\right]
+$}
+\\alwaysSingleLine
+\\UnaryInfC{$d\_1 \\,\\sim_{\\mathrm{IADrv}}\\, d\_2$}
+\\end{prooftree}
+\\]
+
+where \\(\\mathrm{caInputs}(d)\\) returns the content-addressed inputs of \\(d\\) and \\(\\mathrm{iaInputs}(d)\\) returns the input-addressed inputs.
+
+> **Note**
+>
+> An astute reader might notice that that nowhere does `inputSrcs` enter into these definitions.
+> That means that replacing an input derivation with its outputs directly added to `inputSrcs` always results in a derivation in a different equivalence class, despite the resulting input closure (as would be mounted in the store at build time) being the same.
+> [Issue #9259](https://github.com/NixOS/nix/issues/9259) is about creating a coarser equivalence relation to address this.
+>
+> \\(\\sim_\mathrm{Drv}\\) from [derivation resolution](@docroot@/store/resolution.md) is such an equivalence relation.
+> It is coarser than this one: any two derivations which are "'hash quotient derivation'-equivalent" (\\(\\sim_\mathrm{IADrv}\\)) are also "resolution-equivalent" (\\(\\sim_\mathrm{Drv}\\)).
+> It also relates derivations whose `inputDrvOutputs` have been rewritten into `inputSrcs`.
+
+[deriving-path]: @docroot@/store/derivation/index.md#deriving-path
+[xp-feature-dynamic-derivations]: @docroot@/development/experimental-features.md#xp-feature-dynamic-derivations
 [xp-feature-ca-derivations]: @docroot@/development/experimental-features.md#xp-feature-ca-derivations
-[xp-feature-git-hashing]: @docroot@/development/experimental-features.md#xp-feature-git-hashing
-[xp-feature-impure-derivations]: @docroot@/development/experimental-features.md#xp-feature-impure-derivations
diff --git a/doc/manual/source/store/resolution.md b/doc/manual/source/store/resolution.md
diff --git a/doc/manual/theme/head.hbs b/doc/manual/theme/head.hbs
diff --git a/src/libstore/include/nix/store/derivations.hh b/src/libstore/include/nix/store/derivations.hh

Original file line number	Diff line number	Diff line change
`@@ -167,7 +167,7 @@ It is only in the potential for that check to fail that they are different.`
`167`	`167`	`>`
`168`	`168`	`> In a future world where floating content-addressing is also stable, we in principle no longer need separate [fixed](#fixed) content-addressing.`
`169`	`169`	`> Instead, we could always use floating content-addressing, and separately assert the precise value content address of a given store object to be used as an input (of another derivation).`
`170`		`-> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [Issue #11955](https://github.com/NixOS/nix/issues/11955).`
	`170`	`+> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [issue #11955](https://github.com/NixOS/nix/issues/11955).`
`171`	`171`	`>`
`172`	`172`	`> In the current version of Nix, fixed outputs which fail their hash check are still registered as valid store objects, just not registered as outputs of the derivation which produced them.`
`173`	`173`	`> This is an optimization that means if the wrong output hash is specified in a derivation, and then the derivation is recreated with the right output hash, derivation does not need to be rebuilt --- avoiding downloading potentially large amounts of data twice.`