Skip to content

Commit 04def32

Browse files
Ericson2314roberth
andcommitted
Document "hash derivation modulo"
Progress on #13405, which asks for an explicit characterisation of the equivalence relation like the one given here. Mention #9259, a future work item. Co-authored-by: Robert Hensing <[email protected]>
1 parent bffbdcf commit 04def32

File tree

11 files changed

+277
-21
lines changed

11 files changed

+277
-21
lines changed

doc/manual/book.toml.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ additional-css = ["custom.css"]
77
additional-js = ["redirects.js"]
88
edit-url-template = "https://github.com/NixOS/nix/tree/master/doc/manual/{path}"
99
git-repository-url = "https://github.com/NixOS/nix"
10+
mathjax-support = true
1011

1112
# Handles replacing @docroot@ with a path to ./source relative to that markdown file,
1213
# {{#include handlebars}}, and the @generated@ syntax used within these. it mostly

doc/manual/meson.build

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,8 @@ manual = custom_target(
9292
(cd @2@; RUST_LOG=warn @1@ build -d @2@ 3>&2 2>&1 1>&3) | { grep -Fv "because fragment resolution isn't implemented" || :; } 3>&2 2>&1 1>&3
9393
rm -rf @2@/manual
9494
mv @2@/html @2@/manual
95+
# Remove Mathjax 2.7, because we will actually use MathJax 3.x
96+
find @2@/manual | grep .html | xargs sed -i -e '/2.7.1.MathJax.js/d'
9597
find @2@/manual -iname meson.build -delete
9698
'''.format(
9799
python.full_path(),

doc/manual/source/SUMMARY.md.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
- [Derivation Outputs and Types of Derivations](store/derivation/outputs/index.md)
2727
- [Content-addressing derivation outputs](store/derivation/outputs/content-address.md)
2828
- [Input-addressing derivation outputs](store/derivation/outputs/input-address.md)
29+
- [Derivation Resolution](store/resolution.md)
2930
- [Building](store/building.md)
3031
- [Store Types](store/types/index.md)
3132
{{#include ./store/types/SUMMARY.md}}

doc/manual/source/protocols/derivation-aterm.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Derivation "ATerm" file format
22

3-
For historical reasons, [store derivations][store derivation] are stored on-disk in [ATerm](https://homepages.cwi.nl/~daybuild/daily-books/technology/aterm-guide/aterm-guide.html) format.
3+
For historical reasons, [store derivations][store derivation] are stored on-disk in "Annotated Term" (ATerm) format
4+
([guide](https://homepages.cwi.nl/~daybuild/daily-books/technology/aterm-guide/aterm-guide.html),
5+
[paper](https://doi.org/10.1002/(SICI)1097-024X(200003)30:3%3C259::AID-SPE298%3E3.0.CO;2-Y)).
46

57
## The ATerm format used
68

doc/manual/source/protocols/json/schema/derivation-v3.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,9 @@ properties:
3939
This is a guard that allows us to continue evolving this format.
4040
The choice of `3` is fairly arbitrary, but corresponds to this informal version:
4141
42-
- Version 0: A-Term format
42+
- Version 0: ATerm format
4343
44-
- Version 1: Original JSON format, with ugly `"r:sha256"` inherited from A-Term format.
44+
- Version 1: Original JSON format, with ugly `"r:sha256"` inherited from ATerm format.
4545
4646
- Version 2: Separate `method` and `hashAlgo` fields in output specs
4747

doc/manual/source/store/derivation/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -245,7 +245,7 @@ If those other derivations *also* abide by this common case (and likewise for tr
245245
> note the ".drv"
246246
> ```
247247
248-
## Extending the model to be higher-order
248+
## Extending the model to be higher-order {#dynamic}
249249
250250
**Experimental feature**: [`dynamic-derivations`](@docroot@/development/experimental-features.md#xp-feature-dynamic-derivations)
251251

doc/manual/source/store/derivation/outputs/content-address.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ It is only in the potential for that check to fail that they are different.
167167
>
168168
> In a future world where floating content-addressing is also stable, we in principle no longer need separate [fixed](#fixed) content-addressing.
169169
> Instead, we could always use floating content-addressing, and separately assert the precise value content address of a given store object to be used as an input (of another derivation).
170-
> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [Issue #11955](https://github.com/NixOS/nix/issues/11955).
170+
> A stand-alone assertion object of this sort is not yet implemented, but its possible creation is tracked in [issue #11955](https://github.com/NixOS/nix/issues/11955).
171171
>
172172
> In the current version of Nix, fixed outputs which fail their hash check are still registered as valid store objects, just not registered as outputs of the derivation which produced them.
173173
> This is an optimization that means if the wrong output hash is specified in a derivation, and then the derivation is recreated with the right output hash, derivation does not need to be rebuilt --- avoiding downloading potentially large amounts of data twice.

doc/manual/source/store/derivation/outputs/input-address.md

Lines changed: 192 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,26 +6,203 @@
66
That is to say, an input-addressed output's store path is a function not of the output itself, but of the derivation that produced it.
77
Even if two store paths have the same contents, if they are produced in different ways, and one is input-addressed, then they will have different store paths, and thus guaranteed to not be the same store object.
88

9-
<!---
9+
## Modulo fixed-output derivations {#hash-modulo}
1010

11-
### Modulo fixed-output derivations
11+
So how do we compute the hash part of the output paths of an input-addressed derivation?
12+
This is done by the function `hashDerivationModulo`, shown below.
1213

13-
**TODO hash derivation modulo.**
14+
First, a word on inputs.
15+
`hashDerivationModulo` is only defined on derivations whose [inputs](@docroot@/store/derivation/index.md#inputs) take the first-order form:
16+
```typescript
17+
type ConstantPath = {
18+
path: StorePath;
19+
};
1420

15-
So how do we compute the hash part of the output path of a derivation?
16-
This is done by the function `hashDrv`, shown in Figure 5.10.
17-
It distinguishes between two cases.
18-
If the derivation is a fixed-output derivation, then it computes a hash over just the `outputHash` attributes.
21+
type FirstOrderOutputPath = {
22+
drvPath: StorePath;
23+
output: OutputName;
24+
};
1925

20-
If the derivation is not a fixed-output derivation, we replace each element in the derivation’s inputDrvs with the result of a call to `hashDrv` for that element.
21-
(The derivation at each store path in `inputDrvs` is converted from its on-disk ATerm representation back to a `StoreDrv` by the function `parseDrv`.) In essence, `hashDrv` partitions store derivations into equivalence classes, and for hashing purpose it replaces each store path in a derivation graph with its equivalence class.
26+
type FirstOrderDerivingPath = ConstantPath | FirstOrderOutputPath;
2227

23-
The recursion in Figure 5.10 is inefficient:
24-
it will call itself once for each path by which a subderivation can be reached, i.e., `O(V k)` times for a derivation graph with `V` derivations and with out-degree of at most `k`.
25-
In the actual implementation, memoisation is used to reduce this to `O(V + E)` complexity for a graph with E edges.
28+
type Inputs = Set<FirstOrderDerivingPath>;
29+
```
2630

27-
-->
31+
For the the algorithm below, we adopt a derivation where the two types of (first order) derived paths are partitioned into two sets, as follows:
32+
```typescript
33+
type Derivation = {
34+
// inputs: Set<FirstOrderDerivingPath>; // replaced
35+
inputSrcs: Set<ConstantPath>; // new instead
36+
inputDrvOutputs: Set<FirstOrderOutputPath>; // new instead
37+
// ...other fields...
38+
};
39+
```
2840

41+
In the [currently-experimental][xp-feature-dynamic-derivations] higher-order case where outputs of outputs are allowed as [deriving paths][deriving-path] and thus derivation inputs, derivations using that generalization are not valid arguments to this function.
42+
Those derivations must be (partially) [resolved](@docroot@/store/resolution.md) enough first, to the point where no such higher-order inputs remain.
43+
Then, and only then, can input addresses be assigned.
44+
45+
```
46+
function hashDerivationModulo(drv) -> Hash:
47+
assert(drv.outputs are input-addressed)
48+
drv′ ← drv with {
49+
inputDrvOutputs = ⋃(
50+
assert(drvPath is store path)
51+
case hashOutputsOrDerivationModulo(readDrv(drvPath)) of
52+
drvHash : Hash →
53+
(drvHash.toBase16(), output)
54+
outputHashes : Map[String, Hash] →
55+
(outputHashes[output].toBase16(), "out")
56+
| (drvPath, output) ∈ drv.inputDrvOutputs
57+
)
58+
}
59+
return hashSHA256(printDrv(drv′))
60+
61+
function hashOutputsOrDerivationModulo(drv) -> Map[String, Hash] | Hash:
62+
if drv.outputs are content-addressed:
63+
return {
64+
outputName ↦ hashSHA256(
65+
"fixed:out:" + ca.printMethodAlgo() +
66+
":" + ca.hash.toBase16() +
67+
":" + ca.makeFixedOutputPath(drv.name, outputName))
68+
| (outputName ↦ output) ∈ drv.outputs
69+
, ca = output.contentAddress // or get from build trace if floating
70+
}
71+
else: // drv.outputs are input-addressed
72+
return hashDerivationModulo(drv)
73+
```
74+
75+
### `hashDerivationModulo`
76+
77+
We replace each element in the derivation's `inputDrvOutputs` using data from a call to `hashOutputsOrDerivationModulo` on the `drvPath` of that element.
78+
When `hashOutputsOrDerivationModulo` returns a single drv hash (because the input derivation in question is input-addressing), we simply swap out the `drvPath` for that hash, and keep the same output name.
79+
When `hashOutputsOrDerivationModulo` returns a map of content addresses per per-output, we look up the output in question, and pair it with the output name `out`.
80+
81+
The resulting pseudo-derivation (with hashes instead of store paths in `inputDrvs`) is then printed (in the ["ATerm" format](@docroot@/protocols/derivation-aterm.md)) and hashes, and this becomes the "hash modulo" of the derivation.
82+
83+
When calculating output hashes, `hashDerivationModulo` is called on an almost-complete input-addressing derivation, which is just missing its input-addressed outputs paths.
84+
The derivation hash is then used to calculate output paths for each output.
85+
<!-- TODO describe how this is done. -->
86+
Those output paths can then be substituted into the almost-complete input-addressed derivation to complete it.
87+
88+
> **Note**
89+
>
90+
> There may be an unintentional deviation from specification currently implemented in the `(outputHashes[output].toBase16(), "out")` case.
91+
> This is not fatal because the deviation would only apply for content-addressing derivations with more than one output, and that only occurs in the floating case, which is [experimental][xp-feature-ca-derivations].
92+
> Once this bug is fixed, this note will be removed.
93+
94+
### `hashOutputsOrDerivationModulo`
95+
96+
How does `hashOutputsOrDerivationModulo` in turn work?
97+
It consists of two main cases, based on whether the outputs of the derivation are to be input-addressed or content-addressed.
98+
99+
#### Input-addressed outputs case
100+
101+
In the input-addressed case, it just calls `hashDerivationModulo`, and returns that derivation hash.
102+
This makes `hashDerivationModulo` and `hashOutputsOrDerivationModulo` mutually-recursive.
103+
104+
> **Note**
105+
>
106+
> In this case, `hashDerivationModulo` is being called on a *complete* input-addressing derivation that already has its output paths calculated.
107+
> The `inputDrvs` substitution takes place anyways.
108+
109+
#### Content-addressed outputs case
110+
111+
If the outputs are [content-addressed](./content-address.md), then it computes a hash for each output derived from the content-address of that output.
112+
113+
> **Note**
114+
>
115+
> In the [fixed](./content-address.md#fixed) content-addressing case, the outputs' content addresses are statically specified in advanced, so this always just works.
116+
> (The fixed case is what the pseudo-code shows.)
117+
>
118+
> In the [floating](./content-address.md#floating) case, the content addresses are not specified in advanced.
119+
> This is what the "or get from build trace if floating" comment refers to.
120+
> In this case, the algorithm is *stuck* until the input in question is built, and we know what the actual contents of the output in question is.
121+
>
122+
> That is OK however, because there is no problem with delaying the assigning of input addresses (which, remember, is what `hashDerivationModulo` is ultimately for) until all inputs are known.
123+
124+
### Performance
125+
126+
The recursion in the algorithm is potentially inefficient:
127+
it could call itself once for each path by which a subderivation can be reached, i.e., `O(V^k)` times for a derivation graph with `V` derivations and with out-degree of at most `k`.
128+
In the actual implementation, memoisation is used to reduce this to `O(V + E)` complexity for a graph with `E` edges.
129+
130+
### Semantic properties
131+
132+
In essence, `hashDerivationModulo` partitions input-addressing derivations into equivalence classes: every derivation in that equivalence class is mapped to the same derivation hash.
133+
We can characterize this equivalence relation directly, by working bottom up.
134+
135+
We start by defining an equivalence relation on first-order output deriving paths that refer content-addressed derivation outputs. Two such paths are equivalent if they refer to the same store object:
136+
137+
\\[
138+
\\begin{prooftree}
139+
\\AxiomC{$d\_1$ is content-addressing}
140+
\\AxiomC{$d\_2$ is content-addressing}
141+
\\AxiomC{${}^\*(d\_1, o\_1) = {}^\*(d\_2, o\_2)$}
142+
\\TrinaryInfC{$(d\_1, o\_1) \\,\\sim_{\\mathrm{CA}}\\, (d\_2, o\_2)$}
143+
\\end{prooftree}
144+
\\]
145+
146+
where \\({}^*(d, o)\\) denotes the store object that the output deriving path refers to.
147+
148+
We will also need the following construction to lift any equivalence relation on \\(X\\) to an equivalence relation on (finite) sets of \\(X\\) (in short, \\(\\mathcal{P}(X)\\)):
149+
150+
\\[
151+
\\begin{prooftree}
152+
\\AxiomC{$\\forall a \\in A. \\exists b \\in B. a \\,\\sim\_X\\, b$}
153+
\\AxiomC{$\\forall b \\in B. \\exists a \\in A. b \\,\\sim\_X\\, a$}
154+
\\BinaryInfC{$A \\,\\sim_{\\mathcal{P}(X)}\\, B$}
155+
\\end{prooftree}
156+
\\]
157+
158+
Now we can define the equivalence relation \\(\\sim_\\mathrm{IA}\\) on input-addressed derivation outputs. Two input-addressed outputs are equivalent if their derivations are equivalent (via the yet-to-be-defined \\(\\sim_{\\mathrm{IADrv}}\\) relation) and their output names are the same:
159+
160+
\\[
161+
\\begin{prooftree}
162+
\\AxiomC{$d\_1$ is input-addressing}
163+
\\AxiomC{$d\_2$ is input-addressing}
164+
\\AxiomC{$d\_1 \\,\\sim_{\\mathrm{IADrv}}\\, d\_2$}
165+
\\AxiomC{$o\_1 = o\_2$}
166+
\\QuaternaryInfC{$(d\_1, o\_1) \\,\\sim_{\\mathrm{IA}}\\, (d\_2, o\_2)$}
167+
\\end{prooftree}
168+
\\]
169+
170+
And now we can define \\(\\sim_{\\mathrm{IADrv}}\\).
171+
Two input-addressed derivations are equivalent if their content-addressed inputs are equivalent, their input-addressed inputs are also equivalent, and they are otherwise equal:
172+
173+
\\[
174+
\\begin{prooftree}
175+
\\AxiomC{$
176+
\\mathrm{caInputs}(d\_1)
177+
\\,\\sim_{\\mathcal{P}(\\mathrm{CA})}\\,
178+
\\mathrm{caInputs}(d\_2)
179+
$}
180+
\\AxiomC{$
181+
\\mathrm{iaInputs}(d\_1)
182+
\\,\\sim_{\\mathcal{P}(\\mathrm{IA})}\\,
183+
\\mathrm{iaInputs}(d\_2)
184+
$}
185+
\\AxiomC{$
186+
d\_1\left[\\mathrm{inputDrvOutputs} := \\{\\}\right]
187+
\=
188+
d\_2\left[\\mathrm{inputDrvOutputs} := \\{\\}\right]
189+
$}
190+
\\TrinaryInfC{$d\_1 \\,\\sim_{\\mathrm{IADrv}}\\, d\_2$}
191+
\\end{prooftree}
192+
\\]
193+
194+
where \\(\\mathrm{caInputs}(d)\\) returns the content-addressed inputs of \\(d\\) and \\(\\mathrm{iaInputs}(d)\\) returns the input-addressed inputs.
195+
196+
> **Note**
197+
>
198+
> An astute reader might notice that that nowhere does `inputSrcs` enter into these definitions.
199+
> That means that replacing an input derivation with its outputs directly added to `inputSrcs` always results in a derivation in a different equivalence class, despite the resulting input closure (as would be mounted in the store at build time) being the same.
200+
> [Issue #9259](https://github.com/NixOS/nix/issues/9259) is about creating a coarser equivalence relation to address this.
201+
>
202+
> \\(\\sim_\mathrm{Drv}\\) from [derivation resolution](@docroot@/store/resolution.md) is such an equivalence relation.
203+
> It is coarser than this one: any two derivations which are "'hash modulo'-equivalent" (\\(\\sim_\mathrm{IADrv}\\)) are also "resolution-equivalent" (\\(\\sim_\mathrm{Drv}\\)).
204+
> It also relates derivations whose `inputDrvOutputs` have been rewritten into `inputSrcs`.
205+
206+
[deriving-path]: @docroot@/store/derivation/index.md#deriving-path
207+
[xp-feature-dynamic-derivations]: @docroot@/development/experimental-features.md#xp-feature-dynamic-derivations
29208
[xp-feature-ca-derivations]: @docroot@/development/experimental-features.md#xp-feature-ca-derivations
30-
[xp-feature-git-hashing]: @docroot@/development/experimental-features.md#xp-feature-git-hashing
31-
[xp-feature-impure-derivations]: @docroot@/development/experimental-features.md#xp-feature-impure-derivations
Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Derivation Resolution
2+
3+
To *resolve* a derivation is to replace its [inputs] with the simplest inputs --- plain store paths --- that denote the same store objects.
4+
5+
[Deriving paths][deriving-path] intentionally make it possible to refer to the same [store object] in multiple ways.
6+
This is a consequence of content-addressing, since different derivations can produce the same outputs, and the same date can also be manually added to the store.
7+
This is also a consequence even of input-addressing, as an output can be referred to by derivation and output name, or directly by its store path input address.
8+
Since dereferencing deriving paths is thus not injective, it induces an equivalence relation on deriving paths.
9+
10+
Let's call this equivalence relation \\(\\sim\\), where \\(p_1 \\sim p_2\\) means that deriving paths \\(p_1\\) and \\(p_2\\) refer to the same store object.
11+
12+
**Content Equivalence**: Two deriving paths are equivalent if they refer to the same store object:
13+
14+
\\[
15+
\\begin{prooftree}
16+
\\AxiomC{${}^*p_1 = {}^*p_2$}
17+
\\UnaryInfC{$p_1 \\,\\sim_\\mathrm{DP}\\, p_2$}
18+
\\end{prooftree}
19+
\\]
20+
21+
where \\({}^*p\\) denotes the store object that deriving path \\(p\\) refers to.
22+
23+
This also induces an equivalence relation on sets of deriving paths:
24+
25+
\\[
26+
\\begin{prooftree}
27+
\\AxiomC{$\\{ {}^*p | p \\in P_1 \\} = \\{ {}^*p | p \\in P_2 \\}$}
28+
\\UnaryInfC{$P_1 \\,\\sim_{\\mathcal{P}(\\mathrm{DP})}\\, P_2$}
29+
\\end{prooftree}
30+
\\]
31+
32+
**Input Content Equivalence**: This, in turn, induces an equivalence relation on derivations: two derivations are equivalent if their inputs are equivalent, and they are otherwise equal:
33+
34+
\\[
35+
\\begin{prooftree}
36+
\\AxiomC{$\\mathrm{inputs}(d_1) \\,\\sim_{\\mathcal{P}(\\mathrm{DP})}\\, \\mathrm{inputs}(d_2)$}
37+
\\AxiomC{$
38+
d\_1\left[\\mathrm{inputs} := \\{\\}\right]
39+
\=
40+
d\_2\left[\\mathrm{inputs} := \\{\\}\right]
41+
$}
42+
\\BinaryInfC{$d_1 \\,\\sim_\\mathrm{Drv}\\, d_2$}
43+
\\end{prooftree}
44+
\\]
45+
46+
Derivation resolution always maps derivations to input-content-equivalent derivations.
47+
48+
Similar to evaluation, we can also speak of *partial* vs *total* derivation resolution.
49+
Total resolution is the function described above.
50+
For partial resolution, a derivation is related to equivalent derivations with the same or simpler inputs, but not all those inputs will be plain store paths.
51+
This is useful when the input refers to a floating content addressed output we have not yet built --- we don't know what (content-address) store path will used for that derivation, so we are "stuck" trying to resolve derived path in question.
52+
Partial resolution is not a function, but an (assymetic) relation, created by directing the above equivalence relation so the right-side items are always equal or simpler.
53+
(This is the usual practice for evaluation relations.)
54+
Like well-behaved evaluation relations, partial resolution is [*confluent*](https://en.wikipedia.org/wiki/Confluence_(abstract_rewriting)).
55+
56+
[store object]: @docroot@/store/store-object.md
57+
[inputs]: @docroot@/store/derivation/index.md#inputs
58+
[deriving-path]: @docroot@/store/derivation/index.md#deriving-path

doc/manual/theme/head.hbs

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
<script>
2+
MathJax = {
3+
loader: {load: ['[tex]/bussproofs']},
4+
tex: {
5+
packages: {'[+]': ['bussproofs']},
6+
// Doesn't seem to work in mathjax 3
7+
//formatError: function(jax, error) {
8+
// console.log(`TeX error in "${jax.latex}": ${error.message}`);
9+
// return jax.formatError(error);
10+
//}
11+
}
12+
};
13+
</script>
14+
<!-- Load a newer versino of MathJax than mdbook does by default, and which in particular has working relative paths for the "bussproofs" extension. -->
15+
<script async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/3.0.1/es5/tex-mml-chtml.js"></script>

0 commit comments

Comments
 (0)