Skip to content

Commit 6b7e30f

Browse files
committed
en
1 parent 1c70ef3 commit 6b7e30f

File tree

8 files changed

+225
-4
lines changed

8 files changed

+225
-4
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
---
2+
title: MARKOV_TRAIN
3+
---
4+
5+
Extracting patterns from datasets using Markov models
6+
7+
## Syntax
8+
9+
```sql
10+
MARKOV_TRAIN(<string>)
11+
12+
MARKOV_TRAIN(<order>)(<string>)
13+
14+
MARKOV_TRAIN(<order>, <frequency_cutoff>, <num_buckets_cutoff>, <frequency_add>, <frequency_desaturate>) (<string>)
15+
```
16+
17+
## Arguments
18+
19+
| Arguments | Description |
20+
|------------------| ------------------ |
21+
| `string` | Input |
22+
| `order` | Order of markov model to generate strings |
23+
| `frequency-cutoff` | Frequency cutoff for markov model: remove all buckets with count less than specified |
24+
| `num-buckets-cutoff` | Cutoff for number of different possible continuations for a context: remove all histograms with less than specified number of buckets |
25+
| `frequency-add` | Add a constant to every count to lower probability distribution skew |
26+
| `frequency-desaturate` | 0..1 - move every frequency towards average to lower probability distribution skew |
27+
28+
## Return Type
29+
30+
Depending on the implementation, it is only used as a argument for [MARKOV_GENERATE](../20-other-functions/markov_generate.md).
31+
32+
## Examples
33+
34+
```sql
35+
create table model as
36+
select markov_train(concat('bar', number::string)) as bar from numbers(100);
37+
38+
select markov_generate(bar,'{"order":5,"sliding_window_size":8}', 151, (number+100000)::string) as generate
39+
from numbers(5), model;
40+
+-----------+
41+
| generate |
42+
+-----------+
43+
│ bar95 │
44+
│ bar64 │
45+
│ bar85 │
46+
│ bar56 │
47+
│ bar95 │
48+
+-----------+
49+
```

docs/en/sql-reference/20-sql-functions/07-aggregate-functions/index.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,4 +88,10 @@ This page provides a comprehensive overview of aggregate functions in Databend,
8888
| Function | Description | Example |
8989
|----------|-------------|---------|
9090
| [RETENTION](aggregate-retention.md) | Calculates retention rates | `RETENTION(action = 'signup', action = 'purchase')``[100, 40]` |
91-
| [WINDOWFUNNEL](aggregate-windowfunnel.md) | Searches for event sequences within time window | `WINDOWFUNNEL(1800)(timestamp, event='view', event='click', event='purchase')``2` |
91+
| [WINDOWFUNNEL](aggregate-windowfunnel.md) | Searches for event sequences within time window | `WINDOWFUNNEL(1800)(timestamp, event='view', event='click', event='purchase')``2` |
92+
93+
## Anonymization
94+
95+
| Function | Description | Example |
96+
|----------|-------------|---------|
97+
| [MARKOV_TRAIN](aggregate-markov-train.md) | train markov model | `MARKOV_TRAIN(address)` |

docs/en/sql-reference/20-sql-functions/17-table-functions/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,3 +52,9 @@ This page provides reference information for the table functions in Databend. Ta
5252
|----------|-------------|--------|
5353
| [ICEBERG_MANIFEST](./iceberg-manifest.md) | Shows Iceberg table manifest information | `SELECT * FROM ICEBERG_MANIFEST('mytable')` |
5454
| [ICEBERG_SNAPSHOT](./iceberg-snapshot.md) | Shows Iceberg table snapshot information | `SELECT * FROM ICEBERG_SNAPSHOT('mytable')` |
55+
56+
## Anonymization
57+
58+
| Function | Description | Example |
59+
|----------|-------------|---------|
60+
| [OBFUSCATE](obfuscate.md) | dataset anonymization | `SELECT * FROM OBFUSCATE(users)` |
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: OBFUSCATE
3+
---
4+
5+
Dataset anonymization. This is a quick tool, and for more complex scenarios, it is recommended to directly use the underlying function [MARKOV_TRAIN](../07-aggregate-functions/aggregate-markov-train.md), [MARKOV_GENERATE](../20-other-functions/markov_generate.md), [FEISTEL_OBFUSCATE](../20-other-functions/feistel_obfuscate.md).
6+
7+
## Syntax
8+
9+
```sql
10+
OBFUSCATE('<table>'[, seed => <seed>])
11+
```
12+
13+
## Examples
14+
15+
```sql
16+
create or replace table users as
17+
select * from (values
18+
(1, 'James Smith', '[email protected]', '123 Fake St, Anytown, CA 91234'),
19+
(2, 'Mary Johnson', '[email protected]', '456 Fictional Ave, Springfield, IL 62704'),
20+
(3, 'John Williams', '[email protected]', '789 Imaginary Ln, Pleasantville, NY 10570'),
21+
(4, 'Patricia Brown', '[email protected]', '101 Nonexistent Rd, Metropolis, KS 66666'),
22+
(5, 'Robert Jones', '[email protected]', '222 Make Believe Dr, Smallville, OH 44688'),
23+
(6, 'Jennifer Garcia', '[email protected]', '333 Phantom Ct, Gotham, NJ 07005'),
24+
(7, 'Michael Miller', '[email protected]', '444 Unreal Blvd, Wonderland, TX 75001'),
25+
(8, 'Linda Davis', '[email protected]', '555 Fabricated Way, Neverland, FL 32801'),
26+
(9, 'William Rodriguez', '[email protected]', '666 Bogus Pl, Oz, KS 67445'),
27+
(10, 'Elizabeth Martinez', '[email protected]', '777 Sham Ln, Camelot, CA 90210'),
28+
(11, 'James Johnson', '[email protected]', '888 Pretend Ave, Atlantis, GA 30303'),
29+
(12, 'Mary Williams', '[email protected]', '999 Simulated Rd, Utopia, MI 48009'),
30+
(13, 'John Brown', '[email protected]', '1010 Counterfeit St, El Dorado, AR 71730'),
31+
(14, 'Patricia Jones', '[email protected]', '10 Counterfeit St, El Dorado, AR 71730'),
32+
(15, 'Robert Garcia', '[email protected]', '1111 Phony Ln, Shangri-La, CO 80014'),
33+
(16, 'Jennifer Miller', '[email protected]', '1212 Artificial Dr, Rivendell, WA 98101'),
34+
(17, 'Michael Davis', '[email protected]', '1313 Spurious Ave, Narnia, TN 37201'),
35+
(18, 'Linda Rodriguez', '[email protected]', '1414 Pseudo Rd, Brigadoon, PA 19003'),
36+
(19, 'William Martinez', '[email protected]', '1515 Feigned St, Never Never Land, CA 90210'),
37+
(20, 'Elizabeth Smith', '[email protected]', '1616 Imitation Ln, Asgard, NY 10001'),
38+
(21, 'James Williams', '[email protected]', '1717 Simulated Ave, Middle Earth, OR 97006'),
39+
(22, 'Mary Brown', '[email protected]', '123 Fake St, Anytown, CA 91234'),
40+
(23, 'John Jones', '[email protected]', '456 Fictitious Ave, Springfield, IL 62704'),
41+
(24, 'Patricia Garcia', '[email protected]', '789 Illusion Ln, Pleasantville, NY 10570'),
42+
(25, 'Robert Miller', '[email protected]', '101 Imaginary Rd, Metropolis, KS 66666'),
43+
(26, 'Jennifer Davis', '[email protected]', '222 Make Believe Dr, Neverland, FL 33333'),
44+
(27, 'Michael Rodriguez', '[email protected]', '333 Pretend Ct, Wonderland, TX 77777'),
45+
(28, 'Linda Martinez', '[email protected]', '444 Fabricated Blvd, Utopia, WA 98101'),
46+
(29, 'William Smith', '[email protected]', '555 Sham Way, Mirage, AZ 85001'),
47+
(30, 'Elizabeth Johnson', '[email protected]', '666 Bogus Pl, Fantasyland, GA 30303'),
48+
(31, 'James Brown', '[email protected]', '777 Unreal Ave, Dreamville, CO 80202'),
49+
(32, 'Mary Jones', '[email protected]', '888 Counterfeit Ln, Wishville, OH 44114'),
50+
(33, 'John Garcia', '[email protected]', '999 Phony Rd, Delusion, MI 48075'),
51+
(34, 'Patricia Miller', '[email protected]', '1010 Simulated St, Echo, NV 89109'),
52+
(35, 'Robert Davis', '[email protected]', '1111 Spurious Ave, Replica, PA 19103'),
53+
(36, 'Jennifer Rodriguez', '[email protected]', '1212 Artificial Dr, Clone, NC 27601'),
54+
(37, 'Michael Martinez', '[email protected]', '1313 Synthetic Ct, Duplicate, TN 37201'),
55+
(38, 'Linda Smith', '[email protected]', '1414 Feigned Blvd, Imposter, IN 46204'),
56+
(39, 'William Johnson', '[email protected]', '1515 Pseudo Pl, Mimic, MN 55401'),
57+
(40, 'Elizabeth Williams', '[email protected]', '1616 Forged Way, Facsimile, AL 35203')
58+
) users(id, name, email, address);
59+
60+
61+
select * from obfuscate(users, seed=>10) limit 5 offset 20;
62+
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
63+
│ id │ name │ email │ address │
64+
│ Nullable(UInt64) │ Nullable(String) │ Nullable(String) │ Nullable(String) │
65+
├──────────────────┼───────────────────┼───────────────────────────┼─────────────────────────────────────────┤
66+
21 │ William Rodriguez │ michael.davis@example.com1212 Artificial Dr, Rivendell, WA 98101
67+
16 │ Jennifer Garcia │ patricia.brown@gmail │ 1313 Spurious Ave, NC 27601
68+
25 │ John Brown │ michael.martinez@example │ 1111 Phony Ln, Asgard, NY 10570
69+
30 │ Mary Brown │ jennifer.garcia@gmail.com222 Make Believe Dr, Clone, NC 27601
70+
24 │ James Smith │ elizabeth.johnson@example │ 444 Fabricated St, Anytown, CA 90210
71+
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
72+
```

docs/en/sql-reference/20-sql-functions/19-test-functions/sleep.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ title: SLEEP
44

55
Sleeps `seconds` seconds on each data block.
66

7-
!!! warning
8-
Only used for testing where sleep is required.
9-
7+
:::caution
8+
Only used for testing where sleep is required.
9+
:::
1010

1111
## Syntax
1212

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: FEISTEL_OBFUSCATE
3+
---
4+
5+
Transformed numbers for anonymization
6+
7+
## Syntax
8+
9+
```sql
10+
FEISTEL_OBFUSCATE( <number>, <seed> )
11+
```
12+
13+
## Arguments
14+
15+
| Arguments | Description |
16+
| ----------- | ----------- |
17+
| `number` | Input |
18+
| `seed` | The data for corresponding non-text columns for different tables will be transformed in the same way, so the data for different tables can be JOINed after obfuscation |
19+
20+
## Return Type
21+
22+
Same as input
23+
24+
## Examples
25+
26+
```sql
27+
SELECT feistel_obfuscate(10000,1561819567875);
28+
+------------------------------------------+
29+
| feistel_obfuscate(10000, 1561819567875) |
30+
+------------------------------------------+
31+
| 15669 |
32+
+------------------------------------------+
33+
```
34+
35+
feistel_obfuscate preserves the number of bits in the original input. If mapping to a larger range is required, an offset can be added to the original input, e.g. feistel_obfuscate(n+10000,50)
36+
```sql
37+
SELECT feistel_obfuscate(10,1561819567875);
38+
+------------------------------------------+
39+
| feistel_obfuscate(10, 1561819567875) |
40+
+------------------------------------------+
41+
| 13 |
42+
+------------------------------------------+
43+
```

docs/en/sql-reference/20-sql-functions/20-other-functions/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,5 @@ This section collects assorted utilities that do not fit into the major function
1414
| [REMOVE_NULLABLE](remove-nullable.md) | Strip NULLability from a column value |
1515
| [TO_NULLABLE](to-nullable.md) | Convert a value to a nullable type |
1616
| [TYPEOF](typeof.md) | Return the name of a value’s data type |
17+
| [MARKOV_GENERATE](markov_generate.md) | Generate anonymized strings |
18+
| [FEISTEL_OBFUSCATE](feistel_obfuscate.md) | Transformed numbers for anonymization |
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: MARKOV_GENERATE
3+
---
4+
5+
Using the model trained by [MARKOV_TRAIN](../07-aggregate-functions/aggregate-markov-train.md) to anonymize the dataset.
6+
7+
## Syntax
8+
9+
```sql
10+
FEISTEL_OBFUSCATE( <model>, <params>, <seed>, <determinator> )
11+
```
12+
13+
## Arguments
14+
15+
| Arguments | Description |
16+
| ----------- | ----------- |
17+
| `model` | The return model of markov_train |
18+
| `params`| Json string: `{"order": 5, "sliding_window_size": 8}` <br/> order:order of markov model to generate strings,<br/> size of a sliding window in a source string - its hash is used as a seed for RNG in markov model |
19+
| `seed` | seed |
20+
| `determinator`| Source string |
21+
22+
## Return Type
23+
24+
String.
25+
26+
## Examples
27+
28+
```sql
29+
create table model as
30+
select markov_train(concat('bar', number::string)) as bar from numbers(100);
31+
32+
select markov_generate(bar,'{"order":5,"sliding_window_size":8}', 151, (number+100000)::string) as generate
33+
from numbers(5), model;
34+
+-----------+
35+
| generate |
36+
+-----------+
37+
│ bar95 │
38+
│ bar64 │
39+
│ bar85 │
40+
│ bar56 │
41+
│ bar95 │
42+
+-----------+
43+
```

0 commit comments

Comments
 (0)