Skip to content

Commit 1c70ef3

Browse files
committed
obfuscate cn
1 parent a6a04fb commit 1c70ef3

File tree

10 files changed

+230
-6
lines changed

10 files changed

+230
-6
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
title: MARKOV_TRAIN
3+
---
4+
5+
使用马尔可夫模型提取数据集中的模式
6+
7+
## 语法
8+
9+
```sql
10+
MARKOV_TRAIN(<string>)
11+
12+
MARKOV_TRAIN(<order>)(<string>)
13+
14+
MARKOV_TRAIN(<order>, <frequency_cutoff>, <num_buckets_cutoff>, <frequency_add>, <frequency_desaturate>) (<string>)
15+
```
16+
17+
| 参数 | 描述 |
18+
|------------------| ------------------ |
19+
| `string` | 输入 |
20+
| `order` | 模型上下文长度 |
21+
| `frequency-cutoff` | 频率截断: 移除所以计数小于阈值的桶 |
22+
| `num-buckets-cutoff` | 同一上下文不同后继桶的截断:移除所有桶数量少于指定值的直方图 |
23+
| `frequency-add` | 对每个计数加一个常数以降低概率分布的偏斜 |
24+
| `frequency-desaturate` | 0..1 - 将每个频率移向平均值以降低概率分布的偏斜 |
25+
26+
## 返回类型
27+
28+
取决于实现,仅用于作为 [MARKOV_GENERATE](../20-other-functions/markov_generate.md) 的参数。
29+
30+
## 示例
31+
32+
```sql
33+
create table model as
34+
select markov_train(concat('bar', number::string)) as bar from numbers(100);
35+
36+
select markov_generate(bar,'{"order":5,"sliding_window_size":8}', 151, (number+100000)::string) as generate
37+
from numbers(5), model;
38+
+-----------+
39+
| generate |
40+
+-----------+
41+
│ bar95 │
42+
│ bar64 │
43+
│ bar85 │
44+
│ bar56 │
45+
│ bar95 │
46+
+-----------+
47+
```

docs/cn/sql-reference/20-sql-functions/07-aggregate-functions/aggregate-retention.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: 留存分析
2+
title: RETENTION
33
---
44

55
聚合函数

docs/cn/sql-reference/20-sql-functions/07-aggregate-functions/index.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,4 +88,10 @@ title: '聚合函数'
8888
| 函数 | 描述 | 示例 |
8989
|----------|-------------|---------|
9090
| [RETENTION](aggregate-retention.md) | 计算留存率 | `RETENTION(action = 'signup', action = 'purchase')``[100, 40]` |
91-
| [WINDOWFUNNEL](aggregate-windowfunnel.md) | 在时间窗口内搜索事件序列 | `WINDOWFUNNEL(1800)(timestamp, event='view', event='click', event='purchase')``2` |
91+
| [WINDOWFUNNEL](aggregate-windowfunnel.md) | 在时间窗口内搜索事件序列 | `WINDOWFUNNEL(1800)(timestamp, event='view', event='click', event='purchase')``2` |
92+
93+
## 匿名化
94+
95+
| 函数 | 描述 | 示例 |
96+
|----------|-------------|---------|
97+
| [MARKOV_TRAIN](aggregate-markov-train.md) | 训练马尔可夫模型 | `MARKOV_TRAIN(address)` |

docs/cn/sql-reference/20-sql-functions/17-table-functions/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,3 +36,9 @@ title: 表函数 (Table Functions)
3636
|------|------|------|
3737
| [ICEBERG_MANIFEST](iceberg-manifest) | 显示 Iceberg 表清单信息 | `SELECT * FROM ICEBERG_MANIFEST('mytable')` |
3838
| [ICEBERG_SNAPSHOT](iceberg-snapshot) | 显示 Iceberg 表快照信息 | `SELECT * FROM ICEBERG_SNAPSHOT('mytable')` |
39+
40+
## 匿名化
41+
42+
| 函数 | 描述 | 示例 |
43+
|----------|-------------|---------|
44+
| [OBFUSCATE](obfuscate.md) | 生成匿名化的数据 | `SELECT * FROM OBFUSCATE(users)` |
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
---
2+
title: OBFUSCATE
3+
---
4+
5+
生成匿名化的数据。这是一个快速工具,对于更复杂的场景,更推荐直接使用底层函数 [MARKOV_TRAIN](../07-aggregate-functions/aggregate-markov-train.md)[MARKOV_GENERATE](../20-other-functions/markov_generate.md)[FEISTEL_OBFUSCATE](../20-other-functions/feistel_obfuscate.md)
6+
7+
## 语法
8+
9+
```sql
10+
OBFUSCATE('<table>'[, seed => <seed>])
11+
```
12+
13+
## 示例
14+
15+
```sql
16+
create or replace table users as
17+
select * from (values
18+
(1, 'James Smith', '[email protected]', '123 Fake St, Anytown, CA 91234'),
19+
(2, 'Mary Johnson', '[email protected]', '456 Fictional Ave, Springfield, IL 62704'),
20+
(3, 'John Williams', '[email protected]', '789 Imaginary Ln, Pleasantville, NY 10570'),
21+
(4, 'Patricia Brown', '[email protected]', '101 Nonexistent Rd, Metropolis, KS 66666'),
22+
(5, 'Robert Jones', '[email protected]', '222 Make Believe Dr, Smallville, OH 44688'),
23+
(6, 'Jennifer Garcia', '[email protected]', '333 Phantom Ct, Gotham, NJ 07005'),
24+
(7, 'Michael Miller', '[email protected]', '444 Unreal Blvd, Wonderland, TX 75001'),
25+
(8, 'Linda Davis', '[email protected]', '555 Fabricated Way, Neverland, FL 32801'),
26+
(9, 'William Rodriguez', '[email protected]', '666 Bogus Pl, Oz, KS 67445'),
27+
(10, 'Elizabeth Martinez', '[email protected]', '777 Sham Ln, Camelot, CA 90210'),
28+
(11, 'James Johnson', '[email protected]', '888 Pretend Ave, Atlantis, GA 30303'),
29+
(12, 'Mary Williams', '[email protected]', '999 Simulated Rd, Utopia, MI 48009'),
30+
(13, 'John Brown', '[email protected]', '1010 Counterfeit St, El Dorado, AR 71730'),
31+
(14, 'Patricia Jones', '[email protected]', '10 Counterfeit St, El Dorado, AR 71730'),
32+
(15, 'Robert Garcia', '[email protected]', '1111 Phony Ln, Shangri-La, CO 80014'),
33+
(16, 'Jennifer Miller', '[email protected]', '1212 Artificial Dr, Rivendell, WA 98101'),
34+
(17, 'Michael Davis', '[email protected]', '1313 Spurious Ave, Narnia, TN 37201'),
35+
(18, 'Linda Rodriguez', '[email protected]', '1414 Pseudo Rd, Brigadoon, PA 19003'),
36+
(19, 'William Martinez', '[email protected]', '1515 Feigned St, Never Never Land, CA 90210'),
37+
(20, 'Elizabeth Smith', '[email protected]', '1616 Imitation Ln, Asgard, NY 10001'),
38+
(21, 'James Williams', '[email protected]', '1717 Simulated Ave, Middle Earth, OR 97006'),
39+
(22, 'Mary Brown', '[email protected]', '123 Fake St, Anytown, CA 91234'),
40+
(23, 'John Jones', '[email protected]', '456 Fictitious Ave, Springfield, IL 62704'),
41+
(24, 'Patricia Garcia', '[email protected]', '789 Illusion Ln, Pleasantville, NY 10570'),
42+
(25, 'Robert Miller', '[email protected]', '101 Imaginary Rd, Metropolis, KS 66666'),
43+
(26, 'Jennifer Davis', '[email protected]', '222 Make Believe Dr, Neverland, FL 33333'),
44+
(27, 'Michael Rodriguez', '[email protected]', '333 Pretend Ct, Wonderland, TX 77777'),
45+
(28, 'Linda Martinez', '[email protected]', '444 Fabricated Blvd, Utopia, WA 98101'),
46+
(29, 'William Smith', '[email protected]', '555 Sham Way, Mirage, AZ 85001'),
47+
(30, 'Elizabeth Johnson', '[email protected]', '666 Bogus Pl, Fantasyland, GA 30303'),
48+
(31, 'James Brown', '[email protected]', '777 Unreal Ave, Dreamville, CO 80202'),
49+
(32, 'Mary Jones', '[email protected]', '888 Counterfeit Ln, Wishville, OH 44114'),
50+
(33, 'John Garcia', '[email protected]', '999 Phony Rd, Delusion, MI 48075'),
51+
(34, 'Patricia Miller', '[email protected]', '1010 Simulated St, Echo, NV 89109'),
52+
(35, 'Robert Davis', '[email protected]', '1111 Spurious Ave, Replica, PA 19103'),
53+
(36, 'Jennifer Rodriguez', '[email protected]', '1212 Artificial Dr, Clone, NC 27601'),
54+
(37, 'Michael Martinez', '[email protected]', '1313 Synthetic Ct, Duplicate, TN 37201'),
55+
(38, 'Linda Smith', '[email protected]', '1414 Feigned Blvd, Imposter, IN 46204'),
56+
(39, 'William Johnson', '[email protected]', '1515 Pseudo Pl, Mimic, MN 55401'),
57+
(40, 'Elizabeth Williams', '[email protected]', '1616 Forged Way, Facsimile, AL 35203')
58+
) users(id, name, email, address);
59+
60+
61+
select * from obfuscate(users, seed=>10) limit 5 offset 20;
62+
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
63+
│ id │ name │ email │ address │
64+
│ Nullable(UInt64) │ Nullable(String) │ Nullable(String) │ Nullable(String) │
65+
├──────────────────┼───────────────────┼───────────────────────────┼─────────────────────────────────────────┤
66+
21 │ William Rodriguez │ michael.davis@example.com1212 Artificial Dr, Rivendell, WA 98101
67+
16 │ Jennifer Garcia │ patricia.brown@gmail │ 1313 Spurious Ave, NC 27601
68+
25 │ John Brown │ michael.martinez@example │ 1111 Phony Ln, Asgard, NY 10570
69+
30 │ Mary Brown │ jennifer.garcia@gmail.com222 Make Believe Dr, Clone, NC 27601
70+
24 │ James Smith │ elizabeth.johnson@example │ 444 Fabricated St, Anytown, CA 90210
71+
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
72+
```

docs/cn/sql-reference/20-sql-functions/19-test-functions/sleep.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ title: SLEEP
44

55
在每个数据块上休眠 `seconds` 秒。
66

7-
!!! warning
8-
仅用于需要休眠的测试场景。
9-
7+
:::caution
8+
仅用于需要休眠的测试场景。
9+
:::
1010

1111
## 语法
1212

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: FEISTEL_OBFUSCATE
3+
---
4+
5+
FEISTEL_OBFUSCATE 函数用于对数字类型匿名化。
6+
7+
## 语法
8+
9+
```sql
10+
FEISTEL_OBFUSCATE( <number>, <seed> )
11+
```
12+
13+
## 参数
14+
15+
| 参数 | 描述 |
16+
| ----------- | ----------- |
17+
| `<number>` | 需要匿名化的数据。|
18+
| `<seed>` | 加密种子。<br /> 对于相同的种子,总是得到相同的加密结果,有时这很有用,但同时,泄露种子会引起原始数据泄露。|
19+
20+
## 返回类型
21+
22+
与输入相同
23+
24+
## 示例
25+
26+
```sql
27+
SELECT feistel_obfuscate(10000,1561819567875);
28+
+------------------------------------------+
29+
| feistel_obfuscate(10000, 1561819567875) |
30+
+------------------------------------------+
31+
| 15669 |
32+
+------------------------------------------+
33+
```
34+
feistel_obfuscate 保留原始输入的位数,如果需要映射到更大范围,可以在原始输入上加一个偏移,例如:feistel_obfuscate(n+10000,50)
35+
```sql
36+
SELECT feistel_obfuscate(10,1561819567875);
37+
+------------------------------------------+
38+
| feistel_obfuscate(10, 1561819567875) |
39+
+------------------------------------------+
40+
| 13 |
41+
+------------------------------------------+
42+
```

docs/cn/sql-reference/20-sql-functions/20-other-functions/index.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,4 +13,6 @@ title: 其他函数
1313
| [HUMANIZE_SIZE](humanize-size.md) | 将字节数格式化为可读单位 |
1414
| [REMOVE_NULLABLE](remove-nullable.md) | 从列值中去除可空性 |
1515
| [TO_NULLABLE](to-nullable.md) | 将值转换为可空类型 |
16-
| [TYPEOF](typeof.md) | 返回值的数据类型名称 |
16+
| [TYPEOF](typeof.md) | 返回值的数据类型名称 |
17+
| [MARKOV_GENERATE](markov_generate.md) | 字符串类型匿名化生成 |
18+
| [FEISTEL_OBFUSCATE](feistel_obfuscate.md) | 数字类型匿名化 |
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
---
2+
title: MARKOV_GENERATE
3+
---
4+
5+
MARKOV_GENERATE 函数用于使用经 [MARKOV_TRAIN](../07-aggregate-functions/aggregate-markov-train.md) 训练的模型,生成匿名化数据
6+
7+
## 语法
8+
9+
```sql
10+
FEISTEL_OBFUSCATE( <model>, <params>, <seed>, <determinator> )
11+
```
12+
13+
## 参数
14+
15+
| 参数 | 描述 |
16+
| ----------- | ----------- |
17+
| `model` | markov_train 生成的模型 |
18+
| `params`| 生成参数,json 字符串,`{"order": 5, "sliding_window_size": 8}` <br/> order:模型上下文长度,<br/> 源字符串中滑动窗口的大小-其哈希值用作模型中RNG的种子 |
19+
| `seed` | 生成种子。|
20+
| `determinator`| 输入 |
21+
22+
## 返回类型
23+
24+
字符串
25+
26+
## 示例
27+
28+
```sql
29+
create table model as
30+
select markov_train(concat('bar', number::string)) as bar from numbers(100);
31+
32+
select markov_generate(bar,'{"order":5,"sliding_window_size":8}', 151, (number+100000)::string) as generate
33+
from numbers(5), model;
34+
+-----------+
35+
| generate |
36+
+-----------+
37+
│ bar95 │
38+
│ bar64 │
39+
│ bar85 │
40+
│ bar56 │
41+
│ bar95 │
42+
+-----------+
43+
```

package.json

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,26 +45,32 @@
4545
"copy-to-clipboard": "^3.3.3",
4646
"copyforjs": "^1.0.6",
4747
"databend-logos": "^0.0.16",
48+
"dayjs": "^1.11.19",
4849
"docusaurus-plugin-devserver": "^1.0.6",
4950
"docusaurus-plugin-sass": "^0.2.5",
5051
"docusaurus-prince-pdf": "^1.2.1",
5152
"fs-extra": "^11.2.0",
5253
"js-cookie": "^3.0.5",
5354
"prism-react-renderer": "^2.3.0",
55+
"prop-types": "^15.8.1",
5456
"react": "^19.1.0",
5557
"react-dom": "^19.1.0",
5658
"react-icons": "^5.5.0",
5759
"react-markdown": "^9.0.1",
5860
"react-scroll-progress-bar": "^2.0.3",
5961
"react-slick": "^0.31.0",
62+
"remark-gfm": "^4.0.1",
6063
"sass": "^1.77.8",
6164
"sass-resources-loader": "^2.2.5",
6265
"turndown": "^7.2.0",
6366
"vanilla-cookieconsent": "^3.1.0",
6467
"xml2js": "^0.6.2"
6568
},
6669
"devDependencies": {
70+
"@ant-design/cssinjs": "^2.0.1",
6771
"@docusaurus/module-type-aliases": "^3.7.0",
72+
"@docusaurus/plugin-content-docs": "^3.9.2",
73+
"@docusaurus/theme-common": "^3.9.2",
6874
"@docusaurus/tsconfig": "^3.7.0",
6975
"@docusaurus/types": "^3.7.0",
7076
"typescript": "~5.2.2"

0 commit comments

Comments
 (0)