Skip to content

Commit 4b29495

Browse files
shuiyisongnicecui
andauthored
chore: update pipeline docs using doc ver 2 by default (#1956)
Signed-off-by: shuiyisong <[email protected]> Co-authored-by: Yiran <[email protected]>
1 parent df54ac5 commit 4b29495

File tree

12 files changed

+368
-298
lines changed

12 files changed

+368
-298
lines changed

docs/user-guide/logs/manage-pipelines.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -64,21 +64,22 @@ curl "http://localhost:4000/v1/events/pipelines/test?version=2025-04-01%2006%3A5
6464
"pipelines": [
6565
{
6666
"name": "test",
67-
"version": "2025-04-01 06:58:31.335251882+0000",
68-
"pipeline": "processors:\n - dissect:\n fields:\n - message\n patterns:\n - '%{ip_address} - - [%{timestamp}] \"%{http_method} %{request_line}\" %{status_code} %{response_size} \"-\" \"%{user_agent}\"'\n ignore_missing: true\n - date:\n fields:\n - timestamp\n formats:\n - \"%d/%b/%Y:%H:%M:%S %z\"\n\ntransform:\n - fields:\n - ip_address\n - http_method\n type: string\n index: tag\n - fields:\n - status_code\n type: int32\n index: tag\n - fields:\n - request_line\n - user_agent\n type: string\n index: fulltext\n - fields:\n - response_size\n type: int32\n - fields:\n - timestamp\n type: time\n index: timestamp\n"
67+
"version": "2025-04-01 06:58:31.335251882",
68+
"pipeline": "version: 2\nprocessors:\n - dissect:\n fields:\n - message\n patterns:\n - '%{ip_address} - - [%{timestamp}] \"%{http_method} %{request_line}\" %{status_code} %{response_size} \"-\" \"%{user_agent}\"'\n ignore_missing: true\n - date:\n fields:\n - timestamp\n formats:\n - \"%d/%b/%Y:%H:%M:%S %z\"\n - select:\n type: exclude\n fields:\n - message\n\ntransform:\n - fields:\n - ip_address\n type: string\n index: inverted\n tag: true\n - fields:\n - status_code\n type: int32\n index: inverted\n tag: true\n - fields:\n - request_line\n - user_agent\n type: string\n index: fulltext\n - fields:\n - response_size\n type: int32\n - fields:\n - timestamp\n type: time\n index: timestamp\n"
6969
}
7070
],
71-
"execution_time_ms": 92
71+
"execution_time_ms": 7
7272
}
7373
```
7474

7575
In the output above, the `pipeline` field is a YAML-formatted string. Since the JSON format does not display YAML strings well, the `echo` command can be used to present it in a more human-readable way:
7676

7777
```shell
78-
echo "processors:\n - dissect:\n fields:\n - message\n patterns:\n - '%{ip_address} - - [%{timestamp}] \"%{http_method} %{request_line}\" %{status_code} %{response_size} \"-\" \"%{user_agent}\"'\n ignore_missing: true\n - date:\n fields:\n - timestamp\n formats:\n - \"%d/%b/%Y:%H:%M:%S %z\"\n\ntransform:\n - fields:\n - ip_address\n - http_method\n type: string\n index: tag\n - fields:\n - status_code\n type: int32\n index: tag\n - fields:\n - request_line\n - user_agent\n type: string\n index: fulltext\n - fields:\n - response_size\n type: int32\n - fields:\n - timestamp\n type: time\n index: timestamp\n"
78+
echo -e "version: 2\nprocessors:\n - dissect:\n fields:\n - message\n patterns:\n - '%{ip_address} - - [%{timestamp}] \"%{http_method} %{request_line}\" %{status_code} %{response_size} \"-\" \"%{user_agent}\"'\n ignore_missing: true\n - date:\n fields:\n - timestamp\n formats:\n - \"%d/%b/%Y:%H:%M:%S %z\"\n - select:\n type: exclude\n fields:\n - message\n\ntransform:\n - fields:\n - ip_address\n type: string\n index: inverted\n tag: true\n - fields:\n - status_code\n type: int32\n index: inverted\n tag: true\n - fields:\n - request_line\n - user_agent\n type: string\n index: fulltext\n - fields:\n - response_size\n type: int32\n - fields:\n - timestamp\n type: time\n index: timestamp\n"
7979
```
8080

8181
```yml
82+
version: 2
8283
processors:
8384
- dissect:
8485
fields:
@@ -91,17 +92,22 @@ processors:
9192
- timestamp
9293
formats:
9394
- "%d/%b/%Y:%H:%M:%S %z"
95+
- select:
96+
type: exclude
97+
fields:
98+
- message
9499

95100
transform:
96101
- fields:
97102
- ip_address
98-
- http_method
99103
type: string
100-
index: tag
104+
index: inverted
105+
tag: true
101106
- fields:
102107
- status_code
103108
type: int32
104-
index: tag
109+
index: inverted
110+
tag: true
105111
- fields:
106112
- request_line
107113
- user_agent
@@ -114,7 +120,6 @@ transform:
114120
- timestamp
115121
type: time
116122
index: timestamp
117-
118123
```
119124
120125
Or you can use SQL to query pipeline information.
@@ -211,7 +216,7 @@ transform:
211216
type: string
212217
- field: time
213218
type: time
214-
index: timestamp'
219+
index: timestamp
215220
```
216221
217222
The pipeline configuration contains an error. The `gsub` Processor expects the `replacement` field to be a string, but the current configuration provides an array. As a result, the pipeline creation fails with the following error message:
@@ -246,7 +251,7 @@ transform:
246251
type: string
247252
- field: time
248253
type: time
249-
index: timestamp'
254+
index: timestamp
250255
```
251256

252257
Now that the Pipeline has been created successfully, you can test the Pipeline using the `dryrun` interface.

docs/user-guide/logs/pipeline-config.md

Lines changed: 54 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ Dispatcher forwards pipeline execution context onto different subsequent pipelin
5959
Transform decides the final datatype and table structure in the database.
6060
Table suffix allows storing the data into different tables.
6161

62+
- Version is used to state the pipeline configuration format. Although it's optional, it's high recommended to start with version 2. See [here](#transform-in-version-2) for more details.
6263
- Processors are used for preprocessing log data, such as parsing time fields and replacing fields.
6364
- Dispatcher(optional) is used for forwarding the context into another pipeline, so that the same batch of input data can be divided and processed by different pipeline based on certain fields.
6465
- Transform(optional) is used for converting data formats, such as converting string types to numeric types, and specifying indexes.
@@ -67,6 +68,7 @@ Table suffix allows storing the data into different tables.
6768
Here is an example of a simple configuration that includes Processors and Transform:
6869

6970
```yaml
71+
version: 2
7072
processors:
7173
- urlencoding:
7274
fields:
@@ -98,7 +100,7 @@ table_suffix: _${string_field_a}
98100
99101
Starting from `v0.15`, the GreptimeDB introduce a version `2` format.
100102
The main change is the transform process.
101-
Refer to [the following documentation](#transform-in-doc-version-2) for detailed changes.
103+
Refer to [the following documentation](#transform-in-version-2) for detailed changes.
102104

103105
## Processor
104106

@@ -865,27 +867,69 @@ The `filter` processor takes the following options:
865867

866868
Transform is used to convert data formats and specify indexes upon columns. It is located under the `transform` section in the YAML file.
867869

868-
Starting from `v0.15`, an auto-transform mode is added to simplify the configuration. See below for details.
870+
Starting from `v0.15`, GreptimeDB is introducing version 2 format and auto-transform to largely simplify the configuration. See below for details.
869871

870872
A Transform consists of one or more configurations, and each configuration contains the following fields:
871873

872874
- `fields`: A list of field names to be transformed.
873-
- `type`: The transformation type.
874-
- `index`: The index type (optional).
875-
- `tag`: Specify the field to be a tag field (optional).
876-
- `on_failure`: Handling method for transformation failures (optional).
877-
- `default`: Default value (optional).
875+
- `type`: The target transformation type in the database.
876+
- `index`(optional): The index type.
877+
- `tag`(optional): Specify the field to be a tag field.
878+
- `on_failure`(optional): Handling method for transformation failures.
879+
- `default`(optional): Default value.
880+
881+
### Transform in version 2
882+
883+
Originally, you have to manually specify all the fields in the transform section for them to be persisted in the database.
884+
If a field is not specify in the transform, it will be discards.
885+
With the number of field growing, this can make the configuration both tedious and error-prone.
886+
887+
Starting from `v0.15`, GreptimeDB introduces a new transform mode which make it easier to write pipeline configuration.
888+
You only set necessary fields in the transform section, specifying particular datatype and index for them; the rest of the fields from the pipeline context are set automatically by the pipeline engine.
889+
With the `select` processor, you can decide what field is wanted and what isn't in the final table.
890+
891+
However, this is a breaking change to the existing pipeline configuration files.
892+
If you has already used pipeline with `dissect` or `regex` processors, after upgrading the database, the original message string, which is still in the pipeline context, gets immediately inserted into the database and there's no way to stop this behavior.
893+
894+
Therefore, GreptimeDB introduces the concept of version to decide which transform mode you want to use, just like the version in a Docker Compose file. Here is an example:
895+
```YAML
896+
version: 2
897+
processors:
898+
- date:
899+
field: input_str
900+
formats:
901+
- "%Y-%m-%dT%H:%M:%S%.3fZ"
902+
903+
transform:
904+
- field: input_str, ts
905+
type: time, ms
906+
index: timestamp
907+
```
908+
909+
Simply add a `version: 2` line at the top of the config file, and the pipeline engine will run the transform in combined mode:
910+
1. Process all written transform rules sequentially.
911+
2. Write all fields of the pipeline context to the final table.
912+
913+
Note:
914+
- The transform section **must contains one timestamp index field**.
915+
- The transform process in the version 2 will consume the original field in the pipeline context, so you can't transform the same field twice.
878916

879917
### Auto-transform
880-
If no transform section is specified in the pipeline configuration, the pipeline engine will attempt to infer the data types of the fields from the context and preserve them in the database, much like the `identity_pipeline` does.
881918

882-
To create a table in GreptimeDB, a time index column must be specified.
919+
The transform configuration in version 2 format is already a large simplification over the original transform.
920+
However, there are times when you might want to combine the power of processors with the ease of using `greptime_identity`, writing no transform code, letting the pipeline engine auto infer and persist the data.
921+
922+
Now it is possible in custom pipelines.
923+
If no transform section is specified, the pipeline engine will attempt to infer the datatype of the fields from the pipeline context and preserve them into the database, much like what the `identity_pipeline` does.
924+
925+
To create a table in GreptimeDB, a timestamp index column must be specified.
883926
In this case, the pipeline engine will try to find a field of type `timestamp` in the context and set it as the time index column.
884-
A `timestamp` field is produced by a `date` or `epoch` processor, so at least one `date` or `epoch` processor must be defined in the processors section.
927+
A `timestamp` field is produced by a `date` or `epoch` processor, so at least one `date` or `epoch` processor must be defined in the processor's section.
885928
Additionally, only one `timestamp` field is allowed, multiple `timestamp` fields would lead to an error due to ambiguity.
886929

887930
For example, the following pipeline configuration is now valid.
888931
```YAML
932+
version: 2
889933
processors:
890934
- dissect:
891935
fields:
@@ -1037,42 +1081,6 @@ The result will be:
10371081
}
10381082
```
10391083

1040-
### Transform in doc version 2
1041-
1042-
Before `v0.15`, the pipeline engine only supports a fully-set transform mode and an auto-transform mode:
1043-
- Fully-set transform: only fields explicitly noted in the transform section will be persisted into the database
1044-
- Auto-transform: no transform section is written, and the pipeline engine will try to set all the fields from the pipeline context. But in this case, there is no way to set other indexes other than the time index.
1045-
1046-
Starting from `v0.15`, GreptimeDB introduces a new transform mode combining the advantages of the existing two, which make it easier to write pipeline configuration.
1047-
You only set necessary fields in the transform section, specifying particular datatype and index for them; the rest of the fields from the pipeline context are set automatically by the pipeline engine.
1048-
With the `select` processor, you can decide what field is wanted and what isn't in the final table.
1049-
1050-
However, this is a breaking change to the existing pipeline configuration files.
1051-
If you has already used pipeline with `dissect` or `regex` processors, after upgrading the database, the original message string, which is still in the pipeline context, gets immediately inserted into the database and there's no way to stop this behavior.
1052-
1053-
Therefore, GreptimeDB introduces the concept of doc version to decide which transform mode you want to use, just like the version in a Docker Compose file. Here is an example:
1054-
```YAML
1055-
version: 2
1056-
processors:
1057-
- date:
1058-
field: input_str
1059-
formats:
1060-
- "%Y-%m-%dT%H:%M:%S%.3fZ"
1061-
1062-
transform:
1063-
- field: input_str, ts
1064-
type: time, ms
1065-
index: timestamp
1066-
```
1067-
1068-
Simply add a `version: 2` line at the top of the config file, and the pipeline engine will run the transform in combined mode:
1069-
1. Process all written transform rules sequentially.
1070-
2. Write all fields of the pipeline context to the final table.
1071-
1072-
Note:
1073-
- If the transform section is explicitly written, **it must contain a time index field**. Otherwise the time-index field will be inferred by the pipeline engine just like the auto-transform mode.
1074-
- The transform process in the version 2 will consume the original field in the pipeline context, so you can't transform the same field twice.
1075-
10761084
## Dispatcher
10771085

10781086
The pipeline dispatcher routes requests to other pipelines based on configured

0 commit comments

Comments
 (0)