Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
114 changes: 108 additions & 6 deletions _data-prepper/pipelines/configuration/processors/csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@ The following table describes the options you can use to configure the `csv` pro

Option | Required | Type | Description
:--- | :--- | :--- | :---
source | No | String | The field in the event that will be parsed. Default value is `message`.
quote_character | No | String | The character used as a text qualifier for a single column of data. Default value is `"`.
delimiter | No | String | The character separating each column. Default value is `,`.
delete_header | No | Boolean | If specified, the event header (`column_names_source_key`) is deleted after the event is parsed. If there is no event header, no action is taken. Default value is true.
column_names_source_key | No | String | The field in the event that specifies the CSV column names, which will be automatically detected. If there need to be extra column names, the column names are automatically generated according to their index. If `column_names` is also defined, the header in `column_names_source_key` can also be used to generate the event fields. If too few columns are specified in this field, the remaining column names are automatically generated. If too many column names are specified in this field, the CSV processor omits the extra column names.
column_names | No | List | User-specified names for the CSV columns. Default value is `[column1, column2, ..., columnN]` if there are no columns of data in the CSV record and `column_names_source_key` is not defined. If `column_names_source_key` is defined, the header in `column_names_source_key` generates the event fields. If too few columns are specified in this field, the remaining column names are automatically generated. If too many column names are specified in this field, the CSV processor omits the extra column names.
`source` | No | String | The field in the event that will be parsed. Default value is `message`.
`quote_character` | No | String | The character used as a text qualifier for a single column of data. Default value is `"`.
`delimiter` | No | String | The character separating each column. Default value is `,`.
`delete_header` | No | Boolean | If specified, the event header (`column_names_source_key`) is deleted after the event is parsed. If there is no event header, no action is taken. Default value is true.
`column_names_source_key` | No | String | The field in the event that specifies the CSV column names, which will be automatically detected. If there need to be extra column names, the column names are automatically generated according to their index. If `column_names` is also defined, the header in `column_names_source_key` can also be used to generate the event fields. If too few columns are specified in this field, the remaining column names are automatically generated. If too many column names are specified in this field, the CSV processor omits the extra column names.
`column_names` | No | List | User-specified names for the CSV columns. Default value is `[column1, column2, ..., columnN]` if there are no columns of data in the CSV record and `column_names_source_key` is not defined. If `column_names_source_key` is defined, the header in `column_names_source_key` generates the event fields. If too few columns are specified in this field, the remaining column names are automatically generated. If too many column names are specified in this field, the CSV processor omits the extra column names.
`delete_source` | No | Boolean | If `true`, deletes the configured `source` field (default `message`) after CSV parsing. Default is `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also elaborate on the value here. We added this to help reduce memory pressure. So if you know you are going to drop the source after this processor, you can be better memory usage since processing happens in batches.


## Usage

Expand Down Expand Up @@ -97,6 +98,107 @@ Then, the processor parses the event into the following output. Because `delete_
{"message": "1,2,3", "a": "1", "b": "2", "c": "3"}
```

### Delete the source field after parsing

If you want to remove the original `message` field once columns are extracted, enable `delete_source`. See following example:

```yaml
csv-pipeline-delete-source:
source:
file:
path: "/full/path/to/ingest.csv"
record_type: "event"
processor:
- csv:
column_names: ["col1", "col2"]
delete_source: true # default is false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need the comment here.

sink:
- opensearch:
hosts: ["https://opensearch:9200"]
insecure: true
username: admin
password: admin_pass
index_type: custom
index: csv-demo-%{yyyy.MM.dd}
```
{% include copy.html %}

The documents stored in OpenSearch contain the following information:

```json
{
...
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "csv-demo-2025.11.10",
"_id" : "vTgDb5oBcoMYUXV6ocPH",
"_score" : 1.0,
"_source" : {
"col1" : "1",
"col2" : "2",
"column3" : "3"
}
},
{
"_index" : "csv-demo-2025.11.10",
"_id" : "vjgDb5oBcoMYUXV6ocPI",
"_score" : 1.0,
"_source" : {
"col1" : "4",
"col2" : "5",
"column3" : "6"
}
}
]
}
}
```

If the `delete_source` is set to `false`, the documents would include the `message` field, see following example:

```json
{
...
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "csv-demo-2025.11.10",
"_id" : "fpAKb5oB85vgu48rA-rD",
"_score" : 1.0,
"_source" : {
"message" : "1,2,3",
"col1" : "1",
"col2" : "2",
"column3" : "3"
}
},
{
"_index" : "csv-demo-2025.11.10",
"_id" : "f5AKb5oB85vgu48rA-rD",
"_score" : 1.0,
"_source" : {
"message" : "4,5,6",
"col1" : "4",
"col2" : "5",
"column3" : "6"
}
}
]
}
}
```

## Metrics

The following table describes common [Abstract processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/processor/AbstractProcessor.java) metrics.
Expand Down