Skip to content

[Ray Data] explode expression #58279

@codingl2k1

Description

@codingl2k1

Description

I use a method to split the values of some columns and generate a new column where each element contains all parts of the split. However, I can't easily flatten the output. Currently, I can only use map_batches or flat_map to handle the data myself.

Use cases:

  • string split
  • audio split
  • video split

What I want is the explode expression. This is the daft example:

import daft
from daft.functions import explode

df = daft.from_pydict({"id": [1, 2, 3], "sentence": ["lorem ipsum", "foo bar baz", "hi"]})

df.with_column("word", explode(df["sentence"].split(" "))).show()
╭───────┬─────────────┬────────╮
│ idsentenceword   │
│ ---------    │
│ Int64StringString │
╞═══════╪═════════════╪════════╡
│ 1lorem ipsumlorem  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1lorem ipsumipsum  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2foo bar bazfoo    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2foo bar bazbar    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2foo bar bazbaz    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3hihi     │
╰───────┴─────────────┴────────╯

(Showing first 6 of 6 rows)

Use case

import pandas as pd
import ray.data
from ray.data.expressions import explode, col

df = pd.DataFrame({"id": [1, 2, 3], "sentence": ["lorem ipsum", "foo bar baz", "hi"]})
df = ray.data.from_pandas(df)

df.with_column("word", explode(col["sentence"].split(" "))).show()
╭───────┬─────────────┬────────╮
│ idsentenceword   │
│ ---------    │
│ Int64StringString │
╞═══════╪═════════════╪════════╡
│ 1lorem ipsumlorem  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1lorem ipsumipsum  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2foo bar bazfoo    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2foo bar bazbar    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2foo bar bazbaz    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3hihi     │
╰───────┴─────────────┴────────╯

(Showing first 6 of 6 rows)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalcommunity-backlogdataRay Data-related issuesenhancementRequest for new feature and/or capabilityusability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions