Skip to content

Conversation

@wjxiz1992
Copy link
Collaborator

@wjxiz1992 wjxiz1992 commented Oct 20, 2025

close #221 .

Use the solution described in the issue. Verified locally with the original failing command line.

@wjxiz1992 wjxiz1992 requested a review from jihoonson October 20, 2025 02:17
@wjxiz1992 wjxiz1992 self-assigned this Oct 20, 2025
Signed-off-by: Allen Xu <[email protected]>
@jihoonson
Copy link
Collaborator

Thanks @wjxiz1992 for having a look at this issue. Should we consider something similar to the solution for hdfs for local as well? Such as skipping generating the delete data once the first file is generated. I'm not sure whether it's a good idea to always assume the overwrite_output flag when it's update.

@wjxiz1992
Copy link
Collaborator Author

wjxiz1992 commented Oct 22, 2025

Thanks @wjxiz1992 for having a look at this issue. Should we consider something similar to the solution for hdfs for local as well? Such as skipping generating the delete data once the first file is generated. I'm not sure whether it's a good idea to always assume the overwrite_output flag when it's update.

I get your concern, no worry, so far the duplicated "delete_n" and "invetory_delete_n" dat files are the only buggy part I found in native TPC-DS tool. When using child, the generated data file with different child number has different file names:

./dsdgen -scale 10 -dir $PWD/sf10-2 -parallel 10 -child 3 -verbose -update 20
...
s_inventory_3_10.dat

./dsdgen -scale 10 -dir $PWD/sf10-2 -parallel 10 -child 2 -verbose -update 20
...
s_inventory_2_10.dat

All other data files are like this. so overwrite flag won't cause problem for them.

we do move action for HDFS case is due to the different output file structure from the local case.

Ultimately, yes, we can do similar things as HDFS, but this need like 1. create child folder for each child, 2. move data files from each child folder to the final folder etc.

Copy link
Collaborator

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wjxiz1992 and I discussed offline. I don't think this is a good idea given my concern in #222 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Data generation failed for data maintenance in local mode with parallelism

2 participants