-
Notifications
You must be signed in to change notification settings - Fork 26
Description
Describe the bug
When UseThriftClient is 1, we are experiencing a hanging whenever expired links are encountered. The behaviour observed is:
SELECT *on a large table (e.g., 5 billion rows).- All links are fetched upfront in batches. In this case there were around 60K chunks. In this case, it took more than 15 minutes to fetch the links which means by the time we start downloading chunks, the links have already expired.
- Once chunks are fetched,
ChunkLinkDownloadServiceis initialized here withchunkCountequal to the total number of chunks. nextBatchStartIndex is therefore initialized to the total number of chunks. - When
ChunkDownloadTaskis called,chunk.isChunkLinkInvalidwill return true. This will triggerChunkLinkDownloadService#getLinkForChunkto be called. handleExpiredLinksAndResetwill do nothing becauseisChunkLinkExpiredForPendingDownloadis false - theCompletableFuture<ExternalLink> chunkFutureis not done because nothing has triggered the link to be fetched (the chunk was already populated with the link because nothing has yet run to complete it.triggerNextBatchDownloadwill be called, however because of (3) there is never a scenario in which currentDownloadTask will be reached: givennextBatchStartIndexwas set to the total number of chunks,batchStartIndex >= totalChunkswill always be true so you seeNo more chunks to download. Current index: {}, Total chunks: {}in the logs.- Ultimately you end up in a "stuck" situation where
linkDownloadService.getLinkForChunk(chunk.getChunkIndex()).get()never completes because there is nothing running inChunkLinkDownloadServiceto update the pending future.
The attached
stacktrace.txt illustrates the behaviour. You can see the first thread blocked as described in (7). None of the databricks-jdbc-chunks-downloader are doing anything and there is no other references to ChunkLinkDownloadService showing nothing is going to modify any of the futures in chunkIndexToLinkFuture map.
One additional observation is the batch size of chunks fetched in DatabricksThriftAccessor was only around 20 links, so it takes a decent amount of time to obtain all links. This seems confusing when the request appears to suggest the result should be limited to 100,000 rows or ~400MB for which this is not close to that.
Also, in the Thrift approach, it seems somewhat inefficient to obtain all links upfront before any chunks begin to be downloaded? This behaviour is exacerbating the issue because of how long it takes to download the links (i.e., by the time you start reading the majority if not all are going to have expired and you have to repeat the process). Even if this hanging were fixed, in my case where it takes longer than 15 minutes to download the links, you would just keep downloading links and never actually progress in downloading chunks.
To Reproduce
This can be reproduced with a somewhat large table in order to trigger a long duration to download links. In my test this was a table with 5 billion rows. Ultimately, I think it'd be sufficient to have a scenario where a link expires. No configuration overrides required as UseThriftClient is 1 by default.
By default in ChunkReadyTimeoutSeconds is 0 so AbstractArrowResultChunk#waitForChunkReady will also block indefinitely. If this is configured the fetching of new rows from the ResultSet will fail loudly.
Failed to ready chunk
com.databricks.jdbc.exception.DatabricksSQLException: Failed to ready chunk
at com.databricks.jdbc.api.impl.arrow.AbstractRemoteChunkProvider.getChunk(AbstractRemoteChunkProvider.java:174)
at com.databricks.jdbc.api.impl.arrow.ArrowStreamResult.next(ArrowStreamResult.java:193)
at com.databricks.jdbc.api.impl.DatabricksResultSet.next(DatabricksResultSet.java:268)
Expected behavior
Driver should not hang.
Screenshots
N/A
Client side logs
N/A
Client Environment (please complete the following information):
- OS: MacOS/Linux
- Java version: 21
- Java vendor: Amazon Corretto
- Driver Version: 3.0.4
- BI Tool (if used) N/A
- BI Tool version (if applicable) N/A
Additional context
N/A