Skip to content

[BUG] Expired links in Thrift client lead to hanging #1097

@tom-s-powell

Description

@tom-s-powell

Describe the bug
When UseThriftClient is 1, we are experiencing a hanging whenever expired links are encountered. The behaviour observed is:

  1. SELECT * on a large table (e.g., 5 billion rows).
  2. All links are fetched upfront in batches. In this case there were around 60K chunks. In this case, it took more than 15 minutes to fetch the links which means by the time we start downloading chunks, the links have already expired.
  3. Once chunks are fetched, ChunkLinkDownloadService is initialized here with chunkCount equal to the total number of chunks. nextBatchStartIndex is therefore initialized to the total number of chunks.
  4. When ChunkDownloadTask is called, chunk.isChunkLinkInvalid will return true. This will trigger ChunkLinkDownloadService#getLinkForChunk to be called.
  5. handleExpiredLinksAndReset will do nothing because isChunkLinkExpiredForPendingDownload is false - the CompletableFuture<ExternalLink> chunkFuture is not done because nothing has triggered the link to be fetched (the chunk was already populated with the link because nothing has yet run to complete it.
  6. triggerNextBatchDownload will be called, however because of (3) there is never a scenario in which currentDownloadTask will be reached: given nextBatchStartIndex was set to the total number of chunks, batchStartIndex >= totalChunks will always be true so you see No more chunks to download. Current index: {}, Total chunks: {} in the logs.
  7. Ultimately you end up in a "stuck" situation where linkDownloadService.getLinkForChunk(chunk.getChunkIndex()).get() never completes because there is nothing running in ChunkLinkDownloadService to update the pending future.

The attached
stacktrace.txt illustrates the behaviour. You can see the first thread blocked as described in (7). None of the databricks-jdbc-chunks-downloader are doing anything and there is no other references to ChunkLinkDownloadService showing nothing is going to modify any of the futures in chunkIndexToLinkFuture map.

One additional observation is the batch size of chunks fetched in DatabricksThriftAccessor was only around 20 links, so it takes a decent amount of time to obtain all links. This seems confusing when the request appears to suggest the result should be limited to 100,000 rows or ~400MB for which this is not close to that.

Also, in the Thrift approach, it seems somewhat inefficient to obtain all links upfront before any chunks begin to be downloaded? This behaviour is exacerbating the issue because of how long it takes to download the links (i.e., by the time you start reading the majority if not all are going to have expired and you have to repeat the process). Even if this hanging were fixed, in my case where it takes longer than 15 minutes to download the links, you would just keep downloading links and never actually progress in downloading chunks.

To Reproduce
This can be reproduced with a somewhat large table in order to trigger a long duration to download links. In my test this was a table with 5 billion rows. Ultimately, I think it'd be sufficient to have a scenario where a link expires. No configuration overrides required as UseThriftClient is 1 by default.

By default in ChunkReadyTimeoutSeconds is 0 so AbstractArrowResultChunk#waitForChunkReady will also block indefinitely. If this is configured the fetching of new rows from the ResultSet will fail loudly.

Failed to ready chunk
com.databricks.jdbc.exception.DatabricksSQLException: Failed to ready chunk
	at com.databricks.jdbc.api.impl.arrow.AbstractRemoteChunkProvider.getChunk(AbstractRemoteChunkProvider.java:174)
	at com.databricks.jdbc.api.impl.arrow.ArrowStreamResult.next(ArrowStreamResult.java:193)
	at com.databricks.jdbc.api.impl.DatabricksResultSet.next(DatabricksResultSet.java:268)

Expected behavior
Driver should not hang.

Screenshots
N/A

Client side logs
N/A

Client Environment (please complete the following information):

  • OS: MacOS/Linux
  • Java version: 21
  • Java vendor: Amazon Corretto
  • Driver Version: 3.0.4
  • BI Tool (if used) N/A
  • BI Tool version (if applicable) N/A

Additional context
N/A

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions