Skip to content

Conversation

@Daksh14
Copy link
Contributor

@Daksh14 Daksh14 commented Oct 19, 2025

Motivation

We slowly use io uring everywhere here I made the simple change of supporting fs::read however it might not be as simple. Let me know if the unsafe I used is correct or not.

We currently use the blocking std::fs::MetaData to obtain file size for buffer capacity and extend the length of the vector according to the bytes read in the CQE. This implementation sounds good on paper to me.

Later we should implement an internal statx helper, in this PR or a seperate PR to make our uring implementation less painful to use. As this pr failed #7616

I will prefer to include statx helper in the same PR to avoid merging an inefficient read implementation given io uring is about being more efficient in file IO

Solution

Continue adopting io uring

strace on a tokio::fs::read after this change

io_uring_setup(256, {flags=0, sq_thread_cpu=0, sq_thread_idle=0, sq_entries=256, cq_entries=512, features=IORING_FEAT_SINGLE_MMAP|IORING_FEAT_NODROP|IORING_FEAT_SUBMIT_STABL
E|IORING_FEAT_RW_CUR_POS|IORING_FEAT_CUR_PERSONALITY|IORING_FEAT_FAST_POLL|IORING_FEAT_POLL_32BITS|IORING_FEAT_SQPOLL_NONFIXED|IORING_FEAT_EXT_ARG|IORING_FEAT_NATIVE_WORKERS
|IORING_FEAT_RSRC_TAGS|IORING_FEAT_CQE_SKIP|IORING_FEAT_LINKED_FILE|IORING_FEAT_REG_REG_RING|IORING_FEAT_RECVSEND_BUNDLE|IORING_FEAT_MIN_TIMEOUT|IORING_FEAT_RW_ATTR, sq_off=
{head=0, tail=4, ring_mask=16, ring_entries=24, flags=36, dropped=32, array=8256, user_addr=0}, cq_off={head=8, tail=12, ring_mask=20, ring_entries=28, overflow=44, cqes=64,
 flags=40, user_addr=0}}) = 9
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 9, 0x10000000) = 0xfaf0bf1e2000
mmap(NULL, 9280, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, 9, 0) = 0xfaf0be71d000
epoll_ctl(5, EPOLL_CTL_ADD, 9, {events=EPOLLIN|EPOLLRDHUP|EPOLLET, data=0}) = 0
io_uring_enter(9, 1, 0, 0, NULL, 128)   = 1
futex(0xfaf0bf2557f0, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
mmap(NULL, 2162688, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0xfaf0be50d000
mprotect(0xfaf0be51d000, 2097152, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0xfaf0be71c150, parent_
tid=0xfaf0be71c150, exit_signal=0, stack=0xfaf0be50d000, stack_size=0x20e960, tls=0xfaf0be71c7a0} => {parent_tid=[746758]}, 88) = 746758
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0xfaf0be71c850, FUTEX_WAKE_PRIVATE, 1) = 1
io_uring_enter(9, 1, 0, 0, NULL, 128)   = 1
futex(0xfaf0bf2557f0, FUTEX_WAIT_PRIVATE, 1, NULL) = -1 EAGAIN (Resource temporarily unavailable)
close(10)                               = 0

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-fs Module: tokio/fs labels Oct 19, 2025
@Darksonn Darksonn changed the title Fs read io uring fs: support io_uring with tokio::fs::read Oct 20, 2025
let mut offset = 0;

while size_read < size {
let left_to_read = (size - size_read) as u32;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let left_to_read = (size - size_read) as u32;
let left_to_read = u32::try_from(size - size_read).unwrap_or(u32::MAX);

To properly support files bigger than 4GB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max read size at a time is u32::MAX, we read the rest in other next iterations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, if we know we're reading more than u32::MAX then we can batch 2 read requests to avoid extra syscalls

@Daksh14 Daksh14 force-pushed the fs_read_io_uring branch 7 times, most recently from 6237a4c to 636cfb8 Compare October 27, 2025 13:36
// increase read size per Op submission
// 2. In case of small reads by the kernel, also gradually increase
// read size per Op submission to read files in lesser cycles
if size_read_usize <= buf.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buf.len (r_buf) is updated at line 119 and it accumulates the old length + size_read_usize.
I think this if check will always pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll change the buf.len() to the read_len to detect shorter reads, how else do you think I should detect shorter reads

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add something like let requested = read_len; anywhere before read_len -= size_read; and use if size_read_usize < requested { here.

Copy link
Member

@ADD-SP ADD-SP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks complicated, I will review it incrementally.

Comment on lines +33 to +30
// extra single capacity for the whole size to fit without any reallocation
let buf = Vec::with_capacity(size_hint.unwrap_or(0).saturating_add(1));
Copy link
Member

@ADD-SP ADD-SP Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra single capacity for the whole size to fit without any reallocation

Why doesn't the standard library do it?

// https://doc.rust-lang.org/1.90.0/src/std/fs.rs.html#308
pub fn read<P: AsRef<Path>>(path: P) -> io::Result<Vec<u8>> {
    fn inner(path: &Path) -> io::Result<Vec<u8>> {
        let mut file = File::open(path)?;
        let size = file.metadata().map(|m| m.len() as usize).ok();
        let mut bytes = Vec::try_with_capacity(size.unwrap_or(0))?;
        io::default_read_to_end(&mut file, &mut bytes, size)?;
        Ok(bytes)
    }
    inner(path.as_ref())
}

Copy link
Contributor Author

@Daksh14 Daksh14 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because they react on the bytes read rather than how many bytes to read ahead of time. In case of a short read during EOF and try_reserve is called, it allocates more space than needed and also doubles the capacity but if its +1 then there's no need and capacity isn't necessarily grown

Comment on lines 21 to 26
// TODO: use io uring in the future to obtain metadata
let size_hint: Option<usize> = file
.metadata()
.await
.map(|m| usize::try_from(m.len()).unwrap_or(usize::MAX))
.ok();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of size_hint is to improve performance by avoiding memory allocation. However, it involves thread pool mechanisms—which may not be more efficient than multiple memory allocations due to the overhead of creating/terminating threads and inter-thread synchronization operations.

We need performance data to demonstrate that this approach is faster than the method without using size_hint.

I asked this because I want to eliminate unnecessary logic as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my question is Why would you think that? Would you think a statx helper will eliminate this problem or it will still involve thread/pool mechanisms? I guess it will also be a future SQE.

I will include some benchmarks and upload them in a separate branch to test this @ADD-SP

Copy link
Member

@ADD-SP ADD-SP Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two simple benchmarks are enough. One is for small file, another one is for large file.

Copy link
Member

@mox692 mox692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked all of the details in read_uring.rs, but left some comments I've noticed so far.

Comment on lines +36 to +41
// SAFETY: The `len` of the amount to be read and the buffer that is passed
// should have capacity > len.
//
// If `len` read is higher than vector capacity then setting its length by
// the caller in terms of size_read can be unsound.
pub(crate) fn read(fd: OwnedFd, mut buf: Vec<u8>, len: u32, offset: u64) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be marked as unsafe if the caller needs to follow this.

Copy link
Contributor Author

@Daksh14 Daksh14 Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree but there's nothing inherently unsafe in calling this function nor it is unsound.
The safety comment is there since this function writes in uninitialized capacity the caller will try to set_len and if they do that on the basis of size_read and do not have sufficient capacity/len then it will be UB

@Daksh14 Daksh14 requested a review from ADD-SP November 3, 2025 06:52
@Daksh14 Daksh14 requested review from martin-g and mox692 November 3, 2025 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate M-fs Module: tokio/fs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants