IRC channel logs
2024-02-20.log
back to list of logs
<Gooberpatrol66>is the discussion here in 2013-09-18 about pci arbitration obsolete since pci-arbiter has been added? <damo22>would be good if Avadis Tevanian Jr was lurking around to help with the vm map locking :D <damo22>vm_allocate returns with the vm_map lock in a read lock <damo22>ah no it unlocks the map in Bailout <damo22>Gooberpatrol66: we have solved most of the problems that would otherwise prevent virtio from working <damo22>rump has virtio disk, we could use it <damo22>the only reason its disabled is i had a compilation failure and had to remove it temporarily <damo22>otherwise it would actually work <damo22>although its not clear to me why its any different from having a working AHCI <damo22>solid_black: im starting to think there is no bug with the vm locking <damo22>i am seeing all threads in W state with no vm lock calls <damo22> 65 ./test.inf(786) (f61001e8): 8 threads: <damo22> 1 (f61bfe08) .W.O..(mach_msg_continue) 0 <damo22>oh but the backtraces of sleeping threads do show vm stuff <solid_black>the second thread is just trying to lock the map (for vm_allocate, but that doesn't matter) <solid_black>the VM regions are disjoint: f8892000,f8893000 vs f8894000,f8895000 <damo22>its wiring and unwiring the same thing? <damo22>the first arg is the ipc_kernel_map <solid_black>can you see what that entry's actual start/end VM addresses are? <solid_black>not sure what we'll learn from that though; it would match eirher one region or the other <damo22>isnt the kernel supposed to not do this? <damo22>ie, wire/unwire same object while the map is unlocked <solid_black>it's supposed to not do that, that's what "well behaved" refers to <solid_black>see, it really thinks it's wiring/unwiring two different regions of virtual memory, but somehow that ends up being the same entry / vm_page <damo22>but the parameters it passes to the wire/unwire faults are the same <solid_black>and yet the address pairs passed to vm_map_pageable are different, can you see that? <damo22>somehow the entry is being reused <damo22>yes the pageable pairs are different <solid_black>I really really wish qemu supported rr with smp, we'd have this figured out a long time ago :| <damo22>does anything check that the entries are unique? <solid_black>there must be something wrong with how vm_map_pageable_scan args are displayed <damo22>kdb guesses how to print them from the stack <solid_black>yes, there's not much to it on i386 (though I wonder how it works on x86_64, and what we could do on aarch64) <solid_black>but vm_map_pageable_scan is a static func, it might be partly inlined or something <solid_black>which explains why the stack doesn't contain what kdb expects <damo22>yesterday i was getting proper backtrace in gdb <damo22>it happened to stop when the program hung <solid_black>could this (the entry reuse) be due to vm_map_clip_start? <solid_black>it should be a no-op if it doesn't have to do any clipping -- but what if it does? <solid_black>the implementation of clipping is ok, under the assumption that things are properly locked <solid_black>but in our case, they're not, because the other thread has released the map lock while still referencing an entry from the map <solid_black>vm/vm_map.c:1791 here in my tree, might be different for you <solid_black>next to a comment that says "Pass 1. Clip entries, check for holes and protection mismatches" <solid_black>it's not just clipping entries though... *lots* of things must be broken in the same way <solid_black>all these things (rightfully) assume that the map is locked and noone else is looking at the entries, so we can change them <solid_black>that wouldn't take care of clipping, but cleipping should not happen if things are not coalesced in the first place <solid_black>and maybe try reverting dc0c9160173d2a5b40e9d6e29c1800102b17b1ec, ee65849bec5da261be90f565bee096abb4117bdd <solid_black>a read lock on the map would be enough to prevent this <solid_black>why does it do that? is that for perfomance reasons, or needed for correctness / avoiding some other deadlock? <damo22>wiring in the kernel map or submap <solid_black>guess something somewhere (in vm_fault) could try to jsut allocate memory, and that would deadlock <solid_black>so for one thing, vm_fault.c doesn't actually care about the passed-in entry being a proper entry in the map <solid_black>it just needs uses address, end address, is_sub_map, object, offset <damo22>can we remove the unlock hack, but do a read lock further down? <solid_black>so we could, for instance, make a copy of the entry, and pass that copy <solid_black>but this of course gets more complicated when there are multiple entries on the region <solid_black>vm_map_copy_overwrite_nested? what does that have to do with this? <damo22>can we copy the entry and overwrite it, then discard it? <solid_black>that's kind of what I'm suggesting, make a temp on-stack copy of the entry, unlock the map, then pass the copy to vm_fault_wire <solid_black>that should work as long as other stuff is indeed well-behaved <solid_black>but then again, this only works for a single entry; how do you deal with multiple ones? alloca? <solid_black> /root/damo22s-test/test: Cannot get mach_port_names: (os/kern) resource shortage <damo22>did you really fill up kernel memory? <solid_black>but, like, the fact that it no longer panics (I added an assertion that there is only a single entry), and no longer hangs, and all the 8 CPU cores are busy, is reassuring <solid_black>well s/no longer panics/doesn't panic/, it didn't panic before either <solid_black>is there some other way I could test it, with some other test? <solid_black>ah, let me reboot into the full system on non-smp then <solid_black>oh great, my (non-SMP) VM locked up after "stress --vm 8" <damo22>root@zamhurd:/part3/test-ipc# ./smp ./test 7 <damo22>./test: Cannot get mach_port_names: (os/kern) resource shortage <solid_black>it runs a lot of successful iterations before that error, doesn't it? <solid_black>yes, but you should be able to tell by e.g. cpu load <damo22>do you think gsync_wait is still broken? <damo22>try running the infinite loop test <damo22>rumpdisk: mach_vm_object_pages: Success <damo22>/hurd/startup: Crashing system; essential task rumpdisk died <damo22>startup: notifying ext2fs /dev/wd0s3 of reboot...done <damo22>stress --vm hangs because it chews too much ram i think <damo22>you need to dial down the amoung <damo22>solid_black: i removed the HACK block altogether <damo22>now it does similar behaviour to your patch but still breaks on gsync_wait <solid_black>damo22: but that would deadlock if the vm_fault_wire handler wants to allocate some memory <solid_black>but at least we now have a fairly good understanding of why the VM wire/unwire deadlock happens <solid_black>the gsync deadlock, we should also take a closer look at, but clearly it's a different one <damo22>does vm_fault_wire allocate anything ? <solid_black>vm_fault certainly does e.g. allocate a new object for a copy, though vm_map_pageable_scan tries to handle that before calling vm_fault_wire <damo22>how do i repair a git tree? i get bad object <solid_black>...and faulting in the page from the object can allocate memory too, depending on what the object is <damo22>why does gsync_wait lock the task->map but then remove objects from the kernel_map? <solid_black>that not our case, is it? we should not hit the remote branch <damo22>i showed Richard Braun my traces about the deadlock earlier and he has interesting feedback: <damo22>(08:30:38 PM) braunr: hm it doesn't look like a deadlock <damo22>(08:30:55 PM) braunr: the unwire call is blocked inside the critical section, isn't it ? <damo22>(08:33:59 PM) braunr: yeah that's what it looks to me <damo22>(08:34:29 PM) braunr: vm_fault_page blocks for some unknown reason, and the rest is just serialized after that operation <damo22>(08:34:34 PM) braunr: it's not a locking issue <damo22>(08:35:36 PM) braunr: it could be a missed event around thread_block <damo22>(08:36:06 PM) braunr: which is the main reason i redesigned my version as thread_sleep, directly with the interlock as an argument <damo22>(08:36:16 PM) braunr: that's the critical difference that allows not missing events <damo22>(08:36:49 PM) braunr: it's the same idea as with condition variables, and why the mutex associated with a condvar must be locked when signalling <solid_black>and the reason vm_fault_page blocks is beacuse is waits for the page to lose its busy bit <solid_black>and the thread that owns the busy bit is waiting to lock the map <damo22>solid_black: but could it unblock if there was a missed event ? maybe there is something more to it <damo22>solid_black: its possible the third thread is waiting for a missed event to unbusy the page that the others are locked on <damo22>theyre not all waiting on a lock <solid_black>damo22: I really doubt it's anything else but the deadlock on map/page that I'm describing <solid_black>you can't tell this from the trace alone, it indeed looks unclear what the unwiring thread is blocking on <solid_black>which is why I added all the printfs yesterday, and connected gdb, and inspected it <solid_black>it really is the case that the unwiring thread (which holds the lock) is waiting on a busy page, and the page is marked busy just prior to that by another thread, and that other thread is waiting in vm_map_verify() to lock the map <solid_black>that all we knew yesterday; the question was why then would two threads both access the same page, if they supposedly deal each with their own VM region <solid_black>and yesterday I actually observed two threads wiring and unwiring *the same* VM address range, which is super weird, and makes me think there could be more going on <solid_black>but today, thanks to your observation, we known that in your trace, the two threads were accessing adjacent address ranges, but somehow the same entry <solid_black>them accessing the same entry of course explains why the ended up with the same page <solid_black>so then, why would they access the same entry? -- well, because of clipping/coalescing of entries, of course <solid_black>if gsync still deadlocks, that must be caused by something else <damo22>its easy to reproduce with 4 threads <damo22>the one with openmp and lots of printing <damo22>i read the gsync.c code, i think the linked list and hashing should be ok <damo22>i tried to assume it was broken and stepped through it to see if it did what i expected <damo22>but i dont understand the locking side <solid_black>"writing core file" is of course a lie, the fs is read-only :| <solid_black>second attempt (with no change), still running & spitting out port info <solid_black>it's seemingly not consuming multiple CPU cores though, judging by CPU usage <damo22>thats because printing takes too long <damo22>mine hangs at around 8000 iterations <damo22>if you get back to a shell, it completed without hanging after 10000 iterations <solid_black>for one thing, it doesn't iterate in order, so the current printed number is rather meaningless <damo22>did you change anything, or same as before that you posted <solid_black>just in case, yes nproc is 8, and I ran it as /root/damo22s-test/smp /root/damo22s-test/test-gsync 8 <damo22>hmm its probably easier if you reproduce <damo22>(10:12:30 PM) braunr: if there is enough memory pressure, the swapper could attempt to balance segments <damo22>(10:12:36 PM) braunr: calling vm_page_seg_balance_page <damo22>(10:13:45 PM) braunr: this function does set the busy bit to false, but doesn't wake up any thread beyond that apprently <damo22>(10:14:46 PM) braunr: unlike say vm_object_copy_slowly <damo22>(10:14:53 PM) braunr: see PAGE_WAKEUP_DONE <damo22>(10:15:26 PM) braunr: i'd check whether the busy bit really still is true while the thread waiting for it is blocking <damo22>(10:15:33 PM) braunr: if it's true, it's a logic problem <damo22>(10:15:54 PM) braunr: otherwise, the thread has simply not been awoken <solid_black>good point about vm_page_seg_balance_page not doing a wakeup (unlike PAGE_WAKEUP_DONE), does it not do that intentionally? <solid_black>I wouldn't expect it to steal a busy page anyway, that sounds disastrous? <solid_black>Intel i7-8650U (8) @ 4.200GHz (no idea what that means though) <damo22>youpi: is there some kind of problem we could be facing due to threads running completely concurrently that we would never have seen before, eg with system calls and glibc? <youpi>damo22: with system calls themselves, probably not, since they're on the stack only & such <youpi>the non-hurdish part of glibc, of course not since it's used widely on linux <youpi>the hurdish part of glibc, possibly <youpi>that's why I recommended testing progressively <damo22>i gtg to sleep, its my birthday tomorrow :) <solid_black>and if you're already gone, I'm going to have to debug this all by myself :| <solid_black>ok, same as your case, all the threads are waiting for a lock on the map, it's unclear what holds this lock <solid_black>I don't know how to debug this further; none of the threads look like they could be holding the lock <youpi>adding __FILE__ & __LINE__ ? <solid_black>lock_t doesn't let new readers in if there are waiting writers <youpi>not the way we were discussing ;) <youpi>I *very* often hear people barf at something, just because they haven't actually understood how it works <solid_black>it still blocks the current thread, instead of just delaying the response message <youpi>note that it currently works with a gsync_waiter on the stack <youpi>so you cannot just get rid of the stack <solid_black>really, this is breaking rpctracing any non-trivial program <youpi>(13:03:27) solid_black: lock_t doesn't let new readers in if there are waiting writers <youpi>(13:03:34) solid_black: so you cannot lock it for reading recursively <youpi>which writers are you thinking of? <solid_black>it certainly is more expensive than putting things on the stack, but also not that expensive compared to making a syscall, looking up ports, etc <solid_black>wdym? in general, anyone trying to lock_write() the same lock <youpi>then it might be useful to do indeed <youpi>solid_black: sure, but they'll release it at some point <youpi>solid_black: remember that this is already a long ago <solid_black>who, the writers? they won't release it if they never manage to get it <youpi>I have *WAY* too much on my plate to remember about everything that goes on <youpi>solid_black: why shouldn't they never manage to get it ? <youpi>I mean, if a read holds the lock for long, sure <youpi>but that's not supposed to happen <solid_black>sure, I don't expect you to *remember*, but it's only logical: rpctrace is the primary user of _make remote calls for something that a task/thread normally does to itself_ <youpi>it's logical when you think about it <solid_black>so anything that breaks that, immediately breaks rpctrace <youpi>on my daily work I'm mostly working on linux <youpi>where strace is not affected by futex problems <youpi>so, no, I wouldn't magically think about rpctrace <youpi>which has indeed different concerns <solid_black>yes, normally the outstanding readers should release the lcok eventually, but not if one of the readers is the same thread who now waits to take the same lock *again*, recursively, and is blocked on the writer! <solid_black>it does vm_map_lock_read (task->map); first, and that succeeds <youpi>"write" doesn't appear in the file <youpi>remember that I'm currently at work, I cannot spend time on checking things and stuff <solid_black>then it calls gsync_prepare_key() -> probe_address() -> vm_map_lookup(), while holding the read lock <youpi>sure, if you have a read lock you cannot expect to be able to acquire a write lock <solid_black>what if another thread calls lock_write() in the middle? <youpi>ah so another read, not a write <youpi>ah, so a read then write then read <youpi>(write being another thread) <solid_black>now the nested lock_read() inside vm_map_lookup() is waiting for the writer to go away, but the writer is waiting for the original read lock to go away <youpi>that was completely not obvious from what you wrote above <solid_black>maybe short real-time messages is not the best medium for complex topics :| <solid_black>so, one way to fix / work around this would be to make gsync get a *write* lock, and then allow recursion <solid_black>it's a shame, because it doesn't need a write lock, though <youpi>I'd say I would really *not* be surprised that various code in gnumach has such double-read-lock <youpi>and thus we might want to have some debugging tool to trace such errors <youpi>I wonder if we really need to keep a read lock on the map <youpi>and whether we could rather keep some lock on the object instead <youpi>as the vm_map_lookup() way of working suggests, actually <solid_black>sneek: later tell damo22: I reproduced and investigated your gsync deadlock, it's a completely unrelated (to the VM wire/unwire thing) issue, purely related to locking (readers vs writers) inside gsync <youpi>vm_map_enter takes the object and not the map, after all <solid_black>why does gsync_wake need to take the explicit read lock on the map at all? <youpi>it was probably to make sure that the user doesn't release the memory in the meanwhile <solid_black>it has to synchronize with gsync_wait, sure, but the loking inside vm_map_lookup would accomplish that <youpi>but we can keep some lock on the object itself instead <youpi>to be sure that even if the user releases its mappipng, the underlying memory is still there and we can still synchronize on it <youpi>it's the memory object that matters in the end anyway <solid_black>I don't think it matters, gsync_wake only uses the memory address as a key to find waiters to wake, it doesn't care about the underlying memory cell at all, unlike gsync_wait <youpi>I was looking at gsync_wait indeed <youpi>which we have to fix too anyway <youpi>wait will have the same issue <youpi>wake can need to mutate the value <youpi>so it does care about the underlying memory <youpi>but there as well it's essentially the memory object that matters <solid_black>well, if the user is doing "serious" operations on the VM region concurrently (such as unmapping this object and mapping another one in its place), I wouldn't expect futex/gsync to behave in any nice manner (other than not crashing) anyway <youpi>it does make sense to see unmap concurrently with futex / gsync operations <solid_black>but we need to convince ourselves that whatever new locking scheme we devise, it won't let a wake() slip in between wait() checking the number and going to sleep <youpi>it poses a lot of problems in userland implementations of e.g. semaphores <youpi>if you do sem_post(&struct->sem); free(struct); <youpi>yes, it looks bogus, it kinda is <youpi>but there are various cases where userland wants to do such things <youpi>another would be on sem_wait(&struct->sem); yes <youpi>and not accessing anything else in it <youpi>again, yes it looks bogus, I'm just taking the simple example to show my point <solid_black>you need a happens-before between the other threads stopping to access the semaphore, and you releasing it <youpi>you have the same kind of concerns with pthread_mutex_lock/unlock vs pthread_mutex_destroy <youpi>it's *hard* to propperly implement pthread_mutex_destroy, making sure that the last thread doing pthread_mutex_unlock really has finished working on the mutex <youpi>and then you'll be able to free it <youpi>solid_black: again, I was just using a small dumb example to explain the kind of thing I'm talking about <youpi>actual cases are much more involved <solid_black>I don't understand why it's supposedly not actually bogus and something that we should support <youpi>if you don't want to take that there is a point to make but it's long to explain <youpi>I'm not sure I'll take the time to explain <solid_black>but here, you know for sure that thread1 is completely done with it, because you've witnessed blah(&x->state) being true *after* you grabbed the mutex, meaning thread1 has already released the mutex <youpi>you don't know whether pthread_mutex_unlock() is really finished with accessing the memory <solid_black>so your gsync_wake() could still be in process of completing when the vm region is already deallocated <youpi>that kind of stuff happens, yes <youpi>but if we keep a ref or lock on the object, we can still continue fine <youpi>that's also important for cross-process semaphores and such <youpi>sem_post(&shared_mem->p);; munmap(shared_mem); , I believe posix probably says it's fine <youpi>and again, that's just a simple example <youpi>you might have whatnot odd examples <solid_black>so in other words, [you believe] that pthread primitive do establish a happens-before between unlock/post *being done accessing the mutex/sempahore* in a thread that held the mutex/semaphire first, and unlock/post *returning* in a thread that held it next <youpi>not unlock/post for the latter, lock/wait for the latter <youpi>and I'm not actually talking about that <youpi>I'm just talking about the fact that memory can disappear under your feet <youpi>especially where you're talking about another task's memory <youpi>that's the whole point of synchronization primitives <youpi>sem_post() in one thread, sem_wait() in the other <youpi>you have a happesn-before between the two <solid_black>I have a happens-before between thread1 *entering* post and thread2 *exiting* wait, yes <youpi>ah, you talk about enter/exit <solid_black>and you're kind of saying that I also have a happens-before thread1 *exiting* post and something in thread2, which (the latter) should at least be existing the *next* post <youpi>I'm not saying that it is there, I'm saying that in various situations people need it :) <youpi>IIRC glibc implements that in pthread_mutex/cond_destroy <youpi>something like it makes sure that there is no unlock left <solid_black>pthread_mutex_destroy in nptl really looks like it's doing nothing to me <youpi>I don't remember the details <youpi>but I do remember that it's tricky <youpi>and glibc does some stuff about it <solid_black>what we really care about is that a call to gsync_wake() can not happen in between gsync_wait() reading the value and going to sleep <youpi>sure, but that's already handled by the hashed mutex <youpi>no need to have interferences with map & object <youpi>ok, see __pthread_cond_destroy instead <youpi>it's even more complex than the example I showed above <youpi>but again, I was just showing the kind of problem <youpi>not going to take the time to write a reproducer <solid_black>ok, so we lock the kmutex (/me wonders how that differes from locks, but whatever), then check, if it matches we insert ourselves as a waiter <solid_black>the locking on the map doesn't really serve any purpose then, does it? <youpi>it's meant to keep the memory alive <youpi>but I'd say we can probably replace that with keeping a ref or lock on the memory object instead <youpi>since that's what we use to do the mem map <youpi>ah but there is also the direct access <youpi>so we indeed do not want the adderss to idsappear <youpi>(and keeping a ref/lock on the object wil lnot be enough) <solid_black>vm_map_lookup() mutates the map if it has to CoW or make a new VM object <solid_black>so, vm_allocate a new page, call gsync on it with GSYNC_MUTATE, guaranteed deadlock <solid_black>let me actually test this, sounds too ~good~ bad to be true <youpi>I wouldn't be surprised at all that such a case exist <youpi>simple_lock are no-ops indeed <youpi>but read/write locks as used by vm are not no-ops <solid_black>there only are a handful of users of vm_map_lookup() <solid_black>we could try to add a boolean_t map_locked argument like some other map AIPs have <youpi>problem is: there are submaps <youpi>see the loop inside vm_map_lookup <youpi>possibly that can be handled somehow <youpi>ah, userland memory doesn't have it? <youpi>but we have to make sure of that <solid_black>maybe what we really want is the (sub)map being *left* locked by vm_map_lookup <youpi>I was wondering about that possibility indeed <youpi>there could be a parameter for this <solid_black>or -- do you think maybe we should use the existing vm_map_version_t/vm_map_verify/vm_map_verify_done mechanism? <youpi>it looks more complex to me than just making the lookup leave the lock held