IRC channel logs

2024-02-17.log

back to list of logs

<damo22>what is the difference between an interruptable and non-interruptable thread_sleep?
<damo22>our sleep locks use non-interruptable ones
<youpi>I don't know which operation exactly, but basically non-interruptible cannot fail
<youpi>at worse it's stuck
<youpi>while interruptible can be interrupted by some operation (I don't know whichà
<youpi>it's similar to pthread_mutex_lock vs sem_wait
<youpi>the former is uninterruptible while the later can be interrupted by a signal
<damo22>if the thread is sleeping, in both cases it can be interrupted by another thread so its confusing terminology
<damo22>i guess cancelling the sleep is possible if its interruptable
<damo22>is there any reason why one may want to cancel a thread_sleep() in a sleep lock other than when the interlock is available?
<damo22>so it looks like solid_black's theory may be correct
<damo22>(10:27:36 PM) solid_black: two threads running on different CPU cores contend for the same lock, one thread gets it, the other one goes to sleep waiting for it, there is nothing else on the second core to run, so it enters low power mode (machine_idle), the first thread releases the lock once it's done, but never pokes the second core to wake up
<damo22>if you starve a pset from having threads, it deadlocks i think because threads dont resume
<damo22>or maybe its just a missing wakeup
<damo22>how does a lock know which processor to wake?
<damo22>or a thread
<youpi>damo22: there's "interruptible" and "interruptible"
<youpi>we can't invent dozens of words for the various cases
<youpi>there's a very wide range between "I'm getting a hardward interrupt", and "my sem_wait() call gets interrupted"
<youpi>one can want to cancel a thread_sleep when e.g. the user presses ^C (i.e. a signal)
<youpi>« never pokes the second core to wake up » that shouldn't be a problem, provided that there is a timer interrupt
<youpi>the scheduler triggers and checks for ast
<youpi>the lock doesn't know which processor to wake, it just wakes a thread, and it's the scheduler that sends an ipi
<youpi>the precise ipi that your "fix" was removing :)
<youpi>the lock knows the thread from the hashtable of the waiting threads
<youpi>as registered by assert_wait()
<youpi>reaaaaallly, reading e.g. "the linux kernel" from Cesati & Bovet would help understanding things
<youpi>the pdf is even available
<youpi>as well as e.g. the minix book from tanenbaum
<youpi>(also, note that when a thread is sleeping, it's not "interrupted by another thread", sleeping means it's not running, so it cannot get e.g. hardware interrupts or such, other threads just execute on the cpu without caring about the sleeping thread)
<youpi>(interrupting the sleep is really about something saying "I actually want to wake the thread even if what it was waiting for is not actuelly available", so e.g. a ^C)
<damo22>i am not using my old "fix" anymore
<damo22>im trying to understand why the deadlock occurs
<damo22>there are simultaneous lock_read and lock_write calls to the same lock, and no threads in the slave_pset runq
<damo22>so, for some reason the APs are not waking
<damo22>but also they have nothing to run
<youpi>but are the threads actually woken up ?
<youpi>if they're not, it's not a missing ipi or scheduler problem
<youpi>it's just a missing unlock or such
<damo22> TASK THREADS
<damo22> 68 ./test(789) (f673aea0): 5 threads:
<damo22> 0 (f6781398) .W..NF 0xf5f2c848
<damo22> 1 (f5f47248) .W.O..(mach_msg_continue) 0
<damo22> 2 (f5f47e48) .W..N. 0xf5f2c848
<damo22> 3 (f5eef500) .W..N. 0xf5f2c848
<damo22> 4 (f5eef800) .W..N. 0xf5f2c848
<youpi>so they're all waiting
<youpi>so it's not a missing ipi or scheduling issue
<youpi>they have just not been woken up at all
<youpi>(in thread terms, not in cpu terms)
<damo22>ok
<damo22> https://paste.debian.net/plain/1307585
<damo22>it seems fishy so many lock_writes
<youpi>they're just stuck on the same common problem
<youpi>somebody probably forgot to release the lock
<youpi>or is keeping a read lock
<youpi>since read_lock itself is stuck, it's probably somebody who forgot to release a write lock
<youpi>it's surprising however, since on up we do have such acquires/releases as well
<youpi>but possibly not enough concurrency to see the issue
<damo22>i will look in vm_map_find_entry_anywhere or vm_map_remove
<youpi>you may want to enable MACH_LDEBUG
<youpi>so that you can read from l->writer what thread is monopolizing the lock
<damo22>in vm_map_delete there is a while loop that calls vm_map_lock() repeatedly
<damo22>with no unlock
<damo22> /*
<damo22> * Step through all entries in this region
<damo22> */
<damo22> while ((entry != vm_map_to_entry(map)) && (entry->vme_start < end)) {
<youpi>vm_map_entry_wait does the unlock
<damo22>in vm_map_deallocate() it says the map should not be locked on entering this function, but the first call inside vm_map_delete -> vm_map_entry_wait unlocks it
<damo22>shouldnt it first lock it?
<damo22>also, with LDEBUG i get very frequent assert(!in_interrupt[cpu_number()])
<damo22>assert failure
<damo22>does that mean some lock is being taken while in interrupts, without using the right simple_lock_irq()
<damo22>ddb/db_mp.c:294
<damo22>{cpu0} ../kern/sched_prim.c:1295: thread_setrun: Assertion `!in_interrupt[cpu_number()]' failed.Debugger invoked: assertion failure
<damo22>....
<damo22>Assert(c108d07a,c108f75a,50f,c1086184,f5bea018)+0x7c
<damo22>thread_setrun(f541c310,1,0,c104bb79)+0x5d0
<damo22>clear_wait(f541c310,1,0,f5bba188)+0x1f4
<damo22>thread_timeout(f541c310,f541c41c,0,c1044c05)+0x20
<damo22>softclock(2756cd00,0,0,43,e0)+0x92
<damo22>clock_interrupt(2710,0,1,c1021a92)+0x2a9
<damo22>hardclock(0,0,c10296a4,f61d9f6c,0)+0x5a
<damo22>ok so the locks are not the right ones in the scheduler
<damo22>but they are already called at splsched
<damo22>false positives?
<damo22>ok i fixed most of the false positives i could find
<damo22>now i get this:
<damo22>{cpu5} ../kern/lock.c:293: lock_write: Assertion `!in_interrupt[cpu_number()]' f
<damo22>ailed.Debugger invoked: assertion failure
<damo22>Kernel Breakpoint trap, eip 0xc1000a24, code 0, cr2 f5f68cf4
<damo22>Stopped at Debugger+0x13: int $3
<damo22>Debugger(c108cd63,c108cdba,f5f68d60,c1000a4c,f670d688)+0x13
<damo22>Assert(c108cdba,c108f282,125,c1085b00,c10cc0e0)+0x7c
<damo22>lock_write(c10cc0e0,1000,f8894000,f8893000)+0x17e
<damo22>vm_map_lock(c10cc0e0,c10cc0f0,f5f68de0,c1007321)+0x10
<damo22>vm_map_find_entry_anywhere(0,f5f68e00,f5f68e00,c10616be)+0x1bd
<damo22>vm_map_enter(c10cc0e0,f5f68eb0,1000,0,1,0,0,0,3,7,1,0)+0xe6
<damo22>vm_allocate(c10cc0e0,f5f68eb0,1000,1)+0x4b
<damo22>mach_port_names(f5bd10e0,fa0a803c,f5f68f10,fa0a804c,f5f68f14)+0x85
<damo22>_Xmach_port_names(f5418010,fa0a8010,c1033d9d,f5bd10e8,f663ec08)+0x43
<damo22>ipc_kobject_server(f5418000,1c,0,1000)+0x93
<damo22>mach_msg_trap(2804e8c,3,18,40,12)+0xab0
<damo22>how can vm_map_lock be called from an interrupt?
<damo22>do we need to disable interrupts during sleep lock logic?
<damo22>eg should interlock be an irq lock
<youpi>damo22: again, that's explained in the linux kernel book from cesati and bovet
<youpi>if an interrupt handler takes a simple lock, all users of that lock need to use the _irq version of the call
<youpi>that being said, the calltrace you get does not look like an interrupt handler
<youpi>so possibly in_interrupt is getting wrong?
<etno>gnu_srs2: this syscall seems to be part of some "standard" for network mgmt, but I don't know which one. Seen from my naive eye, syscalls are an implementation tainted kind of API. There is certainly a way to implement an adaptation layer for them on top of the Hurd's device API; but is it worth the investment?
<youpi>damo22: the vm_map_delete call from vm_map_deallocate indeed looks fishy
<solid_black>hi all
<youpi>I'm surprised that lock_done doesn't err out in that case, though
<youpi>since it checks various cases
<youpi>if one didn't take the lock before, lock_done is supposed to assert(0)
<youpi>that's worth checking
<youpi>gnu_srs2: maybe people can help me with this task
<youpi>I know there is frustration here and there
<youpi>but maybe people can also realize how much frustration I'm also taking
<youpi>I'm almost never adding features to the hurd
<youpi>rather just fixing stuff here and there
<youpi>(when I find the time to)
<solid_black>youpi damo22: vm_map_delete only unlocks the lock when it finds an entry with in_transition == TRUE, which should not be the case for a map that has no outstanding references?
<youpi>ah, perhaps that's why it's not happening utusally
<youpi>it's still completely broken to go through the vm_map_to_entry etc. calls without the lock taken
<solid_black>no it's not, if there any no other references to the map
<youpi>there should be a check against that then
<youpi>solid_black: btw, instead of poking at random, adding the recording of __FILE__ and __LINE__ to LDEBUG could be very convenient
<solid_black>you call vm_map_delete either with the map locked for real, or when you have the *only* reference to te map
<youpi>solid_black: sure, but the function should check that
<youpi>better safe than sorry
<solid_black>the latter is the case for vm_map_deallocate when it finds that the remaining refcount is 0, and is about to deallocate the map altogether
<solid_black>(I did not understand what you meant about LDEBUG?)
<youpi>current LDEBUG only records the writer that took the lock
<youpi>which is convenient to know what thread is monopolizing the lock because it forgot to release it
<youpi>but better also know _what code_ it executed when it took the lock
<youpi>gnu_srs2: the task is not particularly complex, it boils down to going through the _IO* calls, look at those which take a struct as parameter, and check that the struct exists
<youpi>if not, the macro indeed cannot compile, and we can just comment it out safely
<solid_black>do we really want all the ugly ioctls instead of nice RPCs? :(
<solid_black>oh so Mach KDB can backtrace a sleeping thread in-kernel? that's nice
<youpi>solid_black: when they are well-established, that allows to get existing code to just work already without having to port them
<youpi>solid_black: obviously, that's documented both in the info doc of mach, and on the wiki
<youpi>please guys really read the doc, there is so much to learn form it
<solid_black>(ioctls) I understand that, yes, but also RPCs are so much nicer :| if these are really well-established and used by a lot of projects, I guess it makes sense
<etno>Could some ioctls be implemented as wrappers around "clean RPCs" in the caller's space/libs ?
<solid_black>damo22: hm, so all these threads are blocked on trying to lock the same map; but can you check if the map is indeed locked?
<solid_black>etno: indeed, and glibc already does that for some ioctls
<etno>I am interested in reading more about the networking strategy of the Hurd (bridging, routing, userland ethernet driver). What is a good starting point in the doc ?
<etno>(my favorite answer would be that the Hurd doesn't have such a strategy because this is left to the user :-D ) In that case, I would have the following question: Is there already an Ethernet bridging server / an IP routing server ?
<damo22>etno: isnt that pfinet + dhclient?
<damo22>solid_black: you have a lot of good answers but you dont turn them into code
<damo22>eg, how do i add a check for this vm_map thing
<damo22>i am compiling with MACH_LDEBUG = 1
<damo22>and ive silenced a bunch of false positive asserts from locks that already are wrapped in spl
<damo22>see zam/fix-smp
<damo22>i think youpi might be right, the in_interrupt[] array might be getting out of sync
<damo22>so the lock checking thinks we are in an interrupt handler when we are not
<damo22>very strange
<damo22>in_interrupt contains the value 1
<damo22>but im on cpu4 with a failed assert
<damo22>for !in_interrupt[cpu_number()]
<etno>damo22: [pfinet/lwIP] You are right that those probably cover the IP routing. So I will ask more precisely about the Ethernet plumbing: virtual bridge and Ethernet pairs? (Like: can we bridge a qemu instance with the physical Ethernet adapter by inserting a bridge below pfinet/lwIP ? )
<damo22>etno: just give qemu a bridge
<damo22>your emulator can have different ethernet cards configured however you want on the host
<damo22>then hurd sees /dev/ethX
<damo22>in_interrupt[] = 1, 0, 0, 0, 0, ... but im on cpu4 with a failed assert for !in_interrupt[cpu_number()]
<etno>> etno: just give qemu a bridge < hence my question: how is the bridge implemented in the Hurd ?
<biblio>damo22: hardwares are not bug free. They might also have bugs or known issue. Before investing too much time, it would be better to test on different hardware vendor(s).
<damo22>etno: you cant bridge hurd to host unless qemu does it
<damo22>hurd sees whatever /dev is present on its system
<damo22>if you have /dev/eth0 and /dev/eth1 in hurd, you can probably bridge those together inside hurd but i dont know how
<etno>damo22: oh, I realize that what I said is misleading. I would like to run qemu _inside_ a native Hurd, and bridge the guest with a physical dev
<damo22>etno: in that case, use qemu --help
<etno>damo22: ok, well at least I know that I did not missed an obvious way of doing it
<damo22>biblio: i am running this on qemu, qemu is not bug free but its easier to develop than rebooting a real system each change
<etno>If qemu had a net device backend making use of the Hurd Ethernet device interface (a bit like tuntap for Linux), then with a bridging server inserted between the physical device and pfinet/lwIP, this could be accomplished. I was just wondering if such a bridging server existed already among the Hurd servers.
<damo22>etno: maybe you just need to add a net device backend for qemu
<solid_black>etno: I don't think the Hurd currently has a compelling story for routing/bridging/tuntap
<solid_black>and this is something I've been thinking about from time to time
<etno>damo22: that would be a nice (and reachable) work for me
<solid_black>the good news is that on the Hurd, network devices are just like other devices (e.g. /dev/eth0), so no hack like tuntap would be needed, you could just use a socket
<etno>solid_black: nice to see that we have this on common
<etno>solid_black: "so no hack needed" that's what I love with the Hurd
<solid_black>on thing we have for now is eth-multiplexer, which is somewhat like bridging
<solid_black>that's what we use for subhurds, and it could be similarly userful for qemu
<solid_black>s/on/one/
<solid_black>but overall, we need to come up with a design for how this all should work coherently
<etno>"eth-multiplexer", that sounds interesting
<etno>ACTION dives in the doc
<damo22>solid_black: under what conditions do we not need to lock the vm map when deleting?
<solid_black>in all cases, unless we're sure we're the only user of the map (we hold the only reference)
<solid_black>the latter happens either when we've just created the map (so no one else could have seen it yet), or if we're about to delete it
<damo22>so in vm_map_deallocate() can we add some kind of check?
<damo22>or assert
<damo22>it surely cannot call vm_map_delete() without locking
<solid_black>in vm_map_deallocate, it does exactly that check
<solid_black>it does --refcount
<solid_black>and if no references remain, that means the map is to be deleted
<solid_black>in which case it's fine to call vm_map_delete one last time without locking
<solid_black>does that not make sense?
<damo22>ok
<damo22>i'll add a comment to the code
<damo22>i find the vm code hard to follow
<damo22>so many locks all over the place and not symmetrical
<damo22>eg, some functions enter with no lock and leave with an object locked
<damo22>etc
<solid_black>that happens, yes
<solid_black>it should all be documented though
<solid_black>and it is hard to follow indeed
<damo22>it means you have to carry the state in your head when you trace through the pathway of the code
<solid_black>but at least it's only software, it doesn't deal with hardware, so I have a chance of understanding it if I try
<damo22>i think there might be a missing unlock somewhere
<damo22>since i am getting a deadlock in the vm code somewhere with smp
<damo22>we need lock debugging to work properly
<solid_black>did you follow youpi's suggestion, and make each lock store where it is locked from (which thread and where in the source code)?
<damo22>not yet
<damo22>if you want to try what ive done, you can, im going to sleep... but my branch is here: https://git.zammit.org/gnumach-sv.git/log/?h=fix-smp
<damo22>set MACH_LDEBUG = 1 in configfrag.ac before building
<youpi>etno: there is the eth-multiplexer and the eth-filter
<youpi>basically the idea will be to leave it to the user to set up translators indeed
<youpi>biblio: hardware is not so often bugged, particularly smp is tested, so I'd doubt a hardware issue
<youpi>the maintenance of in_interrupt in the smp case, however, is very plausible
<youpi>+error in
<youpi>etno: concerning qemu, it will use the tun/tap interface, in the tun case you can just add it to your pfinet
<youpi>so you'll be able to route
<youpi>perhaps eth-multiplexer can take the tap case, I don't remember, it's too long ago and I only had a glimpse
<youpi>damo22: vm code are *always* hard to follow
<youpi>the linux one is several orders of magnitude harder to dive in
<youpi>concerning functions that expect something to be locked, we can add asserts to not only document but also check it
<youpi>about ldebug: "which thread" is already implemented
<youpi>just need to add the __FILE__ and __LINE__ alongside
<youpi>gnu_srs2: checking closer, it seems there is an _IOT_ifreq_int defined, and printing SIOCGIFMETRIC does work fine
<youpi>which issue do you actually encounter exactly?
<youpi>(xy problem again)
<gnu_srs2>what about _IOT_ifreq_short and #define SIOCGIFFLAGS _IOWR('i',17, struct ifreq_short)/* get ifnet flags */
<gnu_srs2>And how to use it??
<youpi>that would completely break the ioctl
<youpi>again, *what* error do you *actually* get ?
<youpi>as long as you don't understand the xy problem, communication will be complicated
<gnu_srs2>error: invalid application of ‘sizeof’ to incomplete type ‘struct ifreq_short’
<gnu_srs2> 949 | if (ioctl(fd, SIOCGIFFLAGS, &ifr) == -1)
<youpi>is _IOT_ifreq_short defined when this is compiled?
<youpi>is _IOT_ifreq_int defined when this is compiled?
<youpi>both are defined in the same file, so I don't see why the short version would compile while the int version would not
<youpi>but that's probably where the problem lies, thus worth checking
<gnu_srs2>/usr/include/net/if.h:# define _IOT_ifreq_short _IOT(_IOTS(char),IFNAMSIZ,_IOTS(short),1,0,0)
<youpi>I'm not asking whether it's defined in some header
<youpi>I'm asking whether it's defined when this is compiled
<youpi>you can use e.g. #ifdef _IOT_ifreq_short #error ok #endif
<youpi>in the code you're compiling
<youpi>to make sure whether things are actually defined
<youpi>notably, these #define are under #ifdef __USE_MISC, so if the program is not compiled with some -D to enable that, it's normal that they don't get defined, see man feature_test_macros
<youpi>possibly the required -D is enabled for linux, but not others, and it's a matter to adding that
<youpi>e.g. -D_GNU_SOURCE
<youpi>which I see dhcpcd use for linux and kfreebsd only
<youpi>you very probably want that for the Hurd too
<youpi>damo22: the comment is useful, but the called function could check that it's called with the lock taken, unless the reference counter is indeed 0
<gnu_srs2>configure: hurd*)
<gnu_srs2> echo "CPPFLAGS+= -D_GNU_SOURCE" >>$CONFIG_MK
<flavioc> https://buildd.debian.org/status/fetch.php?pkg=coreutils&arch=hurd-i386&ver=9.4-3&stamp=1704209030&raw=0 I see that some packages are failing to build as they require 64bit time_t support in glibc. should we worry about this? one solution is add such support into glibc to avoid having to pass --disable-year2038 but I wonder what others think
<youpi>flavioc: adding the support is complex
<youpi>it'd need also kernel support etc.
<youpi>I don't think we really need to spend time on this
<youpi>if we really need newer versions of those packages, we can patch debian/rules to pass the option to ignore the issue
<flavioc>makes sense