IRC channel logs

2024-11-24.log

back to list of logs

<damo22>i was able to mount and read a usb disk, unmount it and kill rumpusbdisk without crashing my system
<Gooberpatrol_66>yay
<damo22>$ ../configure --enable-kdb --enable-apic --enable-ncpus=8 --disable-linux-groups
<damo22>seems to boot with -smp 6
<damo22>$ sudo ~/bin/smp /usr/bin/stress -c 5
<damo22>hogs 5 cores
<damo22>what do i need to test?
<damo22>i tried compiling gnumach with -j6 but it only serialised the cc1 calls
<damo22>.NOTPARALLEL: check
<damo22>^ this seems to serialise all of gnumach compilation
<gfleury>damo22: congrats for you usb port
<gfleury>You're doing great jobs
<damo22>gfleury: thank you, its all standing on the shoulders of giants as the saying goes
<damo22>we need to figure out why SMP wont fully boot, there are likely a few race conditions left
<damo22>when i revert gnumach aadb4339, it doesnt boot because of the races
<gfleury>Indeed
<damo22>but in master, it boots with the APs in the slave processor set
<damo22>and the extra cores are usable in the slave pset
<damo22>its surprising that the design allows for any process, including gnumach (the kernel) to be scheduled on any processor
<damo22>youpi: how do i make the startup of init execute just /bin/bash so i can step through rc manually?
<damo22>it seems service networking hangs on smp
<damo22>DHCPDISCOVER on /dev/eth0 to 255.255.255.255 port 67 interval 4
<damo22>i think the problem is in netdde
<damo22>#0 0x081af935 in _hurd_intr_rpc_mach_msg ()
<damo22>#1 0x08277580 in gsync_wait_intr ()
<damo22>#2 0x081a57b2 in __sem_timedwait_internal ()
<damo22>#3 0x081a58ce in sem_wait ()
<damo22>#4 0x0818939e in _sem_timedwait_internal (sem=0x4900b60, timeout=0) at ../../libddekit/thread.c:248
<damo22>#5 ddekit_sem_down (sem=0x4900b60) at ../../libddekit/thread.c:268
<damo22>its deadlocked on a timed wait
<damo22> https://paste.debian.net/hidden/de9eea4c/
<damo22>5 sem_waits!!
<damo22>something is borked
<damo22>maybe netdde needs to be a multithreaded machdev?
<azert>damo22: DHCPDISCOVER hangs on read only filesystems without any warnings. Such as when the system mounts file systems read only without error. I’d first try to just disable networking and see if everything else boots in multicpu
<azert>Best strategy, in fact, is to boot on the mono processor set, and then test things server by server by booting them individually on the APs
<damo22>azert: i have a booting system with no networking
<damo22>i had to remove a few things from sysv-rc-conf
<azert>Cool!
<damo22>i think something is fishy with ddekit_timer_thread()
<damo22>if theres no timer_list, __timer_sleep(-1) is called and it might not be good
<damo22>__timer_sleep(DDEKIT_TIMEOUT_NEVER)
<damo22>i dont think it can process any more timers after hat
<damo22>that*
<damo22>also ddekit_condvar_wait_timed() is implemented as an infinite wait ddekit_condvar_wait (cvp, mp);
<damo22>maybe we can make it never wait forever, so the timer thread definitely wakes up occasionally
<damo22>thats not it
<damo22>youpi: should userspace drivers, eg netdde, be changing the processor interrupt flag?
<damo22>libdde_linux26 calls local_irq_save and local_irq_restore
<damo22>wont that cause havoc on smp?
<damo22>libdde-linux26/lib/src/arch/l4/softirq.c
<youpi>damo22: no, the gnumach build really is parallel, we get several cc1 process in parallel
<youpi>reverting aadb4339 is not a priority: the issue we face is compilation time, which is mostly bounded by cc1 / cc1plus duration
<damo22>if i remove ".NOTPARALLEL: check" from the tests/Makefile.am, then i get a parallel gnumach build
<youpi>«  its surprising that the design allows for any process, including gnumach (the kernel) to be scheduled on any processor »
<youpi>Why surprising? That's how operating systems work
<youpi>« how do i make the startup of init execute just /bin/bash so i can step through rc manually? »
<youpi>I don't remember if `init=/bin/bash` would work
<youpi>« maybe netdde needs to be a multithreaded machdev? »
<youpi>that would only introduce yet more parallelism, and thus yet more problems. netdde is fine on UP with the current machdev. The problem is probably rather missing synchronization, which happens to work on UP just by luck
<youpi>the whole dde framework probably wasn't fixed for actual parallelism, I'd say better not spend time on this since we'd rather move to rump which is maintained
<youpi>« should userspace drivers, eg netdde, be changing the processor interrupt flag? »
<youpi>they are *not* supposed to, and rather use the libirqhelp helpers
<youpi>« then i get a parallel gnumach build »
<youpi>how do you observe this?
<damo22>i watch the "top" process list and see only one cc1 when i dont remove the line
<damo22>(with smp enabled)
<damo22>i was able to fully build gnumach with smp and -j6
<youpi>I also mean: which command do you run?
<damo22>$ sudo ~/bin/smp /bin/bash
<damo22># su - demo
<damo22>$ cd /path/to/gnumach/build
<damo22>$ make -j6
<youpi>well, I don't know. It really does not make sense, since .NOTPARALLEL: check only has effect on the check rule, not the all rule (and we definitely need it for the check rule). Maybe at least check in ps whether there's really just one process that is started (and not just one process actually using cpu time). On linux, make -j6 really does work fine without removing such line
<damo22>okay
<azert>youpi: what would you think about making smp available as a kernel for i386 on Debian Hurd and ./smp part of the distribution as a temporary hack?
<youpi>the kernel is already available
<youpi>adding smp is a matter of somebody actually submitting a patch that does it
<youpi>("adding smp", I mean adding ./smp)
<azert>Yep understood
<damo22>i can provide the patch
<damo22>i can make it part of sutils
<azert>please do
<solid_black>does anyone have an account on gcc bugzilla / a more direct contact with them? we should really make sure the miscompilation janneke found is heard about & fixed, and the gcc-bugs mail he sent apparently got ignored
<solid_black>s/gcc-bugs/bug-gcc/
<janneke>solid_black: well, i am not subscribed, and the moderator let it go through, so that's something
<solid_black>now that I look, I cannot seem to find mentions or archives of bug-gcc online
<solid_black>but gcc-bugs exists
<solid_black>janneke: also, can I bribe you somehow to take a look at aarch64-gnu? ;)
<youpi>one can easily register an account on gcc's bugzilla
<youpi>ideally the bug would be reproduced with a non-cross-compiler too, to make it way simpler to reproduce
<youpi>start with minimizing the c code down, starting from the -E output and dropping almost everything
<solid_black>yes, that was my idea too
<solid_black>and since we're only interested in the compilation proper stage, x86_64-gnu vs x86_64-linux-gnu shouldn't matter
<solid_black>posted the reduced version
<janneke>solid_black: very nice, thank you
<youpi>it looks to me like gcc thinks the whole code is not supposed to happen
<youpi>and just emits the volatile asm pieces because volatile forces it to
<youpi>and the presence of nop tells it that yes the code should happen
<janneke>well, if you remove the __builtin_unreachable (); then it works
<youpi>so perhaps something like it's not aware that the retq asm bit really is supposed not to return
<youpi>yes, that's what I mean
<youpi>unreachable tells the compiler that it's not supposed to be reached
<youpi>and the retq piece is not said to be non-returning
<youpi>so gcc can infer that everything above is not supposed to happen
<solid_black>it does emit other things on the code path, i.e. the call to __mach_port_mod_refs (reduced away here)
<solid_black>it's only the assignment w/ __seg_fs that gets eliminated
<youpi>yes, it's not supposed to know what function calls do, so it wouldn't drop them
<solid_black>and how would an asm (nop) tell it that the code path is reachable?
<youpi>that's exactly what I have just asked the list
<solid_black>this certainly isn't the only place where we do some sort of custom jump/return w/ inlin asm, followed by __builtin_unreachable
<youpi>not really that sure of this
<youpi>we don't have many jumps from inline asm
<youpi>and not that many unreachable()
<solid_black>RETURN_TO / _hurd_stack_setup is the other prominent example
<youpi>again, a function call is not suppsoed to know what the function does
<youpi>a function can very well be a noreturn without being declared so
<solid_black>RETURN_TO is a macro that exapnds to asm volatile
<youpi>yes, but doinit doesn't have more than the init() call and RETURN_TO
<solid_black>oh, so you think it emits the function call thinking that it could be noreturn, then the asm volatile just because it's volatile, while already thinking the code path is unreachable?
<youpi>and gcc wouldn't usually drop a function call unless it really knows it's const
<youpi>yes
<youpi>not the code path, just the end
<solid_black>the code path after the function call
<youpi>yes
<solid_black>from quick godbolt'ing, gcc does remove asm volatile after either a known noreturn call, or after __builtin_unrecahable
<youpi>when it knows for sure it's not reachable, sure
<youpi>but if it doesn't know for sure, it can't just dro pit
<solid_black>we could also disprove your theory by doing something that is not a function call and not an asm volatile statement in between, like setting a global
<youpi>don't try to make theories about compilers :)
<youpi>they are full of heuristics
<youpi>so it's very hard to actually model
<solid_black>hmmm, it doesn't emit storing to the global at all, with ADD_NOT or without it
<solid_black> https://godbolt.org/z/3T6MKvj85
<solid_black>so it's as if it does think that the store through __seg_fs makes the following code somewhat unreachable
<solid_black>-fno-delete-null-pointer-checks does not fix it, so it's not that it considers the pointer to be null
<youpi>it may as well be simpler to just make the code call an asm snippet outside the function and that can just drop the return address and do the jump
<youpi>like we have i sysdeps/mach/hurd/x86/trampoline.c
<youpi>(and we can even mark it noreturn)
<solid_black>or we could just drop the __builtin_unreachable, this is only gonna cost us an extra 'ret' or so
<solid_black>but the gcc codegen bug is still worth being investigated & fixed
<youpi>it is really possible that gcc is not actually buggy here
<solid_black>by the way, in the very same file, __sigreturn also returns in a similar way, by doing a bunch of inline asm, then __builtin_unreachable
<youpi>and we are just telling it wrong things
<solid_black>and before that, some reads/writes to/from memory, not a call
<youpi>again, don't make theories, heuristics
<solid_black>I don't think we're doing anything wrong? placing __builtin_unreachable after inline asm that doesn't return is their recommendadtion in the docs
<youpi>do the docs tell about inline asm explicitly?
<solid_black>in the docs on inline asm, yes
<youpi>ok, then it's odd indeed
<solid_black> https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Goto-Labels here it tells you to use __builtin_unreachable
<solid_black>though the section is about asm goto
<Pellescours>I know that hurd:1afe753f75f1e64254c8e29c4c2030e25fa95392 should have fixed the problem of doing a ls of /dev/ when using rump but the real problem is still here. If gnumach is configured with debugger, then doing a `ls /dev` will enter gnumach in debugger
<Pellescours>with a message from rump "cd0: dos partition I/O error", then a message from kernel explaining the problem "kernel: Divide error (0), code=0" "Stopped at 0x81a053d: divl %ebx,%eax"
<solid_black>...I think I know what's going on
<solid_black>gcc just decides to reorder the store after the asm
<solid_black>if I do : "memory", it works without the nop
<solid_black>janneke: please try that ^
<solid_black>and unlike for extern functions, for inline asm it is on us to specify for it might read/write, so yeah, our fault
<janneke>solid_black: like this:
<janneke> "retq $128"
<janneke> : memory
<janneke> : "rm" (usp));
<janneke>eh "memory" probably
<solid_black>rather
<solid_black>"retq $128" :
<solid_black>: "rm" (usp)
<solid_black>: "memory");
<janneke>OK
<solid_black>although IDK, can you specify "memory" as in input rather than clobber?
<solid_black>looks like you can't, it's only valid in clobbers
<janneke>dunno, i just pattern-matched the `: "memory"' bit
<janneke>right
<janneke>solid_black: yes, that works for me too