IRC channel logs

<damo22>{cpu5} ../kern/lock.c:293: lock_write: Assertion `!in_interrupt[cpu_number()]' failed.Debugger invoked: assertion failure

<damo22>Kernel Breakpoint trap, eip 0xc1000a24, code 0, cr2 f6793da4

<damo22>Stopped at Debugger+0x13: int $3

<damo22>Debugger(c108cdcb,c108ce22,f6793e50,c1000a4c,c10cc0e0)+0x13

<damo22>Assert(c108ce22,c108f2ea,125,c1085b20,f5f408f8)+0x7c

<damo22>lock_write(f5f408f8,f8892000,f649aab0,f6716ab0)+0x17e

<damo22>vm_map_lock(f5f408f8,40,f6793ec0,1000)+0x10

<damo22>vm_map_find_entry_anywhere(0,f6793ed4,f8890000,0)+0x1bd

<damo22>....

<damo22>its failing on a sleep lock, but is it possible it was during an interrupt? probably not right

<youpi>what is the rest of the stack?

<damo22>youpi: a new one:

<damo22>{cpu4} ../kern/lock.c:392: lock_read: Assertion `!in_interrupt[cpu_number()]' fa

<damo22>iled.Debugger invoked: assertion failure

<damo22>Kernel Breakpoint trap, eip 0xc1000a24, code 0, cr2 f6144e3c

<damo22>Stopped at Debugger+0x13: int $3

<damo22>Debugger(c108cdcb,c108ce22,5321000,c1000a4c,f678d060)+0x13

<damo22>Assert(c108ce22,c108f2ea,188,c1085b08,f5bb0268)+0x7c

<damo22>lock_read(f5bb0268,0,36761000,5300000,5321000)+0x190

<damo22>vm_map_lookup(f6144f90,5300000,3,f6144f48,f6144f50,f6144f54,f6144f58,f6144f4c,c1

<damo22>0d0540,c10038ab,c10b21c0,fa09a000)+0x21

<damo22>vm_fault(f5bb0268,5300000,3,0,0)+0x20b

<damo22>user_trap(f6186578)+0x133

<damo22>>>>>> Page fault (14) for WU 5300010 at 0x11528c9 <<<<<

<damo22>>>>>> user space <<<<<

<damo22>in_interrupt array:

<damo22>c10c8000: 0x00000000 0x00000001 0x00000001 0x00000000

<damo22>c10c8010: 0x00000000 0x00000001 0x00000001 0x00000001

<damo22>IRR 48 251 on cpu4

<damo22>clock + ipi

<damo22>could it be that the sleep lock sleeps the thread, and wakes up on a different cpu, thus the cpu_number() is not the original cpu?

<damo22>hmm can the thread wake up during an interrupt?

<damo22>if we get an interrupt from int stack, the in_interrupt[] array is not updated

<damo22>would that matter?

<damo22>youpi: i have 3 threads sitting at switch_context()+0x171 and they are not being scheduled to run on APs, but the APs are getting clock interrupts

<damo22>they are all waiting to run

<damo22>> kdb says we are HERE but it really means it jumped into a userspace thread

<damo22>maybe that means userspace is blocking

<damo22>ugh it might be glibc?

<damo22>i cant attach gdb to my program

<damo22>heh this may be interesting:

<damo22>db{0}> trace/tu $task68.0

<damo22>Continuation thread_bootstrap_return

<damo22>>>>>> user space <<<<<

<damo22>0x5881(1034,8,1038e98,0,103a054)

<damo22>0x1c27e(1038e40,5d,1038e50,27811,2)

<damo22>0x23703(1038e40,1038f4c,28c65,1ddaa,0)

<damo22>0x1ddbc(1039000,1f520,1bb25,1f212,0)

<damo22>0x1f265(1039000,0,0,0,6d6f682f)

<damo22>0x1e2cb()

<damo22>stuck

<damo22>i might compile a statically linked binary to test this

<damo22>its in gsync_wait, then __mach_msg_trap

<damo22>it called gsync_wait inside putchar

<damo22>hmm GOMP_parallel executes this instruction: 804b39f: 65 8b 15 a0 ff ff ff mov %gs:0xffffffa0,%edx

<damo22>i thought %gs was for kernel?

<damo22>oh no

<damo22>it calls the gomp_team_start

<damo22>then cleans up

<damo22>its stuck here:

<damo22>080bd470 <__mach_msg_trap>:

<damo22> 80bd470: b8 e7 ff ff ff mov $0xffffffe7,%eax

<damo22> 80bd475: 9a 00 00 00 00 07 00 lcall $0x7,$0x0

<damo22>it never returns after that

<damo22>then this happens:

<damo22>db{0}> trace/tu $task67.0

<damo22>switch_context(f6461ab8,0,f5bae180,c1040cfe,c10990ac)+0x171

<damo22>thread_invoke(f6461ab8,0,f5bae180,c1041937)+0x119

<damo22>thread_block(0,0,1,f5bb3144)+0x64

<damo22>gsync_wait(f6102018,81779bc,2,0,0)+0x237

<damo22>_Xgsync_wait(f5400010,fa0ae010,f6461ab8,f5fb4798,f5f2e8a0)+0x65

<damo22>ipc_kobject_server(f5400000,38,0,1000)+0x93

<damo22>mach_msg_trap(1001c7c,3,40,20,2c)+0x987

<damo22>>>>>> user space <<<<<

<damo22>its stuck on switch_context ?

<damo22>db{0}> show task $task71

<damo22> TASK THREADS

<damo22> 71 /home/demo/bin/test(830) (f6102018): 5 threads:

<damo22> 0 (f6461ab8) .W...F 0

<damo22> 1 (f5ec71f0) .W..N. 0xf5f2e8a0

<damo22> 2 (f5bd2018) .W..N. 0xf5f2e8a0

<damo22> 3 (f5f811e8) .W..N. 0xf5f2e8a0

<damo22> 4 (f61f7350) .W..N. 0xf5f2e8a0

<damo22>the threads are all waiting

<damo22>im getting interrupts on APs

<damo22>why are they not running?

<damo22>it must be a scheduler bug that the threads are not being scheduled on the slave_pset?

<damo22>maybe it only schedules threads that are new or inherited from a task, but doesnt know how to reschedule existing threads back on the slave_pset

<damo22>so when they sleep, they never wake because they never get rescheduled there

<solid_black>damo22: "i thought %gs was for kernel?" yes, and no; the kernel has its own %gs (that it uses for percpu data, and you implemented that, no?), but userland can use its own %gs, just like it has its own values for the other registers (%eax etc). on i386-gnu, glibc uses %gs for thread-local storage in fact

<solid_black>on x86_64, the "swapgs" instruction is exactly intended for swapping between "kernel gs" and "user gs"

<solid_black>"its stuck on switch_context ?" that just means it was blocking on something (in this case, gsync_wait working as expected), and never scheduled since then (probably never woken up too)

<solid_black>what *does* your slave pset do? does it run idle_thread? something else?

<youpi>damo22: ("a new one"): so lock_read is called from interrupt context (user_trap), so it needs special care

<youpi>that's part of what was said earlier on the list: when we lock a map, we must be very careful not to trigger any fault, otherwise the fault handler may try to lock again

<youpi>now, when in smp, another cpu might be holding the lock, and lock_read would then try to thread_sleep

<solid_black>youpi: how's user_fault an interrupt context? it doesn't interrupt other kernel code, it only interrupts the user code

<youpi>apparently there is a l->can_sleep, which we can set to false, to make sure that it doesn't try to, in the vm_map lock case

<youpi>and the cpu will then just spin, waiting for the other cpu to release the lock, which is fine

<youpi>"could it be that the sleep lock sleeps the thread" : gnumach not being preemptible, it's not supposed to switch threads in an interrupt handler

<youpi>"can the thread wake up during an interrupt?" : take care not to confuse "wake up", and "actually wake up"

<youpi>a thread can be woken up, i.e. telling the scheduler to put it on a runq, in an interupt context

<youpi>but that doesn't mean it's getting run

<youpi>it's only when the scheduler is told to change context (because another thread is getting blocked) that the other thread really gets to wake up

<youpi>"if we get an interrupt from int stack, the in_interrupt[] array is not updated. would that matter?" I don't know if in_interrupt is correctly handled in the nested case, that's very probably worth checking yes

<youpi>but again, in the trace you showed, we *are* in an interrupt (page fault trap)

<youpi>" i have 3 threads sitting at switch_context()" that just means they are not running

<youpi>that's not meaning they are ready to run or whatever

<youpi>check their state in the show all thread output, whether they are waiting or not

<youpi>i.e.: "they are all waiting to run" how do you know that?

<youpi>"maybe that means userspace is blocking" : no, userland cannot "block", it's meaningless, it's only the kernel that can "block", by switching to another thread without putting the thread on a runq to be executed later

<youpi>"i thought %gs was for kernel?": no, userland uses gs to implement TLS

<youpi>" its stuck on switch_context ?" it's just waiting for a gsync_wake

<youpi>" the threads are all waiting": yes, so the scheduler knows they should *not* be run, even if there are interrupts on APs

<youpi>" why are they not running?": because they are waiting for somebody to wake them up

<youpi>" it must be a scheduler bug ": NO.

<youpi>NO

<youpi>the W state really means nobody woke them up

<youpi>so the scheduler is really *NOT* supposed to run them

<youpi>solid_black: "how's user_fault an interrupt context? ": well, technically speaking, it *is* an interrupt

<solid_black>it is, but "interrupt context" has a specific meaning

<youpi>it's true that it doesn't interrupt the kernel code, so we could change the code to distinguish whether we interrupted the userland (in which case we don't care about taking non-_irq versions) and kernelland (in which case we do need to use _irq versions)

<youpi>damo22: really, really, really, really, REALLY, reading books about OSes would help you a lot understanding what I mean

<youpi>you seem to be all confused by what "waking up" actually means

<youpi>and that's completely described in such books

<youpi>I can't take the time to explain in details like books do

<youpi>really, read the minix book and the linux kernel book

<youpi>all I'm explaining here, I have just learnt from these

<solid_black>(I know what waking up means)

<youpi>solid_black: I know that you know :)

<solid_black>re interrupt context: can the kernel actually fault on user pages while holding any locks? or does it always release locks before accessing user memory? -- I'd need to check that

<youpi>kernel is *not* supposed to keep a lock held before returning to userland

<youpi>and it's indeed not supposed to hold any lock when touching user memory

<solid_black>when returning to userland, certainly

<youpi>that last part could be detected by introducing user_get() macros

<solid_black>and if it doesn't hold any locks when touching user memory, that's great too

<youpi>but we can also check in user_trap whether we were in kernel context or not, and shout if we had a lock

<solid_black>that means vm_fault() doesn't risk deadlocking, since it *is* called form an interrupt context

<youpi>yes, that's what I meant above

<solid_black>wdym? user_trap exactly means we trapped from user mode?

<youpi>when there's no kernel stack frame in between, yes

<youpi>see there is nothing between page fault and user space, in the trace

<solid_black>which trace is that?

<youpi>it's easy to detect by looking at %sp in the interrupted context content

<youpi>the trace damo pasted at 03:27:11 UTC

<youpi>which shouts that we are taking a lock while in interrupt

<youpi>while it's just a userland page fault

<solid_black>again, in my understanding, userland faults / traps are not "interrupt context", even if they are technically implemented by the same hardware mechanism

<solid_black>it should not set in_interrupt = TRUE

<youpi>that's what I meant above

<youpi>we should probably fix in_interrupt for that case

<solid_black>but why does it get set? AFAICS in i386/i386/locore.S, in_interrupt gets set (incl'ed) in all_intrs(), which is calls into interrupt(), and is a different codepath from alltraps(), which is what calls into user_trap()

<solid_black>it does not look like just handling a user fault would set in_interrupt

<youpi>ah, then I don't know

<youpi>damo22: your commit f37055f86bb19b54efd1a1b4784d35c3a4e1f947 dropped a CPU_NUMBER call

<youpi>but this *need* to be kept

<youpi>ah, no, sorry, I misread the line numbers

<solid_black>speaking of cpu_number calls, CPU_NUMBER(%eax) in x86_64/cswitch.S:Switch_context is obviously broken, no?

<solid_black>guess nobody even tried to run x86_64 with smp

<youpi>damo22: while at it, I cherry-picked it into master

<youpi>solid_black: nobody did, sure

<youpi>solid_black: how is it broken? (I'm on my way out)

<solid_black>%rax holds the old_thread pointer at that point (which gets return from Switch_context to the new thread); but it gets overwritten with the CPU number

<solid_black>s/return/returned/

<youpi>indeed, we need to use another register, I don't remember why rdx wasn't used

<damo22>hi

<damo22>i missed all the fun

<etno>Hi damo22 o/

<solid_black>hi damo22 etno

<damo22>yes i dont know how wakeup works

<etno>solid_black: o/

<damo22>(07:32:28 PM) youpi: apparently there is a l->can_sleep, which we can set to false, to make sure that it doesn't try to, in the vm_map lock case

<solid_black>but again, that really only makes sense to do when you're actually in an interrupt context

<solid_black>not handling a user trap

<damo22>so do we need to make vm map locks actually not sleep locks, but spin locks?

<solid_black>I don't think that your issue, and I'm not sure if it would break something

<solid_black>for the in_interrupt panic, you should figure out why in_interrupt gets set, because you're not actually in an interrupt contet, you're handling a user trap at that point

<solid_black>for the deadlock, please see what is expected to wake those threads (does something do gsync_wake?), and why it hasn't

<damo22>i wrote a simple #pragma omp parallel for

<damo22>and iterated 10000 times over 7 threads

<solid_black>let me actually read gsync.c and see how it's supposed to work

<damo22>maybe gcc is broken?

<damo22>libgomp?

<solid_black>I have much better understanding of Mach kernel internals, both VM and the scheduler, than what I had lst time I looked at it

<solid_black>no, gcc/omp is fine

<damo22>i can share my program

<damo22>its very simple

<solid_black>can you actually share the complete steps to reproduce?

<damo22>yes

<solid_black>there you go, gsync_wait touches userland memory while holding a read lock on the map

<damo22>!!

<solid_black>hmm, but vm_fault really tries hard not to take a lock on the vm_map

<solid_black>a write lock I mean

<solid_black>vm_map_lookup() is still going to take a write lock if you fault on writing, but that's not our case

<damo22> https://zamaudio.com/mbox2/test-ipc.tgz

<solid_black>but also in the "Create an object if necessary." branch!

<solid_black>see vm_map.c:4757

<solid_black>so if the userland vm_allocates() a new page, does not touch it just yet, and calls gsync_wait with an address on that page, that would deadlock, it seems

<solid_black>no, it would not, because gsync_wait calls probe_address() first, which already does vm_map_lookup(), and so creates the VM object, ok

<damo22>solid_black: ^ that link has instructions for a reproducible deadlock

<solid_black>here's another thing I don't fully understand: kernel_map is the vm_map of the kernel task, right? why is it enough to enter mappings into the kernel_map inside *any* thread running in the kernel to access it?

<damo22>probably because memory is accessible on all cpus in the same addresses and the page tables are synced?

<solid_black>(this has nothing to do with SMP, it's more of a question about the Mach VM subsystem)

<solid_black>damo22: "smp kernel with apic from master" -- so not your branch, just upstream master?

<damo22>that will do, yes, or my fix-smp branch if you want to be exactly the same

<solid_black>does it just boot Debian, or how do I run this?

<damo22>yes it boots debian

<solid_black>or guess init=/bin/bash would work?

<solid_black>gsync_wait() blocks the thread with no continuation :(

<solid_black>which is not the end of the world, but using continuations is better

<solid_black>well, it should not be blocking the current thread anyway, so I'll have to rewrite this all

<damo22>where is the code for gsync_wait is that in glibc?

<solid_black>glibc:mach/lowlevellock.h is what calls gsync_wait()

<solid_black>also I don't think gsync_wait() implementation even does the equality check atomically?

<solid_black>i.e. it does if (!equal) { kmutex_unlock(...); return KERN_INVALID_ARGUMENT; } first, and only puts itself into the waiters hash map later

<solid_black>and the whole point of this value check is to do that the other way around

<solid_black>argh

<solid_black>this is all wrong

<solid_black>so IOW if you stress gsync_wait-based synchronization primitives on an SMP system, it *will* deadlock

<solid_black>that might just be what you're seeing actually

<damo22>:D

<damo22>that could probably explain random deadlocks during boot too

<damo22>unless you bind to 0

<solid_black>that may very well explain those, yes

<solid_black>it sounds like I really have to rewrite gsync, doesn't it? :D

<damo22>well you have a test program now you can use to stress it

<damo22>how difficult is it to fix?

<damo22>youpi: solid_black thinks gsync_wait is broken in glibc

<solid_black>not in glibc, in gnumach

<damo22>oh

<youpi>I very much doubt gsync_wait is broken

<youpi>we use it a *lot* in pthread etc.

<youpi>and your rpc testcase doesn't involved gsync_wait,

<youpi>only rpcs with the kernel, which requires synchronizing etc.

<youpi>better focus on that for now, it involves less code

<damo22>my code uses gsync_wait because it uses libgomp

<solid_black>youpi: it's only broken on SMP

<solid_black>it checks for the pointed-to value being equal to the passed-in value first, goes to sleep later

<damo22>#pragma omp parallel for

<solid_black>that will work as long as nothing can change the value in between, which is the case on single CPU

<solid_black>but is broken on SMP

<damo22>i think it syncronises the printfs

<damo22>so they dont trample each other

<solid_black>indeed, there's a lot of syncronization inside glibc even if you don't call e.g. pthread_mutex_lock explicitly

<solid_black>I'd expect OpenMP runtime to also use synchronization pervasively

<solid_black>and we in fact already *know*, from your Mach KDB backtraces, that your threads are stuck in gsync_wait

<solid_black>it doesn't necesserily mean they are stuck because of this issue I'm talking about, but it seems plausible

<damo22>putchar has gsync_wait in it

<solid_black>damo22: is "../configure --host=i686-gnu --disable-linux-groups --enable-apic --enable-ncpus=8" an appropriate way to reproduce this?

<solid_black>if not, what do I need to pass?

<damo22>you could include --enable-kdb

<damo22>otherwise yes

<solid_black>ok

<solid_black>it panics at startup, assertion(hpet_addr != 0) failed at apic.c:398 in hpet_init

<damo22>hmm

<damo22>what machine is it running on

<damo22>if this is qemu, you need -M q35

<solid_black>qemu-system-x86_64, with a long long cmdline generated by libvirt

<solid_black>it has -machine pc-i440fx-7.0,usb=off,dump-guest-core=off,memory-backend=pc.ram,hpet=off,acpi=on

<solid_black>that hpet=off might be relevant

<damo22>hpet=on

<solid_black>what is that, and why is it required?

<damo22>its a timer

<solid_black>high-precision something timer?

<damo22>every ACPI platform since 2005 has one

<damo22>i made it mandatory for smp

<solid_black>why does libvirt default to hpet=off? can other kernels boot without it?

<damo22>they can, they probably use TSC

<damo22>but its a nightmare to calibrate

<damo22>since it relies on the cpu frequency

<damo22>which is unknown until runtime

<solid_black>anyways, I better run qemu myself anyway

<damo22>i440fx is oldddd

<damo22>you can use q35

<youpi>solid_black: "but again, that really only makes sense to do when you're actually in an interrupt context"

<youpi>right, we can sleep in the user fault case

<solid_black>ok, does not panic w/ q35, but not it hangs in rumpdisk

<damo22>which commit are you on

<solid_black>ff3f259ceab38064953e04384d2b529f7b9aca34 of gnumach

<youpi>« there you go, gsync_wait touches userland memory while holding a read lock on the map», ah, that however is bogus indeed

<solid_black>youpi: no, that's probably ok

<youpi>ok

<solid_black>since it's only a read lock

<youpi>ok

<youpi>but do we have code that checks this?

<youpi>I mean

<youpi>if there is a possibility for code to be written wrongly, we'd rather have code that checks against it

<damo22>solid_black: are you using -smp 4

<youpi>so that we are not just lucky

<solid_black>actually no, I haven't even passed -smp to qemu

<solid_black>let me try with that

<youpi>about the kernel map, yes the kernel part of the address space is shared by all processes, including the page table pieces

<solid_black>do I pass -smp 8 on gnumach smdline too, or what?

<damo22>solid_black: gnumach cmdline doesnt know anything about it

<solid_black>youpi: but how does that actually work? I only see kernel_map being used for initializing kernel_task->map in task_create_kernel

<youpi>I'm afraid I don't have time to explain

<youpi>and not sure I even know what you need to know

<solid_black>damo22: so I pass nothing additional to gnumach, just -smp 8 to qemu, and gnumach auto-detects the additional cpus?

<damo22>solid_black: yes

<solid_black>ok

<damo22>make sure you use the right kernel image

<youpi>solid_black: make sure to read "futexes are tricky" to understand how gsync works

<solid_black>youpi: fwiw, I know how I did this on aarch64, the ttbr1 part of the mapping is shared between all tasks, and inserting any mapping into *any* pmap (not just kernel map) at an address > min_kernel_address (whatever that constant is called) will be visible to all tasks

<youpi>but possibly there is a missing part for smp in gsync indeed

<solid_black>is something similar done on i386 too?

<solid_black>youpi: yes, I've read "futexes are tricky", of course

<youpi>as I said above, the page table part is shared

<youpi>ok

<solid_black>when you have time, take a look at https://github.com/bugaevc/lets-write-sync-primitives :)

<solid_black>youpi: so on i386, is it also true that inserting a mapping ata high address into *any* map (and not just kernel_map) will be visible to all tasks?

<solid_black>iow, is kernel_map any special in that regard?

<youpi>yes

<youpi>since it's shared

<youpi>any write by any thread will be seen by all others, since it's the same page table pieces

<damo22>solid_black: it should boot with rumpdisk using just cpu0 (therefore no deadlocks by default) and you can get access to the other 7 cpus using smp.c

<solid_black>I understand that it's shared on the page table / pmap level, I don't understand how exactly that works with the Mach vm_map abstractions

<solid_black>but whatever, I'll figure it out

<youpi>I don't know

<youpi>(I mean, not more than: since there's just one kernel map, any thread adding stuff to it will be seen by others of course, and the same for the page table as said above, so from this simple point of view all is fine)

<youpi>solid_black: concerning gsync correctness

<youpi>remember that we keep a mutex on hbp->lock

<youpi>so any other thread that is running gsync_wait will be synchronized with that

<solid_black>damo22: I get this now https://imgur.com/a/EulJGJe

<solid_black>this is booting with -M q35 -smp 8

<damo22>that is unusual, let me compile master again and verify its not broken

<youpi>solid_black: the important part is that we call thread_will_wait before releasing the lock

<youpi>so we are sure not to miss a wakeup

<solid_black>youpi: there's one kernel_map object, but that's only used as the task->map of the kernel_task (kernel threads) -- that's not any threads running in the kernel, just the kernel threads

<youpi>that's fine

<solid_black>other threads running in the kernel all use their tasks' maps

<youpi>these threads only need to access the kernel map

<youpi>and if a kernel function needs some virtual mapping in the kernel, it'll access the kernel map, won't it ?

<youpi>we really need to write that research paper about the "I will sleep" principle, that is used throughout all OSes I have dived into

<youpi>though IIRC it's described in the linux kernel book from cesati & bovet

<youpi>(if you still have not read it, really, do)

<youpi>gotta go, later

<solid_black>youpi: how do a mutex on hbp->lock or calling thread_will_wait or whatever prevent another userland thread (running on a different CPU core) from just changing the value on the page, without ever calling into the kernel?

<youpi>that's completely fine

<youpi>because kernel will check again

<solid_black>I understand how the "I will sleep" principle works, yes

<youpi>ok

<solid_black>but it's not being used properly here, that's the issue

<solid_black>you say "I will sleep first", *then* you check

<solid_black>gsync_wait does it the other way

<youpi>really goota gpo

<solid_black>sure

<damo22>qemu-system-i386 -M q35,accel=kvm -smp 8 -m 4096 -net user,hostfwd=tcp::8888-:22 -net nic -display curses -hda /dev/sdd

<damo22>that boots master

<solid_black>ah, I might just need to give it more ram

<solid_black>ACTION runs wget https://downloadmoreram.com/

<damo22>heh

<solid_black>yep, doesn't hang anymore, cool

<damo22>\o/

<solid_black>(I mean, my boot doesn't, not your test)

<damo22>yeah

<solid_black>let me try your test

<damo22>do the test in a separate shell

<damo22>because it will hang the shell

<damo22>but you can still use the console

<damo22>nproc should be 8 even though mostly running on 0

<solid_black>nproc is 8 indeed

<solid_black>your ./smp segfaults

<damo22>ah you need to pass a param

<damo22>the name of a new program

<damo22>to execute with smp

<damo22>eg, ./smp /bin/bash

<solid_black>and it doesn't respect $PATH, you should use execvp

<solid_black>but yeah, a full path works

<damo22>yeah its a quick dirty program

<solid_black>now your ./test segfaults

<damo22>test needs a param, how many cpus to use

<damo22>give it 7

<youpi>solid_black: concerning the "userland can change the value unexpectedly" part, yes, sure, that's why the userland part of the wait side has to register somehow and publish it (as in a memory barrier) *before* calling gsync_wait

<youpi>to be sure that the thread that changes the value knows to call gsync_wake

<youpi>concerning the openmp test, really, stop looking for culprits

<youpi>that test involves libgomp, glibc, the gsync rpcs, rpcs

<youpi>so you'll find plenty of culprits to frown upon (even though that's well-tested, such as libgomp)

<youpi>stay with the very simple case that we had fro mthe very beginning

<youpi>getting the task port names

<youpi>that'll involve only the rpcs part

<youpi>and we already know that this has troubles

<youpi>so fix that *first*

<solid_black>youpi: it sounds like you're misunderstanding how futexes/gsync are supposed to work :|

<youpi>and then actually very probably all the other cases will just work

<youpi>solid_black: I don't think so

<youpi>but again, with that, gotta go

<damo22>so i could just fork processes

<damo22>instead of threading with openmp

<youpi>solid_black: see e.g. sysdeps/htl/sem-timedwait.c

<youpi>it puts 1 << SEM_NWAITERS_SHIFT before doing the loop

<youpi>see __lll_lock

<youpi>it puts a 2 value before calling lll_wait

<youpi>to record that it'll be waiting

<youpi>really, that' s how things work, I have been spending parts of my phd and following research on this kind of thing

<youpi>really gone

<damo22>thx

<damo22>so the test can be rewritten as a bash script

<damo22>just call ./test 1 & a few times

<damo22>inside a smp bash shell

<damo22>$ sudo ./smp /bin/bash

<damo22>$ for i in 1 2 3 4 5 6 7 ; do ./test 1 & ; done

<damo22>hmm

<damo22>that didnt work

<solid_black>I think what you're saying is -- while the value of the futex can be changed from userland concurrently indeed, we can't miss a gsync_wake(), because you'd change the value first and call gsync_wake() second, and since gsync_wait() holds the mutex between checking the value and calling thread_will_wait(), there's no way gsync_wake() can slip in between

<solid_black>and I guess that's true

<damo22>solid_black: if you want to move forward with this, feel free to remove the openmp part and compile as simple test, then execute it 4 times or so in an smp shell and you will still see a hang

<solid_black>I just got it to hang with OpenMP

<damo22>yeah but youpi is right, there are too many moving parts to this, if you remove openmp from the picture it still hangs with just IPC

<damo22>by running say 4 or 5 of the tests at once

<damo22>without openmp

<damo22>i gtg to sleep, i have work tomorrow

<damo22>thanks for looking into this with me

<solid_black>sure, goodnight :)

<youpi>solid_black: yes, the thing is that the kernel can't magically catch userland writing to the value (it could through setting the page r/o but that'd be terribly expensive, defeating the purpose of futexes)

<youpi>thus why it's userland that is responsible for the concurrency part

<solid_black>yes, I'm not suggesting that

<youpi>the kernel part only provides the atomicity between last-value-check and actually sleeping

<youpi>not only remove openmp, but also glibc and gsync, only keep the ipc part

<solid_black>I'm saying that the contract is that if the value is different from the RPC argument, you don't go to sleep

<youpi>the most effective way to fix bugs is to corner them by using simple test cases, not more complex ones :)

<solid_black>and the current implementation does not guarantee that

<solid_black>but indeed it should not matter much, because a gsync_wake() can not slip in

<youpi>it does, as much as what it can provide and the futex principle requires

<youpi>there's a window between the check and the sleep, sure

<youpi>but that's fine because we hold the mutex

<youpi>and the waiter has already published that it's about to sleep, atomically against another thread changing the value to make it free

<youpi>so the other thread can't miss the waiter having set the wait bit

<solid_black>but yeah, changing the value, by itself, will do nothing to wake the thread even if we put it on the queue (i.e. assert_wait) before checking

<solid_black>you _need_ to gsync_wake

<youpi>yes but you can't miss that

<solid_black>yes

<solid_black>ok, so it's something else then :|

<youpi>precisely because the setting of the wait bit on the wait side is properly synchronized with freeing the lock on the wake side

<youpi>of course it's very probable that it's something else

<youpi>since we *already* have a very simple test case that only involves rpcs

<solid_black>do we?

<youpi>so that case being broken means gsync can indeed fail

<youpi>well, yes

<solid_black>I was just following damo22's reproducer

<youpi>the port names ipc loop

<youpi>that damo22 reported to be hanging at times

<youpi>Subject: Test program for running task on slave_pset

<youpi>Date: Mon, 12 Feb 2024 05:53:12 +0000

<solid_black>ACTION looks

<youpi>that alone is a trivial reproducer that can explain everything

<youpi>so no need to look at any other reproducer

<youpi>ah, sorry, not that one

<youpi>I don't remember if he posted it

<solid_black>the one he posted today does include getting port names, but it does that concurrently under OpenMP

<youpi>but I remember him reporting that a mere mach_port_names() could hang

<youpi>ah

<youpi>but openmp is just used to start the threads, isn't it?

<youpi>so openmp is not involved in the rest of the execution

<youpi>were was it posteD?

<solid_black> https://zamaudio.com/mbox2/test-ipc.tgz

<youpi>ACTION ran tar xf https://...

<youpi>hopefully some day I'll have a desktop where that works :)

<youpi>ok so the parallel for loop does involve openmp from times to times

<youpi>but just replace that:

<youpi> #pragma omp parallel for

<youpi> for (i = 0; i < MAX_ITERATIONS; i++)

<youpi>with

<youpi>#pragma omp parallel

<youpi>while (1)

<youpi>so that omp will not be involved at all during the while(1) loop

<youpi>(and anyway in my memory of his traces, the threads were stuck inside mach_port_names(), so it's really the culprit, and not anything else)

<youpi>(but better do the replacement to really rule out openmp/glibc/gsync out of the picture

<solid_black>in his traces today, the threads were stuck in gsync_wait

<youpi>that's an ipc

<solid_black>?

<youpi>well, rpc to the kernel

<solid_black>sure, so, what's the point?

<youpi>so it can be affected by the very same kind of issue that mach_port_names() can get

<solid_black>it was stuck inside thread_block inside gsync_wait specifically

<youpi>which just means there's a missing wake

<solid_black>with W state on all threads, so they were never getting woken up

<solid_black>yes

<youpi>if there's another thread stuck on gsync_wake, it's the rpc mechanism culprit

<youpi>but again, mach_port_names looks a much more simple case to work on

<youpi>since it's supposed to involve only the thread, and not two (wake/wait) threads

<solid_black>if we can reproduce mach_port_names getting stuck w/o gsync, sure that'd be easier to debug

<solid_black>and I should learn how to build and use KDB, I guess :|

<solid_black>I know, I know, I should read the docs

<youpi>that's what his program does

<youpi>except it uses a parallel for loop so there's some openmp sync, but with parallel while it'll be out

<solid_black>printf is still synchronized, as he said

<solid_black>we could just... not print it, I guess

<solid_black>it's a real shame qemu's record-replay doesn't work with SMP

<solid_black>aha, it does block inside mach_port_names() -> vm_map_pageable() -> vm_map_pageable_scan() -> vm_fault_wire() -> vm_fault() -> vm_fault_page() -> thread_block()

<solid_black>the other thread does the same, yet another waits to lock_write the map (should be ipc_kernel_map) to vm_allocate in it

<youpi>ok so it's not the rpc mechanism at stake, but better fix it for a start :)

<solid_black>(what's there to fix? / what's broken about rpcs?)

<solid_black>how can it *possibly* block in lock_read if it already holds this very lock for reading? (and lock_set_recursive() has been called) that doesn't make sense

<solid_black>is it the locks implementation that's broken?

<solid_black>or is this the "don't let new readers in if there are pending writers" thing

<solid_black>lock_read() does do that, but it also explicitly lets the current thread get it recursively

<solid_black>and yet, how can that ("current thread") possibly work when there multiple readers?

<solid_black>ok, only the thread that had the map locked for writing calls vm_map_lock_set_recursive() before downgrading to read-only lock, so only that one thread can lock it for reading recursively

<solid_black>but that's definetely not what I'm seeing in the KDB traces

<solid_black>the l->want_write bit handling does look broken

<solid_black>for one thing, the lock implementation uses a volatile boolean, but not atomics; that will fail in precense of compiler optimizations (but doesn't, because we're not building with LTO), and will surely come to bite us on aarch64

<youpi>which exactly?

<youpi>one of them is just for up I guess

<solid_black>"struct slock" has "volatile natural_t lock_data", yes

<solid_black>that's enough to prevent other threads from running "at the same time", not enough to make them see the changes you make to memory while holding the lock

<youpi>the volatile is probably just for the spinning loop

<youpi>sure but that's fine for the spinning loop

<youpi>that just reads the value

<solid_black>yes, but you don't gran a lock to just spin, you grab a lock to see changes made by others, and to make your own changes that others should see; and that doesn't necesserily work (as in, it's technically UB) without atomics or barriers

<solid_black>but that's not our issue right now

<youpi>I'm not saying that volatile enough is enough

<youpi>s/enough/only

<youpi>I'm saying that for the spinnning part that's enough

<youpi>see i386/i386/lock.h's _simple_lock

<youpi>it read-spins with only volatile

<youpi>when it sees it free, it goes back to a proper lock_xchg

<youpi>that's wait less expensive to spin like this

<youpi>doing lock_xchg in a loop would just starve when you have a significant number of processors all doing the same

<solid_black>I actually don't think it's enough even for just spinning, because volatile (has poorly defined semantics, for one thing), but it only ensures the compiler will issue a real read/write to "memory", not that the write you have issued will be seen by other CPU cores (who may have a different value in their own caches)

<solid_black>but that would be an issue for when we get to SMP on aarch64, i.e. not very soon

<youpi>somebody who writes to the value ought to use a memory barrier to publish the value

<youpi>and it's *fine* to see that update late

<youpi>it's better to get this late while avoiding to bother other cpus

<youpi>rather than insisting on atomic operations on each and every loop

<solid_black>yes, but currently, lock.c does not (see kern/lock.c:simple_unlock)

<youpi>which one ?

<solid_black>and I'm not saying you should do lock_xchg each iteration, but at least load(relaxed), or better load(acquire)

<solid_black>kern/lock.c:98

<youpi>there is one #ifdef notdef, and another for UP

<solid_black>ah, right

<youpi>load(acquire) would be more than you actually need

<youpi>you want to see updates, yes, but a pause is fine enough for that

<solid_black>why? -- does pause discard caches?

<solid_black>in our actual reproducible mach_port_names issue, multiple threads look like they're holding the same lock (ipc_kernel_map's) for reading after having held it for writing and then downgraded it for reading

<solid_black>which of course doesn't make sense

<youpi>it doesn't discard caches, but caches eventually get the new value, that's all we need

<youpi>for the read

<youpi>then you want also the happens-before & such, so you want an acquire, yes, but only on the xchg

<solid_black>ok

<youpi>put it another way: the read can *not* be the culprit for anything

<youpi>since that's not what actually makes the lock safe

<youpi>at worse it makes it wait for longer than could be

<solid_black>if we're sure the read will see the new value eventually, yeah

<youpi>(but making it shorter can have very significant cost when the situation is contended)

<youpi>damo22: « so the test can be rewritten as a bash script » no need for this, just use #pragma parallel \n while(1) { }

<youpi>that'll get openmp away

<youpi>it will only be used to easily create the threads

<youpi>then leave it up to the while(1) loop

<solid_black>I don't see what else wrong with locks, but something must be :|

<solid_black>s/else/else is/ s/with/with the/

<youpi>while the low-level parts are proabably fine since quite simple in the end, I would be not surprised that the read/write upgrades/downgrades have bugs

<youpi>since they're quite more involved

<solid_black>I'm talking about the latter ones, yeah

<solid_black>specifically, { lock_write(); lock_set_recursive(); lock_write_to_read(); } -- somehow multiple of these succeed on the same lock at the same time

<youpi>solid_black: I had a quick glance at your github repo, e.g. your implementation of mutexes indeed registers the need_wake value before calling futex wait ;)

<solid_black>sure

<youpi>an important thing about conditions:

<youpi>your readme says that with your implementation the signal/broadcast needs to be outside the mutex section

<youpi>it's true that one can signal/broadcast outside

<youpi>but people need to be aware that, as you say, it's the mutex that provides the happens-before semantic

<youpi>so I'd say you should explicitly remind that the variable manipulation etc. before the signal/broadcast needs to be done with the mutex held, to get that

<youpi>i.e. push_data_on_list(); ready = 1; pthread_cond_signal(&c); is unsafe

<solid_black>"Neither condvar.notify_one() nor condvar.notify_all() must be called with the mutex held" perhaps this sounds like I'm saying that you *must not* hold the mutex, but what I'm saying is you don't *have to*

<youpi>ah

<solid_black>and the very next sentence is "although it's correct to call them while either holding or not holding the mutex, it's much faster to call them without holding the mutex."

<youpi>in my reading it says you must not hold it :)

<youpi>s/must/need to/

<solid_black>perhaps I should s/must be/need to be/

<youpi>to fix it so I'd understand correctly

<solid_black>yes

<solid_black>feel free to submit a patch :D

<youpi>heh :)

<youpi>won't bother though :)

<youpi>I'd still add in the text that of course, the changes that produce a condition to become true or false, needs to be in the mutex section

<youpi>at that point

<youpi>because even if it's explained a bit further below, the fact that one can call signal/broadcast outside a mutex makes people think they can get away with the mutex

<solid_black>not only that, the *checks* for condition have to be done while holding the mutex too

<youpi>"because ready = 1; is atomic, no?"

<youpi>yes, sure

<youpi>but that is sorta hinted strongly by wait needing it

<youpi>(but, yes, the test itself also needs to be in the same mutex section, that too is something that people sometimes get wrong, but at least wait needing the mutex leads them to the right path)

<youpi>gnu_srs2: (-D_GNU_SOURCE): ok, but is that indeed effective in the build log, does it get passed when compiling the C code where you are having the issue?

<youpi>put another way: you need to understand why the #define _IOT_ifreq_int is not getting in

<youpi>the missing -D_GNU_SOURCE is a very common culprit, possibly there is something else

<youpi>you can put e.g. #warning in the /usr/include/net/if.h file to see if that gets included, what part gets included, etc.

IRC channel logs

2024-02-18.log