IRC channel logs

2024-12-10.log

back to list of logs

<damo22>ok now that all is merged, i really dont know why parallel smp init crashes
<damo22>unless INIT/STARTUP timings are wrong
<damo22>AHA
<damo22>GDT for apboot is in the same place for all cpus....
<damo22>it is patched by every AP with the percpu gs stuff
<damo22>so they all clobber each other
<damo22>we should remove gs early
<damo22>how did AP call spl1 ?
<damo22>or spl0
<damo22>im so confused
<damo22>if you flush the instruction cache with a jmp, does it set interrupt flag?
<damo22>interrupt flag is off but the AP is trying to run CPU_NUMBER() in spl1
<damo22>how??
<damo22>i mean how did it get there
<damo22>splvm() !!!!
<damo22>in ksmg_putchar
<damo22>we should not print during racy smp bringup
<damo22>mach's printf is not smp safe
<damo22>youpi: i have working parallel smp init
<damo22>\o/
<damo22>i also found and fixed a bug with the logical ids
<solid_black>hi
<damo22>hi
<damo22>i just mailed in parallel smp
<solid_black>I see :)
<solid_black>buy I wish I understood what is it that you're even fixing
<solid_black>is it making gnumach's own startup on smp more parallel?
<damo22>it wakes all cpus at the same time
<damo22>with one IPI
<damo22>this is needed because you cannot address every cpu core individually on large cpus
<damo22>so previously, you could not run gnumach on xeon processor with 24 cores for example
<solid_black>ah, so this is about supporting specific processors that have a large number of cores?
<damo22>yes basically x86
<solid_black>what were the issues that prevent us from enabling SMP for everything by default
<solid_black>it hung during boot, I think?
<damo22>yes
<solid_black>and there were some recent deadlocks that Pellescours has fixed?
<damo22>indeed
<damo22>we need to test it more
<solid_black>let's maybe just debug the remaining hangs
<solid_black>how hard could it be (famous last words)
<damo22>it still doesnt boot on my AMD board
<damo22>but that is an independent problem
<damo22>i think its hardware timing
<damo22>with the INIT/STARTUP
<damo22>qemu is good but its not ideal
<damo22>it does not emulate all hardware timings
<solid_black>I see, but indeed that's a separate problem
<damo22>i think the cpu gives up trying to start the APs
<damo22>and sends an error
<solid_black>does Flavio's mini-distro boot fully with smp?
<damo22>no idea
<solid_black>have you looked at virtio?
<damo22>not yet
<damo22>if you like virtio you can fix it
<damo22>:P
<damo22>i prefer real disk
<damo22>i cant boot virtio on metal
<damo22>my version of virtio is a usb -> sata dongle
<damo22>-hda /dev/sdd
<damo22>since grub has usb i can even plug the same disk into my other machines and boot gnumach from usb to test the kernel
<damo22>even though i cant mount /
<solid_black>"Started cpu 1 (lapic 1 0001)" is the last line I get
<solid_black>this is with qemu-system-i386 -accel kvm -machine q35 -smp 8
<damo22>solid_black: which branch
<damo22>commit i mean
<solid_black>master (5d1a540211adc9f9f96b80f2c037369b85b9edbd), + a revert of aadb433981b086bfb4e082757fed1154582d5497, + some harmless patches to make it build
<damo22>theres a bug in master with logical ids but it still boots
<solid_black>first of all, is the machine / qemu command line correct?
<solid_black>i.e. is this supposed to work
<damo22>let me boot my qemu and check the git log
<damo22>if you want the cores to run you need patch 1/5 of my latest series
<solid_black>ok, so I should apply your today's series
<damo22>otherwise the ipis interrupt the wrong cores
<damo22>yeah wouldnt hurt
<solid_black>but smp certainly did work for me last winter when we were debugging vm-related hangs
<damo22>i didnt realise i made a mistake in the first patch series, so i fixed it in the second
<damo22>as patch 1
<solid_black>ah, I see
<damo22>but it should still boot to init
<damo22>ah maybe not if you reverted aadb433
<damo22>theres no need to revert that patch
<damo22>you can use my /sbin/smp tool to run a full smp shell
<solid_black>yes, but I wanted to see how the boot hangs if we enable smp for all
<solid_black>I assumed it happens later during unix boot
<damo22>ok then apply the latest series too
<damo22>you can fetch from git.zammit.org/gnumach-sv.git if you want
<solid_black>already git am'ed, but thanks
<solid_black>rebuilding now
<solid_black>hangs at "Waiting for AP 1" now
<solid_black>should I attach gdb and see what core 1 is doing?
<damo22>errr
<damo22>i fixed that
<solid_black>core 1 is at apboot_jmp_offset
<damo22>really?? how
<solid_black>so are all the other cores except for 0, which is looping in start_other_cpus
<solid_black>lmk what else to check
<damo22>in your log, what is the jmp offset
<damo22>0x10000?
<solid_black>I'm not seeing anything about a jmp offset on screen
<solid_black>let me try with console=stdio, perpahs it's further above
<damo22>well i think i deleted that line in the logging
<damo22>can you inspect the value of apboot_jmp_offset ?
<damo22>as in the memory contained in that address
<solid_black>gdb says it cannot read it
<damo22>hmm
<solid_black>(but it might be something about segments or paging that throws it off, if this is early boot)
<damo22>how many patches did you apply to master?
<damo22>make sure you have 5 of mine on top
<solid_black>yes, all 5 from your latest series, plus a revert of the processor set patch
<damo22>which version of qemu do you have
<solid_black>9.1.2
<damo22>QEMU emulator version 8.2.50 (v8.2.0-763-g09be347171-dirty)
<solid_black>so what else should i look at?
<damo22>qemu-system-i386 -M q35,accel=kvm -smp $1 -m 4096 ....
<solid_black>ah right, I haven't specified -m
<solid_black>could it be due to lack of ram?
<damo22>yes
<solid_black>could it detect that and complain?
<damo22>im not sure, i guess it could
<solid_black>oooh so much better now
<solid_black>ext2fs started up
<damo22>yeah
<solid_black>and discovered hd0 doesn't exist, which makes sense
<solid_black>init is running
<damo22>"--enable-kdb --enable-apic --enable-ncpus=8 --disable-linux-groups" is what i compile with
<solid_black>it tried to fsck /dev/hd0s1 (from fstab?) and discovered that doesn't exist
<solid_black>but seems to mostly work otherwise?
<damo22>if you have networking enabled with that smp grouping thing reverted, it will hang on network
<damo22>netdde is not smp safe
<damo22>so i suggest not reverting it for now
<damo22>then everything will boot
<solid_black>can we have just netdde pinned to a single core?
<damo22>we could yes
<solid_black>also how do you even write a userland process that is not smp safe?
<damo22>i dont know, it breaks
<damo22>probably threading and races
<solid_black>but threading and races also happen on a single core, due to context switching
<solid_black>though I can see how smp would make them more likely
<damo22>yes but you dont notice them with single core as often
<damo22>one thing i did notice, (but also happened before all my smp patches) -smp 2 hangs on rumpdisk startup
<damo22>solid_black: can you verify that?
<damo22>perhaps you can debug that one
<solid_black>8 cores booted
<damo22>yeah
<solid_black>something about networking is borken indeed
<damo22>netdde
<youpi>again, I'd not to care about netdde which we won't support long-term-wise
<youpi>+say
<solid_black>what'd be an easy way to stress all the cores
<damo22>stress -c 8
<solid_black>right
<solid_black>thx
<damo22>you might need to install that
<solid_black>looks like I have it installed since last year
<solid_black>yep, hogging all of my host's cores
<damo22>\o/
<solid_black>and the load is gone the instant I interrupted stress(1), so it's not just a busy-loop somewhere
<solid_black>so how about instead of your smp tool, we pin netdde to a single core, and ship full smp otherwise?
<damo22>i dont want to invest time making that happen when im pretty close to having rumpnet working
<damo22>it only supports intel nics though and a handful of amd nics
<solid_black>ah, one reason networking is broken is I didn't mount /home, so my translator record that points to a local build of pfinet is just broken
<solid_black>file_set_translator returns ERANGE, what?
<solid_black>something about xattrs?
<solid_black>fsysopts / --no-xattr-translator-records helped
<solid_black>console crashed
<damo22>did you press delete?
<damo22>im not even joking
<solid_black>I think so
<solid_black>I was trying to edit a command
<solid_black>seems to hang with -smp 2 indeed
<damo22>yeah there was a bug that caused delete key to crash console, i fixed that upstream
<damo22>yeah we should fix the -smp 2 hang
<solid_black>cpu 0 loops inside intr_thread
<solid_black>on queue_iterate (&main_intr_queue
<damo22>interesting
<damo22>there was a commit relating to that recently
<damo22>maybe we exposed another bug
<solid_black>irqtab.tot_num_intr is -1
<damo22>8ef7e269755
<damo22>that is bad
<damo22>maybe we are 1 off?
<solid_black>is 'dev->tot_num_intr++' smp-safe?
<damo22>hmm
<damo22>atomic_inc?
<solid_black>yes, or locking
<damo22>unless we need to subtract up to e->interrupts
<damo22>depending if it would make it negative
<damo22>do we know for sure that its the right number to subtract?
<solid_black>insert_intr_entry doesn't look thread-safe either
<damo22>(09:07:29 PM) solid_black: irqtab.tot_num_intr is -1
<damo22>that is probably why its hung
<solid_black>yes
<solid_black>it's looping until there are interrupts
<solid_black>total_num_intr is -1, but there apparently aren't any actual interrupts
<solid_black>so it never gets decreased
<damo22>can you try to fix this
<solid_black>so you need to at the very least synchronize total_num_intr reads/modifications
<solid_black>but also the queue itself and the user_intr_t's
<solid_black>could you explain what user_intr_t's are
<damo22>interrupts from userspace
<damo22>handlers
<damo22>i think its the tracking for interrupts that occur and need to be handled by userspace handlers
<solid_black>sounds like this just needs a lock honestly
<solid_black>let me try that
<damo22>youpi: with -smp 2, solid_black found that irqdev.tot_num_intr == -1
<damo22>and it hung
<damo22>my understanding is that tot_num_intr should never be less than 0
<youpi>yes
<damo22>youpi: "i386/apic: Fix logical id numbering" patch 1 is essential, it fixes a bug with the first patch series, but the rest are also good
<youpi>solid_black: threading and races also happen on a single core due to context switching, but context switching interlace *way* less the code paths
<solid_black>but you'd think we'd still run into it, given how much netdde is used
<youpi>really, the orders of magnitude can be 1000x
<youpi>device/intr.c indeed deserves a lock instead of just spl_high
<solid_black>does simple_lock_irq make sense?
<youpi>(a simple_lock_irq)
<solid_black>ok
<youpi>yes
<youpi>since you want to both disable interrupts and prevent against other cpus
<solid_black>seems to have booted (to fsck failure) with s/spl/locking/g
<solid_black>yes, fully booted once I erased fstab
<solid_black>but I wonder whether this breaks Linux drivers support
<solid_black>posted the patches, PTAL
<janneke>damo22: headsup: i tried your latest smp patch series (both), if that should also have addressed the loadavg issue then it isn't fixed for me
<janneke>fwiw, i've added a patch to guix to bump the overload-threshold for childhurds from 0.8 to 1.8 so that offloading may continue
<janneke>so there's no hurry and i very much appreciate your [smp] efforts!
<damo22>janneke: no, nothing to address the loadavg
<damo22>youpi: thanks for review, but i didnt understand how to fix the splhigh() call
<azert>damo22: I think that youpi wants you to fix the early gs access instead of removing it
<azert>that’s my interpretation
<youpi>yes
<youpi>one can use a gdt descriptor and gdt table per cpu
<youpi>otherwise, longterm we'll keep having to fight with gs: addressing popping up here and there
<damo22>ok
<damo22>but im not sure how to do that because with parallel smp init, the start vector has to be the same address for all cpus
<damo22>since there is only one IPI
<damo22>we need an array of gdts in the low memory?
<Pellescours>do the cpu able to get a different id based on get_cpuid() at this point in code?
<Pellescours>even if it’s between 1 and 8
<Pellescours>it can help you to discriminate them and not have them collide the same address