IRC channel logs
2024-12-10.log
back to list of logs
<damo22>ok now that all is merged, i really dont know why parallel smp init crashes <damo22>unless INIT/STARTUP timings are wrong <damo22>GDT for apboot is in the same place for all cpus.... <damo22>it is patched by every AP with the percpu gs stuff <damo22>if you flush the instruction cache with a jmp, does it set interrupt flag? <damo22>interrupt flag is off but the AP is trying to run CPU_NUMBER() in spl1 <damo22>we should not print during racy smp bringup <damo22>youpi: i have working parallel smp init <damo22>i also found and fixed a bug with the logical ids <solid_black>buy I wish I understood what is it that you're even fixing <solid_black>is it making gnumach's own startup on smp more parallel? <damo22>it wakes all cpus at the same time <damo22>this is needed because you cannot address every cpu core individually on large cpus <damo22>so previously, you could not run gnumach on xeon processor with 24 cores for example <solid_black>ah, so this is about supporting specific processors that have a large number of cores? <solid_black>what were the issues that prevent us from enabling SMP for everything by default <solid_black>and there were some recent deadlocks that Pellescours has fixed? <damo22>it still doesnt boot on my AMD board <damo22>but that is an independent problem <damo22>it does not emulate all hardware timings <damo22>i think the cpu gives up trying to start the APs <damo22>if you like virtio you can fix it <damo22>my version of virtio is a usb -> sata dongle <damo22>since grub has usb i can even plug the same disk into my other machines and boot gnumach from usb to test the kernel <solid_black>"Started cpu 1 (lapic 1 0001)" is the last line I get <solid_black>this is with qemu-system-i386 -accel kvm -machine q35 -smp 8 <solid_black>master (5d1a540211adc9f9f96b80f2c037369b85b9edbd), + a revert of aadb433981b086bfb4e082757fed1154582d5497, + some harmless patches to make it build <damo22>theres a bug in master with logical ids but it still boots <solid_black>first of all, is the machine / qemu command line correct? <damo22>let me boot my qemu and check the git log <damo22>if you want the cores to run you need patch 1/5 of my latest series <damo22>otherwise the ipis interrupt the wrong cores <solid_black>but smp certainly did work for me last winter when we were debugging vm-related hangs <damo22>i didnt realise i made a mistake in the first patch series, so i fixed it in the second <damo22>but it should still boot to init <damo22>ah maybe not if you reverted aadb433 <damo22>theres no need to revert that patch <damo22>you can use my /sbin/smp tool to run a full smp shell <solid_black>yes, but I wanted to see how the boot hangs if we enable smp for all <damo22>ok then apply the latest series too <damo22>you can fetch from git.zammit.org/gnumach-sv.git if you want <solid_black>so are all the other cores except for 0, which is looping in start_other_cpus <damo22>in your log, what is the jmp offset <solid_black>I'm not seeing anything about a jmp offset on screen <solid_black>let me try with console=stdio, perpahs it's further above <damo22>well i think i deleted that line in the logging <damo22>can you inspect the value of apboot_jmp_offset ? <damo22>as in the memory contained in that address <solid_black>(but it might be something about segments or paging that throws it off, if this is early boot) <damo22>how many patches did you apply to master? <damo22>make sure you have 5 of mine on top <solid_black>yes, all 5 from your latest series, plus a revert of the processor set patch <damo22>which version of qemu do you have <damo22>QEMU emulator version 8.2.50 (v8.2.0-763-g09be347171-dirty) <damo22>qemu-system-i386 -M q35,accel=kvm -smp $1 -m 4096 .... <damo22>"--enable-kdb --enable-apic --enable-ncpus=8 --disable-linux-groups" is what i compile with <solid_black>it tried to fsck /dev/hd0s1 (from fstab?) and discovered that doesn't exist <damo22>if you have networking enabled with that smp grouping thing reverted, it will hang on network <damo22>so i suggest not reverting it for now <solid_black>also how do you even write a userland process that is not smp safe? <solid_black>but threading and races also happen on a single core, due to context switching <solid_black>though I can see how smp would make them more likely <damo22>yes but you dont notice them with single core as often <damo22>one thing i did notice, (but also happened before all my smp patches) -smp 2 hangs on rumpdisk startup <damo22>solid_black: can you verify that? <youpi>again, I'd not to care about netdde which we won't support long-term-wise <solid_black>and the load is gone the instant I interrupted stress(1), so it's not just a busy-loop somewhere <solid_black>so how about instead of your smp tool, we pin netdde to a single core, and ship full smp otherwise? <damo22>i dont want to invest time making that happen when im pretty close to having rumpnet working <damo22>it only supports intel nics though and a handful of amd nics <solid_black>ah, one reason networking is broken is I didn't mount /home, so my translator record that points to a local build of pfinet is just broken <damo22>yeah there was a bug that caused delete key to crash console, i fixed that upstream <damo22>yeah we should fix the -smp 2 hang <damo22>there was a commit relating to that recently <damo22>unless we need to subtract up to e->interrupts <damo22>depending if it would make it negative <damo22>do we know for sure that its the right number to subtract? <damo22>(09:07:29 PM) solid_black: irqtab.tot_num_intr is -1 <solid_black>total_num_intr is -1, but there apparently aren't any actual interrupts <solid_black>so you need to at the very least synchronize total_num_intr reads/modifications <damo22>i think its the tracking for interrupts that occur and need to be handled by userspace handlers <damo22>youpi: with -smp 2, solid_black found that irqdev.tot_num_intr == -1 <damo22>my understanding is that tot_num_intr should never be less than 0 <damo22>youpi: "i386/apic: Fix logical id numbering" patch 1 is essential, it fixes a bug with the first patch series, but the rest are also good <youpi>solid_black: threading and races also happen on a single core due to context switching, but context switching interlace *way* less the code paths <solid_black>but you'd think we'd still run into it, given how much netdde is used <youpi>really, the orders of magnitude can be 1000x <youpi>device/intr.c indeed deserves a lock instead of just spl_high <youpi>since you want to both disable interrupts and prevent against other cpus <solid_black>seems to have booted (to fsck failure) with s/spl/locking/g <solid_black>but I wonder whether this breaks Linux drivers support <janneke>damo22: headsup: i tried your latest smp patch series (both), if that should also have addressed the loadavg issue then it isn't fixed for me <janneke>fwiw, i've added a patch to guix to bump the overload-threshold for childhurds from 0.8 to 1.8 so that offloading may continue <janneke>so there's no hurry and i very much appreciate your [smp] efforts! <damo22>janneke: no, nothing to address the loadavg <damo22>youpi: thanks for review, but i didnt understand how to fix the splhigh() call <azert>damo22: I think that youpi wants you to fix the early gs access instead of removing it <youpi>one can use a gdt descriptor and gdt table per cpu <youpi>otherwise, longterm we'll keep having to fight with gs: addressing popping up here and there <damo22>but im not sure how to do that because with parallel smp init, the start vector has to be the same address for all cpus <damo22>we need an array of gdts in the low memory? <Pellescours>do the cpu able to get a different id based on get_cpuid() at this point in code? <Pellescours>it can help you to discriminate them and not have them collide the same address