IRC channel logs

<solid_black>master (5d1a540211adc9f9f96b80f2c037369b85b9edbd), + a revert of aadb433981b086bfb4e082757fed1154582d5497, + some harmless patches to make it build

<damo22>theres a bug in master with logical ids but it still boots

<solid_black>first of all, is the machine / qemu command line correct?

<solid_black>i.e. is this supposed to work

<damo22>let me boot my qemu and check the git log

<damo22>if you want the cores to run you need patch 1/5 of my latest series

<solid_black>ok, so I should apply your today's series

<damo22>otherwise the ipis interrupt the wrong cores

<damo22>yeah wouldnt hurt

<solid_black>but smp certainly did work for me last winter when we were debugging vm-related hangs

<damo22>i didnt realise i made a mistake in the first patch series, so i fixed it in the second

<damo22>as patch 1

<solid_black>ah, I see

<damo22>but it should still boot to init

<damo22>ah maybe not if you reverted aadb433

<damo22>theres no need to revert that patch

<damo22>you can use my /sbin/smp tool to run a full smp shell

<solid_black>yes, but I wanted to see how the boot hangs if we enable smp for all

<solid_black>I assumed it happens later during unix boot

<damo22>ok then apply the latest series too

<damo22>you can fetch from git.zammit.org/gnumach-sv.git if you want

<solid_black>already git am'ed, but thanks

<solid_black>rebuilding now

<solid_black>hangs at "Waiting for AP 1" now

<solid_black>should I attach gdb and see what core 1 is doing?

<damo22>errr

<damo22>i fixed that

<solid_black>core 1 is at apboot_jmp_offset

<damo22>really?? how

<solid_black>so are all the other cores except for 0, which is looping in start_other_cpus

<solid_black>lmk what else to check

<damo22>in your log, what is the jmp offset

<damo22>0x10000?

<solid_black>I'm not seeing anything about a jmp offset on screen

<solid_black>let me try with console=stdio, perpahs it's further above

<damo22>well i think i deleted that line in the logging

<damo22>can you inspect the value of apboot_jmp_offset ?

<damo22>as in the memory contained in that address

<solid_black>gdb says it cannot read it

<damo22>hmm

<solid_black>(but it might be something about segments or paging that throws it off, if this is early boot)

<damo22>how many patches did you apply to master?

<damo22>make sure you have 5 of mine on top

<solid_black>yes, all 5 from your latest series, plus a revert of the processor set patch

<damo22>which version of qemu do you have

<solid_black>9.1.2

<damo22>QEMU emulator version 8.2.50 (v8.2.0-763-g09be347171-dirty)

<solid_black>so what else should i look at?

<damo22>qemu-system-i386 -M q35,accel=kvm -smp $1 -m 4096 ....

<solid_black>ah right, I haven't specified -m

<solid_black>could it be due to lack of ram?

<damo22>yes

<solid_black>could it detect that and complain?

<damo22>im not sure, i guess it could

<solid_black>oooh so much better now

<solid_black>ext2fs started up

<damo22>yeah

<solid_black>and discovered hd0 doesn't exist, which makes sense

<solid_black>init is running

<damo22>"--enable-kdb --enable-apic --enable-ncpus=8 --disable-linux-groups" is what i compile with

<solid_black>it tried to fsck /dev/hd0s1 (from fstab?) and discovered that doesn't exist

<solid_black>but seems to mostly work otherwise?

<damo22>if you have networking enabled with that smp grouping thing reverted, it will hang on network

<damo22>netdde is not smp safe

<damo22>so i suggest not reverting it for now

<damo22>then everything will boot

<solid_black>can we have just netdde pinned to a single core?

<damo22>we could yes

<solid_black>also how do you even write a userland process that is not smp safe?

<damo22>i dont know, it breaks

<damo22>probably threading and races

<solid_black>but threading and races also happen on a single core, due to context switching

<solid_black>though I can see how smp would make them more likely

<damo22>yes but you dont notice them with single core as often

<damo22>one thing i did notice, (but also happened before all my smp patches) -smp 2 hangs on rumpdisk startup

<damo22>solid_black: can you verify that?

<damo22>perhaps you can debug that one

<solid_black>8 cores booted

<damo22>yeah

<solid_black>something about networking is borken indeed

<damo22>netdde

<youpi>again, I'd not to care about netdde which we won't support long-term-wise

<youpi>+say

<solid_black>what'd be an easy way to stress all the cores

<damo22>stress -c 8

<solid_black>right

<solid_black>thx

<damo22>you might need to install that

<solid_black>looks like I have it installed since last year

<solid_black>yep, hogging all of my host's cores

<damo22>\o/

<solid_black>and the load is gone the instant I interrupted stress(1), so it's not just a busy-loop somewhere

<solid_black>so how about instead of your smp tool, we pin netdde to a single core, and ship full smp otherwise?

<damo22>i dont want to invest time making that happen when im pretty close to having rumpnet working

<damo22>it only supports intel nics though and a handful of amd nics

<solid_black>ah, one reason networking is broken is I didn't mount /home, so my translator record that points to a local build of pfinet is just broken

<solid_black>file_set_translator returns ERANGE, what?

<solid_black>something about xattrs?

<solid_black>fsysopts / --no-xattr-translator-records helped

<solid_black>console crashed

<damo22>did you press delete?

<damo22>im not even joking

<solid_black>I think so

<solid_black>I was trying to edit a command

<solid_black>seems to hang with -smp 2 indeed

<damo22>yeah there was a bug that caused delete key to crash console, i fixed that upstream

<damo22>yeah we should fix the -smp 2 hang

<solid_black>cpu 0 loops inside intr_thread

<solid_black>on queue_iterate (&main_intr_queue

<damo22>interesting

<damo22>there was a commit relating to that recently

<damo22>maybe we exposed another bug

<solid_black>irqtab.tot_num_intr is -1

<damo22>8ef7e269755

<damo22>that is bad

<damo22>maybe we are 1 off?

<solid_black>is 'dev->tot_num_intr++' smp-safe?

<damo22>hmm

<damo22>atomic_inc?

<solid_black>yes, or locking

<damo22>unless we need to subtract up to e->interrupts

<damo22>depending if it would make it negative

<damo22>do we know for sure that its the right number to subtract?

<solid_black>insert_intr_entry doesn't look thread-safe either

<damo22>(09:07:29 PM) solid_black: irqtab.tot_num_intr is -1

<damo22>that is probably why its hung

<solid_black>yes

<solid_black>it's looping until there are interrupts

<solid_black>total_num_intr is -1, but there apparently aren't any actual interrupts

<solid_black>so it never gets decreased

<damo22>can you try to fix this

<solid_black>so you need to at the very least synchronize total_num_intr reads/modifications

<solid_black>but also the queue itself and the user_intr_t's

<solid_black>could you explain what user_intr_t's are

<damo22>interrupts from userspace

<damo22>handlers

<damo22>i think its the tracking for interrupts that occur and need to be handled by userspace handlers

<solid_black>sounds like this just needs a lock honestly

<solid_black>let me try that

<damo22>youpi: with -smp 2, solid_black found that irqdev.tot_num_intr == -1

<damo22>and it hung

<damo22>my understanding is that tot_num_intr should never be less than 0

<youpi>yes

<damo22>youpi: "i386/apic: Fix logical id numbering" patch 1 is essential, it fixes a bug with the first patch series, but the rest are also good

<youpi>solid_black: threading and races also happen on a single core due to context switching, but context switching interlace *way* less the code paths

<solid_black>but you'd think we'd still run into it, given how much netdde is used

<youpi>really, the orders of magnitude can be 1000x

<youpi>device/intr.c indeed deserves a lock instead of just spl_high

<solid_black>does simple_lock_irq make sense?

<youpi>(a simple_lock_irq)

<solid_black>ok

<youpi>yes

<youpi>since you want to both disable interrupts and prevent against other cpus

<solid_black>seems to have booted (to fsck failure) with s/spl/locking/g

<solid_black>yes, fully booted once I erased fstab

<solid_black>but I wonder whether this breaks Linux drivers support

<solid_black>posted the patches, PTAL

<janneke>damo22: headsup: i tried your latest smp patch series (both), if that should also have addressed the loadavg issue then it isn't fixed for me

<janneke>fwiw, i've added a patch to guix to bump the overload-threshold for childhurds from 0.8 to 1.8 so that offloading may continue

<janneke>so there's no hurry and i very much appreciate your [smp] efforts!

<damo22>janneke: no, nothing to address the loadavg

<damo22>youpi: thanks for review, but i didnt understand how to fix the splhigh() call

<azert>damo22: I think that youpi wants you to fix the early gs access instead of removing it

<azert>that’s my interpretation

<youpi>yes

<youpi>one can use a gdt descriptor and gdt table per cpu

<youpi>otherwise, longterm we'll keep having to fight with gs: addressing popping up here and there

<damo22>ok

<damo22>but im not sure how to do that because with parallel smp init, the start vector has to be the same address for all cpus

<damo22>since there is only one IPI

<damo22>we need an array of gdts in the low memory?

<Pellescours>do the cpu able to get a different id based on get_cpuid() at this point in code?

<Pellescours>even if it’s between 1 and 8

<Pellescours>it can help you to discriminate them and not have them collide the same address

IRC channel logs

2024-12-10.log