IRC channel logs

<damo22>Pellescours: im not sure what you mean by apic is not ready

<damo22>are you saying smp works with --disable-apic?

<damo22>no it doesn't

<Pellescours>i don't think it wokrs without apic but non smp with apic neither

<damo22>i was focusing on getting smp+apic working, then when its stable we can look why it fails on other modes

<Pellescours>because of the lost of intereupts, i'm not able to habe disk drivers working

<damo22>gnumach disk driver does not lose interrupts

<damo22>i havent tried with rump yet

<damo22>AHCI

<damo22>i havent tried IDE either

<Pellescours>even with linux i have irq timeout messages

<damo22>how are you invoking qemu?

<Pellescours>qemu-system-x86_64 -m 4096 -drive format=raw,cache=writeback,file=dev.img -nic user,hostfwd=tcp:127.0.0.1:2222-:22 -k bepo --enable-kvm -serial stdio -smp 6 -cpu host -usb

<damo22>isnt that 64 bit?

<damo22>qemu-system-i386 -M q35,accel=kvm -smp 1 -m 4096 -net user,hostfwd=tcp::8888-:22 -net nic -curses -hda /dev/sdd -chardev socket,id=net0,host=127.0.0.1,port=9999,ipv4=on,server=on,telnet=on -monitor chardev:net0 --no-reboot --no-shutdown

<Pellescours>but 64 bit is also 32 bit no?

<damo22>your machine has emulation for entire 64 bit machine, mine is strictly 32 bit

<Pellescours>same error message with qemu-system-i386

<damo22>why -cpu host

<Pellescours>for perfs, even if I don’t think it will change anythink because recent features are not really supported/used by the kernel

<Pellescours>which linux driver are you using in your hurd? IDE?

<damo22>ahci

<damo22>-M q35

<Pellescours>in your hurd it’s the drive hd0 I suppose

<damo22>i think it attaches to sd0

<damo22>AHCI SATA 00:1f.2 BAR 0xfebd5000 IRQ 10

<damo22>sd0: QEMU HARDDISK, 465GB w/256kB Cache

<Pellescours>me it’s hd0 and when I add -M q35 it fails to find hd0

<damo22>that is ok

<damo22>it times out on cd rom

<damo22>you need to wait 30 seconds

<Pellescours>ah this explains the commit that reduce this times to 10 secs

<damo22>we need to fix that hd0 probe

<damo22>i dont know whats going wrong

<damo22>yes

<Pellescours>So in the end, if we don’t use -M q35 there is irq lost. But with q35 we can focus on smp before trying to fix the irq problem

<damo22>yes

<damo22>i dont know if the default machine even has an APIC

<damo22>i havent used it

<damo22>its so old, like ISA bus

<Pellescours>"info pic" in qemu consoles shows 24 entries under "ioapic"

<damo22>ok

<damo22>../configure CFLAGS="-O2 -g" --enable-kdb --enable-ncpus=8 --enable-apic

<Pellescours>in latest master the commit re-adding -O2 is here

<damo22>yes my branch is not rebased onto that

<Pellescours>I just tried to boot with -smp=6, 4 cpu were found (probably because I removed the -cpu host). And It’s sloooow. with smp 1 it’s normal.

<damo22>yes!

<damo22>its slow and the cpus are fighting

<damo22>but its not crashing

<Pellescours>yep

<damo22>i think they are not idling properly

<damo22>its even slow with smp 1

<damo22>it uses 100% cpu in idle with smp 1

<damo22>and barely runs

<damo22>sometimes drops to 25%

<Pellescours>and console timeout make it unusable, only ssh is possible

<damo22>how do we fix this

<Pellescours>Idk, looking at htop, it says that procfs is taking a lot of cpu ~46%

<Pellescours>and /hurd/proc ~13%

<damo22>does the cpu usage add up to the load?

<Pellescours>I think so

<damo22>i was getting load 6 with about 100% usage using smp 4

<Pellescours>100% for one cpu == load of 1 If I understand correctly

<damo22>yes

<damo22>must be that the aps are spinning doing nothing

<Pellescours>updating there timers/clock?

<Pellescours>with normal kernel (no smp) proc and procfs usualy take 0% of cpu

<Pellescours>around 0%

<damo22>what is machine_relax

<damo22>that seems to tell it to sit in tight loop

<damo22>maybe the cpus are sitting in machine_relax a fair bit

<damo22>every time i interrupt the cpu with kdb, its sitting in machine_idle

<damo22>hmmm it never seems to get out of the loop in kern/startup.c

<damo22>APs cant choose a thread

<Pellescours>are you sure they are running the slave_main function?

<damo22>yep

<Pellescours>When ap are set up, the kernel is booting so other tasks are not yet creating. So AP put themself automatically in idle. But when more tasks are created, AP should start taking some. So imho problem in scheduler

<damo22>not quite

<damo22>APs are stuck in the middle of cpu_launch_first_threa

<damo22>spinning in pause

<Pellescours>dp you have the line where they are stuck?

<damo22>i modified my code, its in cpu_launch_first_thread

<damo22>where the loop is that APs spin inside

<damo22>it seems to hang when i put a printf

<Pellescours>and with gdb?

<damo22>Thread 2 (Thread 1.2 (CPU#1 [running])):

<damo22>#0 _kret_popl_ds () at ../i386/i386/locore.S:533

<damo22>#1 0xf000ff53 in ?? ()

<damo22>#2 0x00000000 in ?? ()

<Pellescours>It stay there indefinitely?

<damo22>sigquit

<damo22>machine total crash

<Pellescours>I just re-check the past you did yesterday all AP were at thread_quantum_update (at ../kern/priority.c:152)

<damo22>ok

<Pellescours>And this line is a thread_lock call (macro that contains a while)

<damo22>i need to see why this is not printing that the APs are passing this point

<damo22> do {

<damo22> asm volatile ("pause" : : : "memory");

<damo22> } while (bspdone != 2);

<damo22>aha

<damo22>the two loops are happening simultaneously on BSP and AP

<damo22>so they are waiting for each other

<damo22>and i get a hagn

<damo22>hang*

<Pellescours>Oh

<damo22>how do i do an interprocessor lock?

<Pellescours>simple_lock?

<Pellescours>that’s what is set at multiple places, I think it works

<Pellescours>looking to xnu code, they do a rendezvous (https://github.com/apple/darwin-xnu/blob/main/osfmk/i386/mp.c#L976) but in the end it’s a simple_lock

<damo22>i should make bspdone count up

<damo22>for all cpus, and then wait for the number to be ready

<damo22>i found a problem

<damo22>Pellescours: i just pushed to my branch, we need to figure out why APx DONE! is not being printed

<damo22>see last commit

<damo22>i dont understand why the code never reaches there

<damo22>im doing a clean build

<damo22>maybe lapic_enable_timer needs to run just before load_context()

<damo22>where it usually calls startrtclock()

<damo22>timer interrupts could be causing havoc too early?

<damo22>it crashes because ioapic_configure is running on APs

<damo22>wtf

<damo22>gdb output is not right

<damo22>i need APs to wait until BSP has chosen a thread?

<damo22>kernel_stack is zero and APs switch to a zero stack

<damo22>how do i initialise kernel_stack?

<damo22>they hit trap_from_kernel

<damo22>is there supposed to be only one kernel_stack?

<damo22>or one per cpu

<damo22>youpi1: when the TLB(cpu0 -> cpu1) shootdown happens, cpu1 ends up with ESP = 0

<damo22>this is before cpu1 has chosen a thread to rnu

<damo22>what order do the cpus need to launch threads?

<damo22>s/launch/choose

<damo22>im in a racy part of the code when all APs are eager to fire up

<damo22>BSP: Completed SMP init

<damo22>module 0: acpi ...

<damo22>...

<damo22>5 multiboot modules

<damo22>TLB(0>1)

<damo22>crash

<damo22>so now BSP is launching, and sending shootdowns to all APs in a row, but interrupts are off on the APs because they havent chosen a thread to run yet and it deadlocks

<damo22>and if i enable interrupts on APs at that point, kernel_stack = 0 and APs crash

<damo22>possibly also because there is no thread yet

<damo22>youpi1: ive got it to a point where it doesnt crash anymore, and all APs are enumerated and started, now everything is sitting in machine_idle

<damo22>something is fishy with TLB shootdowns, it seems fire them correctly, but then doesnt continue

<damo22>i am able to enter kdb

<damo22>f72eb9c2 (HEAD -> feat-smp2-hangs

<damo22>i have no clue how to continue

<damo22>it works with -smp 6 even

<damo22>with smp 6 it sits idle just before starting acpi and has 32% usage in the host

<damo22>it seems one of the cores is sitting in HLT with interrupts disabled

<damo22>EIP=c1028491 EFL=00000046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=1

<damo22>does iret instruction enable interrupts?

<damo22>Pellescours: my branch no longer crashes, it just hangs at boot with low cpu usage and you can enter kdb with any number of cores running

<damo22>in that huge comment in pmap.c regarding the TLB shootdown code, it seems there is a requirement for 2 different spl levels

<damo22>maybe we need to set one of them to spl0

<damo22>i think they are both the same currently

<youpi>iret enables interrupts if the eflags on the stack has the interrupts enabled

<youpi>normally APs would stay in machine_idle with interrupts enabled

<youpi>so that the scheduler can check from times to times whether there's a thread to run

<youpi>or the BSP send an IPI to trigger a schedule

<softwar>is there any good web browser that's graphical for debian hurd besides firefox?

<damo22>youpi: it seems that the cpu that was targeted with the TLB shootdown ended up in machine_idle with interrupts off

<damo22>and the thread it was running died

<damo22>s/thread/task

<damo22>hence the boot process stopped

<damo22>can i call "sti" just before running pmap_update_interrupt?

<damo22>or just after

<youpi>splx is suposed to be doing that

<damo22>or perhaps call sti right before the machine_idle loop?

<youpi>again, there is no neezd for that

<youpi>see how it's done for BSP

<damo22>ok

<damo22>but its the same interrupt code, how can it be only the TLB interrupt causing problems?

<damo22>splx_cli not being called?

<youpi>does the timer interrupt actually work?

<damo22>yes i think so

<youpi>"thinking" is not enough when dealing with bugs

<damo22>i remember printing in the timer interrupt and it printed for all cores

<damo22>i havent tested it again on my latest branch

<damo22>will do

<damo22>yes timer works on APs

<damo22>there is lapic timer per cpu now

<damo22>in between calling the pmap_update_interrupt and the lapic_eoi, do i need to call splx_cli and cli?

IRC channel logs

2023-01-28.log