IRC channel logs

<almuhs>in previous implementation, in this case gnumach only detect the cpus until NCPUS, ignoring the others. Maybe there are a unnecesary assert() ?

<almuhs>i go to sleep. In some -smp i required many attemps to boot, because rumpdisk freeze

<damo22>sneek: later tell almuhs i changed the init and startup sequence to send to all except self to wake up all cpus, because one by one could not address all lapics, so if you compile gnumach with enable-ncpus < cpus in your system, it will crash because some of the cpus you didnt enumerate will still wake up

<sneek>Got it.

<damo22>youpi: when usb fails and hogs irq11 i get this backtrace from outside:

<damo22>(gdb) bt

<damo22>#0 ioapic_write (id=0 '\000', reg=38 '&', value=81979) at ../i386/i386at/ioapic.c:166

<damo22>#1 ioapic_write_entry (apic=0, pin=11, e=...) at ../i386/i386at/ioapic.c:188

<damo22>#2 ioapic_irq_eoi (pin=11) at ../i386/i386at/ioapic.c:328

<damo22>#3 0xc104bc40 in interrupt () at ../i386/i386at/interrupt.S:102

<damo22>irq 11 is unmasked at this point and seems like there is irq 11 storm

<damo22> pin 11 0x000000000000c03b dest=0 vec=59 active-hi level fixed physical

<damo22>IRR 48 49 59(level)

<damo22>it seems the protection for the masking is not enough

<damo22>im on UP

<damo22>ok so on UP, inside ioapic_irq_eoi() after masking and setting to edge, as soon as the level entry is restored, if the interrupt line is still high it immediately triggers a new interrupt before lapic_eoi has a chance to execute

<damo22>so i think the lapic_eoi needs to be done first

<youpi>damo22: apic doesn't seem to be working when the linux groups are enabled?

<damo22>oh

<damo22>ive been bypassing linux drivers for ages

<damo22>i think the lapic_eoi indeed needs to be moved before the ioapic eoi

<damo22>because for a level triggered interrupt, after the trigger mode is altered and masked, as soon as it the entry is restored it can retrigger before lapic_eoi has a chance to execute

<damo22>then the interrupt gets stuck on

<youpi>I don't know the details, but I would have understood the converse: if you lapic_eoi, you tell that you're ready to get another interrupt, and there you go you have it, so calling it earlier will just let ioapic raise it immediately?

<damo22>well maybe, but it should at least be protected with ioapic_lock?

<youpi>what for?

<damo22>actually no thats useless

<youpi>ioapic_lock is only about protecting the hardware access

<youpi>to avoid several cpus mangling it at the same time

<damo22>right

<damo22>i stepped through with gdb

<damo22>during the problem

<damo22>i was getting 3M interrupts on irq 11

<damo22>as soon as it unmasked, it raised again

<damo22>not unmasked, but restored the original level mode

<damo22>because the line was still high

<damo22>so as soon as the eflag was restored it raised

<damo22>ioapic_lock is also an eflag protector because its an irq lock

<youpi>but inside the interrupt handler we're supposed to have IF cleared, don't we?

<youpi>(that just pushes the issue later on iret, though)

<damo22>im trying to reproduce

<damo22>i get two __disable_irq calls per ioapic_irq_eoi calls

<damo22>EFL=00000086

<damo22>then got an interrupt

<damo22>??

<youpi>if you are single-stepping in qemu+gdb, I wouldn't be surprised that they get interrupt masking wrong

<damo22>ok

<youpi>apic makes rumpusbdisk way faster, 500KB/s instead of 100Ko/s

<youpi>(that's still quite slow, though)

<damo22>i get 12MB/s read with usb

<damo22>dd if=/dev/ud0 of=/dev/null bs=1M count=100 takes about 10 seconds

<youpi>on real hardware or qemu?

<damo22>qemu

<youpi>well, I get 500KB/s only :)

<damo22>xhci?

<youpi>-usb -device usb-storage,drive=stick

<damo22> -drive if=none,id=usbstick,format=raw,file=testzero.dd \

<damo22> -device qemu-xhci \

<damo22> -device usb-storage,drive=usbstick \

<youpi>uhci apparently

<damo22>oh mine is emulated from a file

<youpi>I'm from a file too

<damo22>i thought -usb passes the hc

<youpi>with xhci I'm getting 18MB/s indeed

<youpi>no, it just emulates a usb bus

<damo22>okay

<youpi>with linux I get 1MB/s with the same uhci setup

<youpi>so it's not that bad comparatively, actually

<youpi>true tests should be done on actual hardware :)

<damo22>yeah

<youpi>ah, xhci performance actually *lowers* with apic, here, from 25MB/s to 18MB/s

<damo22>i think apic is inferior to pic

<damo22>but nobody uses it anymore

<damo22>there is something still fishy with apic eoi

<damo22>also qemu apic is 0x11

<damo22>its a rare old one

<damo22>most apics are version 0x20

<damo22>so we are hitting the code path with !has_irq_specific_eoi on qemu

<damo22>we are calling ioapic_irq_eoi before calling the handler, and interrupts are not disabled

<youpi>don't we mask the interrupt on the ioapic before eoi?

<damo22>so i think its possible a level triggered line that stays high can keep triggerring interrupts before any of them are handled

<damo22>i cant find cli

<youpi>it's done by the processor on interrupt entry

<youpi>othewise you wouldn't be able to cope with a device overflowing with irqs

<damo22>ok

<damo22>if i run two drivers on the same device, which i probably shouldnt, the irq masking gets out of sync

<youpi>that's not supposed to happen: the irq code waits for ack from both drivers before unmasking

<damo22>yeah, somehow the counter skips past 1 on disable so never gets to mask it

<damo22>or it gets unmasked and never masked again

<damo22>can we make disable mask unconditionally?

<damo22>why do we need to be frugal with the masking

<damo22>like only doing it the first time

<youpi>because we can?

<youpi>I mean, ducktape-coding is no good

<youpi>it just hides bug

<damo22>ok then its broken

<youpi>if there is something wrong, better see it wrong

<youpi>and fix what is wrong

<youpi>rather than ducktape it

<damo22>i saw 3M interrupt nesting levels

<damo22>because it didnt mask

<damo22>you said that irq_lock is protecting ndisabled, but what protects the act of masking?

<youpi>the decision of masking is the same

<youpi>the act of masking is the ioapic lock

<damo22>ok

<damo22>goodnight

<almuhs>damo22: i've just removed the -M q35 flag in my qemu script, and now i found a network problem when i boots with less than 8 cpus. Maybe is the problem that you refered

<sneek>Welcome back almuhs, you have 1 message!

<sneek>almuhs, damo22 says: i changed the init and startup sequence to send to all except self to wake up all cpus, because one by one could not address all lapics, so if you compile gnumach with enable-ncpus < cpus in your system, it will crash because some of the cpus you didnt enumerate will still wake up

<almuhs> https://pasteboard.co/5GA7p99v1Fyi.png

<almuhs>this is the error. But i'm not sure if the problem is with netdde or with rumpnet

<almuhs>this lock shows in -smp 2,3,4 and 6

<almuhs>Pending to test in a real machine

<almuhs>my compilation options are this: ../configure --host=i686-gnu CC='gcc -m32' LD='ld -melf_i386' --enable-apic --enable-kdb --enable-ncpus=$NUM_CPUS --disable-linux-groups

<almuhs>compiling with ncpus=16, and booting with -smp 10, the system boots mostly fine (after two attempts)

IRC channel logs

2025-07-13.log