IRC channel logs

2023-01-26.log

back to list of logs

<damo22>Pellescours: feat-smp2-works is a branch that has AP with no periodic interrupts, so likely its sitting in a loop, thats why it mostly works
<damo22>im trying to fix feat-smp2-fault with apic and smp
<damo22>this one has a timer enabled on all cores
<damo22>-smp 1 or -smp 2 are the only things i am testing because more than 2 is more difficult to fix the AP bringup
<damo22>yes when i compile gnumach for smp i do --enable-kdb --enable-ncpus=8 --enable-apic
<damo22>the problem seems to be now not in the intstack but on a different stack, because its general protection faulting with the stack pointer not on an intstack
<Pellescours>with smp2faults apic and 2 cpus i have a protection exception when starting acpi task
<Pellescours>in all_intrs
<damo22>yes that is the fault
<Pellescours>it says it fails at movl ...,%ebp
<damo22>general protection faults are difficult to track down because there are many reasons why one can occur
<Pellescours>is this instruction the one ine CPU_NUMBER ? (movl lapic,%ebp)?
<damo22>not sure, i am going to try disabling interrupts during an interrupt
<damo22>hm that didnt fix it
<damo22>start acpi: acpi Kernel General protection trap, eip 0xc103160c
<damo22>kernel: General protection (13), code=0
<damo22>Stopped at all_intrs+0x8: movl 0xc11e070c,%ebx
<damo22>that is inside CPU_NUMBER i think yes
<youpi1>is that address mapped?
<damo22>its the "lapic" pointer
<youpi1>ok but is it mapped?
<damo22>how do i tell?
<youpi1>you can at least use x in kdb to read it
<youpi1>x 0xc11e070c
<damo22>db{0}> x 0xc11e070c
<damo22> f9693000
<damo22>not sure how that instruction failed, with no interrupts enabled
<youpi1>that being said, I wouldn't be surprised if the faulty instruction was the instruction just before that
<damo22>ok will objdump in a sec
<damo22>c103160b: 50 push %eax
<damo22>c103160c: 8b 1d 0c 07 1e c1 mov 0xc11e070c,%ebx
<damo22>the only way that can fail is if esp = 0
<damo22>esp 0xf590363c
<damo22>efl 0x10086
<damo22>ebp 0x800000
<damo22>db{0}> x 0x800000
<damo22>Kernel Page fault trap, eip 0xc1061dc5
<youpi1>is the memory location of esp writable?
<youpi1>you can use: w 0xf5903638 0
<damo22>db{0}> w 0xf5903638 0
<damo22>0xf5903638 0x10086 = 0
<damo22>db{0}> x 0xf5903638
<damo22> 0
<damo22>yes
<damo22>unless kdb moved the esp
<damo22>i can try with gdb
<damo22>no bootstrap code loaded with the kernle
<damo22>Thread 2 (Thread 1.2 (CPU#1 [halted ])):
<damo22>#0 0xc10283b1 in machine_idle (cpu=1) at ../i386/i386at/model_dep.c:236
<damo22>#1 0xc100c019 in idle_thread_continue () at ../kern/sched_prim.c:1657
<damo22>#2 0x00000000 in ?? ()
<damo22>Thread 1 (Thread 1.1 (CPU#0 [running])):
<damo22>#0 kdcnmaygetc () at ../i386/i386at/kd.c:2999
<damo22>is there anything useful i can probe at this point?
<damo22>it seems to only fault when you give it something to run
<Pellescours>like if it do not have right to read the program
<Pellescours>?
<damo22>i mean if you run the kernel in gdb by itself, it does not fault
<damo22>it sets up AP and is fine, but complains it has nothing to run
<damo22>i should put infinite loop just before the kernel tries to run something and check that the timer interrupts are being serviced by all cores?
<damo22>should hardclock ignore clock ticks coming from APs?
<youpi1>yes
<youpi1>only one CPU should advance the time
<damo22>i forgot to do this
<youpi1>but stats for processes etc. need to be updated
<youpi1>(notably counting what process the tick accounts for)
<youpi1>clock_interrupt already advances the time only on the master
<youpi1>so you shouldn't need to special-case more than what is already there
<damo22>hmm
<damo22>it seems all timer interrupts are being serviced by cpu0
<damo22>oh i had it in a loop sorry, thats not right
<damo22>maybe linux_timer_intr() needs to ignore cpu_number != 0
<damo22>and not update jiffies
<damo22>that fixed the prot fault!
<damo22>Sending IPI(0) to call TLB shootdown...done
<damo22>hung there
<damo22>timers are working though
<damo22>weird, it still faults on -smp1
<damo22>youpi1: i noticed the 30 second timeout lasted only a split second, could it be the timer is too short?
<damo22>causing a general fault with stack overflow of too many timer interrupts
<damo22>task loaded: exec /hurd/exec
<damo22>Sending IPI(0) to call TLB shootdown...done
<damo22>start acpi:
<damo22>both cpus in HLT with interrupts enabled
<damo22>pmap_update_interrupt on cpu1Sending IPI(0) to call TLB shootdown...done
<damo22>hmm cpu1 sent an IPI to cpu0 but cpu1 serviced it
<damo22>start acpi: Sending IPI(0 -> 1) to call TLB shootdown...done
<damo22>pmap_update_interrupt on cpu1
<damo22>acpi Kernel General protection trap, eip 0xc10315db
<damo22>kernel: General protection (13), code=0
<damo22>Stopped at all_intrs+0x7: movl 0xc11e070c,%ebx
<damo22>no memory is assigned to address 00040004
<damo22>looks like interrupt 251 works
<damo22>but there is still a general fault
<damo22>looks like ebp = 0
<damo22>is that a problem?
<damo22>maybe its CPU_NUMBER
<damo22>i am trying to write a compact CPU_NUMBER that reads the kernel id
<damo22>its messy
<Pellescours>I’m triggering the general protection trap with smp enabled and 1 cpu
<Pellescours>damo22: How "movl APIC_ID[%ebx], %eax" works (http://git.zammit.org/gnumach-sv.git/tree/i386/i386/cpu_number.h?h=feat-smp2-faults#n48)? I mean APIC_ID is defined as "offset ApicLocalUnit lu apic_id APIC_ID". Are you sure it take the good address?
<Pellescours>damo22: I tried something, I replaced the CPU_NUMBER macro to globaly set 0 to the ebx register, and boot with 1 cpu. And it works without page fault. It’s really seems to be the "movl lapic, %ebx" that makes the protection fault
<Pellescours>it’s definitively this instruction that trigger the protection fault
<Pellescours>damo22: can it be because the 1st CPU_NUMBER is called before the switch to kernel segments, so some kernel variables are not accessible yet?
<Pellescours>I think that’s the cause, acd3fa8f8ba9c093c426f83488b338088035f117 introduced a CPU_NUMBER call before the stack switch
<damo22>Pellescours: genius
<damo22>so how do we make an ASM macro to read cpu number?
<damo22>maybe we can use the hardcoded address of the lapic?
<damo22>or maybe we can make an early stack switch function that only gets used when cpu number will fail
<damo22>like before the switch to kernel segments it can use a hardcoded cpu number
<damo22>where can i store a flag that can be read even when not on kernel segs?
<damo22>i only need one bit
<damo22>does eflags have a user space?
<damo22>i need to mark when the cpu bringup is done
<youpi>isn't it possible to just disable interrupts until the bringup is done?
<damo22>maybe
<damo22>but the problem is cpu_number cannot read lapic
<damo22>because the first CPU_NUMBER is called before switch to kernel segs
<damo22>Pellescours found it
<youpi>before?
<youpi>which one?
<youpi>the one in all_intrs is after
<youpi>ah, the additional before the int_from_stack check
<damo22>(20:00:23) Pellescours: I think that’s the cause, acd3fa8f8ba9c093c426f83488b338088035f117 introduced a CPU_NUMBER call before the stack switch
<youpi>possibly setting the registers could be moved before that
<youpi>+segment
<damo22>ok!
<youpi>Mmm, actually, can't one just use the cs segment ?
<youpi>it doesn't allow writing, but it should be fine for reading
<damo22>i dont understand that part
<youpi>which part?
<youpi>just use cs:
<youpi>that'll use the cs segment
<youpi>which is already set by the interrupt mechanism
<damo22>i mean, use it where?
<youpi>when reding
<youpi>+
<youpi>a
<damo22>cs:lapic ?
<youpi>yes
<damo22>hmm ok
<youpi>or whatever variable you want to read
<damo22>../i386/i386/cswitch.S:42: Error: junk `:lapic' after expression
<youpi>how did you write it?
<youpi>see e.g. i386/i386/debug_trace.S: movl %ss:EXT(debug_trace_pos),%eax
<damo22>cs:lapicoh
<damo22>ok
<youpi>you need %
<damo22>ok its compiling
<youpi>actually you could also just use ss: like below the cmpl instruction
<youpi>but cs: inside CPU_NUMBER should be safest
<youpi>that macros saves esi, edi, etc. that's useless
<youpi>it could just only use reg
<youpi>for all operations
<damo22>err, how do you iterate through a list of kernel ids and choose the one that matches apic id with just one reg?
<damo22>im writing that function now in asm
<youpi>? I don't see CPU_NUMBER doing that
<damo22>yes
<damo22>its WIP
<youpi>if something like that is needed, I'd say just prepare a table
<youpi>that already gives you the result directly
<youpi>spending 256 byte on that is cheap
<damo22>i agree that would be nice
<damo22>for now i will just assume the apic id == the kernel id
<damo22>to make this test
<damo22>all this work and its 0 or 1
<damo22>Stopped at all_intrs+0xe: movl 0x20(%ebx),%eax
<damo22>do i need cs here too?: movl %cs:APIC_ID(%ebx), %eax
<youpi>sure
<youpi>it's the same problem
<damo22>i see
<damo22>...
<damo22>Sending IPI(1 -> 0) to call TLB shootd
<damo22>Sending IPI(1 -> 0)
<damo22>rumpdisk pmap_update_interrupt on cpu1
<damo22>pmap_update_interrupt on cpu1
<damo22>its stuck in idle_thread
<damo22>idle_thread_continue*
<damo22>no more general protection faults
<damo22>it seems like the IPIs are being delivered to the wrong cpu
<damo22>or there is some kind of cpu mapping problem
<damo22>its late, i need to sleep now
<damo22>-smp 1 boots with that %cs thing
<softwar>besides arch and debian are there any other hurd distros?
<youpi1>softwar: guix
<softwar>no
<youpi1>? why no ?
<youpi1>GNU Guix/Hurd is a thing
<softwar>I can't find it. Something with my searches not working. I am using debian
<youpi1> https://guix.gnu.org/en/download/latest/
<youpi1>there's a big hurd icon in the middle
<softwar>I get a 502 bad gateway
<softwar>nginx
<youpi1>well, I don't know, ask a guix channel :)
<softwar>yeah debian hurd is good enough. managed to get something installed, thanks anyway