IRC channel logs

<damo22>Pellescours: feat-smp2-works is a branch that has AP with no periodic interrupts, so likely its sitting in a loop, thats why it mostly works

<damo22>im trying to fix feat-smp2-fault with apic and smp

<damo22>this one has a timer enabled on all cores

<damo22>-smp 1 or -smp 2 are the only things i am testing because more than 2 is more difficult to fix the AP bringup

<damo22>yes when i compile gnumach for smp i do --enable-kdb --enable-ncpus=8 --enable-apic

<damo22>the problem seems to be now not in the intstack but on a different stack, because its general protection faulting with the stack pointer not on an intstack

<Pellescours>with smp2faults apic and 2 cpus i have a protection exception when starting acpi task

<Pellescours>in all_intrs

<damo22>yes that is the fault

<Pellescours>it says it fails at movl ...,%ebp

<damo22>general protection faults are difficult to track down because there are many reasons why one can occur

<Pellescours>is this instruction the one ine CPU_NUMBER ? (movl lapic,%ebp)?

<damo22>not sure, i am going to try disabling interrupts during an interrupt

<damo22>hm that didnt fix it

<damo22>start acpi: acpi Kernel General protection trap, eip 0xc103160c

<damo22>kernel: General protection (13), code=0

<damo22>Stopped at all_intrs+0x8: movl 0xc11e070c,%ebx

<damo22>that is inside CPU_NUMBER i think yes

<youpi1>is that address mapped?

<damo22>its the "lapic" pointer

<youpi1>ok but is it mapped?

<damo22>how do i tell?

<youpi1>you can at least use x in kdb to read it

<youpi1>x 0xc11e070c

<damo22>db{0}> x 0xc11e070c

<damo22> f9693000

<damo22>not sure how that instruction failed, with no interrupts enabled

<youpi1>that being said, I wouldn't be surprised if the faulty instruction was the instruction just before that

<damo22>ok will objdump in a sec

<damo22>c103160b: 50 push %eax

<damo22>c103160c: 8b 1d 0c 07 1e c1 mov 0xc11e070c,%ebx

<damo22>the only way that can fail is if esp = 0

<damo22>esp 0xf590363c

<damo22>efl 0x10086

<damo22>ebp 0x800000

<damo22>db{0}> x 0x800000

<damo22>Kernel Page fault trap, eip 0xc1061dc5

<youpi1>is the memory location of esp writable?

<youpi1>you can use: w 0xf5903638 0

<damo22>db{0}> w 0xf5903638 0

<damo22>0xf5903638 0x10086 = 0

<damo22>db{0}> x 0xf5903638

<damo22> 0

<damo22>yes

<damo22>unless kdb moved the esp

<damo22>i can try with gdb

<damo22>no bootstrap code loaded with the kernle

<damo22>Thread 2 (Thread 1.2 (CPU#1 [halted ])):

<damo22>#0 0xc10283b1 in machine_idle (cpu=1) at ../i386/i386at/model_dep.c:236

<damo22>#1 0xc100c019 in idle_thread_continue () at ../kern/sched_prim.c:1657

<damo22>#2 0x00000000 in ?? ()

<damo22>Thread 1 (Thread 1.1 (CPU#0 [running])):

<damo22>#0 kdcnmaygetc () at ../i386/i386at/kd.c:2999

<damo22>is there anything useful i can probe at this point?

<damo22>it seems to only fault when you give it something to run

<Pellescours>like if it do not have right to read the program

<Pellescours>?

<damo22>i mean if you run the kernel in gdb by itself, it does not fault

<damo22>it sets up AP and is fine, but complains it has nothing to run

<damo22>i should put infinite loop just before the kernel tries to run something and check that the timer interrupts are being serviced by all cores?

<damo22>should hardclock ignore clock ticks coming from APs?

<youpi1>yes

<youpi1>only one CPU should advance the time

<damo22>i forgot to do this

<youpi1>but stats for processes etc. need to be updated

<youpi1>(notably counting what process the tick accounts for)

<youpi1>clock_interrupt already advances the time only on the master

<youpi1>so you shouldn't need to special-case more than what is already there

<damo22>hmm

<damo22>it seems all timer interrupts are being serviced by cpu0

<damo22>oh i had it in a loop sorry, thats not right

<damo22>maybe linux_timer_intr() needs to ignore cpu_number != 0

<damo22>and not update jiffies

<damo22>that fixed the prot fault!

<damo22>Sending IPI(0) to call TLB shootdown...done

<damo22>hung there

<damo22>timers are working though

<damo22>weird, it still faults on -smp1

<damo22>youpi1: i noticed the 30 second timeout lasted only a split second, could it be the timer is too short?

<damo22>causing a general fault with stack overflow of too many timer interrupts

<damo22>task loaded: exec /hurd/exec

<damo22>Sending IPI(0) to call TLB shootdown...done

<damo22>start acpi:

<damo22>both cpus in HLT with interrupts enabled

<damo22>pmap_update_interrupt on cpu1Sending IPI(0) to call TLB shootdown...done

<damo22>hmm cpu1 sent an IPI to cpu0 but cpu1 serviced it

<damo22>start acpi: Sending IPI(0 -> 1) to call TLB shootdown...done

<damo22>pmap_update_interrupt on cpu1

<damo22>acpi Kernel General protection trap, eip 0xc10315db

<damo22>kernel: General protection (13), code=0

<damo22>Stopped at all_intrs+0x7: movl 0xc11e070c,%ebx

<damo22>no memory is assigned to address 00040004

<damo22>looks like interrupt 251 works

<damo22>but there is still a general fault

<damo22>looks like ebp = 0

<damo22>is that a problem?

<damo22>maybe its CPU_NUMBER

<damo22>i am trying to write a compact CPU_NUMBER that reads the kernel id

<damo22>its messy

<Pellescours>I’m triggering the general protection trap with smp enabled and 1 cpu

<Pellescours>damo22: How "movl APIC_ID[%ebx], %eax" works (http://git.zammit.org/gnumach-sv.git/tree/i386/i386/cpu_number.h?h=feat-smp2-faults#n48)? I mean APIC_ID is defined as "offset ApicLocalUnit lu apic_id APIC_ID". Are you sure it take the good address?

<Pellescours>damo22: I tried something, I replaced the CPU_NUMBER macro to globaly set 0 to the ebx register, and boot with 1 cpu. And it works without page fault. It’s really seems to be the "movl lapic, %ebx" that makes the protection fault

<Pellescours>it’s definitively this instruction that trigger the protection fault

<Pellescours>damo22: can it be because the 1st CPU_NUMBER is called before the switch to kernel segments, so some kernel variables are not accessible yet?

<Pellescours>I think that’s the cause, acd3fa8f8ba9c093c426f83488b338088035f117 introduced a CPU_NUMBER call before the stack switch

<damo22>Pellescours: genius

<damo22>so how do we make an ASM macro to read cpu number?

<damo22>maybe we can use the hardcoded address of the lapic?

<damo22>or maybe we can make an early stack switch function that only gets used when cpu number will fail

<damo22>like before the switch to kernel segments it can use a hardcoded cpu number

<damo22>where can i store a flag that can be read even when not on kernel segs?

<damo22>i only need one bit

<damo22>does eflags have a user space?

<damo22>i need to mark when the cpu bringup is done

<youpi>isn't it possible to just disable interrupts until the bringup is done?

<damo22>maybe

<damo22>but the problem is cpu_number cannot read lapic

<damo22>because the first CPU_NUMBER is called before switch to kernel segs

<damo22>Pellescours found it

<youpi>before?

<youpi>which one?

<youpi>the one in all_intrs is after

<youpi>ah, the additional before the int_from_stack check

<damo22>(20:00:23) Pellescours: I think that’s the cause, acd3fa8f8ba9c093c426f83488b338088035f117 introduced a CPU_NUMBER call before the stack switch

<youpi>possibly setting the registers could be moved before that

<youpi>+segment

<damo22>ok!

<youpi>Mmm, actually, can't one just use the cs segment ?

<youpi>it doesn't allow writing, but it should be fine for reading

<damo22>i dont understand that part

<youpi>which part?

<youpi>just use cs:

<youpi>that'll use the cs segment

<youpi>which is already set by the interrupt mechanism

<damo22>i mean, use it where?

<youpi>when reding