IRC channel logs

2024-04-10.log

back to list of logs

<azert>I'm debugging the function with asm snippets like the following
<azert>        movz x0, #0x8000
<azert>        movk x0, #0x01c2, lsl #16
<azert>        movk x0, #0xffff, lsl #48
<azert>        movz w1, #0x42
<azert>        dsb st
<azert>        str w1, [x0]
<azert>that output a single char on the serial line
<azert>let's see..
<almuhs>azert: do you refers to the function which cause my compilation error?
<azert>no I refer to the code solid_black wrote
<azert>that I'm trying to debug
<azert>he normally reads the logs and provide feedback
<dsmith>What is that, something like: *(char *) 0xffff01c28000 = 'A';
<azert>yes
<azert>It's exactly that but with a memory barrier in addition that I'm not really sure is needed
<azert>the dsb st instruction
<azert>solid_black: the code in thread_bootstrap_return gets stuck in one of the following two instructions:         msr TPIDR_EL0, x0
<azert>        msr SPSR_EL1, x1
<azert>WRONG
<azert>one of these:         msr SP_EL0, x0
<azert>        msr ELR_EL1, x1
<azert>solid_black: ditch that! I am almost sure that it is stalling at this instruction:         ldp x0, x1, [sp, #(ATS_SP)]
<azert>that would mean that there is an issue with the stack
<azert>weird that this doesn't raise an exception?
<azert>almuhs: maybe you should try reintroduce the patches one after the other?
<almuhs>maybe
<almuhs>but the patches are wrotten for a previous version of the code
<almuhs>so maybe are not compatible
<azert>then by doing it one by one you will see what needs to be updated
<solid_black_tmp>azert: OMG I think I know what's going on
<sneek>Welcome back solid_black_tmp, you have 1 message!
<sneek>solid_black_tmp, gnucode says: I should probably update the latest news with your alpine aports
<solid_black_tmp>Try adding a "long pad" to "struct pcb" in thread.h in between "afs" and "ats"
<solid_black_tmp>And regenerating aarch64asm.h (rm it from the build tree)
<dsmith>solid_black_tmp, alignment?
<solid_black_tmp>Yep
<solid_black_tmp>The one thing that QEMU doesn't check, citing performance, is SP alignment
<dsmith>I was going to ask earlier, what is the value of ATS_SP?
<solid_black_tmp>And there's a reason why faults on the PCB stack are not reported — I dropped even trying to handle them, since we're really just not supposed to crash there
<solid_black_tmp>It's much like faulting inside the fault handler you're just not supposed to
<solid_black_tmp>Barring bugs like this
<dsmith>bad for large values of bad
<solid_black_tmp>ATS_SP is just 8*31
<solid_black_tmp>That's not the issue, unaligned loads are fine
<solid_black_tmp>What's not fine is not 16-aligned SP at any time a load or a store is made relative to it
<dsmith>Ah
<solid_black_tmp>And SP here points to pcb->ats of the active thread
<solid_black_tmp>I don't think anything currently enforces that the PCB is 16-aligned overall either, but at least that seems to be the case for azert
<solid_black_tmp>Also note that regular access to ats->pc worked, as we know from the "thread_exception_return() to 40205c" message
<solid_black_tmp>Meaning that the pointer & data is OK otherwise
<solid_black_tmp>But not when used as SP
<solid_black_tmp>See, it makes sense
<solid_black>hi
<solid_black>azert: pushed the fix
<Pellescours>hi
<azert>solid_black: I applied your patches, it doesn't solve the issue
<azert>I'll try to see if the last 4 bits of sp are zero next
<azert>to check if this is indeed the issue
<azert>I discovered something interesting: the exception is somehow taken and the processor is running in an infinite loop
<solid_black>azert: I'm 90% sure that SP alignment was the issue, so please double-check that you have applied my patch
<solid_black>other than that, let me see if I can enable tracing for exceptions taken from the PCB stack too
<solid_black>but yes, sure, if we take an exception with a bad SP, we'll get stuck
<solid_black>Linux does this interesting thing where they swap SP with a general-purpose registers, then align SP, and proceed from there
<azert>I think you solved that issue
<azert>now there is another one!
<azert>I checked sp and it's aligned right now
<solid_black>so it still faults there, but for a different reason?
<azert>yes
<solid_black>fun
<azert>let me reenable tracing
<azert>I'm not sure it fault in the same spot
<solid_black>the thing about SP alignment is QEMU doesn't check for it, even though the ARM spec requires it to
<solid_black>which I why I haven't noticed this before
<azert>I checked SP with this
<azert>        movz x0, #0x8000
<azert>        movk x0, #0x01c2, lsl #16
<azert>        movk x0, #0xffff, lsl #48
<azert>        movz w2, #0x41
<azert>        dsb st
<azert>        str w2, [x0]
<azert>        movz x3, #0xF
<azert>        and x1, x1, x3
<azert>        add w2, w2, w1
<azert>        dsb st
<azert>        str w2, [x0]
<azert>it prints an infinite amount of AAAAAAAAAAAAAAAAAAA
<azert>not just two..
<azert>x1 contains sp
<azert>I insert this after         GET_PCB_STACK(x1)
<azert>        mov sp, x1
<solid_black>mov x1, sp you mean?
<solid_black>ah, that is existing code
<azert>yes
<azert>so it snould print 'A' and then 'A' + the last 16 bits of sp
<azert>and it just prints 'A'
<azert>many times
<solid_black>yes, and it's aligned so you get two A's
<azert>I get hundreds
<solid_black>and then since thread_exception_return is invoked lots of times, you get a lot of output
<azert>pointing to another issue
<solid_black>that is, assuming it does successfully return to EL0
<azert>probably an issue outside this function
<azert>so let's reenable tracing and I'll get back to you
<solid_black>so can you reenable tracing and see what happens now, or will that wait until tonight?
<azert>git cherry-pick ee177f52680116538192b2c0c5d9a08e174c007f
<azert>It's stuck in this kind of loop:
<azert>thread_exception_return() to 40205c
<azert>Sync exc from EL0!
<azert>ESR: 0x8200000f, FAR: 0x40205c
<azert>upfc, kr = 0
<azert>thread_exception_return() to 40205c
<solid_black>so?
<solid_black>awesome!
<azert>:)
<azert>gtg
<solid_black>0x8200000f is ESR_EC_IABT_LOWER_EL (an instruction abort from EL0), i.e. it tried to run code from a page that either wasn't there or the page was not executable
<solid_black>0x0f is ESR_IABT_IFSC_PERM_L3, permission fault on level 3
<solid_black>so the physical page is there, but it's not executable
<solid_black>which makes sense, since read_exec() vm_allocate's the page (rw), then copies over it (so the physical page is entered into pmap as rw), and then vm_protect's it executable
<solid_black>what's supposed to happen, and you can see it in my trace, is it gets entered into pmap as executable now, and we return back, and this time it works and faults on another address
<solid_black>but you're saying that in your case it keeps looping on the same address?
<solid_black>azert: try uncommenting TLB_FLUSH/cache_flush at the end of pmap_enter() in aarch64/aarch64/pmap.c
<solid_black>my understanding is that they shouldn't be necessary, i.e. it cannot cache in / speculate a negative lookup
<solid_black>but I might be wrong
<solid_black>hmm, what if it cannot cache the page not being there at all, but can cache the page not having execute permission?
<solid_black>that would explain why kernel data faults worked for you, but EL0 instruction faults don't seem to
<solid_black>ACTION reads the Arm ARM chapter on TLB and break-before-make
<solid_black>truth is, I don't really understand exactly which barriers are required in which case
<solid_black>I just stuffed a bunch of them, copying from random examples
<solid_black>and that seemed to be enough for qemu
<solid_black>but I'm not surprised if that's not enough for real hardware
<Pellescours>youpi: I noticed that this patch (https://salsa.debian.org/hurd-team/gnumach/-/blob/fcbfed0c084bda1ed0b963b56fa0a7c847c5cfc3/debian/patches/79_dde-debian.patch) doesn’t apply anymore, it need to be updated
<youpi>isn't it already updated in the tree?
<Pellescours>I don’t know, when I try to apply it to latest gnumach (not the debian one, which is not that old so should be relatively the same), it fails to apply.
<Pellescours>I checked that I took the latest version of this patch
<youpi>/usr/src/hurd/gnumach (git)-[master] €
<youpi>patch -p1 --dry-run < /usr/src/hurd-debian/gnumach/debian/patches/79_dde-debian.patch
<youpi>works fine here
<solid_black>any news on SMP kernels in Debian?
<youpi>I sent them the other day
<Pellescours>Maybe it’s me that apply it wrongly then
<solid_black>ah, I see, thanks!
<Pellescours>Ah found, it was because my git was not cleaned. I reseted it’s state, and now it applies correctly
<solid_black>azert: I pushed uncommenting the barriers
<solid_black>somebody someday who understands the exact required barriers better than I do would go and ensure we always use the least strong barrier required
<solid_black>but for now it's fine if we do a little bit more