IRC channel logs

<ekaitz>old: tests look like now fail 10/13 with that flag set

<ekaitz>i just run like 5 times

<ekaitz>now I need a core

<ekaitz>this is all I have in the terminal: assert run-fibers on (rpc-fib 24) terminates: Segmentation fault

<ekaitz>so... it doesn't really say much

<ekaitz>i got a core

<ekaitz>but i'm not sure if it's going to be decent

<ekaitz>very poor

<ekaitz>idk what to do now :)

<dsmith>gdb the-executable the-core-file

<dsmith>and then `bt`

<ekaitz>i don't see much there

<ekaitz>how do I add separate many symbol files together?

<dsmith>Not sure what you mean. The exec needs to be built with -ggdb (IIRC)

<ekaitz>dsmith: https://cdn.elenq.tech/guile.txt that's what I see in the backtrace

<dsmith>This is guix stuff?

<ekaitz>yes

<ekaitz>it produces the symbols in a separate directory, and I managed to import them

<ekaitz>but it's like there are missing things

<ekaitz>probably because of the jit?

<dsmith>There is a difference when compiling with -g and -ggdb . There is more info available. And it might be jit code too.

<dsmith>*Typically* on a segfault, there is some arg passed that is 0 that should not be.

<ekaitz>okay, that would mean recompiling guile

<dsmith>But those addresses seem to be in the same general area, so they are prob not JIT'ed code

<ekaitz>yeah that's true

<ekaitz>recompiling guile with -ggdb

<dsmith>Yeah, recompile guile with the -ggdb option.

<ekaitz>okay, this will take long

<ekaitz>so probably better continue tomorrow

<dsmith>This https://gitlab.com/wingo/lightening/-/work_items/14 was segfaulting the bot 6 years ago. The only time I messed with gdb and JITed code

<dsmith>It would take hours to build..

<ekaitz>i have had some fun with the jit but I don't remember very well what I did

<dsmith>Debugging was setting a breakpoint in C after the JIT code was generated, and then use that info to disassemble the generated machine code.

<ekaitz>yep

<ekaitz>there are also a few env vars that pause after jit and things like that

<ekaitz>what I never did is to debug multithreaded code

<dsmith>Rather tedious

<ekaitz>and in this case I'm not sure what I'm looking for

<dsmith>Sometimes, you just have to poke it with a stick and see which way it wiggles...

<dsmith>Add a print. Add some sleeps somewhere. Make a fake error.

<ekaitz>yeah, but i'm not very used to multithreaded code and i'm not sure how should I poke it

<ekaitz>will try tomorrow anyway

<ekaitz>that's the next challenge in ekaitz's programming knowledge journey

<old>what is interesting is to know what memory segfault and how

<old>is is executable memory that segfault?

<old>I think you can have access to siginfo

<old> https://sourceware.org/gdb/current/onlinedocs/gdb.html/Signals.html

<old>look at the end, there's an example with segfault

<old>if you can us that, it would certainly give hints

<ekaitz>oh! i didn't know about this

<ekaitz>great

<old>the important fields are: sa_addr and si_code

<ekaitz>okay, i'll do that tomorrow

<old>thx

<rlb>printf debugging is primitive, but consistently reliable for threading issues, *unless* of course printing the messages itself is enough to disturb some important ordering issue...

<rlb>ekaitz: I imagine you know, but last time I had to deal with this sort of thing directly in C (a good while ago) gdb was pretty good wrt showing individual threads, switching for inspection, etc.

<ekaitz>rlb: sure, I use gdb a lot these days but I'm not very used to multithreaded code

<rlb>It's certainly often "harder".

<ekaitz>yeah

<ekaitz>most of GNU Mes' debugging I do in gdb

<rlb>And as you say, if you can come up with theories, that's often a promising place to start...

<ekaitz>yes, that's what I was lacking here

<ekaitz>but I'll let my pillow help me with that

<ekaitz>it's very late here

<dsmith>And there is a difference with posix threads and "green" threads. gdb knows posix

<rlb>With concurrency in particular, I think it's likely easier if you get to the point of understanding the code well enough to reason about it wrt making predictions, but if you can "catch it in the act", of course, that might do it too.

<rlb>Right, I assume fibers are green-ish?

<dsmith>(I'm assuming fibers is more "green" ?)

<rlb>(i.e. multiplexing on posix threads)

<rlb>(If it were segfaulting in "the same place", then maybe you could start setting watches, etc. and working backward.)

<ekaitz>ACTION goes to bed

<ekaitz>thanks all for the ideas

<rlb>night

<ekaitz>i'll bother you with questions during the following days

<ekaitz>ACTION runs away

<rlb>Wondering if Andy might also be able to help once there's more info.

<ekaitz>rlb, old, dsmith : got guile built with -ggdb

<ekaitz>same thing in the trace

<old>do you have the value in siginfo ?

<ekaitz>let me try

<ekaitz> https://cdn.elenq.tech/fibers-threads.gdb.log this is the stack trace of all threads

<dsmith>ekaitz, Fouey

<ekaitz>oh wait

<ekaitz>in gdb i get Thread 8 "guile" received signal SIGXCPU, CPU time limit exceeded.

<ekaitz>but if i run the process outside gdb i get just a segfault

<ekaitz>old: (gdb) p $_siginfo._sifields._sigfault.si_addr

<ekaitz>$2 = (void *) 0xffffffff8a743b00

<ekaitz>weird number

<dsmith>And just when it started to get fun, I have to leave...

<dsmith>ekaitz, Good luck.

<ekaitz>haha dsmith see ya

<ekaitz>aaargh i don't know what I'm looking for

<ekaitz>and the core doesn't give me any backtrace even if I load the symbol files

<ekaitz>it looks like some sign extension issue

<ekaitz>i'm looking at the registers

<ekaitz>0x3f8d691d80 or numbers like that are common

<ekaitz>ours is 0xffffffff8a743b00, which is in s1

<ekaitz>sorry, s2

<ekaitz>it looks like it got signextended?

<old>ekaitz: what about si_code

<old>that's the must important thing

<ekaitz>11

<old>pour yeah $2 is weird, that looks like kernel mapping and not userspace eh

<ekaitz>(talking from memory)

<old>11 decimal or binary

<ekaitz>let me try again

<ekaitz>oh i was rebuilding, it'll take some time

<old>I hope it's binary .. there is no 11 lol

<old>241 #define NSIGSEGV 10

<ekaitz>isn't sigsev 11?

<ekaitz>yeah SIGSEGV is 11

<ekaitz>(decimal)

<old>sure but si_code

<old>is suppose to be the reason

<old>The following values can be placed in si_code for a SIGSEGV signal:

<old>SEGV_MAPERR Address not mapped to object.

<old>right so 0xffffffff8a743b00 is indeed kernel space pointer

<old>either wrong pointer arithmetic or wrong sign extension

<old>can you do: (gdb) p (void*) 0x8a743b00

<old>see if there is a symbol there

<old>you can also try to disassemble at that location: (gdb) x/5i 0x8a743b00

<old>if you see some valid RISC-V instruction, this looks like a sign-extension bug to me

<ekaitz>oh let me see that in more detail

<ekaitz>once it builds

<ekaitz>i tried those before

<ekaitz>i think there wasn't any instruction but maybe I did the disassembly wrong

<ekaitz>ACTION is out for a coffee and will try later when guile is properly compiled

<dthompson>the next lisp game jam has been announced. 5/15-5/25 https://itch.io/jam/spring-lisp-game-jam-2026

<ieure>Wake me when the lisp game candied preserves opens

<ekaitz>old: si_code = 1

<ekaitz>dthompson: !!

<old>ekaitz: interresting

<old>1 not 11 right

<ekaitz>yeah, 1

<ekaitz>now i run it, before i was giving you the signal id not the si_code

<old>so that's indeed SEGV_MAPERR

<old>which mean the mapping is not valid, which make sens since the address you have is a kernel space address

<ekaitz>yeah

<ekaitz>now let's do the disassembly

<old>yup

<old>if you disassemble on the lower 32-bit portion of the address

<old>that would be nice to see

<ekaitz>hmmm

<old>basically, the value of sa_addr

<ekaitz>i tried

<ekaitz>but it doesn't look like anything reasonable

<ekaitz>unimp a lot

<old>(gdb) x/1i (void*)((uinptr_t)$_siginfo.si_addr & 0xffffffff))

<old>hmm okay

<old>what about the value of the instruction pointer

<old>is it the same as of si_addr ?

<ekaitz>no but it's similar

<ekaitz>0x3fb63bd928

<ekaitz>look

<ekaitz>i tried to remove zeros from the top to make it look similar

<ekaitz>this is the failing address: 0xffffffffb5afdcc0

<ekaitz>i tried disassembling at 0x3fb5afdcc0

<ekaitz>which has a few unimp, and a few instructions that happen to be disassemblable

<ekaitz> https://paste.debian.net/hidden/129ca3fc

<ekaitz>it doesn't look like real assembly

<old>nah it's gabarge

<old>so we have executable code, probable in the JIT here

<old>that make an illegal access to a kernel address

<rlb>I assume it's not somewhere "stable" you could put a watch point on to try to figure out what writes the value.

<ekaitz>yeah but I don't have any code or anything, this is very hard to test

<ekaitz>the backtrace has a lot of ??

<old>is it RV64 ?

<ekaitz>and i don't have the sources here properly idk why

<ekaitz>old: RV64, yes

<old>the ?? are probably JIT code

<old>not sure

<ekaitz>that's what I thought

<ekaitz>what's this? ⚠️ warning: Unexpected size of section `.reg2/3503087' in core file.

<old>no idea :/

<ekaitz>f

<old>okay let's try something here

<old>could you show the instructions around your instruction pointer

<rlb>Another possibility might be to start trying to simplify an example, and/or use "printfs" to start narrowing down what code's involved. Sometimes that works... (and gives a smaller set of code to reason about when trying to come up with hypotheses)

<old>that ought to show the JITted code

<rlb>(*if* it's still repeatable enough)

<ekaitz>rlb: the only good way to printf debug this is to find smaller reproducer that doesn't use fibers

<old>I think: (gdb) x/10i $pc - 32

<ekaitz>if we could take fibers out of the way, that would be great

<ekaitz>now I'm compiling things in two levels and it's a big pain in the balls actually

<ekaitz>compiling guile takes looooong time if I want to do it with guix

<old>what do you mean two levels?

<ekaitz>i have to build guile and then fibers

<ekaitz>and run a thing in fibers

<ekaitz>making sure it uses the correct guile

<ekaitz>and so on

<ekaitz>in any case, we should be able to think through this

<ekaitz>if most of the cases fail, means it could be as simple as making a new thread and triggering the JIT

<rlb>I don't know about guix, but with say a --prefix install, I'd guess you probably don't need to rebuild fibers (if the reproducer is just running a fibers test or something).

<rlb>But I probably don't understand well.

<ekaitz>that should work, yeah

<ekaitz>but getting fibers out of the way should also help, because I don't know how it works internally so I don't know what to expect

<rlb>ACTION unfondly remembers a lot of printf("[%zu] %s:%d: ...", getpid(), __FILE__, __LINE__) when tracking down some concurrency-related problems in C...

<rlb>ekaitz: another thing I wondered about is if you do get to the point where printing activity is interesting, and if fibers is using a thread pool that's configurable, then decreasing it might help lower the complexity of the information *if* the problem still appears.

<ekaitz>old: this might be interesting, there are fences introduced in the code

<old>I don't think the problem is synchornization

<rlb>(of course if it's a race, then decreasing the size of the pool might well also make the problem vanish)

<ekaitz> https://paste.debian.net/hidden/74fca5bc <-- and also! look at s2

<ekaitz>s2 is the register with the weird value

<ekaitz>andi sign extends

<ekaitz>but still, it's overwritting s2

<ekaitz>ld s2,32(a0)

<ekaitz>and then loads from s2

<ekaitz>it's like it's missing something after the second load

<old>what's at 32(a0) in gDB ?

<old>is it the same value that segfault?

<old>you could also disassemble more before to have full context

<old>ekaitz: I hope you are keeping tracks of all that for the future :p

<ekaitz>a0 is the register that is used for args and return values

<ekaitz>32 (a0) means (contents_of(a0) + 32)

<old>Right

<ekaitz>(gdb) p/x $a0+32

<ekaitz>$4 = 0x3fb5d7ac78

<old>p/x *(void**) ($a0 + 32)

<ekaitz>hehe

<ekaitz>(gdb) p/x *(void**) ($a0 + 32)

<ekaitz>$6 = 0xffffffffb5afdcc0

<ekaitz>that was what I was looking for

<ekaitz>ACTION needs adult supervision

<ekaitz>so that's what we have there

<ekaitz>whatever that we have in a0+32

<ekaitz>but also, why a0 + 32?

<old>right, wathever a0 is I don't know

<ekaitz>that's weird, fp + something is very normal but a0 is weird

<old>who knows, we need to know the type of a0

<ekaitz> https://paste.debian.net/hidden/728d5ed4

<ekaitz>it's like it wrote s2 to a0+32

<old>oh my god so many fence what the hell

<ekaitz>that's what I told you hehe

<old>that can't be possible. This will be so slow you might as well just disable JIT lol

<ekaitz>also, in lightening i don't understand very well when the fences are introduced

<ekaitz>because the mfence function is never called

<old>atomic operations

<old>ldr_atomic for example

<ekaitz>oh true

<ekaitz>cas_atomic and such

<old>maybe the problem is there

<old>not the fence, but the implementation of the atomic

<ekaitz>surely

<ekaitz>we don't have tests for them in lightening

<ekaitz>and that store is very weird

<old>lw is odd

<ekaitz>yep

<ekaitz>ldr_i and those are using lw and such

<ekaitz>i think ldr_atomic and sdr_atomic might be wrong because of sign extension

<old>I'm not familiar with RSIC-V ISA but

<old> https://www.cs.cornell.edu/courses/cs3410/2025sp/notes/riscv-mem_ctrl.html

<old>The 64-bit RISC-V instruction set gives you several instructions for loading from and storing to memory. They are very similar; the only difference is the size of the load or store: the number of bits we’re reading or writing.

<ekaitz>yeah

<old>I think it should be ld/sd

<ekaitz>i found a minibug already in lightening lol

<old>not lw/sw

<ekaitz>old I agree

<old>can you fix that locally and recompile yet again :-)

<ekaitz>it's because the ldr_atomic and sdr_atomic are not written properly

<ekaitz>i can of course

<old>great!

<old>remote debugging session is fine

<old>s/fine/fun

<ekaitz>yeah but also...