IRC channel logs

2022-09-17.log

back to list of logs

<Pellescours>because I was unable to have gdb working correctly I took another approch, commenting lines of code and try to find what makes gnumach having a kernel trap
<Pellescours>I commented almost everything, configured to use our intr and not the linux one. It works until I comment the linux_timer_intr() call in i386/i386/hardclock.c
<Pellescours>If someone have an idea of the why, I take
<youpi>( rather than commenting, you can just insert while(1) {} )
<youpi>you can probably look into linux_timer_intr to check what exactly there poses problem?
<Pellescours>the issue is that removing the call generate the trap, and leaving it don’t
<youpi>yes, I'm saying to go *into* it
<youpi>put while(1) {} in places inside that function
<Pellescours>ah o.
<Pellescours>ah ok
<Pellescours>keeping only this line "(*(unsigned long *) &jiffies)++;" make it works. I copied this line (and the var declaration) into hardclock and I’m able to run without trap (compiled with --disable-linux-groups)
<Pellescours>is this a bug in the context switching?
<youpi>why would it?
<youpi>the rest of the function is about mark_bh
<youpi>but you didn't tell about your context
<youpi>what are you testing?
<Pellescours>I’m testing booting gnumach without linux code, and when I compile with --disable-linux-groups, it don’t boot.
<youpi>it looks odd that linux_timer_intr would be called at all with the linux group disabled
<youpi>possibly a bug in the configfrag bits?
<Pellescours>It’s not called when it’s disabled
<youpi>? you said above it was called
<Pellescours>yes because to find the root cause, I kept linux-group enable but I commented portions of code until find the root cause
<youpi>I don't understand your approach
<youpi>if you have a bug with the linux group disabled, why are you debugging with linux group enabled?
<Pellescours>And the root cause is that without this "(*(unsigned long *) &jiffies)++;" line, I’m not able to boot
<Pellescours>Because gdb don’t work
<youpi>aah, now that makes sense
<youpi>so does it work with, as only modification, disabling the linux group and updating jiffies in hardclock.c ?
<Pellescours>yes
<youpi>btw, "I'm not able to boot" is far from enough as information
<youpi>does it hang
<youpi>does it reboot
<youpi>does it trap
<youpi>does it catch fire
<youpi>details *always* matter when debugging
<youpi>it'd be useful to grep to see which .o files actually use jiffies
<Pellescours>it reboot
<youpi>since it's a linuxish thing, mach isn't supposed to use it
<Pellescours>yes, as I said to be able to compile it, I also copied the variable declaration into the hardclock.c
<youpi>we don't understand each other it seems
<youpi>you are saying that *that line* alone allows you to boot with linux-group disable
<youpi>is that right?
<youpi>thus I'm saying look for what is using that variable
<youpi>since it's a linux thing, it's really surprising that mach can't boot without seeing it updated
<youpi>thus look for whatever is actually using it
<youpi>and possibly we'll see a surprise and understand what's wrong
<Pellescours>in linux code, it’s used as a counter for the timer
<youpi>yes
<youpi>that's what jiffies are yes
<youpi>but that's for linux, not for mach
<youpi>when disabling the linux group, in principle there's no code left that is supposed to be using it
<youpi>if that's not the case then there's something surprising to be found
<youpi>thus grep for it
<Pellescours>I don’t think it about the jiffies, it’s just that having a volatile variable being updated at that point that "make thing correctly happen?", I renamed my variable test and it works too. I’m now re-testing without the change and without linux groups to get the error
<youpi>I'm lost in what you are actually testing
<youpi>« And the root cause is that without this "(*(unsigned long *) &jiffies)++;" line, I’m not able to boot »
<youpi>what does that actually mean?
<Pellescours>My goal: boot without linux groups, so using rump disk drivers
<youpi>understood
<youpi>my concern is why you're looking at all inside linux_timer_intr, since normally there's no linux code in that case
<youpi>thus no linux_timer_intr function at all
<youpi>note that I wrote "normally"
<youpi>I'm not saying that there's no bug that errnoneously includes it, calls it, etc. and that's plain wrong
<Pellescours>I’m lookint into linux_timer_intr because wasn’t able to use gdb, I took the approch of altering the code to progressively move from "with linux groups" to "without linux groups" until finding the source of what piece of code is making "with linux group" works and not "without linux group"
<youpi>ok, but that's an extremely difficult approach
<youpi>I, for one, wouldn't dare it
<Guest9734>Pellescours : just an idea, did you try to boot without optimizations?
<youpi>because there is so much code happening there
<youpi>that there's no way I can be sure in which order I should be commenting code
<youpi>I'd rather recommend disable the linux group, and put while(1){} to see there things start going wrong
<Pellescours>yes I know, I my approach is not the best, I could not have worked
<Guest9734>« updating a volatile variable changes hid a bug » sounds like code reordering by the compiler to me
<youpi>Guest9734: I believe that's unrelated
<youpi>and it's just that the bh mechanism is half-disabled, and thus broken
<youpi>because it's so hard to know exactly how it should be progressively disabled
<youpi>thus my advice not to try that approach
<Pellescours>ok ^^'
<Guest9734>youpi : yes that makes sense
<Pellescours>but at least when I completely disable the linux group BUT I add this variable update, I’m able to boot completely.
<youpi>ah
<youpi>so that's back to something odd
<youpi>mach shouldn't need that variable at all
<youpi>so look for what is actually using it
<youpi>and we'll see the surprise
<Pellescours>nothing is using it, I called my variable "test" and that makes the thing working
<Guest9734>asm volatile("": : :"memory")
<Guest9734>Try this in place of that update
<Pellescours>Guest9734: your line works
<youpi>that's terribly odd
<youpi>there is no source line left after that line, right?
<Pellescours>exact
<youpi>so *only* adding that line as last line of the hardclock() function, makes the kernel boot?
<Pellescours>yes
<youpi>that is so strange
<youpi>it'd be worth comparing the generated assembly code
<youpi>i.e. objdump -d hardclock.o
<Pellescours>And without it reboot just after "vm_page: HIGHMEM: min:28158 low: 33790 high 56316"
<youpi>your variable, what is it?
<youpi>a local variable ? a global variable ? with static ?
<Pellescours>unsigned long volatile
<youpi>yes but what storage?
<youpi>(which is the whole local/global, static question)
<Pellescours>global
<Pellescours>variable is declared outside of a function
<youpi>could you try to make it local?
<youpi>(non-static)
<youpi>(comparing the generated assembly code would still be very insightful)
<Pellescours> https://pastebin.com/pgK3qTeg for the disass
<youpi>uh, how is it that the non-volatile version doesn't call clock_interrupt at all (no call instruction)
<youpi>ah, no tail recursion
<youpi>ok, now things start making some sense
<youpi>and it's indeed just the volatile thing that makes gcc not optimize and get to boot
<youpi>(so in the end disabling the linux bits progressively did point out the difference that mattered, congrats on getting it right, I wouldn't have dared doing it)
<youpi>ok, I guess the difference lies in the non-volatile version modifying its parameters
<youpi>while the caller expect the irq number parameter not to change on the stack
<youpi>since it uses it for the pic mask and whatever
***Andrew is now known as WaxCPU
<youpi>Pellescours: just to see: could you try to make the old_ipl and irq parameters const?
<youpi>(not able to reproduce that behavior with a small testcase yes)
<Pellescours>yes
<youpi>how do you configure gnumach exactly?
<Pellescours>../configure --disable-linux-groups
<youpi>no CFLAGS etc. ?
<Pellescours>nothing
<youpi>you at least tell it to use a 32bit compiler don't you?
<Pellescours>no, I’m compiling from my hurd VM and don’t have a 64bit compiler
<Pellescours>installed
<youpi>ok
<Pellescours>changing to const doesnt work
<youpi>ok, with native 32bit compiler, I do get the tail recursion issue as well indeed
<youpi>and const doesn't help with it indeed
<youpi>I'm looking at cleaning interrupt.S to properly separate local variables & parameters
<luckyluke>Pellescours: do you also have issues if you use --enable-device-drivers=none instead of --disable-linux-groups?
<Pellescours>luckyluke: without modification, using "../configure --enable-device-drivers=none", it reboot like --disable-linux-groupt
<Pellescours>I’m testing now with the "fix"
<youpi>no need to investigate any more :)
<youpi>you can use the "fix" as a workaround for now
<youpi>I'll push a real fix in interrupt.S
<youpi>which will alongside make 64bit and 32bit versions way more similar
<Pellescours>luckyluke: the workaround make --enable-device-drivers=none working
<Pellescours>luckyluke: with your x64 kernel, are you able to boot on it?
<luckyluke>Pellescours: thanks... as far as I understand, the issue only happens with a 32-bit compiler? I usually compile gnumach from linux, so maybe this is why I never had this issue
<youpi>yes, the 64bit interrupt code doesn't have the issue
<luckyluke>I also compiled 32-bit gnumach but no issues (both with and without linux drivers)
<youpi>you need to enable optimizations
<youpi>so that tail recursion hits the bug
<luckyluke>is the default -O2 enough?
<youpi>probably
<youpi>note however that I couldn't reproduce the issue with the 32bit compiler from a 64bit host
<youpi>I had to use a native 32bit host (chroot, actually)
<Pellescours>luckyluke: yes -O2 is enought to reproduce, that’s the level I have
<luckyluke>I see, so I'm using the lucky combination :)
<luckyluke>and I guess for debian, gnumach is built on linux as well?
<youpi>no
<youpi>but for debian linux drivers are enabled
<luckyluke>ah, right
<youpi>Pellescours: I don't remember your real name? for the comment in the commit log
<Pellescours>Etienne Brateau
<youpi>fix pushed :)
<Pellescours>otherwise I still have my problem of compiling rumpdisk with ahcisata and piixide, I really don’t know to have the struct atabus_cd being not here twice. I don’t know how to change the ioconfs to make this object not being in ahcisata+piixide but in the "common ata" lib