IRC channel logs

2024-02-05.log

back to list of logs

<oriansj>yeah, sadly I missed the talk but it looks like it would have been another good one.
<stikonas>oriansj: there will be recording later
<stikonas>hmm, just found some interesting data point about posix-runner failure when running mes-m2 (probably not mes-m2 specifically but mes-m2 uses fair amount of memory)
<stikonas>after multiple invocations, eventually kaem was failing to cleanup up some memory after mes-m2 process existed
<stikonas>e.g. it could be 30 successful runs of mes-m2 before failure
<stikonas>turns out that number depends on amount of RAM...
<stikonas>so I must be leaking some memory somehow...
<Googulator>Alright - with builder-hex0 fixed, I did reach Linux build in Fiwix
<Googulator>And /boot/fiwix exists
<stikonas>Googulator: nice!
<Googulator>fiwix_file_list.txt exists in this run, but wasn't used - as shown by the fact that /boot/fiwix is not on the list
<Googulator>and yet, it does exist in the real FS
<Googulator>but something is clearly still wrong - packages aren't getting installed properly
<rickmasters>With live-bootstrap cloned 45 minutes ago, kernel bootstrap mode fails for me on libtool-2.2.4.
<rickmasters>libtool-2.2.4: compiling source.
<rickmasters>make: *** No targets specified and no makefile found. Stop.
<rickmasters>Googulator: We're you able to get recent code working or were you working on older version?
<stikonas>rickmasters: hmm, at the very least bwrap mode works
<rickmasters>stikonas: It's just kernel bootstrap
<stikonas>yeah, I understand...
<stikonas>ideally we need to bisect it
<stikonas>but fossy didn't update fiwix-file-list
<stikonas>so it will still refer to m4-1.4.7 instead of 1.4.10...
<rickmasters>stikonas: that part was fixed, I saw
<stikonas>yeah, I fixed it, but just saying that bisect will need cherry-picking that
<stikonas>for each bisect test...
<rickmasters>I can bisect. I was just wondering whether Googulator got it working or if that separate work on removing file list maintenance that started from a working repo
<stikonas>oh, that's good point
<Googulator>It wasn't started from a known working base, so you could well be running into the same issue without the fiwix initrd changes
<stikonas>if Googulator did reach Linux
<Googulator>I did not
<Googulator>well, I did reach Linux build
<Googulator>but I'm running with --interactive
<Googulator>which has a bug related to checksum validation
<Googulator>because set -E doesn't work in bash-2.05b
<Googulator>we use set -E instead of set -e with --interactive
<Googulator>alright, can confirm - libtool is missing
<rickmasters>Googulator: Sorry, not sure what you mean. You skipped it interactively?
<Googulator>It threw an error, but because --interactive has broken error handling in the early bash, it continued on without hitting the trap
<rickmasters>Ok. Interesting. I thought stikonas said bash-2.05b pass1 didn't support interactive at all so maybe that will be trouble.
<stikonas>rickmasters: Googulator worked around it
<stikonas>with some clever traps
<stikonas>but yes, you need newer bash for full features error handling
<Googulator>I emulate interactivity by doing bash -c '$(cat)' in a loop, basically
<Googulator>a bash REPL if you wish
<Googulator>the problem is that "set -E" silently fails in this version of bash
<rickmasters>ok, thanks for the update. that's very interesting.
<Googulator>so the error trap that replaces the "just exit and let the kernel panic" behavior in interactive mode fails to propagate to shell functions
<rickmasters>As a side note, I was working on something similar. I was able to get the bash-2.05b pass2.sh to work on fiwix right after libtool-2.2.4.
<rickmasters>I also got it to handle a trap and go into interactive mode.
<Googulator>That helps with getting interactivity without hacks, but unfortunately not with trap propagation
<Googulator>we need at least bash 3.0 for that, possibly 3.2
<rickmasters>But, if we can get pass1.sh to do that then that would be better.
<Googulator>also, there are other issues with the pass1 bash
<Googulator>broken globbing on certain kernels
<Googulator>including the WSL one
<stikonas>Googulator: do you know if that's meslibc issue?
<stikonas>might be interesting to try to rebuild same bash with musl
<Googulator>hmm, it could be
<stikonas>and see whether that fixes it
<rickmasters>So, I triggered interactive mode by installing a trap in an `improve:` script and then put an error in the next package's configure script.
<rickmasters>Is that trap propagation?
<Googulator>no
<Googulator>normally, if you do "set -e", then any error in that shell process, including inside a shell function, will cause the shell to quit
<Googulator>you can trap that quitting with a "trap <cmd> EXIT", but then there's no way to cancel the exit from within the trap environment - by the time the trap is hit, the shell has committed to quitting
<Googulator>so I use "trap <cmd> ERR" instead, which is *meant* to trigger in any scenario where "set -e" would quit
<Googulator>unfortunately ksh had a bug where "trap <cmd> ERR" would fail to trigger if the failure occurred in a shell function, and Bash chose to emulate that by default, since some legacy shell scripts rely on the broken behavior
<rickmasters>Sorry I'm dense, but if you've spawned a new shell on a trap with `bash -i` are you in a new trap environment and if so why go back to the parent which is committed to quitting?
<Googulator>so you need to either use "set -E", or run bash with "-E" to allow ERR traps to behave the way you would expect based on "set -e"
<Googulator>because one of the use cases for the trap is fixing whatever failed, manually rerunning the relevant build, and then typing "exit" to let the bootstrap continue as if nothing happened
<rickmasters>ok
<rickmasters>thanks, that makes sense.
<Googulator>IIRC bash-2.05 was the version that introduced the ksh emulation behavior, not realizing it needs an opt-out
<Googulator>the opt-out was added in 3.0
<Googulator>but was apparently buggy until 3.2
<rickmasters>Well, ultimately I was hoping to get bash 5.2 working right after libtool-2.2.4 but it might be a lot harder on Fiwix.
<rickmasters>I started with bash-2.05b to get my feet wet and was able to figure that out but it was hard.
<Googulator>I'm hoping to replace 2.05b with 3.2 entirely
<Googulator>or 3.0 with "set -E" fixes backported
<Googulator>& then we will have to do a rebuild of that version immediately after autoconf-2.52 is built
<Googulator>because the globbing bug hits in the next step, which is automake-1.6.3
<Googulator>luckily autoconf-2.52 is all that's needed to regenerate the configure script in bash 3.x
<rickmasters>OK, well not sure how far you got with bash 3 but bison was segfaulting for me and I had to partially implement mremap in Fiwix to get around it.
<oriansj>ok, I have unxz.c into a form that M2-Planet can compile
<Googulator>Was there any other error in the libtool build before the no makefile one?
<stikonas>oriansj: congratulations!
<Googulator>oriansj: congrats indeed, great news!
<oriansj>not yet, as it still is segfaulting when being run
<oriansj>so the pointer math needs tweaking
<oriansj>not sure on the fix yet.
<stikonas>yeah, that's the tricky thing with M2...
<rickmasters>Googulator: /bootstrap: line 91: xargs: command not found
<oriansj>but I'll add it to the repo as a prototype
<Googulator>hmm, xargs?
<rickmasters>./bootstrap: line 89: find: command not found
<stikonas>xargs is part of findutils
<oriansj>and give everyone who wants a chance to fix it; while I fuzz it to find where the segfaults are
<Googulator>wtf, findutils isn't meant to be built until well after libtool
<rickmasters>actually I see those errors on linux too so false alarm
<Googulator>per manifest
<oriansj>and it is pushed into mescc-tools-extra
<rickmasters>oriansj: nice work
<oriansj>well Googulator pointed me in the right direction; I just did some refactoring ^_^
<rickmasters>Googulator: so, no obvious errors in libtool prior to the "no makefile" error
<Googulator>can you post the configure output?
<stikonas>it could be that bootstrap indeed tries to run those find and xargs commands but failure was non-fatal
<rickmasters>Googulator: the configure output is short, it seems to have failed:
<rickmasters>libtool-2.2.4: configuring source.
<rickmasters>libtool configure 2.2.4
<rickmasters>generated by GNU Autoconf 2.61
<rickmasters>Then a copyright message and that's it
<rickmasters>Interesting, that on linux you get:
<Googulator>Before that, did you get the "Bootstrap detected" lines?
<rickmasters>libtool-2.2.4: configuring source.
<rickmasters>configure: loading cache /dev/null
<rickmasters>./configure: line 1: ./configure:: error 02
<rickmasters>then a lot of output
<stikonas>hmm, perhaps there is some issue with M2libc memory management... I could try to tweak posix-runner to use one preallocated memory block to save all suspended processes (like builder-hex0 does)
<rickmasters>Googulator: yes, there is a Bootstrap detected from the prepare function
<Googulator>there should be several
<rickmasters>yes, several
<Googulator>see https://pipelinesghubeus6.actions.githubusercontent.com/6p8Yv1e2Rv5QWiToIPN8PzwvSX24PCpXUyu7WfoxvNwVP41YOu/_apis/pipelines/1/runs/912/signedlogcontent/2?urlExpires=2024-02-05T00%3A12%3A02.8086153Z&urlSigningMethod=HMACV1&urlSignature=eFkLQTqrQFrcmcyVdyq7MBniOmoFCoq9ECAYQzXWYzI%3D for what it should look like in a successful build
<Googulator> https://paste.debian.net/1306313/ cut to just the libtool part
<rickmasters>Googulator: I didn't see any obvious differences until configure. I'm running some bisect builds.
<Googulator>rickmasters: srcfs must always start with "src 0 /", right?
<stikonas>argh, I had a stupid copy/paste typo in posix-runner... https://git.stikonas.eu/andrius/stage0-uefi/commit/ccf7808a130bae4f619da3847542d0da2efc5f18
<stikonas>which masked memory allocation error
<Googulator>if so, we can use that as a signature to find where the srcfs starts, instead of hardcoding specific sectors
<stikonas>it's not a proper fix yet, but at least we now have a useful error message rather than some mysterious failures
<rickmasters>Googulator: it does for live-bootstrap but technically that's not a requirement right now
<rickmasters>Googulator: handle_syscall_open verifies that the parent directory exists, except it hard codes an assumption that / exists
<rickmasters>Googulator: so you don't have to create /
<rickmasters>Googulator: But you have to create a dir that starts with / so I guess the answer is yes
<oriansj>stikonas: we probably need to add an empty stdint.h to M2libc; otherwise M2-Mesoplanet will fail to build (even though those types are already baked into M2-Planet)
<stikonas>oriansj: hmm, I guess it's fine
<stikonas>though what started causing this?
<stikonas>stdint.h was there since 2021...
<rickmasters>OK, so I extracted the Fiwix file system from the failed libtools build and /dev/null is a regular file with stuff in it.
<rickmasters>Perhaps we need /dev/null to be created properly on Fiwix
<stikonas>rickmasters: that might also be some other bug
<oriansj>stikonas: on the main branch?
<stikonas>rickmasters: I've seen how musl's configure script messes up with /dev/null and replaces it with a file (perhaps bug of bash-2.05/meslibc)
<stikonas>oriansj: I think so
<stikonas>at least git blame tells me that
<stikonas>oriansj: by the way, do you remember how our memory management works?
<stikonas>(in M2libc)
<stikonas>I wonder why repeatedly allocating/freeing memory seem to leak it
<stikonas>basically posix-runner allocates fairly large chunks of memory and saves running stage of a process (e.g. mes) while child process is running, later it restores it and free()'s the memory.
<stikonas>yet there is some kind of leak
<stikonas>but perhaps it would be more practical for me to preallocate 1 GiB of memory for "saved processes"
<rickmasters>stikonas: I thought free was a no-op in early stages
<rickmasters>stikonas: after looking, it appears that's outdated- it puts it on a free list
<matrix_bridge><Andrius Štikonas> Indeed, but something is not working
<matrix_bridge><Andrius Štikonas> Either bug in m2libc or podix-runner
<rickmasters>stikonas: I imagine you rewrote malloc for UEFI due to lack of brk?
<stikonas>indeed
<stikonas>well, me and oriansj
<stikonas>perhaps we could have simply used what is done in builder-hex0
<stikonas>and what is now posix-runner does..
<stikonas>i.e. allocate one very big block and use it for brk()
<rickmasters>is free_pool returning success ?
<stikonas>hmm, I would hope so... But this needs checking
<stikonas>I probably won't have time today to check
<stikonas>it's getting late
<rickmasters>hmm, the wrapper is void return so ...
<stikonas>hmm, yes, can't check...
<stikonas>but actuallly same is for free in POSIX
<stikonas>it's not supposed to fail
<stikonas>there is no error checking
<stikonas>libc should just free it when told
<stikonas>anyway, for now I'm happy that today I fixed a silent failure and made it print an error and will come back tomorrow evening...
<rickmasters>ah. well I hope you figure it out. have a good night
<stikonas>rickmasters: if not, I can simply manage memory inside posix-runner
<stikonas>just allocate big block for saved program memory...
<oriansj>stikonas: interesting git log -p is showing me that stdint.h was added on 684eade53e3eb9d2fdb6f7997b63dea0e961ef01
<matrix_bridge><Andrius Štikonas> https://github.com/oriansj/M2-Mesoplanet/commit/1ac5bb6eee99f59ee2cf3877e4bfc1466baf79b0#diff-b5864dbfdf74de06d9f19fe36f673c7d7063a85ddd62947692e8248ac380e244R22
<oriansj>well yes M2-Mesoplanet uses stdint.h; I was saying M2libc did not include a stdint.h file
<oriansj>M2-Planet doesn't use #include lines
<oriansj>only M2-Mesoplanet does
<oriansj>and M2-Mesoplanet fails when there is a #include <stdint.h> because the file stdint.h does not exist in M2libc
<rickmasters>So, upgrade of m4 appears to have resulted in /dev/null trouble on Fiwix. I'll continue digging into it.
<oriansj>and it looks like we put struct utsname in the wrong file
<oriansj>easy to fix
<rickmasters>update: So it looks like when Fiwix is running /dev/null is OK so I don't know how the m4 update caused the libtool failure yet.
<Googulator>just checked here as well, /dev/null is a character device (1, 3)
<rickmasters>The disk may not have been flushed when I extracted it from memory. I'm running again with a patch that flushes.
<rickmasters>Also, the build directory was missing so there wasn't much to go on. Hopefully my current build will provide more info.
<rickmasters>I have a good file system this time. The configure script seems small.
<rickmasters>I don't have a lot of autotools knowledge so figuring out what went wrong will probably take me a long time.
<Googulator>current live-bootstrap patch for fiwix filelist removal: https://gist.github.com/Googulator/300fa8845cb12731d3d02f905802f7c8
<Googulator>the builder-hex0 change is https://github.com/ironmeld/builder-hex0/pull/11
<Googulator>there's also an improvement for the early console; redirections, quoting and other more advanced shell features now work (although Ctrl+D is still required, and obviously there's no tab completion or history)
<rickmasters>Googulator: looks ok. There is an argument that can be made that reading next_file_num would be better to know what entries are valid. Of course,
<rickmasters>its not in a stable location right now.
<rickmasters>But it could be moved to a stable location, like near the file data. It's not needed now but I thought I'd mention that.
<rickmasters>There are other variables that are scattered in the code that arguably should be moved to stable external memory. Maybe some day.
<Googulator>next_file_num alone wouldn't be enough, because of the duplicates
<Googulator>and eventually also because of unlink, once we implement it
<Googulator>the set of valid file entries isn't necessarily contiguous in memory
<rickmasters>Yes unlink would require zeroing the entries.
<rickmasters>Debugging the autotools is a real pain. I have no idea what the problem is with the new m4 version.
<Googulator>The configure script is clearly being generated wrong
<rickmasters>Yeah, I'm done for tonight. I hope figuring it out can be done in a reasonable effort. Kernel bootstrap is nonfunctional in the mean time.
<Googulator>Original one in the tarball is 895933 bytes
<Googulator>The one generated by our autotools is just 11178 bytes
<Googulator>aclocal.m4 is likewise way too small
<Googulator>34924 bytes, vs 270526 in the tarball
<rickmasters>Yeah, configure is 450 lines on Fiwix, 28436 lines on chroot build.
<rickmasters>What's strange is the output is totally different, right near the start
<Googulator>calling autoconf-2.61 on a freshly extracted tarball is enough to break configure
<rickmasters>I'm assuming its some m4 macro deep down somewhere dying in some weird way because it fails some kernel specific check
<rickmasters>But I'm just groping around randomly right now
<fossy>stikonas: aargh, forgot about fiwix-file-list
<fossy>wtf happened to my gawk PR with the merges. i def removed the <<< lines
<fossy>oh well
<fossy>Googulator: 415 LGTM, merging
<matrix_bridge><Andrius Štikonas> fossy: should we revert m4 upgrade in the meantime?
<fossy>stikonas i think so, i'll just add new m4 to end of bootstrap instead i think
<fossy>it's not worth the hassle of debugging kernel bootstrap
<matrix_bridge><Andrius Štikonas> Yeah, at least not until we have a good reason...
<matrix_bridge><Andrius Štikonas> E.g. we want to build newer GCC and Linux in Fiwix
<Googulator>oriansj: is it just me, or is stdint.h (need by unxz) missing from m2libc?
<Googulator>oh, it's a sub-submodule
<stikonas>fossy, Googulator: so diffutils build is not reproducible https://github.com/fosslinux/live-bootstrap/issues/431
<matrix_bridge><Lance R. Vick> I almost managed to build live-bootstrap without using -any- tools except from stage0 and OCI built-ins like ADD to download files.
<matrix_bridge>The last issue is when I extract live-bootstrap.tgz with debian tar, it works fine, but when I extract with stage0 ungz/tar it requires non-strict and creates a bunch of symlink warnings and then the build fails.
<matrix_bridge>Any known workarounds?
<matrix_bridge><Lance R. Vick> my best guess for a hack is manually copy files in all the places the symlinks would normally go
<matrix_bridge><Andrius Štikonas> Lance R. Vick: yes, that's right
<matrix_bridge><Andrius Štikonas> untar does not have any support for symlinks
<matrix_bridge><Andrius Štikonas> and symlinks would require extra syscalls...
<matrix_bridge><Andrius Štikonas> so copying them is probably the best you can do...
<matrix_bridge><Lance R. Vick> Fair enough.
<matrix_bridge><Lance R. Vick> a big ugly block of copies is still preferable to maintaining updates to debian pins for just a tar command
<matrix_bridge><Lance R. Vick> and the trust gap that creates
<matrix_bridge><Andrius Štikonas> hmm, I wonder how many symlinks we have in live-bootstrap...
<matrix_bridge><Andrius Štikonas> the thing is that bootstrap kernels have no notion of symlink
<matrix_bridge><Lance R. Vick> 11 atm by my count
<matrix_bridge><Lance R. Vick> all patchfiles
<matrix_bridge><Andrius Štikonas> e.g. builder-hex0 doesn't support them at all
<matrix_bridge><Andrius Štikonas> and for UEFI bootstrap, FAT also doesn't know about symlinks
<matrix_bridge><Andrius Štikonas> yeah, mostly musl and coreutils patches
<matrix_bridge><Andrius Štikonas> "find . -type l" finds 10
<matrix_bridge><Andrius Štikonas> some of them for patches, some of them build scripts and some of them for build directories (with multiple patches inside)
<matrix_bridge><Lance R. Vick> Hmm. Copied them all but still explodes: https://dpaste.org/9UV26
<matrix_bridge><Lance R. Vick> much easier to read raw ^
<matrix_bridge><Andrius Štikonas> Hmm, maybe just diff two directories?