IRC channel logs
2024-01-25.log
back to list of logs
<Googulator>USB is orders of magnitude slower than SATA here with BIOS too <Googulator>srcfs load time from a SATA SSD here is 5 minutes on a bad day, probably even faster <stikonas>yeah, but to get to hex1.efi you only need ot read a few KiB <Googulator>CHS vs LBA made no difference in performance (using SATA) <stikonas>though I guess that also includede time to rebuild itself and kaem <Googulator>(with floppy emulation, it's the opposite, only CHS is supported - but we take steps to avoid floppy emulation) <fossy>ok that is just taking far too long and idk if its hanging or what <fossy>seems to be going a bit faster at least? <fossy>still probably about an order of magnitude slower than stage0-posix <stikonas>perhaps the way stage0-uefi does I/O is not optimized in your UEFI implementation... <stikonas>basically impossible on real HW and somewhat annoying in qemu <fossy>ok, running the usb on my laptop now, it is WAYYY faster <fossy>at least 100x faster on the usb than it was on the other PC <fossy>deffo implementation reasons <fossy>yep finished in 45 seconds LOL and the other PC is still on cc_amd64 i think <fossy>unless it hung but i cant tell cause it's too slow <fossy>stikonas: actually, there is a problem; the building of M2-Planet.efi should have --little-endian in the hex2 invocation, that causes M2-Planet.efi to fail on my system <fossy>no idea how it worked on any system tbh <fossy>ENDIAN_FLAG appears to be empty <fossy>yeah, something else did go wrong <stikonas>sounds like environmental variables are not working... <fossy>yeah, only ENDIAN_FLAG though, oddly <fossy>yeah i'll have a play around <fossy>not much you can do if you cant repro <stikonas>well, you can at least add print stuff in kaem <fossy>yeah, thankfully its not a hex2 stage thats problematic or something, i would give up then <stikonas>yeah, if it's hex2 stange and you are not on qemu <stikonas>basically the only way to debug that I found was to run it in UEFI shell and rely on exit status <Googulator>rickmasters: sent some juicy stage1 optimization PRs <Googulator>we're back to 192 bytes of actual code+data space used <fossy>never seen that, seems to be an object ordering thing <fossy>what --jobs are you running at <lrvick>20 on one machine and 30 something on another <lrvick>the only linking flag that feels like it might impact this is separate-code vs --disable-separate-code <lrvick>noticing this now, all the binutils builds without any issues have --disable-separate-code <lrvick>Makes me think the codepath for separate-code linking layout is not deterministic. <lrvick>also it saves like 2 bytes or something so who cares <lrvick>ACTION tries building with --disable-separate-code <lrvick>Yeah I got the same hash on two different systems, but then a different hash on a third and that diff <janneke>that's on guix core-updates branch, fwiw we're not using --disable-separate-code <lrvick>only 1/3 builds being different for me is making this heisenbug status. Will see if I can reproduce in another set of rounds with --disable-separate-code on the same set of machines. <fossy>lrvick: hm, suspect it's a race (possibly when separate-code is enabled), i usually am only running with -j6 to -j10 <oriansj>fossy: the hex2 used to do the first build of M2-planet would be the one written in assembly and not the one in mescc-tools; so it would only support the host architecture. <oriansj>and yes, caching of the reads/writes would have a significant performance difference in the early stage0 steps. (literally seconds vs hours depending on your firmware's ability to cache pages) <oriansj>microseconds latency vs nanoseconds latency is kinda a big delta and if they do a read from the USB flash for every byte and a write for every byte and wait for completion, then yes that is the sort of difference compared to just copying a byte into a block of RAM until a flush or close. <muurkha>I didn't realize it was doing byte reads from the USB flash <muurkha>presumably that involves, as a subroutine, doing a block read from the USB flash <oriansj>and then clearing the block after the read and having to redo the block read again at the next read call <oriansj>which again just reads another single byte from that block <oriansj>as the stage0 steps leading up to M2libc are all single byte read/write from begining to end; (this is to reduce complexity in the pieces but it really kills performance on badly written firmware/kernels) <muurkha>I feel like you could mostly close the nanosecond gap with if (requested_block == current_block) return; <muurkha>well, I guess it's not quite that simple! <muurkha>so it's like three or four machine instructions and a static variable <matrix_bridge><Andrius Štikonas> They start with tinycc, POSIX shell and coreutils <matrix_bridge><Andrius Štikonas> Generally by then bootstrap is fairly simple and resembles distro packaging <matrix_bridge><Andrius Štikonas> C library is often one of the trickier bits <oriansj>muurkha: well no, because there are writes between the reads to a different block, so you would need to support atleast 3 pages of disk being cached and it'll take a bit more logic to do even a last in last out cache. <oriansj>4 pages to cover the case of new read page and new write page but what is an extra 4KB of cache when you have a 100+MB firmware blob involved. <muurkha>oriansj: in some cases yes. Forth only guaranteed 2 <muurkha>but in lots of cases you're reading bytes to put them somewhere other than immediately another disk block <oriansj>true buffering (caching) is rather handy for good performance and one doesn't need much memory to get 90% of the benefit <muurkha>but it's true that you do need more than three machine instructions. maybe something like if (num == keys[0]) return 0; if (num == keys[1]) return 1; last ^= 1; keys[last] = num; /* proceed to block loading logic */ <oriansj>muurkha: well doing 4 pages would be something like: and eax, 0xFFFFF000 ; mov ebx, 1; cmp eax, [$buf1]; je done; cmp eax, [$buf2]; je done; cmp eax, [$buf3]; je done; cmp eax, [$buf4]; je done; (boring page load logic and ); :done mov eax, [ebx+$buf1]; return; <oriansj>not very complicated or slow but will be slightly less likely to hit a page fault than a 2 buffer cache