IRC channel logs
2023-11-29.log
back to list of logs
<oriansj>I'd donate a VPS to help on that effort but I am not exactly keen on administrating another server <stikonas>well, it mostly works, so probably fine for now... <oriansj>fair enough, not like the wiki.bootstrapping.world gets much network traffic anyway <Googulator>stage1 now seems to work on all of my boards (+ qemu), with almost all settings <Googulator>I also implemented printing out the boot sector itself in hex, and the 2nd stage in its textual form, as well as the "wait for stage2 to be available" idea for the trusted Flash drive plan <Googulator>with the original boot sector, many boards would just lock up at a black screen due to Int 13h returning an unexpected number of sectors read <Googulator>In stage2, both reading and writing are via LBA now (although writing is untested and somewhat incomplete) <Googulator>There's also some debug code in the LBA sector read path (it prints an asterisk everytime read_sectors_16 is called, a # for every additional int13 calls within the same read_sector_16 in multi-sector reads, and some debug output when it goes wrong) <Googulator>Unfortunately, there's one issue I can't figure out: on all of my motherboards, with certain BIOS settings, somewhere between sector 180 and 500 (exact number random), reads start timing out <Googulator>This also happens if I switch back to the original stage2 with CHS access <Googulator>All boards tested are LGA775 - one with P35, one with G31, one with NForce 650, and one with i915G <Googulator>Several drives (including a known good SSD), several SATA cables, and several sets of RAM all reproduce the issue <Googulator>Only one of my boards (Asus P5K Premium, with P35) can bypass this and complete the bootstrap, by setting the SATA controller into AHCI mode, or by using an IDE->SATA adapter connected to a JMicron IDE controller on the board <Googulator>(the JMicron would previously blackscreen in the original stage1) <Googulator>Also 5 different CPUs tested (a Northwood, a Cedar Mill, an Allendale, a Wolfdale and a Kentsfield) <Googulator>On the other boards, I have no option for AHCI, and connecting via an IDE->SATA adapter doesn't change anything <Googulator>Besides AHCI and that adapter, the only thing that has any effect is forcing PIO or SWDMA (as opposed to MWDMA or UDMA), which makes the very first read (issued by stage1 using CHS) already time out consistently <Googulator>this is also the only instance I found where stage2 fails to even get called by stage2 <Googulator>It almost feels like the BIOS loads a larger block than just the 1 sector requested in the background, then reads from that cached block on subsequent reads, until it runs out of cached data, and dies trying to touch the actual drive again <Googulator>PIO/SWDMA presumably forces single-sector access, causing only 1 sector worth of buffered data, hence sector 2 already times out <Googulator>Enhanced IDE and Compatible IDE fail in the mode described, just like the other boards <Googulator>Those other boards have no option for AHCI or RAID (except for the NForce, which has RAID, but no RAID boot support) <Googulator>Using qemu, I found no way to reproduce this, even if I set qemu to emulate a Q35 chipset (close relative of the P35 & G31) <fossy>Googulator: is 180 and 500 the lower and upper bounds respectively that you've seen for the randomness? and you can't repro that on qemu at all? <oriansj>is the carry flag set? As The carry flag will be set if Extensions are not supported. <Googulator>Another theory I have is maybe BIOS is leaking some resource on every int 13h call <Googulator>oriansj: the carry flag is set, but AH has a different error code than when extensions aren't supported <Googulator>also, extensions are supported up to some sector, and then become unsupported, and the actual boundary where this happens varies from boot to boot - seems unlikely <oriansj>Set AH = 0x41, BX = 0x55AA, DL = 0x80 and Issue an INT 0x13 <Googulator>plus, function 0x2 (plain CHS read) also reproduces this <oriansj>The carry flag would be unset after if the LBA extended mode is supported <Googulator>Carry flag on 0x02 is certainly not an extension issue <Googulator>& if you get a carry flag, AH contains the status code <Googulator>& it actually sits for something like 30 seconds in the interrupt (always the same time) before it reports 0x80 <Googulator>also, no extensions support on a P35 would be _very_ unlikely <oriansj>and the transfer must fit in the usable part of low memory, right <oriansj>I am not seeing SI being set to the address of a disk address packet <oriansj>seeing :addr_packet but not SI being set to it prior to the interrupt (still reading) <Googulator>BE $addr_packet # mov si, $addr_packet ; disk address packet <Googulator>Also, _this code works_ if I set SATA mode to AHCI or RAID <Googulator>then it fails in the exact same way as the CHS-based code we're currently using <oriansj>if :dest_segment and :dest_offset stayed constant does the issue still occur? <Googulator>dest_segment and dest_offset are always 0xA000 and 0x0000, respectively, when the error occurs <oriansj>sorry I mean :num_sectors_bios is always 1 <oriansj>what is the cache size on your disks? <Googulator>The other is a WD2500BEVT, it doesn't state the cache size on the label <Googulator>3rd drive is an SSD, I can't find any data on cache size on it <oriansj>well a cache miss could produce 80h timeout (not ready) but a second read attempt should be a hit <oriansj>and the LBA isn't near the 504 MiB LBA of the drive right <oriansj>does an AH = 00h int 13h disk reset help? <Googulator>Even weirder - physically unplugging and replugging a disk also doesn't seem to clear the fault, even though if I do that before the problem occurs, it does continue <Googulator>(I can even swap to a different physical drive - useful for booting from an SSD on a board that craps out on RPM being 0) <oriansj>but that is only for external drives <oriansj>well unbalanced stacks are easy to do in assembly but honestly I am out of ideas at the moment <Googulator>hmm, engineered a scenario where a timeout occurs while still in stage0 (by giving it a huge garbage text instead of actual hex0 code), and it froze at sector 15271 (decimal) <Googulator>& all real mode, no real<->protected switcheroo yet <Googulator>as in, you have X seconds to finish booting before the BIOS kills int 13h for you <Googulator>stage2 is replaced by Metasploit's non-repeating pattern (repeated to fill a much larger area, but since I know the error is on sector boundary, it's not practically repeating yet) <oriansj>Googulator: Well some hard drives have behavior left over from Windows 95 which is designed to prevent user programs from bypassing the kernel. But I don't know the technical details of implementation and a timeout seems like a cheap and easy way to do that. <muurkha>you have to work around the workarounds for the workarounds <Googulator>Ran the engineered failure case (where we freeze in stage0) a few more times, and it *very* consistently freezes at exactly the same sector, at the exact same time after it gets control. <Googulator>And now went back to the original stage2 (which gets loaded, but freezes early in reading srcfs), and timed it - it freezes exactly at the same point in time too! <Googulator>"33 C0" being used as a signature for a "modern" bootloader, anything else gets the Windows 9x treatment <Googulator>unless AHCI or RAID is enabled, which are incompatible with Windows 9x, so no 9x-specific hack applied <oriansj>Googulator: now to isolate if it is the sector or the amount of time <Googulator>intentionally slowed down stage 1 by printing every character 5 times <oriansj>sounds like a stack issue; time to start dropping DEADBEEF into the stack <Googulator>yeah, it appears it's not just int 13h that's breaking <Googulator>the code only ever prints "dark white" on black - and yet there's color <Googulator>what do you mean dropping DEADBEEF into the stack <oriansj>By pusing DEADBEEF onto the stack, it becomes easy to spot if you are doing more pushes than pops <oriansj>or more pops than pushes on certain code paths <Googulator>after reset, it stopped at a different location - but there's corrupted output on the screen again <Googulator>did a different test: I print the value of sp before and after each int 13h call, to see if the stack pointer is moving <Googulator>makes sense given that there's only one location where we call read_sector_16 <Googulator>this also confirms that in stage1, the freeze is not triggered by an int 13h call, but rather by an int 10h <Googulator>because there's no SP printout right before the freeze <oriansj>well you are using Teletype output AH=0Eh so BL will be the reason for colors <Googulator>stage0 reads and compiles hex0 to increasing addresses starting Googulator 0x7E00 <Googulator>stage0 reads and compiles hex0 to increasing addresses starting @ 0x7E00 <Googulator>...until it hits a point between 0x80000 and 0xA0000 that varies depending on the BIOS (the start of the EBDA) <Googulator>from then on, it begins overwriting BIOS private data with compiled code <Googulator>& if it still survives, after 0xA0000, you're writing into framebuffer & MMIO space :) <Googulator>as an added bonus, because stage1 is still using CHS, and it's a really simplified CHS code, it never actually reads past the 1st cylinder <Googulator>if I nop out the "stosb", we no longer freeze in stage1 <Googulator>(I don't know why I sometimes call stage1 "stage0") <oriansj>Googulator: it is hard to keep every detail in our heads, that is why we add comments <Googulator>Of course, all this means is that we still have no answer as to why stage2 is failing <oriansj>well, puzzles to solve and fun to be had <matrix_bridge><Andrius Štikonas> Well, UEFI was also easy to crash if I accidentally wrote to its memory <matrix_bridge><Andrius Štikonas> Does BIOS provide any api to get memory map? <Googulator>No stack creep issue in stage2 either, SP always @ 0xF07A when int 13 is called <Googulator>tested the latest stage2 from ironmeld repo on the affected hardware, it locks up exactly the same