IRC channel logs

<damo22>youpi: i think we are caching the reads twice, one in librump and one in the libstore (?)

<damo22>i need to try your suggestion of memset 0

<damo22>but still wrong data

***janneke_ is now known as janneke

***rowanthorpe1 is now known as rowanthorpe

<damo22>youpi: im very confused with this backtrace, the bn=0 in rumpdisk_device_read, but one frame in it knows which offset

<damo22>#10 0x080abb42 in rump___sysimpl_pread (fd=3, buf=0x18006000, nbyte=512, offset=149176725504)

<damo22> at ./buildrump.sh/src/lib/librump/../../sys/rump/librump/rumpkern/rump_syscalls.c:2037

<damo22>#11 0x0803d30d in rumpdisk_device_read (d=0x1000ff80, reply_port=153, reply_port_type=18, mode=0, bn=0,

<damo22> count=512, data=0x1b834f1c, bytes_read=0x1b832e18) at ../../rumpdisk/block-rump.c:330

<damo22>is there some magic im not following?

<damo22>it looks like gdb cant print the bn

<youpi>damo22: sometimes there are optimizations that gdb cannot follow, yes

<damo22>i asked for bn 291360792 and i got this:

<damo22>Thread 32 hit Breakpoint 2, rumpdisk_device_read (d=0x1000ff80, reply_port=173, reply_port_type=18,

<damo22> mode=0, bn=268500864, count=461582048, data=0x1b832f0c, bytes_read=0x1b832e18)

<damo22>count should be 512

<youpi>best is to check the value within the caller

<youpi>there there cannot have been any optimization that makes the value confusing for gdb

<damo22>the caller is #1 0x0104a268 in _Xdevice_read (InHeadP=0x1b834ef0, OutHeadP=0x1b832ee0) at deviceServer.c:552

<damo22>i dont know how to decode that

<damo22>its an RPC?

<youpi>yes, but look at the call line

<youpi>you'll see what parameters are passed there

<damo22>ok it seems to be fine, but the data is not matching the gnumach driver at the same offset :(

<damo22>Thread 34 hit Breakpoint 3, ds_device_read (device=0x1000ff80, reply_port=156, reply_port_type=18, mode=0,

<damo22> recnum=291360792, count=512, data=0x1b834f1c, bytes_read=0x1b832e18)

<damo22>its definitely using DMA

<damo22>maybe the lookup is broken and its mapping the wrong physical address?

<youpi>the data that you get, is it always the same?

<damo22>no its different each block i try

<youpi>just to make sure: did you check the disk offset that the driver everntually gives to the device?

<youpi>I mean, for the same block

<damo22>yes i get the same result for the same block

<youpi>and after reboot/restart/etc.

<damo22>i put a breakpoint on wdread

<youpi>if it's always the same, it's probably not a problem of wrong physical address, the data would be quite random if that was the case

<damo22>i can see the correct offset is being requested on the actual device

<youpi>ok...

<damo22>(gdb) p *uio

<damo22>$1 = {uio_iov = 0x1b832c18, uio_iovcnt = 1, uio_offset = 149176725504, uio_resid = 512, uio_rw = UIO_READ,

<damo22> uio_vmspace = 0x1000c000}

<damo22>that is for block 291360792

<damo22>in wdread

<youpi>that's not the latest layer

<damo22>the final layer is probably in pci-userspace where it does a dma transfer?

<youpi>no, the dma doesn't care about the disk offset

<youpi>it's the pio command that tells the device which offset

<damo22>atastart?

<youpi>_wdc_ata_bio_start

<youpi>and wdccommandext more precisely

<youpi>or perhaps it's satafis_rhd_construct_cmd since it's an ahci fis

<youpi>err, rather satafis_rhd_construct_bio

<youpi>since it's a bio

<youpi>I'd indeed expect that since wdstart1 calls wd->atabus->ata_bio, which I guess for ahci is ahci_ata_bio, which records ahci_bio_start for the start, which calls satafis_rhd_construct_bio

<youpi>you can as well put prints/breaks on all of them, to be sure

<damo22>ata_bio->flags & ATA_LBA48

<damo22>its not LBA48

<youpi>how so?

<youpi>>128Gb has to be > LBA28

<damo22>its landing on line 116

<damo22>(gdb) p ata_bio->flags

<damo22>$1 = 48

<youpi>(use p/x)

<youpi>that's very odd, don't the probe prints say it's lba48 ?

<youpi>that's supposed to be set in wdstart1

<youpi>possibly the computation with blkno / bcount / d_secsize / sc_capcacity28 is wrong?

<damo22>yeah maybe

<youpi>(the formula seems right, but the values are possibly badly detected)

<youpi>bcount is a number of bytes, right?

<damo22>yes

<youpi>and you said that blkno is 291360792, right?

<damo22>#define ATA_LBA48 0x0080

<youpi>(which is really more than 2^28)

<damo22>yes

<youpi>I gues it's sc_capacity28 which is wrong

<youpi>it seems to be computed in wdattach

<youpi>uh, it looks like a bug in qemu?

<damo22>wd0: drive supports 16-sector PIO transfers, LBA48 addressing

<youpi> http://dept-info.labri.fr/~thibault/tmp/blip.png

<youpi>it says LBA can address the 200GiB

<youpi>that can't be true

<youpi>I'd say try to cap that in wdattach, to 2^28

<youpi>it cannot be more than this at any rate

<damo22>wd0: 298 GB, 620181 cyl, 16 head, 63 sec, 512 bytes/sect x 625142448 sectors

<youpi>2^28-1 actually

<youpi>yes but see my screen capture

<youpi>that's the hdparm -I

<youpi>the LBA user adressaable sectors line is the atap_cacacity value used in wdattach

<youpi>that wdattach uses for sc_capacity28

<damo22>where did you get hdparm for hurd?

<youpi>I didn't

<youpi>I just ran it in linux in qemu

<damo22>cheater

<damo22>:P

<youpi>you have all the cheat rights when you're debugging

<damo22>indeed

<youpi>I tried on a real machine, that LBA value is 0xfffffff, as it should

<youpi>so I'd really say it's a bug in qemu

<youpi>odd that it was never found before

<damo22>wow

<youpi>since that's the bsd code in rump, nobody used bsd on qemu with large devices?

<damo22>potentially

<youpi>so try to cap that value in wdattach, check that it fixes it, and submit the fix to the bsd people

<youpi>we can also commit the fix to the hurd package

<damo22>why not fix qemu

<youpi>I(m on it

<damo22>its not a bug in bsd

<youpi>not dealing with hardware quirks is a bug, actually :/

<youpi>I would really not be surprised that low-cost disks could have the same issue

<youpi>just because for instance windows always uses lba48 when available

<youpi>and thus it was never actually exercised with bsd

<damo22>ok

<youpi>which would be the only to use lba28 on large devices

<damo22>but we are on 7.99.99

<damo22>oldddd

<damo22>maybe they fixed it alrady

<damo22>already*

<youpi>eh? rump doesn't update to newer bsd?

<youpi>what's the point if they don't keep up?

<damo22>we need to fix it upstream so we can get the new code

<damo22>thats why i suggested to merge buildrump.sh into build.sh

<damo22>then upstream netbsd can build rump

<youpi>I mean is our rump at 7.99.99 because we have an old snapshot of rump, or is it that rump itself is not updating?

<youpi>yes, sure we should merge everything upstream

<youpi>no doubt on this

<damo22>rump is part of netbsd but they turned it off in their builds

<damo22>because the makefile is borken

<youpi>ew

<youpi>well, then fix them :)

<damo22>yeah

<damo22>we have the latest rump that pooka supported, it used to be maintained out of tree, but it actually should build as part of netbsd source

<youpi>yes it really should

<youpi>otherwise it'll always be laggy etc.

<damo22>they have a make target for it, but its disabled

<youpi>and not receiving upstream fixes

<damo22>someone got it to work on 9.something

<youpi>anything that is not integrated in CI gets bitrot very quickly

<damo22>yeah

<damo22>i need to work on that too

<damo22>i need a netbsd vm

<damo22>i probably have one

<youpi>well, for a start let's finish fixing lba48 :)

<youpi>(on qemu, that is)

<damo22>ok

<damo22>ide_identify_size

<damo22>seems to populate the values correctly

<damo22>put_le16(p + 93, 1 | (1 << 14) | 0x2000);

<damo22>im guessing that is related

<youpi>I'm on it, I told you :)

<youpi>will Cc you

<damo22>thanks, this bug was killing me

<damo22>let me confirm fix

<damo22>fixewd

<youpi>\o/

<damo22>:D

<damo22>i still have to send in patches

<damo22>for 64 bit

<damo22>offsets

<damo22>rump is currently built with sizeof(off_t) = 4

<damo22>thats probably why you didnt experience any issues with small disks

<damo22>and my strange files are restored

<damo22>\o/

<damo22>it may not be a bug in netbsd anymore: (20:36:40) mlelstv: if a) drive supports LBA48 and ( b) transfer ends beyond LBA28 boundary or (c) transfer size > 128 sectors ) use LBA48, otherwise use LBA28.

<damo22>im not sure

<damo22>actually it still checks the drive param pretty sure

<youpi>but I guess he is still assuming that the reported LBA28 capacity is correct

<damo22>yes

<damo22>i raised that issue with him, he will talk to jakllsch about it

<damo22>so the disk is supposed to advertise a maximum sector count for LBA28 mode

<damo22>which can be <= 2^28-1

<youpi>yes

<damo22>but qemu used to just truncate the 48 offset and subtract one?

<youpi>yes

<damo22>good catch

<damo22>so the driver can do a hardcoded bounds check for > 2^28-1 and assume firmware bug