IRC channel logs

2019-11-19.log

back to list of logs

***Server sets mode: +nt
***Server sets mode: +nt
***justan0theruser is now known as justanotheruser
<damo22>youpi: when i call "umount /mnt" i get one deallocate bug
<damo22>maybe the bug is in ext2fs?
<youpi>does that happen without rump?
<youpi>I have never seen that
<damo22>interesting question
<damo22>no
<damo22>but how does umount work
<damo22>if ext2fs is being passed an invalid port to play with then
<damo22>it actually unmounts the rumpdisk
<damo22>hmm it never calls device_close
<youpi>yes, I've already noticed that
<youpi>that happens with xen's block device too
<damo22>im using the devp setting from libmachdev/block.c
<damo22>225 *devp = ports_get_right (bd);
<damo22>(gdb) p *devp
<damo22>$5 = 108
<damo22>(gdb) n
<damo22>ds_device_open (open_port=127, reply_port=115, reply_port_type=18, mode=3,
<damo22> name=0x1fffd5c "/dev/wd0", devp=0x2001d64, devicePoly=0x1fffc7c)
<damo22>does that look right?
<damo22>0x2001d64: 0x0000006c = 108
<youpi>yes, what do yo uput in *devicePoly ?
<damo22>20
<damo22>i'll have to find the source im currently in gdb
<damo22>MACH_MSG_TYPE_MOVE_SEND
<damo22>no thats commented out
<damo22>the other one
<damo22>MACH_MSG_TYPE_MAKE_SEND
<damo22>224 // from net.c
<damo22>225 *devp = ports_get_right (bd);
<damo22>226 *devicePoly = MACH_MSG_TYPE_MAKE_SEND;
<damo22>ahh the reply port is sometimes the same as the normal port
<damo22>and its causing problems
<damo22>(i think)
<damo22>Thread 1 hit Breakpoint 1, device_open (reply_port=78, reply_port_type=18,
<damo22>Thread 1 hit Breakpoint 1, device_open (reply_port=115, reply_port_type=18,
<damo22>Thread 1 hit Breakpoint 1, device_open (reply_port=78, reply_port_type=18,
<damo22>its difficult to debug because the deallocation occurs zillion times
<damo22>maybe 1000
<damo22>it seems to happen after processing the mach msg from device_open
<damo22>ok i have somthing interesting:
<damo22>Thread 1 hit Breakpoint 1, ds_device_open (open_port=127, reply_port=110,
<damo22>Thread 1 hit Breakpoint 1, ds_device_open (open_port=108, reply_port=125,
<damo22>Thread 1 hit Breakpoint 1, ds_device_open (open_port=127, reply_port=108,
<damo22>and in between its a bogus port 110
<damo22>if ds_device_open is calling the emulation device_open, why does it already know the port?
<damo22>i am allocating a port for the device in device_open
<youpi>make_send should be fine
<youpi>that's not supposed to happen that the open_port (the device master port) would be the same as the port you are allocating
<youpi>oh perhaps it could
<youpi>see is_master_device
<youpi>do you know what the device master port is?
<damo22>no
<youpi>ok
<youpi>that's what opener use to show they are allowed to access the resource
<youpi>processes running with uid=0 have it
<youpi>when calling device_open(), they pass it along
<youpi>so device_open can check that it's the master port
<youpi>that's what is_master_device() tests
<youpi>by checking that it's a port known in the mach port_bucket
<youpi>and if so, it calls ports_port_deref()
<youpi>since it doesn't need the port any more
<youpi>after checking it
<youpi>so that will deallocate the port
<damo22>so where do i call is_master_device?
<youpi>and thus it's normal that allocating a p ort for the device itself will return the same port
<youpi>it is already called in ds_device_open()
<damo22>no its not
<damo22>i have cloned that code
<youpi>there however seems to be one issue: if for some other reason the function fails, one should not have released the port
<youpi>because on error mig does the cleanup
<damo22>i have a copy of ds_routines.c
<damo22>in libmachdevrump
<damo22>it does not have is_master_device()
<youpi>where is that directory?
<damo22>in incubator
<youpi>I'm really lost in what source code you are actually using
<damo22>i havent quite pushed what i am running now
<youpi>ok, then in the long run you will want to call is_master_device()
<damo22>but most of it is in incubator
<youpi>otherwise you'll let any process open it
<youpi>now, that means that it's not being deallocated there
<youpi>and thus it's surprising that you'd get the same port for the device port
<damo22>ok
<damo22>> if (!is_master_device (open_port))
<damo22>> return D_INVALID_OPERATION;
<damo22>these two lines are missing from my src
<damo22>i cant remember why i removed it
<damo22>i thought it was unnecessary because i am not a mach device
<damo22>as in not one in gnumach
<youpi>more precisely, the unix permissions should be fine enough
<youpi>but deallocating the open_port on success will be needed, otherwise that'll be a port leak
<damo22>ports_port_deref(bd) ?
<youpi>no, mach_port_deallocate(open_port)
<youpi>bd has nothing to do with bd
<youpi>err
<youpi>open_port has nothing to do with bd
<damo22>ahhh ok
<damo22>it seems to be complaining about the reply_port
<damo22>i added mach_port_deallocate()
<damo22>it seems that the reply_port from a previous call to device_open is not being deallocated and it tries to deallocate the wrong port the next time but the reply port has changed
<damo22>Thread 1 hit Breakpoint 1, ds_device_open (open_port=115, reply_port=114,
<damo22>Thread 1 hit Breakpoint 1, ds_device_open (open_port=112, reply_port=108,
<damo22>bogus port 114
<youpi>when is "bogus port" printed exactly?
<youpi>are you sure it happens after or before ds_device_open is getting called?
<damo22>let me repeat it and i will document
<youpi>really, tracing from the kernel directly provides the answer
<youpi>just enable kdb
<damo22>i did
<youpi>switch the bogus allocation variable
<damo22>yeah
<youpi>and trace/u will show you the bbacktrace
<damo22>the backtrace is useless because its in mach port loop
<youpi> /u shows you the user part
<damo22>its in mach_port_deallocate
<youpi>yes but I'm telling you that trace/u ALSO prints the user part
<damo22>0x81b8c1c
<youpi>that's a userland pointer
<youpi>inside your userland process
<youpi>so you can addr2line it
<damo22>0x81b8c1c <syscall_mach_port_deallocate+12>: 0x909066c3
<youpi>see, that's the userland part of the syscall
<youpi>and below you have the callers
<damo22>theres only 1 caller
<damo22>below that
<youpi>which is?
<damo22>i need to paste here so i dont lose it 0x814e755
<youpi>it could be useful to build your program with -fno-omit-frame-pointer, so it'll be easier for kdb to unroll the stack
<damo22>0x814e755 <ports_manage_port_operations_one_thread+117>
<damo22>its in the port loop
<youpi>ok, so it must be on cleanup after processing the message
<damo22>yep
<youpi>i.e. your message management did one deallocation that it shouldn't have done
<youpi>and when cleanup came it did it again
<damo22>ok
<youpi>you could use gdb to go step by step there, to see which port exactly is cleaned up a second time
<damo22>could it be that im doing the rpc with too many elements in the message?
<damo22>because i overloaded the device struct
<damo22>i dont know where to set the size of the message
<youpi>what device struct?
<damo22>block_data
<youpi>that's not passed in messages, that's only allocated on your side
<damo22>ok
<damo22>i'll use gdb
<damo22>brb
<damo22>FIXED!!
<damo22>device_read was not replying correctly
<damo22>pushed working disk driver to incubator
<youpi>damo22: don't always deallocate open_port, only deallocate on success
<youpi>on error mig will do the cleanup
<youpi>so you mustn't do it yourself before that
<youpi>it was indeed bogus to both call ds_device_read_reply and return D_SUCCESS
<youpi>either you return MIG_NO_REPLY and call ds_device_read_reply later
<youpi>or you return D_SUCCESS
<youpi>currently your code is correct for the success case, but not for EIO
<damo22>oh ok
<youpi>don't call ds_device_read_reply in addition to returning EIO
<damo22>okay thnx
<youpi>but even in the success case, instead of calling ds_device_read_reply and return MIG_NO_REPLY, you can just return D_SUCCESS
<damo22>will fix
<youpi>since you'll have already set *bytes_read
<youpi>using ds_device_read_reply is only needed when you work asynchronously
<youpi>here you are synchronous so you can just reply immediately
<damo22>no that doesnt work
<damo22>somehow your suggestions broke device_read
<damo22>after get_status i get EIEIO on the ext2fs mount
<damo22>i left the reply with D_SUCCESS and then return MIG_NO_REPLY and it works
<damo22>i cant mount a partition past the 128GB boundary
<damo22>even with rumpdisk
<damo22>i think its because storeio partition thing doesnt work
<damo22>it mounts /dev/wd0s2
<damo22>because it goes only 50GB past the start of the disk
<damo22>it has same behaviour on native hw
<damo22>pushed everything that ive done now
<youpi>if it doesn't work past 128GB perhaps the rump ide driver only supports lba28
<damo22>no it supports lba48
<youpi>well, somehow there's a point where it doesn't :)
<damo22>i think its libstore
<youpi>the device interface itself supports 32bit block numbers, so that's 2T
<youpi>libstore uses 64bit off_t and the device interface
<damo22>hmm
<youpi>so I don't see how a 128GB limit could be there
<youpi>128GB definitely is a 2^28 thing
<damo22>yeah but the log from rumpdisk shows LBA48
<youpi>please show me the change you've made to make the device_read return D_SUCCESS
<youpi>it's really supposed to work
<youpi>probably there's a detail you are doing wrongly
<youpi>showing LBA48 perhaps only means the device supports it, and not the driver
<damo22>i dont think netbsd only supports 128GB drives in 2019
<damo22>its the ahci driver
<youpi>possibly it's not netbsd itself, but some glue at some point
<damo22>its not really glue, it compiles the actual source
<youpi>yes, but there's glue around it to make it work with the rest
<damo22>i'll have to investigate further but not tonight
<youpi>I'd say try with dd bs=1M skip=150000 to see where the offset gets wrong
<damo22> *bytes_read = err;
<damo22>- ds_device_read_reply (reply_port, reply_port_type, D_SUCCESS, buf, *bytes_read);
<damo22>- return MIG_NO_REPLY;
<damo22>+ return D_SUCCESS;
<damo22>that broke it
<damo22>doesn't it need the "buf" to send back through the message?
<youpi>you need to set *data = buf indeed
<damo22>lol
<youpi>(and drop the ds_device_read_reply for the EIO case, they are really not useful, just returning EIO will do it
<youpi>)
<youpi>ah, you did, ok :)
<youpi>(no need to set *bytes_read on errors, that won't be transmitted anyway)
<damo22>if this works on native hw i will be very happy
<damo22>ive just compiled a static binary with the latest changes
<damo22>btw im using libparted
<damo22>to detect the partitioning
<damo22>via libstore
<damo22>could there be a lba28 restriction there?
<damo22>i get EIEIO reading some of the files in /mnt
<damo22>but not all
<youpi>I don't think it'd be in libpartesd
<youpi>really, just putting printf along the path will tell you where it gets wrong
<youpi>instead of trying to guess
<damo22>its very hard to debug when i need to repack initrd every time
<damo22>and i have no gdb
<damo22>because hardly anything fits
<youpi>I thought that irq sharing fix would allow to run it live
<youpi>at least in qemu
<youpi>where you can make qemu expose several ahci controllers
<damo22>it does, but i have no errors on qemu
<youpi>so you can let one driven by gnumach, and the other by rump
<youpi>ok
<youpi>but did you try >128G access in qemu ?
<damo22>ah not yet
<damo22>i cant select which ahci controller rump can control
<damo22>it tries to control all
<youpi>even by hacking it?
<damo22>nope
<youpi>to e.g. exclude some pci devices by hand
<damo22>it has clever autodetection
<youpi>but I gues there is a probe function which can return an error?
<damo22>i guess i could hack it
<youpi>which you could do when the pci device is to be excluded
<youpi>you can also make qemu expose both an ide device and an ahci device
<damo22>yeah that is what i have been doing
<damo22>ide has lba28 though
<youpi>sure but that's enough for your /
<damo22>so ive had to shrink my /
<youpi>you have a / bigger than 128G ?
<damo22>yeah originally i did
<damo22>now 50GB
<damo22>i hate running out of space
<damo22>this is not the 90s ;)
<damo22>problem is i spend most of my time in e2fsck
<youpi>why running e2fsck ?
<damo22>because it sets the dirty bit when i try mounting real disk
<damo22>and if it fails it cant unset it
<damo22>so next boot ...
<youpi>you can use a small partition for your tests
<youpi>so that it fscks quickly
<damo22>indeed, i need a compromise
<damo22>less space, but enough for dev environment
<damo22>and some code
<youpi>I don't understand: only rump mount does bring missing umounts, no?
<youpi>so that your / always correctly umounts
<damo22>my / is a initrd
<damo22>it doesnt even need to be cleaned
<damo22>but i have real disk with another /
<damo22>i need to test mounting the real disk
<youpi>so here you are talking about tests on real hw
<damo22>yes
<youpi>I can understand that that case is hard to debug
<youpi>but most often you can debug with qemu
<youpi>I'm not saying "always"
<damo22>yeah ok
<youpi>(and my students wonder why we are making them work on a purely simultated system, not even qemu, to do their OS assignments...)
<damo22>but i seem to only encounter problems on real hw, not in qemu
<youpi>ok :/
<damo22>so i think its all good, then transfer disk to real system and boom it explodes
<damo22>i'll debug more tomorrow
<damo22>gniht
<damo22>goodnight
<damo22>midnight here
<damo22>good luck with your students
***Emulatorman_ is now known as Emulatorman