IRC channel logs

2020-07-23.log

back to list of logs

<youpi>"soon" is actually today :)
<AlmuHS>what?
<youpi>glibc
<AlmuHS>the fix for... IRQ? I don't remember
<youpi>the fix for clock_nanosleep
<AlmuHS>true
<youpi>(buildds haven't picked it up yet thoiugh)
<AlmuHS>damo22: the glibc fix for the clock is ready
<youpi>one is busy with gcc-9, the other with qemu
<youpi>*stuck* with qemu actually, let me fix that
<AlmuHS>ok
<AlmuHS>then I keep my Qemu VM without upgrade for a time
<youpi>upgrading should be fine
<youpi>only rump is affected by the clock_nanosleep issue
<youpi>there was also an issue with eatmydata + sudo, but that's really a niche thing
<AlmuHS>in my SMP code, in real hardware, I noticed that the tty doesn't reply to keyboard. But mach_ncpus is set to 1, there are not concurrency yet, so I don't know the reason
<AlmuHS>in Qemu, the same SMP code works without problems, and the tty reply to keyboard properly
<youpi>you mean with your patch but without actually setting mach_ncpus > 1 ?
<AlmuHS>yeah
<youpi>possibly try to disable pieces of code that you have introduced
<youpi>to determine which one poses problem
<AlmuHS>the only strange piece can be a kalloc()
<youpi>that can't be it if you don't actually access the data
<youpi>really, dichotomy between a working state and a non-working state is a boring, but extremely effective methodology
<youpi>the thing is: the non-boring part comes after the boring dichotomy
<AlmuHS>my code is not enabling cpus yet. Currently, its only find and enumeration
<youpi>just keep patient during the boring dichotomy part
<youpi>ok but possibly it buffer-overruns or such
<youpi>but dichotomy can reveal which piece exactly poses problem
<youpi>and then you can track what's actually happening wrong
<AlmuHS>i have to prepare unit-tests for my code
<youpi>no need for unit-test
<youpi>just #if 0
<youpi>that's fine enough
<youpi>when you have a working state and a non-working state, you just need to disable/enable code
<AlmuHS>ok. But It will be a few long. because this problem only exists over real hardware, not in Qemu
<youpi>just be prepare to put an eyebrow on really innocent-looking code
<youpi>then work on making the real hardware testing loop really short
<youpi>that'll be useful anyway, longterm wise
<AlmuHS>eyebrow?
<youpi>bad eye
<youpi>frowned eye
<AlmuHS>what do you refers exactly?
<youpi>I mean never assume any innocent-looking code is actually innocent
<AlmuHS>it's true
<youpi>- get_char(vc, (u_short *)&tmp_pos + 1, &temp) > SPACE) {
<youpi>+ get_char(vc, (u_short *)tmp_pos + 1, &temp) > SPACE) {
<youpi>this typo has been looked over for DECADES
<AlmuHS>LOL
<youpi>and actually posed problems to people recently, bringing kernel crashes etc.
<youpi>while it should have for the decades
<youpi>I have no idea why nobody really reported it before
<AlmuHS>this is a line difficult to debug
<youpi>not really
<youpi>when you have a good backtrace, it's obvious where things are wrong
<youpi>I didn't have anything more than a backtrace to track it down actually
<youpi>since I never could reproduce it myself
<AlmuHS>i like to use temporary variables, to avoid doing too things in a line
<youpi>but after a few decades, once a user actually sent a backtrace pointing at this very line, I could review it and say "hey that's actually completely wrong"
<AlmuHS>maybe an unit-test could detect this typo, because the value is not the expected
<youpi>no, it's very hard to settle
<youpi>because it happens in a rare condition that's actually produce by a real user typing on the keyboard
<youpi>if you wanted to track *that* case you would need thousands of unit-tests
<AlmuHS>real
<youpi>unit-tests are good when you know what you want to track down
<youpi>yes, for real
<youpi>but they're not enough when you have bugs
<youpi>because bugs are almost by definition the cases that you didn't expect
<youpi>and thus didn't thought you'd need a unit test for
<AlmuHS>yes. In SMP, the higher difficult is the addressing
<youpi>there, glibc is at the top for buildds
<youpi>(well, almost, yet another haskell rebuild ahead of it)
<AlmuHS>i have many errors in addressing, because I try to call a physical address, or I try to access "out of range"...
<youpi>you need to be always sure of what you are managing, a pointer to the data or the data itself, yes
<youpi>and in the kernel case, physical vs virtual address, indeed :)
<youpi>and nothing like sigsev to catch you
<AlmuHS>kernel panic, simply
<youpi>when I see students depressed by a sigsegv, I claim "but that's a blessing!! you have a whole backtrace!!"
<youpi>and not a mere panic completely out of the place where the bug actually is
<AlmuHS>in a kalloc(), I had a panic because I called free() after the new kalloc()
<AlmuHS>nope, the panic was call free() before assign the new memory to the pointer
<AlmuHS> apic_data.cpu_lapic_list = new_list;
<AlmuHS> kfree(old_list);
<AlmuHS>if this lines are put in inverse order, the kernel shows a panic
<youpi>that's what I meant: the panic points you at something, but it's not really exactly what you should look at, and thus you need to be inventive in what could have been going wrong
<AlmuHS>yes, it's true
<youpi>and even worse, quite often you have to fix *several* things until the bad effect goes
<youpi>so you shouldn't say "nah, it's not that since it doesn't seem to be fixing the issue"
<youpi>yes, it is that,but not only
<AlmuHS>in this function, I try to resize a dynamic array. I reserved 255 elements. But, after enumerate the processors, maybe I don't need 255. So I resize to fit the array size to the real number of processors
<youpi>heh, yet more haskel rebuilds came on the way
<youpi>I'll just bump glibc
<AlmuHS>haskell? the language?
***Server sets mode: +nt
<damo22>struct blob *next ?
<youpi>$ find . -name \*list\*
<youpi>./kern/list.h
<youpi>that could be it
<AlmuHS>yes, but manage the list manually it's a source of errors
<youpi>looks so, at least
<youpi>always remember that grep and find are you best friends
<damo22>i like that "git grep" only searches in current subtree
<AlmuHS>then, maybe I can avoid the array resizing. Simply I store the processors in a temporary linked-list and, after enumerate, I add them to the array with the correct size
<youpi>that's a way, yes
<youpi>or reallocate the array twice longer each time you need more
<youpi>so that the reallocation cost is actually constant
<AlmuHS>yes, but the resizing is a delicate process, I think. It's easy to produce errors
<damo22>is a fixed size of 255 too much for all pcs?
<youpi>managing a list is as well ;)
<AlmuHS>by this reason I was searching a library ;)
<youpi>per-cpu data often grows quite quickly actually
<youpi>even using a linked-list library is delicate
<AlmuHS>xAPIC has the limit in 256 cpu
<AlmuHS>by this reason I set the NCPUS to 255 in my patch
<damo22>if there are never more than 256 cpus could you not just preallocate an array of fixed size?
<youpi>that still looks like a waste
<youpi>we can live with it initially, but meh :)
<AlmuHS>at first, i reserved 255 as fixed size. But I want to optimize this
<damo22>ok
<damo22>but as a proof of concept, theres not much point optimising code until you have something that works
<youpi>sure
<youpi>1) make it work
<AlmuHS>i can disable the resizing, to test if this is the cause of the hang
<youpi>2) make it beautiful
<youpi>3) make it fast
<AlmuHS>in my case, I add 4) add it modular, and manageable
<youpi>that's part of 2)
<youpi>beautiful = readable, maintainable, modular, elegant, nice, ...
<AlmuHS>richard told me that the SMP code must be modular, and independant of architecture. For this reason I added the SMP pseudoclass
<AlmuHS>the next step, once I got work this properly, will be implementing cpu_number() and CPU_NUMBER(). And the first might be architecture-independent
<AlmuHS>really, my current work is mainly to refactor the work of last year
<AlmuHS>Richard was angry when he saw my code
<youpi>the cleanup part is a boring and long , but necessary step
<youpi>otherwise the long-term maintenance is a nightmare
<AlmuHS>because there are many globals, extern, arch-dependent code...
<youpi>we suffered quite a bit after Zheng Da's work
<AlmuHS>so I take note about his advices, and I'm refactoring following these
<AlmuHS>the most delicated step will be the cpu enabling. I will have to find a way to implement atomic operations
<youpi>you can use gcc's atomic operations
<youpi>no need to reimplement them
<youpi>c11's atomic operations, even
<AlmuHS>thanks by this info. I didn't know
<AlmuHS>I will have to be careful to avoid a cpu tries to execute the assembly routine at this time to other
<AlmuHS>**at same time
<AlmuHS>I have to fully serialize the cpus enabling and configuration
<AlmuHS>even the adding cpus to the kernel might be serialized
<damo22>perhaps have a look at coreboot's code that inits the cpus
<AlmuHS>but coreboot works in BIOS level. It has less restrictions
<damo22>yes, so it has simpler code
<AlmuHS>the cpus initialization into the BIOS has other instructions
<damo22>yes but there is code in coreboot that sends IPI's and enumerates cpus
<AlmuHS>the enumeration and IPI sending it's not the problem. The concurrency is
<damo22>you can take ideas from that
<AlmuHS>yes, this is true
<AlmuHS>I have to go sleep. Tomorrow I have to wakeup early
<AlmuHS>good night youpi. good morning damo22 ;)
<damo22>ok bye!
***_Posterdati_ is now known as Posterdati
***jma is now known as junlingm
<junlingm>youpi: when I use ddekit_large_malloc twice in a row, the send allocation failed with an invalid argument error. Here is a minimal test case: http://pastebin.com/embed_js/82unk12r
<junlingm>Did I forget to init something?
<youpi>I don't know, I never really played with libddekit
<youpi>but I doubt it'd be a usage error
<youpi>rather a bug inside libdde
<youpi>libddekit*
<junlingm>ok. thanks.
<youpi>(hint: printfs are extremely efficient ways to determine where the problem actually lies)
<junlingm>true. I will checkout libddekit and try to debug it.
<youpi>(well, you can start with gdb, in case single-stepping does actually work)
<junlingm>it does not step into ddekit functions :(
<youpi>did you install the -dbgsym package?
<junlingm>I did
<junlingm>probably a version mismatch. I recompiled the libddekit from the git master branch, and the problem disappeared.