IRC channel logs


back to list of logs

<damo22>cpu_number and CPU_NUMBER are the problem
<damo22>they are too expensive operations
<damo22>also, they are not cached since all these are macros:
<damo22>#define current_thread() (active_threads[cpu_number()])
<damo22>#define current_stack() (active_stacks[cpu_number()])
<damo22>#define current_task() (current_thread()->task)
<damo22>#define current_space() (current_task()->itk_space)
<damo22>#define current_map() (current_task()->map)
<damo22>is there a cheaper way to get current thread?
<damo22>instead of looking up cpu id
<damo22>every instance of these macros is looking up the cpu number from APIC
<damo22>is there a per-cpu register we can hijack to put the thread pointer in?
<damo22>like when we do a context switch we have execution on the cpu and we know the thread address
<damo22>if we stash it somewhere where only that cpu can see the right value...
<damo22>instead of in an array indexed by cpu number
<youpi>damo22: how "expensive" are they ? ns ? µs ? ms ?
<damo22>i havent measured exactly but when i printf in there it is printing much much more often than with any other lock or cpu_pause
<damo22>it pages out over multiple screens before printing anything else
<youpi>well, it's expected that it's called *very* often
<youpi>but if it costs a ns, that's *not* a problem
<damo22>its the only thing i can find that really differs between NCPUS=8 running with -smp 1 and bare NCPUS=1
<damo22>but the speed difference is huge
<youpi>there's also the whole scheduler
<damo22>can i use %fs
<youpi>debugging by elimination is a very difficult thing, you never know what you have forgotten :)
<damo22>or %gs
<youpi>before trying to fix something, c heck whether it actually is a problem
<damo22>to point to percpu struct
<damo22>which reg can i take
<youpi>don't try to fix a problem if there's any possibility that it doesn't exist actually
<youpi>you'll just lose time
<youpi>just put a loop in the acpi init code, to check whether getting the acpi id is actually expensive
<youpi>like a million iterations
<youpi>if that takes one second to run, then getting the acpi id is indeed a µs and that's a concern
<youpi>there's also the cpuid way to get the apic id
<youpi>that's probably much faster
<youpi>it clobbers e[abcd]x however
<damo22>there doesnt seem to be much per cpu stuff needed in mach, maybe all i need to do is store the cpu id in %fs
<youpi>yes, with eax=1, cpuid gives the apic id in ecx >> 24
<youpi>you cannot put whatever you want in fs, it has to be a valid segment number
<damo22>or a pointer
<damo22>to a struct
<youpi>really, you can as well just push e[abcd]x ; mov $1,eax ; cpuid ; shr $24, %ecx; mov %ecx, register; pop e[abcd]x
<youpi>and you'll be done
<damo22>i will try it
<youpi>I measured it to be about 60ns
<youpi>on my laptop
<damo22>-smp 1 boots quicker now with NCPUS=8
<damo22>it hangs with more im investigating
<damo22>it seems using the stack inside CPU_NUMBER is bad
<damo22>i will try disabling intrs
<youpi>disabling intrs won't make the stack magically work
<damo22>AP=(1) reset page dir done
<youpi>you could have two versions, one that uses lapic (and thus doesn't use the stack), the other that uses cpuid
<youpi>I don't see many places where the stack can't be used
<youpi>syscall64 is one place indeed
<youpi>but syscall is expensive anyway, so using apic there should be fine enough for now
<youpi>yes, the rest are functions themselves, or are using push/pop/call around
<youpi>so that's basically the only place where you cannot optimise with cpuid
<damo22> Bad processor state 1 (Cpu 0)
<damo22>+ asm volatile("cpuid" : "+a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
<damo22>+ : : "memory");
<youpi>i386/i386/proc_reg.h:#define cpuid(eax, ebx, ecx, edx) \
<youpi>i386/i386/proc_reg.h: "cpuid\n\t" \
<youpi>it's already there of course
<youpi>don't reinvent the wheel, it's tricky to invent
<damo22>it panics in idle_thread_continue
<damo22>in PROCESSOR_RUNNING state
<damo22>ok so the processor makes it into idle thread in running state before it has a chance to complete the rest of the AP bringup i think
<youpi>possibly you just uncovered a race
<youpi>that was previously just hidden by the time that apic_id takes
<damo22>i think youre right
<damo22>i think ecx gets clobbered
<damo22>cpuid() macro does not work
<youpi>it's the same as in hwloc, which is used on all supercomputers
<damo22>- return (lapic->apic_id.r >> 24) & 0xff;
<damo22>+ unsigned int eax, ebx, ecx, edx;
<damo22>+ eax = 1;
<damo22>+ cpuid(eax, ebx, ecx, edx);
<damo22>+ return (ecx >> 24);
<youpi>check the assembly, to make sure your hypothesis is right
<youpi>sorry, it's in the higher part of ebx, not ecx
<damo22>its faster than before but still pretty slow
<damo22>im trying to get a shell with -smp 4
<youpi>it's not a hundred times faster? then there's another issue
<damo22>i mean it could be, i cant tell
<damo22>ive never got this far
<damo22>its up to INIT
<damo22>after 3:30 of cpu time
<damo22>its probably like 3x quicker or something
<youpi>so apic_id was part of the problem, but not the only part
<damo22>i can try using the stack again in CPU_NUMBER
<damo22>as the problem was i was using ecx
<youpi>for the syscall64 case, you probably can't use the stack, see the code
<youpi>(or at least, shouldn't because it's the user stack at that point)
<damo22>what about
<damo22>* XXX Dubious things here:
<damo22>- * I don't check the idle_count on the processor set
<damo22>in thread_handoff
<youpi>I don't know
<damo22>+ * Don't switch threads old -> new if more idle, ie
<damo22>+ * since old set has more idling cpus on average it is less busy
<damo22>+ * so better not switch to a busier cpu set, just let it run.
<damo22>i wrote this
<damo22>+ if ((100 * oldset->idle_count / (oldset->processor_count + 1))
<damo22>+ > (100 * newset->idle_count / (newset->processor_count + 1))) {
<damo22>+ counter(c_thread_handoff_misses++);
<damo22>+ return FALSE;
<damo22>+ }
<damo22>it had some noticable effect on speed
<youpi> that's probably just luck
<youpi>really, first understand where the problem is before trying to fix it
<youpi>otherwise you'll just be stabbing in the dark
<damo22>in smp 4 its sitting in INIT with 30% load on the host trying to boot slowly
<damo22>now its dropped to 12.5% but it hasnt progressed much
<damo22>i wish i could interrupt all cpus
<damo22>and see a backrace
<youpi>you can do that from qemu?
<damo22>i know i can run gnumach alone
<youpi>iirc there's a page about it on the wiki
<youpi>but since you're seeing from the host the vm being mostly idle, you'll probably just see gnumach in the idle loop, which will not be very informative
<damo22>haha, yes
<damo22>i ran qemu with -s
<youpi>it's probably more the scheduler state that you want to check, to make sure that it does get threads to run when it wants to
<youpi>making ast actually send an IPI might be part of it
<damo22>all cpus are in idle
<damo22>[ 9.0700050] chacha: Portable C ChaCha
<youpi>because for instance, if cpu0 wakes a thread, and the scheduler thinks "cpu1 is idle, let's make it run there", but the ast doesn't actually make cpu1 break its idle loop, the thread will have to wait for the clock tick to happen before getting to run on cpu1
<damo22>so clock tick speed is thread choosing speed
<youpi>possibly when cpu0 gets idle, the scheduler would take that thead back on cpu0
<youpi>but that really depends on the scheduler behavior
<youpi>it can be but only by accident, that's not how it's supposed to be, IPI are supposed to make threads get to run
<damo22>ok i can implement cause_ast_check
<damo22>as an IPI mechanism
<youpi>as a reminder, all lthese principles are explained in the linux kernel book from bovet and cesati
<youpi>that said, weren't you saying that cause_ast_check wasn't that much called? is that still the case?
<damo22>yes its hardly called at all
<damo22>if any
<youpi>well, it will be needed to implement cause_ast_check anyway :)
<damo22>ok will look at doing this on the weekend