IRC channel logs

2024-12-08.log

back to list of logs

<azert>damo22: if the code does already the same, then I think it would be more useful to add a comment then change it
<azert>if it’s already xapic, I don’t see why you get the  apic id == 0? What is the bug?
<azert>Is it a proble
<damo22>azert: the bug is, cpu_number() calls percpu area to locate its cpu_id but seems something is broken with the gs segmentation
<azert>m with memory? Does it discards the value you are writing there in the per cpu region?
<damo22>i dont know yet
<damo22>it might just be mapped incorrectly so it reads zero
<youpi>damo22: since it's not on a critical path, I'm fine with making it more readable
<damo22>youpi: do we need to reload the gdt if we change values in the entries?
<youpi>yes
<damo22>im trying to figure out why im getting zero cpu_number
<youpi>you mean the entries in the segment descriptor, right?
<damo22>yes
<youpi>or do you mean the entries pointed by the segment?
<damo22>i mean if you patch the value in apboot_percpu_high for example
<youpi>then you have to reload gdt, yes
<damo22>i dont think we are doing this
<damo22>why do we nee percpu so early?
<damo22>need*
<youpi>there will be more and more code using percpu data
<youpi>not having it early means having to make sure that all code leading to setting it up does not use percpu data at all
<damo22>ok
<damo22>well i have execution on AP but it still thinks its cpu number is 0
<damo22>CS =0008 40000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
<damo22>GS =0068 010a0260 ffffffff 00c09300 DPL=0 DS [-WA]
<damo22>something tells me that should be 410a0260
<damo22>or even 400a0260
<damo22>youpi: i thought that before the final gdt is set up, the AP is in a weird segmentation, so how is it supposed to look up the cpu number in C code?
<damo22>i think we should assume nothing will need percpu area until after cpu_setup() is called on an AP, theres nothing it needs that early
<damo22>it only needs its own cpu number
<AlmuHS>damo22: in AP starting there are two GDT: in apboot.S we load a temporary GDT to jump to protected model, and after this, once in C code, we load the final GDT in cpu_setup()
<AlmuHS>even, if i remember well, there are a previous GDT in apboot, used only to jump to 32-bit
<damo22>AlmuHS: yes, i realise, but youpi has added complexity to the boot asm for APs such that it configures the GS segment for early percpu access. I think this is unnecessary
<damo22>also, i am getting cpu number == 0 on APs currently with smp on AMD cpu
<damo22>so something is definitely broken
<AlmuHS>the first gdt which is loaded before jump to 32-bit is gdt_tmp, the second (temporary but already in 32-bit) is apboot_gdt, and the final is in cpu_setup()
<damo22>i tried removing the early percpu configuration but now it hangs at paging setup
<AlmuHS>then, maybe cpu_number() is being executed between these jumps
<AlmuHS>we had a strange issue with paging flags
<damo22>yes?
<AlmuHS>maybe it's necessary to change some paging flags?
<AlmuHS>you found this problem. Check old commits
<damo22>but why would it be any different to master
<damo22>it already works
<AlmuHS>I'm not sure
<damo22>my preference is to move out complexity from asm
<damo22>its too error prone
<AlmuHS>other remember: the final gdt, idt... etc is different in BSP and AP. You had to create specific ap_gdt_init() and similar
<damo22>if it can be done in C code instead, we should do it there
<AlmuHS>we only can execute C code after jump to protected mode
<damo22>we already solved these problems almu
<AlmuHS>then try it
<damo22>the bug arrived when asm was heavily modified
<AlmuHS>other thing that simplify the AP booting could be remove the gdt_tmp and jump to protected mode directly using apboot_gdt
<damo22>yeah, not really possible unless you can hotpatch the code segment
<AlmuHS>it's difficult. youpi explained me many years ago how i could jump without gdt_tmp, but I'm not remember well
<damo22>since you dont know the address to jump to until runtime
<damo22>so we patch the realmode jump offsets
<damo22>again, we already solved this part
<damo22>we are seeing a new bug because the asm is now more complex
<AlmuHS>i think that we have to map the apboot_gdt. But the AP boots in 16-bit with limited segmentation, so it's difficult
<AlmuHS>oj
<AlmuHS>ok
<AlmuHS>but i don't know about percpu yet
<damo22>percpu is an array that has a gdt entry exclusively for it
<AlmuHS>but it could be a synchronizing problem
<damo22>no the array is flat in memory but only one cpu writes to each per cpu element
<AlmuHS>maybe the percpu is not ready when you call to cpu_number() in this step?
<damo22>yes, exactly, there was asm code added to make this possible
<damo22>but i think something is wrong with it
<damo22>but its hard to debug
<AlmuHS>you can try to force the other cpu_number(), which not use percpu, to be sure that they problem is from percpu
<AlmuHS>you wrote a alternative version of this function which not use percpu. Try to call that instead normal cpu_number()
<damo22>ok
<AlmuHS>if the code works with alternative cpu_number(), then we can be sure that the problem is from percpu. If the alternative fails too, then the problem is in other site
<AlmuHS>i go to sleep. Good luck
<damo22>night
<damo22>ok i figured it out, you cant send a PHYSICAL destination IPI to an APIC id > 0xf
<damo22>so it was restarting cpu 0
<damo22>hmm looks like you can only support up to 8 cpus to send unique ipips
<damo22>IPIs
<damo22>with logical destination
<damo22>i dont know how to set up the lapic on APs before i get execution on them
<damo22>seems like a chicken egg problem
<damo22>Upon receiving an IPI message that was sent using logical destination mode, a local APIC compares the MDA in the message with the values in its LDR and DFR to determine if it should accept and handle the IPI
<damo22>but how do you set the LDR and DFR on APs before they start?
<damo22>so you can interrupt them and wake them up
<damo22>i think we need to change the code to send a broadcast IPI
<damo22>and have them all start in parallel
<damo22>theres only 8 bits in the mask that you can use for identifying a cpu
<damo22>so you cant send unique IPIs to more than 8 cpus if they have APIC ids > 0xf
<damo22>youpi: it is actually impossible to uniquely address more than 8 processors with IPIs, (with the exception of some x86 hardware that allows 16 processors)
<damo22>with APIC
<damo22>i think the only way to start up a cpu with more than 8 cores is to do them in parallel
<damo22>so if we are going to fix smp, we may as well invest time in making it work on > 8 processors
<azert>damo22: out of curiosity, how many cores your cpu has?
<librehawk>x64 WAS WHAT HURD USERS WANTED, You Delivered! Mob Love From Kenya, Lodwar.
<librehawk>The Next #Challenge for The Hurd Community Is Guides & Tooling for Buildings Hurd Native Drivers and Applications...
<librehawk>I Love You Guys
<damo22>azert: My cpu only has 8 cores, but i dont want to write code that only works on 8
<damo22>i have a patch for parallel smp init, but the synchronisation is broken, and it hangs sometimes and boots other times
<damo22>I have a branch of gnumach that almost has parallel smp init working for any number of cores https://git.zammit.org/gnumach-sv.git/log/?h=fix-smp-amd
<damo22>i need to review all functions in cpu_setup() to see if any are trampling on each other
<AlmuHS>xAPIC allows a max of 256 cpus. Even APIC ID has 8-bit (2^8 = 256)
<AlmuHS>so must not be a problem using more than 8 cpu
<AlmuHS>in Qemu I got to boot the SMP kernel with 16 cpus
<AlmuHS>with the scheduler patch, it got to boot and the pthread test showed that all cpus was working
<AlmuHS>i have a program which creates 16 threads using pthread, and each of this runs a infinite loop showing its APIC ID
<AlmuHS>I tested it in Qemu using 16 cpus, and worked, showing alternative numbers in range 0-16
<AlmuHS>what is the problem? the IPI routine's address?
<janneke>damo22: oh very nice!
<gnu_srs1>Hello. Which program/script is active after: start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1] exec startup proc auth.
<gnu_srs1>I'm trying to upgrade an old image with dpkg-deb -x and boot hangs after the above :(
<gnu_srs1>Where to find info about the boot sequence of Hurd?
<janneke>gnu_srs1: starting a debian hurd, i see
<janneke>Hurd server bootstrap: ext2fs[part:1:device:wd0] exec startup proc auth.
<janneke>INIT: version 3.08 booting
<janneke>looking at the guix boot sequence, there's also: daemons/runsystem.sh
<janneke>ACTION had a writeup of this somewhere...
<janneke>hurd/startup.c is also a very early candidate
<damo22>AlmuHS: Only some x86 cpus support destination register with 8 bits of physical apic id, mostly only support 4 bits... but the apic id can be wider than 4 bits and so you cant address the lapic
<damo22>qemu doesnt have this limitation
<damo22>the way to work around it, is to use logical destination mode, but you still only have 8 unique mask bits to address 8 groups of cpus
<damo22>so i solved this in my branch using (cpu_number % 8)
<damo22>and ALL_EXCLUDING_SELF parallel startup
<gnu_srs1>janneke: tks for your hints. No luck so far :(
<gnu_srs1>repeating myself:
<gnu_srs1>(18:05:33) gnu_srs1: Hello. Which program/script is active after: start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1] exec startup proc auth.
<gnu_srs1>(18:07:25) gnu_srs1: I'm trying to upgrade an old image with dpkg-deb -x and boot hangs after the above
<gnu_srs1>(20:02:56) gnu_srs1: Where to find info about the boot sequence of Hurd?
<damo22>gnu_srs1: probably "init"
<gnu_srs1>damo22: /usr/sbin/init from sysvinit-core??
<gnu_srs1>I have linked that file to /sbin too. No progress :(
<damo22>most likely /etc/init.d/rc*
<damo22>use sysv-rc-conf to configure it
<gnu_srs1>seems like there is no debug or verbose option for that file.
<gnu_srs1>sysv-rc-conf does not exist.
<damo22>might need to install it
<damo22>youpi: do you know of any reason why parallel smp init may break if all the APs are running cpu_setup() at the same time?
<damo22>my branch sometimes boots, sometimes hangs
<youpi>iirc there were some initialization things that were using global variables
<youpi>probably with per_cpu data we can fix that
<damo22>ah right nice
<youpi>also, possibly it's difficult to control the APs with ACPI etc. if they all start at the same time
<damo22>we dont have an option
<gnu_srs1>tks. Not installed either on an updated box or the failing box. I think it is something else, missing packages or too old/buggy grub.cfg.
<damo22>since there is no way to address them individually
<damo22>it works in qemu because qemu doesnt have 4 bit limitation on destination register
<damo22>so really its emulating a xeon
<damo22>s/register/field