IRC channel logs

2023-12-07.log

back to list of logs

<muurkha>there are various ways you can screw that up, and as I understand it, Niagara did actually deliver on that expected performance
<oriansj>muurkha: thank you for clarifying but I was speaking of single thread performance
<oriansj>and it is definitely true in multi-threaded applications (which were integer heavy) saw significant performance relative to dual core OoO progessors
<muurkha>significant performance relative?
<oriansj>UltraSPARC T1 server ran 13.5 times faster than on an AMD Opteron server for MySQL benchmarks
<muurkha>significant performance gains relative?
<oriansj>yes
<muurkha>aha
<muurkha>anyway in 02006 single-threaded performance was significantly more critical than it is now
<oriansj>which reminds me, the Verilog source code of the UltraSPARC T1 design is available under the GPL
<muurkha>yes
<muurkha>that is not very useful
<oriansj>so certainly something we could use if we had fabs
<oriansj>although register windows were in retrospect a bad idea.
<muurkha>especially with a barrel processor!
<oriansj>nothing like having 1024*${pipeline stages} registers
<muurkha>yeah, exactly
<muurkha>I wonder if a 6502-like design might make more sense in that context
<oriansj>well assuming a relatively short pipeline (14 or less stages) and less than 24 registers; you can get register sets of equal size to modern OoO processors.
<oriansj>but 14*9KB of L1 isn't ideal
<muurkha>the Tera MTA didn't have any L1 or indeed any cache at all because it was a 128-thread barrel processor
<muurkha>suppose you're running at 4 GHz with 64 banks of SDRAM each of which can send you an arbitrary memory word in a latency of 80ns, plus another 80ns for the crossbar mesh that connects the processor to the RAM
<muurkha>if you have 1024 threads active, each thread only runs one instruction every 256 nanoseconds. That means it can issue a memory read in one instruction and always get the result in the very next instruction.
<muurkha>in that situation, what's the advantage of reading from a register instead of reading from memory?
<muurkha>just that it reduces the bandwidth required from the memory subsystem and the number of instruction bits. It doesn't make your program run any faster.
<muurkha>the bandwidth is important, though. If you're running four instructions per nanosecond, and each instruction fetches an instruction word and two operands, you're fetching 12 memory words per nanosecond, 12000 per microsecond. that ends up being a word from each memory bank every 5.3 ns in the 64-bank scenario I set up here, which I think is implausible
<oriansj>muurkha: well if you feel that way; Texas Instruments TMS9900 only had 2 registers per core but used 16 words of address space pointed at by the Work Space pointer of RAM; which effectively makes for infinite register windows and no spill logic needed
<muurkha>right, I was thinking of the 9900 a bit. of course it was slow as hell because its RAM wasn't multibanked or crossbarred and its processor wasn't multithreaded
<muurkha>the 6502 got a lot better performance than the 9900 with half the transistors with five registers, totaling 48 bits, or 56 if you count the flags, which is barely more than the 9900's 32 bits of on-chi pregisters
<muurkha>actually the 9900 had an 11-bit status register with flags and an interrupt mask too, so 43 bits in all
<oriansj>well the 6502 was also pipelined which is why a 1Mhz 6502 could hold its own against a 4Mhz z80
<oriansj>but a barrel 9900 definitely would have been one hell of a 16bit processor but lack of a MMU would have resulted in a very buggy system (beyound the process problems that plagued that chip its entire life)
<Googulator>Successful bootstrap on bare metal from USB flash drive using fossy's simplify branch confirmed.
<muurkha>congratulations!
<oriansj>Googulator: great work
<muurkha>the 6502 was not pipelined; it just generated a faster internal clock and had better-designed internal control logic. also a 1MHz 6502 is closer to a 2MHz Z80 than a 4MHz Z80
<oriansj>the chip does some limited overlapping of fetching and execution; so you are correct that it is not fully pipelined
<muurkha>I don't think it even overlaps fetching one instruction with executing the previous one; details on the state machine are at, for example, https://www.nesdev.org/wiki/Visual6502wiki/6502_Timing_States
<muurkha>hmm, I guess it does, a little bit: "The opcode remains undisturbed inside the IR all the way to the end of the next [T1] clock state ([T1] phase 2). This allows an instruction to do its last signal origination even during the fetching of its successor instruction by other circuits on the chip. These propagated last signals can perform the final operations of an instruction even one cycle later
<muurkha>(T2 again) when the next instruction is in the IR."
<muurkha>that seems like in fact a crucial aspect of being able to run some instructions in only two cycles instead of three or more. thank you! I was wrong about that
<oriansj>thank you for the link, it is helpful
<muurkha>the Z80 did have instructions that ran in four cycles, though, which I think is equivalent to the 6502's two, given the difference in how their clocks were generated
<muurkha>perhaps fewer of them
<oriansj>well; the z80 certainly did better on object assembly than the 6502 which worked better on array assembly
<muurkha>hmm?
<muurkha>what are those?
<oriansj>if you wanted to represent a set of objects in assembly; in z80 you would be doing standard objects but in 6502 you would create a set of arrays.
<muurkha>oh, you mean like structs?
<oriansj>which makes direct assembly performance hard to compare as you would be writing quite different code if you wanted good performance on the chips
<oriansj>bingo
<muurkha>why would it matter?
<muurkha>oh, because on the Z80 you had the HL register, which worked as a 16-bit pointer?
<oriansj>and you wouldn't need to try to reload your pointer to your object list when getting values out of the object
<muurkha>I've never actually written any Z80 or 6502 code
<muurkha>although a Z80 was the first computer I programmed, I programmed it only in BASIC
<oriansj>fair enough, you can get the feel for it on x86 assembly if you limit yourself to only 2-3 registers
<oriansj>and the 6502 stack was only 256bytes in size; so you needed to play games to work around that.
<muurkha>yeah
<muurkha>I've read about it, I just haven't done it ;)
<oriansj>there are still a great deal of tricks on shaving bytes off binaries; that I need to learn but atleast I feel I mastered clean object assembly
<muurkha>the ARM assembler had some facilities for making struct-based assembly (I refuse to call it "object assembly") easier to read and write
<muurkha>you could define symbols for offsets, and in particular you could define a block of them so the assembler would assign the offsets
<muurkha>when you defined them you had the option of also specifying a base register, so you could load field Foo of whatever register r3 was currently pointing at just by saying ldr r2, Foo
<muurkha>which would get translated to something like ldr r2, [r3, #12]
<oriansj>fair enough; https://sourceware.org/binutils/docs/as/Struct.html
<muurkha>yeah, the old ARM assembler was a little more convenient and less error-prone than that
<oriansj>even M0 has something like that using creative DEFINEs
<muurkha>it turns out gas macros are powerful enough to implement the ARM assembler facility though: http://canonical.org/~kragen/sw/dev3/mapfield.S
<muurkha>except for the bit about including the register in the definition, which apparently you can't do
<oriansj>that suprises me
<oriansj>as gas macros are pretty powerful stuff
<muurkha>maybe you could do it with the explicit CPS transformation
<muurkha>basically the problem is that you can define two kinds of things in gas: labels and macros
<muurkha>labels can only be defined as numbers, addresses, or other labels (which eventually bottom out in numbers or addresses)
<muurkha>if you write, as in the example at the top there:
<muurkha>LDR r0, Lab
<muurkha>the only thing you can define Lab as is as a label. but you want the instruction to expand out to LDR r0, [r9, #4], not LDR r0, somenumber, because that isn't valid assembly
<muurkha>you could define a macro called something like LDF, for "load field"
<muurkha>and invoke it as
<muurkha>LDF r0, Lab
<muurkha>or even WF, "with field"
<muurkha>WF Lab LDR R0
<muurkha>that's what I mean about the explicit CPS transformation
<oriansj>indeed
<muurkha>but there's no way to get the more understandable LDR r0, Lab to work, as far as I can tell
<oriansj>well LDR isn't by default understandable despite being a standard
<muurkha>I mean, if you already know what LDR means, it's helpful to be able to see that the instruction is an LDR
<oriansj>well once you invoke gas macros; you can't be sure LDR is an LDR or a macro named LDR
<muurkha>I'm not sure how to make that work
<muurkha>because ultimately it does have to emit the instruction named LDR, right?
<Googulator>currently testing another patch to builder-hex0 for USB boot on a stubborn motherboard
<Googulator>on this board, int 13h needs pushad/popad around it, and only LBA is supported
<Googulator>oriansj: do we need to support any board with a CHS-only BIOS (one with no int 13h extensions)?
<Googulator>AFAIK that would be pre-1997
<oriansj>muurkha: nope, you can get it to mean any instruction you want; the problem ultimately means subversive assembly
<oriansj>Googulator: need => no; nice to have => yes; but right now the bootstrap uses a good bit more resources than older computers can provide.
<oriansj>nothing more horrid than realizing mov eax, ebx in your assembly file does not actually result in a mov instruction nor touch the eax or ebx registers
<Googulator>Looks like my last USB-boot-on-stubborn-board test came to a conclusion: don't bootstrap on random Kingston USB drives you got for free :(
<Googulator>at least not with swap enabled
<Googulator>looks like it killed the flash
<Googulator>tbh, I was a bit worried about this, since this drive has always had terrible write performance, and especially low write IOPS
<Googulator>suggesting high write amplification or just general crappy flash
<Googulator>moral of the story: check write IOPS on any Flash-based drive you plan to bootstrap on
<muurkha>yeah, swap is hard on flash
<muurkha>sounds like you had an annoying day
<Googulator>Well, the test did get far enough to show that my next round of builder-hex0 fixes does work
<Googulator>I had to switch to LBA also in stage1, and wrap the int 13h call into pushad/popad in both stages
<Googulator>(Award BIOS really doesn't like CHS access on USB drives)
<Googulator>"To use LBA addressing with INT 0x13, you need to use a command in the "INT13h Extensions". *Every BIOS since the mid-90's supports the extensions,* but you may want to verify that they are supported anyway."
<Googulator>IMO that's about as far as it's worth going back in time to support.
<Googulator>Anything pre-"mid-90s" will be a) too slow to reasonably bootstrap on, b) not able to address enough memory, and c) too easy to maliciously emulate using modern hardware
<muurkha>not sure if bluepilling is easier or harder for older hardware; we aren't doing much that would stress a VM in particular
<muurkha>it makes sense that you'd want to use LBA
<muurkha>I mean https://en.wikipedia.org/wiki/Blue_Pill_(software)
<Googulator>My concern isn't bluepilling
<muurkha>I'd forgotten it was by Joanna Rootkowska
<muurkha>what kind of malicious emulation do you mean?
<Googulator>Some modern SoC or microcontroller design (not sure if e.g. the RP2040 would be powerful enough, but RK3568 surely is) faithfully emulating a Pentium MMX CPU until it hits some code it wants to backdoor, all packaged up to look convincingly like an actual Pentium MMX, and pin-compatible with it
<Googulator>This is why I don't want to go back _too much_ in time
<Googulator>(Pentium MMX appears to be the last platform where there's a risk of no LBA support in the BIOS)
<Googulator>As I understand it, the "knight" ISA is based on Tom Knight's LISP machine design
<Googulator>Which is an example of what I'd consider old enough to be vulnerable to such an "evil clone" attack
<Irvise_>Hi all, long time no see :)
<Irvise_>I would love to bring good news about the Ada bootstrap compiler, but pretty much nothing has happened...
<Irvise_>Though I do come with a question that may easily be answered by someone here.
<Irvise_>Has Erlang been bootstrapped? Afaik, it has not. The VM is in C, but the compiler is in Erlang and has been in Erlang even before V1.
<Inline>why are the bootstraps only 32bit ?
<Irvise_>Inline: afaik, FiwixOS is 32-bit only. I do not know how that works for other arches that are not i386.
<Inline>i just got the git sources and hold on to the instructions, and it built me a gcc now and it is 32bit only, and i started it with --qemu option
<Inline>but i don't see anything mounted, dunno where the env is
<Inline>i mean it has /sys /proc/ etc. mounted but i don't see where the root of that shall be
<stikonas_>Inline: a few reasons for 32-bits but none are really fundamental, just need work
<stikonas>in particular, mes was originally written only for x86 and since that just runs fine on current x86_64 machines, nobody ported it
<Inline>ok
<stikonas>we now have riscv64 port of mes that goes to bootstrappable tcc, so quite a few 64 bit bugs are solved
<stikonas>but probably not all
<Inline>i found the tmp/sysa/sysa.img
<Inline>that one is used for booting it seems, tho it only contains the DOS/MBR
<stikonas>I've tried running 64 bit amd64 bootstrap and it goes all the way to tcc-mes but that binary crashes
<stikonas>so probably once that step is resolved, things can progress much further on x86_64
<Inline>ok thank you
<stikonas>Inline: what do you mean only DOS/MBR?
<stikonas>no other data (like sources)?
<Inline>no no it has loads of sectors 33543719
<stikonas>yeah, that's good
<Inline>it's just what file sysa.img shows
<Inline>that's something around 4G not ?
<Inline>right
<stikonas>hmm, for me it shows 16 GB
<stikonas>and file indeed only says sysa.img: DOS/MBR boot sector
<Inline>right 16G my bad
<Inline>ok
<stikonas>well, it's mostly zeroes...
<Inline>df -h used: 2.3G avail 13G
<Inline>right
<Inline>it's gcc 13.1 version
<Inline>is there a reason why it stops there ?
<stikonas>well, that's where we got to
<stikonas>you can extend it...
<Inline>ok
<stikonas>e.g. FreedesktopSDK builds on top of that
<stikonas>but gcc 13 is not a bad place to get to
<Inline>where is FreedesktopSDK to be found ?
<stikonas> https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/merge_requests/11557
<Inline>thank you
<stikonas>that's what is used for most flatpaks
<stikonas>though they have some binary stuff, e.g. rust on top of it
<stikonas>hmm, I now think that my stage0-uefi issues are not due to header... something is probably wrong in the code...
<stikonas>Inline: by the way, did you have anything else in mind after GCC?
<stikonas>in principle it might be nice to build some distros
<stikonas>probably source based distros like Gentoo would be the easiest target
<Inline>no idea how to proceed on from here stikonas
<stikonas>depends on what you goal is
<Inline>so how would i for example install a distro like Gentoo ?
<Inline>or say Guix
<stikonas>well, Guix is much harder...
<stikonas>but Gentoo probably has some prefix setup script
<stikonas>I think sam_ knows this better
<Inline>hmmm
<stikonas>possibly https://gitweb.gentoo.org/repo/proj/prefix.git/plain/scripts/bootstrap-prefix.sh
<stikonas>the problem with Guix is that even if you get upstream guix source, you'll still have to hack around it to try to avoid using its bootstrap seeds
<Inline>ok thank you
<Inline>so is that script above needed to be invoked from within the image ? or just outside of it ?
<stikonas>well, within the image
<stikonas>or in the after.sh hook
<stikonas>but I haven't tried it myself yet
<Inline>ok thank you
<fossy>Inline: the work with building distros on top of live-bootstrap is not yet really done
<fossy>core live-bootstrap has a goal of a modern toolchain, which is achieved, at some point you'll be able to plug things in on top of that, but .. not quite yet, as those things don't really exist :D
<Inline>thank you