IRC channel logs

2022-06-16.log

back to list of logs

***saksophony_ is now known as saksophony
***lh_ is now known as lh
***alethkit_ is now known as alethkit
***achaninja_ is now known as achaninja
***sm2n_ is now known as sm2n
***unmatched-paren_ is now known as unmatched-paren
***pi1 is now known as johnjaye
<oriansj>now the 4/8bit processor seems to be limited to 8words of DATA RAM (4/8bytes) and a 7bit address space doesn't leave much room to do much but 127 instructions might be enough to do hex0
<johnjaye>is there a good academic paper on bootstrappability?
<oriansj>johnjaye: in general or you mean specific cases (like say bootstrapping the ocaml language)
<johnjaye>i suppose in general. like what are the costs and benefits of bootstrapping a compiler vs compiling the compiler with the compiler.
<oriansj>johnjaye: ummm we use compilers to bootstrap compilers; we just write the base compiler in assembly
<johnjaye>oh ok.
<johnjaye>i thought bootstrapping had some kind of theoretical properties like that.
<oriansj>for example here is a C compiler written in RISC-V assembly: https://github.com/oriansj/stage0-posix-riscv32/blob/master/cc_riscv32.M1
<oriansj>and yes there are a good deal of theoretical properties for bootstrapping but in practical terms, ultimately you either bootstrap a language in a high level language or in assembly
<oriansj>So if you have a high level compiler you can trust, it is usually much easier just to use that
<oriansj>but if you have no high level compiler you trust, there really isn't any other option but to build a compiler written in assembly
<oriansj>(or high level interpreter)
<oriansj>as one could write a compiler in a high level interpreted language: (such as scheme) https://github.com/oriansj/mes-m2
<oriansj>but that interpreter has to be written in either a high level language or in assembly
<oriansj>(schemes tend to be written in C and FORTHs tend to be written in Assembly)
<stikonas>johnjaye: sadly there isn't that much theory in bootstrapping. It's actually fairly simple
<johnjaye>i see. well fair enough.
<stikonas>you can maybe see some connections to self-referential paradoxes in some places
<stikonas>e.g. let's say you build sha256sum. On it's own it can't be used to prove that it's not malicious binary
<johnjaye>can you bootstrap with a compiler written in pcode or IR?
<stikonas>I guess you can but the compilers we have are non-optimizing
<stikonas>so there isn't really any IR or pcode
<oriansj>johnjaye: and what exactly do you think will be running that pcode or IR?
<stikonas>and yes, that too
<stikonas>M2-Planet (the simplest subset of C compiler that is still written in C) and slightly simpler cc_x86 (even smaller subset of C compiler written in assembly) are very simple compilers, they just go over the source (single-pass) and basically directly spit assembly out
<stikonas>(they do keep a bit of state to deal with local variables, etc but not much)
<johnjaye>meaning there's not much to be gained from optimizing it? i'm not sure i get it
<stikonas>meaning that optimizations are not important for bootstrapping
<stikonas>you use your compiler once to build something more complicated
<stikonas>compiler optimizations only matter if you use compiler program a lot
<oriansj>johnjaye: does 5 minutes seem like a huge problem that would benefit by shaving off seconds?
<stikonas>and most of the slowdown actually comes from interpreter (mes)
<stikonas>M2-Planet builds really fast
<stikonas>oriansj: was it something like 250k lines per second?
<oriansj>stikonas: yes
<johnjaye>hmm i see
<johnjaye>wait is lines per second a real metric?
<oriansj>that was the worst case performance
<oriansj>johnjaye: yes when self-hosted; it is 2Mloc/second when built by GCC
<stikonas>basically bootstrap goes something like hex0->hex1->hex2 (linker)->M0 (assembler)->cc_x86 (C compiler)-> M2-Planet (C compiler) -> mes (scheme interpreter) -> mescc (C compiler) -> tcc (C compiler)
<stikonas>if you run this chain everything is quite fast (compilation takes maybe a second) except for interpreted code
<stikonas>building tcc maybe 8 mintues
<johnjaye>why is that a chain of multiple assemblers and multiple compilers
<johnjaye>shouldn't it be compiler makes a compiler?
<stikonas>how do you get 1st compiler?
<stikonas>you write it in assembly
<johnjaye>right
<oriansj>johnjaye: short version: mescc got to building TCC before M2-Mesoplanet could
<stikonas>oriansj: I think johnjaye is asking why this chain has both assemblers and compilers
<oriansj>Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35.29 from zero to mescc
<stikonas>(and that's including extra tools)
<stikonas>basically we want to build compiler from the smallest possible binary seed
<stikonas>you can't build it without any seed at all
<oriansj>johnjaye: to the question of why the complexity, it has to do with minimizing development and review efforts
<stikonas>and large binary seed (e.g. 1 MiB can't be easily inspected by human)
<oriansj> but a few hundred bytes can be
<stikonas>yes, hex code is hard to inspect and write
<stikonas>once you have hex1, it can do some of the calculations for you and it becomes easier
<oriansj>so we do it as little as possible
<stikonas>still somewhat tedious
<stikonas>hex2 is slightly improved version of hex1, removes some of the limitations
<oriansj>but each is easier to work in than the previous step
<oriansj>and more powerful
<oriansj>tools are made with each step
<stikonas>well, once we have M2-Planet we compile loads of other tools
<stikonas>since it's already C compiler
<stikonas>after M0 programming is basically assembly programming but we only need to write 1 program in it (cc_*)
<johnjaye>i see
<stikonas>cc_* builds its cross-platform variant in C (M2-Planet)
<stikonas>and after that it's just normal programing
<stikonas>although in somewhat restricted subset of C
<stikonas>but perfectly readable for any C programmer
<oriansj>with standard C libraries
<stikonas>yes, e.g. untar https://github.com/oriansj/mescc-tools-extra/blob/master/untar.c
<stikonas>and everything can run automated and build GCC
<johnjaye>is this all about guix or is it a separate project?
<stikonas>by starting with 2 small binaries (357 byte hex0 and 757 byte kaem-minimal shell)
<stikonas>it's not just guix although guix does that too
<oriansj>johnjaye: well it is bigger than Guix but yes many guix people are also here and it did start on #guix
<stikonas>but we have scripts that are completely independent of guix
<oriansj>and the work is integrated in Guix
<johnjaye>oh ok
<stikonas>and some other people also tried writing their own bootstrap chains
<oriansj>and we encourage people to create any bootstrap chain they trust
<stikonas>but it's mostly 2 implementations
<johnjaye>i heard about it there yes. it seems to me the main reason you would want bootstrapping is for that whole build everything declaratively thing?
<stikonas>Guix (which starts with hex0 and static guile binary)
<stikonas>and live-bootstrap (which starts with hex0 and kaem-minimal)
<stikonas>well, declarative building is mostly guix thing
<stikonas>live-bootstrap focusses more on not allowing any pre-generated files (e.g. configure scripts are banned, or e.g. pre-built bison parsers)
<stikonas>and also does reproducibility a bit better than guix
<stikonas>though guix tries to be quite reproducible too
<oriansj>and declarative building requires a declarative language (like Guile)
<stikonas>well, in Guix you have packages written in Guile
<stikonas>live-bootstrap is basically just some collection of shell scripts
<stikonas>mostly bash
<oriansj>so bootstrapping is very much what you feel comfortable with
<stikonas>but it's self-contained and does not need big interpreted like guile
<vagrantc>does it actually use features of "bash" specifically, as opposed to "sh" ?
<stikonas>vagrantc: yes, plenty
<stikonas>well, once bash is built
<oriansj>Guix is comfortable with the static guile binary and live-bootstrap sticks to the bootstrap-seeds
<stikonas>vagrantc: before bash is built it's just a list of commands (kaem scripts)
<vagrantc>stikonas: i assumed you waited till it was built to use it :)
<vagrantc>i'm not sure "guix is comfortable" so much as that is as far as it has gotten so far
<oriansj>vagrantc: bashisms are a far smaller issue if the only shells you have are bash and kaem
<stikonas>kaem can't run POSIX compatible sh files anyway
<stikonas>it's even more restricted
<vagrantc>guix is a project that started out with a large (but at least known) set of binary seeds and has been chipping away at them...
<stikonas>though it now has aliases and if else fi statements
<stikonas>yeah, and live-bootstrap started vice verse
<oriansj>vagrantc: well if guix wanted to get away from that Guile binary, they would have to move their bootstrap out of guix itself
<stikonas>it started with a small seed and was building on top of that
<vagrantc>stikonas: exactly
<vagrantc>oriansj: or just port guix to mes's scheme interpreter :)
<stikonas>vagrantc: that's probably easier said than done
<vagrantc>indeed :)
<stikonas>I think even gash is very experimental on mes\
<oriansj>vagrantc: at that point mes.c would be functionality the same as guile; which would be freaking cool
<muurkha>oriansj: I'm curious whether it's more broadly applicable than the 4004; have you read the paper? I haven't yet
<muurkha>hmm, 8 words of data RAM and a 7-bit address space sounds too small
<muurkha>I haven't found a way to build a usable CPU in under about 2500 transistors
<oriansj>muurkha: the 4004 supported a 12 bit address space
<oriansj>so aside from being cheap as heck, it is worse than the 4004 in every technical way
<oriansj>muurkha: define usable CPU
<johnjaye>which paper?
<muurkha>johnjaye: yeah, you could write a P-code interpreter in assembly, then write a compiler in P-code. arguably that's what you're doing when you program in Forth :)
<johnjaye>haha ok
<johnjaye>most obvious ideas have probably been thought of and done. it's just hard to figure out what the project is named
<muurkha>22:45 < stikonas> compiler optimizations only matter if you use compiler program a lot
<oriansj>johnjaye: http://passat.crhc.illinois.edu/isca22_02.pdf
<muurkha>I think this should say "compiler optimizations only matter if you use the compiled program a lot"
<muurkha>and I'd add "much"
<muurkha>that is, they only matter much
<stikonas>muurkha: yes, though I was thinking compiled program is the next compiler
<oriansj>muurkha: well the RISC-II had only 40,760 transistors of which 1: 26496
<oriansj>26,496 were used for the 138registers
<johnjaye>anyway i was just curious about the organizational relationships. all these terms are really new and strange sounding. mes. m2. ccx86.
<oriansj>(assuming 32bits at 6 transistors per register bit)
<oriansj>so 1: 14264
<muurkha>oriansj: I think my working definition of "usable CPU" is one that can run a high-level-language interpreter
<muurkha>oriansj: yeah, I said 2500, not 25000
<oriansj>johnjaye: can you be more precise in what you are looking for?
<stikonas>johnjaye: it's just a name of some programs
<stikonas>johnjaye: just like gcc is just a name
<oriansj>muurkha: oooh that is much harder
<muurkha>stikonas: yes, agreed
<stikonas>sometimes there is some acronym behind but in the end any name is just a name
<johnjaye>kaem is the world's worst build manager. is ee
<muurkha>oriansj: the bit-serial version of Calculus Vaporis should weigh in around 2500 transistors
<stikonas>kaem is anagram for make
<johnjaye>well gcc is just a name sure. but it's also a huge part of the software ecosystem and millions of lines of code (guessing)
<muurkha>there's a graffiti artist near here who makes these big elaborate mural-sized tags all over town that say "KAEM"
<muurkha>I should take some photos to share
<oriansj>muurkha: yes please
<muurkha>those 2500 transistors don't include RAM, just registers and logic
<oriansj>even the PDP-8/s is 519 logic gates
<stikonas>M2 is kind of a sequence. We have M0 (assembler written in arch), then there is M1 (cross-platform assembler writtne in C) and M2-Planet (C compiler in C)
<muurkha>yeah, Calculus Vaporis is about 600 IIRC
<stikonas>though maybe oriansj has better explanation for M2
<muurkha>like the PDP-8, it also has a 12-bit address space (of 12-bit words) and so plausibly could run something like a BASIC interpreter, though I haven't written one for it and might be disappointed if its instruction set turns out to be too bad at density
<stikonas>johnjaye: cc_x86 is even simpler (cc is c compiler)
<stikonas>johnjaye: mes is probably the most curious naming: Maxwell's Equations of Software
<oriansj>muurkha: actually the PDP-8 instruction set could be simpified without being hard to program
<muurkha>johnjaye: yeah, GCC is millions of lines of code
<stikonas>johnjaye: that's because mes interpreter can run mescc compiler that can build mes
<stikonas>some somewhat similar to how electromagnetic field behaves
<muurkha>a non-bit-serial version of Calculus Vaporis is more like 4500 transistors
<oriansj>one could remove the rotate instructions from the PDP-8 and byte swap; not to mention the multiply and divide instructions
<oriansj>the subtract/skip instruction would be simpler as a jump if zero instruction
<oriansj>drop from 12bits to 8bits should simplify a few things as well
<oriansj>just increase the PC from 12bits to 16bits
<oriansj>using an expanded instruction encoding could also be used to simplify decoding
<muurkha>I don't think the PDP-8/S had multiply and divide, did it?
<muurkha>dropping the instruction word, the memory word size, and/or the address bus width from 12 to 8 bits would indeed reduce the number of transistors needed, but it would probably also drop below being a usable CPU
<oriansj> https://en.wikipedia.org/wiki/PDP-8#Group_3
<muurkha>you could switch from being word-oriented to being byte-oriented though
<oriansj>8bit accumulator and 16bit address space is enough for a C compiler
<muurkha>yeah, for sure
<muurkha>lots of examples of that, I have one on the desk in front of me right now in fact
<oriansj>add, sub, xor, nand, jump if zero, LOAD, LOADI8, STORE would be sufficient for easy programming
<muurkha>I know about the group-3 instructions, I just dont think the PDP-8/S had the MQ register and thus the relevant instructions, did it?
<oriansj>muurkha: I don't know
<muurkha>did it have an EAE?
<muurkha>WP says, The EAE was an option on the original PDP-8,[17] the 8/I,[18] and the 8/E, but it is an integral part of the Intersil 6100 microprocessor."
<muurkha>you don't really need XOR a an instruction
<muurkha>you probably do, however, need some way to compare integers for ordering (≤, ≥)
<muurkha>*as an instruction
<muurkha>I guess that for an 8-bit subtraction result you could NAND with 0x80 to preserve only the sign bit, then compare the result to 0
<muurkha>nd that would give you a way to compare integers for ordering
<muurkha>wait, not zero, 0xFF
<muurkha>so you'd have to NAND with 0xFF or 0x7F to map either 0x7F or 0xFF to zero, and then you could do it
<stikonas>well, you almost never need any specific instruction as is shown by one instruction ISA
<oriansj>muurkha: well cmp with greater and less bits being set is an optimization. Only subtraction is needed
<muurkha>oriansj: right
<oriansj>jump if negative is a cheap optimization one can do as well
<muurkha>add is also an optimization from that POV, which is why Calculus Vaporis doesn't have it
<muurkha>but it's an important one
<oriansj>jump if positive exists also in knight
<muurkha>stikonas: yeah, I mean that in practice the dynamic number of XORs is so low that evaluating XORs with a NAND instruction will have minimal performance impact
<muurkha>or code size impact
<muurkha>(because the static number of XORs is also low)
<stikonas>yes, for riscv64 in stage0-posix we have 6 uses in hex1 and 6 in hex2
<oriansj>jmp.nz (not zero), jmp.z (zero), jmp.n (negative), jmp.p (positive) would be quite easy to work with
<muurkha>not true, I think, of sign comparisons, addition, or right shift
<stikonas>(probably in same places)
<stikonas>and for other non-word arches we might have even fewer xors
<muurkha>if you remove those I think you get a significant slowdown
<oriansj>a single bit shift would probably eat a good few transistors
<muurkha>not transistors, no, just wires
<oriansj>muurkha: nice
<muurkha>hey man, I didn't do it
<muurkha>indexed store is also a pretty useful optimization, and was omitted from the 8080
<muurkha>(but not RV32I or even RV32E)
<oriansj>that would just be store r0 [r1]
<oriansj>but that would cost additional registers
<muurkha>no, by indexed store I mean store r0 [r1+8]
<oriansj>oh, no need for that at all
<oriansj>it'll only save a single instruction
<muurkha>no, you can do without it, but it's expensive
<muurkha>it's also missing on the PDP-8
<muurkha>yeah, but it saves a single instruction almost every single time you access memory :)
<oriansj>well if you look at my knight code: you'll see it isn't that big of a benefit
<muurkha>it's certainly possible to write code without it
<muurkha>I mean you can store most variables at fixed memory locations
<muurkha>but that gets painful anytime there's more than one of something
<oriansj>well I do load r0 [r0] a good bit (because ->next is always at offset zero there)
<muurkha>yeah
<oriansj>4 registers would really simplify things
<muurkha>also I think call and return instructions are pretty useful
<oriansj>they do simplify a great many things
<muurkha>if you don't have a return instruction you really need a way to jmp to a register value, otherwise you end up having to use self-modifying code for switch, return, and ad-hoc polymorphism (function pointers)
<oriansj>store-pc, load-pc would work
<muurkha>in your list add, sub, xor, nand, jump if zero, LOAD, LOADI8, STORE also omits immediates and 8-bit stores. which, yeah, you can do without, but again it's awkward
<oriansj>LOADI8 is loading of an 8bit immediate
<oriansj>and we are only loading 8bits and storing 8bits
<oriansj>if I were to do a 16bit ALU then yes you would want LOAD, LOAD8, STORE, STORE8 and LOADI16
<oriansj>LOAD8U (possibly if you wanted to not sign extend)
<muurkha>oh, I misunderstood
<muurkha>agreed then
<muurkha>the 6502 had an 8-bit accumulator and two 8-bit pointers
<muurkha>it would have benefited a lot from an 8-bit zero-page pointer (which was added on the 65816) and a PC-page addressing mode
<oriansj>if one wanted to be extra easy to work with: add a 16bit stack pointer (as an 8bit stack is bleh) and push, pop, call and return be the only things that impacted it with the default value being 0xFFFFFFFF
<oriansj>muurkha: yeah the 6809 did that idea to optimize for smaller code size
<muurkha>you can do without 8-bit loads on a machine with a more-than-8-bit accumulator if you support unaligned access (as you normally would if you had an 8-bit data bus)
<oriansj>and we can cheat very hard too
<oriansj>put the registers in RAM itself
<muurkha>well, the 6502 kind of does
<muurkha>the zero page is the registers
<muurkha>the 6502 does okay with an 8-bit stack but it isn't really usable as a C stack
<oriansj>ACCUMULATOR in 0x00000000, PC in 0x00000002 and Stack in 0xFFFFFFFD
<oriansj>yeah you can't do a C compiler using the 6502 hardware stack and instead have to manually manage your own which is a huge pain
<muurkha>yeah, the PDP-10, PDP-X, and I think the PDP-8 mapped their architectural registers into the memory space
<johnjaye>i wonder if the pdp8/pdp11 were the earliest processors that could actually have a language like C on them
<muurkha>no, the PDP-8 didn't come out until 01965
<muurkha>there were lots of processors from the 01950s that would have been adequate for C and a few that would have been fine
<oriansj>well 8bit accumulator, 16bit PC and 16bit stack would only cost 240 transistors (assuming 6transistors per bit)
<muurkha>that doesn't count the transistors to increment and decrement them, or the muxes to put those values onto different buses
<oriansj>true but that is what I used to estimate the transistor counts used in the RISC-II register file
<oriansj>so if I am under-estimating here, the savings also would be larger there
<muurkha>yeah
<muurkha>johnjaye: things that make C hard to implement include tagged memory (Burroughs 5000), non-byte-addressable memory (most scientific machines of the 01950s up to the 01990s and also the Nova and Tandem, though C implementations on TI DSPs solved this problem by simply making their chars 32 bits wide), a lack of bitwise operations (all BCD computers including most "business" machines from the
<muurkha>01950s), and not enough memory space (the HP 9100A, the GreenArrays F18A)
<oriansj>but if the 6502 is any indicator with 3,510 transistors and the majority of which in the decode logic
<muurkha>that doesn't count the depletion-mode transistors
<muurkha>which bring the total to 4528
<oriansj> https://monster6502.com/
<muurkha>the 6502 or 8080 or PDP-8 isn't hard to implement C on, it's just that C code on them runs slowly
<oriansj>Total active transistors: 4237
<muurkha>yeah, the MonSter 6502 uses resistors for some of the depletion loads, I think
<johnjaye>oh ok
<muurkha>in theory a sufficiently smart C compiler could get C code to run fast on them, though I don't think such a thing has been written
<muurkha>but it's a different level of performance problem than not having bitwise operations in your CPU because your bytes range from 0 to 99
<muurkha>or not being able to byte-address half of your memory space because it's 128 KiB and your registers are only 16 bits (I think that was Tandem)
<oriansj> https://www.pagetable.com/?p=39
<oriansj>well we could extend the PC and stack to 24bits which would be more than enough to run an operating system and C compiler
<muurkha> http://www.righto.com/2013/01/a-small-part-of-6502-chip-explained.html is one citation for the 4528 number
<oriansj>adding overflow (for add) and borrow (for subtract) bits; jump.o (jump if overflow) and jump.b (jump if borrow) would enable efficient multibyte arithmetic
<muurkha>if you were doing it in CMOS instead of NMOS you'd probably need 7000 transistors or so
<muurkha>oriansj: C requires you to be able to take the addresses of local variables, so if your stack pointer is 24 bits then pointer variables in C also need to be 24 bits
<muurkha>which likely means that you need at least one other 24-bit register that isn't the PC or SP. although maybe if you don't have interrupts, or they run on a separate stack, you could get away with saving the SP in RAM at a fixed location, loading your pointer into the SP, and using push and pop to dereference the pointer :)
<oriansj>muurkha: nope but it would make things prettier
<muurkha>no, I mean you can't implement the semantics of C if you can't take the addresses of local variables, or if local variables aren't allocated dynamically to permit recursion
<muurkha>it's not just a matter of making things prettier
<oriansj>muurkha: hence load-pc would be reading a memory address
<johnjaye>i had no idea people still new about the burroughs 5000 or those other machines
<muurkha>load-pc is just indirect jump, that's a different issue
<muurkha>johnjaye: I don't know much about it, I've just read a book is all
<oriansj>muurkha: we could simulate 32/16bit support in functions
<oriansj>so we wouldn't needtfgbvhrdjeugjeklerrhfkrdirigidbjbvnkifdlcgfg
<muurkha>hello tiny orians offspring
<oriansj>yep
<muurkha>welcome to the world
<oriansj>so yeah, only an 8bit accumulator would be needed
<muurkha>yeah, you don't need a bigger accumulator to have bigger pointers
<oriansj>and we could just do the sweet-16 register approach for the C code
<muurkha>yes, but only if you have a CPU instruction that is capable of reading from a 24-bit address
<muurkha>so you need some 24-bit pointer register
<oriansj>nope
<muurkha>(or writing to a 24-bit address)
<muurkha>that is, if your local variables on the stack can be at arbitrary 24-bit addresses
<oriansj>you modify the LOAD instruction in memory to point to the local variable
<oriansj>as you can either do load instructions that cover the whole address space or do pointer register
<muurkha>that does work if the LOAD instruction has a 24-bit address field in it! but that just means that your 24+-bit register in the CPU is the current-instruction register
<oriansj>or if we wanted to get clever: 24bit stack and PC, 16bit accumulator and 8bit page register
<muurkha>yeah, that would work
<muurkha>the current-instruction register is not an architectural register, in the sense that you can't access its contents with instructions, but it still costs transistors per bit
<muurkha>hmm, we should probably do some experiments rather than just talking
<oriansj>well PC relative operations are not essential for C but knowing the stack pointer might be
<muurkha>yeah, PC-relative operations are useful for position-independent code
<muurkha>which is nice for shared libraries
<oriansj>so push page register and pop page register and copy stack pointer to accumulator and page register might be required
<muurkha>even the PDP-8's tightwad version of PC-relative memory access is enough for relocating shared libraries
<oriansj>well if we wanted things to be as simple as possible for assembly programmers we would have 4 registers R0 (accumulator), R1 (index), R2 (stack) and R3 (PC)
<oriansj>and so all operations such as pushing onto the stack and loading relative to the PC is the same as loading from the accumulator
<oriansj>the RCA 1802 was 5,000 transistors
<oriansj>had a set of sixteen registers of 16 bits each
<oriansj>and that was C2L CMOS technology
<oriansj>ran C and Basic
<oriansj>so assuming 6 transistors per register bit; 1: 1536
<oriansj>1536 transistors for the registers
<oriansj>and we could get by 1/4th that
<oriansj>we absolutely don't need the I/O logic either
<muurkha>yeah, 5000 is doable, that's bigger than the 6502
<muurkha>or the Intersil 6100 PDP-8-on-a-chip