IRC channel logs

***saksophony_ is now known as saksophony

***lh_ is now known as lh

***alethkit_ is now known as alethkit

***achaninja_ is now known as achaninja

***sm2n_ is now known as sm2n

***unmatched-paren_ is now known as unmatched-paren

***pi1 is now known as johnjaye

<oriansj>now the 4/8bit processor seems to be limited to 8words of DATA RAM (4/8bytes) and a 7bit address space doesn't leave much room to do much but 127 instructions might be enough to do hex0

<johnjaye>is there a good academic paper on bootstrappability?

<oriansj>johnjaye: in general or you mean specific cases (like say bootstrapping the ocaml language)

<johnjaye>i suppose in general. like what are the costs and benefits of bootstrapping a compiler vs compiling the compiler with the compiler.

<oriansj>johnjaye: ummm we use compilers to bootstrap compilers; we just write the base compiler in assembly

<johnjaye>oh ok.

<johnjaye>i thought bootstrapping had some kind of theoretical properties like that.

<oriansj>for example here is a C compiler written in RISC-V assembly: https://github.com/oriansj/stage0-posix-riscv32/blob/master/cc_riscv32.M1

<oriansj>and yes there are a good deal of theoretical properties for bootstrapping but in practical terms, ultimately you either bootstrap a language in a high level language or in assembly

<oriansj>So if you have a high level compiler you can trust, it is usually much easier just to use that

<oriansj>but if you have no high level compiler you trust, there really isn't any other option but to build a compiler written in assembly

<oriansj>(or high level interpreter)

<oriansj>as one could write a compiler in a high level interpreted language: (such as scheme) https://github.com/oriansj/mes-m2

<oriansj>but that interpreter has to be written in either a high level language or in assembly

<oriansj>(schemes tend to be written in C and FORTHs tend to be written in Assembly)

<stikonas>johnjaye: sadly there isn't that much theory in bootstrapping. It's actually fairly simple

<johnjaye>i see. well fair enough.

<stikonas>you can maybe see some connections to self-referential paradoxes in some places

<stikonas>e.g. let's say you build sha256sum. On it's own it can't be used to prove that it's not malicious binary

<johnjaye>can you bootstrap with a compiler written in pcode or IR?

<stikonas>I guess you can but the compilers we have are non-optimizing

<stikonas>so there isn't really any IR or pcode

<oriansj>johnjaye: and what exactly do you think will be running that pcode or IR?

<stikonas>and yes, that too

<stikonas>M2-Planet (the simplest subset of C compiler that is still written in C) and slightly simpler cc_x86 (even smaller subset of C compiler written in assembly) are very simple compilers, they just go over the source (single-pass) and basically directly spit assembly out

<stikonas>(they do keep a bit of state to deal with local variables, etc but not much)

<johnjaye>meaning there's not much to be gained from optimizing it? i'm not sure i get it

<stikonas>meaning that optimizations are not important for bootstrapping

<stikonas>you use your compiler once to build something more complicated

<stikonas>compiler optimizations only matter if you use compiler program a lot

<oriansj>johnjaye: does 5 minutes seem like a huge problem that would benefit by shaving off seconds?

<stikonas>and most of the slowdown actually comes from interpreter (mes)

<stikonas>M2-Planet builds really fast

<stikonas>oriansj: was it something like 250k lines per second?

<oriansj>stikonas: yes

<johnjaye>hmm i see

<johnjaye>wait is lines per second a real metric?

<oriansj>that was the worst case performance

<oriansj>johnjaye: yes when self-hosted; it is 2Mloc/second when built by GCC

<stikonas>basically bootstrap goes something like hex0->hex1->hex2 (linker)->M0 (assembler)->cc_x86 (C compiler)-> M2-Planet (C compiler) -> mes (scheme interpreter) -> mescc (C compiler) -> tcc (C compiler)

<stikonas>if you run this chain everything is quite fast (compilation takes maybe a second) except for interpreted code

<stikonas>building tcc maybe 8 mintues

<johnjaye>why is that a chain of multiple assemblers and multiple compilers

<johnjaye>shouldn't it be compiler makes a compiler?

<stikonas>how do you get 1st compiler?

<stikonas>you write it in assembly

<johnjaye>right

<oriansj>johnjaye: short version: mescc got to building TCC before M2-Mesoplanet could

<stikonas>oriansj: I think johnjaye is asking why this chain has both assemblers and compilers

<oriansj>Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35.29 from zero to mescc

<stikonas>(and that's including extra tools)

<stikonas>basically we want to build compiler from the smallest possible binary seed

<stikonas>you can't build it without any seed at all

<oriansj>johnjaye: to the question of why the complexity, it has to do with minimizing development and review efforts

<stikonas>and large binary seed (e.g. 1 MiB can't be easily inspected by human)

<oriansj> but a few hundred bytes can be

<stikonas>yes, hex code is hard to inspect and write

<stikonas>once you have hex1, it can do some of the calculations for you and it becomes easier

<oriansj>so we do it as little as possible

<stikonas>still somewhat tedious

<stikonas>hex2 is slightly improved version of hex1, removes some of the limitations

<oriansj>but each is easier to work in than the previous step

<oriansj>and more powerful

<oriansj>tools are made with each step

<stikonas>well, once we have M2-Planet we compile loads of other tools

<stikonas>since it's already C compiler

<stikonas>after M0 programming is basically assembly programming but we only need to write 1 program in it (cc_*)

<johnjaye>i see

<stikonas>cc_* builds its cross-platform variant in C (M2-Planet)

<stikonas>and after that it's just normal programing

<stikonas>although in somewhat restricted subset of C

<stikonas>but perfectly readable for any C programmer

<oriansj>with standard C libraries

<stikonas>yes, e.g. untar https://github.com/oriansj/mescc-tools-extra/blob/master/untar.c

<stikonas>and everything can run automated and build GCC

<johnjaye>is this all about guix or is it a separate project?

<stikonas>by starting with 2 small binaries (357 byte hex0 and 757 byte kaem-minimal shell)

<stikonas>it's not just guix although guix does that too

<oriansj>johnjaye: well it is bigger than Guix but yes many guix people are also here and it did start on #guix

<stikonas>but we have scripts that are completely independent of guix

<oriansj>and the work is integrated in Guix

<johnjaye>oh ok

<stikonas>and some other people also tried writing their own bootstrap chains

<oriansj>and we encourage people to create any bootstrap chain they trust

<stikonas>but it's mostly 2 implementations

<johnjaye>i heard about it there yes. it seems to me the main reason you would want bootstrapping is for that whole build everything declaratively thing?

<stikonas>Guix (which starts with hex0 and static guile binary)

<stikonas>and live-bootstrap (which starts with hex0 and kaem-minimal)

<stikonas>well, declarative building is mostly guix thing

<stikonas>live-bootstrap focusses more on not allowing any pre-generated files (e.g. configure scripts are banned, or e.g. pre-built bison parsers)

<stikonas>and also does reproducibility a bit better than guix

<stikonas>though guix tries to be quite reproducible too

<oriansj>and declarative building requires a declarative language (like Guile)

<stikonas>well, in Guix you have packages written in Guile

<stikonas>live-bootstrap is basically just some collection of shell scripts

<stikonas>mostly bash

<oriansj>so bootstrapping is very much what you feel comfortable with

<stikonas>but it's self-contained and does not need big interpreted like guile

<vagrantc>does it actually use features of "bash" specifically, as opposed to "sh" ?

<stikonas>vagrantc: yes, plenty

<stikonas>well, once bash is built

<oriansj>Guix is comfortable with the static guile binary and live-bootstrap sticks to the bootstrap-seeds

<stikonas>vagrantc: before bash is built it's just a list of commands (kaem scripts)

<vagrantc>stikonas: i assumed you waited till it was built to use it :)

<vagrantc>i'm not sure "guix is comfortable" so much as that is as far as it has gotten so far

<oriansj>vagrantc: bashisms are a far smaller issue if the only shells you have are bash and kaem

<stikonas>kaem can't run POSIX compatible sh files anyway

<stikonas>it's even more restricted

<vagrantc>guix is a project that started out with a large (but at least known) set of binary seeds and has been chipping away at them...

<stikonas>though it now has aliases and if else fi statements

<stikonas>yeah, and live-bootstrap started vice verse

<oriansj>vagrantc: well if guix wanted to get away from that Guile binary, they would have to move their bootstrap out of guix itself

<stikonas>it started with a small seed and was building on top of that

<vagrantc>stikonas: exactly

<vagrantc>oriansj: or just port guix to mes's scheme interpreter :)

<stikonas>vagrantc: that's probably easier said than done

<vagrantc>indeed :)

<stikonas>I think even gash is very experimental on mes\

<oriansj>vagrantc: at that point mes.c would be functionality the same as guile; which would be freaking cool

<muurkha>oriansj: I'm curious whether it's more broadly applicable than the 4004; have you read the paper? I haven't yet

<muurkha>hmm, 8 words of data RAM and a 7-bit address space sounds too small

<muurkha>I haven't found a way to build a usable CPU in under about 2500 transistors

<oriansj>muurkha: the 4004 supported a 12 bit address space

<oriansj>so aside from being cheap as heck, it is worse than the 4004 in every technical way

<oriansj>muurkha: define usable CPU

<johnjaye>which paper?

<muurkha>johnjaye: yeah, you could write a P-code interpreter in assembly, then write a compiler in P-code. arguably that's what you're doing when you program in Forth :)

<johnjaye>haha ok

<johnjaye>most obvious ideas have probably been thought of and done. it's just hard to figure out what the project is named

<muurkha>22:45 < stikonas> compiler optimizations only matter if you use compiler program a lot

<oriansj>johnjaye: http://passat.crhc.illinois.edu/isca22_02.pdf

<muurkha>I think this should say "compiler optimizations only matter if you use the compiled program a lot"

<muurkha>and I'd add "much"

<muurkha>that is, they only matter much

<stikonas>muurkha: yes, though I was thinking compiled program is the next compiler

<oriansj>muurkha: well the RISC-II had only 40,760 transistors of which 1: 26496

<oriansj>26,496 were used for the 138registers

<johnjaye>anyway i was just curious about the organizational relationships. all these terms are really new and strange sounding. mes. m2. ccx86.

<oriansj>(assuming 32bits at 6 transistors per register bit)

<oriansj>so 1: 14264

<muurkha>oriansj: I think my working definition of "usable CPU" is one that can run a high-level-language interpreter

<muurkha>oriansj: yeah, I said 2500, not 25000

<oriansj>johnjaye: can you be more precise in what you are looking for?

<stikonas>johnjaye: it's just a name of some programs

<stikonas>johnjaye: just like gcc is just a name

<oriansj>muurkha: oooh that is much harder

<muurkha>stikonas: yes, agreed

<stikonas>sometimes there is some acronym behind but in the end any name is just a name

<johnjaye>kaem is the world's worst build manager. is ee

<muurkha>oriansj: the bit-serial version of Calculus Vaporis should weigh in around 2500 transistors

<stikonas>kaem is anagram for make

<johnjaye>well gcc is just a name sure. but it's also a huge part of the software ecosystem and millions of lines of code (guessing)

<muurkha>there's a graffiti artist near here who makes these big elaborate mural-sized tags all over town that say "KAEM"

<muurkha>I should take some photos to share

<oriansj>muurkha: yes please

<muurkha>those 2500 transistors don't include RAM, just registers and logic

<oriansj>even the PDP-8/s is 519 logic gates

<stikonas>M2 is kind of a sequence. We have M0 (assembler written in arch), then there is M1 (cross-platform assembler writtne in C) and M2-Planet (C compiler in C)

<muurkha>yeah, Calculus Vaporis is about 600 IIRC

<stikonas>though maybe oriansj has better explanation for M2

<muurkha>like the PDP-8, it also has a 12-bit address space (of 12-bit words) and so plausibly could run something like a BASIC interpreter, though I haven't written one for it and might be disappointed if its instruction set turns out to be too bad at density

<stikonas>johnjaye: cc_x86 is even simpler (cc is c compiler)

<stikonas>johnjaye: mes is probably the most curious naming: Maxwell's Equations of Software

<oriansj>muurkha: actually the PDP-8 instruction set could be simpified without being hard to program

<muurkha>johnjaye: yeah, GCC is millions of lines of code

<stikonas>johnjaye: that's because mes interpreter can run mescc compiler that can build mes

<stikonas>some somewhat similar to how electromagnetic field behaves

<muurkha>a non-bit-serial version of Calculus Vaporis is more like 4500 transistors

<oriansj>one could remove the rotate instructions from the PDP-8 and byte swap; not to mention the multiply and divide instructions

<oriansj>the subtract/skip instruction would be simpler as a jump if zero instruction

<oriansj>drop from 12bits to 8bits should simplify a few things as well

<oriansj>just increase the PC from 12bits to 16bits

<oriansj>using an expanded instruction encoding could also be used to simplify decoding

<muurkha>I don't think the PDP-8/S had multiply and divide, did it?

<muurkha>dropping the instruction word, the memory word size, and/or the address bus width from 12 to 8 bits would indeed reduce the number of transistors needed, but it would probably also drop below being a usable CPU

<oriansj> https://en.wikipedia.org/wiki/PDP-8#Group_3

<muurkha>you could switch from being word-oriented to being byte-oriented though

<oriansj>8bit accumulator and 16bit address space is enough for a C compiler

<muurkha>yeah, for sure

<muurkha>lots of examples of that, I have one on the desk in front of me right now in fact

<oriansj>add, sub, xor, nand, jump if zero, LOAD, LOADI8, STORE would be sufficient for easy programming

<muurkha>I know about the group-3 instructions, I just dont think the PDP-8/S had the MQ register and thus the relevant instructions, did it?

<oriansj>muurkha: I don't know

<muurkha>did it have an EAE?

<muurkha>WP says, The EAE was an option on the original PDP-8,[17] the 8/I,[18] and the 8/E, but it is an integral part of the Intersil 6100 microprocessor."

<muurkha>you don't really need XOR a an instruction

<muurkha>you probably do, however, need some way to compare integers for ordering (≤, ≥)

<muurkha>*as an instruction

<muurkha>I guess that for an 8-bit subtraction result you could NAND with 0x80 to preserve only the sign bit, then compare the result to 0

<muurkha>nd that would give you a way to compare integers for ordering

<muurkha>wait, not zero, 0xFF

<muurkha>so you'd have to NAND with 0xFF or 0x7F to map either 0x7F or 0xFF to zero, and then you could do it

<stikonas>well, you almost never need any specific instruction as is shown by one instruction ISA

<oriansj>muurkha: well cmp with greater and less bits being set is an optimization. Only subtraction is needed

<muurkha>oriansj: right

<oriansj>jump if negative is a cheap optimization one can do as well

<muurkha>add is also an optimization from that POV, which is why Calculus Vaporis doesn't have it

<muurkha>but it's an important one

<oriansj>jump if positive exists also in knight

<muurkha>stikonas: yeah, I mean that in practice the dynamic number of XORs is so low that evaluating XORs with a NAND instruction will have minimal performance impact

<muurkha>or code size impact

<muurkha>(because the static number of XORs is also low)

<stikonas>yes, for riscv64 in stage0-posix we have 6 uses in hex1 and 6 in hex2

<oriansj>jmp.nz (not zero), jmp.z (zero), jmp.n (negative), jmp.p (positive) would be quite easy to work with

<muurkha>not true, I think, of sign comparisons, addition, or right shift

<stikonas>(probably in same places)

<stikonas>and for other non-word arches we might have even fewer xors

<muurkha>if you remove those I think you get a significant slowdown

<oriansj>a single bit shift would probably eat a good few transistors

<muurkha>not transistors, no, just wires

<oriansj>muurkha: nice

<muurkha>hey man, I didn't do it

<muurkha>indexed store is also a pretty useful optimization, and was omitted from the 8080

<muurkha>(but not RV32I or even RV32E)

<oriansj>that would just be store r0 [r1]

<oriansj>but that would cost additional registers

<muurkha>no, by indexed store I mean store r0 [r1+8]

<oriansj>oh, no need for that at all

<oriansj>it'll only save a single instruction

<muurkha>no, you can do without it, but it's expensive

<muurkha>it's also missing on the PDP-8

<muurkha>yeah, but it saves a single instruction almost every single time you access memory :)

<oriansj>well if you look at my knight code: you'll see it isn't that big of a benefit

<muurkha>it's certainly possible to write code without it

<muurkha>I mean you can store most variables at fixed memory locations

<muurkha>but that gets painful anytime there's more than one of something

<oriansj>well I do load r0 [r0] a good bit (because ->next is always at offset zero there)

<muurkha>yeah

<oriansj>4 registers would really simplify things

<muurkha>also I think call and return instructions are pretty useful

<oriansj>they do simplify a great many things

<muurkha>if you don't have a return instruction you really need a way to jmp to a register value, otherwise you end up having to use self-modifying code for switch, return, and ad-hoc polymorphism (function pointers)

<oriansj>store-pc, load-pc would work

<muurkha>in your list add, sub, xor, nand, jump if zero, LOAD, LOADI8, STORE also omits immediates and 8-bit stores. which, yeah, you can do without, but again it's awkward

<oriansj>LOADI8 is loading of an 8bit immediate

<oriansj>and we are only loading 8bits and storing 8bits

<oriansj>if I were to do a 16bit ALU then yes you would want LOAD, LOAD8, STORE, STORE8 and LOADI16

<oriansj>LOAD8U (possibly if you wanted to not sign extend)

<muurkha>oh, I misunderstood

<muurkha>agreed then

<muurkha>the 6502 had an 8-bit accumulator and two 8-bit pointers

<muurkha>it would have benefited a lot from an 8-bit zero-page pointer (which was added on the 65816) and a PC-page addressing mode

<oriansj>if one wanted to be extra easy to work with: add a 16bit stack pointer (as an 8bit stack is bleh) and push, pop, call and return be the only things that impacted it with the default value being 0xFFFFFFFF

<oriansj>muurkha: yeah the 6809 did that idea to optimize for smaller code size

<muurkha>you can do without 8-bit loads on a machine with a more-than-8-bit accumulator if you support unaligned access (as you normally would if you had an 8-bit data bus)

<oriansj>and we can cheat very hard too

<oriansj>put the registers in RAM itself

<muurkha>well, the 6502 kind of does

<muurkha>the zero page is the registers

<muurkha>the 6502 does okay with an 8-bit stack but it isn't really usable as a C stack

<oriansj>ACCUMULATOR in 0x00000000, PC in 0x00000002 and Stack in 0xFFFFFFFD

<oriansj>yeah you can't do a C compiler using the 6502 hardware stack and instead have to manually manage your own which is a huge pain

<muurkha>yeah, the PDP-10, PDP-X, and I think the PDP-8 mapped their architectural registers into the memory space

<johnjaye>i wonder if the pdp8/pdp11 were the earliest processors that could actually have a language like C on them

<muurkha>no, the PDP-8 didn't come out until 01965

<muurkha>there were lots of processors from the 01950s that would have been adequate for C and a few that would have been fine

<oriansj>well 8bit accumulator, 16bit PC and 16bit stack would only cost 240 transistors (assuming 6transistors per bit)

<muurkha>that doesn't count the transistors to increment and decrement them, or the muxes to put those values onto different buses

<oriansj>true but that is what I used to estimate the transistor counts used in the RISC-II register file

<oriansj>so if I am under-estimating here, the savings also would be larger there

<muurkha>yeah

<muurkha>johnjaye: things that make C hard to implement include tagged memory (Burroughs 5000), non-byte-addressable memory (most scientific machines of the 01950s up to the 01990s and also the Nova and Tandem, though C implementations on TI DSPs solved this problem by simply making their chars 32 bits wide), a lack of bitwise operations (all BCD computers including most "business" machines from the

<muurkha>01950s), and not enough memory space (the HP 9100A, the GreenArrays F18A)

<oriansj>but if the 6502 is any indicator with 3,510 transistors and the majority of which in the decode logic

<muurkha>that doesn't count the depletion-mode transistors

<muurkha>which bring the total to 4528

<oriansj> https://monster6502.com/

<muurkha>the 6502 or 8080 or PDP-8 isn't hard to implement C on, it's just that C code on them runs slowly

<oriansj>Total active transistors: 4237

<muurkha>yeah, the MonSter 6502 uses resistors for some of the depletion loads, I think

<johnjaye>oh ok

<muurkha>in theory a sufficiently smart C compiler could get C code to run fast on them, though I don't think such a thing has been written

<muurkha>but it's a different level of performance problem than not having bitwise operations in your CPU because your bytes range from 0 to 99

<muurkha>or not being able to byte-address half of your memory space because it's 128 KiB and your registers are only 16 bits (I think that was Tandem)

<oriansj> https://www.pagetable.com/?p=39

<oriansj>well we could extend the PC and stack to 24bits which would be more than enough to run an operating system and C compiler

<muurkha> http://www.righto.com/2013/01/a-small-part-of-6502-chip-explained.html is one citation for the 4528 number

<oriansj>adding overflow (for add) and borrow (for subtract) bits; jump.o (jump if overflow) and jump.b (jump if borrow) would enable efficient multibyte arithmetic

<muurkha>if you were doing it in CMOS instead of NMOS you'd probably need 7000 transistors or so

<muurkha>oriansj: C requires you to be able to take the addresses of local variables, so if your stack pointer is 24 bits then pointer variables in C also need to be 24 bits

<muurkha>which likely means that you need at least one other 24-bit register that isn't the PC or SP. although maybe if you don't have interrupts, or they run on a separate stack, you could get away with saving the SP in RAM at a fixed location, loading your pointer into the SP, and using push and pop to dereference the pointer :)

<oriansj>muurkha: nope but it would make things prettier

<muurkha>no, I mean you can't implement the semantics of C if you can't take the addresses of local variables, or if local variables aren't allocated dynamically to permit recursion

<muurkha>it's not just a matter of making things prettier

<oriansj>muurkha: hence load-pc would be reading a memory address

<johnjaye>i had no idea people still new about the burroughs 5000 or those other machines

<muurkha>load-pc is just indirect jump, that's a different issue

<muurkha>johnjaye: I don't know much about it, I've just read a book is all

<oriansj>muurkha: we could simulate 32/16bit support in functions

<oriansj>so we wouldn't needtfgbvhrdjeugjeklerrhfkrdirigidbjbvnkifdlcgfg

<muurkha>hello tiny orians offspring

<oriansj>yep

<muurkha>welcome to the world

<oriansj>so yeah, only an 8bit accumulator would be needed

<muurkha>yeah, you don't need a bigger accumulator to have bigger pointers

<oriansj>and we could just do the sweet-16 register approach for the C code

<muurkha>yes, but only if you have a CPU instruction that is capable of reading from a 24-bit address

<muurkha>so you need some 24-bit pointer register

<oriansj>nope

<muurkha>(or writing to a 24-bit address)

<muurkha>that is, if your local variables on the stack can be at arbitrary 24-bit addresses

<oriansj>you modify the LOAD instruction in memory to point to the local variable

<oriansj>as you can either do load instructions that cover the whole address space or do pointer register

<muurkha>that does work if the LOAD instruction has a 24-bit address field in it! but that just means that your 24+-bit register in the CPU is the current-instruction register

<oriansj>or if we wanted to get clever: 24bit stack and PC, 16bit accumulator and 8bit page register

<muurkha>yeah, that would work

<muurkha>the current-instruction register is not an architectural register, in the sense that you can't access its contents with instructions, but it still costs transistors per bit

<muurkha>hmm, we should probably do some experiments rather than just talking

<oriansj>well PC relative operations are not essential for C but knowing the stack pointer might be

<muurkha>yeah, PC-relative operations are useful for position-independent code

<muurkha>which is nice for shared libraries

<oriansj>so push page register and pop page register and copy stack pointer to accumulator and page register might be required

<muurkha>even the PDP-8's tightwad version of PC-relative memory access is enough for relocating shared libraries

<oriansj>well if we wanted things to be as simple as possible for assembly programmers we would have 4 registers R0 (accumulator), R1 (index), R2 (stack) and R3 (PC)

<oriansj>and so all operations such as pushing onto the stack and loading relative to the PC is the same as loading from the accumulator

<oriansj>the RCA 1802 was 5,000 transistors

<oriansj>had a set of sixteen registers of 16 bits each

<oriansj>and that was C2L CMOS technology

<oriansj>ran C and Basic

<oriansj>so assuming 6 transistors per register bit; 1: 1536

<oriansj>1536 transistors for the registers

<oriansj>and we could get by 1/4th that

<oriansj>we absolutely don't need the I/O logic either

<muurkha>yeah, 5000 is doable, that's bigger than the 6502

<muurkha>or the Intersil 6100 PDP-8-on-a-chip

IRC channel logs

2022-06-16.log