***saksophony_ is now known as saksophony
***lh_ is now known as lh
***alethkit_ is now known as alethkit
***achaninja_ is now known as achaninja
***sm2n_ is now known as sm2n
***unmatched-paren_ is now known as unmatched-paren
***pi1 is now known as johnjaye
<oriansj>now the 4/8bit processor seems to be limited to 8words of DATA RAM (4/8bytes) and a 7bit address space doesn't leave much room to do much but 127 instructions might be enough to do hex0 <johnjaye>is there a good academic paper on bootstrappability? <oriansj>johnjaye: in general or you mean specific cases (like say bootstrapping the ocaml language) <johnjaye>i suppose in general. like what are the costs and benefits of bootstrapping a compiler vs compiling the compiler with the compiler. <oriansj>johnjaye: ummm we use compilers to bootstrap compilers; we just write the base compiler in assembly <johnjaye>i thought bootstrapping had some kind of theoretical properties like that. <oriansj>and yes there are a good deal of theoretical properties for bootstrapping but in practical terms, ultimately you either bootstrap a language in a high level language or in assembly <oriansj>So if you have a high level compiler you can trust, it is usually much easier just to use that <oriansj>but if you have no high level compiler you trust, there really isn't any other option but to build a compiler written in assembly <oriansj>but that interpreter has to be written in either a high level language or in assembly <oriansj>(schemes tend to be written in C and FORTHs tend to be written in Assembly) <stikonas>johnjaye: sadly there isn't that much theory in bootstrapping. It's actually fairly simple <stikonas>you can maybe see some connections to self-referential paradoxes in some places <stikonas>e.g. let's say you build sha256sum. On it's own it can't be used to prove that it's not malicious binary <johnjaye>can you bootstrap with a compiler written in pcode or IR? <stikonas>I guess you can but the compilers we have are non-optimizing <oriansj>johnjaye: and what exactly do you think will be running that pcode or IR? <stikonas>M2-Planet (the simplest subset of C compiler that is still written in C) and slightly simpler cc_x86 (even smaller subset of C compiler written in assembly) are very simple compilers, they just go over the source (single-pass) and basically directly spit assembly out <stikonas>(they do keep a bit of state to deal with local variables, etc but not much) <johnjaye>meaning there's not much to be gained from optimizing it? i'm not sure i get it <stikonas>meaning that optimizations are not important for bootstrapping <stikonas>you use your compiler once to build something more complicated <stikonas>compiler optimizations only matter if you use compiler program a lot <oriansj>johnjaye: does 5 minutes seem like a huge problem that would benefit by shaving off seconds? <stikonas>and most of the slowdown actually comes from interpreter (mes) <stikonas>oriansj: was it something like 250k lines per second? <oriansj>johnjaye: yes when self-hosted; it is 2Mloc/second when built by GCC <stikonas>basically bootstrap goes something like hex0->hex1->hex2 (linker)->M0 (assembler)->cc_x86 (C compiler)-> M2-Planet (C compiler) -> mes (scheme interpreter) -> mescc (C compiler) -> tcc (C compiler) <stikonas>if you run this chain everything is quite fast (compilation takes maybe a second) except for interpreted code <johnjaye>why is that a chain of multiple assemblers and multiple compilers <johnjaye>shouldn't it be compiler makes a compiler? <oriansj>johnjaye: short version: mescc got to building TCC before M2-Mesoplanet could <stikonas>oriansj: I think johnjaye is asking why this chain has both assemblers and compilers <oriansj>Elapsed (wall clock) time (h:mm:ss or m:ss): 1:35.29 from zero to mescc <stikonas>basically we want to build compiler from the smallest possible binary seed <stikonas>you can't build it without any seed at all <oriansj>johnjaye: to the question of why the complexity, it has to do with minimizing development and review efforts <stikonas>and large binary seed (e.g. 1 MiB can't be easily inspected by human) <stikonas>yes, hex code is hard to inspect and write <stikonas>once you have hex1, it can do some of the calculations for you and it becomes easier <stikonas>hex2 is slightly improved version of hex1, removes some of the limitations <oriansj>but each is easier to work in than the previous step <stikonas>well, once we have M2-Planet we compile loads of other tools <stikonas>after M0 programming is basically assembly programming but we only need to write 1 program in it (cc_*) <stikonas>cc_* builds its cross-platform variant in C (M2-Planet) <stikonas>and after that it's just normal programing <stikonas>although in somewhat restricted subset of C <stikonas>but perfectly readable for any C programmer <stikonas>and everything can run automated and build GCC <johnjaye>is this all about guix or is it a separate project? <stikonas>by starting with 2 small binaries (357 byte hex0 and 757 byte kaem-minimal shell) <stikonas>it's not just guix although guix does that too <oriansj>johnjaye: well it is bigger than Guix but yes many guix people are also here and it did start on #guix <stikonas>but we have scripts that are completely independent of guix <stikonas>and some other people also tried writing their own bootstrap chains <oriansj>and we encourage people to create any bootstrap chain they trust <johnjaye>i heard about it there yes. it seems to me the main reason you would want bootstrapping is for that whole build everything declaratively thing? <stikonas>Guix (which starts with hex0 and static guile binary) <stikonas>and live-bootstrap (which starts with hex0 and kaem-minimal) <stikonas>well, declarative building is mostly guix thing <stikonas>live-bootstrap focusses more on not allowing any pre-generated files (e.g. configure scripts are banned, or e.g. pre-built bison parsers) <stikonas>and also does reproducibility a bit better than guix <stikonas>though guix tries to be quite reproducible too <oriansj>and declarative building requires a declarative language (like Guile) <stikonas>well, in Guix you have packages written in Guile <stikonas>live-bootstrap is basically just some collection of shell scripts <oriansj>so bootstrapping is very much what you feel comfortable with <stikonas>but it's self-contained and does not need big interpreted like guile <vagrantc>does it actually use features of "bash" specifically, as opposed to "sh" ? <oriansj>Guix is comfortable with the static guile binary and live-bootstrap sticks to the bootstrap-seeds <stikonas>vagrantc: before bash is built it's just a list of commands (kaem scripts) <vagrantc>stikonas: i assumed you waited till it was built to use it :) <vagrantc>i'm not sure "guix is comfortable" so much as that is as far as it has gotten so far <oriansj>vagrantc: bashisms are a far smaller issue if the only shells you have are bash and kaem <stikonas>kaem can't run POSIX compatible sh files anyway <vagrantc>guix is a project that started out with a large (but at least known) set of binary seeds and has been chipping away at them... <stikonas>though it now has aliases and if else fi statements <stikonas>yeah, and live-bootstrap started vice verse <oriansj>vagrantc: well if guix wanted to get away from that Guile binary, they would have to move their bootstrap out of guix itself <stikonas>it started with a small seed and was building on top of that <vagrantc>oriansj: or just port guix to mes's scheme interpreter :) <stikonas>vagrantc: that's probably easier said than done <stikonas>I think even gash is very experimental on mes\ <oriansj>vagrantc: at that point mes.c would be functionality the same as guile; which would be freaking cool <muurkha>oriansj: I'm curious whether it's more broadly applicable than the 4004; have you read the paper? I haven't yet <muurkha>hmm, 8 words of data RAM and a 7-bit address space sounds too small <muurkha>I haven't found a way to build a usable CPU in under about 2500 transistors <oriansj>muurkha: the 4004 supported a 12 bit address space <oriansj>so aside from being cheap as heck, it is worse than the 4004 in every technical way <muurkha>johnjaye: yeah, you could write a P-code interpreter in assembly, then write a compiler in P-code. arguably that's what you're doing when you program in Forth :) <johnjaye>most obvious ideas have probably been thought of and done. it's just hard to figure out what the project is named <muurkha>22:45 < stikonas> compiler optimizations only matter if you use compiler program a lot <muurkha>I think this should say "compiler optimizations only matter if you use the compiled program a lot" <stikonas>muurkha: yes, though I was thinking compiled program is the next compiler <oriansj>muurkha: well the RISC-II had only 40,760 transistors of which 1: 26496 <oriansj>26,496 were used for the 138registers <johnjaye>anyway i was just curious about the organizational relationships. all these terms are really new and strange sounding. mes. m2. ccx86. <oriansj>(assuming 32bits at 6 transistors per register bit) <muurkha>oriansj: I think my working definition of "usable CPU" is one that can run a high-level-language interpreter <muurkha>oriansj: yeah, I said 2500, not 25000 <oriansj>johnjaye: can you be more precise in what you are looking for? <stikonas>johnjaye: it's just a name of some programs <stikonas>sometimes there is some acronym behind but in the end any name is just a name <johnjaye>kaem is the world's worst build manager. is ee <muurkha>oriansj: the bit-serial version of Calculus Vaporis should weigh in around 2500 transistors <johnjaye>well gcc is just a name sure. but it's also a huge part of the software ecosystem and millions of lines of code (guessing) <muurkha>there's a graffiti artist near here who makes these big elaborate mural-sized tags all over town that say "KAEM" <muurkha>those 2500 transistors don't include RAM, just registers and logic <stikonas>M2 is kind of a sequence. We have M0 (assembler written in arch), then there is M1 (cross-platform assembler writtne in C) and M2-Planet (C compiler in C) <muurkha>yeah, Calculus Vaporis is about 600 IIRC <stikonas>though maybe oriansj has better explanation for M2 <muurkha>like the PDP-8, it also has a 12-bit address space (of 12-bit words) and so plausibly could run something like a BASIC interpreter, though I haven't written one for it and might be disappointed if its instruction set turns out to be too bad at density <stikonas>johnjaye: cc_x86 is even simpler (cc is c compiler) <stikonas>johnjaye: mes is probably the most curious naming: Maxwell's Equations of Software <oriansj>muurkha: actually the PDP-8 instruction set could be simpified without being hard to program <muurkha>johnjaye: yeah, GCC is millions of lines of code <stikonas>johnjaye: that's because mes interpreter can run mescc compiler that can build mes <stikonas>some somewhat similar to how electromagnetic field behaves <muurkha>a non-bit-serial version of Calculus Vaporis is more like 4500 transistors <oriansj>one could remove the rotate instructions from the PDP-8 and byte swap; not to mention the multiply and divide instructions <oriansj>the subtract/skip instruction would be simpler as a jump if zero instruction <oriansj>drop from 12bits to 8bits should simplify a few things as well <oriansj>just increase the PC from 12bits to 16bits <oriansj>using an expanded instruction encoding could also be used to simplify decoding <muurkha>I don't think the PDP-8/S had multiply and divide, did it? <muurkha>dropping the instruction word, the memory word size, and/or the address bus width from 12 to 8 bits would indeed reduce the number of transistors needed, but it would probably also drop below being a usable CPU <muurkha>you could switch from being word-oriented to being byte-oriented though <oriansj>8bit accumulator and 16bit address space is enough for a C compiler <muurkha>lots of examples of that, I have one on the desk in front of me right now in fact <oriansj>add, sub, xor, nand, jump if zero, LOAD, LOADI8, STORE would be sufficient for easy programming <muurkha>I know about the group-3 instructions, I just dont think the PDP-8/S had the MQ register and thus the relevant instructions, did it? <muurkha>WP says, The EAE was an option on the original PDP-8,[17] the 8/I,[18] and the 8/E, but it is an integral part of the Intersil 6100 microprocessor." <muurkha>you don't really need XOR a an instruction <muurkha>you probably do, however, need some way to compare integers for ordering (≤, ≥) <muurkha>I guess that for an 8-bit subtraction result you could NAND with 0x80 to preserve only the sign bit, then compare the result to 0 <muurkha>nd that would give you a way to compare integers for ordering <muurkha>so you'd have to NAND with 0xFF or 0x7F to map either 0x7F or 0xFF to zero, and then you could do it <stikonas>well, you almost never need any specific instruction as is shown by one instruction ISA <oriansj>muurkha: well cmp with greater and less bits being set is an optimization. Only subtraction is needed <oriansj>jump if negative is a cheap optimization one can do as well <muurkha>add is also an optimization from that POV, which is why Calculus Vaporis doesn't have it <oriansj>jump if positive exists also in knight <muurkha>stikonas: yeah, I mean that in practice the dynamic number of XORs is so low that evaluating XORs with a NAND instruction will have minimal performance impact <muurkha>(because the static number of XORs is also low) <stikonas>yes, for riscv64 in stage0-posix we have 6 uses in hex1 and 6 in hex2 <oriansj>jmp.nz (not zero), jmp.z (zero), jmp.n (negative), jmp.p (positive) would be quite easy to work with <muurkha>not true, I think, of sign comparisons, addition, or right shift <stikonas>and for other non-word arches we might have even fewer xors <muurkha>if you remove those I think you get a significant slowdown <oriansj>a single bit shift would probably eat a good few transistors <muurkha>indexed store is also a pretty useful optimization, and was omitted from the 8080 <oriansj>but that would cost additional registers <muurkha>no, by indexed store I mean store r0 [r1+8] <oriansj>it'll only save a single instruction <muurkha>no, you can do without it, but it's expensive <muurkha>yeah, but it saves a single instruction almost every single time you access memory :) <oriansj>well if you look at my knight code: you'll see it isn't that big of a benefit <muurkha>it's certainly possible to write code without it <muurkha>I mean you can store most variables at fixed memory locations <muurkha>but that gets painful anytime there's more than one of something <oriansj>well I do load r0 [r0] a good bit (because ->next is always at offset zero there) <oriansj>4 registers would really simplify things <muurkha>also I think call and return instructions are pretty useful <oriansj>they do simplify a great many things <muurkha>if you don't have a return instruction you really need a way to jmp to a register value, otherwise you end up having to use self-modifying code for switch, return, and ad-hoc polymorphism (function pointers) <muurkha>in your list add, sub, xor, nand, jump if zero, LOAD, LOADI8, STORE also omits immediates and 8-bit stores. which, yeah, you can do without, but again it's awkward <oriansj>LOADI8 is loading of an 8bit immediate <oriansj>and we are only loading 8bits and storing 8bits <oriansj>if I were to do a 16bit ALU then yes you would want LOAD, LOAD8, STORE, STORE8 and LOADI16 <oriansj>LOAD8U (possibly if you wanted to not sign extend) <muurkha>the 6502 had an 8-bit accumulator and two 8-bit pointers <muurkha>it would have benefited a lot from an 8-bit zero-page pointer (which was added on the 65816) and a PC-page addressing mode <oriansj>if one wanted to be extra easy to work with: add a 16bit stack pointer (as an 8bit stack is bleh) and push, pop, call and return be the only things that impacted it with the default value being 0xFFFFFFFF <oriansj>muurkha: yeah the 6809 did that idea to optimize for smaller code size <muurkha>you can do without 8-bit loads on a machine with a more-than-8-bit accumulator if you support unaligned access (as you normally would if you had an 8-bit data bus) <muurkha>the 6502 does okay with an 8-bit stack but it isn't really usable as a C stack <oriansj>ACCUMULATOR in 0x00000000, PC in 0x00000002 and Stack in 0xFFFFFFFD <oriansj>yeah you can't do a C compiler using the 6502 hardware stack and instead have to manually manage your own which is a huge pain <muurkha>yeah, the PDP-10, PDP-X, and I think the PDP-8 mapped their architectural registers into the memory space <johnjaye>i wonder if the pdp8/pdp11 were the earliest processors that could actually have a language like C on them <muurkha>no, the PDP-8 didn't come out until 01965 <muurkha>there were lots of processors from the 01950s that would have been adequate for C and a few that would have been fine <oriansj>well 8bit accumulator, 16bit PC and 16bit stack would only cost 240 transistors (assuming 6transistors per bit) <muurkha>that doesn't count the transistors to increment and decrement them, or the muxes to put those values onto different buses <oriansj>true but that is what I used to estimate the transistor counts used in the RISC-II register file <oriansj>so if I am under-estimating here, the savings also would be larger there <muurkha>johnjaye: things that make C hard to implement include tagged memory (Burroughs 5000), non-byte-addressable memory (most scientific machines of the 01950s up to the 01990s and also the Nova and Tandem, though C implementations on TI DSPs solved this problem by simply making their chars 32 bits wide), a lack of bitwise operations (all BCD computers including most "business" machines from the <muurkha>01950s), and not enough memory space (the HP 9100A, the GreenArrays F18A) <oriansj>but if the 6502 is any indicator with 3,510 transistors and the majority of which in the decode logic <muurkha>that doesn't count the depletion-mode transistors <muurkha>the 6502 or 8080 or PDP-8 isn't hard to implement C on, it's just that C code on them runs slowly <muurkha>yeah, the MonSter 6502 uses resistors for some of the depletion loads, I think <muurkha>in theory a sufficiently smart C compiler could get C code to run fast on them, though I don't think such a thing has been written <muurkha>but it's a different level of performance problem than not having bitwise operations in your CPU because your bytes range from 0 to 99 <muurkha>or not being able to byte-address half of your memory space because it's 128 KiB and your registers are only 16 bits (I think that was Tandem) <oriansj>well we could extend the PC and stack to 24bits which would be more than enough to run an operating system and C compiler <oriansj>adding overflow (for add) and borrow (for subtract) bits; jump.o (jump if overflow) and jump.b (jump if borrow) would enable efficient multibyte arithmetic <muurkha>if you were doing it in CMOS instead of NMOS you'd probably need 7000 transistors or so <muurkha>oriansj: C requires you to be able to take the addresses of local variables, so if your stack pointer is 24 bits then pointer variables in C also need to be 24 bits <muurkha>which likely means that you need at least one other 24-bit register that isn't the PC or SP. although maybe if you don't have interrupts, or they run on a separate stack, you could get away with saving the SP in RAM at a fixed location, loading your pointer into the SP, and using push and pop to dereference the pointer :) <oriansj>muurkha: nope but it would make things prettier <muurkha>no, I mean you can't implement the semantics of C if you can't take the addresses of local variables, or if local variables aren't allocated dynamically to permit recursion <muurkha>it's not just a matter of making things prettier <oriansj>muurkha: hence load-pc would be reading a memory address <johnjaye>i had no idea people still new about the burroughs 5000 or those other machines <muurkha>load-pc is just indirect jump, that's a different issue <muurkha>johnjaye: I don't know much about it, I've just read a book is all <oriansj>muurkha: we could simulate 32/16bit support in functions <oriansj>so we wouldn't needtfgbvhrdjeugjeklerrhfkrdirigidbjbvnkifdlcgfg <oriansj>so yeah, only an 8bit accumulator would be needed <muurkha>yeah, you don't need a bigger accumulator to have bigger pointers <oriansj>and we could just do the sweet-16 register approach for the C code <muurkha>yes, but only if you have a CPU instruction that is capable of reading from a 24-bit address <muurkha>so you need some 24-bit pointer register <muurkha>that is, if your local variables on the stack can be at arbitrary 24-bit addresses <oriansj>you modify the LOAD instruction in memory to point to the local variable <oriansj>as you can either do load instructions that cover the whole address space or do pointer register <muurkha>that does work if the LOAD instruction has a 24-bit address field in it! but that just means that your 24+-bit register in the CPU is the current-instruction register <oriansj>or if we wanted to get clever: 24bit stack and PC, 16bit accumulator and 8bit page register <muurkha>the current-instruction register is not an architectural register, in the sense that you can't access its contents with instructions, but it still costs transistors per bit <muurkha>hmm, we should probably do some experiments rather than just talking <oriansj>well PC relative operations are not essential for C but knowing the stack pointer might be <muurkha>yeah, PC-relative operations are useful for position-independent code <oriansj>so push page register and pop page register and copy stack pointer to accumulator and page register might be required <muurkha>even the PDP-8's tightwad version of PC-relative memory access is enough for relocating shared libraries <oriansj>well if we wanted things to be as simple as possible for assembly programmers we would have 4 registers R0 (accumulator), R1 (index), R2 (stack) and R3 (PC) <oriansj>and so all operations such as pushing onto the stack and loading relative to the PC is the same as loading from the accumulator <oriansj>had a set of sixteen registers of 16 bits each <oriansj>so assuming 6 transistors per register bit; 1: 1536 <oriansj>we absolutely don't need the I/O logic either <muurkha>yeah, 5000 is doable, that's bigger than the 6502 <muurkha>or the Intersil 6100 PDP-8-on-a-chip