IRC channel logs

2023-09-19.log

back to list of logs

<muurkha>but if I'm not mistaken it wasn't until the ARM7 (ARMv3) that they had halfword instructions: https://3dodev.com/_media/documentation/hardware/arm60_datasheet_-_gec_plessey_semiconductors.pdf
<muurkha>the ARM6 didn't
<oriansj>muurkha: you are probably right, 16bit values tend to be more legacy than absolutely needed to anything interesting.
<muurkha>it didn't have signed byte loads either
<oriansj>well if you treat only chars as bytes, then you can get away with that
<muurkha>I don't understand
<muurkha>what do you mean by "chars"?
<oriansj>char in the C sense
<muurkha>it would be pretty inconvenient to implement a compliant C on the ARM where C char wasn't a byte
<muurkha>because C defines char to be the basic unit of sizeof and some other things
<muurkha> https://en.wikipedia.org/wiki/Transistor_count says the ARM700 shipped in 01994, so until that point no ARM had 16-bit instructions or signed byte loads
<muurkha>so the Newton, the Acorn RiscPCs, the original StrongARM, were all without 16-bit loads and stores, and without signed byte loads
<oriansj>well zero extension can be achieved with AND 0x0000FFFF and sign extension can be achieved with signed bit shifts
<muurkha>multiply, by contrast, was added in the very first ARM chip to ship to customers, ARM2, because you can do multiply a *lot* faster in hardware than in software
<oriansj>well yes a multiply loop takes a good bit more than 3-6 clock cycles
<muurkha>I don't think they ever shipped a divide instruction, did they?
<muurkha>for integers
<oriansj>not until armv8
<muurkha>so in no 32-bit ARM ever?
<oriansj>correct, only 64bit ARM ever got divide
<oriansj>it is why you see in M2libc/armv7l/libc-core.M1 :divide, :divides, :modulus and :moduluss
<muurkha>in a lot of cases if you're doing 16-bit operations you don't need to AND 0xffff, you can just compute with 16 extra garbage bits
<muurkha>I was noodling on minimalistic bootstrapping instruction sets last night
<oriansj>true that is a valid optimization for those cases (especially if your compiler can predict that those bits are lost or cleared before they are looked at)
<oriansj>(as they may impact if the register is zero program states)
<muurkha>I came up with a 3-address code with 16 instructions, 32 32-bit general-purpose registers, and one addressing mode, with pc, sp, and lr in GPRs
<muurkha>li16, addi, srli, load32, store32, load8, store8, bne, bgeu, jalr, add, sub, bic, mul, mulhu, and jal
<muurkha>32-bit instructions
<muurkha>thinking it might be worthwhile to split li16 into movt and movw
<muurkha>it feels like something you can implement in C in an afternoon, though I haven't tried yet
<muurkha>and get sort of reasonable performance
<muurkha>if you want to compile it to native code you probably want an ifence instruction, and for most architectures the compiler will have to compile pc-destination instructions separately
<muurkha>occurs to me that it might be worthwhile to replace srli with rori
<muurkha>what do you think?
<oriansj>muurkha: I'd truncate to 16 registers (as it maps better to hex)
<muurkha>using an entire byte of the instruction encoding to specify each register operand
<oriansj>(so 256 registers like MMIX ?)
<oriansj>but yeah that instruction set has everything one would need in bootstrapping but a syscall instruction. and Memory Mapped I/O would definitely add complexity relative to a simple lookup table for the syscalls
<muurkha>just 32 registers; I don't want to make context switches unreasonably slow
<oriansj>I only need 6 registers (if PC, stack and LR are in the general register set) [or just 3 if not counting those]
<muurkha>yeah, and 16 are enough for most things, but occasionally a bit cramped
<muurkha>yeah, for system calls it needs an ecall instruction, a kernel-mode bit, a way for kernel-mode code to set up trap handlers to invoke the desired kernel-mode code, a watchdog timer that traps to kernel mode when the timer expires, and some kind of memory protection
<oriansj>muurkha: in a physical machine yes but in a bytecode VM, nope
<muurkha>the absolute minimum reasonable memory protection would be base+bounds registers
<muurkha>sure, if the VM only has to support user code, you can eliminate all the stuff that's invisible in user mode
<oriansj>indeed
<muurkha>there are a couple of rationales for including the protection mechanisms though
<muurkha>1. you need them if you want to run it on an FPGA or something
<muurkha>2. if they're designed to be properly virtualizable (unlike i386 or ARM!) user-mode code can use them recursively to confine untrusted code of its own
<muurkha>at a very modest additional cost; it makes your kernel-mode code responsible for emulating the hardware protection facilities in software when the user-mode code tries to examine or twiddle kernel-mode state
<muurkha>(trap vectors, watchdog state, the kernel-mode bit, and the base+bounds registers)
<oriansj>true but if the goal is a sweet-16 like architecture for making bootstrapping easier on low memory hardware; then no it wouldn't provide much value
<muurkha>what do you mean by "low memory"?
<oriansj>say 16KB of RAM
<muurkha>that sounds like something where you want virtual memory, which I haven't proposed; base+bounds isn't a powerful enough memory protection mechanism for that
<muurkha>although if you have an ecall instruction you could certainly add system calls for something like Forth's BLOCK and UPDATE
<oriansj>I was planning on adding in-page logic to every load/store instruction so we can leverage the block loading idea you shared earlier
<muurkha>well, the idea of the block loading idea is that you can avoid adding in-page logic to every load/store instruction
<oriansj>it'll be slow but then we wouldn't need a hardware MMU and it would be transparent to the running code
<muurkha>because they're "bank-switching" existing blocks of a larger virtual memory into "windows" or "buffers" in the small physical memory
<muurkha>well. "physical", heh.
<muurkha>I think that if you're doing something on a 16KiB machine you probably want to use a compact variable-length bytecode instead of a fixed-size 32-bit instruction set
<muurkha>maybe not for your inner loops but for most of your logic
<muurkha>I was thinking of this for machines that are 1-5 orders of magnitude bigger than that
<oriansj>even with 4-6byte instructions one would only have 16,386bytes of coded needed to get to M2-Planet
<oriansj>but yes the Pineapple RISC-V EEPROM machine does have 512KB
<oriansj>^coded^machine instructions^
<muurkha>yeah. more generally something like the ARM6 is about 10,000 gates
<muurkha>so I think it's useful to have a way to program such small machines
<muurkha>hooking up a 10,000-gate machine to 10,000 bits of SRAM should be a reasonable thing to do, but that's only 2.5 KiB
<oriansj>well not if the external storage has 4KB block size
<muurkha>an interesting question is what it would look like if our chosen switching technology happened to be much faster, or slower
<stikonas>2.5 KiB might be enough to run hex1->hex2 step
<oriansj>it is enough for M0-compact (1696 bytes)
<muurkha>you can access an external storage with 4KB blocks with less RAM than that
<oriansj>muurkha: well either the top 3KB of each block is wasted or your block device has configurable block load sizes.
<muurkha>you can reload the same block multiple times, ignoring different parts each time. like subscribing to a stock ticker or Ethernet
<theruran>the Pineapple-ONE looks really nice. might have to be geared more toward bootstrapping though
<oriansj>muurkha: well yes but that is the configurable block load sizes bit
<oriansj>as you would be giving the logical block address, the target memory address, the amount to load/store and the offset inside of the block.
<muurkha>oriansj: that mechanism is largely the Unix approach; it is both more complex to implement and more difficult to use than the FORTH approach
<muurkha>if you had 2.5 KiB of RAM you'd probably want to use a smaller block size than the traditional 1024 bytes. when Chuck Moore got old and started losing his vision, he switched to 256-byte blocks
<muurkha>16 lines by 16 columns
<oriansj>muurkha: I was thinking more on the hardware level. in the SCSI protocol the TRANSFER LENGTH field in the READ(6) command only covers the number of blocks read (starting at the specified LBA and counting up) and the drive capacity has the field BLOCK LENGTH IN BYTES which on some drives is fixed to 4096 or 512 bytes respectively. In the 4KB case, the drive would only read whole 4KB blocks into memory and on a 2.5KB memory system, most
<oriansj>of that is going to be written into null space. But yes on drives that allow the setting of that parameter, setting it to 256 or 512 bytes is definitely viable for paging in and out data in the basic read/write this block at this block of memory the contents of this Logical block address on disk.
<muurkha>oriansj: oh, that makes sense. typically at the level of SDRAM or Flash you send a series of commands to the device, and some of those commands result in it asserting words on the data bus
<muurkha>in both DRAM and Flash, reads normally work by transferring a whole row of the memory matrix to an SRAM buffer, followed by muxing some of that buffer onto the data bus
<oriansj>usually done by an 8051 if I remember correctly (with ARM M0 cores becoming more popular)
<muurkha>I think it is usually done by hardwired logic
<oriansj> https://www.bunniestudios.com/blog/?p=918 perhaps depends if there is an internal controller
<oriansj>or external drive controller
<oriansj>this one goes a little deeper: https://www.bunniestudios.com/blog/?p=898
<oriansj>reading the K9GAG08B0M datasheet; it appears that they only read 4Kb pages and write 4Kb pages
<muurkha>I'm not talking about SD cards, which do indeed need an internal controller; I'm talking about Flash memory chips
<muurkha>yes, like the K9GAG08B0M
<muurkha>I haven't actually implemented this yet
<muurkha>but note that it has a timing diagram for "two-plane page read operation with two-plane random data out"
<muurkha>to me this means that once you've read the page into the internal SRAM buffer you can read random words from the SRAM buffer, but I'm not 100% confident in my understanding
<muurkha>I'm pretty sure you can also read the entire page one byte after another from the SRAM buffer without any intervening commands