IRC channel logs

2023-03-09.log

back to list of logs

<muurkha>this project might be of interest to some folks in here: https://hackaday.io/project/177716-the-libre-autarkic-laptop
<fossy>rickmasters: sounds good, yeah i just did that quick and dirt, havent done checksums and the like
<avih>oriansj: in https://github.com/oriansj/bootstrap-seeds/blob/master/README.md you give an example hex0 using sed and xxd, but xxd is not posix. i thought it could be useful to have a version which runs on a posix system without using a compiler, e.g. using shell and printf, like so: https://0x0.st/HibE.txt
<avih>so this is verifiable by comparison, as it runs on all shells i have access to with identical output (at least for builder-hex0.hex0). this was tested using several versions of dash, busybox ash, free/net bsd sh, openbsd sh, bash, ksh93, yash, gwsh, and more.
<avih>(and the ever popular mksh)
<stikonas[m]>Interesting. (hex0 in pure shell)
<avih>well, not strictly pure. it uses printf, which is not necessarily a shell builtin, but is posix regardless
<stikonas[m]>Well yes, though I think most shells have printf
<avih>yup
<avih>the notable ones without builtin printf are yash, mksh, and openbsd sh (based on pdksh)
<avih>no, actually yash does have a builtin printf
<avih>compiling builder-hex0.hex0 takes 100-700 ms with the various shells with builtin printf, and about 6s on those without builtin printf
<avih>(invoking printf 3500 times can be slow with an external binary...)
<avih>that's a buffering version, for comparison https://0x0.st/HicN.txt
<stikonas[m]>Well, hex0 itself does no buffering...
<avih>it doesn't need to
<oriansj>muurkha: I do like the idea of a solar powered laptop; certainly the idea of an energy efficient e-ink screened laptop, having a computer that one can fully understand and empowering everyone in the world to take control of their computing. I just have doubts that this is just another overly complex idea not sufficiently explored before being shared with the world.
<oriansj>avih: thank you, I will certainly include that ^_^
<muurkha>oriansj: it'll be interesting to see how it works out
<avih>oriansj: do you want me to open a PR? if yes, let me know which project/files (e.g. hex0-posix might be more suitable, etc)
<oriansj>avih: I was thinking of creating a folder called alternatives in bootstrap-seeds for hex0 implementations in other languages
<avih>sounds good to me
<oriansj>so create a PR and include a line with how to use it
<oriansj>I'll merge it
<avih>so here https://github.com/oriansj/bootstrap-seeds another folder "alternatives" alongside NATIVE/POSIX/ETC ?
<avih>etc*
<avih>oriansj: i'm adding a readme with this title "This firectory contains various implementations of the `hex0` monitor" and was wondering whether "monitor" is the correct term to use? to me it's more like a compiler...
<stikonas[m]>avih: hex0 monitor is slightly different thing
<stikonas[m]>hex0 monitor would read characters from some input device (e.g.keyboard)
<stikonas[m]>And output bin file
<avih>"slightly different" than what?
<AwesomeAdam54321>stikonas[m]: Why not call it a hex0 interpreter instead of a monitor?
<avih>i think it's a compiler. it takes input in source form (text) and outputs a binary object.
<AwesomeAdam54321>but an interactive compiler
<avih>?
<avih>also, it's not an interpreter. an interpreter executes lines of source code one after another. hex0 does not execute anything. it only translates text to binary
<AwesomeAdam54321>avih: It's a compiler, but rather than taking the contents of a text file as input, the input is given through an interactive interface
<AwesomeAdam54321>like from a keyboard
<avih>where the input comes from (keyboard or a source file) does not affect what it does with that input. and what it does is translate the source input into binary output. which, to me, is a compiler.
<avih>i'm not arguing that "monitor" is bad terminology, only that it's not obvious to me why it's used.
<stikonas[m]>Yeah, it's not obvious
<stikonas[m]>It's might be used for baremetal bootstrap
<stikonas[m]>For POSIX we don't need it
<avih>anyway, PR is open here https://github.com/oriansj/bootstrap-seeds/pull/37
<gforce_de1977>avih: at least interesting! - as far is I know: printf is a builtin and 'test' or '[' is an external command - but thats not important - and: export LC_ALL=C is needed why?
<gforce_de1977>avih: also better use: printf '%s' "$var"
<gforce_de1977>avih: once i had a similar idea, but yours is better: https://github.com/bittorf/GNU-mes-documentation-attempt/blob/main/step00/hex0-to-binary-debug.sh
<oriansj>avih: looks great, thank you . merged
<avih>gforce_de1977: LC_ALL is needed because otherwise [0-9a-fA-F] depends on the collating locale. printf %s "$var" will not work, because "$buf" is a sequence of octal values in the form of "\nnn", and we don't want to print that literally, but rather let printf interpret the "\nnn" as octal value
<gforce_de1977>avih: understand! - great
<avih>as for whether printf or test or [ are builtin, posix doesn't mandate that any of them be a builtin, but in most shells they are.
<gforce_de1977>avih: yes - thats wild somehow
<avih>oriansj: cheers :)
<oriansj>AwesomeAdam54321: well one can do hex0 /dev/stdin $output and it would work but the code wouldn't execute until you did ctrl-d./$output\n
<stikonas[m]>[ is also built-in
<stikonas[m]>Any shell that doesn't have [ would be fairly slow
<oriansj>avih: I made a small tweak to make debugging easier: https://paste.debian.net/1273504/ and I noticed a small bug if one does: ./hex0.sh < ../POSIX/x86/hex0_x86.hex0 >| foo
<avih>sec, checking
<avih>oriansj: i'm noticing few changes: 1. style (spaces to tabs, etc) - that's fine. 2. you report errors but go on anyway - i think that's a bug, but depends on the specification. is there a spec for hex0 source? 3. you use echo -e. that's a bug because -e is not portable. also, $line may contain chars which don't print well, or which modify the terminal state etc. this can be discussed. 4. the "exit 1" line is not unreachable, because everything which is
<avih> not hex pair now goes through the echo -e ... ; continue code, skipping the exit 1 line
<avih>not reachable*
<avih>not sure i understand the bug you refer to
<avih>(>| is a bash thing which overrides the noclobber option. not sure how that relates to the bug behavior, if at all
<avih>also, /dev/stderr is not portable. if you want to "echo" arbitrary string to stderr, the portable way is: printf %s\\n "$string" >&2
<oriansj>avih: sorry for not being clear
<avih>no worries, but do clarify :)
<oriansj>the change was just to highlight the lines which would result in exit 1 conditions in the original code
<avih>yes, i get that part
<oriansj>and I was pointing that those lines in POSIX/x86/hex0_x86.hex0 were causing an issue
<avih>wait, you mean that it does not process that file correctly?
<oriansj>indeed
<avih>huh, interesting. is that a bug in the file? or in the script?
<oriansj>well if one assumes one *must* have spaces between bytes in hex0 then it would be a bugin the file but as that isn't in the hex0 spec, it would be a minor defectin the script for not covering that edge case
<avih>oh, i see. i didn't understand the spec i think. it has 4-digits hex values
<avih>i thought the spec was only two digits
<oriansj>yes 2-digits hex per byte
<avih>but then the output from 4 digits depends on the endianess of the platform
<oriansj>not exactly
<avih>well, then the source is an invalid hex0 source, e.g. this line https://github.com/oriansj/bootstrap-seeds/blob/master/POSIX/x86/hex0_x86.hex0#L68
<oriansj>as 01 23 and 0123 would both mean the same thing in hex0
<avih>where is that specified?
<oriansj>in the hex0 not being whitespace sensitive
<avih>but also, in that case does that mean that in 0123, it always writes the 01 byte before the 23 byte, regardless of the system endianess? I'd think that's confusing...
<avih>because it looks like the hex value 0x0123, and the representation on stream/memory of such value does depend on endianess
<oriansj>well yes as the spec is reading 2 hex digits and outputting a byte in left to right order and not depending on any host endianness so that all architectures will always produce the exactsame output with the same hex0
<avih>right
<avih>i also see "Hex0 is trivial to implement [1] It just needs to read 2 hex nybbles and output a byte, you can ignore all non-hex characters"
<oriansj>so 0123456789 is the same for both big and little endian systems as it would be the bytes: 01 23 45 67 89 if one wanted to white space separate them
<avih>does that mean that 0y1z2r3q should also be interpreted as if it was 01 23 ?
<oriansj>and 0 1 2 3 4 5 6 7 8 9 would also produce the same value
<oriansj>indeed
<oriansj>let me check in the equal in C code for reference
<avih>ok, so i completely misunderstood the spec. off topic, i think it should be as strict as possible, to be able to catch source errors, but as i said, off topic.
<avih>anyway, i'll post a fixup PR soon, ok?
<oriansj>avih: I agree that strict implementations should exist
<avih>well, no, the implementation should adhere to the spec as closely as possible, but this spec is very loose...
<oriansj>and printing warnings for 0y1z2r3q as bad form is entirely valid (and probably a very good idea)
<avih>or rather, it's well defined, but allows inputs which would be considered errors in other languages
<oriansj>well hex0 does a more complex stripping than you would think
<avih>?
<avih>what do you mean?
<oriansj>it first removes the comments and all non-hex values entirely before processing
<avih>right
<oriansj>so 0 1 2\n3 5 7 just becomes 012357 and that is read 2 digits at a time
<avih>fwiw, i don't think it should warn, because it's perfectly valid, because the spec says so
<oriansj>well one could warn mixing non-hex characters outside of comments
<avih>if you want to add a new spec, hex0-strict, for instance, which is more what I thought the spec was, that's fine, but with the current spec there's nothing to warn about.
<oriansj>as those despite being technically valid are a code smell of confusing or misleading code
<oriansj>avih: would you help me better formalize a good hex0-stric spec?
<avih>sure. the spec i "followed" is also described in a comment at the script file. it's three lines.
<avih>apparently i imagined it, but still, it's a good spec i think :)
<avih>i do think it's important to separate the bytes though, due to the endianess confusion
<oriansj>complete fair and I can fix up the existing hex0 files to include that property
<avih>well, we don't know how many hex0 files are out there, so i don't know if it should be removed from the spec
<avih>but in source files, sure, it would make them clearer IMHO
<oriansj>well that is why a hex0-strict implementation is a good thing
<avih>it needs a strict spec first :)
<avih>that's teh spec i followed https://github.com/oriansj/bootstrap-seeds/blob/master/hex0-alternatives/hex0.sh#L7-L11
<oriansj>exactly ^_^
<avih>i.e. it's either comment, or it's uninterrupted pairs of hex digits, separated by any number of space/tabs/newline
<avih>(but not vertical tab, etc)
<avih>i.e. the standard shell IFS
<oriansj>sounds good for the strict spec
<avih>oriansj: if i understand correctly, according to the current spec, there can be no errors at the source file, except if there's an odd number of hex digits outside of comments? or is that valid too?
<oriansj>and lone hex digits probably should be wrong as well
<avih>e.g. a file which only contains a single char: 1
<avih>is it valid? if yes, is it 0x01? or 0x10?
<oriansj>I'd say hard no
<oriansj>current implementations will drop it entirely
<avih>right, so other than a final odd hex digit outside of comment, any random stream of bytes is a valid hex0 source according to the current spec, right?
<oriansj>well yes with the detail that all non-hex characters will be dropped prior to processing and only ASCII hex characters will be valid for the generation of bytes
<avih>sure
<avih>and in the current reference implementation, even an off hex digit is valid, as it's ignored together with non-hex bytes, yes?
<avih>odd*
<oriansj>so utf-8 3byte F characters would be entirely dropped and would be misleading to those reading the code
<avih>indeed
<oriansj>correct
<avih>fwiw, personally i find it both weird and amusing that this works: head -c 1024 </dev/urandom >valid-src.hex0
<avih>it shouldn't be :)
<oriansj>well hex0 is very dumb to enable small implementations
<avih>it's been years since i wrote assembly, so i don't know how harder it would be to be stricter. in C i don't think it's a burden, and neither in sh
<oriansj>well doing a strict version in assembly would ballon its size from 255bytes to a couple KB
<avih>i have to take your word :)
<avih>(i believe you though)
<oriansj>but strict versions in higher level languages could provide better warnings, errors and recommendations while producing the exact same output
<avih>i don't believe so much in warnings TBH. if it's specified then it's valid. warning is something which is specified but is murky for some reason. IMHO that's a bug at the spec.
<avih>but that's philosophical. a spec could also say "in such case, an implementation may produce a warning", to make it more valid :)
<oriansj>entirely valid point
<avih>oriansj: so, how's this for a battle plan: 1. i revise the code to handle the current spec. 2. you write a stricter hex0 spec, possibly starting from my definition, and i can go over it if you want. 3. I'll update the script to support both strict and non-strict.
<oriansj>well I am not much of a language lawyer but I'll take a crack at it
<avih>heh
<oriansj>and I'll be updating the *.hex0 programs in bootstrap-seeds to conform to the more rigourous hex0-strict spec
<avih>i could have a go at it instead if you prefer
<avih>(the stricter spec)
<avih>but i wouldn't know where to put it. as far as i can tell it's only specified here? https://bootstrapping.miraheze.org/wiki/Stage0
<avih>so it would need two things: 1. make it clearer what the current spec is. specifically that any random stream of bytes is valid, and explain why. 2. add the stricter and recommended spec similar to what i mistakingly described and followed
<oriansj>good question; I don't think we have an official formal specification repo yet (anyone is free to correct me if they started on that)
<avih>i'm not aware of one, but i also never looked for one :)
<avih>i think we can start with the wiki page then?
<avih>(once we have revised texts, that is)
<oriansj>of course and I probably should update the wiki clones too
<avih>what are "the wiki clones"?
<avih>or do you have an implementation which you consider a reference one, in c?
<oriansj> https://git.sr.ht/~oriansj/bootstrappable-wiki
<avih>s/or/oriansj/
<oriansj>the contents of the wiki and the mailing list have been put into a git cloneable wiki
<avih>oh, i see, that's the "wiki clones". i thought that link was a reference c implementation, except i couldn't find any c file :)
<oriansj>that way if the wiki goes down, everyone should have the ability to create a perfect clone where ever they want
<oriansj>avih: git pull on the the bootstrap-seeds to get a reference C implementation
<oriansj>(of hex0)
<avih>i don't think i see any c files there. i already have it cloned, that's where i added the hex0-alternatives dir
<avih>oh lol, you just added it :)
<oriansj>I just added commit 31543031833baeebb9d4e99b7fce81a23eb3f84b to the master ^_^
<avih>inside hex0-alternatives. good
<avih>oriansj: you don't need FILE *f = fopen("/dev/stdin", "r"). that's not portable, and also not needed, because you can use "stdin" in any place where "f" is used (same goes for "stdout" and "stderr", e.g. fprintf(stderr, whatever))
<avih>or int c = fgetc(stdin)
<stikonas>avih: if you update .hex0 programs, keep in mind that bootstrap-seeds is just a copy, main version of hex0 files are in other repos (stage0-posix/stage0-uefi/stage0)
<oriansj>indeed
<oriansj>and the changes should not produce any checksum or other diferences from the existing files
<avih>stikonas: well, i don't know where the "master" is...
<stikonas>well, that's why I mentioned it here
<avih>obviously i'd prefer it over downstream clones
<stikonas>.hex0 files in bootstrap-seeds are included just for context
<avih>i'm not touching .hex0 files
<oriansj>and allowing people to skip creating their own hex0 implementation that they trust
<stikonas>e.g. for x86 is here https://github.com/oriansj/stage0-posix-x86/blob/master/hex0_x86.hex0
<avih>seems like oriansj wants to, to make it clearer, possibly after having a stricter spec too
<stikonas>so if you want to edit any of the hex0 files, edit them there and later sync them back to bootstrap-seeds
<avih>i don't, but thanks for the heads up anyway
<stikonas>anyway, bootstrap-seeds will have to be not strict
<stikonas>only alternatives can be strict
<avih>i don't think i understand the difference
<avih>also, i don't think any file _needs_ to be strict. i see "strict" as a way for the hex0 compiler to be able to point out places where the source could be clearer.
<avih>(by making it fail on such cases)
<oriansj>avih: I think stikonas is indicating that the hex0-seeds can't be strict but the hex0-alternates can be and the .hex0 code can comply with a stricter set of rules than it implments
<stikonas>that's right, thanks for clarifying
<avih>you mean that the source of the self-hosting hex0 seed can be loose, even if it implements a strict spec? (or maybe the other way around?)
<stikonas>no, vice versa
<stikonas>source can be stricter
<stikonas>but it will not error out on non strict source files
<avih>but implementation should remain loose?
<avih>right
<avih>anyway, so i'll write some text: 1. a more accurate description of the current spec. 2. a stricter spec, and a recommendation that source files follow it. after it's reviewed, someone uploads it to the wiki and whatever?
<oriansj>sounds good
<avih>good. i'l probably have a draft later today or tomorrow.
<oriansj>I'll handle the uploading to the wiki
<avih>k
<stikonas>yes, implmentation will remain loose
<stikonas>as oriansj said, you can't have strict implementation that is so small
<oriansj>and the strict and the loose implementations should produce the exact same output if the input is actually good form hex0
<oriansj>^hex0^hex0-strict^
<avih>yup
<oriansj>and hex0-strict is to reduce confusion and make code auditing easier
<avih>yup
<mihi>avih, oriansj: FWIW, I don't really care about hex0 spec, but in hex2 I have already used the fact that you can write DE AD BE EF as DEA DBE EF if it makes the instruction sequence clearer. And I've also put comments between adjacent nybbles of a byte. So in case the hex0 spec is made stricter, probably hex2 spec should explicitly allow this. Also about requiring whitespace between words, UEFI hex0
<mihi>image currently does not always have them and it does not hurt its legibility in my opinion..
<mihi>s/hex0 image/hex0 seed/
<oriansj>mihi: well sounds like a good discussion would be needed before we make a strict hex2 spec
<mihi>allowing CHUCK NORRIS as replacement for CC is something I would not endorse
<mihi>just like #chuck is only a valid HTML color (in some legacy tag attributes), but not a valid CSS color
<avih>mihi: yeah, i did have second thoughts about allowing ABCD to denote the sequence of bytes 0xab followed by 0xcd. i don't write hex0 files so i don't know how useful that is, but i do _think_ it's a potential footgun, because if an op-code is 0xabcd, then the sequence of bytes depends on the endianess of the system, so on a big endian system the order 0xab and then 0xbc is correct, but on a little endian system, the opcode 0xabcd should be written as
<avih>CDAB at the hex0 file, and i think it would be better to not introduce such doubt
<avih>so others would have to chip in on whether that's desirable to support or not
<mihi>yeah, having 12-bit fields in instructions as 3 hex digits also only really works when the architecture is big endian
<avih>(in the strict mode)
<stikonas>mihi: is that 3 byte sequence for aarch64?
<mihi>nope :)
<oriansj>well in hex0 it would be written as CD AB if the byte order was supposed to be 0xCD 0xAB; only in hex1 and above does endianness show up
<stikonas>I don't remember seeing any 3 byte sequences in x86 or amd64 hex2 files
<oriansj>~labels
<stikonas>or maybe I just remember poorly
<stikonas>oh we do have a lot
<stikonas>it's mostly spaced according to opcodes
<oriansj>or registers or what ever else makes sense
<oriansj>for example DEFINE R15 F sort of bits
<stikonas>hmm, actually it's still pairs in amd64, just can be more than 1 pair
<stikonas>but yes, we are mostly putting spaces around opcode boundaries, not around bytes
<stikonas>which I think makes most sense
<stikonas>because even in hex code, you dont' program directly in hex
<stikonas>but you think of an opcode, and then insert hex sequence for that opcode
<stikonas>and endianess doesn't really matter
<avih>fwiw, if possible, i do think that the strict mode should be practical, and that people don't revert to loose mode because they can make the code clearer that way
<avih>but i don't know how to translate that statement into a specification because i don't have practice in writing hex0 files
<stikonas>hmm, so for example right now we have E8 21000000 for calling a function 0x21 bytes ahead
<stikonas>which is clearer than E8 21 00 00 00
<stikonas>(this is call %read_byte in M1 file)
<mihi>avih, when somebody talks about EBFB and CD20 (16-bit x86 assembly), I would interpret them as EB FB and CD 20 anyway (but perhaps because I know what these opcodes o)
<stikonas>jump 0xFB :D
<mihi>or maybe it is because OllyDbg which I used for *years* displays it as such (https://sectools.org/images/screenshots/OllyDbg2.gif)
<muurkha>yeah, I'd certainly read it as CD 20 (exit)
<avih>mihi: sure, but on a little endian system, the opcode ABCD would have to be reversed at the hex0 file. that's my main gripe with it, that a sequence of ABCD might look like the value 0xabcd, but that's only true on big endian
<muurkha>it's a little worrisome to think that, depending on context, it might actually mean ?C D2 0?
<stikonas>avih: but you dont' need to think about it as ABCD
<stikonas>if it's reversed in hex0, everybody starts thinking of it as CD AB
<mihi>avih, if you write 0xABCD, I would agree, but ABCD does not resemble 0xABCD for me when thinking about assembly dumps
<stikonas>I think that's because most arches are either little endian or big endian
<stikonas>so you never really have to think what is your endianness
<muurkha>my butt is big endian
<avih>i mean, let's say a malicious actor presents a hex0 code for review, and it includes the line "ABCD 00 # opcode ABCD sets $whatever to 0"
<muurkha>yeah
<avih>and it's intended for a little endian system, and the comment is correct, but don't actually describe the code, because the opcode is actually 0xcdab in this case
<stikonas>but is there anybody thinking of it as 0xcdab?
<avih>that's what i'm asking :)
<stikonas>most tools show it as AB CD
<stikonas>at least those tools that are aware that this is little endian code
<mihi>quick poll: on x86 (little endian), what is the opcode for REP MOVSW?
<avih>anyway, i think the problem is clear. what isn't clear is what the strict spec should allow. if you guys think that spaces should not be necessary between bytes, and can happen between nibbles, then that's fine.
<muurkha>it's two opcodes. or are you asking to see if we know without looking it up?
<avih>you just need to decide
<mihi>I'd say F3A5, while avid would probably say (0x)A5F3
<mihi>muurkha, you are free to pipe stuff into nasm, it is not closed book question :D
<stikonas>I would also say F3A5
<muurkha>yeah, F3 A5 or F3A5
<avih>what about spaces between nibbles? i think someone mentioned it earlier?
<mihi>avih, out of curiosity, what (little-endian) platform would treat opcode numbers larger than 1 byte the way you think?
<muurkha>I am somewhat disquieted by the idea of spaces between nibbles
<muurkha>RISC-V's instruction format is pervasively little-endian but it's also damned hard to read in hex
<avih>mihi: i don't know. if you guys think that's not gonna be an issue, i'm cool with that. but i had to raise the concern :)
<stikonas>risc-v immediate encoding is terrible
<stikonas>it took me so much time to get those early bits sorted
<stikonas>much harder to write hex0 code than on x86
<mihi>avih, raising the concern is obviously fine :)
<stikonas>though later bits (once you have M0) on risc-v are much nicer
<stikonas>avih: various disassemblers also think it's F3A5
<stikonas>e.g. try https://defuse.ca/online-x86-assembler.htm
<stikonas>so I think everybody encodes little endianess already into opcodes
<stikonas>when they think about them
<stikonas>which is simpler for most people
<stikonas>because it saves you from having to swap bytes again
<stikonas>(you still need to do it for immediate constants)
<muurkha>no, see https://sectools.org/images/screenshots/OllyDbg2.gif
<stikonas>but swapping endianess of immediate constants is trivial compared to risc-v...
<muurkha>the immediate constants are written out in little-endian in the hex dump window
<muurkha>not swapped back to conventional human order
<stikonas>yes, immediate constants are also written in this order in hex0
<avih>stikonas: so basically, literature of LE architectures describe the multi-byte sequences at the order they appear in memory, rather than by their value (where the memory representation would be different than the order they're written, left to right)?
<stikonas>i.e. 21 is 21000000
<muurkha>right
<stikonas>it does seems to be so, though I haven't read much literature
<stikonas>x86 is a bit messed up anyway
<stikonas>we use hex
<avih>ok, so for now spacing will not be mandatory between bytes.
<stikonas>whereas it actually maps much better to octal
<muurkha>the RISC-V literature generally describes instruction words not in the order they appear in memory
<avih>what about allowing spaces between nibbles? that was mentioned earlier
<stikonas>e.g. mov rsi,rsp is 4889E6
<muurkha>it displays 32-bit or sometimes 16-bit instruction words
<stikonas>but in hex you can't really tell what is move what is rsi and what is rsp
<stikonas>avih: generally I think people prefer spaces between opcodes
<stikonas>not nibbles
<stikonas>but hex0 implementation is not aware of opcodes
<stikonas>so it can't really check
<muurkha>I wrote a thing a few years ago about how the intel instruction sets are more readable in octal: https://dercuano.github.io/notes/8080-opcode-map.html
<stikonas>but people do think of opcodes
<avih>right, the way i grasped it, hex0 is representing a sequence of bytes, which is why initially i thought mandatory spacing between bytes would make it more concise.
<avih>what about "inline" comments? that was mentioned earlier as well
<stikonas>well, it mostly comes from the way we right hex0 code
<stikonas>we start with assembly prototype (i.e. what as can compile)
<stikonas>then we convert it to simplified syntax that M1 or M0 can build
<stikonas>so for example it would be push !1 # prepare to set rdx to 1
<stikonas>and then we keep both as comment in hex0 file
<stikonas>6A 01 ; push !1 # prepare to set rdx to 1
<avih>nono, comment till end of line are fine, and really necessary
<stikonas>oh, those random hex characters
<stikonas>non-hex
<stikonas>they are not used
<avih>someone mentioned earlier comments between bytes
<stikonas>that's just because all implementations ignore everythign else
<avih>sure, but should strict mode allow it?
<stikonas>cause that would mean you need to add extra handling of other characters
<stikonas>I would say no
<stikonas>6AMNY01 is just unreadable
<avih>i certainly agree
<stikonas>but in the minimal implementation, you don't want another conditional
<stikonas>to handle other characters
<avih>but again, the goal is also to be practical. we don't want people reverting to loose mode only because they can make it more readable this way
<stikonas>well, nobody would be using those other non-hex characters in the middle
<stikonas>just like nobody puts random URLs in C code
<stikonas>C spec allows to put 1 URL per function
<avih>so to summarize: for each line, after we remove the longest suffix which begins with '#' or ';' and then split the line into "tokens" separated by spaces/tabs, then each token has to be an even number of hex digits, which are interpreted as a sequence of two-digits hex values, left to right?
<stikonas>actually, 1 http URL and 1 https URL per function is allowed
<avih><stikonas> C spec allows to put 1 URL per function <-- what?
<stikonas>i.e. something like this is a valid C code https://paste.debian.net/1273527/
<avih>no?!
<stikonas>yes, gcc will compile this
<avih>wtf
<stikonas>try
<FireFly>oh
<FireFly>stikonas: amazing :D
<avih>stikonas: i don't think that's in the spec
<avih>can you point me to it in the c99 spec?
<stikonas>sure
<FireFly>it is spec-compliant syntax
<avih>i did read the c99 spec start to finish at least once. i just don't recall such thing
<stikonas> https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf section 6.8.1
<mihi>works in Java, too. While Java has no goto, it has labels and // comments
<avih>i don't see how 6.8.1 allows it...
<stikonas>it's https: label
<FireFly>avih: single-line comment
<stikonas>followed by a comment
<FireFly>well, label + comment
<stikonas>another interesting hack is while (x --> 0)
<avih>the https: part is fine, but what follows shouldn't, i think
<stikonas>what follows is a commend
<avih>oh ffs
<stikonas>comment
<avih>right
<avih>man
<FireFly>I had to paste it into vim to get it, touché :p
<avih>:)
<mihi>x = a /*/*/*/*/*/ b;
<avih>ok, so how does that relate to hex0 strict spec? :)
<stikonas>just saying that random strange stuff can be formally allowed by the spec but people would not use it
<avih>right.
<avih>so anyway, this? <avih> so to summarize: for each line, after we remove the longest suffix which begins with '#' or ';' and then split the line into "tokens" separated by spaces/tabs, then each token has to be an even number of hex digits, which are interpreted as a sequence of two-digits hex values, left to right?
<stikonas>well, at for some arches that makes sense
<stikonas>i.e. in x86 we always use them in pairs
<avih>what's missing for the other arches?
<stikonas>but e.g. for knight oriansj used single hex tokens
<stikonas>which could represent register
<avih>so i presume that writing individual nibbles would make it clearer to read in such cases?
<stikonas>e.g. see https://github.com/oriansj/M2libc/blob/main/knight/knight_defs.M1
<avih>IOW, if the spec requires even sequence of digits, then whoever find it clearer to use individual nibbles would revert to loose mode?
<avih>that's not great...
<stikonas>which means we shouldn't ask it in spec
<stikonas>because it really depends on the specific details of how architecture encodes stuff
<avih>stikonas: that link is not hex0
<stikonas>not, that link is opcode to hex map
<stikonas>like I said, people think in opcodes
<avih>(that i can tell, because technically any random sequence of bytes is a valid hex0 source...)
<stikonas>e.g. move register1 register2
<stikonas>and other "move" or register is a pair of bytes or single byte or something worse is arch specific
<stikonas>and some arches are nasty
<avih>so you think single digits should be allowed, but they always end up as part of a pair at the same line?
<stikonas>in practice I think they will be on the same line...
<avih>i.e. if, after striping the trailing comment, a line has an odd number of hex digits, then it's invalid?
<stikonas>but again, I don't think having them on the same line should be a spec
<avih>otherwise it's very easy to accidentally leave an off digit at one line, which will shift the rest of the file by 4 bits...
<avih>odd*
<stikonas>but maybe it will just shift 1 line
<stikonas>and you are back to normal
<avih>stikonas: why not? remember, we want the strictest practical spec we can produce.
<stikonas>just you wanted to have some end of line comment
<avih>and we want to be able to identify errors
<avih>the problem with the current spec is that it doesn't have any errors at all. any typo at the source will be valid to the compiler
<stikonas>avih: so this https://paste.debian.net/1273528/ would be invalid then?
<stikonas>(this is of course fake hex, not x86 or anything else, but just to illustrate the point
<oriansj>well if we think of hex0 of having a mapping of 1 instruction per line
<avih>stikonas: that's what i'm asking, yes. i thought it should be invalid. what do you think the produced sequence of bytes should be for this source?
<oriansj>and all architectures we plan on supporting have instructions that are multiples of 8bits
<stikonas>well, it's the same as 01FE or 0xFE01
<avih>oh, sorry, i missed the last F at the first line
<oriansj>so the strict spec should be able to require after removal of all comments and whitespace to be an even set of hex characters
<oriansj>and in the strict spec we can explicity forbid all non-hex characters that are not inside of comments
<oriansj>(excluding whitespace of course)
<avih>right
<stikonas>I'm definitely for removing non-hex characters outside comments
<mihi>hmm, as a compromise, how about allowing some special character (like *) to be added for partial bytes? Or would it make the parser/checker too complicated?
<stikonas>that's a clear improvements
<mihi> https://paste.debian.net/1273530/
<oriansj>and we don't even need to support all whitespace characters either; just space, \t, \n and \r should be sufficient
<avih>oriansj: so you think that it's practical to write readable hex0 files without using individual hex digits? i.e. that the digits are always in an even sequence?
<mihi>existing lax implementations will ignore it
<avih>mihi: no objection here if you think it would be useful
<oriansj>well in some architectures doing add R0 R1 R15 as AF0 0 1 F is most readable but when you strip out the whitespace it'll be an even number of hex digits only
<mihi>but probably we are in bikeshedding or https://xkcd.com/1172/ land ...
<oriansj>mihi: I don't think partial bytes make sense in hex0
<mihi>okay. I used them in hex2, but I do not really care about hex0 in that respect
<stikonas>but if we care too much about whitespace rules, then we'll end up with another python
<avih>right, so there are dew options: 1. allowing any number of consecutive digits, including odd number, including one. 2. same as 1, but per line it should be even number of digits. 1.1: same as 1, but individual digits should be marked somehow, e.g. with *. 2.2: same as 2 but individual digits are marked
<avih>few*
<avih>i think 2 should generally be both practical and allow the compiler to detect issues?
<oriansj>mihi: in x86/M0_x86.hex2: I see the * in the comment not in the body
<avih>fwiw, i don't think marking is necessary. if you guys deem that individual digits can make it more readable, then it's fine. no need to mark it IMHO
<stikonas>no, marking them extra would be confusing
<avih>the question remain of whether an individual byte can be split over lines
<avih>like here https://paste.debian.net/1273528/
<stikonas>individual hex
<stikonas>bytes can definitely be split
<avih>i don't understand.
<stikonas>oh ignore that
<avih>so do we want to allow splitting bytes between lines at the strict spec, like at that link?
<oriansj>avih: Well I can't imagine a good reason to slit a byte between lines
<avih>i.e. the same byte starts in one line and continues at the next one
<stikonas>oriansj: what do you think?
<stikonas>I know in hex2 code we definitely allow it
<stikonas>because M0 can produce one define per line
<stikonas>but that's hex0...
<oriansj>splitting inside of a single line is common and aids in comprehension
<avih>stikonas: the question is not whether it's allowed or not, but whether it can practically make the code more readable.
<avih>if yes, then disallowing that would harm the readability. but if perfectly readable source can get away with full bytes per line, then there's no issue IMHO
<oriansj>I know M1 generated output might spit a line but I can't remember ever manually splitting a byte between line
<stikonas>no, I think we never did that
<stikonas>but this example https://paste.debian.net/1273528/ shows a plausible scenario
<stikonas>(though never used in practice)
<stikonas>but then again, maybe this is spacebar heating
<oriansj>yeah, I don't accept that as valid even with RISC-V levels of crazy encoding
<stikonas>ok, let's say we need full byte in a line then
<avih>stricter spec should help users, by being able to detect some accidental errors at the code. you guys write hex0 code, so you should decide to what extent the strictness goes, to balane your needs
<oriansj>doing multiple line comments above a messy instruction is common and easily understood
<stikonas>hopefully we don't need to write much hex0 code...
<stikonas>it's tedious...
<oriansj>stikonas: only for a handful of steps when setting up a new architecture
<stikonas>actually in practice, if you want to add a new arch (say to stage0-posix). You'll spend more time writing M0 or even C code than hex0 code
<stikonas>which is true for every single arch I guess
<oriansj>yeah, the biggest problem usually is figuring out how the spec lies
<oriansj>armv7l lied about bit order; powerpc lies about the e_entry and so forth
<avih>so you guys agreed on this for each line: 1. strip comments (the longest suffix which begins with '#' or ';'). 2. what remain must be any sequence of hex digits, spaces, and tabs. 3. the overall number of hex digits in "2" must be even.
<mihi>ACTION agrees :)
<oriansj>avih: you can also safely strip out the spaces and tabs too
<avih>what do you mean?
<mihi>the spaces and tabs have no semantic meaning
<avih>i'm talking logically, not implementation details
<avih>sure, but they are alowed
<stikonas>oh we need to allow a bit more
<avih>ll
<stikonas>\r is also allowed whitespace
<stikonas>(carriage return)
<oriansj>0 1 2 3 4 5 6 7 8 9 and 0123456789 are the same thing
<oriansj>(in hex0)
<avih>stikonas: i actually considered it and concluded against it. this could be seriously misleading in some tools at the terminal
<stikonas>e.g. if somebody has windows line ending
<avih>then they can choose unix line ending. even notepad can handle it
<mihi>stikonas, I would define a line to be separated by either \r\n or \n, that should cover the \r case
<stikonas>e.g. say you are in UEFI shell
<stikonas>yeah, only in new line endings
<mihi>TBH I am not sure how hex0 implementations treat 12 # comment \r 34 #another comment \n
<mihi>so I would disallow it
<stikonas>\r is ignored
<stikonas>just like \n is ignored
<oriansj>well UEFI shell uses \n\r but bios level programs would only get \r when the enter key is struck
<mihi>\n is significant for teminating comments, and I'm not sure whether \r also does it or only in some implementations
<mihi>but we have oriansj's C implementation now :)
<avih>oriansj: is/can "UEFI shell" be used to write hex0 files, and the concern is that it will end lines with \r\n, which then might be considered invalid with strict hex0 compiler?
<stikonas>yes, I just checked UEFI shell adds 0D 0A even if you set ASCII (i.e. non Unicode) file
<stikonas>yes UEFI shell can definitely be used to write hex0 file
<mihi> https://github.com/oriansj/bootstrap-seeds/blob/master/hex0-alternatives/hex0.c#L47 terminates comments at \r
<avih>right, so i think strict spec should allow \r\n line endings, but not \r before a comment, yes?
<mihi>fine for me
<stikonas>well, I guess either \r\n or \n\r
<mihi>ACTION never seen \n\r :)
<stikonas>actually \r is also legal on some Macs
<avih>and \r is allowed in a comment, even if it can lead to tricks such as abcd # \r 1234
<stikonas>\r is also a new line
<avih>stikonas: i think \r is the "old" mac line ending
<stikonas>I think so
<avih>these days \n alone is used on mac too afaik
<stikonas>before they switched to POSIX
<stikonas>ok, maybe we can ignore it
<avih>ignore \r? or ignore the fact that some systems use it as EOL marker? :)
<stikonas>ignore the fact that some systems use it
<stikonas>here is the full list
<stikonas> https://en.wikipedia.org/wiki/Newline
<oriansj>Acorn BBC and RISC-OS did \n\r
<stikonas>we definitely don't support all of them even in loose implementation
<stikonas>ok, in some sense we do support it
<stikonas>they are just ignored
<avih>"the classic Mac OS, MIT Lisp Machine and OS-9" <- the most modern systems which use \r EOL natively. yeah, i think we can ignore it.
<avih>we could simply treat \r as EOL too
<oriansj>well we can solve this pretty easily
<avih>?
<oriansj>we allow loose hex0 to support system default EOL but strict hex0 to require \n
<avih>ACTION likes :)
<oriansj>and every step after hex0 can just use SET and that'll set \n at end of line
<stikonas>well, loose hex0 supports everything like it does today
<avih>also, i would think that in the unlikely event that someone tries to write a hex0 file using one of these machines, it should have editors/tools to convert EOL, like dos2unix does (i think it can handle old mac too)
<avih>so basically, for the strict mode, \r which is not followed by \n is invalid?
<stikonas>I guess that's fine
<stikonas>that covers everything that is not a museum piece
<avih>that at least covers the UEFI thing and people using notepad :)
<mihi>and anything that is not \r \n \t or 20-7E is also invalid I think. Or do we need Arabic comments?
<avih>?
<stikonas>in comments we allow more stuff I guess
<avih>what's 20-7E?
<stikonas>ascii codes?
<avih>yeah, in comment everything goes
<avih>yeah, got it. i think it's some code for something
<oriansj>mihi: ascii only as there are way too many utf-8 and other source attacks
<stikonas>we are not ascii only already
<mihi>do we want to allow UTF-8 in comments? Also Right-to-left-overrides?
<stikonas>at least due to my copyright statements
<avih>well, we already said that outside of comments only hex digits, spaces and tabs are allowed
<avih>so that's inside ASCII7 by definition
<avih>IMO in a comment anything goes, except \r which is not followed by \n