IRC channel logs

2022-05-22.log

back to list of logs

<stikonas>I usually just execute individual build steps after chrooting
<doras>But it's really up to you, I'm fine with it either way.
<doras>I like resuming the bootstrap, so if I actually fixed the issue it would continue to the end.
<doras>It involves commenting out build steps that succeeded.
<doras>stikonas: regarding the tmpfs thing, I mostly need to come up a nice way for the Python code not to assume that mounting and unmounting needs to happen.
<doras>It probably involves moving this logic out of the sysx constructors.
<stikonas>can we not just hide it in the if block?
<Hagfish>"commenting out build steps that succeeded" => i'm wondering if there could be a runner which keeps track of which steps have succeeded to allow resuming from the last known good step
<Hagfish>i guess adding that extra layer of abstraction wouldn't be helpful once all the steps are working though
<stikonas>Hagfish: I was thinking of just checking build results before build
<stikonas>if they exists, skip the build
<stikonas>it's fairly easy
<stikonas>I was thinking of that already but there were other things that neede doing
<oriansj>catm
<oriansj>catm kaem.run step_07.success standard_init.run
<oriansj>catm kaem.run step_08.success standard_init.run
<oriansj>then just use match and if blocks to skip steps
<oriansj>but I would tweak match to give you a more efficient solution
<oriansj>say match --greater ${last_stage} 4
<stikonas>well, for now I'll just do it in bash
<stikonas>just run src_checksum at the beginning of the build process, if it succeeds, return, else continue with the build
<oriansj>that'll even work in kaem
<oriansj>so much better idea, good job stikonas
<stikonas>well, in kaem we don't have those helper function...
<stikonas>anyway, in kaem the only steps worth skipping are mes and initial build of tcc
<stikonas>everything else is very quick
<oriansj>to be honest, part of me misses when M2-Planet could compile 1+Mloc/second
<oriansj>but it is now *MUCH* more robust
<oriansj>but I guess 300+Kloc/second is still respectable
<muurkha>robustness is super important. we don't want it to have an exploitable compiler bug!
<oriansj>muurkha: well we do have an exploitable bug in untar
<muurkha>hopefully our source tarfiles aren't yet exploiting it to introduce backdoors
<oriansj>well we can sha256sum those files prior to untaring them
<oriansj>So we can limit that attack surface to a degree but it is a bug that just requires input validation checks to be added to untar
<oriansj>same could be said about ungz
<oriansj>although one could argue adding --fuzz-test to mes.c would probably be the best way to make our bootstrap robust
<stikonas>M2-Planet now has more features too, so it has to make more processing
<stikonas>preprocessor basically makes it two-pass run
<oriansj>stikonas: yeah the preprocessor was the majority of the speed loss.
<oriansj>stikonas, fossy: would it be an issue if I changed ungz and untar to allow safer fuzzing but it would change the invocation commands slightly?
<stikonas>should be fine
<stikonas>we can adjust commnads in live-bootstrap when we pull in newer stage0-posix
<oriansj>thank you
<oriansj>and you'll now be able to specify the output file name for ungz
<oriansj>and untar and ungz now have the new flags, I'll begin fuzzing and adding input validation
<oriansj>took only 10 seconds of fuzzing to find the first ungz segfault
<midgardian[m]><oriansj> "but I guess 300+Kloc/second is..." <- Can you explain why M2-Planet compile performance decreased? Also how do you benchmark how many LOC/s it compiles?
<midgardian[m]><oriansj> "muurkha: well we do have an..." <- [oriansj](https://matrix.to/#/%40oriansj%3Alibera.chat) What bug in untar is this? is it in this file: sysa/stage0-posix/src/mescc-tools-extra/untar.c ??
<midgardian[m]><oriansj> "stikonas: yeah the preprocessor..." <- Oh sorry I read this, the M2-Planet performance decreased due to preprocessing?
<stikonas[m]>midgardian: M2-Planet is also doing a bit more features and safety checks. That is a further drop in performance...
<oriansj>midgardian[m]: this one is the one I was speaking of: https://www.openwall.com/lists/oss-security/2021/10/03/1
<oriansj>and yes it is the untar in mescc-tools-extra
<oriansj>the method for testing compile speed was done using a 100Mloc test file and timing how long it took to compile.
<oriansj>things that reduce performance: looking at data more than once, manipulating data more than once, inefficient algorithm choices, inefficient syscalls (this was previously the major performance problem with M2-Planet) and busy code
<oriansj>doing a separate preprocessing stage required us to look at the data more than once and manipulate the data more than once, which did cut performance in half. adding all those require statements to catch garbage input (which provide useful error messages rather than just segfaults) requires us to look at the data more than once and caused the remaining performance drop.
<oriansj>The algorithm currently used is not optimal in terms of absolute performance (I selected simplicity of mental model and minimal build requirements) but as it is just a simple state machine walking a list so there is only a possible 4x performance improvement if I went to the extreme. (Which would be a major rewrite effort taking possibly years [aka not worth it])
<oriansj>the previous inefficient syscalls in M2-Planet relates to how fgetc and fputc used to work (and how cc_* still works); in particular one only reads a single byte or writes a single byte per syscall.
<oriansj>but fortunately the M2libc work showed that major performance improvements could be obtained by buffering input/output. So now M2-Planet only does a single read from an input file and minimizes the number of writes. Which resulted in a 10x performance improvement in M2-Planet
<oriansj>and there are no busy wait loops or deadlocks or other busy code blocks in M2-Planet so isn't an issue
<oriansj>in an unrelated note; The freedom to study the source code and make changes. What could we do to improve the situation at the hex0 level?
<oriansj>because right now it is just doing art with individual grains of colored sand sort of labor of love.
<muurkha>I feel like a formal proof that the code does one thing and not another, using a formal model of the semantics of the CPU instructions it uses, would be a useful step
<muurkha>it maybe wouldn't help people make changes but it could help them study it
<muurkha>or maybe it would help them make changes because it would tell them what they broke
<muurkha>unfortunately I don't know how to do formal proofs of programs beyond the most basic Hoare logic level
<oriansj>hmmm but would that make writing unreadable code easier so I am not sure how much that will gain us.
<muurkha>I don't think it would, but I hardly have more experience with the thing than you do
<oriansj>imagine a code golfer with a tool which can say yes or no if a program would do the thing specified. Then apply some brute force search and boom, a program that complies with the spec but no one knows how it works.
<oriansj>for a real world example: https://www.damninteresting.com/on-the-origin-of-circuits/
<muurkha>I don't think you can actually write code that way even with a proof assistant
<muurkha>your article is not a real world example of someone using formal proofs; it's rather the opposite extreme
<muurkha>I mean, yes, superoptimizers can generate code sequences successfully
<muurkha>but they top out around ten instructions, and generally it's not that hard to figure out how they work
<oriansj>perhaps I am just thinking of it in terms of a program that would take the hex0 and the specification and generate a formal proof of correctness. And what one could build on top of it.
<muurkha>yeah, I don't think that is a thing that anybody knows how to do
<oriansj>because a one off proof doesn't really help one change code
<muurkha>hmm, I don't think that's true from my experience with programming in statically typed languages
<oriansj>hex0 is a typeless programming language
<muurkha>irrelevant
<muurkha>I find that static types (which are simple theorems checked by the compiler) are helpful for changing code because they help me rule out large classes of interactions
<oriansj>perhaps we are thinking of two different things when talking about a proof for hex0
<muurkha>also, though, because the process of structuring the code such that it's *possible* for the types to check requires me to write the code in a clearer way
<muurkha>which makes it easier to change
<oriansj>granted that editors with proper syntax and type checking make code easier to maintain but I don't think of that is a proof of correctness.
<muurkha>it's a proof of a particular property of the code
<muurkha>generally a non-dependent type system isn't sufficiently expressive to express all the correctness properties we'd like to express, just some of them
<muurkha>but I've never used more expressive proof systems like PVS, ACL2, or Coq, so maybe my experience doesn't generalize
<oriansj>indeed it is but that property in no way indicates that the code comforms to the specification.
<muurkha>I wouldn't say "in no way" — it shows that the code conforms to the type specification. such and such a function returns a list of integers, or whatever
<muurkha>dependent type systems like Idris allow you to specify tighter types, such as a list of integers that is a permutation of the first argument such that for every i x[i] ≀ x[i+1]
<muurkha>at which point you've proven your sort function correct
<oriansj>specification says: return "true" if prime else return "false"; code does return 42;
<muurkha>yeah, see, even C's type system can demonstrate that code is incorrect
<oriansj>return "42"; fixed
<muurkha>right, in that case C's type system isn't sufficiently expressive to find the bug but, say, TypeScript's is
<unmatched-paren>well, if the only possible returns are "true" and "false", then you could just use <stdbool.h>'s `bool`
<unmatched-paren>not sure why you'd use a string in that case?
<unmatched-paren>and i'm pretty sure if you used an enum then most compilers would check that it's a valid member of the enum...
*unmatched-paren checks with gcc
<muurkha>C isn't very good at enum checks
<oriansj>TypeScript would be checking that it is a string type, so it wouldn't catch that "42" is not the string "true" or the string "false"
<muurkha>but Java, Kotlin, Scala, OCaml, F#, or Pascal is fine
<muurkha>in TypeScript "true" | "false" is a valid type, and the compiler can see that "42" does not conform to it
<unmatched-paren>oh, that's kind of neat that it uses strings as enums...
<muurkha>yeah, TS is designed to be able to check the kind of goofy stuff we routinely do in JS
<unmatched-paren>in a cursed JavaScripty meaning of neat i guess :P
<oriansj>unmatched-paren: the reason for the string type is to demonstrate the point that types by themselves don't solve the problem but we can create structures that do
<muurkha>yes, and the reason for TypeScript's more complex type system is to demonstrate that types by themselves do solve *that* problem
<muurkha>expressing prime() in your type system requires a substantially more elaborate type system though
<oriansj>muurkha: custom created types that are written by people to help themselves, not out of the box types
<muurkha>Coq or Idris or Agda can do it, TS can't
<muurkha>yes, of course
<muurkha>you can't get very far with primitive types in any language
<unmatched-paren>what... GCC does not, apparently, check that an `enum foo` value is a valid member of that enum... i'd expect better from it.
<oriansj>it isn't a native property of type checking but humans leveraging type checking in clever ways
<unmatched-paren>pretty sure the C standard allows you to do that check
<oriansj>muurkha: I managed to get pretty far with just char* and int
<unmatched-paren>it was mentioned in ANSI K&R that compilers were allowed to do that... maybe it's changed since then
<muurkha>oriansj: you can get pretty far writing code without any types at all, as in assembly! but you can't get very far expressing correctness properties of your program
<muurkha>unmatched-paren: did you try -Wall -pedantic -Wextra?
<unmatched-paren>muurkha: yes
<muurkha>I'm surprised it didn't even warn
<muurkha>oriansj: https://thatgeoguy.ca/blog/2021/03/07/review-the-little-typer/ showed up on the orange website today
<oriansj>hex0 could support annotation
<oriansj>which a tool could perform checks on
<oriansj>is this bag of bytes a valid instruction and does it correspond to the # assembly ; block
<muurkha>yes, and there's no limit to how far you can take the expressiveness of that sort of annotation
<muurkha>which, ideally, would be in a separate file from the hex bytes themselves
<oriansj>well hence my question about how we can make hex0 into a better programming language
<oriansj>for example #:label ; [0xaddress] could be checked if it would actually be that address
<oriansj>(just thinking in terms of developer tooling)
<oriansj>hex # jump label ; comment could be checked to see if it is a jump instruction and would it jump to the address at that label
<oriansj>if one does mov ax, [memory] and doesn't do a movzx or movsx to fill up the register that could be a bug and an annotation could be put there to ignore that warning or the programmer could just fix the code sort of details
<stikonas>But label checking tool is basically "hex2"
<stikonas>one can just build it with hex0 and hex2 and compare output
<muurkha>podcast on dependent types and proving programs correct: https://corecursive.com/023-little-typer-and-pie-language/
<muurkha> https://stackoverflow.com/questions/8597426/enum-type-check-in-c-gcc says the enum constants are of type int
<muurkha>"Yes, that means that the constant apple, even though it was declared as part of the definition of enum fruit, isn't actually of type enum fruit. The reasons for this are historical. And yes, it would probably have made more sense for the enumerators to be constants of the type."
<muurkha>it says clang has -Wenum-conversion
<muurkha>which was added to GCC last year
<muurkha>in GCC 10
<stikonas>in C++ there was already strongly typed enum for some time
<muurkha>yeah, C++ enum is different
***mid-kid1 is now known as mid-kid
<stikonas>oriansj: would you mind if I also add --verbose option to untar?
<stikonas>(to print list of extracted files, else untar will not print the list of files but will still print "The extraction of $name.tar was successful"
<stikonas> https://github.com/oriansj/mescc-tools-extra/pull/10
<oriansj>stikonas: looks great
<stikonas>ok, I'll push it then
<oriansj>thanks stikonas
<siraben> https://justine.lol/lambda/
<oriansj>siraben: the 255 character limitation seems pretty harsh
<siraben>the binary nature makes it even harder to audit
<siraben>I'm a bigger fan of Ben Lynn's readable (with enough effort) implementation
<oriansj>and the "Output consists of 0 and 1 bytes" presents a serious practical concern
<stikonas>yeah, 255 characters is not enough to even self-host it
<oriansj>does however appear to be enough to do a brainf#ck