IRC channel logs

<lechner>Hi, is there a way to unconditionally re-export everything from another module?

<rlb>sneek: later tell lechner https://codeberg.org/lokke/lokke/src/branch/main/mod/lokke/ns.scm#L74-L77

<adhoc>rlb: ffor those of us that have never done that, why would you want to ?

<rlb>iirc in lokke's case, it was because a given ns (module) was implemented partially in another guile module, i.e. its exports should be the union of the current module, and a couple of others.

<rlb>in that particular case, the module itself is clojure, but the majority of the exports come from two other files, one clojure, and the biggest, scheme.

<rlb>e.g. https://codeberg.org/lokke/lokke/src/branch/main/mod/lokke/ns/clojure/core.clj#L12-L13

<rlb>adhoc: ^

<adhoc>um

<adhoc>so you can import clojure code into guile?!?

<rlb>lokke is an experimental clojure dialect for guile (e.g. like guile's existing js and elisp support).

<rlb>via https://www.gnu.org/software/guile/manual/html_node/Compiler-Tower.html

<rlb>Think there's also a python dialect out there somewhere, oh, and now wisp is included too.

<rlb> https://codeberg.org/lokke/lokke#lokke-clojure-for-guile fwiw

<adhoc>wow

<adhoc>clujure is one dialect that i have wanted to tinker with, but was always put off by the JVM requirement

<adhoc>too many bad experiences with the JVM

<rlb>no idea whether you'd be happy, my experiences with clojure on the jvm have been good --- and it has very nice ways to work with all the existing (sometimes very useful) java libraries without all the ceremony; fwiw, ymmv.

<adhoc>it was the stop the universe and garbage collect for 30 minutes after 30 minutes run time...

<rlb>And its built in concurrency tools can be very handy (definitely recommend the "train book" if you head that way).

<adhoc>concurrency .. .in the JVM

<rlb>I've never seen that myself, but maybe I've been lucky, or perhaps "things are better" depending on when you were having trouble.

<adhoc>when I spent a lot of time there it was using green threads.

<rlb>It can even return heap to the OS now I think.

<rlb>Oh, yeah, things are *much* more sophisticated now.

<adhoc>web servbices, JSPs, JNI and other fun

<rlb>It's not all roses, but particularly for serverish work, I've had very good luck.

<adhoc>in hind sight, this was up to 2010

<rlb>It also helps that more or less the latest version is very well supported in debian all the time now ;)

<rlb>(for me)

<adhoc>but Oracle ?

<rlb>ACTION sticks to openjdk

<adhoc>ok then, sounds like the improvements would be worth looking into

<adhoc>ah

<adhoc>so, lokke

<adhoc>"The Clojure dialect is currently available via ./lokke which can run code, serve as a !# interprete"

<rlb>They've even finally conceded that symlinks and unix domain sockets exist, but it's still not suitable (easily) if you need to deal with arbitrary binary paths/envt vars/argv, etc. But then neither is guile.

<adhoc>that is cool

<rlb>I somewhat mirrored upstream clojure which now has clj and clojure, the former for interactive use.

<adhoc>as been far too long since i coded for work in scheme or lisp of any kind

<rlb>(lokke's probably not ready for "serious" use though)

<adhoc>to be hoest, i have struggle most with learning guile

<adhoc>ah ok, will keep an eye on that though

<rlb>Oh, and jvm clojure is only suitable for tools where some "startup time" is acceptable. i.e. probably not for cp, ls, git, etc. It still takes about a half a second or more to "do nothing" here.

<rlb>But it can be reasonably fast after that (it's because clojure itself has a lot of fixed work still to do at startup and they've never afaik prioritized fixing it because it's most often used for server work).

<adhoc>better than the python tools we have at work then ...

<rlb>python has a very fast startup time I think, it's just very slow for many things otherwise...

<rlb>well, not *very* fast, but faster.

<rlb>(e.g. less than a tenth of a second here.)

<adhoc>i suppose it depends which modules you are loading

<adhoc>point taken.

<adhoc>on a machine with a sub 1GHz clock, python 3.9+ takes 30 seconds to get to a prompt/repl

<adhoc>servers with spinning rust, yeah ... all these things suffer

<rlb>I'm really just talking about the "time python3 -c ''" time, same wrt "time clojure -e ''", etc.

<adhoc>so, on learning scheme/guile,

<rlb>i.e. what's the fastest a "foo --version" command could run (without cheating)

<adhoc>is The Little Schemer a good book to learn from ?

<rlb>perhaps --- I've heard it mentioned often, though never read it myself.

<rlb>I'd guess these may be good: https://docs.scheme.org/

<rlb>I *know* the first one is good, but it may or may not be "to taste" for everyone.

<rlb>(was for me)

<adhoc>I dusted off my copy of SICP over the new years break, under TLS and The Resaoned Schemer..

<adhoc>ok

<adhoc>ACTION makes notes

<rlb>(and fwiw the jvm book I mentioned is "Concurrency in Practice" --- definitely recommend, though it might not cover the new, fully integrated virtual threads, which seem very likely to be interesting, but also shouldn't be hard to pick up afterward)

<rlb>(the train book)

<adhoc>ok

<adhoc>thanks rlb

<johnwcowan>rlb: Okay, i understand. The answer is that fdd7 isn't valid as part of a nonchar pair, only fdd8-fddf. I probably didn't spell that out.

<johnwcowan>ArneBab: the escape for actual nonchars is fffe.

<ArneBab>johnwcowan: what happens if escaped nonchars are in a file? Do they stay escaped non-chars after loading and saving?

<ArneBab>Or is an escaped non-char an error-condition?

<johnwcowan>Yes, they roundtrip. If there is an fffe ffff in a file, it becomes fffe fffe fffe ffff internally.

<ArneBab>nice -- thank you for clarifying!

<johnwcowan>Note that nonchar is an error recovery mode rather than an encoding. If a byte is read from a file that violates its encoding, it turns into a nonchar-pair internally.

<johnwcowan>But unlike other error modes, it affects output as well as input.

<johnwcowan>UTF-16 is a special case: an unpaired surrogate turns into two nonchar-pairs when read.

<rlb>johnwcowan: sure, I think I understood that. The situation I *think* was being proposed was one where somehow you have some sneaky/risky bytes represented as noncharacter sequences that are then unnoticed when they would have been noticed (or couldn't exist at all) without noncharacters. And then those bytes end up reconstituted on output. Though presumably this wouldn't happen on input for say risky ascii sequences, for an ascii

<rlb>compatible encoding (e.g. when decoding utf-8 or latin-1 or...).

<rlb>I presume it could happen if you were allowed to add noncharacters when building your own strings, whether directly, or via contrived conversions from bytevectors --- though perhaps all that falls under "don't do that then"? (And don't trust other code you shouldn't trust.)

<rlb>Of course the whole scenario is, I think, proposing that the data was already untrusted, and dutifully being checked, but now there's a new potential attack the validation code doesn't know about.

<rlb>johnwcowan: looks like you just disconnected, so don't know whether you saw the rest of that, if not, it'll be in the log, or I can resend if it's interesting.

<johnwcowan>if you inject nonchar pairs into strings, yu xa get stray bytes, but not ascii ones, because fdd0-fdd7 are never part of s nonchar pair. It's true that your code could inject fddc fde6 fdda fde3 into a string and write it to a utf-8 file and file, which would be interpreted by the reader as æ (which might be mad, bad, and dangerous to know). But yes, don't do that.

<johnwcowan>This idiot client disconnects me without warning, but it looks like I got everything.

<johnwcowan>s/file and file/file

<rlb>So is the implementation supposed to also prevent things like (string-set! str n NONCHARACTER)? I hadn't actually thought about the broader "rules" yet. I guess so?

<johnwcowan>no

<johnwcowan>it doesnt prevent you ftom writing corrupt files

<rlb>OK, so code could add the hidden bobby tables sequences to strings, and we're just back to don't do that then.

<johnwcowan>yes

<rlb>But that does mean that existing "validators" will now become incomplete, whenever noncharacters-decoded strings are involved?

<johnwcowan>after all, we don't prevdntyou from putting sql injections into dtrings

<rlb>right, but right now, you could have code that successfully prevents that for a given output, and now it might not?

<rlb>i.e. you were "completely" safe, and now there's a vector.

<rlb>i.e. if (display (remove-any-bobby-tables s) (port)) now won't work for all s, where it did previously.

<johnwcowan>well, you could have a mode in which nonchar-pair processing is not done on output

<rlb>The other thing we wondered about was a mode which raises an error for noncharacters that are actually not invalid in the target encoding --- would work for the (most common) "round trip" case I think?

<johnwcowan>indeed, it's unclear to me atm why you would want output processing done

<rlb>It's the main case I care about? i.e. round trips for getenv/setenv or readdir/open, or...?

<rlb>There all you care about is a way to carry the undecodable bytes transparently.

<rlb>e.g. I assume by far the main use for most people/programs for python's default surrogateescape arrangement.

<rlb>Though there they forbid "smuggling" ascii, I think, and I've wondered if it's to mitigate this risk by default, for common (ascii) situations...

<rlb>i.e. escaped ascii bytes are disallowed (I think)

<rlb>In truth, that's the *only* case I care about right now.

<johnwcowan>yes, readdir/open is indeed a compelling case

<rlb>For files, I can just have my program open them in binary mode, etc.

<rlb>(and probably should)

<rlb>(where the encoding might be weird)

<rlb>But for argv, env vars, paths, users, groups, etc. I just can't write the program sanely at all right now.

<johnwcowan>unless you give up on strings altogether and allow inly bytevectors

<rlb>atm there's no way at all to get the argv bytes without corruption (well other than running the whole process in latin-1)

<rlb>Right, that's what my whole interest in this discussion has been about --- what should guile do about "system data".

<rlb>And providing parallel bytevector support for all relelvant operations is one path, another seems like it might be noncharacters, so I was trying to see what that would mean, and how it'd work.

<johnwcowan>well, i think nonchars are the safest option so far

<rlb>You'd need parallel support for a good bit, though I think.

<rlb>i.e. similar to python -- support for regex, "string" formatting, split, join, and many other "string" operations.

<rlb>(for bytevectors)

<rlb>the bytevectors route would of course completely side-step this question about the potential of introducing a risk to existing safe code.

<rlb>since it'd be universally opt-in

<rlb>But it has the distinct disadvantage of making it so you have to go out of your way to write a program that "just works normally" with all paths/envt-vars, etc. Whereas with python's approach (and noncharacters) naive programs "just work".

<rlb>and most people never have to think about any of this

<johnwcowan>you can make open polymorphic, but you have to write readdir-using code to expect either strings or bytevectors

<johnwcowan>which is very ugly

<rlb>You also have some functions where you have to do something else because there's no reasonable argument to use to "detect" the mode, e.g. getenv.

<rlb>The bytevectors route involves a *lot* more (any) api design/discussion. *And* you still have the situation that almost no one's going to write their programs (or even know to) use the bytevector api instead.

<johnwcowan>how is getenv different from readdir?

<rlb>Where any program that wants to work with all linux paths, actually must.

<rlb>Oh, right, it's not, but I'd assumed that functions would operate in modes, so that you don't have to have two codepaths everwhere, i.e. add support for (getenv var 'bytes) (readdir ... 'bytes) or getenv-bytes or... (e.g. api design/discussion).

<johnwcowan>in that case, what does readdir-chars even do with an undecodable filename, not to mention that in Posix there's no way to even determine the encoding

<johnwcowan>Youd have to deprecate or remove readdir-chars

<rlb>For guile, I assume it would follow the current conversion-strategy, which at the moment defaults to 'subsitute, but I think it ought to default to 'error (so corruption/loss isn't the default).

<rlb>And yes, that's a notable "issue" with the bytevectors approach --- any program that wants to support platforms where system data isn't a string (e.g. linux) should never use anything but the bytevector apis, but I suspect that'd be unlikely to happen, even more so if these apis are all guile-specific.

<rlb>Anyway, I was just trying to understand the situation/options reasonably well, but my current conclusion is that I'm not in a position to really be able to "do anything" substantial about all this with respect to guile. Before I could, "we" would need to have the spare time and attention to pick a path, and then settle all the details, and settlement of the noncharacters path would, I now assume, have to include these risk related

<rlb>questions.

<rlb>And appreciate you helping me understand what a noncharacters approach might involve a bit better.

<johnwcowan>Jeebus, all that explication one finger at a time, and now you're gonna bail out on me? :-)

<rlb>Oh, I'm happy to keep discussing it --- I was just mentioning where I think this topic may be for guile for now.

<johnwcowan>anyway, i was glad to explain

<johnwcowan>as usual it helped me understand better too

<rlb>The potential security-related questions just made it clear that there are even more notable "decisions to be made" than I'd thought.

<rlb>So unlike the utf-8 conversion, I probably can't (or shouldn't) just try to figure out a "reasonable" solution and implement it.

<johnwcowan>akchully i think you should implement as we've discussed here in an ecperimental branch and ask for review

<johnwcowan>but of course if you don't want to invest that much I understand

<rlb>Right --- depending on how much work that turns out to be, I might much rather have a discussion/decision first, also because that would mean that we're sure we're ready for any particular solution here.

<rlb>(With the utf-8 changes, I had the clear impression that's where we wanted to go now, though I don't know when there'll be enough time available for review, etc.)

<rlb>Oh, and also overall, I do still suspect that if you're only going to have one approach, noncharacters is likely preferable, assuming you can satisfactorily address any relevant risks because of that fact that it allows existing/typical code to "just work".

<rlb>Even if you had to opt-in via envt var or something (when you "know" it's safe), perhaps still better than what we have now.

<johnwcowan>i think so. Nobody else has expressed interest in implementing it as yet.

<old>rlb: review is still on my todo .. I was sick all holidays and now I am focus on making BLUE ready for FOSDEM/Guix days :p

<rlb>old: oh, right -- good luck.

IRC channel logs

2026-01-21.log