IRC channel logs

<rlb>...wondering if we have a bug wrt %default-port-encoding --- make-custom-port clearly expects it to be a string since it calls string->symbol on it, and scm_i_set_default_port_encoding sets it to #f or a string, but scm_i_default_port_encoding returns a symbol, and I see other places where we compare scm_is_eq (pt->encoding, sym_UTF_8), etc.

<rlb>Hmm, or perhaps I'm just misunderstanding what's being used where...

<rlb>Think that was it --- %default-port-encoding is a string, but a port's actual encoding is a symbol.

<rlb>...

<dsmith>rlb, Sounds like you've already got it covered, but back-in-the-day I think Guile was very forgiving about strings and symbols. Or it might have been SCWM...

<rlb>I was mostly confused by the symmetric naming, i.e. set-foo get-foo with different types for the value.

<rlb>Not sure if I will, nor whether we'd want me to pursue it, but I have one flavor of preliminary noncharacters support working now on main --- only for getenv, and only for decoding. I'll likely add encoding next for the round-trip, and it'd be easy to extend to pwd, grp, etc. Haven't looked at how hard/scattered paths might be.

<rlb>At the moment, it adds a new category of data, sysdata, for paths, argv, users, groups, etc. and by default it matches current behavior by defaulting to %default-port-conversion-strategy and the current locale for the encoding, but you can change that via %sysdata-encoding and %sysdata-conversion-strategy. And you can also set GUILE_SYSDATA_CONVERSION_STRATEGY=noncharacters if you want to establish the default early (so it can effect

<rlb>argv, etc.).

<rlb>My current inclination is to think that ports should remain independent, and that in some future X/Y release, we might change the default port strategy to error, and the default sysdata strategy to noncharacter.

<rlb>But really just toying with it.

<rlb>(And not sure it's the approach we'd want.)

<rlb>dpk: ^

<dsmith>rlb, Sounds promising

<rlb>I think it'd probably "work fine" mechanically, but there are notable questions about semantics, and we'd need to think about the potential effects of exposing code that might not be expecting it to strings with "noncharacters".

<rlb> And as suggested earlier (by old?), we'd likely also want to think about the potential for "smuggling" --- could be that (at least) we'd want to forbid encoding the ascii range (no reason you should need to).

<rlb>(well, perhaps not "no reason")

<rlb>we'd need to think about it

<rlb>e.g. if I'm thinking straight, I suppose unless all your target encodings are ascii compatible, then you might well need to?

<lloda>re https://issues.guix.gnu.org/72347 i think we should deprecate list-index

<lloda>rename it to something internal sounding if we can't altogether hide it

<lloda>we should collect a few of these breakages and get them ready for 3.2 or 4.0 or whatever may be

<old>hello. Tired of not knowing where some bad-syntax happened while compiling?

<old>Well, I got something that fix that that you can put in your .guile. Now when you compile (evaluating does not work) a buffer in Geiser, you will get proper line report of bad lambda syntax for example

<old> https://paste.sr.ht/~old/dd6f1df1adad855150a0f411253f184d9c414895

<old>that ought to fix it and not pollute the namespace of guile-module

<old>there are probably other core syntax-transformer that could be fixed that way, I'll have to check

<old>If someone has an idea to fix bad form like: ()

<old>would be neat

<old>does somebody know what is suppose the last argument of a unbound-variable ? It's always #f

<johnwcowan>rlb: You are aware that noncharacters are in fact characters? They just don't have glyphs, not even zero-width glyphs, and aren't meant to be used in interchange.

<rlb>johnwcowan: with respect to which thing(s) I said?

<rlb>(And yes, I *think* I understand that.)

<johnwcowan>Cidetshouldbd able to handlall Unicode characters, or at least not puke on them even if they cannot assign meanings to them. In any case you would only represent loose surrogates as nonchar-pairs, so never ASCII.

<johnwcowan>"Code should be"

<rlb>I believe one question was whether this raises any new security concerns, i.e. if you had existing code looking for "rm -rf ..." could this allow those bytes to bypass the filter and end up somewhere "bad". More or less a traditional escaping/validation question.

<rlb>And if so, would a guile (scheme) adopting noncharacters for system data need additional mechanics/constraints/semantics.

<rlb>(And perhaps of course less so, if it worked that way "from the start".)

<johnwcowan>No, because a UTF+ninchar encoding can't ever produce ASCII.

<johnwcowan>(what's Guile's internal reorwsentation?)

<rlb>I think I need to understand that --- i.e. (if I got this right) if create a string with \xFDD7;\xFDE2 (i.e. 0x72 -> 'r') will that not produce an ascii r in the output when encoded back to the "outside"?

<rlb>(Also I did that off the top of my head, may have remembered the encoding wrong.)

<rlb>And if you're asking what guile uses for strings internally, currently either latin-1 or utf-32, but it might eventually switch to utf-32.

<rlb>"switch to utf-8" I meant.

<ArneBab>rlb: do you mean, what if I write non-characters into a file, read the file, and write it out again?

<rlb>Sure, or a command line arg, or env var or... btw, I'm not at all sure how much, or if, this is a real concern yet.

<rlb>I suppose one case to imagine might be some command that takes a string as an argument, parses and "vets it" somehow and then if it's ok, runs it in a shell.

<rlb>Let's also assume that it's specifically looking for "rm -rf". If noncharacters could allow the "rm -rf" to end up as "\xFF..." noncharacters, then the existing code wouldn't find that string, and would pass it on. If that then got "reconstituted" (reencoded) as "rm -rf" on the way out to the shell...

<rlb>Hmm, I suppose if you were to always throw an error if you hit noncharacters that actually represented a valid sequence in the output encoding when you were encoding "on the way out", you would prevent this particular kind of smuggling.

<ArneBab>yes, that’s the option I’d see. Otherwise actual non-characters would have to be escaped. They should be *extremely* rare, so an expensive escape might be OK, but that could impose a runtime cost on everything.

<ArneBab>I guess that’s a case where we’d be better served with throwing an error and if people really want to get that data, read it as raw bytevector.

<ArneBab>If you’d want to serialize the current state of Guile as a string and read that later …

<rlb>(iirc python just special cases ASCII here, i.e. doesn't allow the ascii range to be smuggled.)

<rlb>bytevectors would be more conservative, and leave the responsibility completely with the "user" --- but as discussed, then you'd still need a full set of compatible functions (e.g. regex, etc.).

<rlb>...for my purposes, bytevectors would be fine too, assuming sufficient (new) support along those lines (i.e. splitting, joining, regex, etc.).

<ArneBab>I’d just say noncharacters get the functions, bytevectors are only a special tool to handle non-characters, so they don’t need the convenience.

<ArneBab>What could help with that: a function that just identifies the indizes of non-characters in bytevectors, so you can split them by non-characters and parse the rest as strings.

IRC channel logs

2026-01-20.log