IRC channel logs

<old>rlb: `make -j' yes

<old>oh no srfi-64 is more broken think I though

<old>skipped test used to count for the overall test ran

<old>now it does not ..

<old>so this used to be okay: (test-begin "foo" 1) (test-skip "bar") (test-assert "bar" #t) (test-end "foo")

<old>but not anymore ..

<rlb>old: I wondered if a trivial "guile-tools.1: guild.1" in doc/Makefile.am might fix the problem, but I haven't been able to test/reproduce that yet --- and I might still just change guile-tools.1 from the git symlink to instead be generated by a "$(LN) ..." rule.

<sneek>mwette: Greetings :)

<mwette>sneek: botsnack

<sneek>:)

<mwette>rlb, old: Could the issue be the link guile-tools.1 is in the src tree but not build tree?

<old>the symlink is indeed only in the build tree

<old>instead of adding guile-tools.1 in source, it could be generated in build and it would work fine I guess

<old>this ought to work: https://paste.sr.ht/~old/c2904ff072e521463509e1b27cc382572d1dd8af

<old>rlb: you can use this directly if you want to make the fix instead of going through the PR

<graywolf>Hi :) What is the story with bugs and patches in debbugs? I know Guile moved to Codeberg, but dunno whether there was bankrupcy declared on debbugs or whether I should keep the patches / bug reports open.

<ArneBab>I hope that we’ll keep the debbugs bugs and patches. And resolve them step by step (or retire them when they are fixed).

<sneek>mwette: wb

<mwette>sneek: botsnack

<sneek>:)

<rlb>old: right, that's what I meant earlier wrt "$(LN) ...".

<rlb>mwette: agreed, but then I wondered why it was happening here in a normal source tree (well happened, but now I can't reproduce it).

<rlb>In any case, I'll probably just switch to the explicit rule and forget about it :)

<rlb>old: I remembered one notable, related concern with respect to noncharacters. I *think* that in order to be acceptable, there would need to be some way to always get the encoding used to decode a given string. Not a problem if it's determined at startup and can never change, but otherwise, you need to know that in order to correctly/safely return the string to the original bytes.

<rlb>And so, if the encoding can change, say via a fluid, then interfaces like (program-arguments) won't work "as is" because someone might have changed the fluid before you were called.

<old>Right. but thinking about it again

<old>Do we really need to support these kind of strings in general?

<old>Even bash seems to not support that

<old>If we really want, we might want to check how modern system language did. For example, Rust has OsString type

<rlb>Also, security-related, if you know your vetting needs to detect/escape/transform/elide certain ASCII characters (which is probably by far the norm), then you can't safely do that unless you know that the encoding used to decode was an ascii superset -- *if* it's ascii bytes in the outside world that are the relevant bit.

<rlb>old: what do you mean --- bash supports "anything", except null or / for paths, for example.

<Arsen>assuming one runtime encoding at startup is a recipe for breakage

<old>rlb: do $ pwd

<rlb>Not sure I follow.

<old>hold on

<rlb>You can refer to any byte sequence in bash via "touch $'\xff\xee\xaa'" etc.

<rlb>ACTION has some very strange paths laying around here in dirs via bup testing...

<old> https://paste.sr.ht/~old/66af4406652e52e75199dea9e09ce418fd7cd782

<old>I meant that the builtin command `pwd` of bash will just replace with ? the unknown byte

<Arsen>/tmp/foo$ pwd | xxd

<Arsen>00000000: 2f74 6d70 2f66 6f6f b50a /tmp/foo..

<old>but sure you can always reference the whole directory if needed

<Arsen>can't replicate

<rlb>Sure, but then I'd say that just means the either bash isn't a general purpose programming language, or pwd isn't suitable :)

<rlb>I'm assuming we have the higher bar for guile, but if not, then none of this matters.

<old>but it's still a system tool written in a system language. My point is that supporting none utf-8 OS string is only really a matter for system programming in general

<old>But it's also fine to be pendantic about it and just say, use utf-8 on your system we are not the Linux kernel

<old>To me the whole surrogate escape looks more like a ad hoc hack that will bite us in the future

<rlb>I guess it depends on what you mean by "system programming". It's an issue for any program that wants to be able to open and read "any file"...

<rlb>(without crashing or otherwise failing)

<rlb>I feel like in that sense, it affects nearly every program.

<old>it certainly affect most programs outside of Guile also, but is this in general a real problem?

<rlb>The OsString path (or something like it) is also feasible, but then, as mentioned, we're going to have a *lot* more work, I think.

<rlb>(maybe not, though, just my current impression)

<old>Not sure. Maybe we can get around quite easily actually

<rlb>Dunno, in the python world, I think it was one of the more significant problems with the 2/3 transition, and at first they did appear to try to say "it's not an important issue, just fix your data", but eventually, and with difficulty, relented, adding hybrid bytes/string interfaces in a lot of places, and then finally adding surrogateescape to handle "everything".

<rlb>Also, this mostly isn't a problem until it is, and I'd guess when it happens, perhaps likely with non-western data (most often legacy, but there's still I assume a lot of it), e.g. various Japanese, Korean, and other Asian encodings for paths, but Latin-1 and other "codepages" will do it too.

<rlb>(It's more "common" for me because I work on an archival tool.)

<rlb>old: if we were to go the bytevectors/os-string route, the lot of work I meant was of course wrt all the support you'd need to add for "equivalent" operations (globbing, regex, split/join, perhaps even parsing?).

<rlb>old: suppose in the end "someone"s needs to decide what guile's requirements are --- e.g. should most programs "just work" with any path for the common cases, or is it sufficient for it to just be *possible* to write a program that works, perhaps with nontrivial extra effort, bespoke algorithms, etc., depending on what you need to do with the paths.

<rlb>"someone"(s)

<old>I can't believe UTF-8 simply does not have a standard way of escaping byte

<old>in a way, the surrogcate escape is litteraly using restriction in the standard to overcome limitation in it

<rlb>utf-8 doesn't, I don't think, and if it did, we'd have all the same questions wrt security, etc. I think.

<old>Sure. and what is the implication in performance?

<old>for example, an internal utf-8 in Guile need to be convert back to its bytes representation

<rlb>And to some extent the security questions apply regardless --- if you're just given "raw bytes", you can't vet those without a lot more context, i.e. what's the input encoding, and the output encoding, and what's the context of the eventual use, etc.

<old>I guess a flag cna be kept to mark the string as "dirty" and need to be converted back instead of just taking the underyling bytevector?

<rlb>You still also need to know what encoding was used to decode the bytes on "ingestion".

<rlb>Otherwise you can't properly reverse the process.

<rlb>That's what I was mentioning at first today.

<old>yeah right

<rlb>But the point is that even if we went with a "bytevectors only" approach for system data, you still have to know (in most cases) and questions isomorphic with encoding/decoding.

<old>so for example, a string that is coming from a port with encoding X need to be decoded back using X with the surrogate-escape ?

<rlb>e.g. for linux, for paths, you have to know that it's only ascii byte '/' and '\0' that matter, no matter what else is going on.

<rlb>(at least on "output")

<old>really wish we could just reset everything to utf-8 and only this encoding exist

<old>would make things much simpler

<rlb>ports are harder, potentially, i.e. if it's a network socket, speaking http, then that has a whole mess of it's own rules.

<old>http accepts other encoding than utf-8?

<rlb>worst example i can think of offhand is "email" ;)

<rlb>(it allows mixed messes wrt encoding iirc, e.g. header values, etc.)

<rlb>But wrt ports, if it's an arbitrary file (say a file in the fs), then without further information, might be the only thing you can do is "safely round trip" the bytes by default. Of course, if that's all you want, then you could, and perhaps should, just use a binary port, unless you "know more", e.g. that it's at least an ascii superset, or "something".

<rlb>In that latter case, something like noncharacters might make it a lot easier to do what you need to do, as compared to handling all the data as random bytevectors (until/unless bytevectors are more functionally equivalent to strings wrt available apis, etc.).

<rlb>...dunno, but as mentioned, unless someone becomes interested in, and is in a position to help deliberate and decide this for guile (ideally, understanding it better than I do), I should probably just let it go again. I've (obviously) been a bit stubborn because I'd really like to have guile as an option.

<rlb>old: doc hopefully fixed in main.

<rlb>thx for the help

<rlb>(...be nice if codeberg eventually allowed you to reply via email to discussions like gh does.)

<mwette>I'm writing an effects analysis for the SXML output of one of my parsers. Inspired by guile's cps I'm translating to a vectorized version VXML and will iterate on that. The routine sxml->vxml translates (foo "abc" (bar (baz "def") "fgh")) to #(#(foo "abc" 1) #(bar 2 "fgh") #(baz "def")). Hoping it works out ...

IRC channel logs

2025-12-15.log