IRC channel logs

<mwette>In effects analysis why is meet a union ? In lattices on sets meet is the intersection (inf), join is the union (sup). ... I think the intent is the sup of the property, right?

<cow_2001>i am confused. https://codeberg.org/guile/guile/issues/72

<ArneBab>rlb: I think for Guile the requirement is "do not break existing programs that currently work correctly" and the goal is "be as good and easy to use as possible". But this is just one person’s opinion.

<ArneBab>rlb: I would prefer doing the correct thing by default and having options to go for a less convenient fast-path where really needed.

<ArneBab>rlb: would noncharacters enable opening any binary file as text (though with hefty overhead) and write it back correctly?

<ArneBab>rlb: and aside: I like it that you’re stubborn in this! there’s stuff that should really be simpler. Will you be at FOSDEM? (I likely won’t be, doesn’t work with family these years, but I would expect that you can meet wingo and Ludovic there and lead these discussions with enough bandwidth to actually get to decisions)

<rlb>ArneBab: yes wrt a binary file as text, and I suspect that the cost would be fairly small cpu-wise, but up to double the memory footprint for the resulting data because each undecodable byte would become two characters, and those characters are (I think?) about 3 bytes in utf-8. Though I assume/hope that opening a real binary file as text would be rare.

<rlb>If you're actually opening the file to read it as text, then you probably want 'error and some (even simple) error handling. e.g. "This file does not appear to be readable with ENCODING, perhaps it's something else?"

<rlb>...and/or or there are libraries to detect/guess encodings.

<old>I'm starting to think that this encoding scheme is maybe not what we want. I need to think more about it and see where it needs to be used with OS interactions

<old>but it seems to me that an application that need to support none utf-8 environment variables or filenames should work with bytevector instead

<rlb>...and I suppose my position is "that's every Linux application", in general (at least any that needs to accept any path as a command line argument)..

<rlb>But I completely agree that "it's complicated", and will require careful thought, work, and a consensus.

<rlb>I think I'm likely at this point to let it go for now again --- i.e. let it percolate. And something might always change, e.g. rnrs might finish something reasonable, and then we'd have an incentive to head that direction.

<ArneBab>rlb: if opening a real binary file as text just works, that could be the default. Then you wouldn’t have to do anything extra to work with textual data, and for binary data you could switch to actual bytevetors as an optimization.

<dsmith>My $.02 is I think the Rust folk did a pretty good job here. All internal String and &str are utf8 and then there is OsString, OsStr, and Path which are os-specific.

<rlb>Sure, though it would of course depend on what you were doing, i.e. if it's OK that the file you expected to be utf-8 was in fact a png, or had one embedded right in the middle (corruption). Whether you'd want that to possibly go unnoticed --- which I think (in general) relates to old's security concerns.

<dsmith>Things sure were simpler when everything was just ascii

<ArneBab>dsmith: sounds like what Python3 does: bytes for filenames and paths, utf8 elsewhere. But it’s migration story from Python2 was awful (we should not copy that).

<rlb>dsmith: yes, that's another option, and could be another reasonable option --- it has a lot in common with the "just use bytevectors" approach, including that it may require a lot more work (though that's based on a guess about what a minimal noncharacters-ish approach might take).

<ArneBab>rlb: I think I missed your answer (if you wrote it): do you plan to be at FOSDEM?

<rlb>ArneBab: python now does both, though I think the only pervasive implementation is surrogateescape (i.e. noncharacters-ish) --- they never finished extending bytes to every api (I think, once they came up with and settled on surrogateescape).

<ArneBab>arg …

<rlb>Not sure wrt FOSDEM, atm, suspect not.

<rlb>dsmith: for the bytevectors or os-string approaches, we'll either need the string algorithms to be able to handle an additional flavor (utf-8, ascii, system-data), or we'll need a full set of whatever functions we want (which presumably includes much of what we have for strings) for that new type.

<ArneBab>rlb: then a telecon may be a good idea

<rlb>My guess is that in that world we might even end up wanting to see if we can unify bytevectors and strings at some level, as python does.

<rlb>i.e. ':'.join(strings) and b':'.join(bytess) both work.

<rlb>But that also raises all kinds of questions about the apis, i.e. what's a binary "char" for those apis for us.

<ArneBab>I’ve worked with that in Python for Mercurial plugins, and it’s not really nice -- doesn’t simplify creating nice boundaries, but rather doing local fixes …

<ArneBab>(that’s a point where explicit, static typing could actually help: "tell me where I violate that, so I can push the boundary to where it should be")

<rlb>ArneBab: sure, though another reason I think to set this aside for now, even though I've stirred it back up (when I learned about the noncharacters proposal and was intrigued), is that unless we end up being able to do something "easy", deciding this (and/or reviewing whatever we do) is I suspect very likely to require notable time from Ludovic and/or Andy, and I'm not sure that's feasible atm.

<ArneBab>sounds good, yes.

<ArneBab>Thank you again for thinking about this!

<rlb>ArneBab: right -- the happy path in python right now is to just rely on surrogateescape unless you hae more specific needs, which also means that naive applications should "just work" much of the time.

<rlb>i.e. you can write a (overly naive) "cp" in python that works for all paths, without having to know any of this.

<rlb>(metadata is another question entirely ;) )

<rlb>(attrs, xatrs, acls, high resolution timestamps, ...)

<dsmith>Ugh. Timestamps..

<rlb>(It's better now, but we had to have custom C code for the timestamp handling in bup -- not sure if we still need it.)

<rlb>across linux/*bsd/darwin, etc.

<old>rlb: How would the surrogate work on Windows ?

<rlb>I do think that having complete support for bytevectors could be "nice to have" even if we went with something like noncharacters, but I suspect it might not be worth the extra work/complexity given our resources.

<old>AFAIK, Windows use UTF-16 for their strings. Would internal strings in Guile be also UTF-16 on that platform?

<old>wrt resources, we can work our way one API at a time

<old>start with `getenv` and `putenv`

<old>then `scandir`, etc.

<old>no need to merge all at once with full support of bytevectors

<rlb>Sure, but I suspect it's a lot of extra code, and *should* also require a lot of extra tests.

<old>It is tho? Sure for tests

<old>I think the first step is to get utf-8 merged anyway

<rlb>Again, I think it might be nice, but given how rarely in the common cases you're likely to pay for anything with noncharacters, I'm not sure it's worth it --- but really just a guess atm.

<old>and then we can settle on that and start really thinking how we want to tackle that issue

<rlb>I have no idea what we plan for windows wrt utf-8. I guess I've assumed that we're switching to utf-8 everywhere (internally).

<rlb>yes, utf-8 seems much more certain, and we have at least one somewhat functional attempt :)

<old>Then all utf-16 strings will get converted to valid utf-8 that's okay

<old>it will just be slower on Windows dues to native strings conversation to utf-8

<old>but hey, crappy OS that's what you get I suppose

<rlb>Another fun topic is that even if your paths (say) were required to be proper unicode strings, you still have to deal with "normalization", i.e. more than one byte-sequence in utf-8 encodes föo (because the middle character could be a single unicode point, or could be a sequence of "o" followed by a "dots over previous char" combining character).

<rlb>And only one of those encodings will actually work wrt what you actually have for that file in your filesystem in a given directory if you try to open() it.

<rlb>i.e. "never normalize system data"

<rlb>cf. https://en.wikipedia.org/wiki/Unicode_equivalence

<rlb>But of course you may *need* to normalize for say string=, depending on what you're trying to ask.

<ArneBab>rlb: I think for windows the first step is to actually work there again. And then make sure not to break lilypond. That should already limit the option space sufficiently to take decisions.

<rlb>For at least utf-8, that should be transparent with respect to windows-specific issues. i.e. no windows-specific backward compat questions.

<rlb>(afaik)

<ArneBab>rlb: did you do the tests on Debian boxes with PR 22 (windows cross-compilation) ⇒ do you know whether there’s any ABI change introduced?p

<ArneBab>(sorry for the p, on my keyboard p is next to return …)

<rlb>I don't recall exactly what I did --- I think maybe I did test that branch "somewhere", but we'd need to do it again to be sure if/when we're close(r).

<rlb>This might become interesting, assuming there's some way I could do it while still making it absolutely clear that the results are not "for use", i.e. if I could use it to run things through the buildds more easily/automatically: https://www.freexian.com/blog/debusine-repositories-beta/

<ArneBab>rlb: did you try the abidiff?

<rlb>I did for the release, but not sure what I checked wrt that pr.

<ArneBab>(and I think PR 22 is done -- we’re scope-creeping it )

<dsmith>And there is still time before 3.0.12 to tune things up if needed.

<sneek>Welcome back dsmith :)

<dsmith>sneek, botsnack

<sneek>:)

<dsmith>!uptime

<sneek>I've been faithfully serving for 2 months

<sneek>This system has been up 9 weeks, 4 days, 23 hours, 8 minutes

IRC channel logs

2025-12-16.log