IRC channel logs

<graywolf>Hi, when defining a module, how does #:export and #:export-syntax differ? The latter does not seems to be documented.

<civodul>graywolf: hi! #:export-syntax is a remnant of the past, don’t use it :-)

<graywolf>Ah ok :) I saw it in ice-9/and-let-star.scm and wondered what that is about

<old>why does it take two Ctrl-C in the REPL to get the interupt printed?

<old>(when waiting at prompt)

<dthompson>rlb: this article about chicken 6 has something to say regarding strings from the OS that may not be valid utf-8. thought you might like reading it. https://www.more-magic.net/posts/chicken-6.html

<Arsen>i fear it is faulty to assume a kernel will return UTF-8 :(

<dpk> https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling describes the solution we’ve come up with for R7RS Large

<dthompson>dpk: thanks

<dsmith>I like how rust handles string. Foreign and Domestic

<dsmith>Os strings are... not the same.

<dpk>yes, if we did Scheme again we’d probably make file paths/file names bytevectors or some other type instead of strings

<dpk>alas, that is not the world we inherited from R5RS (nor did R6RS’s new I/O layer do anything to fix it)

<makx>we're talking about software, where on balance, everything is terrible.

<dpk>indeeed

<rlb>dthompson: thanks much

<rlb>dpk: haven't read it carefully yet, but is that similar to puthon's "surrogate-escape" -- I'd finally come around to thinking that if we had the resources, what you might want was something like that for the "default path", and solid support for bytevectors if you want to put in the effort (and have enough data that it matters).

<dpk>it is similar to surrogate escaping, but maintains the properties that a Scheme character is a Unicode scalar value, and a Scheme string is an array of Scheme characters

<rlb>And if you can only pick one, maybe "byte smuggling" (i.e. the pep, or maybe what you posted), since there are a lot of string functions you may want to use on paths, etc. where you're fine as long as the underlying encoding is "ascii transparent", i.e. it's ascii "/" that matters for posix.

<dpk>i.e. doesn’t require unpaired surrogates to be accepted as valid characters, nor to be representable within strings (which would create ambiguities/restrictions depending on an implementation’s internal string representation)

<rlb>And you're out of luck if it's not ascii transparent, e.g. ebcdic?

<rlb>(I guess it depends on the details.)

<rlb>ACTION will read it

<rlb>I might be interested in helping with "something" on the guile front later, after we decide about utf-8.

<rlb>dpk: first-blush, I'm a little surprised that i/o-encoding doesn't leave you just before the bad data (or provide it, if it doesn't), though I guess in general, that might require unbounded buffering...

<dpk>that’s existing R6RS behaviour

<dpk>it would maybe be nice if you could get the invalid data, yeah …

<rlb>...and seems like this eventually needs to be in any rnrs flavor that might talk to linux/*bsd, etc.

<dpk>i actually have an Internet-Draft written up (but not submitted) describing noncharacter error handling for UTF-8 and UTF-16, but i want to get John Cowan’s feedback on it before possibly submitting it, since it’s originally his idea

<old>How can I override a GOOPs accessor and call its next-method ?

<old>I have the following:

<old> https://paste.sr.ht/~old/507a31dfd5a8acde3583837624a709d4fa6c525d

<rlb>OK, so looks like python's approach, as you say, has the disadvantage of introducing random lone surrogates, in exchange it's more compact (it's also fully reversible for "all" encodings, when requested, iiuc).

<rlb>I haven't figured the latter out for sure yet from the text.

<rlb>(wrt the r7rs proposal)

<dpk>i think in theory John did design it to be reversible for all encodings, but i personally only really understand UTF-8 and UTF-16, so i can’t speak to that specifically

<dpk>in practice, for the purpose it’s designed for, UTF-8 and UTF-16 are all that is needed

<rlb>I suppose that's true if you're not writing a system utility, or if it's still always safe if you require the program to run in one of those two modes, regardless of the system locale...

<rlb>i.e. my canonical question is "how would it look to write 'cp -a'" or tar, or...

<rlb>where every path *could* be in a different encoding.

<dpk>right, so it’s a platform-specific thing

<rlb>(and every user name, group, etc.)

<rlb>python now supports the "transparently" fairly well (finally)

<rlb>I think.

<rlb>"supports that"

<rlb>And iiuc rust *can* via the OsString, etc. dsmith mentioned.

<dpk>on Linux, paths are conventionally UTF-8 but not validated, so you always decode/encode paths in UTF-8 with noncharacter error handling

<rlb>paths in other countries weren't and still in I suspect a lot of older filesytems aren't.

<dpk>on Windows, paths are conventionally UTF-16 but not validated, so mutatis mutandis

<rlb>shift-jis, Korean encodings, etc.

<dpk>well, there’s not a lot we can do about that

<rlb>Sure, just do something like what python did -- that does work.

<rlb>for writing tar or cp -a, etc. (I think)

<dpk>what does Python do to support that?

<rlb>Double-check me, but if you read a path, it uses the fsencoding (typically these days utf-8), to encode, with surroggate escape, so whatever doesn't fit is still carried along, and then you can just reverse that in your program, using fsdecode with surrogate escape to get the original, umodified bytes, and you're good to go, if your're tar or cp, or whatever.

<rlb>i.e. you just reverse the string back to the correct bytes.

<rlb>If you don't care, then if you call open(path), it'll just reverse the process for your automatically to dtrt with the kerne.

<dpk>hmm https://docs.python.org/3/library/sys.html#sys.getfilesystemencoding

<rlb>and open(path) will accept either bytes or a string, possibly with bytes smuggled.

<rlb>(for the latter)

<rlb>You can of course also just run your whole process in latin-1 (for some languages), but that's a pretty ugly sledgehammer.

<dpk>okay, here: https://docs.python.org/3/c-api/init_config.html#c.PyConfig.filesystem_encoding

<rlb> https://peps.python.org/pep-0383/

<rlb>It took them far to long to get there (I have enough associated pain to attest...)

<rlb>"too long"

<rlb>(There was a long period where it looked like they just didn't want to accept that all paths/users/groups weren't always encodable.)

<rlb>And it *is* tedious, I'll agree -- having to accommodate that.

<dpk>okay, thanks, i’ve noted that https://codeberg.org/scheme/r7rs/issues/51#issuecomment-2460851

<rlb>I also wasn't a fan of this approach initially, but I've begun to think it might be a fairly practical solution for the common case.

<rlb>Even though I'd still really like to have full bytes/bytevector interfaces when you want them, perhaps in an alternate, or lower-level module.

<dpk>i think how an implementation decides what encoding to try for file paths will be implementation-specific, though; even just saying ‘on Linux, always do UTF-8; on Windows, always do UTF-16’ would be over-specification even if it were entirely correct

<rlb>Sure - for the python approach all that matters (for a utility) is that the same encoding is used during that process run for both input and output (paths, users, groups).

<rlb>i.e. that the round-trip is "clean"

<ekaitz>dthompson: the other day you mentioned about UTF-8 in hoot, https://www.more-magic.net/posts/chicken-6.html <-- chicken 6 does the same thing now

<ekaitz>I wonder how many libraries are affected by the performance degradation of the `string-ref`

<rlb>I'd guess it's heavily library/algorithm dependent, but sometimes "significant" -- though some things will improve too, i.e. locality/cache-density/memory-footprint, at least for some languages, etc.

<old>it seems like my problem is that it is not possible to override class accessors

<rlb>ekaitz: and at least for the proposed guile utf-8 work, string-ref will still be *real* O(1) for ascii.

<rlb>(i.e. direct access)

<rlb>It's string-set! that's the bigger issue I suspect.

<ekaitz>yeah...

<dthompson>ekaitz: yeah, they do the same indirection we do. our string type is a wrapper around the underlying bytes.

<ekaitz>does the plan for guile include an easy way to obtain the underlying bytevector?

<dthompson>string->utf8 would be the thing, though it will not be the same chunk of memory

<ekaitz>why not?

<dpk>mutation

<dpk>(i’m guessing)

<dthompson>if guile says strings are utf8 but allows users to modify the underlying bytes then guile will have a hard time making that guarantee

<dthompson>and also mutable strings are a big no-no imo

<dthompson>and also it's an abstraction leak that limits how guile can represent strings

<ekaitz>string-set! would also return a fresh string i guess, then

<dpk>mmm, it’s not too bad. consider the symbol->string case

<dpk>until version 10, you could corrupt the symbol table in Chez Scheme by doing a string-set! on the result from symbol->string

<dpk>(and this is allowed by R6RS)

<dthompson>that's what happens in hoot rn. the string object is mutated but we allocate a fresh array.

<dthompson>ekaitz: ^

<dpk>you have to be more careful with allowing mutation of the string->utf8 result because it might break an internal expectation that all strings point to valid UTF-8

<dpk>ACTION checks what R7RS small and R6RS say about mutation and the result of string->utf8 …

<dthompson>right. that to me is the biggest reason to not do that.

<dpk>oh, fucking brilliant. R7RS small does not guarantee that string->utf8 returns a newly-allocated bytevector, but also doesn’t say it’s an error (i.e. undefined behaviour, i.e. nasal demons) to mutate the resulting bytevector

<dthompson>🙃

<dpk>so who knows

<dthompson>the scheme standards are but suggestions ;)

<ekaitz>it would be great no to have anything mutable in scheme... and make everything like clojure does... hmmmm

<dthompson>well I like me some mutability :) but in specific situations and for specific data types

<dthompson>I'm a little bytevector gremlin

<dpk>R6RS does require the result of string->utf8 to be a newly-allocated bytevector, unless the string is empty. (same with string->utf16 and string->utf32)

<dthompson>just wait until I release my library that is like chez scheme's ftypes

<dthompson>so much bytevector gremlining going on there

<dthompson>I looked at guile's code for string->utf8 recently and iirc it allocates a new bytevector

<dthompson>ACTION goes afk

<rlb>ekaitz: in the utf8 branch, you get a new bytevector copy because it's a different type, but it's "just a memcpy", so it's likely not very expensive (relatively speaking).

<rlb>(unless you have a whole lot of large strings)

<ekaitz>oh, good

<rlb>I want really solid immutable support, but I also want to be able to "drop down" when I need to, because hardware doesn't care about my elegance :P

<rlb>...on the jvm strings are immutable, but you can go pretty fast with an intermediate stringbuilder, for (not entirely satisfying) example.

<rlb>well, maybe I should say *java* strings are immutable -- not sure whether the jvm embeds that concept...

<dpk>hmm, i wonder how Kawa handles that

<rlb>fwiw clojure just uses java (jvm?) strings directly, and a very common function is (str & args) which internally uses a stringbuilder (I think) for you.

<rlb>i.e. (str 1 2 "three" :foo ...)

<rlb>though often just used on strings, not heterogeneous objects

IRC channel logs

2024-11-19.log