IRC channel logs

<mwette>IIRC string-rindex

<mwette>string-rindex searches for char's. Do you mean regexp?

<mwette>if non-overlapping regexp's then maybe fold-matches to return the last match

<rlb>...not sure what possessed me, but I wrote manpages for the remaining tools (guild, guile-config, guile-snarf, etc. --- already have one for guile proper), and I adjusted them to pick up the version in the footer. I'll probably commit that soonish.

<dsmith>rlb, Nice!

<mwette>ditto!

<rlb>I did only document the guild subcommands listed by "guild help", not the full "guild help --all" list (didn't even remember that was there).

<rlb>But if we were to head that route, I suspect we should break things up into multiple guild-SUBCOMMAND(1) pages like git, etc.

<rlb>And I'm not signing up for that right now :)

<mwette>YOu mean like `man guild-compile'?

<mwette>Do you know of `help2man' ? I only came upon it recently. I have not used it.

<rlb>right, since there are a lot of subcommands in help --all

<rlb>And I think at least on linux man is somewhat flexibile, i.e. "man guild foo" will also work.

<rlb>(does for git)

<mwette>Crazy. I didn't know that.

<listentolist>mwette: I mean searching for a string of several chars in a string. So basically string-contains but backwards as the string is long and it would be too slow to use fold-matches.

<listentolist> https://www.gnu.org/software/mit-scheme/documentation/stable/mit-scheme-ref/Searching-and-Matching-Strings.html

<ArneBab>rlb: you’re right for already broken programs: if it already replaced characters with a fixed placeholder, moving to non-characters should be a clear improvement.

<sneek>Welcome back mwette :D

<mwette>sneek: botsnack

<sneek>:)

<mwette>listentolist: https://paste.debian.net/1413511/

<mwette>rlb: Would 1413511^ paste work fast on your utf-8 implementation?

<dsmith>mwette, That looks like it could be expensive, with the string->list

<mwette>dsmith: That's for the pattern, which is typically small, I guess.

<dsmith>mwette, Indeed it is! Must read more carefully

<rlb>mwette: the cost should be in the string-ref, which is nominally constant cost given the sparse index, and is constant cost if the string happens to be ascii. Though the "constant" for non-ascii will vary, depending on the string. i.e. how far you have to "seek" after you "skip" for each ref.

<dthompson>does every string need to incur the cost of indexing or is done lazily?

<dthompson>is it*

<rlb>index is persistent, and built with the string.

<rlb>part of the immutable string

<rlb>The string is all one contiguous allocation, including the index.

<dthompson>my experience has been that I almost never want to string-ref

<rlb>(for cache friendliness)

<rlb>Right, that's why the index is at the end too -- so you dont' pay for it unless you access it :)

<dthompson>but you pay for its construction

<rlb>ACTION thanks russ for that observation/suggetion

<rlb>Sure, but only for non-ascii, and it's (configurably, even) small relative to the content itself.

<rlb>I'd guess via C, for most fancier cpu's it's probably well worth it -- it's also used iirc a g good bit internally via many functions, i.e. for getting initial offsets, etc.

<rlb>Oh, and as mentioned, you don't have one at all for ascii-only strings.

<rlb>fwiw

<dthompson>I guess over time I've come to accept that strings are not O(1) access

<rlb>Heh, me too...

<dthompson>string cursor interfaces feel more like the right thing than string-ref

<rlb>If nothing else, it's a compromise I suspect we need to make for all the existing code that certainly does expect O(1) ref, etc.

<rlb>Though there's nothing we can do about string-set!...

<dthompson>yeah

<rlb>mwette: a reverse iterator like clj's (reverse ...) could be a lot more efficient.

<dthompson>personally I'm rather okay with letting old string-ref loops go quadratic

<rlb>ACTION is less sanguine

<rlb>I can't recall for sure, but I think I may have maintained the index pretty efficiently/incrementally much of the and I wonder how often it might be roughly "free" allocation-wise, given fragmentation, but you're right, it's dead weight if you never use it.

<dthompson>I guess benchmarking will indicate how much overhead it creates

<rlb>I do think that without it, we might well end up needing something similar in places, either ad-hoc, or as an additional api for cases where you do need some/many offsets.

<dthompson>have you seen https://srfi.schemers.org/srfi-130/srfi-130.html

<rlb>Right -- iirc, so far the few tests that have been run on "larger" things showed it to be a bit faster than what we have now.

<rlb>And I ran the ecraven benchmarks, which were varied, but I think overall, either comparable, or faster (again iirc, and fwiw).

<rlb>Hmm, that rings a bell -- think I did, and thought it seemed interesting, though haven't used it at all.

<dthompson>the string cursor srfi is nice because you can easily save cursors for cases where your algorithm needs to backtrack and it's functional and specialized for strings so less overhead than a string port

<dthompson>I tried to see how much mileage I could get out of string ports and found it too costly.

<dthompson>my benchmark for string iteration performance is guile-syntax-highlight. I wrote it back when I didn't appreciate the complexity of strings and it has its own cursor interface that does string-ref under the hood.

<rlb>makes sense -- basically the in the "additional api" category I was suggesting, perhaps.

<dthompson>yeah

<rlb>yeah, the current utf8 branch (which should build just fine now if anyone wants to play with it -- I keep it rebased and passing the tests) goes to some lengths in the code to try to be efficient, though as I recall, there's still room for even further optimization if we liked.

<dthompson>I rewrote the cursor implementation to use a string port which used seek to backtrack and it performed *much worse* than the existing implementation, so I'm big on a dedicated string cursor interface now.

<rlb>Interesting.

<dthompson>yeah it was like 5x slower or something. really bad but not entirely unexpected. ports are so general-purpose and they have all this buffering machinery that isn't really necessary for strings.

<dthompson>and the fact that they're stateful makes backtracking expensive

<rlb>The existing utf8 work has the advantage (advantage?...) of C, which at least for the bulk byte manipulation is fast, if tedious (of course) to try to get right.

<rlb>Right, ports are often convenient, if nothing else because they hide all the "allocation" decisions. That was one of the things that's tedious / potentially-fragile on the C side.

<rlb>For cases where it was "relevant" I tended to follow the alloca up to a point, and then switch to malloc approach (though typically / always(?) gc_malloc_pointerless). Cases where we *have* to deal with malloc I generally favored a wind for cleanup, etc.

<rlb>So up to SCM_MAX_ALLOCA can be handled on the stack.

<rlb>But I still think we should, in general, lean against putting things in C (hence the srfi-1 conversion). Strings, though, particularly given our starting point, seem like a plausible exception.

<rlb>Though I did want to move some more of the string functions to scheme and couldn't (without further overhaul of other things).

<rlb>"Which wasn't today's problem." :)

<dthompson>yeah, there's a balance. is it a part of the runtime or is it implemented on top?

<rlb>All that said, I'm sure there are still bugs (and certainly some C related) --- as mentioned, I *did* try to be even more careful about overflow than our C code has typically been, i.e. via trying to use intprops.h add/mul checks "everywhere it matters".

<dthompson>have you considered a side table to hold the indexes? that way they could be created lazily upon first string-ref

<rlb>I think the issue was that right now we can't implement some libguile/foo.c (it was one or more of strings.c, srfi-13.c, srfi-14.c) functions in scheme, i.e. some in scheme and some in C.

<rlb>I did, but I felt like if they're say < 5-10% the size of the content, then if they're used very much, that might be a large perf loss compared to being inline wrt the rest of the data, regarding modern caches, prefetching (that might nearly always fetch the following data anyway, index or not), etc.

<rlb>And as mentioned, many allocators are going to round up the allocation, and so that space might just be wasted if we don't put the index there.

<rlb>Another option would be a bit to say whether and inline index had been initialized yet -- plausible, but we'd need to be careful about atomicity.

<dthompson>I guess I see that choice as prioritizing legacy string access over current best practice

<rlb>Also, I should check, but it may be that the index was more or less always computed as a side effect while building the string and so may not cost an extra pass, etc.

<dthompson>perhaps the cost is so trivial that it's fine to do this for all strings, dunno

<ArneBab>I do use string-ref, and the cursor-interface used in guild hall was one reason why I couldn’t make heads or tails of the code: too much state to carry around in my head to understand the code.

<rlb>Sure, but if you can't really notice the cost, or if it's tiny, then perhaps worth it, and past a certain point, if the cost is low enough, why *woudln't* you want string-ref to be vastly faster --- can be convenient, if you're in a hurry, even if you know better.

<dthompson>I think it's an okay goal to keep existing programs built for O(1) string-ref working with the expected time complexity, but personally I'm okay with that involving some constant overhead for that case

<rlb>I suppose my current feeling is that "it's probably worth it, and preferable", but as you say, it'd be better to have more concrete info. Though if course nothing prevents us from having it for now and removing it later (e.g. as a transitional mechanism) if we discover it's an issue.

<rlb>"Though of course"

<dthompson>some benchmarking of real world programs to get an idea of the impact would be cool

<rlb>It *should* be completely invisible, other than the perf impact.

<rlb>i.e. "easy" to remove, though not exactly easy C code wise :)

<dthompson>it's a different thing, but I know that applying this same strategy to hoot would be very costly

<rlb>Yeah, I'm only thinking about "for guile", and with the implementation in C.

<rlb>i.e. my inclination is very situational.

<rlb>also given my impression that many/most cpus are ridiculously fast now wrt "inline data" if it's "not big".

<dthompson>my hypothesis is that the additional overhead at string construction will become noticeable at scale

<rlb>Of course some of that won't apply at all for a *giant* string i.e. if the cache is 200MB away at the end :)

<rlb>Well, at lest not for the L1 cache ;)

<rlb>Yeah, I suspect relative to all the other costs, and "any" of our scheme code, it probably won't, but I might well be wrong.

<ArneBab>Doesn’t wasm evolve at the moment to provide js strings to wasm?

<rlb>The main concern I'd have right now, offhand, is locality-related, but then you're probably bouncing around a much larger string, and so the bounces to the compact index at the end perhaps don't compare.

<rlb>wrt construction, I may misremember, but I think in nearly every case, we're already having to effectively compute every offset regardless, i.e. for validation of the utf-8, so the only "cost" is writing those to the index bytes, and they're sparse, so you only write say one entry for every N chars, where N is configurable (the stride) at compile time, though I feel sure we'll pick one and never change it.

<rlb>for validation, or another per-char traversal (unless we're actually using the existing index) that we need as part of the algorithm in question.

<dthompson>ArneBab: sort of but then you have to accept UTF-16 as your string encoding

<dthompson> https://github.com/WebAssembly/js-string-builtins/blob/main/proposals/js-string-builtins/Overview.md

<rlb>Hmm, though I suppose maybe I'm wrong e.g. for string concatenation without offsets (or even with, as long as the offset is near the start), you wouldn't have all the offsets for the combined index, etc. Would need to think about it were we to start leaning against the indexes...

<rlb>If we could make the atomicity safe/fast enough, having an inline, lazily constructed index might be fine too, but that's something we could add later, presumably after some testing justifies the additional complexity.

<rlb>I also have the feeling that given how things have been until now, there's probably a *lot* of existing code that relies on O(1) string-ref and string-set! and it's likely to be worth favoring some cost to help out with string-ref, since we can --- can't sanely do much about set!.

<ArneBab>dthompson: isn’t UTF-16 the native encoding on Windows?

<ArneBab>… well, also of js in general

<ArneBab>⇒ do you see an alternative for cheap transfer through the js/wasm barrier?

<dthompson>I don't, no

<dthompson>this is *the* way that's available since the stringref proposal didn't get adopted.

<dthompson>it's just going to be an annoying rewrite to use this interface

<dthompson>but the benefit of not copying strings between wasm and js will be great

<old>Would be nice of the utf-8 support is not introducing regression in the order of O(n^2) to something that was O(n)

<old>that would be a very bad things imo

<ArneBab>could UTF-16 in Guile (hoot) and UTF-8 in Guile (C) be a problem for server-client interaction?

<dthompson>old: that's what rlb is working on

<ArneBab>(to avoid that cost)

<rlb>old: we're definitely going to for string-set!, but not string-ref (as it currently stands).

<old>rlb: mutation is fine I think

<dthompson>string-set! should be painful

<dthompson>no mercy for those that mutate strings

<rlb>we "compromise"...

<old>I don't recall a single place where I've used it

<old>maybe if you make a text editor then it could make sens I suppose

<dpk>i should really write up the rationale for why we decided to SHOULD O(1) on string-ref, despite the common and basically correct argument that random access to Unicode scalar values is useless

<old>but there's porbably way better functionnal strings anyway that would support undo operations

<dthompson>old: for text editing you'd have some data structure composed of many little strings

<old>ya ropes

<rlb>well, some places that might use it for a clever/fast algorithm, are likely depending on "ascii", and I think we *could* keep that fast (one of the questions I had during the work), but perhaps that code could also just use a bytevector...

<dpk>(tl;dr: hysterical raisins, plus it’s still a pure functional interface unlike e.g. Hoot’s favoured string port based solution, plus backtracking parsers have an access pattern which sort of approximates random access)

<old>rlb: would it be possible to tag the string internally? For example, user create a utf-8 string, but it is actually ascii

<dthompson>but yeah O(1) string-ref is something we gotta live with. I just don't want to pay a penalty for it when I don't use it.

<old>so the stride for index is always 1 in that case

<rlb>old: done

<dpk>string-set! will be deprecated and have a performance notice allowing truly atrocious performance, in order to enable certain patterns for using string-ref

<rlb>ascii strings have no index, and some much faster paths.

<dthompson>dpk: we don't favor string ports specifically. a dedicated string iterator like srfi-130's cursors would be best.

<old>rlb: well then I think you've solve 90% of the problem

<dthompson>but string ports are what we have, currently

<dpk>dthompson: yeah, i keep meaning to look at adding SRFI 130. last time i looked i was annoyed that jcowan designed it in a way that forces you to have string cursors disjoint from exact nonnegative integers, so a string cursor which is just a fixnum byte offset can’t be used

<rlb>I picked a compromise on how often the ascii strings have a separate path wrt complexity / convenience -- might well not be the right compromise(s) yet.

<dpk>because all the procedures are polymorphic between indexes and cursors. ad hoc polymorphism: not even once

<dthompson>dpk: ugh I haven't looked that closely but I do not like that

<dthompson>so let's just say some as-of-yet-unwritten string cursor interface ;)

<dpk>the latest idea i had was that a string cursor is converted from a byte offset with (- -1 byte-offset); then if it’s negative you know it’s a string cursor, and if nonnegative an index

<rlb>dpk: btw, I'm now thinking that "ports" encoding should be "unified" with the "systemdata" encoding, and (eventually) they should both default to 'error, if not 'noncharacters.

<dpk>rlb: i don’t know what the ports and systemdata encodings are :D

<dthompson>ArneBab: getting back to your question, all the UTF-16 stuff would be hidden in the runtime so there'd be no interop issues. I think this would mean that string->utf8 would become more expensive, but that's okay I think.

<rlb>Oh, sorry, post to the list the other day about the whole noncharacters question.

<rlb>(in an older thread)

<rlb>Right now ports default to a fluid that defaults to a 'substitute (lossy) conversion strategy, and I think that should at least be 'error (eventually), and if we add 'noncharacters, maybe noncharacters along with the encoding/decoding for "system data" (paths, users, envt vars, etc.).

<rlb>My main question was whether you'd ever have a real reason to want separate fluids for "system data" and ports.

<rlb>i.e. need to vary that per-thread differently for each category -- I now think no.

<rlb>bbl

<mwette>For cursors, it might be useful to have (char-fwd string cursor) -> (values char next-cursor). And char-rev also. Instead of calling string-cursor-next and string-ref/cursor separately.

<old>rlb: would there be a way to specify raw encoding?

<old>like just binary

<ArneBab>dthompson: would there be a path around that? I’m thinking about using hoot for experiments with more efficient serialization between client and server (think protobuf) -- would that impact that? (or can I just get the bytes raw?)

<old>For example rn, I have a custom ports that I use to filter colors dynamically before sending strings to the real underlying ports. But this involve many conversion even though the filter is done bytes

<mwette>sneek: later tell listentolist, simple string-search-backwards: https://paste.debian.net/1413511/

<sneek>Got it.

<duncan>Spotted this let form in the wild: https://framagit.org/tyreunom/guile-rdf/-/blob/master/rdf/rdf.scm?ref_type=heads#L241

<duncan>What does it mean?

<duncan>That's not the let form I'm familiar with

<duncan>Is it some kind of guard clause to memoise?

<dthompson>ArneBab: no you can't just copy a blob over and say "this is a compound object conforming to this schema"

<identity>duncan: that does not load for me, could you paste that somewhere else?

<ieure>duncan, Loads for me, doesn't eval in a Guile repl. I've also never seen a let like that, has to be some kind of macro or DSL or something.

<old>identity: line 10: https://paste.sr.ht/~old/a3424df0113f63bd28dbd5b4e173c2c554e01146

<duncan>It definitely evals in a guile repl for me

<ieure>duncan, Also noting line 220, (let loop ((g g) (m 0)))

<ieure>Which doesn't conform to the Guile manual's documentation for let.

<ieure>Named let? letrec? Something like that.

<dthompson>that's srfi-71's extended let syntax

<old>renamings is a variable not a syntax-transformer

<duncan>oh I see, it's multiple values

<ieure>Ah, there you go, there's srfi-71 in the modules.

<duncan>thanks!

<old>oh god

<old>I much prefer let-values instead

<identity>nobody would have been confused by a ‘let*-values’

<old>it has the merit of being clear

<ieure>Really don't care for the way Schemes default to binding the union of module symbols into the current one.

<ieure>Makes is very hard to see where stuff is coming from, which this is an example of.

<duncan>I *guess* the values bit in it should have been a giveaway

<duncan>but I tend not to notice such things til it's too late

<listentolist>mwette: Thank you!

<sneek>listentolist, you have 1 message!

<sneek>listentolist, mwette says: simple string-search-backwards: https://paste.debian.net/1413511/

<listentolist>sneek: Thank you!

<dthompson>rlb: I take it you've read https://hsivonen.fi/string-length/ ?

<dthompson>wingo's article about strings is also very good https://wingolog.org/archives/2023/10/19/requiem-for-a-stringref

<dthompson>"Users would prefer extended grapheme clusters"

<dthompson>I feel this way whenever I deal with emoji

<dthompson>"array-of-codepoints is just a bad place in the middle"

<dthompson>the note about Swift's string views is a good one

<dthompson>"Swift maintainers designed new APIs to allow users to request a view on a string: treat this string as UTF-8, or UTF-16, or a sequence of codepoints, or even a sequence of extended grapheme clusters."

<rlb>ieure: these days, I nearly always #:select --- habit I now prefer, from clojure.

<rlb>It's more work "once" for the writer, but I find much nicer to those who come after.

<rlb>(the readers)

<rlb>dthompson: someone told me, though I haven't investigated carefully, that swift's effectively doing what I did.

<rlb>wrt sparse index

<rlb>And I've seen the first article, though not sure I've read it carefully, but not sure about wingo's.

<rlb>old: depending on what you mean, my current inclination is to switch to utf8 and then, assuming we really think either it's going to be adopted, or is "sufficiently good" regardless, to start with a 'noncharacters approach: https://codeberg.org/scheme/r7rs/wiki/Noncharacter-error-handling --- whether or not we eventually also want much more comprehensive bytevectors handling / support for "system data" seems like it might could be a

<rlb>separate, additional, question.

<rlb>I also consider everything with our currently, I think, fairly limited resources in mind...

<rlb>Using strings, and something like noncharacters means that most programs "just work", without having to know about the whole mess that is "system data vs unicode", and allows you to use the full array of string operators on system data (split, join, ... for paths, etc.).

<rlb>"same story, second verse, ..." https://lists.gnu.org/archive/html/guile-devel/2025-12/msg00012.html

<dthompson>is the idea to use strict utf-8 for strings or wtf-8?

<rlb>I believe it's strict atm.

<dthompson>because wtf-8 would allow strings to contain anything which can then be interpreted

<rlb>I just relied on our existing libunistring and iconv libraries, plus some hand-written optimizations/simplifcations in a few cases.

<ieure>wtf-8 is when you have no idea what the encoding is

<rlb>Though I think Andy may have wanted to consider moving away from libunistring? And I think I did too due to some issues I forget...

<rlb>dthompson: I assume the relevant cases would just end up encoded via noncharacters, if we go that route?

<dthompson>this is where my string knowledge ends. :) idk

<dthompson>utf-8 + breadcrumbs is looking more and more like the best fit for keeping string-ref working as it should even if it's effectively useless in the modern age

<rlb>basically with noncharacters, any undecodable byte 0xxy (where x and y are each 4 bits) becomes unicode \ufddx\ufddy.

<old>oh it's possible to crash guile by setting an invalid or unsupported encoding and writing to the port

<rlb>and then when you encode on the way out with noncharacters, you notice that and reverse the process back into the xy byte.

<old>I would have guess an exception to be raised

<dthompson>rlb: neat!

<old>rlb: this is what we want I think

<rlb>It's very similar to python's (finally working well) surrogateescape approach.

<rlb>So a naive round-trip works fine for arbitrary paths, command line arguments, env vars, etc.

<rlb>As compared to now, when the data's just quietly corrupted.

<dthompson>seems promising!

<rlb>"you can try it at home"

<rlb>guile -c '(write (program-arguments)) (newline)' -- $'\xb5'

<rlb>If you (as is likely) have a UTF-8 locale.

<rlb>With noncharacters, you'd see "\uffdb\uffd5" and then if you tried to write that string back to a noncharacters port, you'd see the 0xb5 byte again.

<rlb>i.e. (write str) shows the encoding, but (write str port) would write the b5

<rlb>But...there are details to decide about.

<rlb>if we even decide we think it's the right way to go...

<old>is there a discussion of all this on codeberg?

<old>I know there is one on the ML

<rlb>No, nothing yet.

<old>okay

<old>perhaps it would be a good starting point for having the discussion ?

<ArneBab>dthompson: ok, thank you for the info. I’ll need to understand more it seems.

<ArneBab>rlb, ieure: I almost always use (only (srfi :26) cut) (and similar). I like being able to see in the file itself where a symbol comes from.

<rlb>...explicit selection is the norm in the clojure world, and there I saw just how useful it was on a regular basis -- of course, ymmv. It's noisy at the top of the file, but pgdn fixes that :)

<ArneBab>rlb: thank you for that "try at home"!

<ArneBab>that drives it home that it’s a clear improvement.

<rlb>ACTION again thanks posix for adding the $'' syntax :)

<ArneBab>:-)

<rlb>handy

<rlb>Now do printf %q...

<rlb>(i.e. wish it were posix and not just bash)

<ArneBab>rlb: In "Naming and Logic" where I restrict what I show to the essentials, I include only, because it’s so useful for understanding. https://www.draketo.de/software/programming-scheme#import ⇒ page 30 in https://www.draketo.de/software/programming-scheme.pdf

<rlb>I don't know import well yet, but assuming "only" is #:select-ish, sounds good to me.

<ArneBab>yes -- (only (srfi :26) cut) is like (use-modules ((srfi srfi-26) #:select (cut)))

<ArneBab>(import (only (srfi :26) cut))

<ieure>ArneBab, Yes, for things where I need 1-2 symbols, I'll explicitly select them; and for cases where I need more, will use #:prefix. But my criticism isn't about this being impossible, but about the most confusing behavior being the default.

<ieure>ex. if using (foo bar) automatically put those into a bar: prefix.

<rlb>Also agree that it'd be better to have to ask for "all".

IRC channel logs

2025-12-11.log