IRC channel logs

<sneek>Yey! dsmith is back :)

<dsmith>sneek, botsnack

<rlb>civodul old: I think our scm_strndup is either wrong or misleading since it always allocates n. Wondered if we could/should do anything about that.

<rlb>Looks like the one internal use in main does expect the unusual behavior, allowing scm_to_latin1_stringn (s, &len) to handle internal nulls if desired. So perhaps it's more a naming/doc issue...

<rlb>i.e. perhaps scm_memdup () or something would be clearer...

<rlb>For now I guess I'll just add some comments (there are no info docs to worry about...).

<probie>rlb: wait, why wouldn't it always allocate `n`?

<rlb>I feel like since it's named like strndup(2) it'd be natural to expect it to stop at the first null.

<rlb>strndup(3)

<rlb>i.e. I feel like that name has fairly well established expectations.

<rlb>(definitely what I thought it did, until I happened to look)

<probie>I haven't looked up the guile function; does it not stop copying on a `null`?

<rlb>no

<rlb> https://codeberg.org/guile/guile/src/commit/eff5613ac8aa51f0cd3f200b8ba72675a19c35be/libguile/gc-malloc.c#L167

<rlb>it's really "memdup"

<rlb>plus one

<probie>Just to clarify; naively, I expect it to always allocate `n`, but stop at the first null

<rlb>Oh, wait, was my reading/expectation of strdup just wrong?

<rlb>Guess now I need to figure that out first. Apologies for the noise if it was.

<rlb>s/strdup/strndup/

<probie>rlb: You're probably not wrong, my use of the word "naively" is likely working overtime. The obvious way to ensure sufficient memory in any of the `strn` calls is to just allocate the maximal amount of memory required.

<rlb>Ahh, interesting, posix 2018 at least leaves it up to the implementation. But then I'd argue "why waste the memory?" unless you want to guarantee that extension so code can rely on it...

<rlb>(and of course we can do anything we want in our implementation, and the one use case does rely on that behavior...)

<rlb>OK, well I'll probably still at least add a comment to the function.

<rlb>Haha --- if anyone recalls when I asked/complained about srfi-42's (eratosthenes 100000) test taking far too long the other day, I started actually looking in to it today, and immediately discovered that it isn't noticable at all on main, only in utf8; then I see string-set! in the inner loop.

<rlb> https://codeberg.org/guile/guile/src/branch/main/test-suite/tests/srfi-42.test#L528-L534

<rlb>old ^

<rlb>s/string-/vector-/ immediately fixes the problem :)

<mwette>Why is string-set slower in utf8?

<rlb>it's now O(n) (and to be avoided when feasible)

<mwette>nevermind. I read string-ref.

<rlb>ahh, right :)

<mwette>string-ref is fast?

<rlb>O(1)

<mwette>ty

<rlb>though the "1" is (compile-time configurably if we like) larger now for larger strings.

<rlb>i.e. we use a sparse index into the string with a "stride" so you can jump near the char you want and then iterate forward to find it.

<rlb>unless the string is all ascii, in which case, not needed.

<rlb>(we also optimize a bunch of cases where we see up front that the substring we're working on has a char count that's identical to the byte count, i.e an all ascii *region*)

<rlb>I just realized though that rdelim ought to be restructured, but I guess for backward compat we may want to preserve the existing %read-delimited behavior for external code (since we treated it as public...)

<rlb>e.g. right now if you use concat, read-delimited (the parent fn) has to do a full reduplication of the string just for that, after going to "lengths" in the utf8 %read-delimited to only have one final allocation/copy.

<rlb>ACTION may fix it

<rlb>jcowan: do you know of anything like read-delimited(!) in one of the standards? (Just realized as I started to fix those related functions up, we'd be better of to optimize a standard version, if we could, and implement ours in terms of that.)

<rlb>i.e. our (ice-9 rdelim) read-delimited(!)

<lilyp>r6rs seems to only have get-char and lookahead-char (as well as undelimited get-string)

<lilyp>I don't thin r7rs has much ports going on

<rlb>right, and there's also get-string-n(!) in various places...

<lilyp>there could be an srfi, but none of the ones documented in guile has read-delimited

<rlb>ok, thanks for poking around.

<rlb>It's unfortunate that our scm_read_delimited_x is not read-delimited! (but actually %read-delimited!).

<lilyp>searching for delimited on srfi.schemers.org does not look like it

<lilyp>well, it does make sense that it calls the internal variant – or rather, that the user-visible one is a wrapper around the C thingy

<rlb>Sure, but typically scm_FOO_BAR is identical to foo-bar.

<rlb>not a variant that behaves slightly differently

<rlb>i.e. I think it should have been scm_?_read_delimited_x or something, particularly because I think I'm about to provide a proper scm_read_delimited_x and can't use the obvious name.

<lilyp>i don't know the convention for replacing percent, actually; I do think I've seen it more often than that (and even used it myself)

<lilyp>rewriting the binding should not be that difficult tho – just point it to scm__read_delimited_x (note the double underline)

<rlb>sure, but I don't want to break backward compat

<rlb>since it's been this way and publically documented "forever".

<rlb>(Ideally, unless people really have needed it, I'd rather we hadn't made %read-delimited! public in the first place.)

<lilyp>I mean, I can see from the documentation why there'd be a reason to export the low-level detail

<rlb>It's about to be notably more expensive than read-delimited! if you use concat.

<lilyp>but as long as the interface remains compatible, it should not be an issue? is the c function exported the same way?

<rlb>i.e. we don't want %read-delimited's semantics with utf8.

<rlb>At least not for read-delimited or read-delimited!'s foundation.

<rlb>But we'll just keep %read-delimited! for backward compat.

<lilyp>i'm out of the loop, what's broken with the % variant that's not broken for read-delimited itself in terms of semantics?

<rlb>(dont want it because string-set! is O(n), and also don't want to have an additional "copy everything" string-append.

<rlb>)

<rlb>concat requires an extra full string copy just to add the delim

<rlb>we want to include the delim inside read-delimited(!) while we're building the one final results tring string in C

<rlb>s/s tring//

<lilyp>that's the same as gobble=False, no?

<rlb>look at how read-delimited(!) is implemented in terms of that in rdelim.scm.

<rlb>gobble only affects ungetc

<rlb>%read-delimited still returns the string, and the delimited separately

<rlb>"delimiter"

<lilyp>ahh

<lilyp>is gobble a boolean or an arbitrary scheme object?

<lilyp>if you can pass an arbitrary scheme object, you could decide on a special symbol (e.g. "slurp") that means gobble + append

<rlb>yep

<lilyp>then you can keep the current interface while also avoiding the o(n) string-set!

<lilyp>another thing you could do is to replace gobble with handle-delim directly if you want to do the "new interface" approach; passing #t or #f would then raise a deprecation warning but still reproduce the old behaviour

<rlb>Hmm, not sure that'd be feasible since previous code might well have returned any random true value, and still expect it to work.

<rlb>e.g. via (memq ...) or whatever

<rlb>"and would"

<lilyp>well, first we'd need to track usage to make a statement; it could well be that an api breakage goes unnoticed

<lilyp>and IIRC 3.0 → 3.2 breakages would be fair game anyhow, at least within limits

<rlb>sure, though I tend to favor a fairly high bar there, and if the alternative is to just add a new function (for example), I'd very likely do that.

<rlb>In any case, for now, I'm just making read-delimited(!) and scm_read_delimited efficient and we can decide if we want to do something about scm_read_delimited_x later.

<rlb>(and %read-delimited!)

<rlb>I'd be tempted to consider deprecating them and, if some part of their functionality is still needed, adding a preferred replacement.

<dsmith>rlb, strncpy (famously?) always writes n bytes, and does not necessarily terminal with NUL. Is scm_strndup following that?

<rlb>strncpy will still stop at the first null in the source, if there is one, though, right?

<rlb>This was just about "unnecessary" allocation --- i.e. it'd be unnecessary for strndup (but allowed) but it's currently *necessary* extra behavior for our scm_n_strdup because scm_to_latin1_stringn depends on it.

<rlb>(scm_strndup I meant)

<dsmith>rlb, Yes, stops at the first NUL and fills the rest with NUL.

<rlb>interesting, didn't know it filled.

<rlb>(or didn't remember)

<dsmith>Perfect when writing directory entres in a 7th (or whatever it was) edition file system..

<rlb>:)

IRC channel logs

2026-05-06.log