IRC channel logs

2023-07-30.log

back to list of logs

<rlb>Is there already something functionally similar to read-string!/partial for bytevectors?
<rlb>error handling in locale-string->integer is incomplete...
<wingo>good day
<rlb>good evening
<rlb>wingo: the utf-8 changes rebased cleanly on to yesterdays main (so that's good) but I've been noticing functions that have no test coverage, so I'm a little more worried about reliability, fwiw. Assuming I don't run out of steam, I'll probably make a pass to add more tests, though I might push the branch before that so people can start seeing what they think.
<wingo>woo
<wingo>how do i review? do you have a branch somewhere? i wish we had something like gerrit or something
<rlb>(Also wondering about rw.c -- I've converted it, but it doesn't really make sense with utf-8...)
<rlb>I don't have anything public yet -- I'm planning to push something after I finish what I think of as the major remaining changes, and I'm happy to try to accommodate any reviewers preferences, including presumably yours :) By default, I was likely to push a branch to either sourcehut or codeberg or both, unless something else would be preferable.
<rlb>I have managed to get the indexes working (as mentioned), have removed all internal use of string_set_x in our algorithms, have removed all internal use of start/finish_writing (in favor of bulk operations), and am now reviewing remaining uses of scm_string_chars (since some expect latin-1 "raw bytes", not an ascii subset).
<wingo>i am happiest if i can somehow comment on the branch on some web site. email is least good option for me
<rlb>Cool, have a site preference? (And I can easily handle codeberg, sourcehut or gh, if that helps.)
<rlb>...I have a feeling you're going to have an initial raft of "larger scale" requests, where I got the style or semantics "wrong".
<rlb>I had to make a *lot* of choices :)
<rlb>And even after all this (if it works out) there still may be a good bit of work for me or others to do things like convert relevant remaining uses of scm_string_ref to an approach that operates more efficiently on the bulk utf-8...
<wingo>rlb: something structured as a pull request / merge request with commenting facilities; i think that rules out sourcehut but codeberg or github appear to be ok
<wingo>am happy to deal with rebases, also fine if you incrementally commit on your branch to resolve comments and rebase those commits in place before merging
<wingo>preserving bisectability is a priority but i would imagine that some of these changes need to be quite large
<rlb>I'm pretty sure the entire series is more or less bisectable, and it's a *lot* of patches atm. I tried to do it in understandable pieces, though it's a pretty long "story" at this point.
<rlb>But if we start adding additional tests (with the relevant changes), I suspect that may either require some reworking, or we'll lose some bisectability. e.g. rw.c has *no* tests, and nether does xsubstring or string-xcopy!, and the locale integer conversions don't appear to test non-ascii, and...
<rlb>i.e. the new tests may expose interim problems that weren't apparent before.
<rlb>But I think maybe we agree on the overall approach. I'd been expecting to keep rebasing until we're happy since I feel like particularly for something this big it's a good idea to have as "nice" a history as possible (for poor future me if nothing else).
<rlb>And at least right now, I have most of it in my head, so it's much easier to do larger scale changes across the series (until I run out of energy :) ).
<wingo>generally speaking let's not add gnarly c code. some is necessary of course. but e.g. xsubstring, who cares, nobody uses that, that should be in scheme somehow
<wingo>so in that case i would have an early patch that moves xsubstring to scheme, using list->string eventually or somemthing
<rlb>OK, so then I suspect one of your first requests may end up being "less" :) I was trying to preserve a good bit of perforamnce, even for functions people should be moving away from in the hopes of reducing the severity of the initial conversion.
<wingo>hehe yeah
<wingo>or, xsubstring would use string ports ideally...
<rlb>On the other hand, some of the operations are fairly similar, using a bulk api, and I did favor local stack alloc with overflow to gc alloc in places that makes the mess getting free right go away.
<wingo>xsubstring could even use bytevector ports i guess to avoid conversion overhead while building; that still requires a utf-8 parse at the end but that's probably in the noise
<rlb>For what it's worth, I'm likely happy with some pretty radical reworks once I know what we prefer, time permitting. This initial pass is likely to be on the more performance-oriented end of the spectrum, though hopefully not *too* crazy. Though I'm contemplating a pass before the first push to remove all the "mutable ascii" optimization paths. I did that initially, but then stopped later in the series.
<rlb>I currently handle xsubstring in two passes, one to compute the u8 byte length, and then one to encode in-place in the stringbuf.
<rlb>fwiw
<wingo>nice
<rlb>same for string-xcopy!
<rlb>They just differ in the final step.
<rlb>(i.e. swap the bufer pointer or create a new string)
<rlb>(new string wrapping the new buffer)
<rlb>Hopefully, though it is a chunk of C, and as you suggest, we'll need to decide where to draw the line. If you're curious (if not, no worries), here's what I came up with there yesterday: https://paste.debian.net/hidden/70d9ce26/
<rlb>The code at the end of string-xcopy! uses the current "edit" api that allows you to change a region of characters within a stringbuf to get a new stringbuf. It does that in two steps so you can compute the new byte count for the region "in the middle".
<rlb>fwiw
<rlb>It's not sufficient for all cases, but covers a number of useful ones in srfi-13, etc.
<rlb>dunno - lots of other ways to approach that, but you have to keep track of the possible stringbuf prefix and suffix (before and after the edit) for the final buffer if/when the buffer is shared.
<rlb>I tried to keep a lot of the details when posible away from code outside strings.c.
<rlb>I've also fixed a reasonable number of bugs in the current code as I've been going. Though if this doesn't work out, I should have kept a list... e.g. some error handling in i18n, etc.
<rlb>wingo: I've also wondered about adding some generative tests for various functions, i.e. that test on random valid unicode strings, and with a generator that always includes some of the more likely important cases, i.e. empty string, all null string, non-null string with final null, string with one or more internal nulls, all ascii, all non-ascii, etc.
<rlb>(Though it could be that we only need a handful of fixed cases like that, i.e. if the more randomize approach would really just be double-checking our utf-8 libs.)