IRC channel logs
2026-04-06.log
back to list of logs
<rlb>Anyone already have an idea offhand what our current, rough cost is to cross over to C (e.g. for a call to something in libguile)? <rlb>I'm wondering in part because it might well be nice to implement more of strings, srfi-13, srfi-14 in scheme, though right now we can't (easily) anyway, until we make it feasible to have a core module that's only partially in C --- last time I looked that didn't work, but I don't recall the details. <rlb>(at the time, I think I was toying with making either srfi-13.c or strings.c "hybrid") <rlb>Not a big deal right now, though --- the heavy lifting on that front's already done for the utf8 branch. <rlb>(Just converted a few more functions to work directly with the utf-8, avoiding string_ref (string-set_x use is already gone), e.g. string-prefix? string-index string-skip...) <jcowan>What's the plan going to be for string-set! in a UTF-8 world? THere are a number of alternatives. <rlb>jcowan: current plan, and implementation, is "avoid", since it's likely O(n). Though I think the current branch may also allow ascii-only stringbufs (and hence strings that refer to them) to mutate in place, but there may be be advantages to dropping that and making all string (contents) immutable. <rlb>I suspect having a read-only strings completely "in line" might have notable advantages wrt locality, cache prefetching, etc. <rlb>Now, in the C code (internally) we generally mess with the content "in bulk", and having no mutation of non-homogeneous (non-ascii-only) stringbufs allows us to keep things "safe" wrt concurrency without the lock we used to have (I think/hope). <rlb>(where guile defines safe as "you may not get what you expected, but it won't crash the process") <rlb>i.e. we can just swap stringbuf pointers <rlb>jcowan: and note that string-*ref* is still O(1), though the constant factor is larger than it was, via in-line sparse string indexes with dynamic granularity. <rlb>(i.e. for strings less than 256 chars, the index entries are a byte, etc.) <rlb>and the index stride also scales with the string length <jcowan>Do you also plan adjustable-length strings? <rlb>Depending on what you mean, no? i.e. if you change the content of a non-ascii string, or the length of a non-ascii string (or content if we decide to drop in-place mutability there), a new stringbuf is created and the string's stringbuf pointer is swapped. <rlb>And if it's a "shared" string, there's *another* level of indirection. <rlb>But the creation of the new stringbuf is more or less all memcpy, etc. for any carryover prefix/suffix of the operation, so it's should be pretty cpu/ram friendly. <rlb>at least for the functions that have been "fully converted" <rlb>Of course, being in C, some of it's not entirely pretty, and not *all* of that's my fault, though some of it certainly is. <rlb>Oh, and of course, the recommendation is to shift to bulk construction, etc. string-join, string-append, string-tabulate, string-map, string-unfold ... <jcowan>Certainly. But as long as you have indirection, variable-length strings become straightforward. <jcowan>SRFI 118 gives string-append! and string-replace!, which are pretty nice to have. <jcowan>Of course they involve copying the whole string, unless you go all the way to a rope implementation <jcowan>"Unfortunately, the usefulness of a fixed-size mutable string is extremely minimal: The main use is for a buffer to which one incrementally adds more data. So why not fold that functionality into the string API? I.e. why can't you add just data to the end of a string?" <rlb>Hmm, I don't know that I've thought about that in particular much yet (still just coming back up to speed after mostly setting the branch aside for a year or two / waiting). <rlb>I'd probably have to think about how it'd fit in. <rlb>For some cases, without a fancier string representation, I assume you'd still want to do something else, like build a list of parts and then string-append them. <rlb>i.e. the typical "so you don't keep reallocating/copying" issue <rlb>(e.g. don't use str += in python) <jcowan>Well, don't use it in a loop, no. <jcowan>A rope implementation gives you a tree of stringbufs, so there is minimal to no copying. <jcowan>anI proposed string->vector and vector->string into R7RS to make it easier to manipulate strings at the codepoint level <rlb>Right --- I've exclusively been focused on ye-olde contiguous buffers since my understanding was that one of the intended uses for all this was so guile could call typical C apis directly with the string bytes, though I suppose only completely cheaply if the C api has a length param, or you know the string is "minimal". <rlb>(i.e. currently the stringbufs are still null terminated, but strings (which slice into them) are clearly not) <rlb>There was a long-standing comment, iirc, that seemed to suggest wanting to remove the strinbuf null terminator, but I'm not sure about that --- i.e. seems like really cheap protection against "much worse" if something goes wrong. <rlb>But no strong feelings atm --- for now, we've kept it. <rlb>It's probably often free anyway since we (void *) align the inline string index right after the content bytes. <rlb>(which likely requires some padding) <jcowan>What do you do about strings that contain #\u0;? Chibi checks for them before passing a string to the FFI and throws an exception.\ <rlb>I guess it's more expensive for ascii-only strings which have no index. <jcowan>You may get better performance by having three types of stringbufs with 1, 2, and 4 byte elements, changing over when one of the higher-numbered characters is set! into the string. <rlb>stringbufs are all utf8 now <rlb>(or ascii-only) so byte <jcowan>That's what Java does, although only for 1 and 2 bytes, because the API exposes UTF-16 <rlb>And wrt null that's a good question --- I don't recall anything specific offhand there, but now you have me wondering if I may have some fixing up to do. i.e. I mostly just carried over what the existing code was doing, but I wonder if I made mistakes wrt use of functions like strncmp... <rlb>i.e. probably need memcmp? <rlb>Think I just made that mistake today in one place, so glad we've chatted, and thanks :) <rlb>This now rings a vague bell --- suspect I had thought about this back last time when this was all top of mind. <jcowan>Sure. I've been dancing the Unicode Jig around Scheme (and other languages) for years. <rlb>jcowan: just forced at least one null into the utf8 series' commonly used (test-suite data) short-random-string. That should help keep us honest. <rlb>(That's 5-15 chars taken fresh for each test process from the unicode "designated-chars", now plus null.) <rlb>The randomized data has helped uncover any number of bugs. I'm actually a touch surprised that the now mandatory null doesn't appear to break anything. <rlb>(perhaps previous me *was* paying attention...or our coverage needs expanding) <jcowan>Probably when it's passed to C routines everything after the NUL is ignored. <jcowan>That updates the charset definitions past Java 1.0 and makes them Unicode-compliant <rlb>I also need to double-check what we currently do in main. I think I recall seeing str??? (e.g. strlen) scattered around, and now I wonder if main itself is always getting that right (wrt null). <probie>jcowan: [You may get better performance by having three types of stringbufs...] isn't that what guile already has, just with only 1 and 4 byte elements? <jcowan>yes; it makes string-ref cheaper in time, but costs more storage and may limit I/O performance <jcowan>In an implementation I'm working on, i use 4-byte representation exclusively <jcowan>so I/O and calling C are more expensive <jcowan>but the language (PL/I) requires overlaid strings to work correctly: thus a string of length 6 may be united with three strings of length 2, which means we can't fool with variable widths or nulls. The alternative is to extend the language with externally visible 2-byte and 4-byte string types <jcowan>Given the increasing dominance of 64-bit systems, 4-byte representations seem reasonable. <rlb>dunno --- cache locality/density for mostly ascii-ish, and the surprising branch prediction we typically have now may be good for utf8. <rlb>and guile has been latin-1 or utf-32 to date <rlb>I think the jvm may have changed too wrt internal representation(s), but don't recall where that is now. <rlb>Looks like we may not be checking everywhere (in main) wrt embedded nulls. <rlb>I wonder if we want scm_from_*_stringn to treat embedded nulls as an error if you pass them -1 size (which provokes a strlen() with the current api...). <rlb>I'm not entirely fond of the -1 approach... <rlb>from a quick scan, "I see things", e.g. scm_strndup is "wrong" --- copies n, even if the input string has a null in the middle. <rlb>we allow a port encoding with an embedded null, but then truncate it when assigning it to the port structure on the C side. <ekaitz_>rlb: is the BoF going to be recorded? <rlb>hmm, not that I know of. <ieure>How can I exclude or rename a Guile builtin in a module? I want to define a procedure named `when'. <ekaitz_>ieure: just defining it doesn't work? <ieure>ekaitz_, It emits a warning. I may also want to use Guile's `when' in the package, so my preference is to rename it. <ieure>Maybe I just knuckle-drag and (define guile-when when) (define-public when ...) ? <rlb>You may want to do an rename on import. I'll get an example in a minute. <ekaitz_>you also have #:replace in (define-module ...) <rlb>And on export you can do renaming with a dotted pair, fwiw. <ieure>I see, #:pure avoids using (guile), which lets me use it with some options. <ieure>mooseball, No, I'm designing this to be used with a #:prefix by users, so they'd do ex (abc/when ...) <ieure>Seems like #:replace is for when you use the default behavior of dumping all the module symbols into the using module's namespace. <mooseball>i'm currently trying to build haunt, because i wanted to use a patched version of it and i don't know how to use guix. but haunt won't build. ./bootstrap works, but configure fails like so: ./configure: line 3038: syntax error near unexpected token `3.0' <mooseball>./configure: line 3038: `GUILE_PKG(3.0 2.2 2.0)' <mooseball>@ieure, or there's #:prefix plus #:renamer... but i haven't tried it <mooseball>anyone able to offer a pointer on getting haunt to build? <rlb>mooseball: I suspect relevant people will be around eventually, if not now. <dsmith>old, I'd also like to express my thanks in putting the bof together. <old>dsmith: thanks for coming by! <old>For those who had to left, I was thinking of making another BoF session in about 3 months. Somewhere around July/August <dsmith>I had some $DAYJOB meetings at the same time so I missed a bit, <old>Right. I was hoping some of us would be off word with easter <old>here most peoples have no work friday and monday or one of the two <old>thus the awkward monday date <dsmith>I'm work-from-home Mon and Fri, so those are good days <dsmith>(the office has some very rigid firewalls, so can't irc and other things. Unless port 443) <jcowan>dsmith: IRCCloud is your friend; US$5 a month or $50 a year <dsmith>jcowan, Thanks, will look into it. <dsmith>Yes, I have used the libera web interface. Was a bit of friction with being able to post in #guile <old>I'll try to find sometime this week to start reviewing all of that <old>but I have to review things at work first :-) <old>wonder if we would be interets to actually have what I am working on in Guile at some point <old>a "fractal trie" a Judy array implementation in URCU <old>it's crazy fast and minimal memory overhead <rlb>No huge rush --- it was good you waited another week, once I started ramping back up. Found/fixed a few more bugs over the past "days". <old>We are only exposing a trivial open addressing hash table for now <rlb>interesting --- didn't remember judy arrays. <old>it's very niche and not well documented data structure <old>FYI, we are making this fractal trie for BIND9 of ISC for DNS lookup <rlb>I'm in general, in favor of some additional good data structures --- I'd still love for us to have some plausible set of basic persistent data structures, and related bits. <rlb>e.g. persistent set, map, vector at least. <rlb>(Andy has fector/fash (for which I think he'd like better names), though the hash table needs a few bug fixes and doesn't support "dissoc" (deletions) yet.) <rlb>I may see if I'm smart enough to add deletions at some point, since I need that for lokke (right now I do something terrible instead). <rlb>Did you mean that you thought the judy arrays might replace something we're currently using, or just be available for general (new) use? <jcowan>Note that alists are de facto a persistent map, although of course they require O(n) time <old>rlb: no I just think that it might be very good for certain class of operations <old>in the case of BIND9, name resolution with prefix <jcowan>Here's a curious fact: ring buffers and Dijkstra arrays are duals <old>or IP address with prefix <jcowan>(a Dijkstra array is one which you can push new elements at either end <jklowden>jcowan: paging John Cowan. Please report to the PL/1 desk for consultation. <dthompson>for which civodul said I should send upstream to guile but I haven't done it yet <rlb>dthompson: ahh, right, that's one of the other possibilities I was trying to remember :) <dthompson>our persistent hashmap uses some of the optimization tricks in fash.scm and supports deletes <dthompson>what it doesn't have is the transient stuff, which I feel is overly complicated <rlb>My persistent vector implementation is small (also currently in C). <rlb>It's easy to do that because clj's vectors are exactly 32-bit. <rlb>(index-count, not of course, content size) <dthompson>yeah seems good. hashmaps work on the same principle <rlb>And of course for adding to guile, C is probably "fine" --- not as much for an external lib. <rlb>But it could also be ported to scheme without too much work I suspect. <rlb>Also could consider starting with C, since it exists, and migrate it later, but don't even know if we want it in the first place. <dthompson>yeah the fact that the code exists is a good argument in its favor <rlb>That would also allow us to vet the performance of the port to scheme against an existing baseline. <dthompson>as one potential customer of many, I just wouldn't be able to use it because I need something that I can compile to wasm <rlb>oh, right --- I keep forgetting about that case (and I really shouldn't) <rlb>(I shouldn't forget, since that was part of the argument for me porting srfi-1 to scheme ;) ) <dthompson>I have even been able to load and interpret the entire thing from a hoot repl <rlb>If it turned out that the C one *was* a lot faster at the moment --- it's small enough, and I suspect will be notably smaller in scheme, that we could even consider having "both". <rlb>i.e. using the C one when we can, and the scheme one as a fallback --- were that justifiable. <rlb>If I get bored one day, maybe I'll poke at porting it... <rlb>Though aren't there also some srfis wrt persistent data structures... <dthompson>a pure scheme persistent vector is on my todo list as well <dthompson>I haven't seen a srfi with an api that I liked... I don't recall the numbers I've looked at though <rlb>I'd unimaginatively pondered pset-* pvec-* pmap-* for names, but I suspect "persistent" may already not be all that well known a name for the broader domain. <rlb>(were we to add said things to guile) <dthompson>I don't have a good name for persistent vectors, but I think "hashmap" is a good name for persistent hash tables <dthompson>I don't think "persistent" needs to be in the name, necessarily <rlb>I guess I'm thinking about it as compared to hash-table --- wondered if we wanted to make the distinction clear. <dthompson>I have found the two names distinct enough in goblins, which uses both in different places. <jcowan>The last thing you want is for hashtable and hash-table to mean two different things <rlb>but clj does use hash-map, though that's just one concrete constructor for a "persistent map", i.e. they also have sorted-map <rlb>(But they also don't have the question of coming up with a prefix to use for every related function name.) <dthompson>I should clarify that the implementation I've been referring to is unordered <rlb>same for (hash-set ...), (sorted-set ...), and (sorted-set-by ...) <rlb>All produce different flavors of sorted sets. <jcowan>Eh, that's why Emacs (and other IDEs) have autocomplete. <rlb>Suppose we could also put these under some (guile|ice-9 immutable map|set|vector) sub-domain, but we'd still need sensible function name prefixes. <dthompson>they should absolutely be in modules and not the default (guile) module <dthompson>the (guile) module is my arch nemesis at this point <rlb>The "immutable" bit would at least make it obvious what they're offering. <rlb>dthompson: we did briefly discuss generally avoiding further additions to the default namespace at the bof. <rlb>(of having a high bar there) <rlb>there'll be another in a "months" <rlb>I'd intended we should mention it here again closer to time, but didn't think of it until this morning just before. <rlb>ACTION thanks old again for deciding to have them <rlb>ACTION is (briefly) surveying vc-git-grep -E -e 'strn?(len|cmp|dup|cpy|cat)' -- '*.[ch]' for cases where we might be overlooking truncation via embedded nulls... <rlb>old: hmm, now I'm wondering whether the scm_to_(utf8|latin1|...)_stringn functions actually should reject scheme strings with embedded nulls. Since they have the "size_t *len" output parameter, there's no reason they can't handle them, and they're the only way to get those scheme strings from C. i.e. maybe the scm_to_*_string functions should reject, and the scm_to_*_stringn functions should not. <rlb>Though ideally, you might want to have a choice, in case the conversion can do the detection for free (i.e. so you don't have to make an extra memchr() pass after the call to look for embedded nulls when they're not valid for your case). <rlb>e.g. scm_to_utf8_stringnn (SCM str, size_t *lenp, bool *has_null) or whatever. <rlb>Oh, I forgot that stringn already specifies that it throws for embedded null if lenp is NULL. <rlb>(scm_to_stringn) so you have a way to ask <old>rlb: Yeah I was suppose to make a reminder before .. but it was easter and busy with family and all <rlb>old: ignore my has_null stuff above --- I was just confused. <old>and I woke up at 8 this morning instead of the usual 6:40, thanks to my parents-in-law that took care of my son for the night <rlb>forgot we have that handled api-wise, if I implement it right. <old>so I was later even for making a last minute announcement reminder <jj>lovely bof this morning <rlb>I should have thought to do it yesterday (or fri) assuming you'd have wanted that. <old>yeah. Maybe sneek could have this functionnality? <rlb>wire it into "at" :) <old>sneek: reminder 6th of August; BoF! https::// ... <rlb>sneek: at now + 10 min ... <jj>if someone does know how ,optimize (or x y #t z) short-circuits to (or x y #t) without first needing to expand to (if x (if y (if #t (if z)))), i'd be very curious to know <old>jj: thanks! happy that you've come <old>jj: now that I don't have the stress of live sharing the screen to 20 peoples, I could probably dig around more :p <jj>old: i'll drag my friend (mra) along next time! perhaps we can discuss guix and zfs some <old>so the relevant optimizer for tree-il is in language/tree-il/optimize.scm (make-optimizer) <rlb>old: btw, for percolation, and related to our discussion the other day about system data, *if* we decide we think we'd like the bar to be low for "normal" programs to handle all valid filesystem paths (for example), then the noncharacters encoding seems a likely approach, *but* were we to go that route, perhaps we might only apply it selectively by default, i.e. if it's only applied by default to some things like paths, user/group <rlb>names, filesystem attributes, etc., but not ports, then perhaps the risks tradeoffs are more reasonable... <old>if you put some `pk` statement inside the `run-pass!` syntax, you ought to see the different pass being apply and which one did the short-circuit <jj>btw, completely tangential -- is guile's psyntax related to implicit/explicit phasing? <old>rlb: well for sure we would need to to handle environment variables first I guess <old>right now we assume locale encoding <old>jj: I have no idea what that meant (phasing) <jj>old: umm, i mostly hear about it in arguments between the r7rs-large folks and the racket folks <rlb>I mostly just meant that I was wondering if maybe avoiding file/network content (and whatever else is (much?) more likely to be the source of a "risk") might be feasible. <jj>phasing as a general thing is having multiple syntax levels, i.e. racket has an expand-once form to complement expand and also a for-syntax form to go up a syntax level <rlb>i.e. much more plausible that you'd have bobby tables via content from a file or over a network socket than via a path? <jj>the terminology i may be messing up, i have mostly learned this thru oesmosis (i.e. not in a well-structured way) <mra>jj: I'd feel bad subjecting more people to my ZFS woes lol <mra>Besides, I think that Hako is making more progress on it these days than I am <rlb>old: right locale encoding *and* "corrupt the data by default" <rlb>as opposed to an error <old>hmm I think the default is to a error no? <rlb>no -- replacement with ? <old>you can try with main guile, I merged something related to that recently <old>try to put some weird charcter on the command line <rlb>Try this if you have a typical locale: <old>the arguments are decoded as localed and it should throw an error I think <rlb>guile -c '(display (program-arguments))' $'foo-\xb5' <old>jj: so it is indeed peval that's doing the short-ircuit <rlb>I think it should be an error by default <old>ahh right maybe the default is replace then <rlb>we default to the REPLACE handler <rlb>silent mangling isn't ideal imo :) <old>idk, feels like a program would crash even if the arguments are not used by it <old>same for the environment? <jj>old: hmm, i guess my question is then, how does guile recover the (or) form from (inlinables #<tree-il ...>) <old>I guess that one is lazily decoded by access so it's fine <jj>(s/heuristics/manual reconstruction of forms/) <rlb>well, either the error handler or a noncharacters handler <old>Well it's odd that the or form is not fully expanded. Since or is defined in boot-9 as a syntax-rule, I would have assume that it expand fully to if statements <rlb>old: in any case mostly I'd just wondered if restricting the noncharacters handler's domain might allow us to cover the main important cases, without introducing unwarranted risk. <old>then peval get the rest done <rlb>e.g. you still have to say what you mean, if you don't want locale-with-error-handler for ports. <rlb>but paths are noncharacters <old>it's odd. When I call the or syntax-transformer directly, I do get the fully expansion you are expecting <old>rlb: I think so yes. But then again, we should go by step <old>we get utf-8 merge, then we can decide on the policy of handling things before release <rlb>oh sure, still just mulling things over. <old>because I suspect we don't want the none-chracter handler to be the default every where <rlb>(while fixing more (many of my own) bugs) <rlb>And I *have* tended to think we don't want it for ports (wasn't even thinking about ports originally --- too obsessed by paths and argv :) ) <old>well for ports, we already have textual and binary ports <old>one is with locale and the other is with bytevector <old>so I guess, it's up to the user to change their handler for textual ports <old>and the default is error and ought to stay that way I guess <old>well I think it is error ? <rlb>yeah, as you say (I think) --- just need to think through the cases, and then one step at at time... <old>that would be _bad_ if not <rlb>...there's no (portable) uint32_t equivalent for memchr is there? Of course there's wmemchr, but wchar_t's size *could* vary. <rlb>(I suspect, posix-wise) <rlb>(Just adding utf32 fn null termination check(s).) <rlb>hah, right --- thanks, I'd forgotten about it. <rlb>hmm, are *we* _GNU_SOURCE? <rlb>That'd be worth it if they do any fancy optimizations... <rlb>though I also suspect utf32 use will become much rarer if we switch to utf8... <old>rlb: write a for loop if not on linux ? <old>otherwise, if gnu-source, use wmemchr <dsmith>I suspect it doesn't search on uint32_t boundaries <old>might as well just write a wrapper thing for that. I think we have this pattern for lots of things <rlb>looks like we probably define GNU_SOURCE <rlb>i.e. we definitely do in some files <rlb>but I'm wondering if it's also now in config.h or similar as a blanket def... <rlb>I'll just use it for now, and we can sort it out in review. <rlb>oh, wait, memmem is substring search --- that's not quite what I need. I'll just stick with the loop. <dsmith>So are strings of 4byte wide chars on 4byte wide boundaries? <dsmith>So actually strings of uint32_t's ? <old>dsmith: for utf-8 not necessary I think <rlb>right, if you ask for scm_to_utf32_string, you'll get an array of utf32_t. <rlb>(with 4 null bytes at the end, iirc) <rlb>but you'll actually get a char* type-wise :) <rlb>oh, wait, no right, you get a proper array <rlb>though for us it's scm_t_wchar* <rlb>I hope we assert somewhere that sizeof(uint32_t) == sizeof(scm_t_wchar)... <rlb>ACTION makes a note to double check that and what we do... <dsmith>This stuff was so much simpler when everything was just ascii and a byte==char <rlb>old: in case you (or anyone else) knows --- I'm now wondering about scm_to_stringn. When lenp is null, it's supposed to return a null terminated char*, and throw if there are any internal nulls. But I see that we just add a single byte for the null termination, which is wrong for utf-16/32? <rlb>(This is in current main, not the utf8 branch.) <rlb>Also, where we just hand over the encoding to mem_iconveh, how can we even know what the right size null is? <rlb>sorry, u32_conv_to_encoding <rlb>Oh, of course, we just need to encode a null too :) <rlb>ACTION will fix it, assuming for now, he's right about the issue <rlb>should I try to fix it in main first... <old>it will have to wait this evenining, I'm leaving for physio-therapy for my dog :-) <rlb>No rush, it's been this way for a *long* time I suspect :) <rlb>I'll fold something similar into the utf8 branch. <rlb>old: and that may still not be quite right, i.e. we're supposed to throw on an internal null character, and for that path, we haven't been. <rlb>I'll have to consider how to do that. <rlb>I won't be ideal if we have to traverse the string a second time, just for that --- I suppose the alternative might be to do the conversion character by character which would be more complex (and might or might not be faster, depending on what the bulk operations do). <rlb>Oh, wait, we handle that bit earlier on the scheme side --- nice, though I ought to convert that to a bulk operation. <rlb>old: ok, maybe fixed wrt both terminator cases. I'll double-check it later. <jcowan>If you are worried about it, examine every nth byte where n = 1 or 4 as the case may be. <jcowan>or no, you need to examine every byte or long as the case may be <rlb>yeah, and I'm guessing that the compiler's going to love the trivial loop anway...