IRC channel logs

2024-07-02.log

back to list of logs

<rlb>wingo: offhand I think it's because the scheme side hash code needs to stick to 32-bit, but it's using u64, maybe via that logand change, and so hits the range check.
<rlb>(Though not sure exactly which range check atm.)
<rlb>Bit of stacktrace we have is in the backlog above, and in the buildd failure logs for i386, etc.
<dthompson>I was in hash.c recently... it could be improved a lot.
<sneek>wb dsmith :)
<dsmith>sneek, botsnack!!
<sneek>:)
<rlb>dthompson: I thought I might at least include an all-ascii shortcut in the utf8 hasher now that I've had to look at it carefully, and since we already get both the byte and char lengths. Thinking "should be trivial, and can't hurt, might help"...
<rlb>i.e. in the 3.0.10 fixup patches I'm currently accumulating
<rlb>(There's of course a lot more of those shortcuts in the utf8 branch.)
<dthompson>rlb: here's a thing that bugs me about the current hash implementation:
<dthompson>(hash (cons 1 2) 53) ;; => 7
<dthompson>(hash (cons 2 1) 53) ;; => 7
<rlb>Oh, right -- I noticed some other bits (some were bugs we fixed) while working on lokke's hash and equality (likely still not quite right).
<rlb>lokke I mean -- guile's is better
<rlb>(clj wants/needs to cache the hashes in many cases though...)
<dthompson>I tried to do better for hoot: https://gitlab.com/spritely/guile-hoot/-/blob/main/lib/hoot/hashtables.scm?ref_type=heads#L102
<rlb>I just tried adding the ascii case to the current utf8 hasher, and it was *trivial* because that's identical to the narrow hasher, and so we can just call the macro and return...
<rlb>(and the wide hasher, i.e. just another fixed-width case)
<dthompson>in hoot we're just using the Java string hashing algorithm for now
<rlb>ACTION wishes u8_check() returned the char length on success somehow, instead of just throwing it away (presumably it has to know)
<dthompson> https://gitlab.com/spritely/guile-hoot/-/blob/main/module/hoot/stdlib.scm#L1450
<rlb>...iirc there are "concerns" with libunistring, but it's been a year or more, so I've forgotten the context.
<dthompson>hmmmm
<rlb>istr there was some interesting-looking opinionated up and coming utf8 lib, but can't recall the name -- wonder if it's still promising.
<dthompson>I wonder if guile can just... have its own code for this?
<rlb>Interesting, wrt the hoot code.
<rlb>It does at least a bit now in the utf8 branch :)
<rlb>Some of the more targetted stuff we wanted was just easier to write as a macro or inline (admittedly C).
<dthompson>andy wrote some crazy utf-8 code for v8 that he then ported to wasm for hoot https://gitlab.com/spritely/guile-hoot/-/blob/main/module/wasm/lower-stringrefs.scm#L114
<rlb>As I mentioned, there were some bits that I did want to write in scheme, but that can't happen afaict until we sort out some other issues wrt the way the two srfis are embedded right now.
<rlb>Could revisit though.
<dthompson>it makes sense to have the utf8 stuff implemented in C as part of the runtime
<dthompson>it's very low-level
<rlb>Mostly right now, I'm just waiting for some review/guidance before I spend too much more time on it -- it was... a lot.
<dthompson>I can imagine! unicode scares me :)
<rlb>Happy to pick it back up in whatever direction is preferred once we know.
<dthompson>thanks for working on the spooky stuff :)
<rlb>Hah, well you get used to it, at least at the level I understand it, which is not *deep* wrt real unicode dark corners.
<rlb>i.e. utf-8's fairly straightforward (the coding, once you're used to it).
<rlb>Fun fact, the number of leading 1 bits (if there are any) in the first byte tells you how many bytes it's going to be :)
<rlb>and if it's zero, it's ascii, of course.
<dthompson>oh neat!
<dthompson>did not know!
<rlb>The hard part is/was, switching from our current arrangement to utf-8 in stages in the patch series without ever breaking anything -- i.e. I think all the patches will (would) pass make check.
<rlb>That and trying to make it "easier" to review.
<dthompson>that's the best way to do it. more work, though.
<rlb> https://en.wikipedia.org/wiki/UTF-8#Encoding
<rlb>Makes it obvious.
<dthompson>ah that's a good visualization
<rlb>Right now I want to get 3.0.10 into debian, but I think I'm going to need some help there. I'll probably file a bug now that I've narrowed it down to (likely) some logand cps changes. Don't know that well enough to pursue it myself. Hopefully Andy will have time for it in a bit.
<rlb>"well enough *yet*" -- interested in learning it better.
<rlb>wrt C - yeah, I tried to lean heavily on keeping the string data compact and completely (all of it, type, length, content) inline for the cache and use memcpy all over the place, etc.
<rlb>The most important thing is that I snuck an xkcd ref into our test data :P https://codeberg.org/rlb/guile/src/commit/ad48d90d12a747489d74a0d7ee7aa53d12639a64/test-suite/test-suite/data.scm#L101
<rlb>(iirc, constraint was that I needed all the utf8-lengths and wanted some more data)
<dthompson>lol I get the reference
<dthompson>very nice :)
<dthompson>this reminds me that I'm sitting on a patch to update guile in guix...
<rlb>If y'all do 32-bit I suspect (and would like to hear) you'll crash there too.
<rlb>But you'll know before you ever reach the tests -- crashes during initial compilation in that logand.
<rlb>Also, if 32-bit is important, might want to test first and/or hold off.
<rlb>bbiaw
<dthompson>I'm gonna send a patch and CI will tell me
<rlb>excellent
<rlb>Also a problem with one of the SEEK_HOLE related tests on two debian archs. I suspect that's due to unexpected fs characteristics for either the test or the new (I think) code. Haven't had a chance to try to track that one down yet.
<rlb>i.e. I imagine it's those buildds using a tmpfs or some other less common fs arrangement for the sbuild chroots.
<dthompson>SEEK_HOLE is new, right? for sparse files or something
<rlb>I believe so.
<rlb>Not sure if it's relevant to the work you're doing, but if so, spare a thought for the *nix "utility writers", i.e. have some well specified (thread safe, etc.) approach for handling system data that may not be compatible with unicode like paths, user names, group names, xattrs...
<rlb>For a while it was very hard to use python3 for that (and now it's mostly just inefficient or sometimes awkward), and guile proper doesn't have a really nice way to handle that either.
<rlb>Currently, I think you just have to swap in/out latin-1 for every relevant call.
<rlb>etc.
<rlb>(Basically, *nix paths, groups, etc. are fundamentally just arbitrary, non-null bytes, and if you want to write tar/cp/find/whatever, you have to handle that.)
<dthompson>could be relevant for hoot on "native" wasm runtimes, not really for the web though.
<rlb>If I vaguely understand correctly, you might be more in the clj position, i.e. "up to the platform".
<rlb>If so, makes sense.
<rlb>(Issue with the jvm too.)
<dthompson>all strings are stored in wtf8 format
<rlb>rust seems to make a much more principled effort to dtrt.
<dthompson>they have tons of string types, aiui
<rlb>I'm not sure I've seen anywhere else (aside of course from C) that does.
<rlb>The main thing is that they have "string" and "os string".
<dthompson>in theory guile/hoot could have this, too, I suppose
<rlb>The latter is (on the relevant platforms) compatible with "just bytes" auii.
<rlb>Yes, but it's a *lot* of work. I'd be interested in spending a good bit of time on it, but first we'd need a solid concensus.
<dthompson>in general I don't take inspiration from rust, though ;)
<dthompson>string encoding is certainly not my area of expertise
<rlb>I don't know rust much, but that bit did stand out, since I've spent a lot of time fighting with related issues.
<rlb>Basically, you can't represent os data on linux and some other unix-like platforms via unicode strings, at least not in any straightforward way.
<dthompson>are bytevectors good enough? :P
<rlb>And you risk corrupting the data if your runtime does any automatic unicode "normalization" or "canonicalization", so that when you go to open the file, you're using the wrong byte sequence.
<rlb>Yes, they're "perfect", if we had a set of string-srfi like operations for them too.
<rlb>Because you still want to be able to easily split/join paths on "/", etc.
<dthompson>I know of multiple projects that bytevector concatenation procedures, etc.
<dthompson>that have*
<rlb>But for now for guile, you need to use either latin-1 strings or bytevectors, and none of the built in functions will give you a bytevector (afaik).
<rlb>But most systems these days are utf-8, so it's even a lower profile issue than it was -- still critical for fs tools that *have* to get it right, though (as mentioned like tar, rsync, cp, find...).
<dthompson>yeah
<rlb>But my impression is that there's still a lot of (possibly mixed) filesystems with say older Japanese FHIFT-JIS, and some Cyrillic and/or Korean encodings.
<rlb>And tools in languages that don't handle it right will just crash.
<rlb>(python did for a while)
<rlb>I don't know if we do -- haven't run the similar generative tests, but ideally, we'd have some.
<rlb>i.e. test random byte paths/users/groups
<rlb>Sorry, long tangent -- pet issue. Glad Ludovic's not here :) Already heard it.
<dthompson>;)
<dthompson>glad someone is thinking about all this stuff :)
<dthompson>I should go to bed soon... but I'm working on a really cool prototype...
<rlb>Nice
<rlb>"SHIFT-JIS"...
<apteryx>continuing on the R7RS papercut fixes theme; would anyone be able to look into bug#67255? there's a patch waiting to be applied
<apteryx>it adds support for the R7RS define-library 'rename' clause.
<rlb>wingo: filed the 32-bit compilation issue https://debbugs.gnu.org/71891
<stanrifkin>Is there a filename completion with (load "filename")?
<lloda>stanrifkin: the functions are there, but they aren't plugged in. For example you can do (import (ice-9 readline)) (with-readline-completion-function filename-completion-function (lambda () (read))) and you'll get filename completion on the (read) call. But the default function on the repl is apropos-completion-function and it doesn't change depending on context, like after '(load "'.
<lloda>a patch to do this would be awesome
<stanrifkin>OK, thanks. So it doesn't work at the moment.
<mwette>rlb: I generated/applied a utf8 patch for 3.0.10 distribution; I ran configure with --disable-fast-install; it crashes trying to build eval.go; any idea how to get it going?
<mwette>I'll try to swipe ice-9/{eval,boot-9,psyntax-pp}.go from utf8 branch?
<chrislck>how to test guile-3.0.10 in guix?
<rlb>mwette: I'd guess you might need a "make clean" unless you started from a clean tree, and to rm -rf the relevant subtree in ~/.cache/guile. The branch changes the byte code layout for strings, and so won't "automatically" be OK until we include it in a release (if we do) that changes the Y version.
<rlb>Though I also haven't tried it on top of 3.0.10 -- the current utf8 branch is a bit further back, but not much.
<rlb>Will rebase it soon.
<mwette>rlb: thanks. I tried those, but didn't work. Also, there was an odd error in srfi-13.c regarding some bad arg in expression to ASSERT in filter_string().
<rlb>Does the current utf8 branch work "as is", or were you specifically interested in it on top of v3.0.10.
<mwette>I did not try in the branch. I was seeing if I could make a patch for 3.0.10. I'll keep plugging. Back to the day job ...
<Harzilein>hi
<rlb>mwette: OK, well I'll fix 3.0.10 soonish (if I can), fwiw. Possibly today or tomorrow, now that I'm waiting on the 32-bit issues.
<rlb>And thanks for the report.
<rlb>mwette: btw, what was the failure? If it's either a "documented?" related test failure or a version test failure, then those are just issues with current main. The former can be fixed by re-running, and the latter via ./autogen.sh.
<mwette> SNARF regex-posix.doc
<mwette> GEN guile-procedures.texi
<mwette>../../../guile-3.0.10-utf8/libguile/srfi-13.c:3645: wrong position for argument char_pred: 1 (should be 2)
<mwette>../../../guile-3.0.10-utf8/libguile/srfi-13.c:3679: wrong position for argument char_pred: 1 (should be 2)
<mwette>SCM_ASSERT (scm_is_true (scm_procedure_p (char_pred)), char_pred, SCM_ARG1, fname);
<mwette>^ very odd, looks fine to me
<dsmith>?? Did SCM_ARG1 get (re)defined to 2 perhaps?