IRC channel logs

<ttz>Guile's implementation of R6RS specialized floating-point operations seem to be sub-optimal since I obtain speed-up when I replace them with the generic ones.

<ttz>Any idea why and if this is easily solvable?

<ttz>It's quite amazing how performance is non-portable: Guile and Chez best optimize different version of an inner-product procedure (using different combinations of vector/bytevector-ieee-double-native, */fl*). I am really impressed by the Guile's JIT that is able to beat Chez's binaries on some combinations, even though the best version for Chez (and the worst for Guile) is just blazingly fast, almost matching Rust's performance (within

<ttz>20%).

<ttz>In these conditions, writing portable yet performant implementations seem to be impossible...

<identity>also, check your guile's version, latest version from master (and maybe even from wip-whippet, the branch for new work-in-progress GC stuff, might be broken) may be faster

<ttz>I know of this, and I already achieved a 135 speed-up using specialized vectors.

<ttz>Only, I want my code to by portable and fl* seems the best way to encode portable optimizations. But this is really not well supported by Guile.

<ttz>Using it instead of * gives a 10x slow-down...

<identity>okay, so i tried benchmarking the difference ‘truncate/’ makes versus ‘truncate-quotient’ and ‘truncate-remainder’, and… why does a single call to ‘truncate/’ take more time than two calls to ‘truncate-q/r’? ‘truncate/’ seems to cause a lot of allocations (over an order of magnitude more gc time)

<identity>“‘truncate/’ returns both Q and R, and is more efficient than computing each separately.” ―The Manual. But is it really?

<identity>indeed, implementing ‘truncate/’ as (values (truncate-quotient x y) (truncate-remainder x y)) seems to be faster than whatever guile does

<mwette>ttz: I think wingo wrote something once that stated rnrs flonums is slower. If you look at rnrs/arithmetic/flonums.scm you'll see fl* etc. is implemented using (apply * args).

<mwette>plus there is a check of all the args

<ttz>yes I can see they are checked with flonum? using trace, but I am surprised this cannot be optimized out since I do (fl* (bytevector-ieee-single-native-ref vec1 pos) (bytevector-ieee-single-native-ref vec2 pos)) so fl*'s arguments are guaranteed to be flonums.

<ttz>I am going to have a look at the implementation, you are right

<ttz>Indeed the implementation is really not trivial: calls to apply, for-all, branches everywhere...

<mwette>I'm sure it could be optimized out. Probably no takers yet.

<mwette>The intent is not implemented.

<ttz>what do you mean?

<ttz>I think use case-lambda would already help for the cases where few arguments are given. Avoiding the two calls to apply, the allocation of the rest arguments, dispatch at compile-time where possible and maybe moving the flonum? check closer to the arguments could unlock compiling them out, though I am really not knowledgeable about the optimization passes.

<ttz>It seems within my reach, maybe I could try submit a patch.

<identity>ttz: what is your guile version?

<ttz>the v3.0.9, is the one used by geiser/guix even though I installed the Guile/guix v3.10, I don't know how to change that

<mwette>My take on the module is that the intent is to provide efficient numerics. It does not do that. I think that's your point. It does implement the API. The Guile optimizer (in cps layer) will look at *, + etc (see language/cps/specialize-primcalls.scm).

<identity>ttz: iirc 3.0.10 is better at monomorphising numerical code, try changing ‘geiser-guile-binary’ to "/gnu/store/…-guile-3.0.10/bin/guile", wherever it is

<ttz>thanks identity

<ttz>though I tried running my benches via the terminal (so v3.10) and I couldn't see much difference

<identity>the guix practice of putting executable paths into elisp is kind of annoying, really

<identity>i think the built-in mpc.el is still broken because of an erroneus substitution, should write a patch

<mwette>I was able to add instrinsic calls to guile source to call out to my valgrind tool which estimates cpu cycles (https://github.com/mwette/valgrind-cputil/tree/main) between calls of cputil-get-count, per thread. Another work in progress; this one alpha stage.

<ttz>I seem to be able to achieve at least a 2.7x speed-up simply by using a case-lambda.

<ttz>That's not perfect but is a simple code transformation that could be merged fast.

<ttz>The compiler is still not able to use the specialized * operation but uses (call-scm<-scm-scm 9 8 9 4)

<ttz>And using a real-case program computing lots of inner products I have a 1.5x speed-up (from 50s to 37s).

<ttz>And another 1.7 speedup by transforming the + (from 37s down to 22s).

<ttz>(I mixed up my number but the total speed-up is from 50s to 22s)

<ttz>For reference, the same program using simple + and * runs in 1.3s.

<mwette>with https://paste.debian.net/1392073/ I get (test1) -> 4949; (test2) -> 107244 ; so * is ~20x faster than fl*

<ttz>Yes, and using a case-lambda brings it back to ~10x slower only.

<ttz>Inlinining the flonum? test shaves off some more: ~4x slower only.

<mwette>If you want it portable and effiicent you could add a cond-expand to redefine fl* to * for guile.

<ttz>but then you wouldn't check for flonum?, would it be standard-compliant?

<ttz>Actually, I think there is something much more impacting going on: the compiler is unable to optimize calls to fl*: evaluating ,optimize (fl* 1. 2.) in the REPL gives (fl* 1.0 2.0) while it gives 2. for *.

<mwette>(define l '(1.0 2.0)) ,optimize (apply + l) => (apply + l)

<ttz>well here I am not using apply

<mwette>I think the optimizer wants to see binary

<mwette>Oh, by fl* do you mean your lamba-case version?

<ttz>yes

<ttz>I think there is something priviledged about * that makes it optimizable. Maybe it is inlinable and fl* isn't.

<ttz>I think to remember Guile cannot do cross-module inlining, am I correct?

<identity>i think it can? not sure though

<mwette>Guile now does cross-module inlining, for #:declarative? #t

<ttz>What about libraries? are they declarative?

<identity>i would guess that it is conservative about inlining stuff, though it is strange that it does not optimize (fl* 2 2) to 4…

<identity>ttz: it seems ‘library’ expands to #:declarative? #false

<ArneBab>ttz: I expect that you’re out of luck with fully optimized code, except if you create a different prelude for each scheme that defines the most efficient implementation depending on the implementation. Maybe as define-inlinable if Chez supports that.

<ArneBab>Would be nice to have something like cond-expand for that. https://srfi.schemers.org/srfi-0/srfi-0.html

<ArneBab>(maybe that exists already and I just don’t know it yet)

<ArneBab>ttz: aside: are you sharing your benchmark results Chez vs. Rust somewhere?

<mwette>(cond-expand (guile (define fl* *) ...) (else (import (rnrs ...))))

<ArneBab>mwette: does that address what ttz needs?

<mwette>I guess it could.

<mwette>See, for example, system/base/lalr.upstream.scm.

<ttz>ArneBab: sure can do, but it is rather trivial: it consists in implementing the inner product the naive way in each languages, except Chez for which I manually unrolled 8 times.

<ttz>I have for dimension 384 (my use-case somewhere else) Rust: 625ns, Chez: 700ns, Guile: 2800ns (non-unrolled works best for some reason)

<ttz>I guess cond-expand could be nice, but I was also trying to figure out how to make Guile faster and for that cond-expand does not do since fl* need the flonum? check.

<ttz>mwette: this is really interesting, do you know why aren't libraries declarative? I think they contain immutable bindings for Chez.

<ttz>If we can just change that flag it may change everything for the compiler.

<ttz>Also, what I really need here are vector instructions. You can't get state of the art speed without them nowawdays. Would it make sense to standardize vector operations? That way a bytevector-ieee-single-native-map could be implemented with low-level optimizations like SIMD.

<identity>what do you mean by ‘standardize vector operations’?

<ttz>well create an SRFI with operations like bytevector-ieee-single-native-map stating that they may use SIMD.

<ttz>Maybe something like that already exists and I don't know.

<identity>ttz: srfi-4 and srfi-160 are pretty much it

<ttz>Mmh I see. But they say nothing about SIMD.

<ttz>I didn't realize that Guile's f32vector was coming from SRFI-4.

IRC channel logs

2025-08-17.log