IRC channel logs

<Sheilong>Hello. How can I convert an integer to a bit vector

<RhodiumToad>that actually looks surprisingly hard

<Sheilong>transform an interger in the range 0-255 into a bitvector of 8 bits. E.g (convert 15) => #*00001111

<justin_smith>yeah, I was surprised to see that isn't built in

<justin_smith>you might find that bit-extract helps, it gives the value of some range of bits inside an int

<Sheilong>I wrote a procedure to calculate the hamming distance between two string, so I have had to write a procedure to count the bits on given an integer, but since I saw that there was a bitvector with a builtin bitcount I've been trying to figure out how to convert an int to that.

<RhodiumToad>there's logcount for integers, no?

<spk121>if you just want the count of the 1 bits, there's a 'bit-count' in srfi-60

<RhodiumToad>that's the same as logcount in core

<Sheilong>spk121: Yes. But bit-count expects a bitvector as input

<RhodiumToad>bit-count in srfi-60 is not bit-count in core

<RhodiumToad>speaking of bit operations, I noticed when doing those hex routines a while back that using srfi-60 came with a substantial performance hit

<RhodiumToad>it looked like the builtins were being compiled to bytecode primitives while the exact same functions used from srfi-60 were not

<Sheilong>How do I compare equality among characters?

<Sheilong>(= #\a #\b) <unnamed port>:2798:0: In procedure =: Wrong type argument in position 1: #\a

<Sheilong>Just found on documentation

<RhodiumToad>eqv? or char=?

<Sheilong>char=?

<Sheilong>I did some bitwise on 8bit integers but I am getting results that are bigger than 8 bits

<RhodiumToad>chars are not bytes

<RhodiumToad>what did you do?

<Sheilong>I convert a base 64 string to a list of chars, then convert the chars to integers, and save them in a bytevector. After that, I try to decode it back to original base.

<Sheilong>But I think I did something wrong in the bitwise operations

<Sheilong> https://paste.ofcode.org/Q3AUgzVRckzBgfddwRUQ9Y

<RhodiumToad>you know you can convert a string straight to a bytevector?

<Sheilong>RhodiumToad: I didn't

<RhodiumToad>for something like base64 (or hex) where you know all the valid data is plain ascii, use string->utf8

<Sheilong>I confess that I looked by it in the documentation but did not found it before.

<RhodiumToad>that gives you a bytevector with the utf8 representation of the string, which will be all-ASCII if the original string was

<Sheilong>Ahhh, I used that before, but I got an error when I needed to do a xor operation with a key against a set of strings.

<RhodiumToad>what has that got to do with it?

<Sheilong>RhodiumToad: Is there a way to work with integers as 8-bit ints? When I extract an int from the bytevector it lost the 8-bit representation.

<RhodiumToad>when you extract it, it becomes just an integer

<RhodiumToad>scheme integers don't have a size as such

<Sheilong>So the only way would be using a MASK

<RhodiumToad>you want to do things like left shifts discarding all but the low 8 bits?

<RhodiumToad>then yes, use a mask

<Sheilong>RhodiumToad: That's it

<Sheilong>RhodiumToad: I am too stupid! When I encoded a string to base 64, the result of bitwise operations were indexes to the base 64 character table. Now to do the reverse operation, I was doing the bitwise operation on the decimal character representation instead of the index...

***wxie1 is now known as wxie

<theruran>can Guile 3.0 handle the nanopass framework Scheme code?

<Sheilong>Is there a built in procedure that counts occurrences of character in a string or in a list?

<RhodiumToad>yes

<RhodiumToad>string-count for strings

<RhodiumToad>for lists you could use a fold if there isn't any specific function for it

<RhodiumToad>ah, srfi-1 has a (count)

<Sheilong>RhodiumToad: Thanks so much. String-count suffices for now.

<RhodiumToad>so (string-count str #\a) or (count (cut eqv? <> #\a) lst)

<Sheilong>I am detecting how much padding is there

<Sheilong>Now I have how much padding I need to remove, it would be nice if there is a built in function to remove the last n characters from a string

<daviid>rnrs bass has length

<daviid>]*base

<Sheilong>I am wrong. I want to remove the n last elements of a bytevector not from a string

<daviid>ah occurrences of ... forget length, i missunderstood the quiz

<RhodiumToad>bytevectors can't have their length changed, you'd need to copy it to a new one

<RhodiumToad>you might prefer just to keep track of the effective length

<Sheilong>bytevector-copy! is what I need, thanks guyu

<Sheilong>s/guyu/guys

<daviid>Sheilong: fwiw, ,use (ice formnat), then

<daviid>(format #t "~8,'0b~%" 15) -> 00001111

<daviid>have to run, bbl

<Sheilong>It is ugly but seems to work lol https://paste.ofcode.org/QTyULvB3rm4Skeh6si9KFf

<rlb>Including the recent compiler changes, any general guidelines wrt choosing character classes vs case for parsing, e.g. if you're parsing something s-expression-ish?

<wingo>rlb: i would use case (or match, or whatever)

<wingo>match is nice because you can use character classes if you want to

<rlb>So for things like "digits" or a fixed set of delimiter chars, should character sets and case perform close enough to the same to not likely matter?

<rlb>(if you have any impression offhand)

<wingo>yeah i think they are similar

<wingo>the tradeoff as i see it is: "case" (or other chains of comparisons) can be more optimal. charsets allow to you be more abstract when specifying the set of characters, and potentially more extensible as unicode evolves

<wingo>i started rewriting guile's reader in scheme yesterday

<wingo>hoping to be able to write an ad-hoc tree-il -> C compiler for use in bootstrapping

<wingo>so i have been thinking about these questions as well

<wingo>one nice thing about case is that effectivel the dispatch to the clauses happens in parallel -- i.e. it's not test literals for this clause, otherwise the next, otherwise the next, etc

<wingo>it's o(1) dispatch to the right clause

<wingo>only matters sometimes tho

*rlb was toying with a scheme edn reader

<rlb>Thanks for the summary.

<rlb>(which might also if it performed well enough eventually grow to replace the current C clj reader -- which itself was stolen wholesale from guile's reader and adapted)

<rlb>(And if I finish it and it seems good enough, I also thought I might submit the pure scheme edn reader for consideration for guile -- if there's interest...)

<rlb>I'll be curious to see what you come up with -- what I'm currently doing is not very optimized.

<wingo>i am aiming to make a more-or-less faithful translation of the C reader to Scheme

<wingo>so as to preserve the same behavior. am trying not to optimize too much in the beginning

<wingo>i know the reader is important but maybe optimization payoffs are different in scheme vs c

<rlb>Indeed.

<rlb>Do we have a way to ask for a character set for a given unicode general category like "Zs" "Zl" and "Zp"?

<rlb>(wrt not optimizing (or maybe not optimizing), I'm using string output ports to accumulate pending text, i.e. unfinished symbols/strings/etc...)

<rlb>(Though I planned to go look and see how they currently work, i.e. are the good enough wrt cost right now.)

*rlb will head off soon

<wingo>rlb: wrt Zs etc, oddly i think not

<wingo>interesting choice, string output ports; hadn't thought of that. not sure how well it would perform; good expandable buffer, but they do round-trip to UTF-8

<wingo>perhaps if we ever switched to utf-8 strings internally in guile, that would be optimal tho

<xelxebar>Okay, I'm stumbling myself through some guile and would like some general advice on approach.

<xelxebar>My goal is to essentially grep for a string in a file.

<xelxebar>The brute-force way I'm working on is to loop through the lines in the file and run a string-match on each.

<xelxebar>Another approach could be to slurp up the entire file into a string, and just do a single string-match, but this seems mildly dangerous in general. I don't want to put a 4GB string in memory by accident.

<xelxebar>Is there a more idiomatic approach here? Am I missing something obivous?

<chrislck>xelxebar: this is a hard(TM) problem. best solved via tools like ripgrep.

<chrislck> https://blog.burntsushi.net/ripgrep/

<xelxebar>Haha! Fair enough. This is largely an pedagogical exercise in guile for me. I was wondering if there's a Guile-y way to do the equivalent of string-match on a file.

<chrislck>the guile way is exactly the same as the C/C++ way - consider buffer size, pointers, memory allocations, etc

<chrislck>you can limit your buffer size easily in guile just as in C

<rekado>xelxebar: this will become as complicated as you’ll allow it

<chrislck>in guile the national sport is minimising m in O(N^m)

<rekado>if you want to look up a fixed string (instead of a regexp) this will become somewhat simpler.

<rekado>but it still involves reading chunks of the file and being smart about matching

<rekado>here’s an entry point to the rabbit hole: https://en.wikipedia.org/wiki/String-searching_algorithm

<civodul>uh, srfi-64 (test-equal label value exp) behaves as if EXP returned #f when it actually threw

<xelxebar>Hrm. The level of abstraction that I'm thinking at here is "shell scripts". I'm trying to write the equivalent of some script I have in guile. If I'm understanding you correctly, it makes sense to just call out to grep/ripgrep/whatever in this case. I.e. there's not some short, idiomatic pure Guile replacement.

<mwette>wingo: While you are looking into strings, maybe look at how Python3 handles them. There are two literals, for example: b'hello' and u'hello'. The first is a bytearray and the second is a unicode string.

<wingo>using utf-8 as a string backing store wouldn't change api fwiw on the scheme side

<wingo>i thought most people thought of the python string / bytestring thing as a disaster

<rekado>xelxebar: there is no single search-string-in-file procedure in Guile. Which implementation is best depends very much on your use case.

<wingo>i would probably loop over calling "read-line" then calling "string-contains" but that's just me

<mwette>I was getting at clarifying the relationship between strings and {bytearray, encoding}. But going back and reading the Guile manual it is consistent. My problem is that I needed to read "character" as code-point and not byte.

***chrislck_ is now known as chrislck

<mwette> /quit

<rlb>wingo: hmm, I'll have to make sure I understand you better later -- If I have a string port and a char and then (put-char port c), I'd have assumed it just appends c to the string in some "reasonable" way. (And if not, we could fix it.)

<rlb>But I hadn't stopped to think about it hard yet or investigate.

<rlb>And wrt string vs bytes -- let's *not* do what python did, but most importantly by that I mean the backward incompatible break. But I'd say that having functions that can handle either bytes or strings might be (part of) a plausible solution. Though of course issues there too, i.e. (readlink x) could decide to return bytes or a string based on the type of x, but that doesn't help for something like (gethostname).

<rlb>In any case, I'd love to eventually have a plan there, and be happy to help implement it if I can.

<rlb>(Also be happy to help with utf-8 related work -- that was a stumbling block wrt lokke's use of pcre2 for #"foo" regexps since pcre2 does not support utf-32 (only 8 and 16).)

<mwette>Guile has (ice-9 regex) which operates on Guile strings. I assume it's basd on posix regex. So conversion currently uses locale but shouldn't the regex routines accept an optional encoding?

<mwette>(ice-9 regex) another target for implementation in scheme?

<leoprikler>iiuc all guile strings are UTF-8. If you need anything else, use bytevectors :)

<RhodiumToad>the problem with that is with reading strings from, say, filenames

<mwette>I believe, strings are arrays of code points, not encoded byte arrays.

<mwette>Files are arrays of bytes. So you need an associated encoding to turn those bytes into code points (aka characters).

<RhodiumToad>the more serious problem is file _names_

<RhodiumToad>everyone expects a file name to be a string, but the OS is treating it as bytes

<mwette>Ah. I remember the issues. I'll bet they are, especially because LC_LANG(?) is volatile.

***amiloradovsky1 is now known as amiloradovsky

<RhodiumToad>well, LC_* might say that the locale is UTF8, but the filename might not be valid UTF8

<mwette>exactly

<mwette>no encoding designation in DIR type?

<RhodiumToad>nope

<rlb>I believe internally strings are either "latin-1" (more or less) or UTF-32, neither of which pcre2 supports.

<rlb>And wrt (ice-9 regex), it's not guaranteed to be available, and when available "does whatever the platform does", and I wanted lokke to have well defined regexps (and they have to be available all the time, so couldn't rely on guile's support anyway)

<rlb>(and of the options pcre seemed like a reasonable choice)

<rlb>And wrt bytes vs strings, yeah, on many platform (i.e. linux), paths/hostnames/etc. are *bytes* not strings -- and there are additional potential complexities, since depending on how the unicode is handled, normalization and/or canonicalization can even mean that the string changes in a way that makes it impossible to find the original file, i.e. if you were to readlink something, then normalize/canonicalize the string, and then try to

<rlb>open it, it might fail.

<rlb>Or I suppose user and group names are perhaps a better example than hostnames...

<rlb>It can be quite fraught to pretend they're not.

<rlb>(or at least to *only* support pretending they're not)

<terpri>utf-8b, as i think it's called, is a fairly clever solution to the random bytes -> almost-utf-8 problem without losing information: https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

<terpri>iirc it's what python uses for things like posix pathnames (or a very similar trick)

<wingo>moo

<terpri>🐮💬μῡ

<terpri>ah, this refers to python's usage, introduced as a "surrogateescape" error handler in pep 383: https://en.wikipedia.org/wiki/UTF-8#Derivatives

<rlb>Can't actually rely on it though, at least if you're using glibc because there's an issue/bug in glibc that it doesn't look like they're going to address, and python hasn't appeared willing to work around it either: https://bugs.python.org/issue35883

<rlb> https://sourceware.org/bugzilla/show_bug.cgi?id=26034

<rlb>So you can't use that to handle arbitrary paths -- have to stick to bytes.

<rlb>or you may crash your program on the wrong path

<rlb>(I've (unfortunately) had had an extraordinary amount of time consumed by the python upstream choices in this general area over the past few years...)

<Sheilong>Do you know how I read all lines from a file but without keeping the \n character?

<ATuin>Sheilong: see `(ice9 rdelim)`

<terpri>rlb, interesting, thanks for the links

<terpri>tricky to determine whether it's a genuine "bug" or not without diving into several standards (posix, iso c, unicode/iso 10646, ...)

<ecraven>hm.. why is guile on arch linux still guile 2?

<wingo>rlb: neat trick: (define (f ch) (define numeric '(#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)) (define (numeric? ch) (memv ch numeric)) (cond ((eof-object? ch) #f) ((numeric? ch) (- (char->integer ch) #\0)) (else 42)))

<wingo>the numeric? check optimizes to a table switch

<wingo>unfortunately guile is very stupid about eof-object? right now; need to fix that

<ecraven>that ignores almost all of unicode :P

<wingo>depends on what your definition of numeric is

<wingo>some languages define numeric in that way

<wingo>civodul: you ok with excise-ltdl?

<wingo>currently missing the mingw dlopen shim, but can fix up in master; i know mike gran has some work in this area

<wingo>apparently guile is broke on mingw anyway right now

<wingo>mike suggested simply pulling in the dlopen shim from cygwin into posix-w32.[ch], which sounds about right to me

<civodul>wingo: sounds good to me; that's for master, not 3.0, right?

<wingo>master is currently 3.0 fwiw

<wingo>no stable-3.0 branch yet

<civodul>right, that's why i'm asking :-)

<wingo>i was thinking we could do it in 3.0 actually

<civodul>ah

<wingo>:)

<civodul>heh :-)

<civodul>i'm just wondering about possible breakage

<civodul>i don't have any clear scenario in mind tho

<wingo>yeah. i thought about it and i don't know of anything that's not just a minor bug that can be fixed up in 3.0

<wingo>but i know that's not quite watertight :)

<wingo>i would rather not branch off 3.2 at this point

<civodul>yeah

<wingo>if necessary, sure, but we don't have that kind of energy atm

<wingo>nor any kind of 3.2 value proposition, currently anyway

<civodul>i guess you can go ahead and then we'll test on a few packages

<wingo>cool :)

<wingo>ok done

<wingo>with an optimization to eof-object? :)

<civodul>so it can be inlined?

<wingo>yeah, inlines to (eq? x the-eof-object)

<wingo>aaah the optimization of that "f" above is really good now

<wingo> 3 (immediate-tag=? 0 255 12) ;; char?

<wingo> 5 (jne 21) ;; -> L2

<wingo> 6 (untag-char 1 0)

<wingo> 7 (usub/immediate 1 1 48)

<wingo> 8 (mov 0 1)

<wingo> 9 (jtable 0 #(13 13 13 13 13 13 13 13 13 13 26))

<wingo>L1:

<wingo> 22 (tag-fixnum 1 1) at (unknown file):3:159

<wingo> 23 (reset-frame 1) ;; 1 slot

<wingo> 24 (handle-interrupts)

<wingo> 25 (return-values)

<wingo>if it's a char, it untags it, subtracts off #\0 already, then does a table jump. if untagged value is <= 9, it tags as fixnum and returns directly

<wingo>in this case we could compile to (< untagged-biased-value 10) instead of the table jump, which would be even better

<wingo>probably should get that together. but still, yay

<civodul>neat!

IRC channel logs

2021-02-03.log