IRC channel logs

2021-02-03.log

back to list of logs

<Sheilong>Hello. How can I convert an integer to a bit vector
<chrislck>example?
<RhodiumToad>that actually looks surprisingly hard
<Sheilong>transform an interger in the range 0-255 into a bitvector of 8 bits. E.g (convert 15) => #*00001111
<justin_smith>yeah, I was surprised to see that isn't built in
<justin_smith>you might find that bit-extract helps, it gives the value of some range of bits inside an int
<Sheilong>I wrote a procedure to calculate the hamming distance between two string, so I have had to write a procedure to count the bits on given an integer, but since I saw that there was a bitvector with a builtin bitcount I've been trying to figure out how to convert an int to that.
<RhodiumToad>there's logcount for integers, no?
<spk121>if you just want the count of the 1 bits, there's a 'bit-count' in srfi-60
<RhodiumToad>that's the same as logcount in core
<Sheilong>spk121: Yes. But bit-count expects a bitvector as input
<RhodiumToad>bit-count in srfi-60 is not bit-count in core
<RhodiumToad>speaking of bit operations, I noticed when doing those hex routines a while back that using srfi-60 came with a substantial performance hit
<RhodiumToad>it looked like the builtins were being compiled to bytecode primitives while the exact same functions used from srfi-60 were not
<Sheilong>How do I compare equality among characters?
<Sheilong>(= #\a #\b) <unnamed port>:2798:0: In procedure =: Wrong type argument in position 1: #\a
<Sheilong>Just found on documentation
<RhodiumToad>eqv? or char=?
<Sheilong>char=?
<Sheilong>I did some bitwise on 8bit integers but I am getting results that are bigger than 8 bits
<RhodiumToad>chars are not bytes
<RhodiumToad>what did you do?
<Sheilong>I convert a base 64 string to a list of chars, then convert the chars to integers, and save them in a bytevector. After that, I try to decode it back to original base.
<Sheilong>But I think I did something wrong in the bitwise operations
<Sheilong> https://paste.ofcode.org/Q3AUgzVRckzBgfddwRUQ9Y
<RhodiumToad>you know you can convert a string straight to a bytevector?
<Sheilong>RhodiumToad: I didn't
<RhodiumToad>for something like base64 (or hex) where you know all the valid data is plain ascii, use string->utf8
<Sheilong>I confess that I looked by it in the documentation but did not found it before.
<RhodiumToad>that gives you a bytevector with the utf8 representation of the string, which will be all-ASCII if the original string was
<Sheilong>Ahhh, I used that before, but I got an error when I needed to do a xor operation with a key against a set of strings.
<RhodiumToad>what has that got to do with it?
<Sheilong>RhodiumToad: Is there a way to work with integers as 8-bit ints? When I extract an int from the bytevector it lost the 8-bit representation.
<RhodiumToad>when you extract it, it becomes just an integer
<RhodiumToad>scheme integers don't have a size as such
<Sheilong>So the only way would be using a MASK
<RhodiumToad>you want to do things like left shifts discarding all but the low 8 bits?
<RhodiumToad>then yes, use a mask
<Sheilong>RhodiumToad: That's it
<Sheilong>RhodiumToad: I am too stupid! When I encoded a string to base 64, the result of bitwise operations were indexes to the base 64 character table. Now to do the reverse operation, I was doing the bitwise operation on the decimal character representation instead of the index...
***wxie1 is now known as wxie
<theruran>can Guile 3.0 handle the nanopass framework Scheme code?
<Sheilong>Is there a built in procedure that counts occurrences of character in a string or in a list?
<RhodiumToad>yes
<RhodiumToad>string-count for strings
<RhodiumToad>for lists you could use a fold if there isn't any specific function for it
<RhodiumToad>ah, srfi-1 has a (count)
<Sheilong>RhodiumToad: Thanks so much. String-count suffices for now.
<RhodiumToad>so (string-count str #\a) or (count (cut eqv? <> #\a) lst)
<Sheilong>I am detecting how much padding is there
<Sheilong>Now I have how much padding I need to remove, it would be nice if there is a built in function to remove the last n characters from a string
<daviid>rnrs bass has length
<daviid>]*base
<Sheilong>I am wrong. I want to remove the n last elements of a bytevector not from a string
<daviid>ah occurrences of ... forget length, i missunderstood the quiz
<RhodiumToad>bytevectors can't have their length changed, you'd need to copy it to a new one
<RhodiumToad>you might prefer just to keep track of the effective length
<Sheilong>bytevector-copy! is what I need, thanks guyu
<Sheilong>s/guyu/guys
<daviid>Sheilong: fwiw, ,use (ice formnat), then
<daviid>(format #t "~8,'0b~%" 15) -> 00001111
<daviid>have to run, bbl
<Sheilong>It is ugly but seems to work lol https://paste.ofcode.org/QTyULvB3rm4Skeh6si9KFf
<rlb>Including the recent compiler changes, any general guidelines wrt choosing character classes vs case for parsing, e.g. if you're parsing something s-expression-ish?
<wingo>rlb: i would use case (or match, or whatever)
<wingo>match is nice because you can use character classes if you want to
<rlb>So for things like "digits" or a fixed set of delimiter chars, should character sets and case perform close enough to the same to not likely matter?
<rlb>(if you have any impression offhand)
<wingo>yeah i think they are similar
<wingo>the tradeoff as i see it is: "case" (or other chains of comparisons) can be more optimal. charsets allow to you be more abstract when specifying the set of characters, and potentially more extensible as unicode evolves
<wingo>i started rewriting guile's reader in scheme yesterday
<wingo>hoping to be able to write an ad-hoc tree-il -> C compiler for use in bootstrapping
<wingo>so i have been thinking about these questions as well
<wingo>one nice thing about case is that effectivel the dispatch to the clauses happens in parallel -- i.e. it's not test literals for this clause, otherwise the next, otherwise the next, etc
<wingo>it's o(1) dispatch to the right clause
<wingo>only matters sometimes tho
*rlb was toying with a scheme edn reader
<rlb>Thanks for the summary.
<rlb>(which might also if it performed well enough eventually grow to replace the current C clj reader -- which itself was stolen wholesale from guile's reader and adapted)
<rlb>(And if I finish it and it seems good enough, I also thought I might submit the pure scheme edn reader for consideration for guile -- if there's interest...)
<rlb>I'll be curious to see what you come up with -- what I'm currently doing is not very optimized.
<wingo>i am aiming to make a more-or-less faithful translation of the C reader to Scheme
<wingo>so as to preserve the same behavior. am trying not to optimize too much in the beginning
<wingo>i know the reader is important but maybe optimization payoffs are different in scheme vs c
<rlb>Indeed.
<rlb>Do we have a way to ask for a character set for a given unicode general category like "Zs" "Zl" and "Zp"?
<rlb>(wrt not optimizing (or maybe not optimizing), I'm using string output ports to accumulate pending text, i.e. unfinished symbols/strings/etc...)
<rlb>(Though I planned to go look and see how they currently work, i.e. are the good enough wrt cost right now.)
*rlb will head off soon
<wingo>rlb: wrt Zs etc, oddly i think not
<wingo>interesting choice, string output ports; hadn't thought of that. not sure how well it would perform; good expandable buffer, but they do round-trip to UTF-8
<wingo>perhaps if we ever switched to utf-8 strings internally in guile, that would be optimal tho
<xelxebar>Okay, I'm stumbling myself through some guile and would like some general advice on approach.
<xelxebar>My goal is to essentially grep for a string in a file.
<xelxebar>The brute-force way I'm working on is to loop through the lines in the file and run a string-match on each.
<xelxebar>Another approach could be to slurp up the entire file into a string, and just do a single string-match, but this seems mildly dangerous in general. I don't want to put a 4GB string in memory by accident.
<xelxebar>Is there a more idiomatic approach here? Am I missing something obivous?
<chrislck>xelxebar: this is a hard(TM) problem. best solved via tools like ripgrep.
<chrislck> https://blog.burntsushi.net/ripgrep/
<xelxebar>Haha! Fair enough. This is largely an pedagogical exercise in guile for me. I was wondering if there's a Guile-y way to do the equivalent of string-match on a file.
<chrislck>the guile way is exactly the same as the C/C++ way - consider buffer size, pointers, memory allocations, etc
<chrislck>you can limit your buffer size easily in guile just as in C
<rekado>xelxebar: this will become as complicated as you’ll allow it
<chrislck>in guile the national sport is minimising m in O(N^m)
<rekado>if you want to look up a fixed string (instead of a regexp) this will become somewhat simpler.
<rekado>but it still involves reading chunks of the file and being smart about matching
<rekado>here’s an entry point to the rabbit hole: https://en.wikipedia.org/wiki/String-searching_algorithm
<civodul>uh, srfi-64 (test-equal label value exp) behaves as if EXP returned #f when it actually threw
<xelxebar>Hrm. The level of abstraction that I'm thinking at here is "shell scripts". I'm trying to write the equivalent of some script I have in guile. If I'm understanding you correctly, it makes sense to just call out to grep/ripgrep/whatever in this case. I.e. there's not some short, idiomatic pure Guile replacement.
<mwette>wingo: While you are looking into strings, maybe look at how Python3 handles them. There are two literals, for example: b'hello' and u'hello'. The first is a bytearray and the second is a unicode string.
<wingo>using utf-8 as a string backing store wouldn't change api fwiw on the scheme side
<wingo>i thought most people thought of the python string / bytestring thing as a disaster
<rekado>xelxebar: there is no single search-string-in-file procedure in Guile. Which implementation is best depends very much on your use case.
<wingo>i would probably loop over calling "read-line" then calling "string-contains" but that's just me
<mwette>I was getting at clarifying the relationship between strings and {bytearray, encoding}. But going back and reading the Guile manual it is consistent. My problem is that I needed to read "character" as code-point and not byte.
***chrislck_ is now known as chrislck
<mwette> /quit
<rlb>wingo: hmm, I'll have to make sure I understand you better later -- If I have a string port and a char and then (put-char port c), I'd have assumed it just appends c to the string in some "reasonable" way. (And if not, we could fix it.)
<rlb>But I hadn't stopped to think about it hard yet or investigate.
<rlb>And wrt string vs bytes -- let's *not* do what python did, but most importantly by that I mean the backward incompatible break. But I'd say that having functions that can handle either bytes or strings might be (part of) a plausible solution. Though of course issues there too, i.e. (readlink x) could decide to return bytes or a string based on the type of x, but that doesn't help for something like (gethostname).
<rlb>In any case, I'd love to eventually have a plan there, and be happy to help implement it if I can.
<rlb>(Also be happy to help with utf-8 related work -- that was a stumbling block wrt lokke's use of pcre2 for #"foo" regexps since pcre2 does not support utf-32 (only 8 and 16).)
<mwette>Guile has (ice-9 regex) which operates on Guile strings. I assume it's basd on posix regex. So conversion currently uses locale but shouldn't the regex routines accept an optional encoding?
<mwette>(ice-9 regex) another target for implementation in scheme?
<leoprikler>iiuc all guile strings are UTF-8. If you need anything else, use bytevectors :)
<RhodiumToad>the problem with that is with reading strings from, say, filenames
<mwette>I believe, strings are arrays of code points, not encoded byte arrays.
<mwette>Files are arrays of bytes. So you need an associated encoding to turn those bytes into code points (aka characters).
<RhodiumToad>the more serious problem is file _names_
<RhodiumToad>everyone expects a file name to be a string, but the OS is treating it as bytes
<mwette>Ah. I remember the issues. I'll bet they are, especially because LC_LANG(?) is volatile.
***amiloradovsky1 is now known as amiloradovsky
<RhodiumToad>well, LC_* might say that the locale is UTF8, but the filename might not be valid UTF8
<mwette>exactly
<mwette>no encoding designation in DIR type?
<RhodiumToad>nope
<rlb>I believe internally strings are either "latin-1" (more or less) or UTF-32, neither of which pcre2 supports.
<rlb>And wrt (ice-9 regex), it's not guaranteed to be available, and when available "does whatever the platform does", and I wanted lokke to have well defined regexps (and they have to be available all the time, so couldn't rely on guile's support anyway)
<rlb>(and of the options pcre seemed like a reasonable choice)
<rlb>And wrt bytes vs strings, yeah, on many platform (i.e. linux), paths/hostnames/etc. are *bytes* not strings -- and there are additional potential complexities, since depending on how the unicode is handled, normalization and/or canonicalization can even mean that the string changes in a way that makes it impossible to find the original file, i.e. if you were to readlink something, then normalize/canonicalize the string, and then try to
<rlb>open it, it might fail.
<rlb>Or I suppose user and group names are perhaps a better example than hostnames...
<rlb>It can be quite fraught to pretend they're not.
<rlb>(or at least to *only* support pretending they're not)
<terpri>utf-8b, as i think it's called, is a fairly clever solution to the random bytes -> almost-utf-8 problem without losing information: https://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html
<terpri>iirc it's what python uses for things like posix pathnames (or a very similar trick)
<wingo>moo
<terpri>🐮💬μῡ
<terpri>ah, this refers to python's usage, introduced as a "surrogateescape" error handler in pep 383: https://en.wikipedia.org/wiki/UTF-8#Derivatives
<rlb>Can't actually rely on it though, at least if you're using glibc because there's an issue/bug in glibc that it doesn't look like they're going to address, and python hasn't appeared willing to work around it either: https://bugs.python.org/issue35883
<rlb> https://sourceware.org/bugzilla/show_bug.cgi?id=26034
<rlb>So you can't use that to handle arbitrary paths -- have to stick to bytes.
<rlb>or you may crash your program on the wrong path
<rlb>(I've (unfortunately) had had an extraordinary amount of time consumed by the python upstream choices in this general area over the past few years...)
<Sheilong>Do you know how I read all lines from a file but without keeping the \n character?
<ATuin>Sheilong: see `(ice9 rdelim)`
<terpri>rlb, interesting, thanks for the links
<terpri>tricky to determine whether it's a genuine "bug" or not without diving into several standards (posix, iso c, unicode/iso 10646, ...)
<ecraven>hm.. why is guile on arch linux still guile 2?
<wingo>rlb: neat trick: (define (f ch) (define numeric '(#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)) (define (numeric? ch) (memv ch numeric)) (cond ((eof-object? ch) #f) ((numeric? ch) (- (char->integer ch) #\0)) (else 42)))
<wingo>the numeric? check optimizes to a table switch
<wingo>unfortunately guile is very stupid about eof-object? right now; need to fix that
<ecraven>that ignores almost all of unicode :P
<wingo>depends on what your definition of numeric is
<wingo>some languages define numeric in that way
<wingo>civodul: you ok with excise-ltdl?
<wingo>currently missing the mingw dlopen shim, but can fix up in master; i know mike gran has some work in this area
<wingo>apparently guile is broke on mingw anyway right now
<wingo>mike suggested simply pulling in the dlopen shim from cygwin into posix-w32.[ch], which sounds about right to me
<civodul>wingo: sounds good to me; that's for master, not 3.0, right?
<wingo>master is currently 3.0 fwiw
<wingo>no stable-3.0 branch yet
<civodul>right, that's why i'm asking :-)
<wingo>i was thinking we could do it in 3.0 actually
<civodul>ah
<wingo>:)
<civodul>heh :-)
<civodul>i'm just wondering about possible breakage
<civodul>i don't have any clear scenario in mind tho
<wingo>yeah. i thought about it and i don't know of anything that's not just a minor bug that can be fixed up in 3.0
<wingo>but i know that's not quite watertight :)
<wingo>i would rather not branch off 3.2 at this point
<civodul>yeah
<wingo>if necessary, sure, but we don't have that kind of energy atm
<wingo>nor any kind of 3.2 value proposition, currently anyway
<civodul>i guess you can go ahead and then we'll test on a few packages
<wingo>cool :)
<wingo>ok done
<wingo>with an optimization to eof-object? :)
<civodul>so it can be inlined?
<wingo>yeah, inlines to (eq? x the-eof-object)
<wingo>aaah the optimization of that "f" above is really good now
<wingo> 3 (immediate-tag=? 0 255 12) ;; char?
<wingo> 5 (jne 21) ;; -> L2
<wingo> 6 (untag-char 1 0)
<wingo> 7 (usub/immediate 1 1 48)
<wingo> 8 (mov 0 1)
<wingo> 9 (jtable 0 #(13 13 13 13 13 13 13 13 13 13 26))
<wingo>L1:
<wingo> 22 (tag-fixnum 1 1) at (unknown file):3:159
<wingo> 23 (reset-frame 1) ;; 1 slot
<wingo> 24 (handle-interrupts)
<wingo> 25 (return-values)
<wingo>if it's a char, it untags it, subtracts off #\0 already, then does a table jump. if untagged value is <= 9, it tags as fixnum and returns directly
<wingo>in this case we could compile to (< untagged-biased-value 10) instead of the table jump, which would be even better
<wingo>probably should get that together. but still, yay
<civodul>neat!