<Sheilong>Hello. How can I convert an integer to a bit vector <Sheilong>transform an interger in the range 0-255 into a bitvector of 8 bits. E.g (convert 15) => #*00001111 <justin_smith>you might find that bit-extract helps, it gives the value of some range of bits inside an int <Sheilong>I wrote a procedure to calculate the hamming distance between two string, so I have had to write a procedure to count the bits on given an integer, but since I saw that there was a bitvector with a builtin bitcount I've been trying to figure out how to convert an int to that. <spk121>if you just want the count of the 1 bits, there's a 'bit-count' in srfi-60 <Sheilong>spk121: Yes. But bit-count expects a bitvector as input <RhodiumToad>speaking of bit operations, I noticed when doing those hex routines a while back that using srfi-60 came with a substantial performance hit <RhodiumToad>it looked like the builtins were being compiled to bytecode primitives while the exact same functions used from srfi-60 were not <Sheilong>How do I compare equality among characters? <Sheilong>(= #\a #\b) <unnamed port>:2798:0: In procedure =: Wrong type argument in position 1: #\a <Sheilong>I did some bitwise on 8bit integers but I am getting results that are bigger than 8 bits <Sheilong>I convert a base 64 string to a list of chars, then convert the chars to integers, and save them in a bytevector. After that, I try to decode it back to original base. <Sheilong>But I think I did something wrong in the bitwise operations <RhodiumToad>you know you can convert a string straight to a bytevector? <RhodiumToad>for something like base64 (or hex) where you know all the valid data is plain ascii, use string->utf8 <Sheilong>I confess that I looked by it in the documentation but did not found it before. <RhodiumToad>that gives you a bytevector with the utf8 representation of the string, which will be all-ASCII if the original string was <Sheilong>Ahhh, I used that before, but I got an error when I needed to do a xor operation with a key against a set of strings. <Sheilong>RhodiumToad: Is there a way to work with integers as 8-bit ints? When I extract an int from the bytevector it lost the 8-bit representation. <RhodiumToad>you want to do things like left shifts discarding all but the low 8 bits? <Sheilong>RhodiumToad: I am too stupid! When I encoded a string to base 64, the result of bitwise operations were indexes to the base 64 character table. Now to do the reverse operation, I was doing the bitwise operation on the decimal character representation instead of the index... ***wxie1 is now known as wxie
<theruran>can Guile 3.0 handle the nanopass framework Scheme code? <Sheilong>Is there a built in procedure that counts occurrences of character in a string or in a list? <RhodiumToad>for lists you could use a fold if there isn't any specific function for it <Sheilong>RhodiumToad: Thanks so much. String-count suffices for now. <RhodiumToad>so (string-count str #\a) or (count (cut eqv? <> #\a) lst) <Sheilong>I am detecting how much padding is there <Sheilong>Now I have how much padding I need to remove, it would be nice if there is a built in function to remove the last n characters from a string <Sheilong>I am wrong. I want to remove the n last elements of a bytevector not from a string <daviid>ah occurrences of ... forget length, i missunderstood the quiz <RhodiumToad>bytevectors can't have their length changed, you'd need to copy it to a new one <RhodiumToad>you might prefer just to keep track of the effective length <Sheilong>bytevector-copy! is what I need, thanks guyu <daviid>Sheilong: fwiw, ,use (ice formnat), then <daviid>(format #t "~8,'0b~%" 15) -> 00001111 <rlb>Including the recent compiler changes, any general guidelines wrt choosing character classes vs case for parsing, e.g. if you're parsing something s-expression-ish? <wingo>rlb: i would use case (or match, or whatever) <wingo>match is nice because you can use character classes if you want to <rlb>So for things like "digits" or a fixed set of delimiter chars, should character sets and case perform close enough to the same to not likely matter? <rlb>(if you have any impression offhand) <wingo>yeah i think they are similar <wingo>the tradeoff as i see it is: "case" (or other chains of comparisons) can be more optimal. charsets allow to you be more abstract when specifying the set of characters, and potentially more extensible as unicode evolves <wingo>i started rewriting guile's reader in scheme yesterday <wingo>hoping to be able to write an ad-hoc tree-il -> C compiler for use in bootstrapping <wingo>so i have been thinking about these questions as well <wingo>one nice thing about case is that effectivel the dispatch to the clauses happens in parallel -- i.e. it's not test literals for this clause, otherwise the next, otherwise the next, etc <wingo>it's o(1) dispatch to the right clause *rlb was toying with a scheme edn reader <rlb>Thanks for the summary. <rlb>(which might also if it performed well enough eventually grow to replace the current C clj reader -- which itself was stolen wholesale from guile's reader and adapted) <rlb>(And if I finish it and it seems good enough, I also thought I might submit the pure scheme edn reader for consideration for guile -- if there's interest...) <rlb>I'll be curious to see what you come up with -- what I'm currently doing is not very optimized. <wingo>i am aiming to make a more-or-less faithful translation of the C reader to Scheme <wingo>so as to preserve the same behavior. am trying not to optimize too much in the beginning <wingo>i know the reader is important but maybe optimization payoffs are different in scheme vs c <rlb>Do we have a way to ask for a character set for a given unicode general category like "Zs" "Zl" and "Zp"? <rlb>(wrt not optimizing (or maybe not optimizing), I'm using string output ports to accumulate pending text, i.e. unfinished symbols/strings/etc...) <rlb>(Though I planned to go look and see how they currently work, i.e. are the good enough wrt cost right now.) <wingo>rlb: wrt Zs etc, oddly i think not <wingo>interesting choice, string output ports; hadn't thought of that. not sure how well it would perform; good expandable buffer, but they do round-trip to UTF-8 <wingo>perhaps if we ever switched to utf-8 strings internally in guile, that would be optimal tho <xelxebar>Okay, I'm stumbling myself through some guile and would like some general advice on approach. <xelxebar>My goal is to essentially grep for a string in a file. <xelxebar>The brute-force way I'm working on is to loop through the lines in the file and run a string-match on each. <xelxebar>Another approach could be to slurp up the entire file into a string, and just do a single string-match, but this seems mildly dangerous in general. I don't want to put a 4GB string in memory by accident. <xelxebar>Is there a more idiomatic approach here? Am I missing something obivous? <chrislck>xelxebar: this is a hard(TM) problem. best solved via tools like ripgrep. <xelxebar>Haha! Fair enough. This is largely an pedagogical exercise in guile for me. I was wondering if there's a Guile-y way to do the equivalent of string-match on a file. <chrislck>the guile way is exactly the same as the C/C++ way - consider buffer size, pointers, memory allocations, etc <chrislck>you can limit your buffer size easily in guile just as in C <rekado>xelxebar: this will become as complicated as you’ll allow it <chrislck>in guile the national sport is minimising m in O(N^m) <rekado>if you want to look up a fixed string (instead of a regexp) this will become somewhat simpler. <rekado>but it still involves reading chunks of the file and being smart about matching <civodul>uh, srfi-64 (test-equal label value exp) behaves as if EXP returned #f when it actually threw <xelxebar>Hrm. The level of abstraction that I'm thinking at here is "shell scripts". I'm trying to write the equivalent of some script I have in guile. If I'm understanding you correctly, it makes sense to just call out to grep/ripgrep/whatever in this case. I.e. there's not some short, idiomatic pure Guile replacement. <mwette>wingo: While you are looking into strings, maybe look at how Python3 handles them. There are two literals, for example: b'hello' and u'hello'. The first is a bytearray and the second is a unicode string. <wingo>using utf-8 as a string backing store wouldn't change api fwiw on the scheme side <wingo>i thought most people thought of the python string / bytestring thing as a disaster <rekado>xelxebar: there is no single search-string-in-file procedure in Guile. Which implementation is best depends very much on your use case. <wingo>i would probably loop over calling "read-line" then calling "string-contains" but that's just me <mwette>I was getting at clarifying the relationship between strings and {bytearray, encoding}. But going back and reading the Guile manual it is consistent. My problem is that I needed to read "character" as code-point and not byte. ***chrislck_ is now known as chrislck
<rlb>wingo: hmm, I'll have to make sure I understand you better later -- If I have a string port and a char and then (put-char port c), I'd have assumed it just appends c to the string in some "reasonable" way. (And if not, we could fix it.) <rlb>But I hadn't stopped to think about it hard yet or investigate. <rlb>And wrt string vs bytes -- let's *not* do what python did, but most importantly by that I mean the backward incompatible break. But I'd say that having functions that can handle either bytes or strings might be (part of) a plausible solution. Though of course issues there too, i.e. (readlink x) could decide to return bytes or a string based on the type of x, but that doesn't help for something like (gethostname). <rlb>In any case, I'd love to eventually have a plan there, and be happy to help implement it if I can. <rlb>(Also be happy to help with utf-8 related work -- that was a stumbling block wrt lokke's use of pcre2 for #"foo" regexps since pcre2 does not support utf-32 (only 8 and 16).) <mwette>Guile has (ice-9 regex) which operates on Guile strings. I assume it's basd on posix regex. So conversion currently uses locale but shouldn't the regex routines accept an optional encoding? <mwette>(ice-9 regex) another target for implementation in scheme? <leoprikler>iiuc all guile strings are UTF-8. If you need anything else, use bytevectors :) <RhodiumToad>the problem with that is with reading strings from, say, filenames <mwette>I believe, strings are arrays of code points, not encoded byte arrays. <mwette>Files are arrays of bytes. So you need an associated encoding to turn those bytes into code points (aka characters). <RhodiumToad>everyone expects a file name to be a string, but the OS is treating it as bytes <mwette>Ah. I remember the issues. I'll bet they are, especially because LC_LANG(?) is volatile. ***amiloradovsky1 is now known as amiloradovsky
<RhodiumToad>well, LC_* might say that the locale is UTF8, but the filename might not be valid UTF8 <mwette>no encoding designation in DIR type? <rlb>I believe internally strings are either "latin-1" (more or less) or UTF-32, neither of which pcre2 supports. <rlb>And wrt (ice-9 regex), it's not guaranteed to be available, and when available "does whatever the platform does", and I wanted lokke to have well defined regexps (and they have to be available all the time, so couldn't rely on guile's support anyway) <rlb>(and of the options pcre seemed like a reasonable choice) <rlb>And wrt bytes vs strings, yeah, on many platform (i.e. linux), paths/hostnames/etc. are *bytes* not strings -- and there are additional potential complexities, since depending on how the unicode is handled, normalization and/or canonicalization can even mean that the string changes in a way that makes it impossible to find the original file, i.e. if you were to readlink something, then normalize/canonicalize the string, and then try to <rlb>open it, it might fail. <rlb>Or I suppose user and group names are perhaps a better example than hostnames... <rlb>It can be quite fraught to pretend they're not. <rlb>(or at least to *only* support pretending they're not) <terpri>iirc it's what python uses for things like posix pathnames (or a very similar trick) <rlb>Can't actually rely on it though, at least if you're using glibc because there's an issue/bug in glibc that it doesn't look like they're going to address, and python hasn't appeared willing to work around it either: https://bugs.python.org/issue35883 <rlb>So you can't use that to handle arbitrary paths -- have to stick to bytes. <rlb>or you may crash your program on the wrong path <rlb>(I've (unfortunately) had had an extraordinary amount of time consumed by the python upstream choices in this general area over the past few years...) <Sheilong>Do you know how I read all lines from a file but without keeping the \n character? <ATuin>Sheilong: see `(ice9 rdelim)` <terpri>rlb, interesting, thanks for the links <terpri>tricky to determine whether it's a genuine "bug" or not without diving into several standards (posix, iso c, unicode/iso 10646, ...) <ecraven>hm.. why is guile on arch linux still guile 2? <wingo>rlb: neat trick: (define (f ch) (define numeric '(#\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9)) (define (numeric? ch) (memv ch numeric)) (cond ((eof-object? ch) #f) ((numeric? ch) (- (char->integer ch) #\0)) (else 42))) <wingo>the numeric? check optimizes to a table switch <wingo>unfortunately guile is very stupid about eof-object? right now; need to fix that <ecraven>that ignores almost all of unicode :P <wingo>depends on what your definition of numeric is <wingo>some languages define numeric in that way <wingo>civodul: you ok with excise-ltdl? <wingo>currently missing the mingw dlopen shim, but can fix up in master; i know mike gran has some work in this area <wingo>apparently guile is broke on mingw anyway right now <wingo>mike suggested simply pulling in the dlopen shim from cygwin into posix-w32.[ch], which sounds about right to me <civodul>wingo: sounds good to me; that's for master, not 3.0, right? <wingo>master is currently 3.0 fwiw <wingo>i was thinking we could do it in 3.0 actually <civodul>i'm just wondering about possible breakage <civodul>i don't have any clear scenario in mind tho <wingo>yeah. i thought about it and i don't know of anything that's not just a minor bug that can be fixed up in 3.0 <wingo>but i know that's not quite watertight :) <wingo>i would rather not branch off 3.2 at this point <wingo>if necessary, sure, but we don't have that kind of energy atm <wingo>nor any kind of 3.2 value proposition, currently anyway <civodul>i guess you can go ahead and then we'll test on a few packages <wingo>with an optimization to eof-object? :) <wingo>yeah, inlines to (eq? x the-eof-object) <wingo>aaah the optimization of that "f" above is really good now <wingo> 3 (immediate-tag=? 0 255 12) ;; char? <wingo> 9 (jtable 0 #(13 13 13 13 13 13 13 13 13 13 26)) <wingo> 22 (tag-fixnum 1 1) at (unknown file):3:159 <wingo> 23 (reset-frame 1) ;; 1 slot <wingo>if it's a char, it untags it, subtracts off #\0 already, then does a table jump. if untagged value is <= 9, it tags as fixnum and returns directly <wingo>in this case we could compile to (< untagged-biased-value 10) instead of the table jump, which would be even better <wingo>probably should get that together. but still, yay