IRC channel logs

2024-03-22.log

back to list of logs

<rlb>Hmm, bitmap? For utf-8, index[i] is recording the offset of the start of the (* i stride) character's first byte, so we need an integer -- u8 for strings <= 255 chars, u16 for...
<rlb>sorry <= 255 bytes
<rlb>(early error in some of the code iirc :) )
<old>one bit per byte
<old>set the bit to 1 if this byte is the start of a character
<old>utf-8 is what, maximum 4 bytes?
<rlb>right, but we don't want to look at the bytes, we want to jump to "nearby" right off the bat.
<freakingpenguin>rlb: Interesting, is this code posted publicly?
<rlb>unless I misunderstand.
<rlb> https://codeberg.org/rlb/guile/src/branch/utf8 -- and should be up to date wrt main, unless main's changed in the past few days.
<freakingpenguin>Thanks!
<old>rlb: If the goal is to find the character K in string S, then finding the K bit set to one in the bitmap will give you the correct index in byte in S to find the character
<old>that's my understanding of the problem
<old>it can be made very efficiant et vey low overhead in memory
<rlb>old: i.e. if you want to go to the 35th char, you jump to index[0] byte and then use utf8_next a few times.
<rlb>(no more than 32 if the stride is 32)
<old>hm
<old>let me see somethin
<old>does supporting multi codepoint a thing?
<old>e.g. > 4 bytes character
<rlb>OK, you're saying we'd allocate a (* 4 char-count) bitmap?
<old>I would allocate (length / sizeof(unsigned long) + 1) * sizeof(unsigned long) bits
<old>where length is the length in bytes in total
<rlb>Oh, right, you'd only need as many bits as the utf-8 encoding has bytes?
<rlb>ish
<rlb>Would have to ponder -- definitely different cost tradeoffs, I think, at least.
<old>something liek this: https://paste.sr.ht/~old/561dff1c88f138da8e95a9a6a1f6b4e1f215099b
<old>not tested, but this is super super fast
<rlb>The current approach is just one indirect offset and a bit of iteration, and the cost is tunable via the stride. I'd also assumed if we wanted to, we could probably use vector ops (if the compiler isn't already) for the iteration...
<old>yes I understand. The current approach is very good also
<old>would need to measure the difference in term of time overhead and memory overhead
<rlb>Offhand, I wonder if it might not be too hard to compare if we do decide to head in the general direction of that branch, i.e. might not be too hard to "plug in" either approach. The index (everything) is in-line for cache friendliness, and the index is at the end. It's also immutable, so it might be a fairly isolated set of changes.
<rlb>Not sure.
<rlb>(offhand)
<rlb>Was also wondering if the bitmap approach always paid more for larger indexes, but might be tricks to play there too...
<rlb>Hmm, or not, I guess, since it's a count.
<rlb>dunno, but can't focus on it right this minute :)
<old>the biggest problem with the bitmap scanning is that as the string get larger, you pay more in term of cache misses for far access
<old>e.g., you want to access the 1024 characters, you gonna pay couple of cache misses there in the bitmap
<rlb>Well, and you have to traverse it, i.e. more data read, even ignoring cache eviction effects...
<rlb>as compared to the direct jump
<old>the sparse array limit the amount of cache misses no matter the wanted reference
<rlb>right
<old>so I'm all for the sparse actually lol
<rlb>Heh, well as mentioned, I *think* we might could do either without too much extra work -- for now I'm assuming the main question is "is this the direction we want to head in general". That branch is a quite a lot of change after all. i.e. have to see what others think when they have time to take a look.
<rlb>I've set it aside for now, other than rebasing.
<old>for a 4096 characters, each 4 bytes (16384 bytes) you pay 2048 bytes in memory (overhead of 12.5%). Best case scenario is reference 0, you get potentially 1 cache misses in the bitmap. Worst case, you have to iterate over all the bitmap for 256 cache misses
<old>might be worth only for small string then
<rlb>Also, I'd be inclined to fix the thread deadlock issue first, however we decide we want to do that. Though I still haven't gotten your version up for others to compare (thought I might, just haven't gotten to it).
<old>would be nice for the thread deadlock!
<rlb>ACTION has extra motivation there, wants the faster parallel tests :)
<old>My mistake, actually 32 cache misses in the previous example
<old>still more than 32 next_utf8 I suppose
<mwette>I have a printf implementation that should return length of string printed. Should a string-output version be a different function (like sprintf) or is using `#f' for the port argument on printf OK?
<shawnw>I like different functions, not Common Lisp format style.
<mwette>shawnw: thanks; two is where I was going, after originally one
<sneek>chrislck: Greetings!!
<chrislck>sneek: botsnack
<sneek>:)
<sneek>Welcome back chrislck :)
<dsmith>sneek, botsnack
<sneek>:)