IRC channel logs
2026-04-08.log
back to list of logs
<rlb>wondering if there's a bug or surprising behavior in u8_conv_to_encoding --- I *think* it says that if the result buf is equal to the input buf, then no allocation was needed, and the conversion worked, but that's not what I observe when a BOM is involved. Instead, if the buffer can't fit the character and the BOM, it adds the BOM and whatever of the character will fit, and then returns the buffer as if things were "fine". <rlb>It's not critical because I think I have a different way to do what we want (maybe the "right" way) using an overlarge buffer and its "offsets", but it's behavior did surprise me. <rlb>jcowan: do you happen to know if a BOM is required to be no larger than the largest encoded character? I assume so, but wondered if that was official somehow... <jcowan>By no means. The BOM is U+FEFF; the largest possible character is U+1FFFFF. <jcowan>The largest character currently assigned is 10FFFD in the Private Zone, and the largest character publicly assigned is U+E01EF <jcowan>In addition, the last graphical character (a Chinese ideograph) is U+33479. <rlb>jcowan: sorry, I meant that I was wondering if a BOM (is there only one flavor?) could ever be longer than the longest encoding of some other character in that encoding. I ask because I was trying to figure out if a 2 * MB_MAX_LEN buffer with libunicode would always be enough for the BOM and an encoded null (no matter the encoding). <JohnCowan>yes, only one BOM character, though it has 3 representations as UTF-8, -16, -32 <JohnCowan>So in UTF-16 the BOM is only 2 bytes whereas characters above U+FFFF requires 4 bytes <rlb>Right --- motivation right now is that libunistring appears to insert a BOM when only asked to ust encode a null, and so I need to have a large enough buffer for any encoded BOM plus null. <rlb>(Unless I misunderstand, it also doesn't actually appear to tell you if the encoded bom plus null doesn't fit in the buffer you gave it...) <rlb>But I can accommodate that I think. <rlb>(And I feel like the function ought to be returning ENOMEM when the bom + null it's trying to write won't fit, but it doesn't appear to --- still, think it's not going to matter.) <rlb>Am I just missing something here? u8_conv_to_encoding for "\0" encodes either a null, or a BOM followed by a null depending on the selected output encoding, but I don't see any way to tell whether or not it wrote a BOM, and so where the encoded null starts. It does allow you to ask for the encoded string's character positions via offsets[] but offsets[0] is the BOM when it writes one, not the null. <rlb>It also looks like it doesn't add the bom length to the encoded string length it reports. <rlb>i.e. for utf-32 it reports a length of 8, but starts the encoding with the bom (0xff 0xfe ...) <rlb>My current impression is that either this is just a bug, or they (without documenting it) expect you to look for the bom at the start, and add the +2 yourself... <rlb>I'll probably take a look at libunistring's source tomorrow. <rlb>For utf-32 it writes (ff fe 0 0 0 0 0 0) and then random data in the two following bytes, so just seems broken. <JohnCowan>If the length is reported as 8 bytes the you shouldn't be looking past that in C <JohnCowan>the firsft 4 bytes are the BOM and the next 4 are a NUL. <rlb>Upon further investigation, libunistring (via iconv) returns, for example, (BOM NULL) for a UTF-32 target, and (NULL) for UTF-32BE/LE. So I'm wondering how we can handle that to safely detect when there's a bom (across all possible encodings), and skip it. <rlb>I'd hoped that the "offset[0]" you can ask for from u8_conv_to_encoding would point past the BOM, but it doesn't (because it's relying on iconv, and iconv doesn't tell you). Unless the two BOM flavors were guaranteed unique as a prefix across all possible encodings (not just the common ones), I'm not sure how we can tell. <jcowan>rlb: without knowing the encoding you cannot safely detect anything, although there are heuristics <rlb>My impression is that this isn't an issue for *iconv* because it's really only designed for "whole string" conversions, where the BOM is just an expected part of the result. <rlb>But u8_conv_to_encoding is trying to use it to work per-character, and here that falls down. <jcowan>My favorite example of this is US-BSCII, an encoding I invented to illustrate the point. It's the same as US-ASCII, except that 'A' and 'B' are exchanged. <rlb>i.e. they should probably be detecting the bom, and skipping it, but of course as you say, I assumed that'd require knowing details of the current encoder. <jcowan>So you look at (say) an XML document, which (unless it is in UTF-8 or a UTF-16 variant) must declare its own encoding at the top. <rlb>Right --- I was/am concerned that libunistring just doesn't work right here (even though it suggests is should via the offsets[]), and so we're "stuck", and iconv just isn't designed to handle our case either. <jcowan>But when you read the "encoding=us-bscii" declaration and you interpret it as US-ASCII it will apppear to say "encoding=us_ascii" and then everything after that is wrong. <rlb>If we knew for sure which encodings have a bom, then we'd be OK, but that seems like something we'd need iconv() or similar, to tell us. <rlb>(in the general case) <jcowan>Similarly, when your bytes begin ff fe, you don't know if it's a BOM (which is right if the encoding is UTF-16), or the first half of a BOM (if the encoding is UTF-32) <jcowan>or even a ZWNBSP character if the encoding is UTF16-LE or UTF-32LE <rlb>For our current particular cases, if we knew for certain "the largest unicode null encoding that could ever exist", then we could just always use that max, and waste a bit of space --- since you said that unicode null must be encoded as 0 bytes. <jcowan>The largest that can possibly exist is 4 bytes <rlb>OK, well if we're certain of that, and the "is always encoded as zero bytes", then that may be a sufficient hack for now --- ideally we'd have a more flexible unicode library... <jcowan>There are only a handful of encodings that use more than 4 bytes per character, and they don''t have BOMs. <rlb>Hmm, I need to check a couple more things... <rlb>old: fwiw I'm leaning toward thinking that scm_to_stringn never should have offered the automagic addition of an encoding-appropriate null terminator (when lenp is null). i.e. when any given encoding might or might not require a BOM, and "terminator" (null character) encodings vary. <rlb>Note that for UTF-{16,32} (not UTF-{16,32}{BE,LE}) every result should and will have a BOM prefix, so encoding "x" gives you the BOM, then x, so it'd be a bit weird (even if we could figure it out) to omit the BOM for an empty string. So fine, we can just start returning BOM followed by encoded null for "" there by just asking libunistring/iconv to encode "". <rlb>But then we're still stuck having to figure out what an encoded null looks like for an arbitrary encoding (without the bom) for non-empty strings and lenp == null, and we don't currently have a way to do that, I don't think. <old>I am not sure to have followed well what is the problem at hand here <rlb>i.e. at least so far, I wish scm_to_stringn hadn't supported the lenp == NULL automagic --- you can always call scm_to_stringn with nulls in the input if you want that. <old>was this not an error of libunistring? <old>So what is the main problem here? converting the string "\0" does not work as intended because of BOM? Sorry I'm kind of page out on the context <rlb>OK, so it's all about scm_to_stringn() --- if you ask it to encode "x", depending on the encoding you might just get back an encoded X, or you might get back a BOM then an x. That's expected. <rlb>We get that via libunistring->iconv, and iconv is designed to encode "whole strings" (i.e. it's not per-char). <rlb>An encoded string might or might not start with a BOM depending on the encoding to tell you what it's endianness is, unless you asked for an endian-specific flavor (for encodings where it's relevant) i.e. UTF-32 vs UTF-32{BE,LE}. The latter don't include a BOM in the result. <rlb>So if we ask scm_to_stringn to encode "x" as UTF-32, we get BOM then x. <rlb>But if we set lenp to null, it's supposed to tack on a "terminator". <rlb>For UTF-32 that should be four 0 bytes. But how do *we* know that. iconv knows that, but it won't tell us directly. <rlb>Also u8_conv_to_encoding/iconv return nothing for "", but should we return a BOM plus null, or just null from scm_to_stringn when lenp is null? <rlb>(if we could even figure out what "null" was for the arbitrary requested encoding) <rlb>Basically "automagic null termination" seems fine for encodings "you know", but not for arbitrary encodings. So it's fine for scm_to_latin1, scm_to_utf8, etc., but possibly not great here. <rlb>As a safety measure, I suppose we could just start adding 4 0 bytes to every lenp == NULL result there. I'll have to think about it a bit more, but maybe that's a tolerable hack... <rlb>i.e. maybe we can just do that and not worry about it --- it'll be wasteful for ascii/latin-1/utf-16, etc. (basically everything but utf-32), but safe (depending of course, on what the terminator is even for, for encodings other than utf-8 and maybe utf-16/utf-32). <rlb>(Overall, I'm not sure that what scm_to_stringn "intended" with lenp == null really makes sense.) <jcowan>Yes, adding 4 #x00 bytes to the end is always safe. <rlb>ACTION is seeing how that looks... <old>I mean, like john says, can't we just append a bunch of null byte at the end? <old>that ought to work with variable length character as well ? <rlb>right, that's what I'm saying --- it's a hack, but maybe it's "fine". <rlb>(It's also easier ;) ) <old>I mean we could also .. <old>with the utf-8 change <rlb>sure, figured I'd want to ponder that --- right now I'm just trying the hack as a "safety fix" for main. <rlb>but could be the hack is "fine" overall. <rlb>I'll also ponder whether any of what I think I've figure out warrants any comments in the scm_to_stringn docs. <old>I think it's very much fine <old>the old behavior was broken anyway <old>adding 4 bytes is a very small waste <old>but if you want to be optimal, you could detect certain common encoding and jsut emit known null-byte I guess <old>I don't think it's worth the hassle tho <rlb>fair point --- for now, perhaps I'll just plan to think about that wrt utf8 (still not sure we'll care), and just do 4 in main. <rlb>dthompson: is there already some way that we/you handle libguile specific vs scheme-only bit s wrt hoot? Or rather, may not end up mattering, but is there already some way (in guile itself) to have a scheme implementation of something that can be used (by say hoot) when it's not possible to use a libguile version? <rlb>Mostly just curious. <rlb>(If say it turned out that a C persistent vector was enough faster to be worth it (might not be) when you can use it, but we still wanted hoot to have them too --- since I think both implementations might be pretty small.) <old>rlb: I will read it more thoroughly later but <old>I don't like this 4 magic constant <old>I know the rationale here but <old>maybe just declaring it as: const size_t biggest_null_size = 4; or something along this <old>with the rationale comment above it <jcowan>should really be biggest_NUL_size <dthompson>rlb: hoot has its own separate implementations of things provided by libguile in the (hoot ...) namespace <rlb>dthompson: I guess then it'd (hypothetically) be some cond-expand or other facility to allow detecting and picking the libguile flavor when available, otherwise the scheme one... <rlb>(anyway, thanks, just wondered what might have already been done) <dthompson>I don't think we should have a C and a Scheme implementation of the same thing, though <rlb>OK --- I'd only wondered because it might be small/easy, and we'd already have the tests to keep us honest for the C version (in that world). <rlb>But I guess, if hoot is the only likely consumer, perhaps I agree. <dthompson>even for the guile vm I think it would be best for the code to be in scheme. <rlb>Of course, if there's not some huge performance difference. <dthompson>iterators won't be continuation safe unless they are in Scheme, etc. <rlb>I'll have to think that through, but wondering if that applies for these persistent vectors, i.e. everything's immutable (as far as what's provided by the C side). <rlb>old: switched the naming to nul, though mostly switched to #\nul since that seems even clearer in the relevant places. <dsmith-work>Misuse of those names always annoyed me. "NULL" is is C pointer thing. "NUL" is an ascii code (like CR or LF). <JohnCowan>in IBM PL/I, NULL is PL/I's idea of a null pointer, whereas C's null pointer is called SYSNULL. On s390 they are not the same bit pattern