<Rashack>any tips on handling files with utf-8 encoded file names? I've been using file-system-tree and file-system-fold, and they both fail. <Rashack>so, it seems to be (stat ...) that has problems, but only when it gets handed the file from file-system-fold, file-system-tree of ftw <Rashack>where the filename is la's, but the quote is utf-8 encoded <Rashack>when i do (stat "la's") from the repl everything works fine <mark_weaver>Rashack: you need to set a UTF-8 locale if you'll be working with UTF-8 encoded filenames <mark_weaver>Rashack: usually the best thing is to run (setlocale LC_ALL "") near the beginning of your program, which sets the locale according to the usual environment variables <Rashack>so the repl inherits this from my shell, but my script doesn't? <mark_weaver>the REPL sets the locale automatically, but in normal programs (including scripts) you have to do it yourself <mark_weaver>in 2.2 (not yet released), we'll set the locale automatically for guile scripts. <Rashack>but i'm not sure i like it (maybe because i never really understood LC_ALL) <mark_weaver>I agree that the 2.2 behavior of automatically setting the locale according to the environment variables is what most users expect. <mark_weaver>(setlocale LC_ALL "") essentially does the equivalent of what most scripting languages and i18n'd programs do. <Rashack>i guess i expected "it" to just handle filenames regardless of the filename encoding <mark_weaver>Rashack: how would you have that work? strings in guile are sequences of unicode code points, whereas POSIX filenames are sequences of bytes. how would you do that conversion? <Rashack>i think i don't like the fact that i don't fully understand why the locale would matter in this case <Rashack>i don't see why there would have to be a conversion <Rashack>some code in guile reads a file, passes it to some other code in guile, that doesn't understand it <Rashack>aren't "unicode code points" bytes at some level as well? Depending on the encoding <mark_weaver>I don't see how to turn a sequence of bytes into a sequence of unicode code points without some kind of conversion. <Rashack>maybe not i our heads, but when they're stored in a computer they are <mark_weaver>if you want to treat filenames as unicode strings, then you need to know what the encoding of the bytes is. <mark_weaver>they can be converted to+from sequences of bytes if and only if you know what encoding to use. that's a conversion process. <mark_weaver>we (and I) have done a lot of research on this topic and put a lot of thought into it. if you think we did things wrong, then do some research and make a proposal :) <Rashack>i'm trying to understand here (perhaps against better judgement) <mark_weaver>unfortunately, the fundamental problem here is that POSIX filenames are just sequences of bytes, and there's no standard encoding for those bytes, so no way to know what bytes > 0x7f are. <Rashack>i can understand the confusion when trying to display a filename, decoded using something else than it was encoded with <Rashack>but why can't guile internally handle the filensames it being fed <mark_weaver>the problem is that Guile strings are sequences of characters, not sequences of bytes like in C. <Rashack>as you say, if the names are just bytes there shouldn't have to be a problem, until i want to display them, unless guile has already tried to decode them <mark_weaver>so you'd prefer for strings to be raw sequences of bytes, and do the conversion only when we display them? <mark_weaver>or would you prefer for filenames to be bytevectors instead of strings? <Rashack>hehe, i'm not sure (and i still don't have a clear picture of what's happening here) <mark_weaver>I assure you, there is no magic bullet here. it's a messy problem. <Rashack>but isn't it strange that a ftw can list a file, but not do stat on it? <mark_weaver>if we could all agree to standardize on UTF-8, and that became a de-facto standard encoding for POSIX byte strings, then the problem would be solved. <mark_weaver>but alas, the CJK countries don't like some aspects of unicode. <adhoc>probably becuase they aren't getting heard on the committees that put together UTF8 <adhoc>but as china teaches the vast majority of its kids in pinyin, the roman character set will become more widespread <adhoc>and making UTF8 the primary encoding method in apps by default will only help that <adhoc>practically UTF8 will help the most people <adhoc>options for UTF16 support for folks in non roman character set languagges <adhoc>mark_weaver: thinking in characters rather than byte, but with the option to look at the stream in bytes as well is probably the best way to go <mark_weaver>our standard hack for that is to use the ISO-8859-1 (latin-1) encoding, where every byte maps to a character. <mark_weaver>if you really don't care how the bytes >= 0x80 are interpreted, and you just want to work with bytes as if they are characters, that's one way to do it. <adhoc>the web has left that idea behind though <mark_weaver>and if you want to do I/O with bytes, of course we support binary I/O and bytevectors. <mark_weaver>in theory, we could allow bytevectors to be used as filenames, and make variants of the (relatively few) procedures that *return* filenames to return it as a bytevector. <mark_weaver>adhoc: can you be more clear about what you're proposing? <adhoc>i've done a lot of work in perl where we treat everything as UTF8 <adhoc>binary files, web requests, the lot <adhoc>we oft get windows char set encoded stuff that simply doesn't convert to other things <adhoc>so we get junk that breaks RSS aggregators (not our code) <adhoc>so we have to convert everything to UTF8 <adhoc>we have code that pokes through the headers and tries to figure out whats going on <adhoc>if it isn't UTF8, like it finds other code points, it tries to re-read as the other encoding <adhoc>sometimes this fails, usually with windows char set stuff in non romain charset languages =/ <mark_weaver>I suppose this is a cultural difference between the Scheme and Perl communities, but we are not so fond of making guesses. <adhoc>our decisions are usually solving problems in a hurry to fix some gaping hole or daft bug <adhoc>looking for the code points of other encodings is the right way to solve this though <adhoc>many apps that generate files assume the windos char set <adhoc>so don't bother to add the code point into the stream <adhoc>and you get it alot in forms submitted to your web app <adhoc>usualyl stuff cut'n'pasted from word <adhoc>its part of our string taint libraries <zacts>adhoc: you are doing scheme coming from perl? <zacts>well Perl was my first language that I liked <zacts>I can help with the transition if you want <zacts>adhoc: I find I like scheme and clojure a ton lately. Although I still like Perl for what it is, and especially simple UNIXy one liners and other stuff like that. and regex <nalaginrut>I took a look at Gopher which is used to generate JS with some Go functions, better to have one in Artanis... <jgrant>nalaginrut: Is that named after that Starcraft charecter? <jgrant>Also your site's documentation link doesn't work. <nalaginrut>jgrant: you're luck to encounter the problem since I'm tweaking the manual link <nalaginrut>jgrant: actually, it's named from the web framework "Sinatra" <nalaginrut>but then I realized it's a character of Starcraft, well, I should use this one <jgrant>In any case, very cool work. :^) <jgrant>Might be a fun place for me to play around, if I edge a little more into webdev. <nalaginrut>the manual is not available at moment, since someone told me use "manual" rather than "manual" <jgrant>Right now, I'm just buggering with Skribilo for my personal Blog/Site. <jgrant>So you are condensing the documentation into one file? <nalaginrut>jgrant: I'm sorry you have to wait a moment, since I haven't release the tarball, I'm doing the stuffs for preparing 0.0.2 release <nalaginrut>jgrant: but you may download it from git and 'make docs' <jgrant>nalaginrut: Not a problem, just excited to see you are still working on this. :^) <nalaginrut>jgrant: yeah, it's just born, so I'm working on it heavily <jgrant>nalaginrut: Ty. I'm close to going to bed for tonight, but I'll certainly bookmark it. :^) <adhoc>zacts: regex's seem to be reviled by lispy people in my experience. don't know why. i don't really have the expeerience in lisp/scheme to find alternatives. *adhoc will get there yet =) <adhoc>zacts: yeah, learning scheme after many years of living in the wilderness ;) <adhoc>zacts: twenty years ago i wrote a lot of assembler in emacs and some elisp =) *nalaginrut fixed manual link finally... *nalaginrut hate CVS again... <zacts>I find regex can be a really useful DSL <zacts>although they can be somewhat limited, oh let me show you a link though <zacts>so really you can get more power with the automat kind of way of doing things <zacts>but at the same time you lose the concise usefullness that is the DSL of regex <saul>zacts, have you looked at srfi-115? <mark_weaver>zacts: I second saul's suggestion to look at SRFI-115, which I plan to add to Guile at some point. also, there's 'irregex', which is quite close to SRFI-115. *jgrant should be sleeping. \\o/ <nalaginrut>I will learn how to write Guix script for packaging it ;-) <nalaginrut>if we're not going to hold potluck this year, maybe today is (sounds unfair to others huh?) <nalaginrut>civodul: hah, I'm kidding, of course it's very different ;-D <zacts>nalaginrut: oh did you get your artanis web server released? <zacts>oh nice, is it going to be an official gnu project? <atheia>Oh fantastic! I had no idea you were going through the process too. <atheia>Feels like there's a nice growth of Guile based GNU pkgs at the mo… <zacts>^ it seems someone is still maintaining scsh a bit <zacts>mark_weaver: davexunit: apparently SICM is releasing a new edition next month <zacts>oops sorry for the long link <zacts>I wonder if they are going to consider updating SICP in any way ***dsmith-w` is now known as dsmith-work
<zacts>mark_weaver: yeah I know about 2nd edition of SICP, but I wonder about a 3rd sometime in the near future. perhaps, or perhaps not ***dje is now known as xdje
<cluck>yay, new guix release! all hail the guixers \\o/ <wingo>civodul: how are the last minute slides coming <civodul>wingo: they're not really coming yet!