IRC channel logs

***sneek_ is now known as sneek

<mark_weaver>sneek: later tell taylanub type ",help debug" at the REPL.

<sneek>Got it.

<zacts>does anyone know of a good pastebin for make check logs?

<zacts>most of the pastebins disallow that much text.

<zacts>I guess I could use something like gitorious or gitlab

***sneek_ is now known as sneek

***ft_ is now known as ft

<mark_weaver>zacts: can you just grep the log for FAIL or ERROR and show us those lines?

***sneek_ is now known as sneek

<zacts>sure mark_weaver

<zacts>are you still awake?

<wingo>moin

***sneek_ is now known as sneek

<wingo>you know, we don't have a GC_oom_func

<wingo>that's probably a bad thing

<wingo>inline gc allocation in master, whee

<mark_weaver>woohoo!

<mark_weaver>if it's not too hard, I might backport it to 2.0 as well.

<mark_weaver>(notice I've been cherry-picking some of your commits from master to stable-2.0 :)

<wingo>i saw that, it's nice :)

*wingo steps out for a bit

<mark_weaver>Is Boehm GC documented anywhere other than its source code? My web searches for various interfaces (in this case GC_generic_malloc_many) never seem to find any docs.

<mark_weaver>well, it's okay. I can read the code. Nice comments in there anyway :)

<taylanub>What's GC_oom_func ?

<sneek>Welcome back taylanub, you have 1 message.

<sneek>taylanub, mark_weaver says: type ",help debug" at the REPL.

<mark_weaver>taylanub: search for GC_oom_func in gc.h, and you'll find your answer :)

<taylanub>Thanks, I guess Google doesn't crawl source code. :P

<mark_weaver>well, I'm sure they do, but apparently they inhibit those from searches by default.

<taylanub>oh wow, "full" backtraces are neat.

<mark_weaver>davexunit: I'm having second thoughts about MVars, for various reasons, not the least of which that while trying to write its tests I lost confidence that it meets the fairness requirements.

<mark_weaver>but also, I'm thinking it might be better to just implement a full STM.

<mark_weaver>I'm seeing now that MVars are fairly limited in application, and aren't even quite the right tool for the coop-repl-servers. So it might be better to convert coop-repl-servers to use mutexes and condition variables directly.

<mark_weaver>I'm willing to do that work, if you like.

<mark_weaver>the mvar that receives messages from the reader threads should really be a queue anyway, capable of holding more than one message, and mvars don't really allow that, or at least it's not convenient.

<wingo>mark_weaver: yeah i don't think there's much documentation outside of the source code; there is also a README.md but it's not technical

<wingo>the header files are the best but for the inline allocator the internals are also relevant

<mark_weaver>*nod* well, there are nice comments in gc.h and mallocx.c for the functions I've looked for, so I can't complain :)

<mark_weaver>while writing the MVar tests, I found some very strange behavior. If I create 10 threads at the REPL and have them compete to put values into the same MVar, they are served in perfect FIFO order, as I'd expect from looking at the mutex / condition variables implementations in threads.c.

<mark_weaver>however, if I wrap the exact same code within a 'let*', then the most recently created thread seems to be treated unfairly.

<mark_weaver>instead of the 0-9 0-9 0-9 0-9 pattern I get when the commands are typed at the REPL one by one, I end up with a pattern like: 0-9 0-8 0-8 0-8 0-8 0-8 0-8

<mark_weaver>I'm really at a loss.

<mark_weaver>so I'm thinking of just punting on MVars for 2.0.10. and maybe forever :)

<mark_weaver>even though the true problem is more likely in our condition variables implementation, or in posix threads.

<wingo>what do you mean by 0-9 or 0-8 ?

<mark_weaver>well, my test code has 10 threads, numbered 0-9, which repeatedly put their thread number into the MVar. and then, when taking from the MVar, I expect to see (0 1 2 3 4 5 6 7 8 9 0 1 2 3 ...)

<wingo>ah

<mark_weaver>but when I wrap the test in a let*, then instead I tend to get (0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 ...)

<wingo>humm

<wingo>that sounds bad indeed :)

<mark_weaver>the MVars implementation is about as simple and straightforward as it can possibly be. I don't think there's a problem there. I think that something's up with condition variables or pthreads.

<mark_weaver>but, we didn't make many (any?) promises about how our conditional variables should behave. whereas MVars traditionally come with several guarantees.

<mark_weaver>and now I'm not so sure that they act the way they should in all cases.

<mark_weaver>it occurs to me that I did these tests on my homebuilt Loongson system, with glibc from a couple of years ago. maybe I should try on intel, and/or with a newer libc.

<mark_weaver>anyway, an STM seems to me the best solution by far (if one insists on shared mutable memory concurrency).

<mark_weaver>wingo: fyi, the without-threads build of master on x86_64 hit a segfault while compiling to .go: http://hydra.nixos.org/build/8681879/log

<wingo>exciting

<wingo>i guess because of the freelist

<wingo>humm, the trap state? strange

<mark_weaver>not sure. it might have been before the freelist change was picked up.

<wingo>seems to be due to a nixpkgs change, not a guile change

<wingo>at least according to http://hydra.nixos.org/jobset/gnu/guile-master/evals

<mark_weaver>If I'm correctly reading http://hydra.nixos.org/jobset/gnu/guile-master , it looks like the latest commit at that point was "Return unused parts of the stack to the OS".

<mark_weaver>yeah, although I looked through the list of changes, and nothing seemed relevant.

<mark_weaver>it might be a transient failure.

<wingo>we'll see i guess

<wingo>so it seems a scheme version of `length' is about 5 times slower than the "native" one

<wingo>by native i mean scm_ilength

<wingo>both of them with full tortoise and hare stuff

<mark_weaver>that's not bad at all. better than I'd expect, actually :)

<mark_weaver>what do you think?

<wingo>it's ok, yeah

<wingo>scheme does 12ns per node, or so, whereas c does about 3ns per node

<wingo>tested with a 1e7-long chain

<wingo>#include <microbenchmark/caveats.h>

<mark_weaver>heh :)

<mark_weaver>speaking of microbenchmarks, did you do any on the lock-free freelists?

<wingo>nothing conclusive, let me give it another go

<mark_weaver>perhaps more relevantly, I wonder how much it speeds up a real-world program, like say our Scheme compiler.

<mark_weaver>while trying to speed up my SRFI-38 'write' implementation, I hacked in local free-lists, and it turned out to not help much, so I undid it all. (not saying that's the case here).

<mark_weaver>of course, in that case it really was an ugly hack, whereas your lock-free freelist code is perfectly nice.

<wingo>(define (alloc n x) (let lp ((x x) (n n)) (if (zero? n) x (lp (cons 1 1) (1- n)))))

<wingo>with that test case

<wingo>,time (alloc #e1e8 0) takes 3.35s before the change

<wingo>and 3.06 or so after

<wingo>so it's a minor speedup

<wingo>at least locally

<wingo>i could try it with vectors, humm

<mark_weaver>that's a reasonable speedup, without any real downside that i can see.

<wingo>the same thing but with (vector 1 2 3 4 5 6) shows a smaller speedup, as you would expect

<wingo>both because the freelist will contain fewer items and because the percentage of opcodes that do allocation are lower

<mark_weaver>and how long for the empty loop to run that many iterations?

<wingo>probably the former is more of an effect

<wingo>the empty loop is 25-40% of the time

<wingo>so as we optimize that down, the percentage improvement of the inline allocation will go up

<mark_weaver>yeah, the inline allocation will definitely become much more important when we get native compilation.

<wingo>indeed

<wingo>struct allocation doesn't use it yet, fwiw

<mark_weaver>numbers.c should probably use it too, which reminds me that I really need to add some internal interfaces there that accept the thread as an argument, for use by the VM.

<mark_weaver>well, it's a thought at least.

<mark_weaver>not going through the PLT would also be a win.

<wingo>yeah indeed

<wingo>lots of things to do :)

<mark_weaver>indeed. we'll get there, some day :)

<wingo>:)

<mark_weaver>well, I suppose it's like chasing a rainbow when you get down to it.

<wingo>one thing i was poking at was a recursive implementation of map; it's slower than our cons-and-reverse! thing currently

<wingo>(this is with a local lift to the stack size limit, which is indeed an issue we will need to deal with at some point)

<mark_weaver>well, think of the cache.

<mark_weaver>even if we can write recursive code that won't fail for deep recursions, it's still blows the cache to do so.

<wingo>not necessarily

<wingo>well, dunno

<wingo>one problem is that currently we have to have two copies of the list in memory

<mark_weaver>yeah, I'm not sure either. just speaking off the top of my head :)

<wingo>because we don't collect arguments to procedures

<wingo>so if you have (map f l), the original value of l will be protected

<wingo>which is usually cool

<wingo>but if you are making a recursive map implementation that's not what you want :)

<wingo>if the incoming arg isn't used elsewhere you should be able to map in place, pretty much

<wingo>not sure what the solution there is

<wingo>or if it even matters :)

<wingo>but i would like to think about inlining map and for-each, and am still trying to figure out what that would mean and how to do it

<mark_weaver>I don't quite understand what you mean. of course, 'map' can't modify its incoming list argument. that's part of its contract.

<wingo>mark_weaver: if the incoming argument isn't referenced elsewhere, it is garbage after it has been traversed

<mark_weaver>yeah, but it's not even enough to prove that 'l' is not referenced elsewhere. you have to prove it for all of the cdrs also.

<wingo>that's the sort of thing a garbage collector can prove cheaply :)

<wingo>i'm not referring to real map-in-place

<wingo>but rather let's say you are 1000 elements into a list, and you need to gc

<wingo>can you collect the first 1000 links and the original values, or not?

<wingo>if they are on the stack as procedure arguments, the gc won't be able to do so

<mark_weaver>oh, I see what you mean. when you say 'protected', you mean protected from being collected by GC.

<wingo>because the slot allocator keeps them as live values (currently), for better backtraces

<wingo>yes

<mark_weaver>I misunderstood.

<mark_weaver>yeah, that's definitely a lose.

<mark_weaver>not sure if it's important enough to worry about though.

<wingo>well in the case of mapping over a list of 1e7 elements or so, you care whether your heap has to grow by 1e8 or more MB or not

<mark_weaver>hmm, though I wonder if we could make the liveness analysis more precise, and improve not only this case but many others.

<wingo>the analysis is precise

<wingo>it's simply that we "throw away" the liveness data for arguments

<wingo>but it's pretty irritating to see <optimized out> in a backtrace :)

<mark_weaver>true

<mark_weaver>but we could have a debugging mode or something.

<wingo>dunno, modes are an antipattern in many ways...

<mark_weaver>if we could prove that the argument will never again be used.

<mark_weaver>space leaks are also annoying though.

<wingo>a user program will consist of user code interleaved with guile code

<wingo>and you won't want to re-compile guile in debug mode

<mark_weaver>sometimes they can make a program that could in theory work well, work very poorly.. and sometimes the code must be distorted to prevent the leak.

<wingo>yeah, the other thing is that if you capitalize on the live arguments data you can end up with other data re-using the procedure argument slots

<wingo>which is fine

<wingo>but with the current code you would get even more confusing backtraces

<wingo>there might be a nice solution to this but it's not yet clear to me

<mark_weaver>yeah, I don't know either :)

<wingo>of course we could have maps indicating which procedure args we can show as original and which are optimized out

<wingo>but gdb's <optimized out> is so vexing :)

<mark_weaver>yeah, gdb has become nearly unusable with -O2 in my experience.

<mark_weaver>at least in many cases.

<wingo>right

<wingo>in all the browser engines they compile with modes

<wingo>release, debugging (which includes debug-only asserts, something i miss sometimes in guile) and debugging without optimization

<wingo>i wonder how long we can put off doing something like that

<wingo>but blah

<mark_weaver>well, things are moving in the right direction, anyway. guile is getting more awesome every year :)

<mark_weaver>I have to go afk for a while. happy hacking!

<wingo>ciao :)

***linas_ is now known as linas

<mark_weaver>what's wrong with cons-and-reverse!, anyway?

<mark_weaver>other than wanting to avoid mutation for the sake of purity, I don't see a problem.

<mark_weaver>but it's an internal detail.

<mark_weaver>well, I guess if the 'proc' invokes its continuation more than once...

<wingo>yes, that is one reason

<wingo>the other reason is that cons-and-reverse! isn't good for directly inlining

<mark_weaver>how much slower is the recursive one?

<wingo>depends on a few things; between 30% and 80% slower

<mark_weaver>one thing: I think that 'map' should be able to work on any length of list, limited only by available memory.

<wingo>yes i agree

<wingo>more or less anyway

<wingo>there are always limits, and bumping up against memory limits is a messy thing always, especially with gc

<mark_weaver>if the stack is limited to less than the available memory, I think that's a problem for recursive 'map'.

<wingo>but those should be the rough limits

<wingo>yes i have a local change to have it be some gigabytes here

<wingo>there is a maximum practical gc heap size and it is a few gb

<wingo>and it's already quite painful at that size

<mark_weaver>I have a worry about one aspect of the direction we're going: I'm worried about assumptions about our current conservative GC spreading.

<mark_weaver>I think we should have an alternative runtime, for programs that don't need much C code, to use a moving GC.

<wingo>it's a good thought, but i don't think that the assumptions are spreading

<wingo>scheme code is getting more precise

<wingo>but writing precise gc maps at every allocation site would be prohibitively costly

<mark_weaver>well, for example our SMOB docs were recently changed to document the mark procedure as being optional.

<wingo>well, if you have a smob probably you don't have moeable heap-allocated data

<wingo>there can be a part of the heap that is moveable and a part that isn't

<wingo>*moveable

<wingo>probably also if you have smobs you're not in the camp that could use a moving gc

<wingo>because you have significant c code

<wingo>(actually, if you restrict it to allocation sites, perhaps precise gc maps isn't that bad; humm)

<mark_weaver>well, maybe that wasn't a good example, but I suspect that our C code will become less precise over time, when it's written/modified without ever being tested on a precise GC system.

<wingo>anyway the compiler knows all that it needs to know about scheme, i think, and on the C side i can't see a useful change coming unless we switch to C++ and have a smarter runtime, dunno

<wingo>without the equivalent of Rooted<T> / Handle<T> it's quite difficult to manage on the c level

<wingo>dunno, perhaps i'm just being contrarian ;)

<wingo>i just ran out of memory, cored, then ran out of disk; good times?

<wingo>> (define l (iota #e1e9))

<wingo>Too many heap sections: Increase MAXHINCR or MAX_HEAP_SECTS

<wingo>^C^C^\\^[[6~

<wingo>Aborted (core dumped)

<mark_weaver>heh :)

<mark_weaver>ah yes, I remember that Boehm GC has some limits that can be extended by some compile-time option.

<mark_weaver>you may be right about the GC thing. maybe the answer is that we should gradually get rid of almost all of our C code.

<wingo>maybe, yep

<wingo>so if i do an (iota #e1e8) i have a 1.6 GB heap, as expected

<wingo>a full gc takes 1.2 seconds, and that with the parallel markers

<wingo>(which does seem to work fine btw; not sure i've run into an issue with it yet)

<wingo>so i guess that's usable in a batch mode

<wingo>if you're working on a big problem

<wingo>but it's about the limit for guile

<wingo>it's also something of a pathological case as there are a bunch of small objects

<wingo>i think that's approaching the practical limit of jvm heaps without tuning or switching to a jvm with a good gc (like C4 or something)

<mark_weaver>I guess this means that the 2^48 heap size limit when using nan-boxing is not really an issue after all.

<wingo>most jvms seem to want users to specify the heap size in advance

<wingo>mark_weaver: heh, indeed :)

<wingo>jvms seem to raise exceptions on OOM

<mark_weaver>maybe we should consider nan-boxing for 2.2.

<wingo>perhaps uncatchable exceptions or something

<wingo>yeah maybe

<wingo>dunno, so many things :)

<mark_weaver>well, I have a patch for master that passed the test suite and seems to work, from a few months ago.

<davexunit>mark_weaver: hey, reading over the log. I've been out at the grocery store.

<mark_weaver>so dusting that patch off would be relatively easy.

<mark_weaver>hey davexunit!

<wingo>neat, i'd like to see that again

<mark_weaver>the memory size limit is the main reason I sort of gave up on the idea, but now I see that it's not an issue.

<mark_weaver>we'd need a way to hook into libgc to ensure that it's allocations were all within the allowed memory range, though. I guess that's probably doable, but I haven't looked.

<mark_weaver>s/it's/its/

<wingo>dunno, i think it gets its memory from mmap usually, and you're really relying on the kernel there

<wingo>i think it would be sufficient to look at what linux, solaris (b/c of their different heap shape), mac os, and windows do

<mark_weaver>well, you can request a specific address from mmap. so I guess it comes down to figuring out which addresses to ask for.

<davexunit>mark_weaver: sorry to hear about mvars not working out. I guess I can follow your advice and use condition variables and mutexes to do the job. and I guess use (ice-9 q) to make a queue of pending ops.

<wingo>or just look at what webkit or firefox does

<wingo>because both of those do nan-boxing

<wingo>yeah but you don't really want to be in the business of choosing mmap addresses; dunno

<wingo>you're competing with other users of mmap, the native conventions, etc...

<mark_weaver>well, you could allocate in large chunks, and in the worst case choose a random address chunk within the allowed range. shouldn't be too long before you find one that's available.

<mark_weaver>address space randomization is probably a good idea for security reasons anyway.

<wingo>i think we should be careful about getting too much into gc, because it's a black hole of distraction

<wingo>(of course it could still be a good idea to nan-box, dunno)

<mark_weaver>if our practical heap limit is around 2^32 anyway, then the 2^48 are will be fairly sparse.

<mark_weaver>s/are/area/

<mark_weaver>(of course there might be some big allocations that aren't in the GC heap, but still)

<wingo> http://stackoverflow.com/questions/214362/java-very-large-heap-sizes

<wingo>i think we should size our expectations out to about 100GB heap sizes

<wingo>which is 2^37 or so i guess

<mark_weaver>yeah, so still that means the 2^48 will be sparse enough that randomly choosing an address should work well in practice.

<mark_weaver>*2^48 area

<mark_weaver>I think we won't be able to avoid thinking about GC for too much longer. when we have native code gen, the inefficiency of our current GC will become increasingly important.

<wingo>sure

<wingo>though i think we should be wary of criticising the current gc

<wingo>the parallel marker and off-main-thread finalizer speeds up things quite a bit

<mark_weaver>oh, it's an excellent GC considering the requirements it's operating under.

<wingo>and being conservative does have some speed advantages, at least for whole-heap operations

<wingo>because it can treat most objects uniformly

<wingo>of course we don't get generational benefits, but we have to think about the worst-case things as well

<wingo>and in that regard things aren't so bad

<wingo>of course it could be much better though, as you say :)

<mark_weaver>davexunit: so, for the mvar that recieves messages from all the reader threads, no condition variable is needed. just a mutex-protected queue.

<davexunit>mark_weaver: okay.

<mark_weaver>(since it's polled anyway)

<davexunit>yeah, makes sense.

<mark_weaver>in the other direction, I guess you need a mutex-protected slot that holds a thunk (or #f), and a condition variable to ignal when the slot gets a new thunk.

<mark_weaver>s/ignal/signal/

<davexunit>sounds simple enough.

<davexunit>and now for something totally unrelated. I was just reading about streams in SICP, and one of the examples is a sieve of eratosthenes. I thought "wow, that's elegant" and then I realized that if you call (stream-ref primes x) with a sufficiently large x, it causes a stack overflow!

<davexunit>and now here I am wondering how to fix that.

*mark_weaver looks at the code

<davexunit>it seems to be nontrivial

<davexunit>mark_weaver: http://paste.lisp.org/display/141110

<mark_weaver>it might help to use SRFI-41 streams, but I'm not sure it would fix the problem.

<mark_weaver>you might try it.

<mark_weaver>since SICP was written, we've learned how to make better promises and better streams than those in SICP.

<mark_weaver>SRFI-45 and SRFI-41 implement the better ones.

<davexunit>mark_weaver: yeah, I used srfi-41 streams to write the primes stream.

<mark_weaver>also, as ijp has pointed out, technically that algorithm is not the sieve of eratosthenes.

<mark_weaver>the sieve does no divisibility checks at all.

<davexunit>oh. well, the more I know.

<davexunit>oh yeah, that's true. you just need addition to mark the multiples of a prime.

<mark_weaver>right.

<mark_weaver>you just have a bit-vector.

<mark_weaver>and you clear all the multiples of 2, then all the multiples of 3, and iterate like that.

<mark_weaver>(then the multiples of 5, 7, 11, etc)

<mark_weaver>of course, that means you have to decide in advance how high you're going to look.

<davexunit>yeah.

<davexunit>it turns out that ijp has written such a stream, actually.

<davexunit> https://raw2.github.com/ijp/ijputils/master/streams.sls

<davexunit>this code will take some studying to fully understand.

<mark_weaver>yeah, I wrote fast primes code that did something along those lines once.

<mark_weaver>if you really want speed, it's probably fastest to do something super-simple like the bit array, and when you get to the end of that bit array, make a new one from scratch that's some multiple in size of the old one.

<mark_weaver>that would be my guess anyway.

<davexunit>okay, thanks for humoring me with this. :)

<mark_weaver>:)

<mark_weaver>yeah, it's too bad that we must often choose between elegance and practicality.

<mark_weaver>the good news is that in many cases, we don't need scalability or high efficiency, and we can keep things elegant.

<davexunit>yes indeed. :)

<wingo>nice, 50ms pause time on a 5.5 GB heap of bytevectors of size 0 to 1e5 bytes

<wingo>mark_weaver: i suppose larger heaps are certainly possible if you include pointerless data

<wingo>probably the limit is somewhat closer than we think if GC_malloc_atomic is involved

<mark_weaver>we could accommodate pointers to blocks beyond 2^48, via an indirection.

<mark_weaver>as long as anything that a SCM needs to point to directly are within 2^48

<mark_weaver>(actually, we could do a bit better than 2^48 also)

<mark_weaver>actually, in the implementation I have for positive-only addresses (which I think might be a bit simpler/faster), there are 50 usable bits, which includes the 3 tag bits at the bottom.

<mark_weaver>basically I was using just one half of the negative signalling NaN space.

<mark_weaver>I could certainly get another bit or two, at the cost of some added complexity.

<mark_weaver>I have another implementation that supports negative addresses as well, but it probably makes tag checking a bit more expensive, and maybe not worth it.

<wingo>pointerless data doesn't have to fit into an SCM i guess

<wingo>hum

<wingo>scratch that ;)

<wingo>obviously it does, a flonum is such a datum

<mark_weaver>huh?

<wingo>i was thinking that a GC_malloc_atomic might not have to be representable as a SCM

<wingo>but that was incorrect

<mark_weaver>why is it incorrect? because of flonums? but the whole idea is that flonums would be immediates.

<mark_weaver>maybe I'm confused :)

<wingo>ah

<wingo>i was thinking of them as they are

<mark_weaver>also, it's okay to have some pointerless data that pointed to by an SCM.

<wingo>anyway, there are other cases

<wingo>yep, that's true too

<wingo>so a thinko on my part :)

<mark_weaver>it happens. thanks for thinking about it anywa :)

<mark_weaver>the thing is, for the things that davexunit and unknown_lamer are interested in, nan-boxing could mean the difference between their guile programs working acceptably or not.

<mark_weaver>hence, I'm highly motivated to find a solution for them.

<mark_weaver>of course, if we find a better solution later, we can always switch away from nan-boxing in a future version of guile.

<wingo>sure; i just heard that racket has done very well on flonums no via nan-boxing but via unboxing

<wingo>*not via

<mark_weaver>yeah, I view nan-boxing as potentially just a stop-gap measure in the short term.

<mark_weaver>but I suspect that unboxing locals within a procedure won't be enough for what davexunit and unknown_lamer are doing. I think they'd need unboxed values in data structures as well.

<wingo>still could be worth it

<wingo>if you have values in data structures though they can be in f32/f64vects

<mark_weaver>true

*wingo not intending to always be contrarian; genuinely perplexed about the thing

<mark_weaver>although that can be messy too, because then you have to allocate the vectors, which is the allocation we're trying to get away from.

<wingo>if they are in a data structure they are allocated already, no?

<mark_weaver>so that might involve messing up the code to keep reusing the same vectors.

<mark_weaver>possibly, dunno!

<wingo>f64vector-ref is a nice type hint too

<mark_weaver>it definitely involves avoiding functional code in this case, and using imperative code instead.

<wingo>i think that's the basis of racket's thing, is type inference, and uniform vectors help there

<mark_weaver>which seems a shame.

<wingo>true

<wingo>unless (f64vector r g b) of course :)

<mark_weaver>I don't mind you being contrarian. I do that quite a lot too :)

<wingo>it's the same as cons really :)

<wingo>hehe

<mark_weaver>yeah, I guess that's true :)

<mark_weaver>so yeah, maybe unboxing locals would be enough, dunno.

<mark_weaver>racket also has the advantage of a much faster GC, which makes the allocations they have to do less painful.

<wingo>faster for short-lived allocations, yes

<mark_weaver>I suspect that the combination of unboxing locals and a good generational GC would be good enough in practice. not sure that either one alone would be.

<wingo>i'm sure there are cases in which we win too -- there are well-known patterns that are pathological to generational gc

<wingo>ok no more apologies, it seems contrarian is the thing for today ;-)

<wingo>but yes, for flonums you are absolutely correct.

<unknown_lamer>mark_weaver: it's basically impossible to avoid mutating vertex data with opengl

<mark_weaver>fwiw, I agree with you that it can be bad to optimize a certain usage pattern at the expense of less-common usage patterns.

<unknown_lamer>there's too much of it to reallocate on every change

<mark_weaver>and generational GC has that property.

<unknown_lamer>one day, when cpus regain gc acceleration features, perhaps

<mark_weaver>unknown_lamer: yeah, it was thinko on my part. if you want to avoid allocation, we need to use long-lived uniform vectors and mutate them.

<unknown_lamer>if th compiler could optimize (define local (f{32,64}vector-ref ...)) to not heap allocate an intermediate, I think that relieves almost all of the gc pressure

<unknown_lamer>also, it's just a fact of life right now that hard realtime code has to avoid allocating whenever possible

<mark_weaver>it's quite possible that unboxing locals might be good enough.

<unknown_lamer>I wonder how hard it would be to hack boehm gc into supporting hard (wall) realtime with bounded allocation rates

<unknown_lamer>I messed with GC_collect_a_little ... but it can't operate for fewer than about 5-10ms

<mark_weaver>of course, there exist real-time GC algorithms, but they tend to involve things like write barriers that are quite expensive.

<unknown_lamer>when I need it to run for a few *microseconds*

<mark_weaver>real-time GC comes at a high cost, the last time I looked.

***vicenteH` is now known as vicenteH

<unknown_lamer>mark_weaver: yes, unless the cpu provides an interruptable object copying processor

<unknown_lamer>which is what the java machine people did

<mark_weaver>oh sure, if you can change the hardware, then lots of nice things are possible.

<unknown_lamer>one day... x86 is gaining a ton of features for microkernels lately (in the name of virtualization)

<unknown_lamer>and a few other features for dynamic languages (actually... basically lisp machine instructions, e.g. vector-ref with range checking done in parallel, discarding the result if it's out of range and raising an exception... right out of the symbolics op code book!)

<unknown_lamer>so maybe in another five years hardware will finally catch up with 1990's GC techniques ;)

<mark_weaver>I hope so!

<wingo>the hardware seems fine to me; it's just the os's that are subpar, seems to me

<wingo>at least as far as gc goes

<wingo>you want your gc to integrate tightly with page faulting, etc

<unknown_lamer>If only Hurd/L4 had taken off

<mark_weaver>it would be nice to have hardware support for tag bits, at least.

<unknown_lamer>actually, I think Linux might be gaining some support for page fault something or other to userspace

<unknown_lamer>I recall Android needing the ability to detect if a page were live or in swap

<unknown_lamer>or I might just be recalling one of those wacky Linux 2.0 era projects

<wingo>lolswap :)

<mark_weaver>I'm glad to see hardware support for STM on the way.

<mark_weaver>s/STM/TM/

<unknown_lamer>mark_weaver: I think the next intel cpu is also getting tag bit checking?

<unknown_lamer>along with the ranged vector-ref and whatnot

<unknown_lamer>but don't hold me to that, I might be imagining things

<mark_weaver>oh? I didn't know that.

<mark_weaver>well, that would be a welcome development.

<unknown_lamer>oh no, just the ranged vector-ref

<unknown_lamer> https://en.wikipedia.org/wiki/Intel_MPX

<unknown_lamer>to Broadwell+1

<unknown_lamer>ARM has something for checking Tagbits in Thumb2 though

<wingo>that's wacky

<unknown_lamer>128 MB L4 eDRAM cache

<unknown_lamer>more memory on the cpu die than my first few machines had...

<unknown_lamer>no need to worry about optimizing floats on the heap, we'll all have a gig of ram on die soon!

<mark_weaver>My first machine had 64 kbytes :)

<mark_weaver>the thing is, no matter how big and fast the machines get, the expectations will always rise to match them.

<mark_weaver>so I don't think it actually helps us.

<mark_weaver>we will always be compared with the best that other language implementations can do.f

<mark_weaver>if the CPU isn't doing more work each second than the entire world population of humans could do in 10,000 years, to constantly animate images on their screen or whatever, then people think the software is unacceptably primitive. it seems to be the way of things.

<unknown_lamer>yes, everyone will go back to wanting 120Hz refresh rates from graphics... 17ms per frame is hard enough

<unknown_lamer>with 128M of fast L4 though, I think that's fast enough to have really aggressive first generation collection, so it might then start to make more sense to allocate tons of data and use write barriers for mutation to allow fully parallel marking and sweeping

<unknown_lamer>basically: allocation is fast, mutation is really slow

<mark_weaver>yeah, that's a possibility, dunno! I'd be glad if purely-functional programming could be leveraged to make real-time GC fast enough.

<mark_weaver>I haven't researched GC in a few years; it's probably worth taking another look.

<mark_weaver>although I'm sympathetic to wingo's skepticism about generational GC. it does optimize a particular common usage pattern at the expense of less common usage patterns.

<mark_weaver>lots to think about though...

<unknown_lamer>an idea I've had floating around is killing the stack (thus making gen0 collection speed essential)

<unknown_lamer>and then effectively forbidding heap mutation

<unknown_lamer>of course, you'd need a new operating system (Haskell's House L4 implementation perhaps...)

<unknown_lamer>then, ram would be cache for an on-disk persistent graph

<mark_weaver>Andrew Appel has long argued in favor of killing stacks.

<unknown_lamer>The SML people made a good case for it about 20 years ago, SML/NJ is stackless

<mark_weaver>I don't think it's wise to eliminate mutation altogether, though.

<unknown_lamer>call/cc is O(1), a big benefit if you're capturing continuations a lot

<unknown_lamer>mutation is only needed because the gc isn't Sufficiently Smart ;)

<wingo>how do you debug a system without stacks? i don't really understand that

<mark_weaver>well, replace the stack with linked lists of frames, right?

<unknown_lamer>yeah

<unknown_lamer>so you can just trace the pointer to the previous activation frame

<wingo>so you still have a stack but it's on the heap?

<unknown_lamer>yes, and not linearized

<wingo>that's a lose in the normal case

<unknown_lamer>not really, if you eliminate the stack hardware on the cpu and reallocate those transistors for gc acceleration hardware

<unknown_lamer>the SML people only found a single digit percentage speed loss, on MIPS from the early 90s with 64kB or cache

<unknown_lamer>*of

<mark_weaver>I'm sympathize to Appel's position, for the same reason I'm sympathetic to wingo's thoughts on generational GC. both of these argue for a more general mechanism, against optimizing the (currently) common case.

<unknown_lamer>at least the math says GC could be made fast enough

<mark_weaver>it does make it very cheap to capture continuations and invoke continuations.

<unknown_lamer>I sometimes agree with the loper os guy about how much hardware really does suck

<unknown_lamer>half the time non-C languages are fighting the processor

<mark_weaver>the many "common case" optimizations we have, in both hardware and software at all levels, do have a powerful effect on the way we have to write programs.

<mark_weaver>collectively, they tend to force us to keep using the same patterns that have been used in the past.

<mark_weaver>virtually all higher-level programming languages suffer as a result.

<mark_weaver>because C code is the common case that both hardware and OSes optimize for.

<mark_weaver>(and compilers)

<wingo>dunno, i was convinced by ungar and chalmer's work on self that at least cpu-level support for higher-level languages (via register windows or tagged arithmetic) was actually a lose

<wingo>the compilers can and should be better

<mark_weaver>for example, tail calls are a wonderful idea, but they are hard to do efficiently because of all this legacy crap that's fossilized the way we do things.

<mark_weaver>there's no fundamental reason why tail calls have to be inefficient.

<mark_weaver>it's a much better way to do things.

<mark_weaver>but we're effectively forced into compiling code that mostly avoids them.

<wingo>seems to me that the cpu and the hardware are pretty good; the os and the abi could be more helpful but they can often be routed around

<mark_weaver>I disagree.

<mark_weaver>CPUs heavily penalize dynamically-typed languages, for example.

<mark_weaver>and they penalize tail calls.

<mark_weaver>and they aren't the least bit helpful for GC.

<mark_weaver>high-level languages have to be vastly more complex than they should need to be, in order to try to force a round peg into a square hole whereever possible.

<mark_weaver>*high-level language implementations

<mark_weaver>I'm not really blaming anyone. The Intel engineers do amazing work, without a doubt.

<mark_weaver>it's just the nature of things that C was designed to match the processors at the time, and processors ever since have been optimized to match C, and every other language implementation has had to try to conform to the resulting calcified architecture.

<wingo>how do cpus penalize tail calls?

<mark_weaver>well, branch prediction heavily optimizes returns.

<mark_weaver>I think you lose that if you use tail calls instead of a stack-based call/return pattern.

<mark_weaver>if I'm mistaken about that, I'd be very glad.

<mark_weaver>another thing: processors tend to favor truncate/ rather than floor/, simply because C '/' and '%' have truncate semantics, which is frankly a poor choice.

<mark_weaver>so programs tend to use truncate/ just because it's faster, and it's a self-reinforcing thing.

<wingo>afaik the relationship of branch predictions and return is the return branch buffer

<mark_weaver>there are lots of things like this. probably most of them I don't even see, because like the air, I've lived in this environment for so long that it becomes invisible to me.

<wingo>which is not the same thing used in general branches

<mark_weaver>right.

<wingo>and works very well for tail calls, because a tail call by definition does not return

<wingo>so the rbb is only used by the procedure that does the "ret"

<wingo>works fine with tail calls. (of course the abi is a different issue.)

<wingo>the problem is only arranging for multiple return points

<wingo>returning to somewhere that was not pushed on by a "call"

<wingo>but what are you going to do, it's an indirect branch predictor that does its job in the simplest way possible, and in a way that can work well with anything afaics

<wingo>regarding math i plead ignorance and am happy to trust you :)

<mark_weaver>hehe, weren't you a physicist at some point? :)

<wingo>that was my undergraduate education, yes, but most of it has evaporated away ;)

<wingo>ignorance is a condition that can be left but also rejoined :P

<mark_weaver>that's certainly true :)

*tadni wants to play with skribillo, but can't seem to get guile-reader to work. :^P

<mark_weaver>sorry to drop out of this conversation for a while, but I have a 3-year-old wanting my attention now. I'll have to continue this later...

<wingo>tadni: i think civodul is travelling, perhaps try again tomorrow or tuesday

<mark_weaver>wingo: you raise interesting points though. I'll think about how to respond :)

<wingo>mark_weaver: ciao! :)

<tadni>wingo: Kay. On that note, I might just wait till the Guix VM, to have another major release. :^)

<wingo>:)

<wingo>mark_weaver: the reference i was looking for was chapter 8 of hölzle's thesis on self, http://www.selflanguage.org/_static/published/urs-thesis.pdf, titled "hardware's impact on performance"

<wingo>old but very very good, and relevant here; i think we lispers tend to make too many myths about performance and hardware when the reality is mostly that our compilers aren't good enough

<bu^>the hardware basically are bunches of NAND doors; so what you will win with software facilities, you might loose it by a more complex hardware or then make a cpu architecture dedicated to one type of paradigm, but then you end up with the same issue as with x86 and asm/C but with a new language; for the default math behavior are you sure that truncation isn't faster than rounding ?

<bu^>what kind of instruction (or logical operation) would be missing from the cpu ?

<bu^>wingo, did you figure out the issue with libgc ? I think I still need to force GC_MARKERS=1 to compile guile

<wingo>bu^: i am now using libgc from git and it seems to work; https://github.com/ivmai/bdwgc/pull/30

<bu^>nice

<bu^>hope they release an update sometimes

<bu^>aren't there other performant GC available on the market ?

<wingo>none that are an easy-to-use library that interoperates well with c

<davexunit>I've seen a lot of GC talk in the ruby community lately

<davexunit>apparently they've done a lot to improve their GC in the latest releases.

<bu^>my issue with the GC is that you can't fix an upper time limit (maybe i missed something)

<bu^>but for a real time application (like a game) I couldn't use guile because I have 8 between frames

<bu^>but sometimes it took much longer and you could feel small freezes

<davexunit>I'm actually trying to write games using guile :)

<bu^>now I could allow for 5ms every 8ms for GC

<davexunit>I think the situation will improve as time goes on

<bu^>but I don't know how to force this kind of behaviour

<bu^>that the GC must be aware of a timeslice he has to perform its tasks, then just pause it

<bu^>davexunit, what kind of game is it ,

<bu^>?

<davexunit>bu^: nothing in particular. I'm writing a platform for 2d games.

<davexunit>mostly tile based stuff.

<bu^>I would love a 2d fighting game ^^

<bu^>does it support networking or is it for local games ?

<mark_weaver>davexunit: it seems that nan-boxing is back on the table as a possible option for 2.2.

<mark_weaver>bu^: all integer division is rounded in some way, it's just a matter of what direction to round in.

<mark_weaver>truncate/ rounds toward 0. floor/ rounds toward negative infinity.

<mark_weaver>processors tend to support only truncate/. if you want anything else, you have to hack around it by doing truncate/ and then fixing up the result.

<bu^>hum, are you sure of that ? must not be the case anymore

<mark_weaver>but for most purposes floor/ would be a better choice.

<mark_weaver>bu^: do you know of a popular processor architecture that has an integer division/modulo instruction that does floor/ ?

<mark_weaver>if so, which one?

<bu^>I will look

<bu^>what you want is integer divide ?

<mark_weaver>signed integer division and remainder.

<mark_weaver>I actually have spent a fair bit of time reading processor manuals, for intel, mips, arm, and sparc.

<davexunit>bu^: networking would be done with guile's built-in networking libs or any external library that may exist. my library won't concern itself with such things.

<davexunit>mark_weaver: oh, cool. 2.2 will be more efficient in general but this would be particularly nice.

<bu^>mark_weaver, so your point is that it would take the same amount of transistors to perform the other type of rounding rather than the (stupid) default one. Do you have other example where there could be (easy) improvement just by changing the default behaviour of some instructions ?

IRC channel logs

2014-02-02.log