Common Mistakes about Generational Garbage Collection
Wed 28 November 2018 by Moshe Zadka(Thanks to Nelson Elhage and Saivickna Raveendran for their feedback on earlier drafts. All mistakes that remain are mine.)
When talking about garbage collection, the notion of "generational collection" comes up. The usual motivation given for generational garbage collection is that "most objects die young". Therefore, we put the objects that survive a collection cycle (and therefore have proven some resistance) in a separate generation that we scan less often.
This is an optimization if the probability of an object that has survived a cycle to be garbage by the time the next collection cycle has come around is lower than the probability of a newly allocated object to be garbage.
In a foundational paper Infant mortality and generational garbage collection, Dr. Baker laid out an argument deceptive in its simplicity.
Dr. Baker asks the question: "Can we model a process where most objects become garbage fast, but generational garbage collection would not improve things?". His answer is: of course. This is exactly the probability distribution of radioactive decay.
If we have a "fast decaying element", say with a half-life of one second, than 50% of the element's atoms decay in one second. However, keeping the atoms that "survived a generation" apart from newly created atoms is unhelpful: all remaining atoms decay with probability of 50%.
We can bring the probability for "young garbage" as high up as we want: a half-life of half a second, a quarter second, or a microsecond. However, that is not going to make generational garbage collection any better than a straightforward mark-and-sweep.
The Poisson distribution,
which models radioactive decay,
has the property that
P(will die in one second)
might be high,
but
P(will die in one second|survived an hour)
is exactly the same:
the past does not give us information about the future.
This is called
the "no memory property"
of Poisson distribution.
When talking about generational garbage collection, and especially if we are making theoretical arguments about its helpfulness, we need to make arguments about the distribution, not about the averages. In other words, we need to make an argument that some kinds of objects hang around for a long time, while others tend to die quickly.
One way to model it is "objects are bimodal": if we model objects as belonging to a mix of two Gaussian distributions, one with a small average and one with a big average, then the motivation for generational collection is clear: if we tune it right, most objects that survive the first cycle belong to the other distribution, and will survive for a few more cycles.
To summarize: please choose your words carefully. "Young objects are more likely to die" is an accurate motivation, "Most objects die young" is not. This goes doubly if you do understand the subtlety: do not assume the people you are talking with have an accurate model of how garbage works.
As an aside, some languages decided that generational collection is more trouble than it is worth because the objects that "die young" go through a different allocation style. For example, Go has garbage collection, but it tries to allocate objects on the stack if it can guarantee at compile-time they do not "escape". Because of that, the "first generation" is collected at stack popping time.
CPython has generational garbage collection, but it also has a "zeroth generation" of sorts: when functions return, all local variables get a "decref": a decrease in reference count. Those for whom that results in a 0 reference counts, which is often quite a few, get collected immediately.