This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.
The fourth edition targets Lua 5.3 and is available at Amazon and other bookstores.
By buying the book, you also help to support the Lua project.
Programming in Lua | ||
Part II. Tables and Objects Chapter 11. Data Structures |
Suppose you are building a string piecemeal, for instance reading a file line by line. Your typical code would look like this:
-- WARNING: bad code ahead!! local buff = "" for line in io.lines() do buff = buff .. line .. "\n" endDespite its innocent look, that code in Lua can cause a huge performance penalty for large files: For instance, it takes almost a minute to read a 350 KB file. (That is why Lua provides the
io.read("*all")
option,
which reads the whole file in 0.02 seconds.)
Why is that? Lua uses a true garbage-collection algorithm; when it detects that the program is using too much memory, it goes through all its data structures and frees those structures that are not in use anymore (the garbage). Usually this algorithm has a good performance (it is not by chance that Lua is so fast), but the above loop takes the worst of the algorithm.
To understand what happens,
let us assume that we are in the middle of the read loop;
buff
is already a string with 50 KB
and each line has 20 bytes.
When Lua concatenates buff..line.."\n"
,
it creates a new string with 50,020 bytes
and copies 50 KB from buff
into this new string.
That is, for each new line,
Lua moves 50 KB of memory, and growing.
After reading 100 new lines (only 2 KB),
Lua has already moved more than 5 MB of memory.
To make things worse,
after the assignment
buff = buff .. line .. "\n"the old string is now garbage. After two loop cycles, there are two old strings making a total of more than 100 KB of garbage. So, Lua decides, quite correctly, that it is a good time to run its garbage collector and so it frees those 100 KB. The problem is that this will happen every two cycles and so Lua will run its garbage collector two thousand times before reading the whole file. Even with all this work, its memory usage will be approximately three times the file size.
This problem is not peculiar to Lua:
Other languages with true garbage collection,
and where strings are immutable objects,
present a similar behavior,
Java being the most famous example.
(Java offers the structure StringBuffer
to
ameliorate the problem.)
Before we continue, we should remark that,
despite all I said, that situation is not a common problem.
For small strings, the above loop is OK.
To read a whole file, we use the "*all"
option,
which reads it at once.
However, sometimes there are no simple solutions.
Then, the only solution is a more efficient algorithm.
Here we show one.
Our original loop took a linear approach to the problem, concatenating small strings one by one into the accumulator. This new algorithm avoids this, using a binary approach instead. It concatenates several small strings among them and, occasionally, it concatenates the resulting large strings into larger ones. The heart of the algorithm is a stack that keeps the large strings already created in its bottom, while small strings enter through the top. The main invariant of this stack is similar to that of the popular (among programmers, at least) Tower of Hanoi: A string in the stack can never sit over a shorter string. Whenever a new string is pushed over a shorter one, then (and only then) the algorithm concatenates both. This concatenation creates a larger string, which now may be larger than its neighbor in the previous floor. If that happens, they are joined too. Those concatenations go down the stack until the loop reaches a larger string or the stack bottom.
function newStack () return {""} -- starts with an empty string end function addString (stack, s) table.insert(stack, s) -- push 's' into the the stack for i=table.getn(stack)-1, 1, -1 do if string.len(stack[i]) > string.len(stack[i+1]) then break end stack[i] = stack[i] .. table.remove(stack) end endTo get the final contents of the buffer, we just need to concatenate all strings down to the bottom. The
table.concat
function does exactly that:
It concatenates all strings of a list.
Using this new data structure, we can rewrite our program as follows:
local s = newStack() for line in io.lines() do addString(s, line .. "\n") end s = toString(s)This new program reduces our original time to read a 350 KB file from 40 seconds to 0.5 seconds. The call
io.read("*all")
is still faster,
finishing the job in 0.02 seconds.
Actually, when we call io.read("*all")
,
io.read
uses exactly the data structure that we presented here,
but implemented in C.
Several other functions in the Lua
libraries also use this C implementation.
One of these functions is table.concat
.
With concat
, we can simply collect all strings in a table
and then concatenate all of them at once.
Because concat
uses the C implementation,
it is efficient even for huge strings.
The concat
function accepts an optional second argument,
which is a separator to be inserted between the strings.
Using this separator, we do not need to insert a newline after
each line:
local t = {} for line in io.lines() do table.insert(t, line) end s = table.concat(t, "\n") .. "\n"(The
io.lines
iterator returns each line without the newline.)
concat
inserts the separator between the strings,
but not after the last one,
so we have to add the last newline.
This last concatenation duplicates the resulting string,
which can be quite big.
There is no option to make concat
insert this extra separator,
but we can deceive it,
inserting an extra empty string in t
:
table.insert(t, "") s = table.concat(t, "\n")The extra newline that
concat
adds before this empty string
is at the end of the resulting string, as we wanted.
Copyright © 2003–2004 Roberto Ierusalimschy. All rights reserved. |