This first edition was written for Lua 5.0. While still largely relevant for later versions, there are some differences.
The fourth edition targets Lua 5.3 and is available at Amazon and other bookstores.
By buying the book, you also help to support the Lua project.
Programming in Lua | ||
Part III. The Standard Libraries Chapter 20. The String Library |
The capture mechanism allows a pattern to yank parts of the subject string that match parts of the pattern, for further use. You specify a capture by writing the parts of the pattern that you want to capture between parentheses.
When you specify captures to string.find
,
it returns the captured values as extra results from the call.
A typical use of this facility is to break a string into parts:
pair = "name = Anna" _, _, key, value = string.find(pair, "(%a+)%s*=%s*(%a+)") print(key, value) --> name AnnaThe pattern '
%a+
' specifies a non-empty sequence of letters;
the pattern '%s*
' specifies a possibly empty sequence of spaces.
So, in the example above,
the whole pattern specifies a sequence of letters,
followed by a sequence of spaces,
followed by `=
´, again followed by spaces plus
another sequence of letters.
Both sequences of letters have their patterns enclosed by parentheses,
so that they will be captured if a match occurs.
The find
function always returns first the indices where
the matching happened
(which we store in the dummy variable _
in the previous example)
and then the captures made during the pattern matching.
Below is a similar example:
date = "17/7/1990" _, _, d, m, y = string.find(date, "(%d+)/(%d+)/(%d+)") print(d, m, y) --> 17 7 1990
We can also use captures in the pattern itself.
In a pattern,
an item like '%d
', where d is a single digit,
matches only a copy of the d-th capture.
As a typical use, suppose you want to find, inside a string,
a substring enclosed between single or double quotes.
You could try a pattern such as '["'].-["']
',
that is, a quote followed by anything followed by another quote;
but you would have problems with strings like
"it's all right"
.
To solve that problem, you can capture the first quote
and use it to specify the second one:
s = [[then he said: "it's all right"!]] a, b, c, quotedPart = string.find(s, "([\"'])(.-)%1") print(quotedPart) --> it's all right print(c) --> "The first capture is the quote character itself and the second capture is the contents of the quote (the substring matching the '
.-
').
The third use of captured values is in the
replacement string of gsub
.
Like the pattern,
the replacement string may contain items like '%d
',
which are changed to the respective captures when the
substitution is made.
(By the way, because of those changes,
a `%
´ in the replacement string must be escaped as "%%"
.)
As an example,
the following command duplicates every letter in a string,
with a hyphen between the copies:
print(string.gsub("hello Lua!", "(%a)", "%1-%1")) --> h-he-el-ll-lo-o L-Lu-ua-a!This one interchanges adjacent characters:
print(string.gsub("hello Lua", "(.)(.)", "%2%1")) --> ehll ouLa
As a more useful example, let us write a primitive format converter, which gets a string with commands written in a LaTeX style, such as
\command{some text}and changes them to a format in XML style,
<command>some text</command>For this specification, the following line does the job:
s = string.gsub(s, "\\(%a+){(.-)}", "<%1>%2</%1>")For instance, if
s
is the string
the \quote{task} is to \em{change} that.that
gsub
call will change it to
the <quote>task</quote> is to <em>change</em> that.Another useful example is how to trim a string:
function trim (s) return (string.gsub(s, "^%s*(.-)%s*$", "%1")) endNote the judicious use of pattern formats. The two anchors (`
^
´ and `$
´) ensure that we get the whole string.
Because the '.-
' tries to expand as little as possible,
the two patterns '%s*
' match all spaces at both extremities.
Note also that, because gsub
returns two values,
we use extra parentheses to discard the extra result (the count).
The last use of captured values is perhaps the most powerful.
We can call string.gsub
with a function as
its third argument, instead of a replacement string.
When invoked this way, string.gsub
calls the given
function every time it finds a match;
the arguments to this function are the captures,
while the value that the function returns is used as the replacement string.
As a first example,
the following function does variable expansion:
It substitutes the value of the global variable varname
for every occurrence of $varname
in a string:
function expand (s) s = string.gsub(s, "$(%w+)", function (n) return _G[n] end) return s end name = "Lua"; status = "great" print(expand("$name is $status, isn't it?")) --> Lua is great, isn't it?If you are not sure whether the given variables have string values, you can apply
tostring
to their values:
function expand (s) return (string.gsub(s, "$(%w+)", function (n) return tostring(_G[n]) end)) end print(expand("print = $print; a = $a")) --> print = function: 0x8050ce0; a = nil
A more powerful example uses loadstring
to evaluate
whole expressions that we write in the text
enclosed by square brackets preceded by a dollar sign:
s = "sin(3) = $[math.sin(3)]; 2^5 = $[2^5]" print((string.gsub(s, "$(%b[])", function (x) x = "return " .. string.sub(x, 2, -2) local f = loadstring(x) return f() end))) --> sin(3) = 0.1411200080598672; 2^5 = 32The first match is the string
"$[math.sin(3)]"
,
whose corresponding capture is "[math.sin(3)]"
.
The call to string.sub
removes the brackets from the captured string,
so the string loaded for execution will be
"return math.sin(3)"
.
The same happens for the match "$[2^5]"
.
Often we want a kind of string.gsub
only
to iterate on a string,
without any interest in the resulting string.
For instance, we could collect the words of a string into a table
with the following code:
words = {} string.gsub(s, "(%a+)", function (w) table.insert(words, w) end)If
s
were the string "hello hi, again!"
,
after that command the word
table would be
{"hello", "hi", "again"}The
string.gfind
function offers
a simpler way to write that code:
words = {} for w in string.gfind(s, "(%a)") do table.insert(words, w) endThe
gfind
function fits perfectly with the generic for loop.
It returns a function that iterates on all occurrences of
a pattern in a string.
We can simplify that code a little bit more.
When we call gfind
with a pattern without any explicit capture,
the function will capture the whole pattern.
Therefore, we can rewrite the previous example like this:
words = {} for w in string.gfind(s, "%a") do table.insert(words, w) end
For our next example,
we use URL encoding,
which is the encoding used by HTTP to send parameters in a URL.
This encoding encodes special characters
(such as `=
´, `&
´, and `+
´) as "%XX"
,
where XX is the hexadecimal representation of the character.
Then, it changes spaces to `+
´.
For instance, it encodes the string "a+b = c"
as "a%2Bb+%3D+c"
.
Finally, it writes each parameter name and parameter
value with an `=
´ in between
and appends all pairs name=value
with an ampersand in-between.
For instance, the values
name = "al"; query = "a+b = c"; q="yes or no"are encoded as
name=al&query=a%2Bb+%3D+c&q=yes+or+noNow, suppose we want to decode this URL and store each value in a table, indexed by its corresponding name. The following function does the basic decoding:
function unescape (s) s = string.gsub(s, "+", " ") s = string.gsub(s, "%%(%x%x)", function (h) return string.char(tonumber(h, 16)) end) return s endThe first statement changes each `
+
´ in the string to a space.
The second gsub
matches all two-digit hexadecimal numerals
preceded by `%
´ and calls an anonymous function.
That function converts the hexadecimal numeral into a number
(tonumber
, with base 16)
and returns the corresponding character (string.char
).
For instance,
print(unescape("a%2Bb+%3D+c")) --> a+b = c
To decode the pairs name=value
we use gfind
.
Because both names and values cannot contain either `&
´ or `=
´,
we can match them with the pattern '[^&=]+
':
cgi = {} function decode (s) for name, value in string.gfind(s, "([^&=]+)=([^&=]+)") do name = unescape(name) value = unescape(value) cgi[name] = value end endThat call to
gfind
matches all pairs in the form name=value
and, for each pair, the iterator returns the corresponding captures
(as marked by the parentheses in the matching string)
as the values to name
and value
.
The loop body simply calls unescape
on both strings
and stores the pair in the cgi
table.
The corresponding encoding is also easy to write.
First, we write the escape
function;
this function encodes all special characters as a `%
´
followed by the character ASCII code in hexadecimal
(the format
option "%02X"
makes
an hexadecimal number with two digits,
using 0 for padding),
and then changes spaces to `+
´:
function escape (s) s = string.gsub(s, "([&=+%c])", function (c) return string.format("%%%02X", string.byte(c)) end) s = string.gsub(s, " ", "+") return s endThe
encode
function traverses the table to be encoded,
building the resulting string:
function encode (t) local s = "" for k,v in pairs(t) do s = s .. "&" .. escape(k) .. "=" .. escape(v) end return string.sub(s, 2) -- remove first `&' end t = {name = "al", query = "a+b = c", q="yes or no"} print(encode(t)) --> q=yes+or+no&query=a%2Bb+%3D+c&name=al
Copyright © 2003–2004 Roberto Ierusalimschy. All rights reserved. |