HOWTO: Speed up string match lookups
When you have large number of patterns (dozens) to scan to find out which pattern is matching a given string, there's a few things you can do to speed up the job.
If the patterns are hard coded, there is of course any number of ways that you can be clever. But if you do not know what the patterns look like beforehand, which is the case when you're trying to match input strings against patterns in GlobalStrings.lua using a formatstring-to-regex utility like BabbleLib's Deformat() function.
The approach below works by making lists of words used by patterns, and then looking at words in the input strings to determine which list(s) to look for matches in.
Actually, the process is 2-pass. The first pass figures out the LEAST commonly used words, and then just uses those.
- Note: The example contains a very simplistic "MyDeformatterFunc()" for converting "%s" to "(.*)". It will not work for other locales than english. Do not use it in the real world, please.
-- Functions that we want called for different string matches
function RoughPokeFunc(v1,v2) print("RoughPokeFunc "..v1.." "..v2); end
function SoftPokeFunc(v1,v2) print("SoftPokeFunc "..v1.." "..v2); end
function SoftNudgeFunc(v1,v2) print("SoftNudgeFunc "..v1.." "..v2); end
function ChickenFunc(v1,v2) print("ChickenFunc "..v1.." "..v2); end
-- Strings to match mapped to functions that we want called
MatchStrings = {
["%s roughly pokes %s"] = RoughPokeFunc,
["%s softly pokes %s"] = SoftPokeFunc,
["%s softly nudges %s"] = SoftNudgeFunc,
["%s gets nudged by %s and runs away screaming"] = ChickenFunc,
}
-- VERY simplistic deformatter function.
-- You probably want a real deformatting library for this.
function MyDeformatterFunc(str)
return (string.gsub(str, "%%s", "(.*)"));
end
-- First run: count how many occurences there are of each word
WordCounts = {}
for str,func in MatchStrings do
for word in string.gfind(str, "[^ ]+") do
if(string.find(word, "^%%")) then
-- ignore format strings
else
WordCounts[word] = (WordCounts[word] or 0) + 1;
end
end
end
-- Second run: for each string, pick the least common word and place string in that hash bucket
MatchStringsHash = {}
for str,func in MatchStrings do
local bestword, num;
for word in string.gfind(str, "[^ ]+") do
if(string.find(word, "^%%")) then
-- ignore format strings
else
if(not num or WordCounts[word] < num) then
num = WordCounts[word];
bestword = word;
end
end
end
assert(bestword);
if(not MatchStringsHash[bestword]) then MatchStringsHash[bestword] = {}; end
MatchStringsHash[bestword][MyDeformatterFunc(str)] = func;
end
WordCounts = nil; -- now we don't need the counts anymore
-- Dump our MatchStringsHash on-screen so we can see what it looks like!
print "Examining hash buckets"
print "----------------------"
for word,strings in MatchStringsHash do
print(" "..word..":");
for str,func in strings do
print(" \""..str.."\"");
end
end
-- Function that scans for matches and calls the resulting function
function ScanForMatch(str)
local bDone = false;
local nCompares = 0;
for word in string.gfind(str, "[^ ]+") do
if(MatchStringsHash[word]) then
for pattern,func in MatchStringsHash[word] do
nCompares = nCompares + 1;
local success,_,v1,v2,v3,v4 = string.find(str, pattern);
if(success) then
func(v1,v2,v3,v4);
bDone=true;
break;
end
end
end
if(bDone) then break; end
end
print(" \""..str.."\": "..nCompares.." string.finds actually executed\n");
end
print("");
print("Executing!");
print("----------");
ScanForMatch("Alice roughly pokes Bob");
ScanForMatch("Bob softly pokes Charles");
ScanForMatch("Charles softly nudges Denise");
ScanForMatch("Denise gets nudged by Eve and runs away screaming");
ScanForMatch("This string does not exist");
Running the above produces the following output:
Examining hash buckets
----------------------
roughly:
"(.*) roughly pokes (.*)"
nudges:
"(.*) softly nudges (.*)"
gets:
"(.*) gets nudged by (.*) and runs away screaming"
softly:
"(.*) softly pokes (.*)"
Executing!
----------
RoughPokeFunc Alice Bob
"Alice roughly pokes Bob": 1 string.finds actually executed
SoftPokeFunc Bob Charles
"Bob softly pokes Charles": 1 string.finds actually executed
SoftNudgeFunc Charles Denise
"Charles softly nudges Denise": 2 string.finds actually executed
ChickenFunc Denise Eve
"Denise gets nudged by Eve and runs away screaming": 1 string.finds actually executed
"This string does not exist": 0 string.finds actually executed
Problems with this approach
There is no guarantee as to which order the string matches will be attempted.
For example, assume these two patterns:
- "%s hits %s."
- "%s hits %s hard."
Now, given the input string "Alice hits Bob.", only #1 will match, and all is good.
But with the input string "Alice hits Bob hard.", there is NO guarantee which string will match. You can get #1 with the arguments "Alice", "Bob hard". Or you can get #2 with the arguments "Alice", "Bob".