Regular Expressions
1. Pattern matching: Find one of a specified set of strings in text.
2. Regular expression : a notation to specify a set of strings.
-- basic operations
-- additional operations: can be expressed by basic operations. eg. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*
3. Duality between REs and DFAs
-- RE: Concise way to describe a set of strings.
-- DFA: Machine to recognize whether a given string is in a given set.
-- Kleene's theorem.
- For any DFA, there exists a RE that describes the same set of strings.
- For any RE, there exists a DFA that recognizes the same set of strings.
-- Bad news : DFA may have exponential # of states.
4. Regular-expression-matching NFA.
-- RE enclosed in parentheses.
-- One state per RE character (start = 0, accept = M).
-- ε-transition: change state, but don't scan text.
-- match transition: change state and scan to next text char.
-- Accept if any sequence of transitions ends in accept state.(after scanning all text characters)
5. How to determine whether a string is matched by an automaton?
-- DFA: Deterministic ⇒ easy because exactly one applicable transition.
-- NFA: Nondeterministic ⇒ can be several applicable transitions; need to select the right one!
6. NFA representation
-- State names: Integers from 0 to M(number of symbols in RE).
-- Match-transitions: Keep regular expression in array re[]. Use (re[position] == input character) to verify the match transition.
-- ε-transitions. Store in a digraph G.
7. NFA simulation
-- Maintain set of all possible states that NFA could be in after reading in the first i text characters.
-- perform reachability in ε-transitions diagraph : Run DFS from each source (possible states after match transition), without unmarking vertices. (Runs in time proportional to E + V)
-- Implementation
public boolean recognizes(String txt) { Bag<Integer> ps = new Bag<Integer>(); // possible states after ε-transitions DirectedDFS dfs = new DirectedDFS(G, 0); // G is ε-transitions diagraph for (int v = 0; v < G.V(); v++) //states reachable from start by ε-transitions if (dfs.marked(v)) pc.add(v); for (int i = 0; i < txt.length(); i++) { Bag<Integer> match = new Bag<Integer>(); //possible states after match transition for (int v : pc) { if (v == M) continue; if ((re[v] == txt.charAt(i)) || re[v] == '.') // match transition match.add(v+1); } dfs = new DirectedDFS(G, match); // DFS search from set of vertices ps = new Bag<Integer>(); for (int v = 0; v < G.V(); v++) //follow ε-transitions if (dfs.marked(v)) pc.add(v); } for (int v : pc) if (v == M) return true; return false; }
-- Determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to MN in the worst case.
Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of ε-transitions. [The NFA construction we will consider ensures the number of edges ≤ 3M.]
8. Building an NFA corresponding to an RE
-- States: Include a state for each symbol in the RE, plus an accept state.
-- Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.
-- Parentheses: Add ε-transition edge from parentheses to next state.
-- Closure: Add three ε-transition edges for each * operator.
-- Or: Add two ε-transition edges for each | operator.
-- Implementation
private Digraph buildEpsilonTransitionDigraph() { Digraph G = new Digraph(M+1); Stack<Integer> ops = new Stack<Integer>(); for (int i = 0; i < M; i++) { int lp = i; if (re[i] == '(' || re[i] == '|') ops.push(i); //left parentheses and |,push onto stack else if (re[i] == ')') { int or = ops.pop(); if (re[or] == '|') { lp = ops.pop(); G.addEdge(lp, or+1); G.addEdge(or, i); } else lp = or; } if (i < M-1 && re[i+1] == '*') { //closure,needs 1-character lookahead G.addEdge(lp, i+1); G.addEdge(i+1, lp); } if (re[i] == '(' || re[i] == '*' || re[i] == ')') G.addEdge(i, i+1); } return G; }
-- Building the NFA corresponding to an M-character RE takes time and space proportional to M.
Pf. For each of the M characters in the RE, we add at most three ε-transitions and execute at most two stack operations.
9. Generalized regular expression print
-- Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE.
-- Implementation:
public class GREP { public static void main(String[] args) { String re = "(.*" + args[0] + ".*)"; NFA nfa = new NFA(re); while (StdIn.hasNextLine()) { String line = StdIn.readLine(); if (nfa.recognizes(line)) StdOut.println(line); } } }
-- Worst-case for grep (proportional to M N ) is the same as for brute-force substring search.
10. Regular expressions in Java
-- String.matches(String regex)
//basic RE matching public class Validate { public static void main(String[] args) { String regexp = args[0]; String input = args[1]; StdOut.println(input.matches(re)); } }
-- java.util.regexp.Pattern and java.util.regexp.Matcher classes:
//Print all substrings of input that match a RE. public class Harvester { public static void main(String[] args) { String regexp = args[0]; In in = new In(args[1]); String input = in.readAll(); Pattern pattern = Pattern.compile(regexp);//compile() creates a Pattern (NFA) from RE Matcher matcher = pattern.matcher(input);//matcher() creates a Matcher (NFA simulator) from NFA and text while (matcher.find()) // find() looks for the next match { //group() returns the substring most recently found by find() StdOut.println(matcher.group()); } } }
上一篇: php获取文件扩展名的方法
推荐阅读
-
C# regular expression to validate email
-
AS报错:lambda expressions are not supported at this language level
-
Bootstrap中glyphicons-halflings-regular.woff字体报404错notfound的解决方法
-
Python使用正则表达式(Regular Expression)超详细
-
浅谈正则表达式(Regular Expression)
-
cf550D. Regular Bridge(构造)
-
C# regular expression to validate email
-
python re正则表达式模块(Regular Expression)
-
共享日常收集JS正则表达式(JavaScript regular expression)
-
正则表达式Regular Expression (RegExp)详解