Regular Expressions

程序员文章站 2022-05-06 11:26:47

...

1. Pattern matching: Find one of a specified set of strings in text.

2. Regular expression : a notation to specify a set of strings.

-- basic operations

Regular Expressions

博客分类： Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern

-- additional operations: can be expressed by basic operations. eg. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*

Regular Expressions

博客分类： Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern

3. Duality between REs and DFAs

-- RE: Concise way to describe a set of strings.

-- DFA: Machine to recognize whether a given string is in a given set.

-- Kleene's theorem.

- For any DFA, there exists a RE that describes the same set of strings.

- For any RE, there exists a DFA that recognizes the same set of strings.

Regular Expressions

博客分类： Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern
-- Bad news : DFA may have exponential # of states.

4. Regular-expression-matching NFA.

-- RE enclosed in parentheses.

-- One state per RE character (start = 0, accept = M).

-- ε-transition: change state, but don't scan text.

-- match transition: change state and scan to next text char.

-- Accept if any sequence of transitions ends in accept state.(after scanning all text characters)

Regular Expressions

博客分类： Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern

5. How to determine whether a string is matched by an automaton?

-- DFA: Deterministic ⇒ easy because exactly one applicable transition.

-- NFA: Nondeterministic ⇒ can be several applicable transitions; need to select the right one!

6. NFA representation

-- State names: Integers from 0 to M(number of symbols in RE).

-- Match-transitions: Keep regular expression in array re[]. Use (re[position] == input character) to verify the match transition.

-- ε-transitions. Store in a digraph G.

7. NFA simulation

-- Maintain set of all possible states that NFA could be in after reading in the first i text characters.

-- perform reachability in ε-transitions diagraph : Run DFS from each source (possible states after match transition), without unmarking vertices. (Runs in time proportional to E + V)

Regular Expressions

博客分类： Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern

-- Implementation

public boolean recognizes(String txt)
{
    Bag<Integer> ps = new Bag<Integer>(); // possible states after ε-transitions 
    DirectedDFS dfs = new DirectedDFS(G, 0); // G is ε-transitions diagraph
    
    for (int v = 0; v < G.V(); v++) //states reachable from start by ε-transitions
        if (dfs.marked(v)) pc.add(v);

    for (int i = 0; i < txt.length(); i++)
    {
        Bag<Integer> match = new Bag<Integer>(); //possible states after match transition
        for (int v : pc)
        {
            if (v == M) continue;
            if ((re[v] == txt.charAt(i)) || re[v] == '.') // match transition
                match.add(v+1);
        }
        dfs = new DirectedDFS(G, match); // DFS search from set of vertices
        ps = new Bag<Integer>();
        for (int v = 0; v < G.V(); v++) //follow ε-transitions
            if (dfs.marked(v)) pc.add(v);
    }
    for (int v : pc)
        if (v == M) return true;
    return false;
}

-- Determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to MN in the worst case.

Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of ε-transitions. [The NFA construction we will consider ensures the number of edges ≤ 3M.]

8. Building an NFA corresponding to an RE

-- States: Include a state for each symbol in the RE, plus an accept state.

-- Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.

-- Parentheses: Add ε-transition edge from parentheses to next state.

-- Closure: Add three ε-transition edges for each * operator.

Regular Expressions

博客分类： Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern
-- Or: Add two ε-transition edges for each | operator.

-- Implementation

private Digraph buildEpsilonTransitionDigraph() {
    Digraph G = new Digraph(M+1);
    Stack<Integer> ops = new Stack<Integer>();
    for (int i = 0; i < M; i++) {
        int lp = i;
        if (re[i] == '(' || re[i] == '|') ops.push(i); //left parentheses and |,push onto stack
        else if (re[i] == ')') {
            int or = ops.pop();
            if (re[or] == '|') {
                lp = ops.pop();
                G.addEdge(lp, or+1);
                G.addEdge(or, i);
            }
            else lp = or;
        }
        if (i < M-1 && re[i+1] == '*') { //closure,needs 1-character lookahead
            G.addEdge(lp, i+1);
            G.addEdge(i+1, lp);
        }
        if (re[i] == '(' || re[i] == '*' || re[i] == ')')
            G.addEdge(i, i+1);
    }
    return G;
}

-- Building the NFA corresponding to an M-character RE takes time and space proportional to M.

Pf. For each of the M characters in the RE, we add at most three ε-transitions and execute at most two stack operations.

9. Generalized regular expression print

-- Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE.

-- Implementation:

public class GREP
{
    public static void main(String[] args)
    {
        String re = "(.*" + args[0] + ".*)";
        NFA nfa = new NFA(re);
        while (StdIn.hasNextLine())
        {
                String line = StdIn.readLine();
                if (nfa.recognizes(line))
                        StdOut.println(line);
                }
         }
}

-- Worst-case for grep (proportional to M N ) is the same as for brute-force substring search.

10. Regular expressions in Java

-- String.matches(String regex)

//basic RE matching
public class Validate
{
    public static void main(String[] args)
    {
        String regexp = args[0];
        String input = args[1];
        StdOut.println(input.matches(re));
    }
}

-- java.util.regexp.Pattern and java.util.regexp.Matcher classes:

//Print all substrings of input that match a RE.
public class Harvester
{
    public static void main(String[] args)
    {
        String regexp = args[0];
        In in = new In(args[1]);
        String input = in.readAll();
        Pattern pattern = Pattern.compile(regexp);//compile() creates a
Pattern (NFA) from RE
        Matcher matcher = pattern.matcher(input);//matcher() creates a Matcher (NFA simulator) from NFA and text
        while (matcher.find()) // find() looks for the next match
        {
            //group() returns the substring most recently found by find()
            StdOut.println(matcher.group());
        }
    }
}