欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Regular Expressions

程序员文章站 2022-05-06 11:26:47
...

1.  Pattern matching: Find one of a specified set of strings in text.

 

2.  Regular expression :  a notation to specify a set of strings.

    --  basic operations

Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 

    --  additional operations: can be expressed by basic operations. eg. [A-E]+ is shorthand for (A|B|C|D|E)(A|B|C|D|E)*

Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
 

3.  Duality between REs and DFAs

    --  RE: Concise way to describe a set of strings.

    --  DFA: Machine to recognize whether a given string is in a given set.

    --  Kleene's theorem.

        -  For any DFA, there exists a RE that describes the same set of strings.

        -  For any RE, there exists a DFA that recognizes the same set of strings.

Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
    --  Bad news : DFA may have exponential # of states.

 

4.  Regular-expression-matching NFA.

    --  RE enclosed in parentheses.

    --  One state per RE character (start = 0, accept = M).

    --  ε-transition: change state, but don't scan text.

    --  match transition: change state and scan to next text char.

    --  Accept if any sequence of transitions ends in accept state.(after scanning all text characters)

Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
 

5.  How to determine whether a string is matched by an automaton?

    --  DFA: Deterministic ⇒ easy because exactly one applicable transition.

    --  NFA: Nondeterministic ⇒ can be several applicable transitions; need to select the right one!

 

6.  NFA representation

    --  State names: Integers from 0 to M(number of symbols in RE).

    --  Match-transitions: Keep regular expression in array re[]. Use (re[position] == input character) to verify the match transition.

    --  ε-transitions. Store in a digraph G.

 

7.  NFA simulation

    --  Maintain set of all possible states that NFA could be in after reading in the first i text characters.

    --  perform reachability in ε-transitions diagraph : Run DFS from each source (possible states after match transition), without unmarking vertices. (Runs in time proportional to E + V)

Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 

    --  Implementation

public boolean recognizes(String txt)
{
    Bag<Integer> ps = new Bag<Integer>(); // possible states after ε-transitions 
    DirectedDFS dfs = new DirectedDFS(G, 0); // G is ε-transitions diagraph
    
    for (int v = 0; v < G.V(); v++) //states reachable from start by ε-transitions
        if (dfs.marked(v)) pc.add(v);

    for (int i = 0; i < txt.length(); i++)
    {
        Bag<Integer> match = new Bag<Integer>(); //possible states after match transition
        for (int v : pc)
        {
            if (v == M) continue;
            if ((re[v] == txt.charAt(i)) || re[v] == '.') // match transition
                match.add(v+1);
        }
        dfs = new DirectedDFS(G, match); // DFS search from set of vertices
        ps = new Bag<Integer>();
        for (int v = 0; v < G.V(); v++) //follow ε-transitions
            if (dfs.marked(v)) pc.add(v);
    }
    for (int v : pc)
        if (v == M) return true;
    return false;
}

    --  Determining whether an N-character text is recognized by the NFA corresponding to an M-character pattern takes time proportional to MN in the worst case.

        Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of ε-transitions. [The NFA construction we will consider ensures the number of edges ≤ 3M.]

 

8.   Building an NFA corresponding to an RE

    --  States: Include a state for each symbol in the RE, plus an accept state.

    --  Concatenation: Add match-transition edge from state corresponding to characters in the alphabet to next state.

    --  Parentheses: Add ε-transition edge from parentheses to next state.

    --  Closure: Add three ε-transition edges for each * operator.

Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
     --  Or:  Add two ε-transition edges for each | operator.
Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
    --  Implementation

private Digraph buildEpsilonTransitionDigraph() {
    Digraph G = new Digraph(M+1);
    Stack<Integer> ops = new Stack<Integer>();
    for (int i = 0; i < M; i++) {
        int lp = i;
        if (re[i] == '(' || re[i] == '|') ops.push(i); //left parentheses and |,push onto stack
        else if (re[i] == ')') {
            int or = ops.pop();
            if (re[or] == '|') {
                lp = ops.pop();
                G.addEdge(lp, or+1);
                G.addEdge(or, i);
            }
            else lp = or;
        }
        if (i < M-1 && re[i+1] == '*') { //closure,needs 1-character lookahead
            G.addEdge(lp, i+1);
            G.addEdge(i+1, lp);
        }
        if (re[i] == '(' || re[i] == '*' || re[i] == ')')
            G.addEdge(i, i+1);
    }
    return G;
}

     --  Building the NFA corresponding to an M-character RE takes time and space proportional to M.

        Pf. For each of the M characters in the RE, we add at most three ε-transitions and execute at most two stack operations.

 

9.  Generalized regular expression print

    --  Take a RE as a command-line argument and print the lines from standard input having some substring that is matched by the RE.

    --  Implementation:

public class GREP
{
    public static void main(String[] args)
    {
        String re = "(.*" + args[0] + ".*)";
        NFA nfa = new NFA(re);
        while (StdIn.hasNextLine())
        {
                String line = StdIn.readLine();
                if (nfa.recognizes(line))
                        StdOut.println(line);
                }
         }
}

    --  Worst-case for grep (proportional to M N ) is the same as for brute-force substring search.

 

10.  Regular expressions in Java

    --  String.matches(String regex)

//basic RE matching
public class Validate
{
    public static void main(String[] args)
    {
        String regexp = args[0];
        String input = args[1];
        StdOut.println(input.matches(re));
    }
}

    --  java.util.regexp.Pattern and java.util.regexp.Matcher classes:

//Print all substrings of input that match a RE.
public class Harvester
{
    public static void main(String[] args)
    {
        String regexp = args[0];
        In in = new In(args[1]);
        String input = in.readAll();
        Pattern pattern = Pattern.compile(regexp);//compile() creates a
Pattern (NFA) from RE
        Matcher matcher = pattern.matcher(input);//matcher() creates a Matcher (NFA simulator) from NFA and text
        while (matcher.find()) // find() looks for the next match
        {
            //group() returns the substring most recently found by find()
            StdOut.println(matcher.group());
        }
    }
}

 

  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 37.9 KB
  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 33.8 KB
  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 19 KB
  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 19.2 KB
  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 100.3 KB
  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 11.4 KB
  • Regular Expressions
            
    
    博客分类: Algorithm II -- Princeton 学习笔记 Regular ExpressionDFANFAPattern 
  • 大小: 8.2 KB