Difference between revisions of "Pattern Matching"

From Suhrid.net Wiki
Jump to navigationJump to search
(Created page with " Category:OCPJP")
 
 
(35 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
== Intro ==
  
 +
* Classes in the java.util.regex package provide regular expressions support.
 +
* The Pattern class is used to store a regex expression - the regex has to be "compiled."
 +
* The Matcher class is used to start the regex engine to perform match operations.
 +
* Basic example
 +
 +
<syntaxhighlight lang="java5">
 +
 +
import java.util.regex.*;
 +
 +
public class RegexTest1 {
 +
 +
public static void main(String[] args) {
 +
 +
Pattern p = Pattern.compile("lazy"); //The pattern to search for
 +
Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
 +
boolean found = false;
 +
while(m.find()) {
 +
System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
 +
found = true;
 +
}
 +
 +
if(!found) {
 +
System.out.println("No match found");
 +
}
 +
}
 +
 +
}
 +
 +
</syntaxhighlight>
 +
 +
* Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
 +
* In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.
 +
 +
<syntaxhighlight lang="java5">
 +
 +
import java.util.regex.*;
 +
 +
public class RegexTest2 {
 +
 +
public static void main(String[] args) {
 +
 +
Pattern p = Pattern.compile("aba");
 +
Matcher m = p.matcher("abababa");
 +
boolean found = false;
 +
while(m.find()) {
 +
System.out.println("Match found; starting at pos : " + m.start());
 +
found = true;
 +
}
 +
 +
if(!found) {
 +
System.out.println("No match found");
 +
}
 +
}
 +
 +
}
 +
 +
</syntaxhighlight>
 +
 +
== Metacharacters and Character Classes  ==
 +
 +
* Metacharacters are special chars that affect the way a pattern is matched.
 +
* The different metacharacters are : ([{\^-$|]})?*+.
 +
 +
* Dot - "." metacharacter matches any character
 +
 +
* Character classes
 +
* The [] notation is used to define a pattern that represents a set of characters. e.g:
 +
* The search will match any of the chars defined within [] that is the "OR" operator will be used.
 +
** [abc] - Only a's or b's or c's
 +
** [a-f] - Search for a,b,c,d,e,f chars
 +
** [a-fA-F] - a to f OR A to F
 +
** [a-d[m-p]] - a through d OR m through p
 +
** [a-z&&[def]] - d, e, or f (intersection)
 +
** [a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction)
 +
** [a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z] (subtraction)
 +
** [^aeiouAEIOU] - no vowels
 +
 +
* Predefined character classes:
 +
 +
** \d - Matches a digit
 +
** \D - Matches a non-digit equivalent to [^\d]
 +
** \s - Matches a whitespace char ([ \t\n\x0B\f\r])
 +
** \S - Matches a non-whitespace char
 +
** \w - Matches a word char (letters/digits or _)
 +
** \W - Matches a non-word char
 +
 +
* Within [] - metacharacters are treated as ordinary characters, no need to escape them.
 +
 +
<syntaxhighlight lang="java5">
 +
 +
public class RegexTest3 {
 +
 +
public static void main(String[] args) {
 +
 +
Pattern p = Pattern.compile("\\d");
 +
Matcher m = p.matcher("The 15th of August");
 +
boolean found = false;
 +
while(m.find()) {
 +
System.out.println("Match found; starting at pos : " + m.start());
 +
found = true;
 +
}
 +
 +
                // Match found; starting at pos : 4
 +
                // Match found; starting at pos : 5
 +
 +
if(!found) {
 +
System.out.println("No match found");
 +
}
 +
}
 +
 +
}
 +
 +
</syntaxhighlight>
 +
 +
== Boundary Matchers ==
 +
 +
* ^Regex - will attempt to match the regex only at the beginning of the line.
 +
* Regex$ - will attempt to match the regex only at the end of the line.
 +
* Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched.
 +
 +
<syntaxhighlight lang="java5">
 +
 +
import java.util.regex.*;
 +
 +
public class RegexTest5 {
 +
 +
public static void main(String[] args) {
 +
 +
Pattern p = Pattern.compile("^(\\d)+");
 +
Matcher m = p.matcher("123 sds sadwvf 221");
 +
 +
while(m.find()) {
 +
System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group());
 +
//Will print: Match found; starting at pos : 0 , matched content : 123
 +
}
 +
 +
}
 +
 +
}
 +
 +
</syntaxhighlight>
 +
 +
== Logical Operators ==
 +
* R | U, Logical OR. e.g ^[a-z] | \d$  A lowercase letter at the beginning of the line or a digit at the end of the line.
 +
* RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.
 +
 +
== Quantifiers ==
 +
 +
* Used to specify the number of occurrences of a search pattern
 +
* The below are all greedy quantifiers.
 +
* <nowiki>*</nowiki> - Zero or more occurrences
 +
* ? - Zero or one occurrence.
 +
* + - One or more occurrence.
 +
 +
 +
* Consider the pattern a? - it means zero or one a's to be searched for.
 +
* In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's.
 +
* '''Now, where the char is not an a, the empty string is returned in the match.''' - the regex engine inserts empty strings in the input to match the pattern a?.
 +
 +
* Further examples: Match results are in format: startpos, endpos, matched-substring
 +
 +
Target: banana
 +
Pattern: a*
 +
Match:
 +
0,0:
 +
1,2: a
 +
2,2:
 +
3,4: a
 +
4,4:
 +
5,6: a
 +
6,6:
 +
 +
Target: banana
 +
Pattern: a?
 +
Match:
 +
0,0:
 +
1,2: a
 +
2,2:
 +
3,4: a
 +
4,4:
 +
5,6: a
 +
6,6:
 +
 +
Target: banana
 +
Pattern: a+
 +
Match:
 +
1,2: a
 +
3,4: a
 +
5,6: a
 +
 +
(banaana - with double a in the middle)
 +
Notice how the quantifiers are greedy:
 +
 +
Target: banaana
 +
Pattern: a*
 +
Match:
 +
0,0:
 +
1,2: a
 +
2,2:
 +
3,5: aa
 +
5,5:
 +
6,7: a
 +
7,7:
 +
 +
Target: banana
 +
Pattern: a?
 +
Match:
 +
0,0:
 +
1,2: a
 +
2,2:
 +
3,4: a
 +
4,5: a
 +
5,5:
 +
6,7: a
 +
7,7:
 +
 +
Target: banana
 +
Pattern: a+
 +
Match:
 +
1,2: a
 +
3,5: aa
 +
6,7: a
 +
 +
 +
* Example the pattern abc(\d)<nowiki>*</nowiki> will match -
 +
** abc0
 +
** abc13423
 +
** abc - since <nowiki>*</nowiki> means 0 or more digits
 +
** abcdef - for the similar reason as above
 +
 +
* It won't match -
 +
** ab211 (doesnt start with abc)
 +
** abcs (doesnt have a digit after abc)
 +
 +
 +
* Quantifiers by default attach to one character at a time only. (The one immediately preceding the quantifier)
 +
* \d\d+ means a digit followed by zero or one more digit.
 +
* abc+ means a followed by b, followed by c one or more times.
 +
* '''(abc)+ means the group "abc" one or more times.'''
 +
* '''[abc]+ would mean a OR b OR c one or more times.'''
 +
 +
=== Greedy Quantifiers ===
 +
 +
* Greedy quantifiers will try to look at the entire source data while trying to determine a match.
 +
 +
See example below:
 +
 +
<syntaxhighlight lang="java5">
 +
 +
public class Greedy {
 +
 +
public static void main(String[] args) {
 +
String greedyPattern = ".*xx";
 +
String reluctantPattern = ".*?xx";
 +
String source = "yyxxxxyxx";
 +
 +
Pattern gp = Pattern.compile(greedyPattern);
 +
Matcher gm = gp.matcher(source);
 +
 +
while (gm.find()) {
 +
System.out.println("Greedy Match found ! Starts at : " + gm.start()
 +
+ ", Matched portion : " + gm.group());
 +
 +
}
 +
                //Will print:
 +
                //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx
 +
 +
Pattern rp = Pattern.compile(reluctantPattern);
 +
Matcher rm = rp.matcher(source);
 +
 +
while (rm.find()) {
 +
System.out.println("Reluctant Match found ! Starts at : " + rm.start()
 +
+ ", Matched portion : " + rm.group());
 +
 +
}
 +
                //Will print:
 +
                //Reluctant Match found ! Starts at : 0, Matched portion : yyxx
 +
                //Reluctant Match found ! Starts at : 4, Matched portion : xx
 +
                //Reluctant Match found ! Starts at : 6, Matched portion : yxx
 +
}
 +
 +
}
 +
</syntaxhighlight>
 +
 +
== Tokenizing (Scanner) ==
 +
 +
* Tokenizing for small pieces of data can be done by the String.split() method.
 +
* For advanced Tokenizing, using the Scanner class is the best choice.
 +
* The scanner class can accept various forms of input such as files, streams or Strings.
 +
* Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
 +
* Tokens can be converted to their primitive types automatically.
 +
* In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.
 +
 +
<syntaxhighlight lang="java5">
 +
 +
public class ScannerTest1 {
 +
 +
private static String source = "M 78 P 85 C 92 E 66 B 88";
 +
 +
public static void main(String[] args) {
 +
 +
List<Integer> scores = new ArrayList<Integer>();
 +
 +
Scanner scanner = new Scanner(source);
 +
 +
while(scanner.hasNext()) {
 +
if(scanner.hasNextInt()) {
 +
int score = scanner.nextInt();
 +
scores.add(score);
 +
} else {
 +
scanner.next();
 +
}
 +
}
 +
 +
Collections.sort(scores);
 +
 +
System.out.println(scores);
 +
}
 +
}
 +
 +
</syntaxhighlight>
 +
 +
 +
* Scanner '''will throw an InputMismatchException''' when it can't convert the next token of the input into the desired type.
 +
* Example: when the scanner reaches the token abc, it will throw an exception.
 +
 +
<syntaxhighlight lang="java5">
 +
 +
Scanner s = new Scanner("123 abc");
 +
while(s.hasNext()) {
 +
int i = s.nextInt();
 +
}
 +
</syntaxhighlight>
 +
 +
* Another example, where a regex is being used as a delimiter to the scanner:
 +
* In the hasNext(Pattern) method attempts to match the '''complete''' token to the Pattern ! Works like Mather's matches() method.
 +
 +
<syntaxhighlight lang="java5">
 +
 +
import java.util.*;
 +
 +
public class ScannerTest2 {
 +
 +
private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747";
 +
 +
public static void main(String[] args) {
 +
 +
Scanner scanner = new Scanner(source);
 +
scanner.useDelimiter(",\\s*");
 +
 +
Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
 +
 +
while(scanner.hasNext()) {
 +
String token = scanner.next();
 +
Scanner lineScanner = new Scanner(token);
 +
lineScanner.useDelimiter("\\s=\\s");
 +
String name = null;
 +
int value = 0;
 +
while(lineScanner.hasNext()) {
 +
if(lineScanner.hasNextInt()) {
 +
value = lineScanner.nextInt();
 +
}  else {
 +
name = lineScanner.next();
 +
}
 +
}
 +
nameValueMap.put(name, value);
 +
}
 +
 +
System.out.println(nameValueMap);
 +
 +
}
 +
}
 +
 +
</syntaxhighlight>
 +
 +
 +
* The scanner has to keep moving forward with a call to one of the next() methods. If next() is not called, the scanner will not move to the next token and will endlessly loop. See below/
 +
* With s.next() commented, the scanner will be stuck at the first token "abc" - not moving forward and will result in an infinite loop !
 +
 +
<syntaxhighlight lang="java5">
 +
 +
import java.util.Scanner;
 +
 +
public class ScanTest {
 +
 +
public static void main(String[] args) {
 +
Scanner s = new Scanner("abc 123 efg 1312");
 +
while(s.hasNext()) {
 +
if(s.hasNextInt()) {
 +
System.out.println("Int : " + s.nextInt());
 +
} else {
 +
System.out.println("No next int");
 +
                                //s.next() //UNCOMMENT for scanner to move ahead
 +
}
 +
}
 +
}
 +
 +
}
 +
 +
</syntaxhighlight>
  
  
 
[[Category:OCPJP]]
 
[[Category:OCPJP]]

Latest revision as of 01:12, 10 September 2011

Intro

  • Classes in the java.util.regex package provide regular expressions support.
  • The Pattern class is used to store a regex expression - the regex has to be "compiled."
  • The Matcher class is used to start the regex engine to perform match operations.
  • Basic example
import java.util.regex.*;

public class RegexTest1 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("lazy"); //The pattern to search for
		Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}
  • Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
  • In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.
import java.util.regex.*;

public class RegexTest2 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("aba");
		Matcher m = p.matcher("abababa");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Metacharacters and Character Classes

  • Metacharacters are special chars that affect the way a pattern is matched.
  • The different metacharacters are : ([{\^-$|]})?*+.
  • Dot - "." metacharacter matches any character
  • Character classes
  • The [] notation is used to define a pattern that represents a set of characters. e.g:
  • The search will match any of the chars defined within [] that is the "OR" operator will be used.
    • [abc] - Only a's or b's or c's
    • [a-f] - Search for a,b,c,d,e,f chars
    • [a-fA-F] - a to f OR A to F
    • [a-d[m-p]] - a through d OR m through p
    • [a-z&&[def]] - d, e, or f (intersection)
    • [a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction)
    • [a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z] (subtraction)
    • [^aeiouAEIOU] - no vowels
  • Predefined character classes:
    • \d - Matches a digit
    • \D - Matches a non-digit equivalent to [^\d]
    • \s - Matches a whitespace char ([ \t\n\x0B\f\r])
    • \S - Matches a non-whitespace char
    • \w - Matches a word char (letters/digits or _)
    • \W - Matches a non-word char
  • Within [] - metacharacters are treated as ordinary characters, no need to escape them.
public class RegexTest3 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("\\d");
		Matcher m = p.matcher("The 15th of August");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}

                // Match found; starting at pos : 4
                // Match found; starting at pos : 5
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Boundary Matchers

  • ^Regex - will attempt to match the regex only at the beginning of the line.
  • Regex$ - will attempt to match the regex only at the end of the line.
  • Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched.
import java.util.regex.*;

public class RegexTest5 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("^(\\d)+");
		Matcher m = p.matcher("123 sds sadwvf 221");
		
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group());
			//Will print: Match found; starting at pos : 0 , matched content : 123
		}
		
	}

}

Logical Operators

  • R | U, Logical OR. e.g ^[a-z] | \d$ A lowercase letter at the beginning of the line or a digit at the end of the line.
  • RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.

Quantifiers

  • Used to specify the number of occurrences of a search pattern
  • The below are all greedy quantifiers.
  • * - Zero or more occurrences
  •  ? - Zero or one occurrence.
  • + - One or more occurrence.


  • Consider the pattern a? - it means zero or one a's to be searched for.
  • In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's.
  • Now, where the char is not an a, the empty string is returned in the match. - the regex engine inserts empty strings in the input to match the pattern a?.
  • Further examples: Match results are in format: startpos, endpos, matched-substring

Target: banana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:

Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:

Target: banana Pattern: a+ Match: 1,2: a 3,4: a 5,6: a

(banaana - with double a in the middle) Notice how the quantifiers are greedy:

Target: banaana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,5: aa 5,5: 6,7: a 7,7:

Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,5: a 5,5: 6,7: a 7,7:

Target: banana Pattern: a+ Match: 1,2: a 3,5: aa 6,7: a


  • Example the pattern abc(\d)* will match -
    • abc0
    • abc13423
    • abc - since * means 0 or more digits
    • abcdef - for the similar reason as above
  • It won't match -
    • ab211 (doesnt start with abc)
    • abcs (doesnt have a digit after abc)


  • Quantifiers by default attach to one character at a time only. (The one immediately preceding the quantifier)
  • \d\d+ means a digit followed by zero or one more digit.
  • abc+ means a followed by b, followed by c one or more times.
  • (abc)+ means the group "abc" one or more times.
  • [abc]+ would mean a OR b OR c one or more times.

Greedy Quantifiers

  • Greedy quantifiers will try to look at the entire source data while trying to determine a match.

See example below:

public class Greedy {

	public static void main(String[] args) {
		String greedyPattern = ".*xx";
		String reluctantPattern = ".*?xx";
		String source = "yyxxxxyxx";
		
		Pattern gp = Pattern.compile(greedyPattern);
		Matcher gm = gp.matcher(source);

		while (gm.find()) {
			System.out.println("Greedy Match found ! Starts at : " + gm.start()
					+ ", Matched portion : " + gm.group());
			
		}
                //Will print:
                //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx

		Pattern rp = Pattern.compile(reluctantPattern);
		Matcher rm = rp.matcher(source);
		
		while (rm.find()) {
			System.out.println("Reluctant Match found ! Starts at : " + rm.start()
					+ ", Matched portion : " + rm.group());
			
		}
                //Will print:
                //Reluctant Match found ! Starts at : 0, Matched portion : yyxx
                //Reluctant Match found ! Starts at : 4, Matched portion : xx
                //Reluctant Match found ! Starts at : 6, Matched portion : yxx
	}

}

Tokenizing (Scanner)

  • Tokenizing for small pieces of data can be done by the String.split() method.
  • For advanced Tokenizing, using the Scanner class is the best choice.
  • The scanner class can accept various forms of input such as files, streams or Strings.
  • Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
  • Tokens can be converted to their primitive types automatically.
  • In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.
public class ScannerTest1 {
	
	private static String source = "M 78 P 85 C 92 E 66 B 88";

	public static void main(String[] args) {
		
		List<Integer> scores = new ArrayList<Integer>(); 
		
		Scanner scanner = new Scanner(source);
		
		while(scanner.hasNext()) {
			if(scanner.hasNextInt()) {
				int score = scanner.nextInt();
				scores.add(score);
			} else {
				scanner.next(); 
			}
		}
		
		Collections.sort(scores);
		
		System.out.println(scores);
	}
}


  • Scanner will throw an InputMismatchException when it can't convert the next token of the input into the desired type.
  • Example: when the scanner reaches the token abc, it will throw an exception.
Scanner s = new Scanner("123 abc");
		while(s.hasNext()) {
			int i = s.nextInt();
}
  • Another example, where a regex is being used as a delimiter to the scanner:
  • In the hasNext(Pattern) method attempts to match the complete token to the Pattern ! Works like Mather's matches() method.
import java.util.*;

public class ScannerTest2 {
	
	private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747"; 

	public static void main(String[] args) {
		
		Scanner scanner = new Scanner(source);
		scanner.useDelimiter(",\\s*");
		
		Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
		
		while(scanner.hasNext()) {
			String token = scanner.next();
			Scanner lineScanner = new Scanner(token);
			lineScanner.useDelimiter("\\s=\\s");
			String name = null;
			int value = 0;
			while(lineScanner.hasNext()) {
				if(lineScanner.hasNextInt()) {
					value = lineScanner.nextInt();
				}  else {
					name = lineScanner.next();
				}
			}
			nameValueMap.put(name, value);
		}
		
		System.out.println(nameValueMap);
		
	}
}


  • The scanner has to keep moving forward with a call to one of the next() methods. If next() is not called, the scanner will not move to the next token and will endlessly loop. See below/
  • With s.next() commented, the scanner will be stuck at the first token "abc" - not moving forward and will result in an infinite loop !
import java.util.Scanner;

public class ScanTest {

	public static void main(String[] args) {
		Scanner s = new Scanner("abc 123 efg 1312");
		while(s.hasNext()) {
			if(s.hasNextInt()) {
				System.out.println("Int : " + s.nextInt());
			} else {
				System.out.println("No next int");
                                //s.next() //UNCOMMENT for scanner to move ahead
			}
		}
	}

}