Difference between revisions of "Pattern Matching"
From Suhrid.net Wiki
Jump to navigationJump to search(31 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
* Classes in the java.util.regex package provide regular expressions support. | * Classes in the java.util.regex package provide regular expressions support. | ||
+ | * The Pattern class is used to store a regex expression - the regex has to be "compiled." | ||
+ | * The Matcher class is used to start the regex engine to perform match operations. | ||
* Basic example | * Basic example | ||
Line 57: | Line 59: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
− | == Metacharacters == | + | == Metacharacters and Character Classes == |
− | * | + | * Metacharacters are special chars that affect the way a pattern is matched. |
− | * \d - Matches a digit | + | * The different metacharacters are : ([{\^-$|]})?*+. |
− | * \s - Matches a whitespace char | + | |
− | * \w - Matches a word char (letters/digits or _) | + | * Dot - "." metacharacter matches any character |
+ | |||
+ | * Character classes | ||
+ | * The [] notation is used to define a pattern that represents a set of characters. e.g: | ||
+ | * The search will match any of the chars defined within [] that is the "OR" operator will be used. | ||
+ | ** [abc] - Only a's or b's or c's | ||
+ | ** [a-f] - Search for a,b,c,d,e,f chars | ||
+ | ** [a-fA-F] - a to f OR A to F | ||
+ | ** [a-d[m-p]] - a through d OR m through p | ||
+ | ** [a-z&&[def]] - d, e, or f (intersection) | ||
+ | ** [a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction) | ||
+ | ** [a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z] (subtraction) | ||
+ | ** [^aeiouAEIOU] - no vowels | ||
+ | |||
+ | * Predefined character classes: | ||
+ | |||
+ | ** \d - Matches a digit | ||
+ | ** \D - Matches a non-digit equivalent to [^\d] | ||
+ | ** \s - Matches a whitespace char ([ \t\n\x0B\f\r]) | ||
+ | ** \S - Matches a non-whitespace char | ||
+ | ** \w - Matches a word char (letters/digits or _) | ||
+ | ** \W - Matches a non-word char | ||
+ | |||
+ | * Within [] - metacharacters are treated as ordinary characters, no need to escape them. | ||
<syntaxhighlight lang="java5"> | <syntaxhighlight lang="java5"> | ||
Line 89: | Line 114: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
+ | |||
+ | == Boundary Matchers == | ||
+ | |||
+ | * ^Regex - will attempt to match the regex only at the beginning of the line. | ||
+ | * Regex$ - will attempt to match the regex only at the end of the line. | ||
+ | * Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched. | ||
+ | |||
+ | <syntaxhighlight lang="java5"> | ||
+ | |||
+ | import java.util.regex.*; | ||
+ | |||
+ | public class RegexTest5 { | ||
+ | |||
+ | public static void main(String[] args) { | ||
+ | |||
+ | Pattern p = Pattern.compile("^(\\d)+"); | ||
+ | Matcher m = p.matcher("123 sds sadwvf 221"); | ||
+ | |||
+ | while(m.find()) { | ||
+ | System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group()); | ||
+ | //Will print: Match found; starting at pos : 0 , matched content : 123 | ||
+ | } | ||
+ | |||
+ | } | ||
+ | |||
+ | } | ||
+ | |||
+ | </syntaxhighlight> | ||
+ | |||
+ | == Logical Operators == | ||
+ | * R | U, Logical OR. e.g ^[a-z] | \d$ A lowercase letter at the beginning of the line or a digit at the end of the line. | ||
+ | * RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR. | ||
+ | |||
+ | == Quantifiers == | ||
+ | |||
+ | * Used to specify the number of occurrences of a search pattern | ||
+ | * The below are all greedy quantifiers. | ||
+ | * <nowiki>*</nowiki> - Zero or more occurrences | ||
+ | * ? - Zero or one occurrence. | ||
+ | * + - One or more occurrence. | ||
+ | |||
+ | |||
+ | * Consider the pattern a? - it means zero or one a's to be searched for. | ||
+ | * In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's. | ||
+ | * '''Now, where the char is not an a, the empty string is returned in the match.''' - the regex engine inserts empty strings in the input to match the pattern a?. | ||
+ | |||
+ | * Further examples: Match results are in format: startpos, endpos, matched-substring | ||
+ | |||
+ | Target: banana | ||
+ | Pattern: a* | ||
+ | Match: | ||
+ | 0,0: | ||
+ | 1,2: a | ||
+ | 2,2: | ||
+ | 3,4: a | ||
+ | 4,4: | ||
+ | 5,6: a | ||
+ | 6,6: | ||
+ | |||
+ | Target: banana | ||
+ | Pattern: a? | ||
+ | Match: | ||
+ | 0,0: | ||
+ | 1,2: a | ||
+ | 2,2: | ||
+ | 3,4: a | ||
+ | 4,4: | ||
+ | 5,6: a | ||
+ | 6,6: | ||
+ | |||
+ | Target: banana | ||
+ | Pattern: a+ | ||
+ | Match: | ||
+ | 1,2: a | ||
+ | 3,4: a | ||
+ | 5,6: a | ||
+ | |||
+ | (banaana - with double a in the middle) | ||
+ | Notice how the quantifiers are greedy: | ||
+ | |||
+ | Target: banaana | ||
+ | Pattern: a* | ||
+ | Match: | ||
+ | 0,0: | ||
+ | 1,2: a | ||
+ | 2,2: | ||
+ | 3,5: aa | ||
+ | 5,5: | ||
+ | 6,7: a | ||
+ | 7,7: | ||
+ | |||
+ | Target: banana | ||
+ | Pattern: a? | ||
+ | Match: | ||
+ | 0,0: | ||
+ | 1,2: a | ||
+ | 2,2: | ||
+ | 3,4: a | ||
+ | 4,5: a | ||
+ | 5,5: | ||
+ | 6,7: a | ||
+ | 7,7: | ||
+ | |||
+ | Target: banana | ||
+ | Pattern: a+ | ||
+ | Match: | ||
+ | 1,2: a | ||
+ | 3,5: aa | ||
+ | 6,7: a | ||
+ | |||
+ | |||
+ | * Example the pattern abc(\d)<nowiki>*</nowiki> will match - | ||
+ | ** abc0 | ||
+ | ** abc13423 | ||
+ | ** abc - since <nowiki>*</nowiki> means 0 or more digits | ||
+ | ** abcdef - for the similar reason as above | ||
+ | |||
+ | * It won't match - | ||
+ | ** ab211 (doesnt start with abc) | ||
+ | ** abcs (doesnt have a digit after abc) | ||
+ | |||
+ | |||
+ | * Quantifiers by default attach to one character at a time only. (The one immediately preceding the quantifier) | ||
+ | * \d\d+ means a digit followed by zero or one more digit. | ||
+ | * abc+ means a followed by b, followed by c one or more times. | ||
+ | * '''(abc)+ means the group "abc" one or more times.''' | ||
+ | * '''[abc]+ would mean a OR b OR c one or more times.''' | ||
+ | |||
+ | === Greedy Quantifiers === | ||
+ | |||
+ | * Greedy quantifiers will try to look at the entire source data while trying to determine a match. | ||
+ | |||
+ | See example below: | ||
+ | |||
+ | <syntaxhighlight lang="java5"> | ||
+ | |||
+ | public class Greedy { | ||
+ | |||
+ | public static void main(String[] args) { | ||
+ | String greedyPattern = ".*xx"; | ||
+ | String reluctantPattern = ".*?xx"; | ||
+ | String source = "yyxxxxyxx"; | ||
+ | |||
+ | Pattern gp = Pattern.compile(greedyPattern); | ||
+ | Matcher gm = gp.matcher(source); | ||
+ | |||
+ | while (gm.find()) { | ||
+ | System.out.println("Greedy Match found ! Starts at : " + gm.start() | ||
+ | + ", Matched portion : " + gm.group()); | ||
+ | |||
+ | } | ||
+ | //Will print: | ||
+ | //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx | ||
+ | |||
+ | Pattern rp = Pattern.compile(reluctantPattern); | ||
+ | Matcher rm = rp.matcher(source); | ||
+ | |||
+ | while (rm.find()) { | ||
+ | System.out.println("Reluctant Match found ! Starts at : " + rm.start() | ||
+ | + ", Matched portion : " + rm.group()); | ||
+ | |||
+ | } | ||
+ | //Will print: | ||
+ | //Reluctant Match found ! Starts at : 0, Matched portion : yyxx | ||
+ | //Reluctant Match found ! Starts at : 4, Matched portion : xx | ||
+ | //Reluctant Match found ! Starts at : 6, Matched portion : yxx | ||
+ | } | ||
+ | |||
+ | } | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | == Tokenizing (Scanner) == | ||
+ | |||
+ | * Tokenizing for small pieces of data can be done by the String.split() method. | ||
+ | * For advanced Tokenizing, using the Scanner class is the best choice. | ||
+ | * The scanner class can accept various forms of input such as files, streams or Strings. | ||
+ | * Tokenizing is done within a loop, so that the process can be exited once any conditions are met. | ||
+ | * Tokens can be converted to their primitive types automatically. | ||
+ | * In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character. | ||
+ | |||
+ | <syntaxhighlight lang="java5"> | ||
+ | |||
+ | public class ScannerTest1 { | ||
+ | |||
+ | private static String source = "M 78 P 85 C 92 E 66 B 88"; | ||
+ | |||
+ | public static void main(String[] args) { | ||
+ | |||
+ | List<Integer> scores = new ArrayList<Integer>(); | ||
+ | |||
+ | Scanner scanner = new Scanner(source); | ||
+ | |||
+ | while(scanner.hasNext()) { | ||
+ | if(scanner.hasNextInt()) { | ||
+ | int score = scanner.nextInt(); | ||
+ | scores.add(score); | ||
+ | } else { | ||
+ | scanner.next(); | ||
+ | } | ||
+ | } | ||
+ | |||
+ | Collections.sort(scores); | ||
+ | |||
+ | System.out.println(scores); | ||
+ | } | ||
+ | } | ||
+ | |||
+ | </syntaxhighlight> | ||
+ | |||
+ | |||
+ | * Scanner '''will throw an InputMismatchException''' when it can't convert the next token of the input into the desired type. | ||
+ | * Example: when the scanner reaches the token abc, it will throw an exception. | ||
+ | |||
+ | <syntaxhighlight lang="java5"> | ||
+ | |||
+ | Scanner s = new Scanner("123 abc"); | ||
+ | while(s.hasNext()) { | ||
+ | int i = s.nextInt(); | ||
+ | } | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | * Another example, where a regex is being used as a delimiter to the scanner: | ||
+ | * In the hasNext(Pattern) method attempts to match the '''complete''' token to the Pattern ! Works like Mather's matches() method. | ||
+ | |||
+ | <syntaxhighlight lang="java5"> | ||
+ | |||
+ | import java.util.*; | ||
+ | |||
+ | public class ScannerTest2 { | ||
+ | |||
+ | private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747"; | ||
+ | |||
+ | public static void main(String[] args) { | ||
+ | |||
+ | Scanner scanner = new Scanner(source); | ||
+ | scanner.useDelimiter(",\\s*"); | ||
+ | |||
+ | Map<String, Integer> nameValueMap = new HashMap<String, Integer>(); | ||
+ | |||
+ | while(scanner.hasNext()) { | ||
+ | String token = scanner.next(); | ||
+ | Scanner lineScanner = new Scanner(token); | ||
+ | lineScanner.useDelimiter("\\s=\\s"); | ||
+ | String name = null; | ||
+ | int value = 0; | ||
+ | while(lineScanner.hasNext()) { | ||
+ | if(lineScanner.hasNextInt()) { | ||
+ | value = lineScanner.nextInt(); | ||
+ | } else { | ||
+ | name = lineScanner.next(); | ||
+ | } | ||
+ | } | ||
+ | nameValueMap.put(name, value); | ||
+ | } | ||
+ | |||
+ | System.out.println(nameValueMap); | ||
+ | |||
+ | } | ||
+ | } | ||
+ | |||
+ | </syntaxhighlight> | ||
+ | |||
+ | |||
+ | * The scanner has to keep moving forward with a call to one of the next() methods. If next() is not called, the scanner will not move to the next token and will endlessly loop. See below/ | ||
+ | * With s.next() commented, the scanner will be stuck at the first token "abc" - not moving forward and will result in an infinite loop ! | ||
+ | |||
+ | <syntaxhighlight lang="java5"> | ||
+ | |||
+ | import java.util.Scanner; | ||
+ | |||
+ | public class ScanTest { | ||
+ | |||
+ | public static void main(String[] args) { | ||
+ | Scanner s = new Scanner("abc 123 efg 1312"); | ||
+ | while(s.hasNext()) { | ||
+ | if(s.hasNextInt()) { | ||
+ | System.out.println("Int : " + s.nextInt()); | ||
+ | } else { | ||
+ | System.out.println("No next int"); | ||
+ | //s.next() //UNCOMMENT for scanner to move ahead | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | |||
+ | } | ||
+ | |||
+ | </syntaxhighlight> | ||
+ | |||
[[Category:OCPJP]] | [[Category:OCPJP]] |
Latest revision as of 01:12, 10 September 2011
Contents
Intro
- Classes in the java.util.regex package provide regular expressions support.
- The Pattern class is used to store a regex expression - the regex has to be "compiled."
- The Matcher class is used to start the regex engine to perform match operations.
- Basic example
import java.util.regex.*;
public class RegexTest1 {
public static void main(String[] args) {
Pattern p = Pattern.compile("lazy"); //The pattern to search for
Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
boolean found = false;
while(m.find()) {
System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
found = true;
}
if(!found) {
System.out.println("No match found");
}
}
}
- Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
- In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.
import java.util.regex.*;
public class RegexTest2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("aba");
Matcher m = p.matcher("abababa");
boolean found = false;
while(m.find()) {
System.out.println("Match found; starting at pos : " + m.start());
found = true;
}
if(!found) {
System.out.println("No match found");
}
}
}
Metacharacters and Character Classes
- Metacharacters are special chars that affect the way a pattern is matched.
- The different metacharacters are : ([{\^-$|]})?*+.
- Dot - "." metacharacter matches any character
- Character classes
- The [] notation is used to define a pattern that represents a set of characters. e.g:
- The search will match any of the chars defined within [] that is the "OR" operator will be used.
- [abc] - Only a's or b's or c's
- [a-f] - Search for a,b,c,d,e,f chars
- [a-fA-F] - a to f OR A to F
- [a-d[m-p]] - a through d OR m through p
- [a-z&&[def]] - d, e, or f (intersection)
- [a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction)
- [a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z] (subtraction)
- [^aeiouAEIOU] - no vowels
- Predefined character classes:
- \d - Matches a digit
- \D - Matches a non-digit equivalent to [^\d]
- \s - Matches a whitespace char ([ \t\n\x0B\f\r])
- \S - Matches a non-whitespace char
- \w - Matches a word char (letters/digits or _)
- \W - Matches a non-word char
- Within [] - metacharacters are treated as ordinary characters, no need to escape them.
public class RegexTest3 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher("The 15th of August");
boolean found = false;
while(m.find()) {
System.out.println("Match found; starting at pos : " + m.start());
found = true;
}
// Match found; starting at pos : 4
// Match found; starting at pos : 5
if(!found) {
System.out.println("No match found");
}
}
}
Boundary Matchers
- ^Regex - will attempt to match the regex only at the beginning of the line.
- Regex$ - will attempt to match the regex only at the end of the line.
- Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched.
import java.util.regex.*;
public class RegexTest5 {
public static void main(String[] args) {
Pattern p = Pattern.compile("^(\\d)+");
Matcher m = p.matcher("123 sds sadwvf 221");
while(m.find()) {
System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group());
//Will print: Match found; starting at pos : 0 , matched content : 123
}
}
}
Logical Operators
- R | U, Logical OR. e.g ^[a-z] | \d$ A lowercase letter at the beginning of the line or a digit at the end of the line.
- RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.
Quantifiers
- Used to specify the number of occurrences of a search pattern
- The below are all greedy quantifiers.
- * - Zero or more occurrences
- ? - Zero or one occurrence.
- + - One or more occurrence.
- Consider the pattern a? - it means zero or one a's to be searched for.
- In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's.
- Now, where the char is not an a, the empty string is returned in the match. - the regex engine inserts empty strings in the input to match the pattern a?.
- Further examples: Match results are in format: startpos, endpos, matched-substring
Target: banana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:
Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:
Target: banana Pattern: a+ Match: 1,2: a 3,4: a 5,6: a
(banaana - with double a in the middle) Notice how the quantifiers are greedy:
Target: banaana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,5: aa 5,5: 6,7: a 7,7:
Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,5: a 5,5: 6,7: a 7,7:
Target: banana Pattern: a+ Match: 1,2: a 3,5: aa 6,7: a
- Example the pattern abc(\d)* will match -
- abc0
- abc13423
- abc - since * means 0 or more digits
- abcdef - for the similar reason as above
- It won't match -
- ab211 (doesnt start with abc)
- abcs (doesnt have a digit after abc)
- Quantifiers by default attach to one character at a time only. (The one immediately preceding the quantifier)
- \d\d+ means a digit followed by zero or one more digit.
- abc+ means a followed by b, followed by c one or more times.
- (abc)+ means the group "abc" one or more times.
- [abc]+ would mean a OR b OR c one or more times.
Greedy Quantifiers
- Greedy quantifiers will try to look at the entire source data while trying to determine a match.
See example below:
public class Greedy {
public static void main(String[] args) {
String greedyPattern = ".*xx";
String reluctantPattern = ".*?xx";
String source = "yyxxxxyxx";
Pattern gp = Pattern.compile(greedyPattern);
Matcher gm = gp.matcher(source);
while (gm.find()) {
System.out.println("Greedy Match found ! Starts at : " + gm.start()
+ ", Matched portion : " + gm.group());
}
//Will print:
//Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx
Pattern rp = Pattern.compile(reluctantPattern);
Matcher rm = rp.matcher(source);
while (rm.find()) {
System.out.println("Reluctant Match found ! Starts at : " + rm.start()
+ ", Matched portion : " + rm.group());
}
//Will print:
//Reluctant Match found ! Starts at : 0, Matched portion : yyxx
//Reluctant Match found ! Starts at : 4, Matched portion : xx
//Reluctant Match found ! Starts at : 6, Matched portion : yxx
}
}
Tokenizing (Scanner)
- Tokenizing for small pieces of data can be done by the String.split() method.
- For advanced Tokenizing, using the Scanner class is the best choice.
- The scanner class can accept various forms of input such as files, streams or Strings.
- Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
- Tokens can be converted to their primitive types automatically.
- In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.
public class ScannerTest1 {
private static String source = "M 78 P 85 C 92 E 66 B 88";
public static void main(String[] args) {
List<Integer> scores = new ArrayList<Integer>();
Scanner scanner = new Scanner(source);
while(scanner.hasNext()) {
if(scanner.hasNextInt()) {
int score = scanner.nextInt();
scores.add(score);
} else {
scanner.next();
}
}
Collections.sort(scores);
System.out.println(scores);
}
}
- Scanner will throw an InputMismatchException when it can't convert the next token of the input into the desired type.
- Example: when the scanner reaches the token abc, it will throw an exception.
Scanner s = new Scanner("123 abc");
while(s.hasNext()) {
int i = s.nextInt();
}
- Another example, where a regex is being used as a delimiter to the scanner:
- In the hasNext(Pattern) method attempts to match the complete token to the Pattern ! Works like Mather's matches() method.
import java.util.*;
public class ScannerTest2 {
private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747";
public static void main(String[] args) {
Scanner scanner = new Scanner(source);
scanner.useDelimiter(",\\s*");
Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
while(scanner.hasNext()) {
String token = scanner.next();
Scanner lineScanner = new Scanner(token);
lineScanner.useDelimiter("\\s=\\s");
String name = null;
int value = 0;
while(lineScanner.hasNext()) {
if(lineScanner.hasNextInt()) {
value = lineScanner.nextInt();
} else {
name = lineScanner.next();
}
}
nameValueMap.put(name, value);
}
System.out.println(nameValueMap);
}
}
- The scanner has to keep moving forward with a call to one of the next() methods. If next() is not called, the scanner will not move to the next token and will endlessly loop. See below/
- With s.next() commented, the scanner will be stuck at the first token "abc" - not moving forward and will result in an infinite loop !
import java.util.Scanner;
public class ScanTest {
public static void main(String[] args) {
Scanner s = new Scanner("abc 123 efg 1312");
while(s.hasNext()) {
if(s.hasNextInt()) {
System.out.println("Int : " + s.nextInt());
} else {
System.out.println("No next int");
//s.next() //UNCOMMENT for scanner to move ahead
}
}
}
}