Difference between revisions of "Pattern Matching"
From Suhrid.net Wiki
Jump to navigationJump to searchLine 63: | Line 63: | ||
* Regex keywords that have special search meaning. | * Regex keywords that have special search meaning. | ||
* \d - Matches a digit | * \d - Matches a digit | ||
+ | * \D - Matches a non-digit equivalent to [^\d] | ||
* \s - Matches a whitespace char | * \s - Matches a whitespace char | ||
+ | * \S - Matches a non-whitespace char | ||
* \w - Matches a word char (letters/digits or _) | * \w - Matches a word char (letters/digits or _) | ||
+ | * \W - Matches a non-word char | ||
+ | * Dot - "." metacharacter matches any character | ||
+ | |||
<syntaxhighlight lang="java5"> | <syntaxhighlight lang="java5"> | ||
Line 99: | Line 104: | ||
** [a-fA-F] - small and caps | ** [a-fA-F] - small and caps | ||
** [^aeiouAEIOUS] - no vowels | ** [^aeiouAEIOUS] - no vowels | ||
− | |||
− | |||
− | |||
== Quantifiers == | == Quantifiers == |
Revision as of 03:02, 5 July 2011
Intro
- Classes in the java.util.regex package provide regular expressions support.
- The Pattern class is used to store a regex expression - the regex has to be "compiled."
- The Matcher class is used to start the regex engine to perform match operations.
- Basic example
import java.util.regex.*;
public class RegexTest1 {
public static void main(String[] args) {
Pattern p = Pattern.compile("lazy"); //The pattern to search for
Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
boolean found = false;
while(m.find()) {
System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
found = true;
}
if(!found) {
System.out.println("No match found");
}
}
}
- Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
- In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.
import java.util.regex.*;
public class RegexTest2 {
public static void main(String[] args) {
Pattern p = Pattern.compile("aba");
Matcher m = p.matcher("abababa");
boolean found = false;
while(m.find()) {
System.out.println("Match found; starting at pos : " + m.start());
found = true;
}
if(!found) {
System.out.println("No match found");
}
}
}
Metacharacters
- Regex keywords that have special search meaning.
- \d - Matches a digit
- \D - Matches a non-digit equivalent to [^\d]
- \s - Matches a whitespace char
- \S - Matches a non-whitespace char
- \w - Matches a word char (letters/digits or _)
- \W - Matches a non-word char
- Dot - "." metacharacter matches any character
public class RegexTest3 {
public static void main(String[] args) {
Pattern p = Pattern.compile("\\d");
Matcher m = p.matcher("The 15th of August");
boolean found = false;
while(m.find()) {
System.out.println("Match found; starting at pos : " + m.start());
found = true;
}
// Match found; starting at pos : 4
// Match found; starting at pos : 5
if(!found) {
System.out.println("No match found");
}
}
}
- Character classes
- The [] notation is used to define a pattern that represents a set of characters. e.g:
- The search will match any of the chars defined within [] that is the "OR" operator will be used.
- [abc] - Only a's or b's or c's
- [a-f] - Search for a,b,c,d,e,f chars
- [a-fA-F] - small and caps
- [^aeiouAEIOUS] - no vowels
Quantifiers
- Used to specify the number of occurrences of a search pattern
- * - Zero or more occurrences
- ? - Zero or one occurrence.
- + - One or more occurrence.
- The above three are greedy quantifiers.
- Example the pattern abc(\d)* will match -
- abc0
- abc13423
- abc - since * means 0 or more
- abcdef - for the similar reason as above
- It won't match -
- ab211 (doesnt start with abc)
- abcs (doesnt have a digit after abc)
Greedy Quantifiers
- Greedy quantifiers will try to look at the entire source data while trying to determine a match.
See example below:
public class Greedy {
public static void main(String[] args) {
String greedyPattern = ".*xx";
String reluctantPattern = ".*?xx";
String source = "yyxxxxyxx";
Pattern gp = Pattern.compile(greedyPattern);
Matcher gm = gp.matcher(source);
while (gm.find()) {
System.out.println("Greedy Match found ! Starts at : " + gm.start()
+ ", Matched portion : " + gm.group());
}
//Will print:
//Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx
Pattern rp = Pattern.compile(reluctantPattern);
Matcher rm = rp.matcher(source);
while (rm.find()) {
System.out.println("Reluctant Match found ! Starts at : " + rm.start()
+ ", Matched portion : " + rm.group());
}
//Will print:
//Reluctant Match found ! Starts at : 0, Matched portion : yyxx
//Reluctant Match found ! Starts at : 4, Matched portion : xx
//Reluctant Match found ! Starts at : 6, Matched portion : yxx
}
}
Tokenizing
- Tokenizing for small pieces of data can be done by the String.split() method.
- For advanced Tokenizing, using the Scanner class is the best choice.
- The scanner class can accept various forms of input such as files, streams or Strings.
- Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
- Tokens can be converted to their primitive types automatically.
- In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.
public class ScannerTest1 {
private static String source = "M 78 P 85 C 92 E 66 B 88";
public static void main(String[] args) {
List<Integer> scores = new ArrayList<Integer>();
Scanner scanner = new Scanner(source);
while(scanner.hasNext()) {
if(scanner.hasNextInt()) {
int score = scanner.nextInt();
scores.add(score);
} else {
scanner.next();
}
}
Collections.sort(scores);
System.out.println(scores);
}
}
- Another example, where a regex is being used as a delimiter to the scanner:
import java.util.*;
public class ScannerTest2 {
private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747";
public static void main(String[] args) {
Scanner scanner = new Scanner(source);
scanner.useDelimiter(",\\s*");
Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
while(scanner.hasNext()) {
String token = scanner.next();
Scanner lineScanner = new Scanner(token);
lineScanner.useDelimiter("\\s=\\s");
String name = null;
int value = 0;
while(lineScanner.hasNext()) {
if(lineScanner.hasNextInt()) {
value = lineScanner.nextInt();
} else {
name = lineScanner.next();
}
}
nameValueMap.put(name, value);
}
System.out.println(nameValueMap);
}
}