Difference between revisions of "Pattern Matching"

Revision as of 00:53, 10 September 2011

Intro

Classes in the java.util.regex package provide regular expressions support.
The Pattern class is used to store a regex expression - the regex has to be "compiled."
The Matcher class is used to start the regex engine to perform match operations.
Basic example

import java.util.regex.*;

public class RegexTest1 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("lazy"); //The pattern to search for
		Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.

import java.util.regex.*;

public class RegexTest2 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("aba");
		Matcher m = p.matcher("abababa");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Metacharacters and Character Classes

Metacharacters are special chars that affect the way a pattern is matched.
The different metacharacters are : ([{\^-$|]})?*+.

Dot - "." metacharacter matches any character

Character classes
The [] notation is used to define a pattern that represents a set of characters. e.g:
The search will match any of the chars defined within [] that is the "OR" operator will be used.
- [abc] - Only a's or b's or c's
- [a-f] - Search for a,b,c,d,e,f chars
- [a-fA-F] - a to f OR A to F
- [a-d[m-p]] - a through d OR m through p
- [a-z&&[def]] - d, e, or f (intersection)
- [a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction)
- [a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z] (subtraction)
- [^aeiouAEIOU] - no vowels

Predefined character classes:

- \d - Matches a digit
- \D - Matches a non-digit equivalent to [^\d]
- \s - Matches a whitespace char
- \S - Matches a non-whitespace char
- \w - Matches a word char (letters/digits or _)
- \W - Matches a non-word char

Within [] - metacharacters are treated as ordinary characters, no need to escape them.

public class RegexTest3 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("\\d");
		Matcher m = p.matcher("The 15th of August");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}

                // Match found; starting at pos : 4
                // Match found; starting at pos : 5
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Boundary Matchers

^Regex - will attempt to match the regex only at the beginning of the line.
Regex$ - will attempt to match the regex only at the end of the line.
Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched.

import java.util.regex.*;

public class RegexTest5 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("^(\\d)+");
		Matcher m = p.matcher("123 sds sadwvf 221");
		
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group());
			//Will print: Match found; starting at pos : 0 , matched content : 123
		}
		
	}

}

Logical Operators

R | U, Logical OR. e.g ^[a-z] | \d$ A lowercase letter at the beginning of the line or a digit at the end of the line.
RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.

Quantifiers

Used to specify the number of occurrences of a search pattern
The below are all greedy quantifiers.
* - Zero or more occurrences
? - Zero or one occurrence.
+ - One or more occurrence.

Consider the pattern a? - it means zero or one a's to be searched for.
In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's.
Now, where the char is not an a, the empty string is returned in the match. - the regex engine inserts empty strings in the input to match the pattern a?.

Further examples: Match results are in format: startpos, endpos, matched-substring

Target: banana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:

Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:

Target: banana Pattern: a+ Match: 1,2: a 3,4: a 5,6: a

(banaana - with double a in the middle) Notice how the quantifiers are greedy:

Target: banaana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,5: aa 5,5: 6,7: a 7,7:

Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,5: a 5,5: 6,7: a 7,7:

Target: banana Pattern: a+ Match: 1,2: a 3,5: aa 6,7: a

Example the pattern abc(\d)* will match -
- abc0
- abc13423
- abc - since * means 0 or more digits
- abcdef - for the similar reason as above

It won't match -
- ab211 (doesnt start with abc)
- abcs (doesnt have a digit after abc)

Quantifiers by default attach to one character at a time only. (The one immediately preceding the quantifier)
\d\d+ means a digit followed by zero or one more digit.
abc+ means a followed by b, followed by c one or more times.
(abc)+ means the group "abc" one or more times.
[abc]+ would mean a OR b OR c one or more times.

Greedy Quantifiers

Greedy quantifiers will try to look at the entire source data while trying to determine a match.

See example below:

public class Greedy {

	public static void main(String[] args) {
		String greedyPattern = ".*xx";
		String reluctantPattern = ".*?xx";
		String source = "yyxxxxyxx";
		
		Pattern gp = Pattern.compile(greedyPattern);
		Matcher gm = gp.matcher(source);

		while (gm.find()) {
			System.out.println("Greedy Match found ! Starts at : " + gm.start()
					+ ", Matched portion : " + gm.group());
			
		}
                //Will print:
                //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx

		Pattern rp = Pattern.compile(reluctantPattern);
		Matcher rm = rp.matcher(source);
		
		while (rm.find()) {
			System.out.println("Reluctant Match found ! Starts at : " + rm.start()
					+ ", Matched portion : " + rm.group());
			
		}
                //Will print:
                //Reluctant Match found ! Starts at : 0, Matched portion : yyxx
                //Reluctant Match found ! Starts at : 4, Matched portion : xx
                //Reluctant Match found ! Starts at : 6, Matched portion : yxx
	}

}

Tokenizing (Scanner)

Tokenizing for small pieces of data can be done by the String.split() method.
For advanced Tokenizing, using the Scanner class is the best choice.
The scanner class can accept various forms of input such as files, streams or Strings.
Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
Tokens can be converted to their primitive types automatically.
In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.

public class ScannerTest1 {
	
	private static String source = "M 78 P 85 C 92 E 66 B 88";

	public static void main(String[] args) {
		
		List<Integer> scores = new ArrayList<Integer>(); 
		
		Scanner scanner = new Scanner(source);
		
		while(scanner.hasNext()) {
			if(scanner.hasNextInt()) {
				int score = scanner.nextInt();
				scores.add(score);
			} else {
				scanner.next(); 
			}
		}
		
		Collections.sort(scores);
		
		System.out.println(scores);
	}
}

Scanner will throw an InputMismatchException when it can't convert the next token of the input into the desired type.
Example: when the scanner reaches the token abc, it will throw an exception.

Scanner s = new Scanner("123 abc");
		while(s.hasNext()) {
			int i = s.nextInt();
}

Another example, where a regex is being used as a delimiter to the scanner:
In the hasNext(Pattern) method attempts to match the complete token to the Pattern ! Works like Mather's matches() method.

import java.util.*;

public class ScannerTest2 {
	
	private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747"; 

	public static void main(String[] args) {
		
		Scanner scanner = new Scanner(source);
		scanner.useDelimiter(",\\s*");
		
		Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
		
		while(scanner.hasNext()) {
			String token = scanner.next();
			Scanner lineScanner = new Scanner(token);
			lineScanner.useDelimiter("\\s=\\s");
			String name = null;
			int value = 0;
			while(lineScanner.hasNext()) {
				if(lineScanner.hasNextInt()) {
					value = lineScanner.nextInt();
				}  else {
					name = lineScanner.next();
				}
			}
			nameValueMap.put(name, value);
		}
		
		System.out.println(nameValueMap);
		
	}
}

The scanner has to keep moving forward with a call to one of the next() methods. If next() is not called, the scanner will not move to the next token and will endlessly loop. See below/
With s.next() commented, the scanner will be stuck at the first token "abc" - not moving forward and will result in an infinite loop !

import java.util.Scanner;

public class ScanTest {

	public static void main(String[] args) {
		Scanner s = new Scanner("abc 123 efg 1312");
		while(s.hasNext()) {
			if(s.hasNextInt()) {
				System.out.println("Int : " + s.nextInt());
			} else {
				System.out.println("No next int");
                                //s.next() //UNCOMMENT for scanner to move ahead
			}
		}
	}

}

@@ Line 73: / Line 73: @@
 ** [a-fA-F] - a to f OR A to F
 ** [a-d[m-p]] - a through d OR m through p
-** [^aeiouAEIOUS] - no vowels
+** [a-z&&[def]] - d, e, or f (intersection)
+** [a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction)
+** [a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z] (subtraction)
+** [^aeiouAEIOU] - no vowels
 * Predefined character classes:

Difference between revisions of "Pattern Matching"

Revision as of 00:53, 10 September 2011

Contents

Intro

Metacharacters and Character Classes

Boundary Matchers

Logical Operators

Quantifiers

Greedy Quantifiers

Tokenizing (Scanner)

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools