Difference between revisions of "Pattern Matching"

Revision as of 00:57, 6 July 2011

Intro

Classes in the java.util.regex package provide regular expressions support.
The Pattern class is used to store a regex expression - the regex has to be "compiled."
The Matcher class is used to start the regex engine to perform match operations.
Basic example

import java.util.regex.*;

public class RegexTest1 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("lazy"); //The pattern to search for
		Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.

import java.util.regex.*;

public class RegexTest2 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("aba");
		Matcher m = p.matcher("abababa");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Metacharacters and Character Classes

Metacharacters are special chars that affect the way a pattern is matched.
The different metacharacters are : ([{\^-$|]})?*+.

Dot - "." metacharacter matches any character

Character classes
The [] notation is used to define a pattern that represents a set of characters. e.g:
The search will match any of the chars defined within [] that is the "OR" operator will be used.
- [abc] - Only a's or b's or c's
- [a-f] - Search for a,b,c,d,e,f chars
- [a-fA-F] - small and caps
- [^aeiouAEIOUS] - no vowels

Predefined character classes:

- \d - Matches a digit
- \D - Matches a non-digit equivalent to [^\d]
- \s - Matches a whitespace char
- \S - Matches a non-whitespace char
- \w - Matches a word char (letters/digits or _)
- \W - Matches a non-word char

public class RegexTest3 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("\\d");
		Matcher m = p.matcher("The 15th of August");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}

                // Match found; starting at pos : 4
                // Match found; starting at pos : 5
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Boundary Matchers

^Regex - will attempt to match the regex only at the beginning of the line.
Regex$ - will attempt to match the regex only at the end of the line.
Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched.

import java.util.regex.*;

public class RegexTest5 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("^(\\d)+");
		Matcher m = p.matcher("123 sds sadwvf 221");
		
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group());
			//Will print: Match found; starting at pos : 0 , matched content : 123
		}
		
	}

}

Logical Operators

R | U, Logical OR. e.g ^[a-z] | \d$ A lowercase letter at the beginning of the line or a digit at the end of the line.
RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.

Quantifiers

Used to specify the number of occurrences of a search pattern
The below are all greedy quantifiers.
* - Zero or more occurrences
? - Zero or one occurrence.
+ - One or more occurrence.

Consider the pattern a? - it means zero or one a's to be searched for.
In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's.
Now, where the char is not an a, the empty string is returned in the match. - the regex engine inserts empty strings in the input to match the pattern a?.

Further examples: Match results are in format: startpos, endpos, matched-substring

Target: banana Pattern: a* Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:

Target: banana Pattern: a? Match: 0,0: 1,2: a 2,2: 3,4: a 4,4: 5,6: a 6,6:

Target: banana Pattern: a+ Match: 1,2: a 3,4: a 5,6: a

Target: banana Pattern: a* Match:

Example the pattern abc(\d)* will match -
- abc0
- abc13423
- abc - since * means 0 or more digits
- abcdef - for the similar reason as above

It won't match -
- ab211 (doesnt start with abc)
- abcs (doesnt have a digit after abc)

Quantifiers by default attach to one character at a time only. (The one immediately preceding the quantifier)
\d\d+ means a digit followed by zero or one more digit.
abc+ means a followed by b, followed by c one or more times.
(abc)+ means the group "abc" one or more times.
[abc]+ would mean a OR b OR c one or more times.

Greedy Quantifiers

Greedy quantifiers will try to look at the entire source data while trying to determine a match.

See example below:

public class Greedy {

	public static void main(String[] args) {
		String greedyPattern = ".*xx";
		String reluctantPattern = ".*?xx";
		String source = "yyxxxxyxx";
		
		Pattern gp = Pattern.compile(greedyPattern);
		Matcher gm = gp.matcher(source);

		while (gm.find()) {
			System.out.println("Greedy Match found ! Starts at : " + gm.start()
					+ ", Matched portion : " + gm.group());
			
		}
                //Will print:
                //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx

		Pattern rp = Pattern.compile(reluctantPattern);
		Matcher rm = rp.matcher(source);
		
		while (rm.find()) {
			System.out.println("Reluctant Match found ! Starts at : " + rm.start()
					+ ", Matched portion : " + rm.group());
			
		}
                //Will print:
                //Reluctant Match found ! Starts at : 0, Matched portion : yyxx
                //Reluctant Match found ! Starts at : 4, Matched portion : xx
                //Reluctant Match found ! Starts at : 6, Matched portion : yxx
	}

}

Tokenizing

Tokenizing for small pieces of data can be done by the String.split() method.
For advanced Tokenizing, using the Scanner class is the best choice.
The scanner class can accept various forms of input such as files, streams or Strings.
Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
Tokens can be converted to their primitive types automatically.
In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.

public class ScannerTest1 {
	
	private static String source = "M 78 P 85 C 92 E 66 B 88";

	public static void main(String[] args) {
		
		List<Integer> scores = new ArrayList<Integer>(); 
		
		Scanner scanner = new Scanner(source);
		
		while(scanner.hasNext()) {
			if(scanner.hasNextInt()) {
				int score = scanner.nextInt();
				scores.add(score);
			} else {
				scanner.next(); 
			}
		}
		
		Collections.sort(scores);
		
		System.out.println(scores);
	}
}

Another example, where a regex is being used as a delimiter to the scanner:

import java.util.*;

public class ScannerTest2 {
	
	private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747"; 

	public static void main(String[] args) {
		
		Scanner scanner = new Scanner(source);
		scanner.useDelimiter(",\\s*");
		
		Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
		
		while(scanner.hasNext()) {
			String token = scanner.next();
			Scanner lineScanner = new Scanner(token);
			lineScanner.useDelimiter("\\s=\\s");
			String name = null;
			int value = 0;
			while(lineScanner.hasNext()) {
				if(lineScanner.hasNextInt()) {
					value = lineScanner.nextInt();
				}  else {
					name = lineScanner.next();
				}
			}
			nameValueMap.put(name, value);
		}
		
		System.out.println(nameValueMap);
		
	}
}

Difference between revisions of "Pattern Matching"

Revision as of 00:57, 6 July 2011

Contents

Intro

Metacharacters and Character Classes

Boundary Matchers

Logical Operators

Quantifiers

Greedy Quantifiers

Tokenizing

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 148: / Line 148: @@
 * ? - Zero or one occurrence.
 * + - One or more occurrence.
 * Consider the pattern a? - it means zero or one a's to be searched for.
 * In the string banana, the engine will match all the characters in the target, since it is searching for zero or one a's.
 * '''Now, where the char is not an a, the empty string is returned in the match.''' - the regex engine inserts empty strings in the input to match the pattern a?.
+* Further examples: Match results are in format: startpos, endpos, matched-substring
+Target: banana
+Pattern: a*
+Match:
+,0:
+,2: a
+,2:
+,4: a
+,4:
+,6: a
+,6:
+Target: banana
+Pattern: a?
+Match:
+,0:
+,2: a
+,2:
+,4: a
+,4:
+,6: a
+,6:
+Target: banana
+Pattern: a+
+Match:
+,2: a
+,4: a
+,6: a
+Target: banana
+Pattern: a*
+Match:
+Target: banana
+Pattern: a*
+Match:
+Target: banana
+Pattern: a*
+Match: