Difference between revisions of "Pattern Matching"

Revision as of 03:30, 5 July 2011

Intro

Classes in the java.util.regex package provide regular expressions support.
The Pattern class is used to store a regex expression - the regex has to be "compiled."
The Matcher class is used to start the regex engine to perform match operations.
Basic example

import java.util.regex.*;

public class RegexTest1 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("lazy"); //The pattern to search for
		Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.

import java.util.regex.*;

public class RegexTest2 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("aba");
		Matcher m = p.matcher("abababa");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Metacharacters

Regex keywords that have special search meaning.
\d - Matches a digit
\D - Matches a non-digit equivalent to [^\d]
\s - Matches a whitespace char
\S - Matches a non-whitespace char
\w - Matches a word char (letters/digits or _)
\W - Matches a non-word char
Dot - "." metacharacter matches any character

public class RegexTest3 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("\\d");
		Matcher m = p.matcher("The 15th of August");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}

                // Match found; starting at pos : 4
                // Match found; starting at pos : 5
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Character classes
The [] notation is used to define a pattern that represents a set of characters. e.g:
The search will match any of the chars defined within [] that is the "OR" operator will be used.
- [abc] - Only a's or b's or c's
- [a-f] - Search for a,b,c,d,e,f chars
- [a-fA-F] - small and caps
- [^aeiouAEIOUS] - no vowels

Boundary Matchers

^Regex - will attempt to match the regex only at the beginning of the line.
Regex$ - will attempt to match the regex only at the end of the line.
Below example, it will match only 123 and not 221. If the ^ is removed, then both 123 and 221 will be matched.

import java.util.regex.*;

public class RegexTest5 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("^(\\d)+");
		Matcher m = p.matcher("123 sds sadwvf 221");
		
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start() + " , matched content : " + m.group());
			//Will print: Match found; starting at pos : 0 , matched content : 123
		}
		
	}

}

Logical Operators

R | U, Logical OR. e.g ^[a-z] | \d$ A lowercase letter at the beginning of the line or a digit at the end of the line.
RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.

Quantifiers

Used to specify the number of occurrences of a search pattern
* - Zero or more occurrences
? - Zero or one occurrence.
+ - One or more occurrence.

The above three are greedy quantifiers.

Example the pattern abc(\d)* will match -
- abc0
- abc13423
- abc - since * means 0 or more
- abcdef - for the similar reason as above

It won't match -
- ab211 (doesnt start with abc)
- abcs (doesnt have a digit after abc)

Greedy Quantifiers

Greedy quantifiers will try to look at the entire source data while trying to determine a match.

See example below:

public class Greedy {

	public static void main(String[] args) {
		String greedyPattern = ".*xx";
		String reluctantPattern = ".*?xx";
		String source = "yyxxxxyxx";
		
		Pattern gp = Pattern.compile(greedyPattern);
		Matcher gm = gp.matcher(source);

		while (gm.find()) {
			System.out.println("Greedy Match found ! Starts at : " + gm.start()
					+ ", Matched portion : " + gm.group());
			
		}
                //Will print:
                //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx

		Pattern rp = Pattern.compile(reluctantPattern);
		Matcher rm = rp.matcher(source);
		
		while (rm.find()) {
			System.out.println("Reluctant Match found ! Starts at : " + rm.start()
					+ ", Matched portion : " + rm.group());
			
		}
                //Will print:
                //Reluctant Match found ! Starts at : 0, Matched portion : yyxx
                //Reluctant Match found ! Starts at : 4, Matched portion : xx
                //Reluctant Match found ! Starts at : 6, Matched portion : yxx
	}

}

Tokenizing

Tokenizing for small pieces of data can be done by the String.split() method.
For advanced Tokenizing, using the Scanner class is the best choice.
The scanner class can accept various forms of input such as files, streams or Strings.
Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
Tokens can be converted to their primitive types automatically.
In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.

public class ScannerTest1 {
	
	private static String source = "M 78 P 85 C 92 E 66 B 88";

	public static void main(String[] args) {
		
		List<Integer> scores = new ArrayList<Integer>(); 
		
		Scanner scanner = new Scanner(source);
		
		while(scanner.hasNext()) {
			if(scanner.hasNextInt()) {
				int score = scanner.nextInt();
				scores.add(score);
			} else {
				scanner.next(); 
			}
		}
		
		Collections.sort(scores);
		
		System.out.println(scores);
	}
}

Another example, where a regex is being used as a delimiter to the scanner:

import java.util.*;

public class ScannerTest2 {
	
	private static String source = "ABC = 322, DEF = 343, GHI = 522, KLM = 747"; 

	public static void main(String[] args) {
		
		Scanner scanner = new Scanner(source);
		scanner.useDelimiter(",\\s*");
		
		Map<String, Integer> nameValueMap = new HashMap<String, Integer>();
		
		while(scanner.hasNext()) {
			String token = scanner.next();
			Scanner lineScanner = new Scanner(token);
			lineScanner.useDelimiter("\\s=\\s");
			String name = null;
			int value = 0;
			while(lineScanner.hasNext()) {
				if(lineScanner.hasNextInt()) {
					value = lineScanner.nextInt();
				}  else {
					name = lineScanner.next();
				}
			}
			nameValueMap.put(name, value);
		}
		
		System.out.println(nameValueMap);
		
	}
}

@@ Line 132: / Line 132: @@
 </syntaxhighlight>
+== Logical Operators ==
+* R | U, Logical OR. e.g ^[a-z] | \d$  A lowercase letter at the beginning of the line or a digit at the end of the line.
+* RU, Logical AND. e.g. [jJ][aA][vV][aA] - any combination of Java in upper/lower case letters. Note within [] - the match is treated as OR.
 == Quantifiers ==

Difference between revisions of "Pattern Matching"

Revision as of 03:30, 5 July 2011

Contents

Intro

Metacharacters

Boundary Matchers

Logical Operators

Quantifiers

Greedy Quantifiers

Tokenizing

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools