Pattern Matching

Intro

Classes in the java.util.regex package provide regular expressions support.
The Pattern class is used to store a regex expression - the regex has to be "compiled."
The Matcher class is used to start the regex engine to perform match operations.
Basic example

import java.util.regex.*;

public class RegexTest1 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("lazy"); //The pattern to search for
		Matcher m = p.matcher("The quick brown fox jumps over the lazy dog"); //The source against which to match the pattern
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found at " + m.start() + "," + m.end()); //Will print : Match found at 35,39
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Thumb rule: Regex matching runs from left to right and once a source character has been consumed, it cannot be reused.
In the below example, it will match the pattern "aba" starting at 0 and 4, but not at 2 since they are consumed during the match starting from 0.

import java.util.regex.*;

public class RegexTest2 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("aba");
		Matcher m = p.matcher("abababa");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Metacharacters

Regex keywords that have special search meaning.
\d - Matches a digit
\s - Matches a whitespace char
\w - Matches a word char (letters/digits or _)

public class RegexTest3 {

	public static void main(String[] args) {
		
		Pattern p = Pattern.compile("\\d");
		Matcher m = p.matcher("The 15th of August");
		boolean found = false;
		while(m.find()) {
			System.out.println("Match found; starting at pos : " + m.start());
			found = true;
		}

                // Match found; starting at pos : 4
                // Match found; starting at pos : 5
		
		if(!found) {
			System.out.println("No match found");
		}
	}

}

Set of characters to search for using []
- [abc] - Only a's or b's or c's
- [a-f] - Search for a,b,c,d,e,f chars
- [a-fA-F] - small and caps

Dot - "." metacharacter matches any character

Quantifiers

Used to specify the number of occurrences of a search pattern
* - Zero or more occurrences
? - Zero or one occurrence.
+ - One or more occurrence.

The above three are greedy quantifiers.

Example the pattern abc(\d)* will match -
- abc0
- abc13423
- abc - since * means 0 or more
- abcdef - for the similar reason as above

It won't match -
- ab211 (doesnt start with abc)
- abcs (doesnt have a digit after abc)

Greedy Quantifiers

Greedy quantifiers will try to look at the entire source data while trying to determine a match.

See example below:

public class Greedy {

	public static void main(String[] args) {
		String greedyPattern = ".*xx";
		String reluctantPattern = ".*?xx";
		String source = "yyxxxxyxx";
		
		Pattern gp = Pattern.compile(greedyPattern);
		Matcher gm = gp.matcher(source);

		while (gm.find()) {
			System.out.println("Greedy Match found ! Starts at : " + gm.start()
					+ ", Matched portion : " + gm.group());
			
		}
                //Will print:
                //Greedy Match found ! Starts at : 0, Matched portion : yyxxxxyxx

		Pattern rp = Pattern.compile(reluctantPattern);
		Matcher rm = rp.matcher(source);
		
		while (rm.find()) {
			System.out.println("Reluctant Match found ! Starts at : " + rm.start()
					+ ", Matched portion : " + rm.group());
			
		}
                //Will print:
                //Reluctant Match found ! Starts at : 0, Matched portion : yyxx
                //Reluctant Match found ! Starts at : 4, Matched portion : xx
                //Reluctant Match found ! Starts at : 6, Matched portion : yxx
	}

}

Tokenizing

Tokenizing for small pieces of data can be done by the String.split() method.
For advanced Tokenizing, using the Scanner class is the best choice.
The scanner class can accept various forms of input such as files, streams or Strings.
Tokenizing is done within a loop, so that the process can be exited once any conditions are met.
Tokens can be converted to their primitive types automatically.
In example below a scanner tokenizes a string containing integers. The default delimiter of a scanner is a whitespace character.

public class ScannerTest1 {
	
	private static String source = "M 78 P 85 C 92 E 66 B 88";

	public static void main(String[] args) {
		
		List<Integer> scores = new ArrayList<Integer>(); 
		
		Scanner scanner = new Scanner(source);
		
		while(scanner.hasNext()) {
			if(scanner.hasNextInt()) {
				int score = scanner.nextInt();
				scores.add(score);
			} else {
				scanner.next(); 
			}
		}
		
		Collections.sort(scores);
		
		System.out.println(scores);
	}
}

Pattern Matching

Contents

Intro

Metacharacters

Quantifiers

Greedy Quantifiers

Tokenizing

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools