Java’s character and assorted string classes offer low-level support for pattern matching, but that support typically leads to complex code. For simpler and more efficient coding, Java offers the Regex API. This two-part tutorial helps you get started with regular expressions and the Regex API. First we’ll unpack the three powerful classes residing in the java.util.regex
package, then we’ll explore the Pattern
class and its sophisticated pattern-matching constructs.
What are regular expressions?
A regular expression, also known as a regex or regexp, is a string whose pattern(template) describes a set of strings. The pattern determines which strings belong to the set. A pattern consists of literal characters and metacharacters, which are characters that have special meaning instead of a literal meaning.
Pattern matching is the process of searching text to identify matches, or strings that match a regex’s pattern. Java supports pattern matching via its Regex API. The API consists of three classes–Pattern
, Matcher
, and PatternSyntaxException
–all located in the java.util.regex
package:
Pattern
objects, also known as patterns, are compiled regexes.
Matcher
objects, or matchers, are engines that interpret patterns to locate matches in character sequences (objects whose classes implement the java.lang.CharSequence
interface and serve as text sources).
PatternSyntaxException
objects describe illegal regex patterns.
Java also provides support for pattern matching via various methods in its java.lang.String
class. For example, boolean matches(String regex)
returns true only if the invoking string exactly matches regex
‘s regex.
RegexDemo
I’ve created the RegexDemo
application to demonstrate Java’s regular expressions and the various methods located in the Pattern
, Matcher
, and PatternSyntaxException
classes. Here’s the source code for the demo:
Listing 1. Demonstrating regexes
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
public static void main(String[] args)
{
if (args.length != 2)
{
System.err.println("usage: java RegexDemo regex input");
return;
}
// Convert new-line (\n) character sequences to new-line characters.
args[1] = args[1].replaceAll("\\\\n", "\n");
try
{
System.out.println("regex = " + args[0]);
System.out.println("input = " + args[1]);
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[1]);
while (m.find())
System.out.println("Found [" + m.group() + "] starting at "
+ m.start() + " and ending at " + (m.end() - 1));
}
catch (PatternSyntaxException pse)
{
System.err.println("Bad regex: " + pse.getMessage());
System.err.println("Description: " + pse.getDescription());
System.err.println("Index: " + pse.getIndex());
System.err.println("Incorrect pattern: " + pse.getPattern());
}
}
}
The first thing RegexDemo
‘s main()
method does is to validate its command line. This requires two arguments: the first argument is a regex, and the second argument is input text to be matched against the regex.
You might want to specify a new-line (\n
) character as part of the input text. The only way to accomplish this is to specify a \
character followed by an n
character. main()
converts this character sequence to Unicode value 10.
The bulk of RegexDemo
‘s code is located in the try
–catch
construct. The try
block first outputs the specified regex and input text and then creates a Pattern
object that stores the compiled regex. (Regexes are compiled to improve performance during pattern matching.) A matcher is extracted from the Pattern
object and used to repeatedly search for matches until none remain. The catch
block invokes various PatternSyntaxException
methods to extract useful information about the exception. This information is subsequently output.
You don’t need to know more about the source code’s workings at this point; it will become clear when you explore the API in Part 2. You do need to compile Listing 1, however. Grab the code from Listing 1, then type the following into your command line to compile RegexDemo
:
javac RegexDemo.java
Pattern and its constructs
Pattern
, the first of three classes comprising the Regex API, is a compiled representation of a regular expression. Pattern
‘s SDK documentation describes various regex constructs, but unless you’re already an avid regex user, you might be confused by parts of the documentation. What are quantifiers and what’s the difference between greedy, reluctant, and possessive quantifiers? What are character classes, boundary matchers, back references, and embedded flag expressions? I’ll answer these questions and more in the next sections.
Literal strings
The simplest regex construct is the literal string. Some portion of the input text must match this construct’s pattern in order to have a successful pattern match. Consider the following example:
java RegexDemo apple applet
This example attempts to discover if there is a match for the apple
pattern in the applet
input text. The following output reveals the match:
regex = apple
input = applet
Found [apple] starting at 0 and ending at 4
The output shows us the regex and input text, then indicates a successful match of apple
within applet
. Additionally, it presents the starting and ending indexes of that match: 0
and 4
, respectively. The starting index identifies the first text location where a pattern match occurs; the ending index identifies the last text location for the match.
Now suppose we specify the following command line:
java RegexDemo apple crabapple
This time, we get the following match with different starting and ending indexes:
regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8
The reverse scenario, in which applet
is the regex and apple
is the input text, reveals no match. The entire regex must match, and in this case the input text does not contain a t
after apple
.
Metacharacters
More powerful regex constructs combine literal characters with metacharacters. For example, in a.b
, the period metacharacter (.
) represents any character that appears between a
and b
. Consider the following example:
java RegexDemo .ox "The quick brown fox jumps over the lazy ox."
This example specifies .ox
as the regex and The quick brown fox jumps over the lazy ox.
as the input text. RegexDemo
searches the text for matches that begin with any character and end with ox
. It produces the following output:
regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41
The output reveals two matches: fox
and ox
(with the leading space character). The .
metacharacter matches the f
in the first match and the space character in the second match.
What happens when we replace .ox
with the period metacharacter? That is, what output results from specifying the following command line:
java RegexDemo . "The quick brown fox jumps over the lazy ox."
Because the period metacharacter matches any character, RegexDemo
outputs a match for each character (including the terminating period character) in the input text:
regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42
Character classes
We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels a
, e
, i
, o
, and u
, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]
), helping us accomplish this task. Pattern
supports simple, negation, range, union, intersection, and subtraction character classes. We’ll look at all of these below.
Simple character class
The simple character class consists of characters placed side by side and matches only those characters. For example, [abc]
matches characters a
, b
, and c
.
Consider the following example:
java RegexDemo [csw] cave
This example matches only c
with its counterpart in cave
, as shown in the following output:
regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0
Negation character class
The negation character class begins with the ^
metacharacter and matches only those characters not located in that class. For example, [^abc]
matches all characters except a
, b
, and c
.
Consider this example:
java RegexDemo "[^csw]" cave
Note that the double quotes are necessary on my Windows platform, whose shell treats the ^
character as an escape character.
This example matches a
, v
, and e
with their counterparts in cave
, as shown here:
regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3
Range character class
The range character class consists of two characters separated by a hyphen metacharacter (-
). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z]
matches all lowercase alphabetic characters. It’s equivalent to specifying [abcdefghijklmnopqrstuvwxyz]
.
Consider the following example:
java RegexDemo [a-c] clown
This example matches only c
with its counterpart in clown
, as shown:
regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0
Union character class
The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]]
matches characters a
through d
and m
through p
.
Consider the following example:
java RegexDemo [ab[c-e]] abcdef
This example matches a
, b
, c
, d
, and e
with their counterparts in abcdef
:
regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4
Intersection character class
The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]]
matches characters d
, e
, and f
.
Consider the following example:
java RegexDemo "[aeiouy&&[y]]" party
Note that the double quotes are necessary on my Windows platform, whose shell treats the &
character as a command separator.
This example matches only y
with its counterpart in party
:
regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4
Subtraction character class
The subtraction character class consists of all characters except for those indicated in nested negation character classes and matches the remaining characters. For example, [a-z&&[^m-p]]
matches characters a
through l
and q
through z
:
java RegexDemo "[a-f&&[^a-c]&&[^e]]" abcdefg
This example matches d
and f
with their counterparts in abcdefg
:
regex = [a-f&&[^a-c]&&[^e]]
input = abcdefg
Found [d] starting at 3 and ending at 3
Found [f] starting at 5 and ending at 5
Predefined character classes
Some character classes occur often enough in regexes to warrant shortcuts. Pattern
provides predefined character classes as these shortcuts. Use them to simplify your regexes and minimize syntax errors.
Several categories of predefined character classes are provided: standard, POSIX, java.lang.Character
, and Unicode script/block/category/binary property. The following list describes only the standard category:
\d
: A digit. Equivalent to [0-9]
.
\D
: A nondigit. Equivalent to [^0-9]
.
\s
: A whitespace character. Equivalent to [ \t\n\x0B\f\r]
.
\S
: A nonwhitespace character. Equivalent to [^\s]
.
\w
: A word character. Equivalent to [a-zA-Z_0-9]
.
\W
: A nonword character. Equivalent to [^\w]
.
This example uses the \w
predefined character class to identify all word characters in the input text:
java RegexDemo \w "aZ.8 _"
You should observe the following output, which shows that the period and space characters are not considered word characters:
regex = \w
input = aZ.8 _
Found [a] starting at 0 and ending at 0
Found [Z] starting at 1 and ending at 1
Found [8] starting at 3 and ending at 3
Found [_] starting at 5 and ending at 5
Capturing groups
A capturing group saves a match’s characters for later recall during pattern matching; this construct is a character sequence surrounded by parentheses metacharacters ( ( )
). All characters within the capturing group are treated as a single unit during pattern matching. For example, the (Java)
capturing group combines letters J
, a
, v
, and a
into a single unit. This capturing group matches the Java
pattern against all occurrences of Java
in the input text. Each match replaces the previous match’s saved Java
characters with the next match’s Java
characters.
Capturing groups can be nested inside other capturing groups. For example, in the (Java( language))
regex, ( language)
nests inside (Java)
. Each nested or non-nested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. In the example, (Java( language))
belongs to capturing group number 1, and ( language)
belongs to capturing group number 2. In (a)(b)
, (a)
belongs to capturing group number 1, and (b)
belongs to capturing group number 2.
Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group’s captured text characters. The presence of a back reference causes a matcher to use the back reference’s capturing group number to recall the capturing group’s saved match, and then use that match’s characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:
java RegexDemo "(Java( language)\2)" "The Java language language"
The example uses the (Java( language)\2)
regex to search the input text “The Java language language
” for a grammatical error, where Java
immediately precedes two consecutive occurrences of language
. The regex specifies two capturing groups: number 1 is (Java( language)\2)
, which matches Java language language
, and number 2 is ( language)
, which matches a space character followed by language
. The \2
back reference recalls number 2’s saved match, which allows the matcher to search for a second occurrence of a space character followed by language
, which immediately follows the first occurrence of the space character and language
. The output below shows what RegexDemo
‘s matcher finds:
regex = (Java( language)\2)
input = The Java language language
Found [Java language language] starting at 4 and ending at 25
Boundary matchers
We sometimes want to match patterns at the beginning of lines, at word boundaries, at the end of text, and so on. You can accomplish this task by using one of Pattern
‘s boundary matchers, which are regex constructs that identify match locations:
^
: The beginning of a line
$
: The end of a line
\b
: A word boundary
\B
: A non-word boundary
\A
: The beginning of the text
\G
: The end of the previous match
\Z
: The end of the text, except for the final line terminator (if any)
\z
: The end of the text
The following example uses the ^
boundary matcher metacharacter to ensure that a line begins with The
followed by zero or more word characters:
java RegexDemo "^The\w*" Therefore
The ^
character indicates that the first three input text characters must match the pattern’s subsequent T
, h
, and e
characters. Any number of word characters may follow. Here is the output:
regex = ^The\w*
input = Therefore
Found [Therefore] starting at 0 and ending at 8
Suppose you change the command line to java RegexDemo "^The\w*" " Therefore"
. What happens? No match is found because a space character precedes Therefore
.
Zero-length matches
You’ll occasionally encounter zero-length matches when working with boundary matchers. A zero-length match is a match with no characters. It occurs in empty input text, at the beginning of input text, after the last character of input text, or between any two characters of that text. Zero-length matches are easy to identify because they always start and end at the same index position.
Consider the following example:
java RegExDemo \b\b "Java is"
This example matches two consecutive word boundaries and generates the following output:
regex = \b\b
input = Java is
Found [] starting at 0 and ending at -1
Found [] starting at 4 and ending at 3
Found [] starting at 5 and ending at 4
Found [] starting at 7 and ending at 6
The output reveals several zero-length matches. The ending index is shown to be one less than the starting index because I specified end() - 1
in Listing 1’s RegexDemo
‘s source code.
Quantifiers
A quantifier is a regex construct that explicitly or implicitly binds a numeric value to a pattern. The numeric value determines how many times to match the pattern. Quantifiers are categorized as greedy, reluctant, or possessive:
- A greedy quantifier (
?
, *
, or +
) attempts to find the longest match. Specify X?
to find one or no occurrences of X
, X*
to find zero or more occurrences of X
, X+
to find one or more occurrences of X
, X{n}
to find n
occurrences of X
, X{n,}
to find at least n
(and possibly more) occurrences of X
, and X{n,m}
to find at least n
but no more than m
occurrences of X
.
- A reluctant quantifier (
??
, *?
, or +?
) attempts to find the shortest match. Specify X??
to find one or no occurrences of X
, X*?
to find zero or more occurrences of X
, X+?
to find one or more occurrences of X
, X{n}?
to find n
occurrences of X
, X{n,}?
to find at least n
(and possibly more) occurrences of X
, and X{n,m}?
to find at least n
but no more than m
occurrences of X
.
- A possessive quantifier (
?+
, *+
, or ++
) is similar to a greedy quantifier except that a possessive quantifier only makes one attempt to find the longest match, whereas a greedy quantifier can make multiple attempts. Specify X?+
to find one or no occurrences of X
, X*+
to find zero or more occurrences of X
, X++
to find one or more occurrences of X
, X{n}+
to find n
occurrences of X
, X{n,}+
to find at least n
(and possibly more) occurrences of X
, and X{n,m}+
to find at least n
but no more than m
occurrences of X
.
The following example demonstrates a greedy quantifier:
java RegexDemo .*ox "fox box pox"
Here’s the output:
regex = .*ox
input = fox box pox
Found [fox box pox] starting at 0 and ending at 10
The greedy quantifier (.*
) matches the longest sequence of characters that terminates in ox
. It starts by consuming all of the input text and then is forced to back off until it discovers that the input text terminates with these characters.
Now consider a reluctant quantifier:
java RegexDemo .*?ox "fox box pox"
Here’s its output:
regex = .*?ox
input = fox box pox
Found [fox] starting at 0 and ending at 2
Found [ box] starting at 3 and ending at 6
Found [ pox] starting at 7 and ending at 10
The reluctant quantifier (.*?
) matches the shortest sequence of characters that terminates in ox
. It begins by consuming nothing and then slowly consumes characters until it finds a match. It then continues until it exhausts the input text.
Finally, we have the possessive quantifier:
java RegexDemo .*+ox "fox box pox"
And here’s its output:
regex = .*+ox
input = fox box pox
The possessive quantifier (.*+
) doesn’t detect a match because it consumes the entire input text, leaving nothing left over to match ox
at the end of the regex. Unlike a greedy quantifier, a possessive quantifier doesn’t back off.
Zero-length matches
You’ll occasionally encounter zero-length matches when working with quantifiers. For example, the following greedy quantifier produces several zero-length matches:
java RegexDemo a? abaa
This example produces the following output:
regex = a?
input = abaa
Found [a] starting at 0 and ending at 0
Found [] starting at 1 and ending at 0
Found [a] starting at 2 and ending at 2
Found [a] starting at 3 and ending at 3
Found [] starting at 4 and ending at 3
The output reveals five matches. Although the first, third, and fourth matches come as no surprise (in that they reveal the positions of the three a
‘s in abaa
), you might be surprised by the second and fifth matches. They seem to indicate that a
matches b
and also matches the text’s end, but that isn’t the case. Regex a?
doesn’t look for b
or the text’s end. Instead, it looks for either the presence or lack of a
. When a?
fails to find a
, it reports that fact as a zero-length match.
Embedded flag expressions
Matchers assume certain defaults that can be overridden when compiling a regex into a pattern–something we’ll discuss more in Part 2. A regex can override any default by including an embedded flag expression. This regex construct is specified as parentheses metacharacters surrounding a question mark metacharacter (?
), which is followed by a specific lowercase letter. Pattern
recognizes the following embedded flag expressions:
(?i)
: enables case-insensitive pattern matching. For example, java RegexDemo (?i)tree Treehouse
matches tree
with Tree
. Case-sensitive pattern matching is the default.
(?x)
: permits whitespace and comments beginning with the #
metacharacter to appear in a pattern. A matcher ignores both. For example, java RegexDemo ".at(?x)#match hat, cat, and so on" matter
matches .at
with mat
. By default, whitespace and comments are not permitted; a matcher regards them as characters that contribute to a match.
(?s)
: enables dotall mode in which the period metacharacter matches line terminators in addition to any other character. For example, java RegexDemo (?s). \n
matches new-line. Non-dotall mode is the default: line-terminator characters don’t match. For example, Java RegexDemo . \n
doesn’t match new-line.
(?m)
: enables multiline mode in which ^
matches the beginning of every line and $
matches the end of every line. For example, java RegexDemo "(?m)^abc$" abc\nabc
matches both abc
s in the input text. Non-multiline mode is the default: ^
matches the beginning of the entire input text and $
matches the end of the entire input text. For example, java RegexDemo "^abc$" abc\nabc
reports no matches.
(?u)
: enables Unicode-aware case folding. This flag works with (?i
) to perform case-insensitive matching in a manner consistent with the Unicode Standard. The default setting is case-insensitive matching that assumes only characters in the US-ASCII character set match.
(?d)
: enables Unix lines mode in which a matcher recognizes only the \n
line terminator in the context of the .
, ^
, and $
metacharacters. Non-Unix lines mode is the default: a matcher recognizes all terminators in the context of the aforementioned metacharacters.
Embedded flag expressions resemble capturing groups because they surround their characters with parentheses metacharacters. Unlike a capturing group, an embedded flag expression doesn’t capture a match’s characters. Instead, an embedded flag expression is an example of a noncapturing group, which is a regex construct that doesn’t capture text characters. It’s specified as a character sequence surrounded by parentheses metacharacters.