Core Java Interview Questions Part II

As I’ve discussed several times on this site, I’m not a fan of learn by wrote java questions. However, a lot of people continue to ask them during their interview process so it means you need to know how to answer them.

The name of the game is core java interview questions and I will be your quiz master. Each week I will publish 10 new quick fire questions and answers.

1. What is a static variable?

A static variable is a value that remains the same through all the instances and can be modified by all the instances. It is a shared value.

2. If I change the order of a methods modifiers will it still work? Is public static void main the same as public void static main? What about void static public main?

Modifiers can be moved around freely. The visibility modifier (if there is one)must be at the start.

3. Why would you pick an ArrayList or LinkedList?

ArrayList is great if you need fast access to objects but can cope with slower writes. Conversely, if you need fast writes (moving, removing or adding) but can cope with slower direct access then choose a LinkedList (read our in depth article on collections here).

4. What are the different access modifiers in Java and what do they mean?

public- accessible to everyone in the JVM

private- only accessible from inside the class.

protected- accessible from by any class in the same package or any subclass in any package

default- when no visibility modifier is present. accessible from any class in the same package only.

5. If you wanted to prevent your methods or classes from being overridden, how would you do this?

By declaring something as final you prevent it from being overridden. Nothing is perfect though, as crafty developers can always use Reflection to get around this, or alternatively just copy and paste your code into their own version of the class.  It is rarely a good idea to prevent your methods or classes being overridden, and you should code defensively to reflect this.

6. What is an immutable object?

The state of an immutable object cannot be changed after construction.  Immutable objects can play an important roll in threading. (read more on threading here).  A good example of an immutable objeect is String.  Any modification to an immutable object will result in  the creation of a new object.

7. What is the difference between overloading and overriding?

Method overloading is having multiple methods with the same name. They can be differentiated as they will have a different signature, e.g. take different method parameters.  Overriding is used when creating a subclass of a class and specifying your own functionality for the method by copying the method signature identically.

8. What does it mean when we say java does not support multiple inheritance? Is this a good thing?

Java cannot extend functionality from more than one concrete or abstract class.  If both parent classes had a jump() method, it would be unclear which functionality the caller would need to use.  We can implement multiple interfaces however as the the implementation occurs in our actual class so this problem does not occur.

9. What is an abstract class?

Similar to an interface, an abstract class cannot actually be instantiated. Unlike an interface, an abstract class can have method implementations.  Any method without implementation will have the modifier “abstract” to indicate to classes which extend it that they must provide the implementation.

10. What does “write once run anywhere” mean in relation to Java?

Java is a cross platform language; java compiles down to byte code which can be run on a Java Virtual Machine (JVM). JVMs are available on many platforms including the major operating systems. This means that any Java application can in theory run on any platform where a JVM is available, hence write once (for the JVM) and run anywhere (there is a JVM).

Continue reading Core Java Interview Questions Part II

JVM and Garbage Collection Interview Questions: The Beginners Guide

The Java Virtual Machine is the achilles heel of most developers and can cause even the most seasoned developers to be come unstuck.  The simple fact is that unless something is going wrong, we don’t normally care about it.  Maybe we tune it a little when the application goes live but after that it remains untouched until something goes wrong.  This makes it a very difficult subject to excel in during interviews.  Even worse, interviewers love to ask questions about it.  Everyone should have a basic knowledge of the JVM to be able to do their job but often people recruiting are looking for someone who knows how to fix a problem like a memory leak when it happens.

In this guide we take a ground up approach to the JVM and garbage collection so you can feel some level of confidence going into your big day.

Java Question: What is the JVM? Why is it a good thing? What is “write once, run anywhere”? Are there negatives?

JVM stands for Java Virtual Machine.  Java code is compiled down into an intermediary language called byte code. The Java Virtual Machine is then responsible for executing this byte code.  This is unlike languages such as C++ which are compiled directly to native code for a specific platform.

This is what gives Java its ‘Write once, run anywhere’ ability.  In a language which compiles directly to platform you would have to compile and test the application separately on every platform on which you wish it to run. There would likely me several issues with libraries, ensuring they are available on all of the platforms for example. Every new platform would require new compilation and new testing.  This is time consuming and expensive.

Continue reading JVM and Garbage Collection Interview Questions: The Beginners Guide

The Idiots Guide to Big O

I hate big O notation. For as long as I can remember it’s been my biggest achilles heel (of which I have many). It’s just something I’ve never managed to successfully motivate myself to learn about despite knowing it’s going to come up in every single interview. I’ll get asked to implement an algorithm which I’ll do via brute force to start with: “What is the Big O of that?”. “I don’t know Big O but I know it’s slow”.

This is the wrong attitude and the wrong approach, and I have finally decided it is time to face my demons. All the resources I’ve found online I’ve sent me quickly to sleep, so I am determined to create an accessible Big O mastery page.

What on earth is Big O?

Big O is the way of measuring the efficiency of an algorithm and how well it scales based on the size of the dataset.  Imagine you have a list of 10 objects, and you want to sort them in order.  There’s a whole bunch of algorithms you can use to make that happen, but not all algorithms are built equal.  Some are quicker than others but more importantly the speed of an algorithm can vary depending on how many items it’s dealing with. Big O is a way of measuring how an algorithm scales.  Big O references how complex an algorithm is.

Big O is represented using something like O(n).  The O simply denoted we’re talking about big O and you can ignore it (at least for the purpose of the interview).  n is the thing the complexity is in relation to; for programming interview questions this is almost always the size of a collection. The complexity will increase or decrease in accordance with the size of the data store.

Below is a list of the Big O complexities in order of how well they scale relative to the dataset.

O(1)/Constant Complexity: Constant.  This means irrelevant of the size of the data set the algorithm will always take a constant time. 1 item takes 1 second, 10 items takes 1 second, 100 items takes 1 second. It always takes the same amount of time.

O(log n)/Logarithmic Complexity: Not as good as constant, but still pretty good.  The time taken increases with the size of the data set, but not proportionately so. This means the algorithm takes longer per item on smaller datasets relative to larger ones.   1 item takes 1 second, 10 items takes 2 seconds, 100 items takes 3 seconds. If your dataset has 10 items, each item causes 0.2 seconds latency. If your dataset has 100, it only takes 0.03 seconds extra per item. This makes log n algorithms very scalable.

O(n)/Linear Complexity: The larger the data set, the time taken grows proportionately.  1 item takes 1 second, 10 items takes 10 seconds, 100 items takes 100 seconds.

O(n log n): A nice combination of the previous two.  Normally there’s 2 parts to the sort, the first loop is O(n), the second is O(log n), combining to form O(n log n) 1 item takes 2 seconds, 10 items takes 12 seconds, 100 items takes 103 seconds.

O(n^2)/Quadratic Complexity: Things are getting extra slow. 1 item takes 1 second, 10 items takes 100, 100 items takes 10000.  

O(2^n): Exponential Growth! The algorithm takes twice as long for every new element added.  1 item takes 1 second, 10 items takes 1024 seconds, 100 items takes 1267650600228229401496703205376 seconds.

It is important to notice that the above is not ordered by the best to worst complexity. There is no “best” algorithm, as it completely hinges on the size of the dataset and the task at hand.  It is also important to remember the code cost; a more complex algorithm may result in an incredibly quick sort, but if the code has become unmaintainble and difficult to debug is that the right thing to do?  It probably depends on your requirements.

There is also a variation in complexity within the above complexities.  Imagine an algorithm which loops through a list exactly two times.  This would be O(2n) complexity, as it’s going through your lists length (n) twice!

Why does this matter?

Simply put: an algorithm that works on a small dataset is not guaranteed to work well on a large dataset.  Just because something is lightening fast on your machine doesn’t mean that it’s going to work when you scale up to a serious dataset.  You need to understand exactly what your algorithm is doing, and what it’s big O complexity is, in order to choose the right solution.

There are three things we care about with algorithms: best case, worst case and expected case. In reality we only actually care about the latter two, as we’re a bunch of pessimists. If you ask an algorithm to sort a pre-sorted list it’s probably going to do it much faster than a completely jumbled list. Instead we want to know what the worst case (the absolutely maximum amount of steps the algorithm could take) and the expected case (the likely or average number of steps the algorithm could take).  Just to add to the fun, these can and often are different.

Examples

Hopefully you’re with me so far, but let’s dive into some example algorithms for sorting and searching.  The important thing is to be able to explain what complexity an algorithm is.  Interviewers love to get candidates to design algorithms and then ask what the complexity of it is.

O(1)

Irrelevant of the size, it will always return at constant speed. The javadoc for Queue states that it is  “constant time for the retrieval methods (peekelement, and size)”. It’s pretty clear why this is the case.  For peek, we are always returning the first element which we always have a reference to; it doesn’t matter how many elements follow it.  The size of the list is updated upon element addition/removal, and referencing this number is just one operation to access no matter what the size of the list is.

O(log n)

The classic example is a Binary search.  You’re a massive geek so you’ve obviously alphabetised your movie collection. To find your copy of “Back To The Future”, you first go to the middle of the list. You discover the middle film is “Meet The Fockers”, so you then head to the movie in between the start and this film.  You discover this is “Children of Men”. You repeat this again and you’ve found back to the future.   There’s a great interactive demo of binary search here.

Although adding more elements will increase the amount of time it takes to search, it doesn’t do so proportionally. Therefore it is O(log n).

O(n)

As discussed in my post on collections, LinkedLists are not so good (relatively speaking) when it comes to retrieval.  It actually has a Big O of O(n) as in the worst case, to find an element T which is the last element in the list, it is necessary to navigate the entire list of n elements to find T. As the number of elements increases so does the access time in proportion.

O(n log n)

The best example of O(n log n) is a merge sort. This is a divide and conquer algorithm. Imagine you have a list of integers. We divide the list in two again and again until we are left with with a number of lists with 1 item in: each of these lists is therefore sorted.We then merge each list with it’s neighbour (comparing the first elements of each every time). We repeat this with the new composite list until we have our sorted result.Merge-sort-example-300px(Image from wikipedia)

To explain why this is O(n log n) is a bit more complex.  In the above example of 8 numbers, we have 3 levels of sorting:

  • 4 list sorts when the list sizes are 2
  • 2 list sorts when the list sizes are 4
  • 1 list sort when the list size is 8

Now consider if I were to double the number of elements to 16: this would only require one more level of sorting. Hopefully your recognise this is a log n scale.

However, on each level of sorting a total of n operations takes place (look at the red boxes in the diagram above).  this results in (n * log n) operations, e.g. O(n log n).

O(n^2)

The Bubble Sort algorithm is everyone’s first algorithm in school, and interestingly it is quadratic complexity. If you need a reminder; we go through the list and compare each element with the one next to it, swapping the elements if they are out of order. At the end of the first iteration, we then start again from the beginning, with the caveat that we now know the last element is correct.

Bubble-sort-example-300px

Image from Wikipedia.

Imagine writing the code for this; it’s two loops of n iterations.

public int[] sort(int[] toSort){
        for (int i = 0; i < toSort.length -1; i++) {
            boolean swapped = false;
            for (int j = 0; j < toSort.length - 1 - i; j++) {
                if(toSort[j] > toSort[j+1]){
                    swapped = true;
                    int swap = toSort[j+1];
                    toSort[j + 1] = toSort[j];
                    toSort[j] = swap;
                }
            }
if(!swapped){
    break;
}
} return toSort; }

This is also a good example of best vs worst case. If the list to sort is already sorted, then it will only take n to sort; a single iteration will do.  However, in the worst case where we have to go through the list n times and each time looping another n – (first loop) which is slow.

You may notice that it’s technically less than n2 as the second loop decreases each time. This gets ignored because as the size of the data set increases this impact of this becomes more and more marginal and tends towards quadratic.

O(2^n)

Exponential growth! Any algorithm where adding another element dramatically increases the processing time. Take for example trying to find combinations; if I have  a list of 150 people and I would like to find every combination of groupings; everyone by themselves, all of the groups of 2 people, all of the groups of 3 people etc. Using a simple program which takes each person and loops through the combinations,  if I add one extra person then it’s going to increase the processing time exponentially. Every new element will double processing time.

In reality O(2^n) are in no way scalable.

How to figure out Big O in an interview

This is not an exhaustive list of Big O.  Much as you can O(n2), you can also have O(n^3) (imagine throwing an extra loop into your bubble sort for no apparent reason).  What the list on this page should allow you to do is have a stab in the dark at figuring out what the big O of an algorithm is.  If someone is asking you this during an interview they probably want to see how you try and figure it out. Break down the loops and processing.

  • Does it have to go through the entire list? There will be an n in there somewhere.
  •  Does the algorithms processing time increase at a slower rate than the size of the data set? Then there’s probably a log n in there.
  • Are there nested loops? You’re probably looking at n^2 or n^3.
  • Is access time constant irrelevant of the size of the dataset?? O(1)

Sample Questions

I have an array of the numbers 1 to 100 in a random number. One of the numbers is missing. Write an algorithm to figure out what the number is and what position is missing.

There are many variations of this question all of which are very popular.  To calculate the missing number we can simply add up all the numbers we do have, and subtract this from the expected answer of the sum of all numbers between 1 and 100.  To do this we have to iterate the list once. Whilst doing this we can also note which spot has the gap.

   public void findTheBlank(int[] theNumbers) {
        int sumOfAllNumbers = 0;
        int sumOfNumbersPresent = 0;
        int blankSpace = 0;

        for (int i = 0; i < theNumbers.length; i++) {
            sumOfAllNumbers += i + 1;
            sumOfNumbersPresent += theNumbers[i];
            if (theNumbers[i] == 0)
                blankSpace = i;
        }
        System.out.println("Missing number = " + (sumOfAllNumbers - sumOfNumbersPresent) + " at location " + blankSpace +" of the array");
    }

new BubbleSort().findTheBlank(new int[]{7,6,0,1,3,2,4});

//Missing number = 5 at location 2 of the array

Caveat: you can also calculate sumOfAllNumbers using (theNumbers.length+1) * (theNumbers.length) / 2.0). I would never remember that in an interview though.

What is the big O of your algo?

Our algorithm iterates through our list once, so it’s simply O(n).

Java Date and Time API

Get to know the java.time classes you’re most likely to use in Java 8

Introducing the Java Date and Time API

Date and Time is a new date, time, and calendar API for the Java SE platform. It was developed under JSR 310 and replaces Java’s existing date and time infrastructure based on java.util.Date and java.util.Calendar. You should use this API in any application that would otherwise use the legacy DateCalendar, or java.util.TimeZone classes; Calendar or TimeZone subclasses such as java.util.GregorianCalendar; or associated classes such as java.text.DateFormat.

The Date and Time API distinguishes between machine and human views of a timeline, which is an always increasing sequence of instants, or points along the timeline. The machine view reveals a sequence of integral values relative to the epoch (e.g., midnight, January 1, 1970), with positive values identifying instants after the epoch and negative values identifying instants before the epoch. The human view reveals a set of fields (e.g., year, month, day-of-month, hour, minute, and second).

Why do we need a new Date and Time API?

Java 1.0 introduced System.currentTimeMillis(), which could return a machine view of the timeline with millisecond precision. It also introduced the java.util.Date class, which returned a human view. It soon became evident that there were several flaws with this class:

  • Constructors that accept year arguments require offsets from 1900, which has been a source of bugs.
  • January is represented by 0 instead of 1, also a source of bugs.
  • Date doesn’t describe a date but describes a date-time combination.
  • Date‘s mutability makes it unsafe to use in multithreaded scenarios without external synchronization.
  • Date isn’t amenable to internationalization.

In an attempt to fix these flaws, Sun introduced java.util.Calendar and related classes with the Java 1.1 release. Unfortunately, Calendar was also riddled with flaws, including the following:

  • It isn’t possible to format a calendar.
  • January is represented by 0 instead of 1, a source of bugs.
  • Calendar isn’t type-safe; for example, you must pass an int-based constant to the get(int field) method. (In fairness, enums weren’t available when Calendar was released.)
  • Calendar‘s mutability makes it unsafe to use in multithreaded scenarios without external synchronization. (The companion java.util.TimeZoneand java.text.DateFormat classes share this problem.)
  • Calendar stores its state internally in two different ways — as a millisecond offset from the epoch and as a set of fields — resulting in many bugs and performance issues.

Continue reading Java Date and Time API

Regular expressions in Java

Use the Regex API to discover and describe patterns in your Java programs

Java’s character and assorted string classes offer low-level support for pattern matching, but that support typically leads to complex code. For simpler and more efficient coding, Java offers the Regex API. This two-part tutorial helps you get started with regular expressions and the Regex API. First we’ll unpack the three powerful classes residing in the java.util.regex package, then we’ll explore the Pattern class and its sophisticated pattern-matching constructs.

What are regular expressions?

regular expression, also known as a regex or regexp, is a string whose pattern(template) describes a set of strings. The pattern determines which strings belong to the set. A pattern consists of literal characters and metacharacters, which are characters that have special meaning instead of a literal meaning.

Pattern matching is the process of searching text to identify matches, or strings that match a regex’s pattern. Java supports pattern matching via its Regex API. The API consists of three classes–PatternMatcher, and PatternSyntaxException–all located in the java.util.regex package:

  • Pattern objects, also known as patterns, are compiled regexes.
  • Matcher objects, or matchers, are engines that interpret patterns to locate matches in character sequences (objects whose classes implement the java.lang.CharSequence interface and serve as text sources).
  • PatternSyntaxException objects describe illegal regex patterns.

Java also provides support for pattern matching via various methods in its java.lang.String class. For example, boolean matches(String regex)returns true only if the invoking string exactly matches regex‘s regex.

RegexDemo

I’ve created the RegexDemo application to demonstrate Java’s regular expressions and the various methods located in the PatternMatcher, and PatternSyntaxException classes. Here’s the source code for the demo:

Listing 1. Demonstrating regexes

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
public class RegexDemo
{
   public static void main(String[] args)
   {
      if (args.length != 2)
      {
         System.err.println("usage: java RegexDemo regex input");
         return;
      }
      // Convert new-line (\n) character sequences to new-line characters.
      args[1] = args[1].replaceAll("\\\\n", "\n");
      try
      {
         System.out.println("regex = " + args[0]);
         System.out.println("input = " + args[1]);
         Pattern p = Pattern.compile(args[0]);
         Matcher m = p.matcher(args[1]);
         while (m.find())
            System.out.println("Found [" + m.group() + "] starting at "
                               + m.start() + " and ending at " + (m.end() - 1));
      }
      catch (PatternSyntaxException pse)
      {
         System.err.println("Bad regex: " + pse.getMessage());
         System.err.println("Description: " + pse.getDescription());
         System.err.println("Index: " + pse.getIndex());
         System.err.println("Incorrect pattern: " + pse.getPattern());
      }
   }
}

The first thing RegexDemo‘s main() method does is to validate its command line. This requires two arguments: the first argument is a regex, and the second argument is input text to be matched against the regex.

You might want to specify a new-line (\n) character as part of the input text. The only way to accomplish this is to specify a \ character followed by an ncharacter. main() converts this character sequence to Unicode value 10.

The bulk of RegexDemo‘s code is located in the trycatch construct. The tryblock first outputs the specified regex and input text and then creates a Pattern object that stores the compiled regex. (Regexes are compiled to improve performance during pattern matching.) A matcher is extracted from the Pattern object and used to repeatedly search for matches until none remain. The catch block invokes various PatternSyntaxException methods to extract useful information about the exception. This information is subsequently output.

You don’t need to know more about the source code’s workings at this point; it will become clear when you explore the API in Part 2. You do need to compile Listing 1, however. Grab the code from Listing 1, then type the following into your command line to compile RegexDemo:

javac RegexDemo.java

Pattern and its constructs

Pattern, the first of three classes comprising the Regex API, is a compiled representation of a regular expression. Pattern‘s SDK documentation describes various regex constructs, but unless you’re already an avid regex user, you might be confused by parts of the documentation. What are quantifiers and what’s the difference between greedyreluctant, and possessive quantifiers? What are character classesboundary matchersback references, and embedded flag expressions? I’ll answer these questions and more in the next sections.

Literal strings

The simplest regex construct is the literal string. Some portion of the input text must match this construct’s pattern in order to have a successful pattern match. Consider the following example:

java RegexDemo apple applet

This example attempts to discover if there is a match for the apple pattern in the applet input text. The following output reveals the match:

regex = apple
input = applet
Found [apple] starting at 0 and ending at 4

The output shows us the regex and input text, then indicates a successful match of apple within applet. Additionally, it presents the starting and ending indexes of that match: 0 and 4, respectively. The starting index identifies the first text location where a pattern match occurs; the ending index identifies the last text location for the match.

Now suppose we specify the following command line:

java RegexDemo apple crabapple

This time, we get the following match with different starting and ending indexes:

regex = apple
input = crabapple
Found [apple] starting at 4 and ending at 8

The reverse scenario, in which applet is the regex and apple is the input text, reveals no match. The entire regex must match, and in this case the input text does not contain a t after apple.

Metacharacters

More powerful regex constructs combine literal characters with metacharacters. For example, in a.b, the period metacharacter (.) represents any character that appears between a and b. Consider the following example:

java RegexDemo .ox "The quick brown fox jumps over the lazy ox."

This example specifies .ox as the regex and The quick brown fox jumps over the lazy ox. as the input text. RegexDemo searches the text for matches that begin with any character and end with ox. It produces the following output:

regex = .ox
input = The quick brown fox jumps over the lazy ox.
Found [fox] starting at 16 and ending at 18
Found [ ox] starting at 39 and ending at 41

The output reveals two matches: fox and ox (with the leading space character). The . metacharacter matches the f in the first match and the space character in the second match.

What happens when we replace .ox with the period metacharacter? That is, what output results from specifying the following command line:

java RegexDemo . "The quick brown fox jumps over the lazy ox."

Because the period metacharacter matches any character, RegexDemo outputs a match for each character (including the terminating period character) in the input text:

regex = .
input = The quick brown fox jumps over the lazy ox.
Found [T] starting at 0 and ending at 0
Found [h] starting at 1 and ending at 1
Found [e] starting at 2 and ending at 2
Found [ ] starting at 3 and ending at 3
Found [q] starting at 4 and ending at 4
Found [u] starting at 5 and ending at 5
Found [i] starting at 6 and ending at 6
Found [c] starting at 7 and ending at 7
Found [k] starting at 8 and ending at 8
Found [ ] starting at 9 and ending at 9
Found [b] starting at 10 and ending at 10
Found [r] starting at 11 and ending at 11
Found [o] starting at 12 and ending at 12
Found [w] starting at 13 and ending at 13
Found [n] starting at 14 and ending at 14
Found [ ] starting at 15 and ending at 15
Found [f] starting at 16 and ending at 16
Found [o] starting at 17 and ending at 17
Found [x] starting at 18 and ending at 18
Found [ ] starting at 19 and ending at 19
Found [j] starting at 20 and ending at 20
Found [u] starting at 21 and ending at 21
Found [m] starting at 22 and ending at 22
Found [p] starting at 23 and ending at 23
Found [s] starting at 24 and ending at 24
Found [ ] starting at 25 and ending at 25
Found [o] starting at 26 and ending at 26
Found [v] starting at 27 and ending at 27
Found [e] starting at 28 and ending at 28
Found [r] starting at 29 and ending at 29
Found [ ] starting at 30 and ending at 30
Found [t] starting at 31 and ending at 31
Found [h] starting at 32 and ending at 32
Found [e] starting at 33 and ending at 33
Found [ ] starting at 34 and ending at 34
Found [l] starting at 35 and ending at 35
Found [a] starting at 36 and ending at 36
Found [z] starting at 37 and ending at 37
Found [y] starting at 38 and ending at 38
Found [ ] starting at 39 and ending at 39
Found [o] starting at 40 and ending at 40
Found [x] starting at 41 and ending at 41
Found [.] starting at 42 and ending at 42

Character classes

We sometimes need to limit characters that will produce matches to a specific character set. For example, we might search text for vowels aeio, and u, where any occurrence of a vowel indicates a match. A character class identifies a set of characters between square-bracket metacharacters ([ ]), helping us accomplish this task. Pattern supports simple, negation, range, union, intersection, and subtraction character classes. We’ll look at all of these below.

Simple character class

The simple character class consists of characters placed side by side and matches only those characters. For example, [abc] matches characters ab, and c.

Consider the following example:

java RegexDemo [csw] cave

This example matches only c with its counterpart in cave, as shown in the following output:

regex = [csw]
input = cave
Found [c] starting at 0 and ending at 0

Negation character class

The negation character class begins with the ^ metacharacter and matches only those characters not located in that class. For example, [^abc] matches all characters except ab, and c.

Consider this example:

java RegexDemo "[^csw]" cave

Note that the double quotes are necessary on my Windows platform, whose shell treats the ^ character as an escape character.

This example matches av, and e with their counterparts in cave, as shown here:

regex = [^csw]
input = cave
Found [a] starting at 1 and ending at 1
Found [v] starting at 2 and ending at 2
Found [e] starting at 3 and ending at 3

Range character class

The range character class consists of two characters separated by a hyphen metacharacter (-). All characters beginning with the character on the left of the hyphen and ending with the character on the right of the hyphen belong to the range. For example, [a-z] matches all lowercase alphabetic characters. It’s equivalent to specifying [abcdefghijklmnopqrstuvwxyz].

Consider the following example:

java RegexDemo [a-c] clown

This example matches only c with its counterpart in clown, as shown:

regex = [a-c]
input = clown
Found [c] starting at 0 and ending at 0

Union character class

The union character class consists of multiple nested character classes and matches all characters that belong to the resulting union. For example, [a-d[m-p]] matches characters a through d and m through p.

Consider the following example:

java RegexDemo [ab[c-e]] abcdef

This example matches abcd, and e with their counterparts in abcdef:

regex = [ab[c-e]]
input = abcdef
Found [a] starting at 0 and ending at 0
Found [b] starting at 1 and ending at 1
Found [c] starting at 2 and ending at 2
Found [d] starting at 3 and ending at 3
Found [e] starting at 4 and ending at 4

Intersection character class

The intersection character class consists of characters common to all nested classes and matches only common characters. For example, [a-z&&[d-f]]matches characters de, and f.

Consider the following example:

java RegexDemo "[aeiouy&&[y]]" party

Note that the double quotes are necessary on my Windows platform, whose shell treats the & character as a command separator.

This example matches only y with its counterpart in party:

regex = [aeiouy&&[y]]
input = party
Found [y] starting at 4 and ending at 4

Subtraction character class

The subtraction character class consists of all characters except for those indicated in nested negation character classes and matches the remaining characters. For example, [a-z&&[^m-p]] matches characters a through l and qthrough z:

java RegexDemo "[a-f&&[^a-c]&&[^e]]" abcdefg

This example matches d and f with their counterparts in abcdefg:

regex = [a-f&&[^a-c]&&[^e]]
input = abcdefg
Found [d] starting at 3 and ending at 3
Found [f] starting at 5 and ending at 5

Predefined character classes

Some character classes occur often enough in regexes to warrant shortcuts. Pattern provides predefined character classes as these shortcuts. Use them to simplify your regexes and minimize syntax errors.

Several categories of predefined character classes are provided: standard, POSIX, java.lang.Character, and Unicode script/block/category/binary property. The following list describes only the standard category:

  • \d: A digit. Equivalent to [0-9].
  • \D: A nondigit. Equivalent to [^0-9].
  • \s: A whitespace character. Equivalent to [ \t\n\x0B\f\r].
  • \S: A nonwhitespace character. Equivalent to [^\s].
  • \w: A word character. Equivalent to [a-zA-Z_0-9].
  • \W: A nonword character. Equivalent to [^\w].

This example uses the \w predefined character class to identify all word characters in the input text:

java RegexDemo \w "aZ.8 _"

You should observe the following output, which shows that the period and space characters are not considered word characters:

regex = \w
input = aZ.8 _
Found [a] starting at 0 and ending at 0
Found [Z] starting at 1 and ending at 1
Found [8] starting at 3 and ending at 3
Found [_] starting at 5 and ending at 5

Capturing groups

capturing group saves a match’s characters for later recall during pattern matching; this construct is a character sequence surrounded by parentheses metacharacters ( ( ) ). All characters within the capturing group are treated as a single unit during pattern matching. For example, the (Java) capturing group combines letters Jav, and a into a single unit. This capturing group matches the Java pattern against all occurrences of Java in the input text. Each match replaces the previous match’s saved Java characters with the next match’s Javacharacters.

Capturing groups can be nested inside other capturing groups. For example, in the (Java( language)) regex, ( language) nests inside (Java). Each nested or non-nested capturing group receives its own number, numbering starts at 1, and capturing groups are numbered from left to right. In the example, (Java( language)) belongs to capturing group number 1, and ( language) belongs to capturing group number 2. In (a)(b)(a) belongs to capturing group number 1, and (b) belongs to capturing group number 2.

Each capturing group saves its match for later recall by a back reference. Specified as a backslash character followed by a digit character denoting a capturing group number, the back reference recalls a capturing group’s captured text characters. The presence of a back reference causes a matcher to use the back reference’s capturing group number to recall the capturing group’s saved match, and then use that match’s characters to attempt a further match operation. The following example demonstrates the usefulness of a back reference in searching text for a grammatical error:

java RegexDemo "(Java( language)\2)" "The Java language language"

The example uses the (Java( language)\2) regex to search the input text “The Java language language” for a grammatical error, where Javaimmediately precedes two consecutive occurrences of language. The regex specifies two capturing groups: number 1 is (Java( language)\2), which matches Java language language, and number 2 is ( language), which matches a space character followed by language. The \2 back reference recalls number 2’s saved match, which allows the matcher to search for a second occurrence of a space character followed by language, which immediately follows the first occurrence of the space character and language. The output below shows what RegexDemo‘s matcher finds:

regex = (Java( language)\2)
input = The Java language language
Found [Java language language] starting at 4 and ending at 25

Boundary matchers

We sometimes want to match patterns at the beginning of lines, at word boundaries, at the end of text, and so on. You can accomplish this task by using one of Pattern‘s boundary matchers, which are regex constructs that identify match locations:

  • ^: The beginning of a line
  • $: The end of a line
  • \b: A word boundary
  • \B: A non-word boundary
  • \A: The beginning of the text
  • \G: The end of the previous match
  • \Z: The end of the text, except for the final line terminator (if any)
  • \z: The end of the text

The following example uses the ^ boundary matcher metacharacter to ensure that a line begins with The followed by zero or more word characters:

java RegexDemo "^The\w*" Therefore

The ^ character indicates that the first three input text characters must match the pattern’s subsequent Th, and e characters. Any number of word characters may follow. Here is the output:

regex = ^The\w*
input = Therefore
Found [Therefore] starting at 0 and ending at 8

Suppose you change the command line to java RegexDemo "^The\w*" " Therefore". What happens? No match is found because a space character precedes Therefore.

Zero-length matches

You’ll occasionally encounter zero-length matches when working with boundary matchers. A zero-length match is a match with no characters. It occurs in empty input text, at the beginning of input text, after the last character of input text, or between any two characters of that text. Zero-length matches are easy to identify because they always start and end at the same index position.

Consider the following example:

java RegExDemo \b\b "Java is"

This example matches two consecutive word boundaries and generates the following output:

regex = \b\b
input = Java is
Found [] starting at 0 and ending at -1
Found [] starting at 4 and ending at 3
Found [] starting at 5 and ending at 4
Found [] starting at 7 and ending at 6

The output reveals several zero-length matches. The ending index is shown to be one less than the starting index because I specified end() - 1 in Listing 1’s RegexDemo‘s source code.

Quantifiers

quantifier is a regex construct that explicitly or implicitly binds a numeric value to a pattern. The numeric value determines how many times to match the pattern. Quantifiers are categorized as greedy, reluctant, or possessive:

  • greedy quantifier (?*, or +) attempts to find the longest match. Specify X? to find one or no occurrences of XX* to find zero or more occurrences of XX+ to find one or more occurrences of XX{n} to find n occurrences of XX{n,} to find at least n (and possibly more) occurrences of X, and X{n,m} to find at least n but no more than m occurrences of X.
  • reluctant quantifier (??*?, or +?) attempts to find the shortest match. Specify X?? to find one or no occurrences of XX*? to find zero or more occurrences of XX+? to find one or more occurrences of XX{n}? to find noccurrences of XX{n,}? to find at least n (and possibly more) occurrences of X, and X{n,m}? to find at least n but no more than m occurrences of X.
  • possessive quantifier (?+*+, or ++) is similar to a greedy quantifier except that a possessive quantifier only makes one attempt to find the longest match, whereas a greedy quantifier can make multiple attempts. Specify X?+ to find one or no occurrences of XX*+ to find zero or more occurrences of XX++ to find one or more occurrences of XX{n}+ to find noccurrences of XX{n,}+ to find at least n (and possibly more) occurrences of X, and X{n,m}+ to find at least n but no more than m occurrences of X.

The following example demonstrates a greedy quantifier:

java RegexDemo .*ox "fox box pox"

Here’s the output:

regex = .*ox
input = fox box pox
Found [fox box pox] starting at 0 and ending at 10

The greedy quantifier (.*) matches the longest sequence of characters that terminates in ox. It starts by consuming all of the input text and then is forced to back off until it discovers that the input text terminates with these characters.

Now consider a reluctant quantifier:

java RegexDemo .*?ox "fox box pox"

Here’s its output:

regex = .*?ox
input = fox box pox
Found [fox] starting at 0 and ending at 2
Found [ box] starting at 3 and ending at 6
Found [ pox] starting at 7 and ending at 10

The reluctant quantifier (.*?) matches the shortest sequence of characters that terminates in ox. It begins by consuming nothing and then slowly consumes characters until it finds a match. It then continues until it exhausts the input text.

Finally, we have the possessive quantifier:

java RegexDemo .*+ox "fox box pox"

And here’s its output:

regex = .*+ox
input = fox box pox

The possessive quantifier (.*+) doesn’t detect a match because it consumes the entire input text, leaving nothing left over to match ox at the end of the regex. Unlike a greedy quantifier, a possessive quantifier doesn’t back off.

Zero-length matches

You’ll occasionally encounter zero-length matches when working with quantifiers. For example, the following greedy quantifier produces several zero-length matches:

java RegexDemo a? abaa

This example produces the following output:

regex = a?
input = abaa
Found [a] starting at 0 and ending at 0
Found [] starting at 1 and ending at 0
Found [a] starting at 2 and ending at 2
Found [a] starting at 3 and ending at 3
Found [] starting at 4 and ending at 3

The output reveals five matches. Although the first, third, and fourth matches come as no surprise (in that they reveal the positions of the three a‘s in abaa), you might be surprised by the second and fifth matches. They seem to indicate that a matches b and also matches the text’s end, but that isn’t the case. Regex a? doesn’t look for b or the text’s end. Instead, it looks for either the presence or lack of a. When a? fails to find a, it reports that fact as a zero-length match.

Embedded flag expressions

Matchers assume certain defaults that can be overridden when compiling a regex into a pattern–something we’ll discuss more in Part 2. A regex can override any default by including an embedded flag expression. This regex construct is specified as parentheses metacharacters surrounding a question mark metacharacter (?), which is followed by a specific lowercase letter. Pattern recognizes the following embedded flag expressions:

  • (?i): enables case-insensitive pattern matching. For example, java RegexDemo (?i)tree Treehouse matches tree with Tree. Case-sensitive pattern matching is the default.
  • (?x): permits whitespace and comments beginning with the #metacharacter to appear in a pattern. A matcher ignores both. For example, java RegexDemo ".at(?x)#match hat, cat, and so on" mattermatches .at with mat. By default, whitespace and comments are not permitted; a matcher regards them as characters that contribute to a match.
  • (?s): enables dotall mode in which the period metacharacter matches line terminators in addition to any other character. For example, java RegexDemo (?s). \n matches new-line. Non-dotall mode is the default: line-terminator characters don’t match. For example, Java RegexDemo . \n doesn’t match new-line.
  • (?m): enables multiline mode in which ^ matches the beginning of every line and $ matches the end of every line. For example, java RegexDemo "(?m)^abc$" abc\nabc matches both abcs in the input text. Non-multiline mode is the default: ^ matches the beginning of the entire input text and $matches the end of the entire input text. For example, java RegexDemo "^abc$" abc\nabc reports no matches.
  • (?u): enables Unicode-aware case folding. This flag works with (?i) to perform case-insensitive matching in a manner consistent with the Unicode Standard. The default setting is case-insensitive matching that assumes only characters in the US-ASCII character set match.
  • (?d): enables Unix lines mode in which a matcher recognizes only the \nline terminator in the context of the .^, and $ metacharacters. Non-Unix lines mode is the default: a matcher recognizes all terminators in the context of the aforementioned metacharacters.

Embedded flag expressions resemble capturing groups because they surround their characters with parentheses metacharacters. Unlike a capturing group, an embedded flag expression doesn’t capture a match’s characters. Instead, an embedded flag expression is an example of a noncapturing group, which is a regex construct that doesn’t capture text characters. It’s specified as a character sequence surrounded by parentheses metacharacters.