Capturing Groups

Capturing Groups

Capturing Groups


Now we are going to cover another useful feature of JavaScript regular expressions: capturing groups, allowing to capture parts of a string, putting them into an array.

It has two primary effects:

  1. Allows getting a part of the match as a separate item in the result array.
  2. In case of putting a quantifier after the parentheses, it applies to the latter, as a whole.

Examples of Using Parentheses

Now, let’s see how parentheses operate.

Imagine, you have an example “dododo”.

Without using parentheses, the pattern do+ means d character, followed by o and repeated one or more times. for example doooo or dooooooooo.

With the help of parentheses characters are grouped together, so (do)+ considers dodododododo, like in the example below:

console.log('Dododo'.match(/(do)+/i)); // "Dododo"

Domain

Now, let’s try to look for a website domain using a regular expression.

For instance:

email.com

users.email.com

roberts.users.email.com

So, the domain here consists of repeated words, and with a dot after each of them except the last one.

It is (\w+\.)+\w+ in regular expressions:

let regexp = /(\w+\.)+\w+/g;

console.log("email.com my.email.com".match(regexp)); // email.com,my.email.com

The search is done, but the pattern is not capable of matching a domain with a hyphen, as it doesn’t belong to the \w class.

It can be fixed by replacing \w with [\w-] in each word except for the last one: ([\w-]+\.)+\w+.

Email

Let’s create a regular expression for emails, based on the previous example. The format of the email is name@domain. A random word can be the name, hyphens and dots are also available. In regexp, it will look like this: [-.\w]+.

The pattern will be as follows:

let regexp = /[-.\w]+@([\w-]+\.)+[\w-]+/g;

console.log("my@email.com @ his@siteName.com.ru".match(regexp)); // my@email.com, his@siteName.com.ru

This regexp mostly works, helping to fix accidental mistypes.

Parentheses Contests in the Match

It is necessary to count parentheses from left to right. The engine remembers the content that was matched by each, allowing to get it in the result.

The str.match(regexp) method searches for the first match, returning that as an array (if the regexp doesn’t have a flag g):

  1. At the 0 index: the full match.
  2. At the 1 index: the contents of the initial parentheses.
  3. At the 2 index: the contents of the second parentheses.

Let’s consider finding HTML tags and proceeding them, as an example.

As a first step, you should wrap the content into parentheses, as follows: <(.*?)>.

So, you will get both the tag <p> as a whole and the contents p in the resulting array, like this:

let str = '<p>Welcome to Web</p>';

let tag = str.match(/<(.*?)>/);

alert(tag[0]); // <p>

alert(tag[1]); // p

Nested Groups

Parentheses might be nested. In that case, the numbering goes from left to right, too.

Once you search for a tag in <p class="myClass">, you should be interested in the whole tag content (p class="myClass"), the tag name (p), and the tag attributes (class="myClass").

Adding parentheses to them will look like this: <(([a-z]+)\s*([^>]*))>

The action will be as follows:

let str = '<p class="myClass">';

let regexp = /<(([a-z]+)\s*([^>]*))>/;

let res = str.match(regexp);

alert(res[0]); // <span class="myClass">

alert(res[1]); // span class="myClass"

alert(res[2]); // p

alert(res[3]); // class="myClass"

As a rule, the zero index of the result keeps the full match.

The initial group will be returned as res[1]. It encloses the tag content as a whole.

Afterward, in the res[2] group comes the group from the second opening paren ([a-z]+)- the name of the tag, and then the tag in the res[3]:([^>]*).

Optional Groups

Even in case of optional groups that don’t exist in the match, the corresponding result array item is there and equals to undefined.

For example, let’s try to apply the a(z)?(c)? regular expression. In case of running it on the string with one letter a, the result will look like this:

let m = 'a'.match(/a(z)?(c)?/);

console.log(m.length); // 3

console.log(m[0]); // a (whole match)

console.log(m[1]); // undefined

console.log(m[2]); // undefined

The length of the array is 3, but all the groups are empty.

Searching for All Matches:matchAll

First of all let’s note that matchAll is a new method, and is not supported by old browsers. That’s why a polyfill may be required.

While searching for all matches (g flag), the match method can’t return contents for all the groups.

In the example below, you can see an attempt of finding all tags in a string:

let str = '<p> <span>';

let tags = str.match(/<(.*?)>/g);

alert(tags); // <p>,<span>

The result is found in the array of matches but without the details about them.

But, usually, contents of the capturing groups in the result.

For getting them, it is necessary to use the str.matchAll(regexp) method, which was added to JavaScript long after the match method. One of the important differences of this method is that it returns an iterable object, rather than an array. Once the g flag is present, it returns each match as an array with groups. In case of finding no matches, it does not return null but an empty iterable object.

Here is an example:

let result = '<p> <span>'.matchAll(/<(.*?)>/gi);

// result - is't an array, but an iterable object

console.log(result); // [object RegExp String Iterator]

console.log(result[0]); // undefined (*)

result = Array.from(result); // let's turn it into array

alert(result[0]); // <p>,p (1st tag)

alert(result[1]); // <span>,span (2nd tag)

It is not easy to remember groups by their names. It is actual for simple patterns but counting parentheses is inconvenient for complex patterns.

You can do it by putting ?<name> right after the opening parent.

Here is an example of searching for a date:

let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/;

let str = "2020-04-20";

let groups = str.match(dateRegexp).groups;

console.log(groups.year); 

console.log(groups.month);

console.log(groups.day);

The groups are residing in the .groups property of the match. To search for the overall dates, the g flag can be added.

The matchAll is also needed for obtaining full matches along with the groups, like this:

let dateRegexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g;

let str = "2020-04-30 2020-10-01";

let results = str.matchAll(dateRegexp);


for (let result of results) {

  let {

    year,

    month,

    day

  } = result.groups;

  console.log(`${day}.${month}.${year}`);

}

Capturing Groups in the Replacement

The str.replace(regexp, replacement), used for replacing all the matches with regular expressions in str helps to use parentheses contents in the replacement string. It should be done with $n (n is the group number).

For instance:

let str = "John Smith";

let regexp = /(\w+) (\w+)/;

console.log(str.replace(regexp, '$2, $1'));

The reference will be $<name> for the named parentheses.

Here is an example:

let regexp = /(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})/g;

let str = "2020-03-30, 2020-10-01";

console.log(str.replace(regexp, '$<day>.$<month>.$<year>'));

Summary

A part of a pattern may be enclosed in parentheses. It is known as a capturing group. Parentheses groups are, generally, numbered from left to right. they can be named with (?<name>...).

The method is used for returning capturing groups without the g flag. The str.matchAll method constantly returns capturing groups.

Also, parentheses contents can be used in the replacement strings in str.replace.

Reactions

Post a Comment

0 Comments

close