|
| 1 | +# Character classes |
| 2 | + |
| 3 | +Consider a practical task -- we have a phone number like `"+7(903)-123-45-67"`, and we need to turn it into pure numbers: `79035419441`. |
| 4 | + |
| 5 | +To do so, we can find and remove anything that's not a number. Character classes can help with that. |
| 6 | + |
| 7 | +A *character class* is a special notation that matches any symbol from a certain set. |
| 8 | + |
| 9 | +For the start, let's explore the "digit" class. It's written as `pattern:\d` and corresponds to "any single digit". |
| 10 | + |
| 11 | +For instance, the let's find the first digit in the phone number: |
| 12 | + |
| 13 | +```js run |
| 14 | +let str = "+7(903)-123-45-67"; |
| 15 | + |
| 16 | +let regexp = /\d/; |
| 17 | + |
| 18 | +alert( str.match(regexp) ); // 7 |
| 19 | +``` |
| 20 | + |
| 21 | +Without the flag `pattern:g`, the regular expression only looks for the first match, that is the first digit `pattern:\d`. |
| 22 | + |
| 23 | +Let's add the `pattern:g` flag to find all digits: |
| 24 | + |
| 25 | +```js run |
| 26 | +let str = "+7(903)-123-45-67"; |
| 27 | + |
| 28 | +let regexp = /\d/g; |
| 29 | + |
| 30 | +alert( str.match(regexp) ); // array of matches: 7,9,0,3,1,2,3,4,5,6,7 |
| 31 | + |
| 32 | +// let's make the digits-only phone number of them: |
| 33 | +alert( str.match(regexp).join('') ); // 79035419441 |
| 34 | +``` |
| 35 | + |
| 36 | +That was a character class for digits. There are other character classes as well. |
| 37 | + |
| 38 | +Most used are: |
| 39 | + |
| 40 | +`pattern:\d` ("d" is from "digit") |
| 41 | +: A digit: a character from `0` to `9`. |
| 42 | + |
| 43 | +`pattern:\s` ("s" is from "space") |
| 44 | +: A space symbol: includes spaces, tabs `\t`, newlines `\n` and few other rare characters, such as `\v`, `\f` and `\r`. |
| 45 | + |
| 46 | +`pattern:\w` ("w" is from "word") |
| 47 | +: A "wordly" character: either a letter of Latin alphabet or a digit or an underscore `_`. Non-Latin letters (like cyrillic or hindi) do not belong to `pattern:\w`. |
| 48 | + |
| 49 | +For instance, `pattern:\d\s\w` means a "digit" followed by a "space character" followed by a "wordly character", such as `match:1 a`. |
| 50 | + |
| 51 | +**A regexp may contain both regular symbols and character classes.** |
| 52 | + |
| 53 | +For instance, `pattern:CSS\d` matches a string `match:CSS` with a digit after it: |
| 54 | + |
| 55 | +```js run |
| 56 | +let str = "Is there CSS4?"; |
| 57 | +let regexp = /CSS\d/ |
| 58 | + |
| 59 | +alert( str.match(regexp) ); // CSS4 |
| 60 | +``` |
| 61 | + |
| 62 | +Also we can use many character classes: |
| 63 | + |
| 64 | +```js run |
| 65 | +alert( "I love HTML5!".match(/\s\w\w\w\w\d/) ); // ' HTML5' |
| 66 | +``` |
| 67 | + |
| 68 | +The match (each regexp character class has the corresponding result character): |
| 69 | + |
| 70 | + |
| 71 | + |
| 72 | +## Inverse classes |
| 73 | + |
| 74 | +For every character class there exists an "inverse class", denoted with the same letter, but uppercased. |
| 75 | + |
| 76 | +The "inverse" means that it matches all other characters, for instance: |
| 77 | + |
| 78 | +`pattern:\D` |
| 79 | +: Non-digit: any character except `pattern:\d`, for instance a letter. |
| 80 | + |
| 81 | +`pattern:\S` |
| 82 | +: Non-space: any character except `pattern:\s`, for instance a letter. |
| 83 | + |
| 84 | +`pattern:\W` |
| 85 | +: Non-wordly character: anything but `pattern:\w`, e.g a non-latin letter or a space. |
| 86 | + |
| 87 | +In the beginning of the chapter we saw how to make a number-only phone number from a string like `subject:+7(903)-123-45-67`: find all digits and join them. |
| 88 | + |
| 89 | +```js run |
| 90 | +let str = "+7(903)-123-45-67"; |
| 91 | + |
| 92 | +alert( str.match(/\d/g).join('') ); // 79031234567 |
| 93 | +``` |
| 94 | + |
| 95 | +An alternative, shorter way is to find non-digits `pattern:\D` and remove them from the string: |
| 96 | + |
| 97 | +```js run |
| 98 | +let str = "+7(903)-123-45-67"; |
| 99 | + |
| 100 | +alert( str.replace(/\D/g, "") ); // 79031234567 |
| 101 | +``` |
| 102 | + |
| 103 | +## A dot is "any character" |
| 104 | + |
| 105 | +A dot `pattern:.` is a special character class that matches "any character except a newline". |
| 106 | + |
| 107 | +For instance: |
| 108 | + |
| 109 | +```js run |
| 110 | +alert( "Z".match(/./) ); // Z |
| 111 | +``` |
| 112 | + |
| 113 | +Or in the middle of a regexp: |
| 114 | + |
| 115 | +```js run |
| 116 | +let regexp = /CS.4/; |
| 117 | + |
| 118 | +alert( "CSS4".match(regexp) ); // CSS4 |
| 119 | +alert( "CS-4".match(regexp) ); // CS-4 |
| 120 | +alert( "CS 4".match(regexp) ); // CS 4 (space is also a character) |
| 121 | +``` |
| 122 | + |
| 123 | +Please note that a dot means "any character", but not the "absense of a character". There must be a character to match it: |
| 124 | + |
| 125 | +```js run |
| 126 | +alert( "CS4".match(/CS.4/) ); // null, no match because there's no character for the dot |
| 127 | +``` |
| 128 | + |
| 129 | +### Dot as literally any character with "s" flag |
| 130 | + |
| 131 | +By default, a dot doesn't match the newline character `\n`. |
| 132 | + |
| 133 | +For instance, the regexp `pattern:A.B` matches `match:A`, and then `match:B` with any character between them, except a newline `\n`: |
| 134 | + |
| 135 | +```js run |
| 136 | +alert( "A\nB".match(/A.B/) ); // null (no match) |
| 137 | +``` |
| 138 | + |
| 139 | +There are many situations when we'd like a dot to mean literally "any character", newline included. |
| 140 | + |
| 141 | +That's what flag `pattern:s` does. If a regexp has it, then a dot `pattern:.` matches literally any character: |
| 142 | + |
| 143 | +```js run |
| 144 | +alert( "A\nB".match(/A.B/s) ); // A\nB (match!) |
| 145 | +``` |
| 146 | + |
| 147 | +````warn header="Not supported in Firefox, IE, Edge" |
| 148 | +Check <https://caniuse.com/#search=dotall> for the most recent state of support. At the time of writing it doesn't include Firefox, IE, Edge. |
| 149 | +
|
| 150 | +Luckily, there's an alternative, that works everywhere. We can use a regexp like `pattern:[\s\S]` to match "any character". |
| 151 | +
|
| 152 | +```js run |
| 153 | +alert( "A\nB".match(/A[\s\S]B/) ); // A\nB (match!) |
| 154 | +``` |
| 155 | +
|
| 156 | +The pattern `pattern:[\s\S]` literally says: "a space character OR not a space character". In other words, "anything". We could use another pair of complementary classes, such as `pattern:[\d\D]`, that doesn't matter. |
| 157 | +
|
| 158 | +This trick works everywhere. Also we can use it if we don't want to set `pattern:s` flag, in cases when we want a regular "no-newline" dot too in the pattern. |
| 159 | +```` |
| 160 | + |
| 161 | +````warn header="Pay attention to spaces" |
| 162 | +Usually we pay little attention to spaces. For us strings `subject:1-5` and `subject:1 - 5` are nearly identical. |
| 163 | +
|
| 164 | +But if a regexp doesn't take spaces into account, it may fail to work. |
| 165 | +
|
| 166 | +Let's try to find digits separated by a hyphen: |
| 167 | +
|
| 168 | +```js run |
| 169 | +alert( "1 - 5".match(/\d-\d/) ); // null, no match! |
| 170 | +``` |
| 171 | +
|
| 172 | +Let's fix it adding spaces into the regexp `pattern:\d - \d`: |
| 173 | +
|
| 174 | +```js run |
| 175 | +alert( "1 - 5".match(/\d - \d/) ); // 1 - 5, now it works |
| 176 | +// or we can use \s class: |
| 177 | +alert( "1 - 5".match(/\d\s-\s\d/) ); // 1 - 5, also works |
| 178 | +``` |
| 179 | +
|
| 180 | +**A space is a character. Equal in importance with any other character.** |
| 181 | +
|
| 182 | +We can't add or remove spaces from a regular expression and expect to work the same. |
| 183 | +
|
| 184 | +In other words, in a regular expression all characters matter, spaces too. |
| 185 | +```` |
| 186 | + |
| 187 | +## Summary |
| 188 | + |
| 189 | +There exist following character classes: |
| 190 | + |
| 191 | +- `pattern:\d` -- digits. |
| 192 | +- `pattern:\D` -- non-digits. |
| 193 | +- `pattern:\s` -- space symbols, tabs, newlines. |
| 194 | +- `pattern:\S` -- all but `pattern:\s`. |
| 195 | +- `pattern:\w` -- Latin letters, digits, underscore `'_'`. |
| 196 | +- `pattern:\W` -- all but `pattern:\w`. |
| 197 | +- `pattern:.` -- any character if with the regexp `'s'` flag, otherwise any except a newline `\n`. |
| 198 | + |
| 199 | +...But that's not all! |
| 200 | + |
| 201 | +Unicode encoding, used by JavaScript for strings, provides many properties for characters, like: which language the letter belongs to (if it's a letter) it is it a punctuation sign, etc. |
| 202 | + |
| 203 | +We can search by these properties as well. That requires flag `pattern:u`, covered in the next article. |
0 commit comments