Unicode Regex
Regular Expressions: Unicode Regex
What is Unicode in JavaScript?
View Answer:
How does JavaScript handle Unicode in Regex?
View Answer:
let str = "hello js!";
let match = str.match(/\p{L}/gu);
// match will be ["h", "e", "l", "l", "o", "j", "s"]
console.log(match);
What are the implications of the "u" flag in JavaScript Regex?
View Answer:
let str = "😊";
console.log(str.length); // Outputs: 2 (without "u" flag)
console.log([...str].length); // Outputs: 1 (with "u" flag)
let regexWithoutU = /^.$/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)
let regexWithU = /^.$/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)
In this example, the string contains a single emoji, which is represented as a Unicode surrogate pair. Without the "u" flag, JavaScript treats the surrogate pair as two separate characters, hence the regex ^.$ (which matches a string of exactly one character) fails to match the string. However, with the "u" flag, the surrogate pair is treated as a single character, so the regex ^.$/u matches the string.
What does the \p{} notation in JavaScript Regex do?
View Answer:
let str = "hello 123! 你好 नमस्ते";
let match = str.match(/\p{Script=Latin}/gu);
console.log(match);
// Outputs: ["h", "e", "l", "l", "o"]
What is a surrogate pair in JavaScript Unicode handling?
View Answer:
let str = "\uD83D\uDE00"; // This is a surrogate pair for 😄
console.log(str); // Outputs: 😄
let regex = /\uD83D\uDE00/u;
console.log(regex.test(str)); // Outputs: true
In this example, the string str uses a surrogate pair to represent the grinning face emoji 😄. The regular expression /\uD83D\uDE00/u uses the same surrogate pair to match this emoji. The u flag enables full Unicode matching, which treats the surrogate pair as a single character.
What is a Unicode property escape in JavaScript?
View Answer:
Can you use the Unicode range in JavaScript regex?
View Answer:
let str = "hello, 你好, नमस्ते!";
let match = str.match(/[\u4e00-\u9fff]+/gu);
console.log(match);
// Outputs: [ '你好' ]
In this example, the regex [\u4e00-\u9fff]+/gu matches any sequence of characters that are in the Unicode range from 4E00 to 9FFF, which includes most common Chinese characters. The g flag makes the regex match globally, and the u flag enables full Unicode matching.
How does the "u" flag change the behavior of \b in JavaScript?
View Answer:
let str = "café";
let regexWithoutU = /\bcafé\b/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)
let regexWithU = /\bcafé\b/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)
What is an astral symbol in relation to Regex?
View Answer:
Here's a simple JavaScript code example illustrating how the 'u' flag in regex allows astral symbols to be matched as single characters:
let str = "𝌆"; // This is an astral symbol
let regexWithoutU = /.+/; // Regex without 'u' flag
let matchWithoutU = str.match(regexWithoutU);
console.log(matchWithoutU[0].length); // Outputs: 2, because it treats astral symbol as two separate characters
let regexWithU = /.+/u; // Regex with 'u' flag
let matchWithU = str.match(regexWithU);
console.log(matchWithU[0].length); // Outputs: 1, because it treats astral symbol as a single character
In this example, you can see how the 'u' flag enables the regex to treat the astral symbol as a single character instead of two separate characters.
How does JavaScript handle astral symbols in Regex?
View Answer:
let str = "I love 🍕!";
let regexWithoutU = /🍕/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)
let regexWithU = /🍕/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)
Without the "u" flag, the astral symbol (pizza emoji) is treated as two separate characters, so the regex fails to match. With the "u" flag, the astral symbol is correctly treated as a single character, and the regex successfully matches the string.
How can you match any Unicode letter in JavaScript Regex?
View Answer:
let str = "hello, 你好, नमस्ते!";
let match = str.match(/\p{L}/gu);
console.log(match);
// Outputs: ['h', 'e', 'l', 'l', 'o', '你', '好', 'न', 'म', 'स', 'त', 'े']
What does JavaScript's \p{Script=} do in Regex?
View Answer:
let str = "こんにちは (Hello in Japanese Hiragana)";
let match = str.match(/\p{Script=Hiragana}/gu);
console.log(match);
// Outputs: [ 'こ', 'ん', 'に', 'ち', 'は' ]
In this example, the regex /\p{Script=Hiragana}/gu
matches any character from the Hiragana script. The g
flag makes the regex match globally, and the u
flag enables full Unicode matching. It matches all the Hiragana letters in the string.
Can JavaScript regex match emoji using Unicode?
View Answer:
let str = "I love 🍕!";
let regex = /\p{Emoji}/u;
console.log(regex.test(str)); // Outputs: true
////////////////////////////////
let str = "I love 🍕!";
let regex = /🍕/u;
console.log(regex.test(str)); // Outputs: true
How can you match all whitespace characters, including Unicode spaces, in JavaScript regex?
View Answer:
let str = "Hello\t\n\u{2003}World!"; // Normal space, tab, newline, and em space characters
let match = str.match(/\p{White_Space}/gu);
console.log(match);
// Outputs: [' ', '\t', '\n', ' ']
In this example, the regex /\p{White_Space}/gu
matches any Unicode whitespace character in the string. The \p{White_Space}
is a Unicode property escape that matches any kind of whitespace character as defined by Unicode, including regular spaces, tabs, newlines, and other types of spaces like the em space. The g
flag makes the regex match globally, and the u
flag enables full Unicode matching. It matches all the different types of spaces in the string.
Can you perform Unicode case-insensitive matching in JavaScript regex?
View Answer:
let str = "Hello hElLo HELLO";
let regex = /hello/giu;
console.log(str.match(regex));
// Outputs: ['Hello', 'hElLo', 'HELLO']
In this example, the regular expression /hello/giu
matches the word "hello" in any case. The i
flag makes the regex case-insensitive, the g
flag makes it match globally, and the u
flag enables full Unicode matching. It matches all variations of "hello" in the string, regardless of their case.
What's the impact of using the dot (.) in a JavaScript regex with the "u" flag?
View Answer:
let str = "😄"; // An astral symbol
let regexWithoutU = /^.$/;
console.log(regexWithoutU.test(str)); // Outputs: false (without "u" flag)
let regexWithU = /^.$/u;
console.log(regexWithU.test(str)); // Outputs: true (with "u" flag)
In this example, the emoji is a Unicode astral symbol represented by a surrogate pair in JavaScript. Without the "u" flag, JavaScript treats the surrogate pair as two separate characters, so the regex ^.$
fails to match. However, with the "u" flag, JavaScript treats the surrogate pair as a single character, so the regex ^.$/u
matches successfully.
What's the significance of Unicode normalization in JavaScript?
View Answer:
let str1 = "café"; // Composed form (é is one Unicode character)
let str2 = "café"; // Decomposed form (e and ´ are two separate Unicode characters)
console.log(str1 === str2); // Outputs: false (not normalized)
// Normalize to composed form (NFC)
console.log(str1.normalize("NFC") === str2.normalize("NFC")); // Outputs: true
In this example, str1
and str2
look identical but are represented differently at the Unicode level. Without normalization, JavaScript considers them different strings. However, by normalizing to the same form ("NFC" for composed form), they are recognized as the same string. This is particularly important for string comparisons and when working with international text.
How many bytes are Unicode characters?
View Answer:
// Both characters return a length of 2,
// it should be 1, but these are special characters
console.log('😄'.length); // 2
console.log('𝒳'.length); // 2
How are Unicode properties expressed in regular expressions?
View Answer:
let str = 'A ბ ㄱ';
console.log(str.match(/\p{L}/gu)); // output: A,ბ,ㄱ
console.log(str.match(/\p{L}/g)); // output: null
// null (no matches, \p does not work without the flag "u")
Is there a way to find or match a Hexadecimal number using Unicode properties?
View Answer:
let regexp = /x\p{Hex_Digit}\p{Hex_Digit}/u;
console.log('number: xAF'.match(regexp)); // ["xAF"]
What approach should we use to handle script-based languages, like Chinese, in regular expressions?
View Answer:
let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs
let str = `Hello Привет 你好 123_456`;
console.log(str.match(regexp)); // 你,好
What Unicode property should we use in regular expressions?
View Answer:
let regexp = /\p{Sc}\d/gu;
let str = `Prices: $2, €1, ¥9`;
console.log(str.match(regexp)); // $2,€1,¥9