wiki/content/20201106090634-javascript_regexp_unicode_property_escapes.md

1.9 KiB
Raw Blame History

id title
5441479c-12e5-48c1-ab17-46a9c3df8bc3 JavaScript RegExp Unicode Property Escapes

Description

Characters can be matched by mentioning their Unicode character properties inside of \p{}:

Proposal

RegExp Unicode Property Escapes1

Syntax

console.log(/^\p{White_Space}+$/u.test("\t \n\r")); // true
console.log(/^\p{Script=Greek}+$/u.test("μετά")); // true

Properties

Examples

Name

A unique name, composed of uppercase letters, digits, hyphens and spaces.

  • A: Name = LATIN CAPITAL LETTER A
  • 🙂: Name = SLIGHTLY SMILING FACE

GeneralCategory

categorizes characters

  • x: GeneralCategory = LowercaseLetter
  • $: GeneralCategory = CurrencySymbol

WhiteScpace

Used for marking invisible spacing characters, such as spaces, tabs and newlines.

  • :͡ WhiteSpace = True
  • π: WhiteSpace = False

Age

Version of the Unicode Standard in which a character was introduced. For example: The Euro sign € was added in version 2.1 of the Unicode standard.

  • €: Age = 2.1

Block

A contiguous range of code points. Blocks dont overlap and their names are unique.

  • S: Block = BasicLatin (range U+0000..U+007F)
  • 🙂: Block = Emoticons (range U+1F600..U+1F64F)

Script

A collection of characters used by one or more writing systems.

  • Some scripts support several writing systems. For example, the Latin script supports the writing systems English, French, German, Latin, etc.
  • Some languages can be written in multiple alternate writing systems that are supported by multiple scripts. For example, Turkish used the Arabic script before it transitioned to the Latin script in the early 20th century.
  1. Examples

    • α: Script = Greek
    • Д: Script = Cyrillic

Footnotes