Home >Blog

ES2018 - Unicode with Regex

OK so, this is something pretty cool that we can do with regular expressions aka RegEx, and kinda blew my mind. Yes, but is it something worthwhile to learn you ask? Not only is it worthwhile to learn, it’s even fun! So shall we get started? Yes we shall.

Are you familiar with the flag u in RegEx? Long story short, it provides support for Unicode. For example, this amazing little piece of code:

let a = /😀$/u.test('😀'); // true

Funny? Sure. Amusing? Why not? Useful, maybe for Grandmas. But what do you think about this code that identifies if a text string is in Hebrew?

let a = /^\p{Script=Hebrew}+$/u.test('עברית'); // true

Did I get your attention? This has now been introduced in to ES2018—Unicode properties. And there are a lot of them! The format is really easy:

\p{LoneUnicodePropertyNameOrValue}

There’s a \p and then the braces that have the Unicode properties inside of them—it could have the name of the property and a value or just the name.

let a = /\p{General_Category=Lowercase_Letter}$/u.test('a'); // true

Unicode is a world of its own, so I’ll try to stay focused on the task at hand. This site has a full list of Unicode properties.

In general, there are a few categories of properties. The first is script which we’ve already seen. Unicode Scripts:

let a = /^\p{Script=Greek}+$/u.test('μετά') // true

Here we’re pretty much talking about writing systems. Using a Unicode property we won’t need any kind of mumbo jumbo to find a language. The RegEx verifies that the letters are within a certain range of the writing system. That is to say, if we enter in a space in a range of Latin letters, it won’t identify it as Hebrew.

let a = /^\p{Script=Hebrew}+$/u.test('שתי מילים') // false

If we want to be more specific in our range, we can move on to our second category, General_Category. Here we can find all kinds of interesting things like hyphens and dashes for example.

let a = /^\p{General_Category=Dash_Punctuation}+$/u.test('-־') // true

Or another example is currency symbols:

let a = /^\p{General_Category=Currency_Symbol}+$/u.test('$') // true

Let me remind you that we’re talking about one character—you can put it in wherever you want. For instance, if I want to check for number and currency type e.g. 400$ vs 400₪, I can do something like this:

let a = /^\p{General_Category=Currency_Symbol}[0-9]+$/u.test('₪400') // true

In that example the Unicode character is just a small part of the full regular expression.

We can use a capital P for negation—for instance any character that is not a currency symbol:

let a = /^\P{General_Category=Currency_Symbol}$/u.test('₪') // false

The capital P is for negation.

This addition to JavaScript ES2018 greatly enriches the use of Unicode in RegEx and gives us quite a bit more power for a broader and more precise usage of RegEx.

Previous article: ES2018 RegEx lookAhead and lookBehind

Next article: Finally method in Promise

About the author: Ran Bar-Zik is an experienced web developer whose personal blog, Internet Israel, features articles and guides on Node.js, MongoDB, Git, SASS, jQuery, HTML 5, MySQL, and more. Translation of the original article by Aaron Raizen

By Ran Bar-Zik | 9/27/2018 | General