ES2018 - Unicode with Regex
OK so, this is something pretty cool that we can do with regular expressions aka RegEx, and kinda blew my mind. Yes, but is it something worthwhile to learn you ask? Not only is it worthwhile to learn, it’s even fun! So shall we get started? Yes we shall.
Are you familiar with the flag u in RegEx? Long story short, it provides support for Unicode. For example, this amazing little piece of code:
let a = /😀$/u.test('😀'); // true
Funny? Sure. Amusing? Why not? Useful, maybe for Grandmas. But what do you think about this code that identifies if a text string is in Hebrew?
let a = /^\p{Script=Hebrew}+$/u.test('עברית'); // true
Did I get your attention? This has now been introduced in to ES2018—Unicode properties. And there are a lot of them! The format is really easy:
\p{LoneUnicodePropertyNameOrValue}
There’s a \p and then the braces that have the Unicode properties inside of them—it could have the name of the property and a value or just the name.
let a = /\p{General_Category=Lowercase_Letter}$/u.test('a'); // true
Unicode is a world of its own, so I’ll try to stay focused on the task at hand. This site has a full list of Unicode properties.
In general, there are a few categories of properties. The first is script which we’ve already seen. Unicode Scripts:
let a = /^\p{Script=Greek}+$/u.test('μετά') // true
Here we’re pretty much talking about writing systems. Using a Unicode property we won’t need any kind of mumbo jumbo to find a language. The RegEx verifies that the letters are within a certain range of the writing system. That is to say, if we enter in a space in a range of Latin letters, it won’t identify it as Hebrew.
let a = /^\p{Script=Hebrew}+$/u.test('שתי מילים') // false
If we want to be more specific in our range, we can move on to our second category, General_Category. Here we can find all kinds of interesting things like hyphens and dashes for example.
let a = /^\p{General_Category=Dash_Punctuation}+$/u.test('-־') // true
Or another example is currency symbols:
let a = /^\p{General_Category=Currency_Symbol}+$/u.test('$') // true
Let me remind you that we’re talking about one character—you can put it in wherever you want. For instance, if I want to check for number and currency type e.g. 400$ vs 400₪, I can do something like this:
let a = /^\p{General_Category=Currency_Symbol}[0-9]+$/u.test('₪400') // true
In that example the Unicode character is just a small part of the full regular expression.
We can use a capital P for negation—for instance any character that is not a currency symbol:
let a = /^\P{General_Category=Currency_Symbol}$/u.test('₪') // false
The capital P is for negation.
This addition to JavaScript ES2018 greatly enriches the use of Unicode in RegEx and gives us quite a bit more power for a broader and more precise usage of RegEx.
About the author: Ran Bar-Zik is an experienced web developer whose personal blog, Internet Israel, features articles and guides on Node.js, MongoDB, Git, SASS, jQuery, HTML 5, MySQL, and more. Translation of the original article by Aaron Raizen
Recent Stories
Top DiscoverSDK Experts
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}
{{CommentsModel.TotalCount}} Comments
Your Comment