String.trim

Problem
Javscript provides the built-in method trim, and the newer trimStart / trimEnd methods [1]. These native methods will do a quick job of trimming whitespace and line terminators from the start and end of a string. How does this functionality work, and how would we extend them to trim anything we want?
Exploration
Let's first take a look at how Javascript does trimming. The ECMAScript standard (2015) [2] describes the trim method as a function that takes a String input and returns a copy of the input "with both leading and trailing white space removed. The definition of white space is the union of WhiteSpace and LineTerminator." So... space bar and enter key, done. right? not quite. Strings in Javascript are interpreted as "UTF-16 encoded code points" so we must include all possible values in this sequence mapping. Here are the definitions of whitespace and line terminators:
Table 32 — White Space Code Points [3]
Code Point | Name | Abbreviation |
---|---|---|
U+0009 | CHARACTER TABULATION | TAB |
U+000B | LINE TABULATION | VT |
U+000C | FORM FEED (FF) | FF |
U+0020 | SPACE | SP |
U+00A0 | NO-BREAK SPACE | NBSP |
U+FEFF | ZERO WIDTH NO-BREAK SPACE | ZWNBSP |
Other category “Zs” | Any other end space only “Separator, space” code point | USP |
Table 33 — Line Terminator Code Points [4]
Code Point | end space only Name | Abbreviation |
---|---|---|
U+000A | LINE FEED (LF) | LF |
U+000D | CARRIAGE RETURN (CR) | CR |
U+2028 | LINE SEPARATOR | LS |
U+2029 | PARAGRAPH SEPARATOR | PS |
There are several implementations of the ES standard in use, but for the purposes of this post, we will use Node.JS's v8 implementation [5]
Here's the code that runs when you call .trim() on a string in earlier versions of v8:
Handle<String> String::Trim(Handle<String> string, TrimMode mode) { Isolate* const isolate = string->GetIsolate(); string = String::Flatten(string); int const length = string->length(); // Perform left trimming if requested. int left = 0; end space onlyCache* end space only_cache = isolate->end space only_cache(); if (mode == kTrim || mode == kTrimStart) { while (left < length && end space only_cache->IsWhiteSpaceOrLineTerminator(string->Get(left))) { left++; } } // Perform right trimming if requested. int right = length; if (mode == kTrim || mode == kTrimEnd) { while ( right > left && end space only_cache->IsWhiteSpaceOrLineTerminator(string->Get(right - 1))) { right--; } } return isolate->factory()->NewSubString(string, left, right);}
which roughly translates in Javscript to:
function faux_v8_trim(str, mode) { const length = str.length; let left = 0; if (mode === TRIM || mode === TRIMSTART) { while (left < length && isWhiteSpaceOrLineTerminator(str.charCodeAt(left))) { left++; } }
let right = length; if (mode === TRIM || mode === TRIMEND) { while (right > left && isWhiteSpaceOrLineTerminator(str.charCodeAt(right - 1))) { right--; } } return str.substring(left, right);}
The more recent version (Node 16 LTS) [6] can be seen here. It still uses while loops, but adds the use of pointers :)
Indexes and pointers are powerful here because we know the string's full representation. Iterating only over necessary characters as opposed to searching the full string saves time and resources.
Now we know how trim works under the hood. If given the task of implementing trim, some programmers may think to use regular expressions. They are a viable option, although depending on the implementation, they will be slower and can be vulnerable to exploits. Let's try a few implementations and see the data.
Solutions
Regular Expressions
To many, regex seems like the obvious choice, especially since the metacharacter \s
will be very helpful.
basic_re = /^[\s]+|[\s]+$/g
this will get flagged by some code linters due to regex operation precedence: "In cases where it is intended that the anchors only apply to one alternative each, adding (non-capturing) groups around the anchors and the parts that they apply to will make it explicit which parts are anchored and avoid readers misunderstanding the precedence or changing it because they mistakenly assume the precedence was not intended." [7]
so we can adjust it to:
noncap_group = /(?:^[\s]+)|(?:[\s]+$)/g
This regex is the most concise, but not always the most efficient since it will match twice if there is whitespace at both ends of the string.
We can also break this up into two operations:
double_regex = str.replace(/^[\s]+/, '').replace(/[\s]+$/, '')
This should perform better on longer strings.
There are other regex solutions that involve backtracking, but they are slow and can be prone to security holes.
Regex + Loop
A solution proposed in "High Performance JavaScript" [8] is a hybrid solution to combine the strengths of regular expressions on the beginning of the string, and an indexed loop on the end of the string for a best of both worlds approach:
function non_re_trim(str) { var start = 0, end = str.length - 1, ws = ' \n\r\t\f\x0b\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000\ufeff' while (ws.indexOf(str.charAt(start)) > -1) { start++ } while (end > start && ws.indexOf(str.charAt(end)) > -1) { end-- } return str.slice(start, end + 1)}
note the use of indexOf
to search for whitespace and slice
to render the final result
The main weakness of this version is long whitespace at the end of the string.
Performance
All tests run on Node 16 LTS, browser results will vary
Native trim and its JS couterpart are by far the fastest, next is the non-regex solution, with the regexes coming in last
test | method | ops/sec | pct error |
---|---|---|---|
space at both ends | v8_trim | 37,384,532 | ±2.40% |
end space only | v8_trim | 30,705,964 | ±1.60% |
space at beginning only | v8_trim | 29,038,258 | ±4.34% |
end space only | faux_v8 | 16,094,768 | ±1.27% |
space at beginning only | faux_v8 | 15,618,819 | ±1.69% |
space at both ends | faux_v8 | 14,026,258 | ±3.31% |
space at beginning only | non_re_trim | 7,448,794 | ±5.81% |
end space only | non_re_trim | 7,273,079 | ±1.23% |
space at both ends | non_re_trim | 7,073,627 | ±1.56% |
space at beginning only | hybrid_trim | 6,742,686 | ±0.94% |
space at beginning only | noncap_group | 5,207,593 | ±1.09% |
space at both ends | hybrid_trim | 5,144,763 | ±1.35% |
end space only | basic_re | 5,068,264 | ±1.62% |
end space only | noncap_group | 5,038,307 | ±1.16% |
space at beginning only | basic_re | 4,947,144 | ±2.76% |
end space only | hybrid_trim | 4,811,567 | ±3.62% |
space at both ends | noncap_group | 4,693,936 | ±2.23% |
space at both ends | basic_re | 4,658,159 | ±1.57% |
space at beginning only | double_regex | 4,636,759 | ±1.42% |
space at both ends | double_regex | 4,247,326 | ±1.63% |
end space only | double_regex | 4,050,811 | ±2.53% |
Conclusion
Sometimes the need arises for us to extend built-in functionality. Deep diving into the source is typically a good starting point. There may be faster ways of doing trimming. If you know of one, leave a comment below!
Sources
[1] String.prototype.trimStart / String.prototype.trimEnd https://github.com/tc39/proposal-string-left-right-trim
[2] Standard ECMA-262 6th Edition / June 2015 - String.prototype.trim https://262.ecma-international.org/6.0/#sec-string.prototype.trim
[3] Standard ECMA-262 6th Edition / June 2015 - whitespace https://262.ecma-international.org/6.0/#sec-white-space
[4] Standard ECMA-262 6th Edition / June 2015 - Line Terminator Code Points https://262.ecma-international.org/6.0/#sec-line-terminators
[5] Chromium v8 https://chromium.googlesource.com/v8/v8/
[6] v8 github
[7] sonarsource regex security hotspot rule - https://rules.sonarsource.com/java/tag/regex/RSPEC-5850
[8] High Performance JavaScript [Book] - O'Reilly - https://www.oreilly.com/library/view/high-performance-javascript/9781449382308/
side note: Am I the only person who thinks "start" should only go with "finish" and "begin" with "end"? "start...end" seems a little off...