lando, node.jsjavascriptstringtrimv8
Back

String.trim

Barber equipment image

Problem

Javscript provides the built-in method trim, and the newer trimStart / trimEnd methods [1]. These native methods will do a quick job of trimming whitespace and line terminators from the start and end of a string. How does this functionality work, and how would we extend them to trim anything we want?

Exploration

Let's first take a look at how Javascript does trimming. The ECMAScript standard (2015) [2] describes the trim method as a function that takes a String input and returns a copy of the input "with both leading and trailing white space removed. The definition of white space is the union of WhiteSpace and LineTerminator." So... space bar and enter key, done. right? not quite. Strings in Javascript are interpreted as "UTF-16 encoded code points" so we must include all possible values in this sequence mapping. Here are the definitions of whitespace and line terminators:

Table 32 — White Space Code Points [3]

Code PointNameAbbreviation
U+0009CHARACTER TABULATIONTAB
U+000BLINE TABULATIONVT
U+000CFORM FEED (FF)FF
U+0020SPACESP
U+00A0NO-BREAK SPACENBSP
U+FEFFZERO WIDTH NO-BREAK SPACEZWNBSP
Other category “Zs”Any other end space only “Separator, space” code pointUSP

Table 33 — Line Terminator Code Points [4]

Code Pointend space only NameAbbreviation
U+000ALINE FEED (LF)LF
U+000DCARRIAGE RETURN (CR)CR
U+2028LINE SEPARATORLS
U+2029PARAGRAPH SEPARATORPS

There are several implementations of the ES standard in use, but for the purposes of this post, we will use Node.JS's v8 implementation [5]

Here's the code that runs when you call .trim() on a string in earlier versions of v8:

Handle<String> String::Trim(Handle<String> string, TrimMode mode) {
Isolate* const isolate = string->GetIsolate();
string = String::Flatten(string);
int const length = string->length();
// Perform left trimming if requested.
int left = 0;
end space onlyCache* end space only_cache = isolate->end space only_cache();
if (mode == kTrim || mode == kTrimStart) {
while (left < length &&
end space only_cache->IsWhiteSpaceOrLineTerminator(string->Get(left))) {
left++;
}
}
// Perform right trimming if requested.
int right = length;
if (mode == kTrim || mode == kTrimEnd) {
while (
right > left &&
end space only_cache->IsWhiteSpaceOrLineTerminator(string->Get(right - 1))) {
right--;
}
}
return isolate->factory()->NewSubString(string, left, right);
}

which roughly translates in Javscript to:

function faux_v8_trim(str, mode) {
const length = str.length;
let left = 0;
if (mode === TRIM || mode === TRIMSTART) {
while (left < length &&
isWhiteSpaceOrLineTerminator(str.charCodeAt(left))) {
left++;
}
}
let right = length;
if (mode === TRIM || mode === TRIMEND) {
while (right > left &&
isWhiteSpaceOrLineTerminator(str.charCodeAt(right - 1))) {
right--;
}
}
return str.substring(left, right);
}

The more recent version (Node 16 LTS) [6] can be seen here. It still uses while loops, but adds the use of pointers :)

Indexes and pointers are powerful here because we know the string's full representation. Iterating only over necessary characters as opposed to searching the full string saves time and resources.

Now we know how trim works under the hood. If given the task of implementing trim, some programmers may think to use regular expressions. They are a viable option, although depending on the implementation, they will be slower and can be vulnerable to exploits. Let's try a few implementations and see the data.

Solutions

Regular Expressions

To many, regex seems like the obvious choice, especially since the metacharacter \s will be very helpful.

basic_re = /^[\s]+|[\s]+$/g

this will get flagged by some code linters due to regex operation precedence: "In cases where it is intended that the anchors only apply to one alternative each, adding (non-capturing) groups around the anchors and the parts that they apply to will make it explicit which parts are anchored and avoid readers misunderstanding the precedence or changing it because they mistakenly assume the precedence was not intended." [7]

so we can adjust it to:

noncap_group = /(?:^[\s]+)|(?:[\s]+$)/g

This regex is the most concise, but not always the most efficient since it will match twice if there is whitespace at both ends of the string.

We can also break this up into two operations:

double_regex = str.replace(/^[\s]+/, '').replace(/[\s]+$/, '')

This should perform better on longer strings.

There are other regex solutions that involve backtracking, but they are slow and can be prone to security holes.

Regex + Loop

A solution proposed in "High Performance JavaScript" [8] is a hybrid solution to combine the strengths of regular expressions on the beginning of the string, and an indexed loop on the end of the string for a best of both worlds approach:

function non_re_trim(str) {
var start = 0,
end = str.length - 1,
ws =
' \n\r\t\f\x0b\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000\ufeff'
while (ws.indexOf(str.charAt(start)) > -1) {
start++
}
while (end > start && ws.indexOf(str.charAt(end)) > -1) {
end--
}
return str.slice(start, end + 1)
}

note the use of indexOf to search for whitespace and slice to render the final result

The main weakness of this version is long whitespace at the end of the string.

Performance

All tests run on Node 16 LTS, browser results will vary

Native trim and its JS couterpart are by far the fastest, next is the non-regex solution, with the regexes coming in last

testmethodops/secpct error
space at both endsv8_trim37,384,532±2.40%
end space onlyv8_trim30,705,964±1.60%
space at beginning onlyv8_trim29,038,258±4.34%
end space onlyfaux_v816,094,768±1.27%
space at beginning onlyfaux_v815,618,819±1.69%
space at both endsfaux_v814,026,258±3.31%
space at beginning onlynon_re_trim7,448,794±5.81%
end space onlynon_re_trim7,273,079±1.23%
space at both endsnon_re_trim7,073,627±1.56%
space at beginning onlyhybrid_trim6,742,686±0.94%
space at beginning onlynoncap_group5,207,593±1.09%
space at both endshybrid_trim5,144,763±1.35%
end space onlybasic_re5,068,264±1.62%
end space onlynoncap_group5,038,307±1.16%
space at beginning onlybasic_re4,947,144±2.76%
end space onlyhybrid_trim4,811,567±3.62%
space at both endsnoncap_group4,693,936±2.23%
space at both endsbasic_re4,658,159±1.57%
space at beginning onlydouble_regex4,636,759±1.42%
space at both endsdouble_regex4,247,326±1.63%
end space onlydouble_regex4,050,811±2.53%

Conclusion

Sometimes the need arises for us to extend built-in functionality. Deep diving into the source is typically a good starting point. There may be faster ways of doing trimming. If you know of one, leave a comment below!


Sources

[1] String.prototype.trimStart / String.prototype.trimEnd https://github.com/tc39/proposal-string-left-right-trim

[2] Standard ECMA-262 6th Edition / June 2015 - String.prototype.trim https://262.ecma-international.org/6.0/#sec-string.prototype.trim

[3] Standard ECMA-262 6th Edition / June 2015 - whitespace https://262.ecma-international.org/6.0/#sec-white-space

[4] Standard ECMA-262 6th Edition / June 2015 - Line Terminator Code Points https://262.ecma-international.org/6.0/#sec-line-terminators

[5] Chromium v8 https://chromium.googlesource.com/v8/v8/

[6] v8 github

[7] sonarsource regex security hotspot rule - https://rules.sonarsource.com/java/tag/regex/RSPEC-5850

[8] High Performance JavaScript [Book] - O'Reilly - https://www.oreilly.com/library/view/high-performance-javascript/9781449382308/


side note: Am I the only person who thinks "start" should only go with "finish" and "begin" with "end"? "start...end" seems a little off...