Post

Strings

String

One of the fundamental data types in Computer Science, which is basically an array of integers under the hood, where each character in a given string is mapped to an integer via some character-encoding standard like ASCII. ASCII covers 128 characters and works well for basic English text, but modern applications rely on Unicode, which can represent over 140,000 characters across virtually every writing system. Common Unicode encodings include UTF-8 (variable-length, 1-4 bytes per character, backward-compatible with ASCII and dominant on the web) and UTF-16 (used internally by Java and JavaScript, 2 or 4 bytes per character).

We can divide Strings to mutable and immutable, in most languages they are immutable, meaning that they can’t be edited after creation. So as you can imagine, simple operations like appending a character to a string are more expensive than it might appear.

String Immutability

In Java and Python, strings are immutable objects. Once created, their contents cannot be changed. Every operation that appears to modify a string actually allocates a new string in memory. This design has benefits – immutable strings are thread-safe, can be freely shared, and enable string interning (the string pool). The JVM and Python interpreter both maintain a pool of string literals so that identical strings share the same memory reference, saving space and making equality comparisons faster.

The canonical example of an operation that’s deceptively expensive due to string immutability is the following:

1
2
3
4
5
String someString = "some string";
String newString = "";
for (char character : someString.toCharArray()) {
	  newString += character;
}

The operation above has a time complexity O(n2) where n is the length of ‘someString’, because each addition of a character to newString creates an entirely new string and is itself an O(n) operation. Since n such O(n) operations are performed (one for each character in the given string), the total cost is O(n2).

Efficient String Concatenation

The solution to the O(n2) problem above is to use a mutable buffer. In Java, StringBuilder (or its thread-safe counterpart StringBuffer) maintains an internal character array that grows as needed, so appending a character is amortized O(1):

1
2
3
4
5
6
String someString = "some string";
StringBuilder sb = new StringBuilder();
for (char character : someString.toCharArray()) {
    sb.append(character);
}
String newString = sb.toString();

This brings the total time complexity down to O(n). In Python, the idiomatic approach is to collect parts in a list and join them at the end:

1
2
3
some_string = "some string"
parts = [char for char in some_string]
new_string = "".join(parts)

Common String Operations and Their Complexities

Understanding the cost of everyday string operations helps write performant code:

OperationTime ComplexityNotes
Accessing a character by indexO(1)Direct array access
Searching for a substringO(n * m)Where n is string length and m is substring length (naive); O(n + m) with KMP or Rabin-Karp
Concatenation (immutable)O(n + m)Creates a new string of combined length
SubstringO(n) in Java (since Java 7u6)Creates a new copy of the underlying character data
Comparing two stringsO(n)Must check character by character in the worst case
StringBuilder appendO(1) amortizedResizes the internal buffer when full, similar to a dynamic array
This post is licensed under CC BY 4.0 by the author.