Post

Strings: Immutability and Traversal

Strings: Immutability and Traversal

String

One of the fundamental data types in Computer Science, which is basically an array of integers under the hood, where each character in a given string is mapped to an integer via some character-encoding standard like ASCII. ASCII covers 128 characters and works well for basic English text, but modern applications rely on Unicode, which can represent over 140,000 characters across virtually every writing system. Common Unicode encodings include UTF-8 (variable-length, 1-4 bytes per character, backward-compatible with ASCII and dominant on the web) and UTF-16 (used internally by Java and JavaScript, 2 or 4 bytes per character).

We can divide Strings to mutable and immutable, in most languages they are immutable, meaning that they can’t be edited after creation. So as you can imagine, simple operations like appending a character to a string are more expensive than it might appear.

String Immutability

In Java and Python, strings are immutable objects. Once created, their contents cannot be changed. Every operation that appears to modify a string actually allocates a new string in memory. This design has benefits – immutable strings are thread-safe, can be freely shared, and enable string interning (the string pool). The JVM and Python interpreter both maintain a pool of string literals so that identical strings share the same memory reference, saving space and making equality comparisons faster.

The canonical example of an operation that’s deceptively expensive due to string immutability is the following:

1
2
3
4
5
String someString = "some string";
String newString = "";
for (char character : someString.toCharArray()) {
	  newString += character;
}

The operation above has a time complexity O(n2) where n is the length of ‘someString’, because each addition of a character to newString creates an entirely new string and is itself an O(n) operation. Since n such O(n) operations are performed (one for each character in the given string), the total cost is O(n2).

Efficient String Concatenation

The solution to the O(n2) problem above is to use a mutable buffer. In Java, StringBuilder maintains an internal character array that grows as needed, so appending a character is amortized O(1):

1
2
3
4
5
6
String someString = "some string";
StringBuilder sb = new StringBuilder();
for (char character : someString.toCharArray()) {
    sb.append(character);
}
String newString = sb.toString();

This brings the total time complexity down to O(n). Java also offers StringBuffer, which is the thread-safe counterpart of StringBuilder. Both have the same API, but StringBuffer synchronizes every method call, making it slower. In practice, string building is almost always a local operation within a single method, so StringBuilder is the right choice in the vast majority of cases. StringBuffer is a holdover from Java 1.0 and you’ll rarely have a reason to reach for it.

In Python, the idiomatic approach is to collect parts in a list and join them at the end:

1
2
3
some_string = "some string"
parts = [char for char in some_string]
new_string = "".join(parts)

Common String Operations and Their Complexities

Understanding the cost of everyday string operations helps write performant code:

OperationTime ComplexityNotes
Accessing a character by indexO(1)Direct array access
Searching for a substringO(n * m)Where n is string length and m is substring length (naive); O(n + m) with KMP or Rabin-Karp
Concatenation (immutable)O(n + m)Creates a new string of combined length
SubstringO(n) in Java (since Java 7u6)Creates a new copy of the underlying character data
Comparing two stringsO(n)Must check character by character in the worst case
StringBuilder appendO(1) amortizedResizes the internal buffer when full, similar to a dynamic array

One subtlety worth noting: in Java, String.length() returns the number of UTF-16 code units, not the number of visible characters. Most characters fit in a single code unit, but emoji and many non-Latin scripts use surrogate pairs (two code units for one character). For example, "hello".length() returns 5 as expected, but a single emoji like the smiling face can report a length of 2. If you need the actual character count, use codePointCount(0, s.length()) instead.

String Comparison: More Than Meets the Eye

One of the most common mistakes Java developers make involves string comparison, so let’s clear it up once and for all. The distinction boils down to == vs .equals(). The == operator compares references — it asks “do both variables point to the exact same object in memory?” — while .equals() compares the actual character content. Here’s the classic pitfall in action:

1
2
3
4
String a = new String("hello");
String b = new String("hello");
System.out.println(a == b);      // false — different objects
System.out.println(a.equals(b)); // true — same content

Now here’s the interning twist: string literals (like String s = "hello") are automatically interned, meaning the JVM reuses the same object from the string pool. So "hello" == "hello" returns true, but this is a side effect of string pool optimization, not a language guarantee about how equality works. The Java Language Specification guarantees that compile-time constant strings are interned, but that is an implementation detail you should never rely on for correctness. The behavior does not extend to strings created with new, concatenated at runtime, or read from external sources. Relying on == for string equality is a bug waiting to happen unless you are absolutely sure both references are interned.

Beyond equality, you’ll often need lexicographic ordering. The compareTo() method returns a negative number, zero, or a positive number depending on whether the calling string is less than, equal to, or greater than the argument. It compares character by character using Unicode values, which makes it the backbone of sorting strings alphabetically.

For situations where case doesn’t matter, use case-insensitive comparison: equalsIgnoreCase() for equality checks and String.CASE_INSENSITIVE_ORDER as a comparator when sorting collections of strings.

Common String Patterns

Strings show up constantly in coding interviews and day-to-day programming. Here are four patterns worth having in your toolkit.

1. Palindrome check — The two-pointer technique is the cleanest approach: place one pointer at the start and one at the end, walk them inward, and compare characters at each step. If every pair matches, the string is a palindrome. This runs in O(n) time with O(1) space:

1
2
3
4
5
6
7
8
9
10
11
public static boolean isPalindrome(String s) {
    int left = 0, right = s.length() - 1;
    while (left < right) {
        if (s.charAt(left) != s.charAt(right)) {
            return false;
        }
        left++;
        right--;
    }
    return true;
}

2. Anagram check — Two strings are anagrams if they contain exactly the same characters in a different order. There are two common approaches.

The sorting approach is straightforward — sort both strings and compare. It’s O(n log n) due to the sort:

1
2
3
4
5
6
7
public static boolean isAnagramSort(String a, String b) {
    char[] aChars = a.toCharArray();
    char[] bChars = b.toCharArray();
    Arrays.sort(aChars);
    Arrays.sort(bChars);
    return Arrays.equals(aChars, bChars);
}

The frequency counting approach is faster. Build an int[26] array (assuming lowercase English letters), increment for each character in the first string, decrement for each character in the second, and check that every count is zero. This is O(n) time and O(1) space:

1
2
3
4
5
6
7
8
9
10
11
12
public static boolean isAnagramCount(String a, String b) {
    if (a.length() != b.length()) return false;
    int[] freq = new int[26];
    for (int i = 0; i < a.length(); i++) {
        freq[a.charAt(i) - 'a']++;
        freq[b.charAt(i) - 'a']--;
    }
    for (int count : freq) {
        if (count != 0) return false;
    }
    return true;
}

3. Substring search — For simple cases, Java’s indexOf() does the job. Internally it uses a naive O(n * m) approach, which is fine for short strings or one-off searches. When performance matters — say you’re scanning a massive log file or implementing a text editor — look into the KMP (Knuth-Morris-Pratt) algorithm or the Rabin-Karp algorithm. Both achieve O(n + m) time by avoiding redundant comparisons. KMP preprocesses the pattern into a failure function, while Rabin-Karp uses rolling hash values. You probably won’t implement these from scratch often, but knowing they exist helps you pick the right tool.

4. Regular expressions — Java provides the Pattern and Matcher classes for regex-based string processing. Here’s a quick example validating a simple email pattern:

1
2
String pattern = "^[\\w.+-]+@[\\w-]+\\.[a-zA-Z]{2,}$";
boolean isValid = Pattern.matches(pattern, email);

Regex is incredibly powerful for parsing, validation, and text extraction, but be careful with complex patterns. Catastrophic backtracking can cause a regex engine to hang on certain inputs, turning a simple validation into an accidental denial-of-service vulnerability. A classic example is the pattern (a+)+b matched against a string like "aaaaaaaaaaaaa!". Because the input has no b, the engine backtracks through every possible way to partition the sequence of a characters among the nested quantifiers, which grows exponentially with the input length. With just 25 a characters, this can take seconds. With 40, it might never finish.

1
2
3
4
// WARNING: this will hang for large inputs
Pattern p = Pattern.compile("(a+)+b");
Matcher m = p.matcher("aaaaaaaaaaaaaaaaaaaaaaaaa!"); // 25 a's, no b
m.matches(); // exponential backtracking

If you accept regex patterns from user input (for example, a search filter), you should either validate or sandbox them, or use a regex engine with backtracking protections.

String Interning Revisited

We touched on interning earlier, so let’s dig a little deeper. The string pool (also called the intern pool) is a special memory area where the JVM stores unique string literals. When you write String s = "hello", the JVM checks the pool first and reuses an existing instance if one already exists, rather than allocating a new object.

An important piece of history: before Java 7, the string pool lived in PermGen (Permanent Generation), a fixed-size memory region that was not garbage collected in the usual sense. This meant that heavy use of intern() could cause OutOfMemoryError: PermGen space failures. Starting with Java 7, the string pool was moved to the main heap, where it benefits from normal garbage collection. This made interning much safer and more practical, since interned strings that are no longer referenced can now be collected like any other object.

You can explicitly add a string to the pool using the .intern() method: String interned = someString.intern(). After interning, == comparisons become reliable for equality checks, since all interned copies of the same content point to the same reference.

When to use interning: if your application creates many duplicate strings — for example, parsing CSV files where the same column values repeat thousands of times — interning can save significant memory. When to avoid it: interning large numbers of unique strings pollutes the pool and can actually increase GC pressure, because the pool itself must be maintained and scanned during garbage collection.

Related posts: Complexity Analysis, Memory

This post is licensed under CC BY 4.0 by the author.