CHAPTER 07 · CORE LANGUAGE

Strings, Runes & Unicode

Every programming language has strings. Go's are simple on the surface and quietly clever underneath. Once you understand the byte-vs-rune split, a whole category of bugs evaporates, the kind where "café" seems to have 5 characters instead of 4.

Learning objectives

  • Explain what a Go string actually stores.
  • Define byte, rune, and code point and how UTF-8 ties them together.
  • Choose between indexing (s[i]) and ranging correctly.
  • Count characters of a string safely.
  • Use raw string literals for regex and JSON.
  • Build long strings efficiently with strings.Builder.

What a string actually is

A Go string is an immutable sequence of bytes. Two operations are guaranteed:

  1. len(s) returns the number of bytes (not characters!).
  2. s[i] returns the i-th byte as a byte (i.e. uint8).

That's it. Immutable, countable, byte-indexable. A string doesn't have a built-in notion of "characters", because the definition of character is tangled up in text encoding, and Go took a strong stance: source code and strings default to UTF-8.

package main

import "fmt"

func main() {
    s := "hello"
    fmt.Println(len(s))          // 5  (bytes)
    fmt.Println(s[0])            // 104, the byte value of 'h'
    fmt.Printf("%c\n", s[0])     // h , interpreted as a rune
}

UTF-8 in 90 seconds

Unicode is a catalog of "characters" (really: code points). Each has a number. 'A' is U+0041. '€' is U+20AC. '😀' is U+1F600. There are over 1.1 million possible code points.

UTF-8 is a way to encode those numbers into bytes. It uses a variable-length scheme:

Code-point rangeBytesExample
U+0000 – U+007F1'A' = 41
U+0080 – U+07FF2'é' = C3 A9
U+0800 – U+FFFF3'€' = E2 82 AC
U+10000 – U+10FFFF4'😀' = F0 9F 98 80

Two consequences that matter for Go strings:

  • ASCII is UTF-8. Old ASCII strings are valid UTF-8, no conversion needed.
  • One character ≠ one byte. "é" is two bytes. "😀" is four.

Bytes vs runes

Go gives you two named types that will save your sanity:

  • byte (alias for uint8), one raw byte of data.
  • rune (alias for int32), one Unicode code point.

Visualize it for the string "café":

BYTES (len = 5) 0x63 'c' 0x61 'a' 0x66 'f' 0xC3 é byte 1 0xA9 é byte 2 RUNES (len = 4) 'c' 'a' 'f' 'é' (U+00E9)
"café", 5 bytes, 4 runes. That asymmetry is the thing to internalize.
package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "café"
    fmt.Println(len(s))                        // 5 , bytes
    fmt.Println(utf8.RuneCountInString(s))     // 4 : runes
}

Indexing & slicing

String indexing returns bytes, not runes. This is a foot-gun if your string isn't pure ASCII:

package main

import "fmt"

func main() {
    s := "café"
    fmt.Printf("%T %v\n", s[3], s[3])     // uint8 195  , NOT 'é'
}

Slicing works on bytes too:

s := "café"
fmt.Println(s[0:3])     // "caf"  , OK, those bytes map 1:1 to runes
fmt.Println(s[0:4])     // "caf?" , cuts an 'é' in half! Invalid UTF-8.
!
Byte slicing isn't safe for arbitrary strings Slicing is cheap (it makes no copy, see Chapter 13) but operates on bytes. Use it when you know your indexes land on rune boundaries (e.g. you just found them with strings.Index) or when the content is guaranteed ASCII.

Ranging correctly

The for … range loop over a string is rune-aware , it decodes UTF-8 on the fly:

package main

import "fmt"

func main() {
    s := "café"
    for i, r := range s {
        fmt.Printf("byte-index %d: %q (rune value %d)\n", i, r, r)
    }
}

Output:

byte-index 0: 'c' (rune value 99)
byte-index 1: 'a' (rune value 97)
byte-index 2: 'f' (rune value 102)
byte-index 3: 'é' (rune value 233)

Notice: the index i is a byte offset, not a rune count. After 'é' (which is 2 bytes), the next iteration would start at byte 5. This matters when you want to slice back into the original string.

Compare to a C-style index loop:

// WRONG for non-ASCII, iterates bytes, not runes
for i := 0; i < len(s); i++ {
    fmt.Printf("%c ", s[i])     // will print garbage for multi-byte runes
}

Counting characters

len(s) gives you bytes. To count runes, use utf8.RuneCountInString:

package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    s := "Hello, 世界! 👋"
    fmt.Println(len(s))                        // 18 (bytes)
    fmt.Println(utf8.RuneCountInString(s))     // 11 (runes)
}
i
"Characters" is a surprisingly hard concept Even rune counts can be misleading. Some "characters" (like a flag emoji, 🇺🇸) are built from multiple code points. User-perceived "characters" are called grapheme clusters and require more sophisticated handling (the golang.org/x/text/unicode/norm and rivo/uniseg libraries do this). For most application code, rune counts are good enough.

String literals

Go has two flavors of string literal:

  1. Interpreted: double-quoted, supports escape sequences like \n, \t, \uXXXX.
  2. Raw: backtick-quoted, no escape sequences. Great for Windows paths, regex, and multi-line content.
normal := "line 1\nline 2\t(tabbed)"
raw := `line 1
line 2    but backslashes are literal: \n \t`
regex := `^\d{3}-\d{4}$`      // raw string, don't escape the \

Useful packages

  • strings: strings.Contains, strings.Split, strings.ToLower, strings.ReplaceAll, strings.Index, strings.Fields, strings.TrimSpace, and many more.
  • unicode: rune classification: unicode.IsLetter, unicode.IsDigit, unicode.IsSpace, unicode.ToUpper.
  • unicode/utf8: low-level UTF-8 operations: utf8.RuneCountInString, utf8.DecodeRuneInString, utf8.ValidString.
  • strconv: string ↔ number conversions (we met this in Chapter 6).
package main

import (
    "fmt"
    "strings"
)

func main() {
    words := strings.Fields("   the   quick  brown fox   ")
    fmt.Println(words)                           // [the quick brown fox]

    s := strings.ToUpper(strings.Join(words, "-"))
    fmt.Println(s)                               // THE-QUICK-BROWN-FOX
}

Efficient string building

Since strings are immutable, every + concatenation allocates a new string. That's fine for a few pieces but bad in a loop. Use strings.Builder:

package main

import (
    "fmt"
    "strings"
)

func main() {
    var b strings.Builder
    for i := 1; i <= 5; i++ {
        fmt.Fprintf(&b, "line %d\n", i)
    }
    fmt.Print(b.String())
}

strings.Builder minimizes copies by writing to an internal []byte and only converting to a string at the end. For building up large strings programmatically, it's much faster than +=.

Also worth knowing fmt.Sprintf is fine for small formatted strings. For serializing structured data, reach for encoding/json (Chapter 29) or text/template. Build-loops with += are a classic Go anti-pattern, spotting them is a quick code-review win.

Check your understanding

Practice exercises

EXERCISE 1

Reverse a string (correctly)

Write a function reverse(s string) string that returns s reversed. Make sure it handles non-ASCII input correctly: reverse("abc")"cba", reverse("héllo 🌍")"🌍 olléh".

Hint: convert the string to []rune first, reverse the slice, then convert back.

Show a working solution
package main

import "fmt"

func reverse(s string) string {
    runes := []rune(s)
    for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
        runes[i], runes[j] = runes[j], runes[i]
    }
    return string(runes)
}

func main() {
    fmt.Println(reverse("abc"))        // cba
    fmt.Println(reverse("héllo 🌍"))   // 🌍 olléh
}

The trick: byte-level reversal would split multi-byte runes. Rune-level reversal always works.

EXERCISE 2

Count letters vs digits

Write a function that takes a string and returns two counts: how many letters and how many digits it contains. Use the unicode package.

Show a working solution
package main

import (
    "fmt"
    "unicode"
)

func classify(s string) (letters, digits int) {
    for _, r := range s {
        switch {
        case unicode.IsLetter(r):
            letters++
        case unicode.IsDigit(r):
            digits++
        }
    }
    return
}

func main() {
    l, d := classify("Hello 123 Gophers!")
    fmt.Printf("letters=%d digits=%d\n", l, d)    // letters=12 digits=3
}

This uses named returns (Chapter 11), handy when the semantic of each return value is clear from its name.

Further reading

You've learned the chapter everyone else skips.