Strings, Runes & Unicode
Every programming language has strings. Go's are simple on the surface
and quietly clever underneath. Once you understand the byte-vs-rune
split, a whole category of bugs evaporates, the kind where
"café" seems to have 5 characters instead of 4.
Learning objectives
- Explain what a Go
stringactually stores. - Define byte, rune, and code point and how UTF-8 ties them together.
- Choose between indexing (
s[i]) and ranging correctly. - Count characters of a string safely.
- Use raw string literals for regex and JSON.
- Build long strings efficiently with
strings.Builder.
What a string actually is
A Go string is an immutable sequence of bytes.
Two operations are guaranteed:
len(s)returns the number of bytes (not characters!).s[i]returns the i-th byte as abyte(i.e.uint8).
That's it. Immutable, countable, byte-indexable. A string doesn't have a built-in notion of "characters", because the definition of character is tangled up in text encoding, and Go took a strong stance: source code and strings default to UTF-8.
package main
import "fmt"
func main() {
s := "hello"
fmt.Println(len(s)) // 5 (bytes)
fmt.Println(s[0]) // 104, the byte value of 'h'
fmt.Printf("%c\n", s[0]) // h , interpreted as a rune
}
UTF-8 in 90 seconds
Unicode is a catalog of "characters" (really: code points). Each has a number. 'A' is U+0041. '€' is U+20AC. '😀' is U+1F600. There are over 1.1 million possible code points.
UTF-8 is a way to encode those numbers into bytes. It uses a variable-length scheme:
| Code-point range | Bytes | Example |
|---|---|---|
| U+0000 – U+007F | 1 | 'A' = 41 |
| U+0080 – U+07FF | 2 | 'é' = C3 A9 |
| U+0800 – U+FFFF | 3 | '€' = E2 82 AC |
| U+10000 – U+10FFFF | 4 | '😀' = F0 9F 98 80 |
Two consequences that matter for Go strings:
- ASCII is UTF-8. Old ASCII strings are valid UTF-8, no conversion needed.
- One character ≠ one byte.
"é"is two bytes."😀"is four.
Bytes vs runes
Go gives you two named types that will save your sanity:
byte(alias foruint8), one raw byte of data.rune(alias forint32), one Unicode code point.
Visualize it for the string "café":
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "café"
fmt.Println(len(s)) // 5 , bytes
fmt.Println(utf8.RuneCountInString(s)) // 4 : runes
}
Indexing & slicing
String indexing returns bytes, not runes. This is a foot-gun if your string isn't pure ASCII:
package main
import "fmt"
func main() {
s := "café"
fmt.Printf("%T %v\n", s[3], s[3]) // uint8 195 , NOT 'é'
}
Slicing works on bytes too:
s := "café"
fmt.Println(s[0:3]) // "caf" , OK, those bytes map 1:1 to runes
fmt.Println(s[0:4]) // "caf?" , cuts an 'é' in half! Invalid UTF-8.
strings.Index) or when the content is guaranteed ASCII.
Ranging correctly
The for … range loop over a string is rune-aware
, it decodes UTF-8 on the fly:
package main
import "fmt"
func main() {
s := "café"
for i, r := range s {
fmt.Printf("byte-index %d: %q (rune value %d)\n", i, r, r)
}
}
Output:
byte-index 0: 'c' (rune value 99)
byte-index 1: 'a' (rune value 97)
byte-index 2: 'f' (rune value 102)
byte-index 3: 'é' (rune value 233)
Notice: the index i is a byte offset,
not a rune count. After 'é' (which is 2 bytes), the next iteration
would start at byte 5. This matters when you want to slice back into
the original string.
Compare to a C-style index loop:
// WRONG for non-ASCII, iterates bytes, not runes
for i := 0; i < len(s); i++ {
fmt.Printf("%c ", s[i]) // will print garbage for multi-byte runes
}
Counting characters
len(s) gives you bytes. To count runes,
use utf8.RuneCountInString:
package main
import (
"fmt"
"unicode/utf8"
)
func main() {
s := "Hello, 世界! 👋"
fmt.Println(len(s)) // 18 (bytes)
fmt.Println(utf8.RuneCountInString(s)) // 11 (runes)
}
golang.org/x/text/unicode/norm
and rivo/uniseg libraries do this). For most
application code, rune counts are good enough.
String literals
Go has two flavors of string literal:
- Interpreted: double-quoted, supports escape
sequences like
\n,\t,\uXXXX. - Raw: backtick-quoted, no escape sequences. Great for Windows paths, regex, and multi-line content.
normal := "line 1\nline 2\t(tabbed)"
raw := `line 1
line 2 but backslashes are literal: \n \t`
regex := `^\d{3}-\d{4}$` // raw string, don't escape the \
Useful packages
strings:strings.Contains,strings.Split,strings.ToLower,strings.ReplaceAll,strings.Index,strings.Fields,strings.TrimSpace, and many more.unicode: rune classification:unicode.IsLetter,unicode.IsDigit,unicode.IsSpace,unicode.ToUpper.unicode/utf8: low-level UTF-8 operations:utf8.RuneCountInString,utf8.DecodeRuneInString,utf8.ValidString.strconv: string ↔ number conversions (we met this in Chapter 6).
package main
import (
"fmt"
"strings"
)
func main() {
words := strings.Fields(" the quick brown fox ")
fmt.Println(words) // [the quick brown fox]
s := strings.ToUpper(strings.Join(words, "-"))
fmt.Println(s) // THE-QUICK-BROWN-FOX
}
Efficient string building
Since strings are immutable, every + concatenation allocates
a new string. That's fine for a few pieces but bad in a loop. Use
strings.Builder:
package main
import (
"fmt"
"strings"
)
func main() {
var b strings.Builder
for i := 1; i <= 5; i++ {
fmt.Fprintf(&b, "line %d\n", i)
}
fmt.Print(b.String())
}
strings.Builder minimizes copies by writing to an
internal []byte and only converting to a string at the
end. For building up large strings programmatically, it's much faster
than +=.
fmt.Sprintf is fine for small formatted strings. For
serializing structured data, reach for encoding/json
(Chapter 29) or text/template. Build-loops with
+= are a classic Go anti-pattern, spotting them is
a quick code-review win.
Check your understanding
Practice exercises
Reverse a string (correctly)
Write a function reverse(s string) string that returns
s reversed. Make sure it handles non-ASCII input
correctly: reverse("abc") → "cba",
reverse("héllo 🌍") → "🌍 olléh".
Hint: convert the string to []rune first, reverse the slice, then convert back.
Show a working solution
package main
import "fmt"
func reverse(s string) string {
runes := []rune(s)
for i, j := 0, len(runes)-1; i < j; i, j = i+1, j-1 {
runes[i], runes[j] = runes[j], runes[i]
}
return string(runes)
}
func main() {
fmt.Println(reverse("abc")) // cba
fmt.Println(reverse("héllo 🌍")) // 🌍 olléh
}
The trick: byte-level reversal would split multi-byte runes. Rune-level reversal always works.
Count letters vs digits
Write a function that takes a string and returns two counts: how
many letters and how many digits it contains. Use the
unicode package.
Show a working solution
package main
import (
"fmt"
"unicode"
)
func classify(s string) (letters, digits int) {
for _, r := range s {
switch {
case unicode.IsLetter(r):
letters++
case unicode.IsDigit(r):
digits++
}
}
return
}
func main() {
l, d := classify("Hello 123 Gophers!")
fmt.Printf("letters=%d digits=%d\n", l, d) // letters=12 digits=3
}
This uses named returns (Chapter 11), handy when the semantic of each return value is clear from its name.
Further reading
- The Go Blog, Strings, bytes, runes and characters: Rob Pike's essential essay.
stringspackage docsunicode/utf8package docs- Unicode.org: the authoritative source on code points.
You've learned the chapter everyone else skips.