Mungomash LLC
Text Encoding Inspector

Code points · UTF-8 · UTF-16 · UTF-32 · grapheme clusters

Private Runs in your browser. Your text stays on this page — nothing is sent to Mungomash, and no third-party API is contacted for the analysis.

Try a sample

Type or paste a string above — or click a sample chip — to see it broken down character by character. Nothing is sent over the network for the analysis.

Why the same string has four different lengths

A string is not one thing. It’s at least four:

  • A sequence of grapheme clusters — what a reader counts as “visible characters.”
  • A sequence of Unicode code points — the abstract entries in the Unicode standard.
  • A sequence of UTF-8 bytes — what gets sent over the wire and stored on disk.
  • A sequence of UTF-16 code units — what JavaScript’s str.length and Java’s String.length() return.

The four counts are usually different. "café" — with a precomposed é — is 4 graphemes, 4 code points, 5 UTF-8 bytes, and 4 UTF-16 code units. "👨‍👩‍👧‍👦" (family emoji) is 1 grapheme, 7 code points, 25 UTF-8 bytes, and 11 UTF-16 code units. Most string bugs in production code are a mismatch between which of these four the developer thought they were counting and which one their language was actually counting.

ASCII, Latin-1, and the road to Unicode

ASCII (1963, standardized 1968) is a 7-bit encoding covering 128 characters: the basic Latin alphabet, digits, punctuation, and a handful of control characters. Every byte from 0x00 through 0x7F maps to exactly one character. ASCII handled English and not much else.

The 8th bit was free, so the world filled it — differently in every locale. Latin-1 (ISO 8859-1) added Western European accented letters; Latin-2 handled Central European; Windows-1252 was Microsoft’s own variant; Shift_JIS, GB2312, Big5, and EUC-KR handled CJK languages with multi-byte schemes. By the early 1990s a single document with English, French, Japanese, and Chinese text had no consistent encoding it could live in.

Unicode (1991, currently version 16.0) was the answer. Unicode separates two ideas that earlier encodings conflated: a code point (an abstract identifier for a character — U+0041 for “A”, U+1F600 for “😀”) and an encoding (how those code points become bytes — UTF-8, UTF-16, UTF-32, etc.). Today Unicode defines code points up to U+10FFFF — over a million possible values, of which ~150,000 are actually assigned.

UTF-8 — variable-width by design

UTF-8 (1992–1993, designed by Ken Thompson and Rob Pike) is a variable-length encoding: 1 byte for ASCII, 2 bytes for most Latin / Cyrillic / Greek / Arabic / Hebrew, 3 bytes for most CJK and Indic scripts, 4 bytes for emoji and rare historical scripts. The byte patterns are self-synchronizing — you can drop into any byte stream mid-sequence and find the next character boundary in at most three bytes.

The variable width is the point. A document that’s 99% ASCII (most software source code, most config files, most English prose) costs nothing extra to encode in UTF-8. A document that’s 99% Chinese costs 3× the bytes a Big5 document would. UTF-8’s genius is that it’s the universal lowest-cost encoding for English and the universal “works everywhere” encoding for every other language — even if it’s not the cheapest option for any single non-Latin script.

UTF-8 is now overwhelmingly the dominant encoding on the web (estimated >98% of public pages), in modern programming languages (Go, Rust, Python 3, Swift), and in modern file formats (JSON, YAML, TOML, modern XML). The only place UTF-16 still dominates is inside the runtime memory of older platforms designed in the early 1990s — Windows, Java, JavaScript — where UTF-16 was the obvious choice when Unicode first fit in 16 bits and reverting it later would have broken every API.

UTF-16 and the surrogate-pair trap

UTF-16 uses 16-bit code units. For code points in the Basic Multilingual Plane (BMP, U+0000U+FFFF) one code unit holds the value directly. For code points above U+FFFF — everything in the supplementary planes, including most emoji — UTF-16 uses a surrogate pair: two code units, each in the range U+D800U+DFFF, that combine to encode the code point.

That choice is the source of more bugs in JavaScript and Java than any other Unicode design decision. "😀".length returns 2 in JavaScript, not 1, because .length counts UTF-16 code units. "😀"[0] returns the high surrogate alone — an invalid Unicode code point on its own — rather than the emoji. .substring(0, 1) on an emoji-bearing string can split a surrogate pair down the middle and produce a corrupted half-character. The fix is iterating with for...of (which respects code points), Array.from(str) (which produces an array of code-point-grouped strings), or a grapheme-aware library.

Java made the same choice in 1995 when Unicode still fit in 16 bits, and is stuck with the same trap: String.length() counts UTF-16 code units, and String.charAt(int) can return half of a surrogate pair. The escape hatch is String.codePointCount and String.codePoints(). Python 3 and Swift made the modern choice — iterating returns code points by default — though Python 3 still has its own subtleties with grapheme clusters.

Grapheme clusters — what a reader sees

A grapheme cluster is what a human reads as “one character.” It can be a single code point (A, ¥) or a sequence of code points that the reader perceives as one glyph: a base letter plus combining marks (e + U+0301 = é), an emoji plus skin-tone modifier (👋 + U+1F3FD = 👋🏽), an emoji plus ZWJ + emoji forming a compound (👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦 = 👨‍👩‍👧‍👦), a regional-indicator pair forming a flag (🇺 + 🇸 = 🇺🇸).

The Unicode standard publishes the rules for finding grapheme cluster boundaries (UAX #29) and modern engines implement them via Intl.Segmenter in JavaScript, the grapheme-clusters features in Swift, the unicode-segmentation crate in Rust, and the java.text.BreakIterator class in Java. If you need to count “visible characters” for any user-facing display (length limits in a UI, truncating to fit a column, drawing a cursor), grapheme clusters — not code points — are the right unit.

Mojibake — the “’” problem

Mojibake — from the Japanese for “character transformation” — is what happens when text encoded in one encoding is decoded as another. The most common pattern in 2020s software is UTF-8 bytes interpreted as Latin-1 (or Windows-1252). The right curly apostrophe (U+2019) encodes in UTF-8 as the three bytes 0xE2 0x80 0x99; if those three bytes get decoded as Latin-1 they render as ’ — the unmistakable mojibake signature you’ve seen in old emails and bad CSV imports.

The same byte sequence decoded as Windows-1252 produces a slightly different rendering because Windows-1252 fills in the C1 control range (0x80–0x9F) with extra symbols. Most real-world “Latin-1 mojibake” is actually Windows-1252 mojibake. The fix is to decode the bytes with the correct encoding the first time; the repair pattern is “encode as Latin-1, decode as UTF-8” to reverse the bad round-trip. (A dedicated repair tool is on the Mungomash backlog at /tools//tools/mojibake/.)

BOMs — Byte Order Marks, U+FEFF — are a separate-but-related cause of breakage. UTF-16 needs a BOM to disambiguate big-endian from little-endian. UTF-8 doesn’t need a BOM because UTF-8 has no byte-order ambiguity, but Windows tools (Notepad through Windows 10, Excel CSV exports) write a UTF-8 BOM anyway. A UTF-8 BOM is the three bytes 0xEF 0xBB 0xBF; if a parser doesn’t expect it, those bytes show up as the literal characters  at the start of the file. Most modern tooling now strips a leading UTF-8 BOM silently; some still don’t.

Source-code escapes

Different languages spell the same character different ways inside source code. U+1F600 (😀) is:

  • JavaScript: "\u{1F600}" (modern), or "😀" (surrogate-pair form, works in older runtimes).
  • Python: "\U0001F600" (uppercase U, 8-digit hex) or "\N{GRINNING FACE}" (named escape).
  • Rust: "\u{1F600}".
  • Go: "\U0001F600" (uppercase U) or just the literal character (Go source is UTF-8).
  • Java: "😀" (Java source-code escapes are UTF-16 only). Modern Java accepts "\u{1F600}"-style only via Character.toString(0x1F600).
  • C/C++: "\U0001F600" in a UTF-8 string literal (u8"..." in C++).
  • JSON: "😀" — JSON only allows the surrogate-pair form for code points above U+FFFF.

The per-character table above shows the JS and Python escapes for each code point so you can paste either form directly into source. For escape translation across more languages, see the planned sibling tool at /tools/escape-sequences/.