Chinese Word Segmentation and Counting

March 2026 · 7 min read

In English, words are separated by spaces, making word counting relatively straightforward. Chinese, however, has no natural word boundary markers. This makes "character count" versus "word count" a fascinating and technically challenging problem in natural language processing.

Characters vs Words in Chinese

Before discussing Chinese word counting, we need to clarify two fundamental concepts:

Character (zi) — A single Chinese character, like "wo" (I), "shi" (is), "ren" (person). Each occupies one Unicode code point
Word (ci) — A semantic unit composed of one or more characters, like "diannao" (computer), "rengong zhineng" (artificial intelligence)

When Chinese speakers say "this article has 500 zi," they typically mean 500 characters, not 500 words. This differs fundamentally from the English concept of "500 words."

Key Takeaway: Chinese "character count" (zi shu) refers to the number of individual characters, while English "word count" refers to the number of words. A 1,000-character Chinese article, after segmentation, typically contains about 600-700 words.

Why Chinese Word Segmentation Is Difficult

Chinese Word Segmentation (CWS) is a fundamental task in Natural Language Processing (NLP) and remains an incompletely solved problem. The main challenges include:

1. Segmentation Ambiguity

The same string of characters can be segmented in different ways. Consider the Chinese equivalent of: "Studying life's origins" could be parsed as "study / life / origins" or "graduate student / destiny / origins" — both valid readings depending on context.

2. Out-of-Vocabulary (OOV) Words

Language constantly evolves with new vocabulary. Names, places, and internet slang often don't appear in dictionaries. Terms like "ChatGPT" and "metaverse" weren't in any dictionary when they first appeared.

3. Fuzzy Word Boundaries

Whether certain linguistic units count as one word or multiple words is debated even among linguists. For example, "People's Republic of China" in Chinese — is it one word or a combination of multiple words?

Main Segmentation Methods

Dictionary-Based Methods

The most intuitive approach maintains a dictionary and matches text against entries. Common matching strategies include:

Method	Description	Pros & Cons
Forward Maximum Matching (FMM)	Left to right, take longest match	Simple and fast, but hits ambiguity
Backward Maximum Matching (BMM)	Right to left, take longest match	Usually more accurate for Chinese
Bidirectional Maximum Matching	Run both FMM and BMM, pick better result	Higher accuracy, slower speed

Statistical Methods

These approaches use large annotated corpora to train statistical models. Hidden Markov Models (HMM) and Conditional Random Fields (CRF) are commonly used methods. They can learn patterns between characters and handle unknown words to some degree.

Deep Learning Methods

In recent years, segmentation methods based on LSTM, BERT, and other deep learning models have achieved remarkable progress. These models capture more complex language features and reach F1 scores above 97% on standard test sets.

Popular Chinese Segmentation Tools

Jieba — The most popular Python Chinese segmentation library, combining dictionary and HMM approaches
CKIP Tagger — Developed by Academia Sinica, best support for Traditional Chinese
HanLP — Feature-rich Chinese NLP toolkit
PaddleNLP — Developed by Baidu, high-accuracy deep learning-based segmentation

Unicode and Chinese Characters

Understanding the Unicode standard is essential when working with Chinese character counting in code. Chinese characters fall mainly within these Unicode blocks:

CJK Unified Ideographs (U+4E00 - U+9FFF) — The most commonly used 20,992 characters
CJK Unified Ideographs Extension A (U+3400 - U+4DBF) — 6,592 less common characters
CJK Unified Ideographs Extensions B-F — Rare and archaic characters
CJK Compatibility Ideographs (U+F900 - U+FAFF) — Compatibility characters

In JavaScript, you can use the regular expression /[\u4e00-\u9fff]/g to match the most common Chinese characters and count them.

Practical Advice for Character Counting

For general Chinese character counting needs, here are practical recommendations:

Raw character count — Suitable for school essays, academic papers with specific character requirements
Non-whitespace character count — Excludes spaces and line breaks for a more accurate content measurement
Chinese character count only — Counts only CJK characters, excluding English and numbers
Mixed counting — Chinese by character, English by word — closest to actual reading experience

Try the Word Counter Tool Now →

Conclusion

Chinese word segmentation is a deceptively simple problem with surprising depth. For general users, understanding the difference between characters and words is sufficient. For developers and researchers, Chinese segmentation remains an active area of research with new methods and tools constantly emerging.

References

Huang, Changning and Zhao, Hai. "Chinese Word Segmentation: A Decade Review." Journal of Chinese Information Processing, 21(3), 2007.
Sun, Maosong et al. "Jieba Chinese Text Segmentation." GitHub Repository, 2023. https://github.com/fxsjy/jieba
The Unicode Consortium. "CJK Unified Ideographs." The Unicode Standard, Version 15.0, 2022. https://www.unicode.org/charts/PDF/U4E00.pdf
Xue, Nianwen. "Chinese Word Segmentation as Character Tagging." International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 2003.