Chinese Word Segmentation and Counting
In English, words are separated by spaces, making word counting relatively straightforward. Chinese, however, has no natural word boundary markers. This makes "character count" versus "word count" a fascinating and technically challenging problem in natural language processing.
Characters vs Words in Chinese
Before discussing Chinese word counting, we need to clarify two fundamental concepts:
- Character (zi) — A single Chinese character, like "wo" (I), "shi" (is), "ren" (person). Each occupies one Unicode code point
- Word (ci) — A semantic unit composed of one or more characters, like "diannao" (computer), "rengong zhineng" (artificial intelligence)
When Chinese speakers say "this article has 500 zi," they typically mean 500 characters, not 500 words. This differs fundamentally from the English concept of "500 words."
Key Takeaway: Chinese "character count" (zi shu) refers to the number of individual characters, while English "word count" refers to the number of words. A 1,000-character Chinese article, after segmentation, typically contains about 600-700 words.
Why Chinese Word Segmentation Is Difficult
Chinese Word Segmentation (CWS) is a fundamental task in Natural Language Processing (NLP) and remains an incompletely solved problem. The main challenges include:
1. Segmentation Ambiguity
The same string of characters can be segmented in different ways. Consider the Chinese equivalent of: "Studying life's origins" could be parsed as "study / life / origins" or "graduate student / destiny / origins" — both valid readings depending on context.
2. Out-of-Vocabulary (OOV) Words
Language constantly evolves with new vocabulary. Names, places, and internet slang often don't appear in dictionaries. Terms like "ChatGPT" and "metaverse" weren't in any dictionary when they first appeared.
3. Fuzzy Word Boundaries
Whether certain linguistic units count as one word or multiple words is debated even among linguists. For example, "People's Republic of China" in Chinese — is it one word or a combination of multiple words?
Main Segmentation Methods
Dictionary-Based Methods
The most intuitive approach maintains a dictionary and matches text against entries. Common matching strategies include:
| Method | Description | Pros & Cons |
|---|---|---|
| Forward Maximum Matching (FMM) | Left to right, take longest match | Simple and fast, but hits ambiguity |
| Backward Maximum Matching (BMM) | Right to left, take longest match | Usually more accurate for Chinese |
| Bidirectional Maximum Matching | Run both FMM and BMM, pick better result | Higher accuracy, slower speed |
Statistical Methods
These approaches use large annotated corpora to train statistical models. Hidden Markov Models (HMM) and Conditional Random Fields (CRF) are commonly used methods. They can learn patterns between characters and handle unknown words to some degree.
Deep Learning Methods
In recent years, segmentation methods based on LSTM, BERT, and other deep learning models have achieved remarkable progress. These models capture more complex language features and reach F1 scores above 97% on standard test sets.
Popular Chinese Segmentation Tools
- Jieba — The most popular Python Chinese segmentation library, combining dictionary and HMM approaches
- CKIP Tagger — Developed by Academia Sinica, best support for Traditional Chinese
- HanLP — Feature-rich Chinese NLP toolkit
- PaddleNLP — Developed by Baidu, high-accuracy deep learning-based segmentation
Unicode and Chinese Characters
Understanding the Unicode standard is essential when working with Chinese character counting in code. Chinese characters fall mainly within these Unicode blocks:
- CJK Unified Ideographs (U+4E00 - U+9FFF) — The most commonly used 20,992 characters
- CJK Unified Ideographs Extension A (U+3400 - U+4DBF) — 6,592 less common characters
- CJK Unified Ideographs Extensions B-F — Rare and archaic characters
- CJK Compatibility Ideographs (U+F900 - U+FAFF) — Compatibility characters
In JavaScript, you can use the regular expression /[\u4e00-\u9fff]/g to match the most common Chinese characters and count them.
Practical Advice for Character Counting
For general Chinese character counting needs, here are practical recommendations:
- Raw character count — Suitable for school essays, academic papers with specific character requirements
- Non-whitespace character count — Excludes spaces and line breaks for a more accurate content measurement
- Chinese character count only — Counts only CJK characters, excluding English and numbers
- Mixed counting — Chinese by character, English by word — closest to actual reading experience
Conclusion
Chinese word segmentation is a deceptively simple problem with surprising depth. For general users, understanding the difference between characters and words is sufficient. For developers and researchers, Chinese segmentation remains an active area of research with new methods and tools constantly emerging.
References
- Huang, Changning and Zhao, Hai. "Chinese Word Segmentation: A Decade Review." Journal of Chinese Information Processing, 21(3), 2007.
- Sun, Maosong et al. "Jieba Chinese Text Segmentation." GitHub Repository, 2023. https://github.com/fxsjy/jieba
- The Unicode Consortium. "CJK Unified Ideographs." The Unicode Standard, Version 15.0, 2022. https://www.unicode.org/charts/PDF/U4E00.pdf
- Xue, Nianwen. "Chinese Word Segmentation as Character Tagging." International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 2003.