← All Articles

Chinese Word Segmentation and Counting

March 2026 · 7 min read

In English, words are separated by spaces, making word counting relatively straightforward. Chinese, however, has no natural word boundary markers. This makes "character count" versus "word count" a fascinating and technically challenging problem in natural language processing.

Characters vs Words in Chinese

Before discussing Chinese word counting, we need to clarify two fundamental concepts:

When Chinese speakers say "this article has 500 zi," they typically mean 500 characters, not 500 words. This differs fundamentally from the English concept of "500 words."

Key Takeaway: Chinese "character count" (zi shu) refers to the number of individual characters, while English "word count" refers to the number of words. A 1,000-character Chinese article, after segmentation, typically contains about 600-700 words.

Why Chinese Word Segmentation Is Difficult

Chinese Word Segmentation (CWS) is a fundamental task in Natural Language Processing (NLP) and remains an incompletely solved problem. The main challenges include:

1. Segmentation Ambiguity

The same string of characters can be segmented in different ways. Consider the Chinese equivalent of: "Studying life's origins" could be parsed as "study / life / origins" or "graduate student / destiny / origins" — both valid readings depending on context.

2. Out-of-Vocabulary (OOV) Words

Language constantly evolves with new vocabulary. Names, places, and internet slang often don't appear in dictionaries. Terms like "ChatGPT" and "metaverse" weren't in any dictionary when they first appeared.

3. Fuzzy Word Boundaries

Whether certain linguistic units count as one word or multiple words is debated even among linguists. For example, "People's Republic of China" in Chinese — is it one word or a combination of multiple words?

Main Segmentation Methods

Dictionary-Based Methods

The most intuitive approach maintains a dictionary and matches text against entries. Common matching strategies include:

MethodDescriptionPros & Cons
Forward Maximum Matching (FMM)Left to right, take longest matchSimple and fast, but hits ambiguity
Backward Maximum Matching (BMM)Right to left, take longest matchUsually more accurate for Chinese
Bidirectional Maximum MatchingRun both FMM and BMM, pick better resultHigher accuracy, slower speed

Statistical Methods

These approaches use large annotated corpora to train statistical models. Hidden Markov Models (HMM) and Conditional Random Fields (CRF) are commonly used methods. They can learn patterns between characters and handle unknown words to some degree.

Deep Learning Methods

In recent years, segmentation methods based on LSTM, BERT, and other deep learning models have achieved remarkable progress. These models capture more complex language features and reach F1 scores above 97% on standard test sets.

Popular Chinese Segmentation Tools

Unicode and Chinese Characters

Understanding the Unicode standard is essential when working with Chinese character counting in code. Chinese characters fall mainly within these Unicode blocks:

In JavaScript, you can use the regular expression /[\u4e00-\u9fff]/g to match the most common Chinese characters and count them.

Practical Advice for Character Counting

For general Chinese character counting needs, here are practical recommendations:

Try the Word Counter Tool Now →

Conclusion

Chinese word segmentation is a deceptively simple problem with surprising depth. For general users, understanding the difference between characters and words is sufficient. For developers and researchers, Chinese segmentation remains an active area of research with new methods and tools constantly emerging.

References

  1. Huang, Changning and Zhao, Hai. "Chinese Word Segmentation: A Decade Review." Journal of Chinese Information Processing, 21(3), 2007.
  2. Sun, Maosong et al. "Jieba Chinese Text Segmentation." GitHub Repository, 2023. https://github.com/fxsjy/jieba
  3. The Unicode Consortium. "CJK Unified Ideographs." The Unicode Standard, Version 15.0, 2022. https://www.unicode.org/charts/PDF/U4E00.pdf
  4. Xue, Nianwen. "Chinese Word Segmentation as Character Tagging." International Journal of Computational Linguistics and Chinese Language Processing, 8(1), 2003.