NLP of Chinese Text Data (with code)

Machine Learning and Big Data—Melissa Dell and Matthew Harding

A brief introduction to NLP of Chinese text data.  


OCR Tables and Parse the Output 

Pros: OCR software frequently misidentifies 4/A, 8/B, 0/O, and other characters. The program created by Prof. Dell could address these issues.

Cons: My experience with the PDF version of the Japanese census shows that the programming could extract half of the data without needing to double-check it manually.

Text Similarity

Text similarity using text2vec 

Supervised Learning  A practical example of Chinese text data classification using RNN.

Unsupervised Learning  Latent Dirichlet Allocation Topic Model

Toolkit  Jiayan, the 1st NLP toolkit designed for Classical Chinese, supports lexicon construction, tokenizing, POS tagging, sentence segmentation, and punctuation.