NLP of Chinese Text Data (with code)

Theory:

Natural Language Processing: Textbook by Dan Jurafsky and James H. Martin 

Stanford CS2N: Natural Language Processing with Deep Learning: Stanford CS224N 


Empirical:

Machine Learning and Big Data—Melissa Dell and Matthew Harding: https://www.aeaweb.org/conference/cont-ed/2023-webcasts


A brief introduction to NLP of Chinese text data.: https://carlos9310.github.io/assets/pdf/chinese-nlp.pdf  


OCR: OCR Tables and Parse the Output 

Pros: OCR software frequently misidentifies 4/A, 8/B, 0/O, and other characters. The program created by Prof. Dell could address these issues.

Cons: My experience with the PDF version of the Japanese census shows that the programming could extract half of the data without needing to double-check it manually.


Text Similarity: Text similarity using text2vec 


Supervised Learning: https://www.youtube.com/watch?v=BAP6l2uGAHU  A practical example of Chinese text data classification using RNN.


Unsupervised Learning: https://github.com/huang027/LDA  Latent Dirichlet Allocation Topic Model


Toolkit: https://github.com/jiaeyan/Jiayan#2  Jiayan, the 1st NLP toolkit designed for Classical Chinese, supports lexicon construction, tokenizing, POS tagging, sentence segmentation, and punctuation.