- Pre-processing for Text Analytics
- Linguistics 101
- Crash course for R
謝舒凱 Graduate Institute of Linguistics, NTU
用 linux 指令
~/lads/data/week4http://www.gutenberg.org/cache/epub/730/pg730.txt) 丟到這個目錄,改檔案名字為 dickens.txtdickens-clean.txtcurl -s http://www.gutenberg.org/cache/epub/730/pg730.txt -o dickens.txt
less -N dickens.txt
# 利用 G 和上下頁鍵
sed '1,150d' dickens.txt > dickens-noheader.txt
sed '18682,19052d' dickens-noheader.txt > dickens-clean.txt
用 linux 指令
bumble 吸引了我們的目光,請用 grep 指令列出在它 dickens-clean.txt所在的地方(順便加上行數,與給它點顏色如何)(請挑戰最精簡作法$\rightarrow$可以立刻留名在 slide!)
tr -d [:punct:] < dickens-clean.txt | tr -d [:digit:] |
tr [:upper:] [:lower:] | tr -d '\r'| tr ' ' '\n' | sort | uniq -c |
sort -r -g > dickens-wordfreq.txt
grep -E -n --color=auto "(B|b)umble" dickens-clean.txt
這件事這樣方法花了妳多少時間 (less than ONE second!!)

在還不會用 R 處理時,可以利用 linux 指令或是 R 套件提供的功能來做前處理。
# thanks to simon
library(jiebaR)
txt = scan('stdin', what = 'char')
words_vector = worker() <= txt
words_char = paste(words_vector, collapse = ' ')
cat(words_char)
curl -s http://www.gutenberg.org/files/27166/27166-0.txt -o luxun.txt
cat luxun.txt | Rscript jieba.R | tr ' ' '\n' | sort | uniq -c -g | sort -r > luxun-wordfreq.txt
ShowDictPath() ### Show dict path, find and edit the "user.dict.utf8"
從 GQ.txt 找一個不想被斷開的詞(如「卡娃伊」) 加入 jieba 詞表(user.dict.utf8),重跑一次詞頻表,看看「卡娃伊」在不在裡面。
cat GQ.txt | Rscript jieba.R | tr ' ' '\n' | sort | uniq -c | sort -r -g > GQ-wordfreq.txt
grep '卡娃伊' GQ.txt
grep '卡娃伊' GQ-wordfreq.txt
# 增添卡娃伊到詞表之後重跑一次第一行..................
grep '卡娃伊' GQ-wordfreq.txt
還有。。。
Stay tuned, 並請期待本課程線上書籍。
Linguistics 101In linguistics, morphology is the identification, analysis and description of the structure of a given language's morphemes and other linguistic units, such as root words, affixes, parts of speech, intonations and stresses, or implied context. (wiki)
Crash course for RCheck week4.R in ceiba (revised from http://learnxinyminutes.com/docs/r/)
舉個例子:「馬的政府」「老闆對我們是很 nice 的」「好der」