- Pre-processing for Text Analytics
- Linguistics 101
- Crash course for R
謝舒凱 Graduate Institute of Linguistics, NTU
用 linux 指令
~/lads/data/week4
http://www.gutenberg.org/cache/epub/730/pg730.txt
) 丟到這個目錄,改檔案名字為 dickens.txt
dickens-clean.txt
curl -s http://www.gutenberg.org/cache/epub/730/pg730.txt -o dickens.txt
less -N dickens.txt
# 利用 G 和上下頁鍵
sed '1,150d' dickens.txt > dickens-noheader.txt
sed '18682,19052d' dickens-noheader.txt > dickens-clean.txt
用 linux 指令
bumble
吸引了我們的目光,請用 grep
指令列出在它 dickens-clean.txt
所在的地方(順便加上行數,與給它點顏色如何)(請挑戰最精簡作法$\rightarrow$可以立刻留名在 slide!)
tr -d [:punct:] < dickens-clean.txt | tr -d [:digit:] |
tr [:upper:] [:lower:] | tr -d '\r'| tr ' ' '\n' | sort | uniq -c |
sort -r -g > dickens-wordfreq.txt
grep -E -n --color=auto "(B|b)umble" dickens-clean.txt
這件事這樣方法花了妳多少時間 (less than ONE second!!)
在還不會用 R 處理時,可以利用 linux 指令或是 R 套件提供的功能來做前處理。
# thanks to simon
library(jiebaR)
txt = scan('stdin', what = 'char')
words_vector = worker() <= txt
words_char = paste(words_vector, collapse = ' ')
cat(words_char)
curl -s http://www.gutenberg.org/files/27166/27166-0.txt -o luxun.txt
cat luxun.txt | Rscript jieba.R | tr ' ' '\n' | sort | uniq -c -g | sort -r > luxun-wordfreq.txt
ShowDictPath() ### Show dict path, find and edit the "user.dict.utf8"
從 GQ.txt
找一個不想被斷開的詞(如「卡娃伊」) 加入 jieba 詞表(user.dict.utf8
),重跑一次詞頻表,看看「卡娃伊」在不在裡面。
cat GQ.txt | Rscript jieba.R | tr ' ' '\n' | sort | uniq -c | sort -r -g > GQ-wordfreq.txt
grep '卡娃伊' GQ.txt
grep '卡娃伊' GQ-wordfreq.txt
# 增添卡娃伊到詞表之後重跑一次第一行..................
grep '卡娃伊' GQ-wordfreq.txt
還有。。。
Stay tuned, 並請期待本課程線上書籍。
Linguistics 101
In linguistics, morphology is the identification, analysis and description of the structure of a given language's morphemes and other linguistic units, such as root words, affixes, parts of speech, intonations and stresses, or implied context. (wiki)
Crash course for R
Check week4.R
in ceiba (revised from http://learnxinyminutes.com/docs/r/)
舉個例子:「馬的政府」「老闆對我們是很 nice 的」「好der」