- DS at the command line
- Text analytics in action
- Linguistics and Data Science
- Crash course for R
謝舒凱 Graduate Institute of Linguistics, NTU
DS at the command line
Lewis Carroll, Alice's Adventures in Wonderland
wget http://www.gutenberg.org/cache/epub/28885/pg28885.txt
## alternatively
curl -s http://www.gutenberg.org/cache/epub/28885/pg28885.txt -o alice.txt
file pg28885.txt
cp pg28885.txt alice.txt
head -n 20 alice.txt
less -N alice.txt
sed '1,216d' alice.txt > alice-noheader.txt
sed '3422,3830d' alice-noheader.txt > alice-trimmed.txt
wc alice-trimmed.txt
grep -n "Rabbit" trimmed.txt
grep -E -n "(W|w)hite" trimmed.txt
tr -d [:punct:] < alice-trimmed.txt > alice-nopunct.txt
tr [:upper:] [:lower:] < alice-nopunct.txt > alice-lowercase.txt
tr -d '\r' < alice-lowercase.txt > alice-lowercaself.txt
tr ' ' '\n' < alice-lowercaself.txt > alice-oneword.txt
sort alice-oneword.txt > alice-onewordsort.txt
uniq -c alice-onewordsort.txt > alice-wordfreq.txt
tr ' ' '\n' < alice-lowercaself.txt | sort | uniq -c > alice-wordfreq2.txt
tr '[:punct:]' ' ' < alice-nofooter.txt | tr '[:upper:]' '[:lower:]' | tr '[:blank:]' ' ' |
sort | uniq -c | sed 's/ \{1,\}/","/g' | sed 's/^",//g' | sed 's/$/"/g'
w3m -dump http://www.gnu.org/gnu/manifesto.html | wc
Text analytics in action
(中文處理可以加上 tmcn
, ...
Barack Obama's State of the Union address 2014
path <- "~/Linguistic.Data/4Practice/the-state-of-the-union.txt"
text <- readLines(path,encoding="UTF-8")
vs <- VectorSource(text)
# NOW The text variable is an array of the lines of the statement.
txt <- Corpus(vs)
[Source]: Toomey, 2015.
# Converting text to lowercase
txtlc <- tm_map(txt, tolower)
# Removing punctuation
txtnp <- tm_map(txt, removePunctuation)
# Removing numbers
txtnn <- tm_map(txt, removeNumbers)
# Removing stop words
txtns <- tm_map(txt[1], removeWords, stopwords("english"))
# ......
dtm <- DocumentTermMatrix(txt)
[1] "and" "are" "but" "can" "every" "for" "from"
[8] "have" "it’s" "like" "make" "more" "new" "one"
[15] "our" "than" "that" "that’s" "the" "their" "they"
[22] "this" "who" "will" "with"
findAssocs(dtm, "work", 0.2)
diplomacy 0.27 “small 0.23 2010, 0.23 again 0.23 along 0.23 biden’s 0.23 cannot 0.23 didn’t 0.23 disagree 0.23 embargo. 0.23 endured; 0.23 europe, 0.23 exist 0.23 francis, 0.23 grit 0.23 hand. 0.23 holiness, 0.23 ideas, 0.23 imposing 0.23 iran, 0.23 it, 0.23 japan, 0.23 laid 0.23 least 0.23 now. 0.23 outlined 0.23 pope 0.23 remaking 0.23 said, 0.23 say 0.23 seek 0.23 separate 0.23 steps.” 0.23 today. 0.23 together 0.23 twenty 0.23 update 0.23 wage, 0.23 where, 0.23 you’ll 0.23 hard 0.22
plot(dtm, terms = findFreqTerms(dtm, lowfreq = 5)[1:10], corThreshold = 0.5)
(freq.terms <- findFreqTerms(tdm, lowfreq = 15))
## [1] "-" "a" "analysis" "and" "at" "data"
## [7] "for" "in" "mining" "of" "on" "package"
## [13] "r" "slides" "the" "to" "with"
term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >= 15)
df <- data.frame(term = names(term.freq), freq = term.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity")
xlab("Terms") + ylab("Count") + coord_flip()
## Error in xlab("Terms") + ylab("Count"): 二元運算子中有非數值引數
wordcloud(names(freq), freq, min.freq=10)
## Error in -freq: 一元運算子的引數不正確
進行斷詞# mixseg <= "../data/week3/MYJ1030101.txt"
# 這是生成的結果
#[1] "../data/week3/MYJ1030101.segment1443585454.8642.txt" 還是需要人工清理
data.path.c <- "~/Dropbox/Linguistic.Analysis.and.Data.Science/data/week3/MYJ1030101.clean.txt"
text.c <- readLines(data.path.c,encoding="UTF-8")
vs.c <- VectorSource(text.c)
txt.c <- Corpus(vs.c)
ma.corpus <- tm_map(txt.c, removeWords,stopwordsCN())
Linguistics and Data Science
word cloud
的意義是什麼?Crash course for R
DataCamp 預習進度已經放在課程網頁http://loperntu.github.io/lads/。
linux 指令
建立「馬英九 103 元旦文告」(MYJ1030101.clean.txt
) 的詞頻表。tm
與 tmcn
與 jiebaR
斷詞套件,用 ceiba 上的「馬英九演講文集」做出詞頻表。