Linguistic Analysis and Data Science

lecture 07

謝舒凱 Graduate Institute of Linguistics, NTU

大綱

Corpus and NLP for Text Analytics
Crash course for R: Regular Expression

自然語言處理

嘗試模擬人類語言理解與處理知識的技術。
在這領域其他語言的套件應該比 R 還豐富 (e.g., python 的 nltk)
NLP-powered Text Analytics 已經是主流，看看這個例子

BUT it is possible to use libraries written in those lower-level and hence faster languages, while writing your code in R and taking advantage of its functional programming style and its many other libraries for data analysis.

不要忘記自然語言處理是 AI-complete

到現在人都還不完全了解人。
語言的豐富性平日可以多多觀察：隱喻譬喻、諷刺幽默、情緒說謊、語用脈絡、個體群體差異、社會文化變異、甚至討價還價、談情說愛、、。

自然語言處理：實例練習

見 nlp.Rmd

Stanford CoreNLP

線上演示
安裝第一線的研究團隊
coreNLP package: Wrappers Around Stanford CoreNLP Tools (Java library). Methods provided: tokenisation, part of speech tagging, lemmatisation, named entity recognition, coreference detection and sentiment analysis

試試看

# works from CRAN !
install.packages("coreNLP")
# wget http://nlp.stanford.edu/software/stanford-corenlp-full-2015-04-20.zip
download.file("http://nlp.stanford.edu/software/stanford-corenlp-full-2015-04-20.zip")
unzip("stanford-corenlp-full-2015-04-20.zip")
library(coreNLP)
initCoreNLP("stanford-corenlp-full-2015-04-20/")
FB = c("Facebook is looking for new ways to get users to share more, 
rather than just consume content, in a push that seemingly puts it in more 
direct rivalry with Twitter.")
output = annotateString(FB)
getToken(output)[,c(1:3,6:7)]
getParse(output)
getDependency(output)
getSentiment(output)
getCoreference(output)

語言學的研究觀察帶動 NLP 研究的層次

互動言談的研究就是需要標記的好例子
對訪談質性研究的協助 (coding similar to annotation in qualitative research)
qdap: (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis.

大綱

Corpus and NLP
Crash course for R: Regular Expression

Exercise: R Regular Expression

網路資料很多請同學上台練習。

keyword 分佈的練習

comments <- read.table("perfumes_comments.csv", header = TRUE, 
                       sep = "\t", dec = ".", quote = "\"")
summary(comments)
# random rows of the data set
x <- sample(nrow(comments), 10, replace = FALSE)
comments[x,]

strong <-grepl("strong",comments$Comment, ignore.case = TRUE)
sum(strong)/nrow(comments)
sweet <-grepl("sweet|soft",comments$Comment, ignore.case = TRUE)
sum(sweet)/nrow(comments)