- More on text classification
- Topic modeling and Information Extraction
- Web data collection
謝舒凱 Graduate Institute of Linguistics, NTU
create_matrix
] Import your hand-coded data into Rcreate_corpus
] 把「不相關」的資料移除,建立訓練語料 (training dataset) 與測試語料 (test data) (with dtm
)train model(s)
] Choose machine learning algorithm(s) to train a modelbuild predictive model(s)
] Test on the (out-of-sample) test data; establish accuracy criteria 了解成效。apply predictive model(s)
] Use model to classify novel datacreate analytics
] 把自動分錯的資料找出來 Manually label data that do not meet accuracy criteriaURL = "http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz"
download.file(URL,destfile = "./data/reviews.tar.gz")
#untar("./data/reviews.tar.gz")
setwd("./data/reviews/txt_sentoken")
library(tm)
SourcePos <- DirSource(file.path(".", "pos"), pattern="cv")
SourceNeg <- DirSource(file.path(".", "neg"), pattern="cv")
pos <- Corpus(SourcePos)
neg <- Corpus(SourceNeg)
reviews <- c(pos, neg)
preprocess = function(
corpus, stopwrds = stopwords("english")){
library(SnowballC)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, removeWords, stopwrds)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus
}
processed <- preprocess(reviews)
term_documentFreq <- TermDocumentMatrix(processed)
# use the tdm, we can now get the ten most frequent terms
asMatrix <- t(as.matrix(term_documentFreq))
Frequencies <- colSums(asMatrix)
head(Frequencies[order(Frequencies, decreasing=T)], 10)
# or terms occur more than 2,000 times
Frequencies[Frequencies > 2000]
# tf–idf measure instead of raw freq.
tf–idf
measure is more meaningful as it increases the weights of terms that occur in many documents, thereby making the classification more reliable. Therefore, use it instead of raw frequencies in a new term-document matrix.removeSparseTerms()
function. An argument called sparse which allows a limit to be set for the degree of sparsity of the terms. (0 means that all documents must contain the term, whereas a sparsity of 1 means that none contain the term). We use a value higher than 0.8 to filter out most terms but still have enough terms to perform the analysis.term_documentTfIdf <- TermDocumentMatrix(processed,
control = list(weighting = function(x)
weightTfIdf(x, normalize = TRUE)))
SparseRemoved <- as.matrix(t(removeSparseTerms(term_documentTfIdf, sparse = 0.8)))
# now many terms are now
ncol(SparseRemoved)
colnames(SparseRemoved)
Now, we can use these 202 terms to classify our documents based on whether the reviews are positive or negative. Remember that the rows 1 to 1,000 represent positive reviews, and rows 1,001 to 2,000 negative ones. Create a vector that reflects this.
The length of the reviews may be related to their positivity or negativity. So also include an attribute that reflects review length in the processed corpus (before the removal of sparse terms):
quality <- c(rep(1,1000),rep(0,1000))
lengths <- colSums(as.matrix(TermDocumentMatrix(processed)))
DF <- as.data.frame(cbind(quality, lengths, SparseRemoved))
set.seed(123)
train = sample(1:2000,1000)
TrainDF = DF[train,]
TestDF = DF[-train,]
Naïve Bayes
library(e1071)
library(caret) # confusionMatrix is in the caret package
set.seed(345)
model <- naiveBayes(TrainDF[-1], as.factor(TrainDF[[1]]))
classifNB <- predict(model, TrainDF[,-1])
confusionMatrix(as.factor(TrainDF$quality),classifNB)
classifNB = predict(model, TestDF[,-1])
confusionMatrix(as.factor(TestDF$quality),classifNB)
Support Vector Machines: attempt to find a separation between the two classes that is as broad as possible.
library(e1071)
modelSVM <- svm(quality ~ ., data = TrainDF)
probSVMtrain <- predict(modelSVM, TrainDF[,-1])
classifSVMtrain <- probSVMtrain
classifSVMtrain[classifSVMtrain>0.5] = 1
classifSVMtrain[classifSVMtrain<=0.5] = 0
confusionMatrix(TrainDF$quality, classifSVMtrain)
probSVMtest <- predict(modelSVM, TestDF[,-1])
classifSVMtest <- probSVMtest
classifSVMtest[classifSVMtest>0.5] = 1
classifSVMtest[classifSVMtest<=0.5] = 0
confusionMatrix(TestDF$quality, classifSVMtest)
Check this CFP
api
則是課堂上應該教的 XDrvest
a new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup. It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces.
rvest
+ CSS Selector)。給入門者的工具 Selectorgadget
library(rvest)
machina <- html("http://www.imdb.com/title/tt0470752/")
# extract the rating
machina %>%
html_node("strong span") %>% html_text() %>%
as.numeric()
# extract the cast
machina %>% html_nodes("#titleCast .itemprop span") %>%
html_text()
# The titles and authors of recent message board postings are stored in a the third
# table on the page. We can use html_node() and [[ to find it, then coerce it to a
# data frame with html_table():
machina %>% html_nodes("table") %>% .[[3]] %>%
html_table()
rtimes
: R client for NYTimes API for government data, including the Congress, Article Search, Campaign Finance, and Geographic APIs. The focus is on those that deal with political data, but throwing in Article Search and Geographic for good measure(另外也可用 tm
自家的 plugin tm.plugin.webmining
)。
.Rprofile
file for re-use.options(nytimes_cg_key = "e63b6f8917f30c79521ad7ddba7b9255:11:66687269")
options(nytimes_as_key = "017ecf6cafb56e24947086cc1778ea30:1:66687269")
options(nytimes_cf_key = "YOURKEYHERE")
options(nytimes_geo_key = "YOURKEYHERE")
library(rtimes)
# Search for bailout between two dates, Oct 1 2008 and Dec 1 2008
out <- as_search(q = "bailout", begin_date = "20081001", end_date = '20081201')
out$data[1:2]
# Search for keyword money, within the Sports and Foreign news desks
res <- as_search(q = "money", fq = 'news_desk:("Sports" "Foreign")')
res$data[1:3]
共同選一個主題/關鍵詞/事件,各組選一個 api 實作看看 [同一共筆/資料夾]
rtimes
(中文?)
Rfacebook
Rweibo
twitteR