- Statistics and Machine Learning 101
- Text classification using
RTextTools
謝舒凱 Graduate Institute of Linguistics, NTU
RTextTools
Preparing / Preprocessing text and data. Text is unstructured or partially structured data that must be prepared for analysis. We extract features from text. We define measures. Quantitative data are often messy or missing. They may require transformation prior to analysis. Data preparation consumes much of a data scientist’s time.
Exploratory data analysis and Infographics (data visualization for the purpose of discovery. We look for groups in data, find outliers, identify common dimensions, patterns, and trends.)
Prediction models (Regression; Classification and Clustering; ) and Evaluations (Recommender systems, collaborative filtering, association rules, optimization methods based on linguistic heuristics, as well as a myriad of methods for regression, classification, and clustering fall under the rubric of machine learning).
http://www.cnblogs.com/bluepoint2009/archive/2012/09/18/precision-recall-f_measures.html
code frame
to categorize text.優點
缺點
最簡單可以用 Excel 來做:
gold standard
)大型的專案要考慮到永續、相容、交換等問題,建議使用標記系統。
GATE
CAT (Coding Analysis Toolkit)
labeling 和 annotation 的差異之後再談。
create_matrix
] Import your hand-coded data into Rcreate_corpus
] 把「不相關」的資料移除,建立訓練語料 (training dataset) 與測試語料 (test data)train model(s)
] Choose machine learning algorithm(s) to train a modelbuild classification model(s)
] Test on the (out-of-sample) test data; establish accuracy criteria 了解成效。apply classification model(s)
] Use model to classify novel datacreate analytics
] 把自動分錯的資料找出來 Manually label data that do not meet accuracy criteriaRTextTools