Linguistic Analysis and Data Science

lecture 08

謝舒凱 Graduate Institute of Linguistics, NTU

Preparing / Preprocessing text and data. Text is unstructured or partially structured data that must be prepared for analysis. We extract features from text. We define measures. Quantitative data are often messy or missing. They may require transformation prior to analysis. Data preparation consumes much of a data scientist’s time.
Exploratory data analysis and Infographics (data visualization for the purpose of discovery. We look for groups in data, find outliers, identify common dimensions, patterns, and trends.)
Prediction models (Regression; Classification and Clustering; ) and Evaluations (Recommender systems, collaborative filtering, association rules, optimization methods based on linguistic heuristics, as well as a myriad of methods for regression, classification, and clustering fall under the rubric of machine learning).

Study of recorded human communication
Summary and quantitative analysis of communicated messages
Researcher looks for patterns/themes in text; develops code frame to categorize text.
Essentially, variables are extracted from text: Based on scientific method; establishes objectivity via inter-coder reliability.

優點

缺點

最簡單可以用 Excel 來做：
- One (or more) column(s) for text data； One column for topic label (as gold standard)
- 通常至少有多於 3000 份標好的文件。
大型的專案要考慮到永續、相容、交換等問題，建議使用標記系統。
- 語料庫和語言處理社群 GATE
- 質性研究社群 CAT (Coding Analysis Toolkit)
- lopetator
labeling 和 annotation 的差異之後再談。

[create_matrix] Import your hand-coded data into R
[create_corpus] 把「不相關」的資料移除，建立訓練語料 (training dataset) 與測試語料 (test data)
[train model(s)] Choose machine learning algorithm(s) to train a model
[build classification model(s)] Test on the (out-of-sample) test data; establish accuracy criteria 了解成效。
[apply classification model(s)] Use model to classify novel data
[create analytics] 把自動分錯的資料找出來 Manually label data that do not meet accuracy criteria

RTextTools 可自動化某些標記工作，與監督式文本自動分類。簡單，但是有記憶體問題，中文支援有問題。
"One-stop-shop for conducting supervised machine learning with textual data" 邊看這篇邊做看看. 參考程式範例