- Background
- Text, Text, Text
- Linguistics and Data Science
- 實作
謝舒凱 Graduate Institute of Linguistics, NTU
Background
三組人馬: (Text analytics | Text mining) 、 (NLP | Linguistics) 、 (Machine Learning | Statistics)
Text analytics (\(\simeq\) text mining) can be viewed as a set of (computational) linguistic (NLP) and (statistical) machine learning techniques that model and discover the information content of textual data for diffirent purposes (e.g., business intelligence, research, or investigation).
Textual data, textual information, textual knowledge.
data
processinginformation
processingknowledge
processing
Data <> Story
: Automated Data Scienctist個人以為會發生的趨勢與需求
Making structured data from unstructured data (and vice versa).
Marrying structured and unstructured data
Source: (Hurwitz, J et al., 2013)
文本挖掘(Text analytics)是透過軟體或其他商業流程來進行自然語言處理,從社群、網站、商業文字中找尋有用資訊。自然語言處理(Natural Language Processing,縮寫 NLP)是人機互動的關鍵,簡單來說就是讓電腦了解人的語言,然後將訊息轉化成一於電腦處理的形式以便儲存及利用。http://buzzorange.com/techorange/2015/06/08/text-analytics/
DataTaipei
: R client for Data.Taipei
功能:
Narrative network of US. 2012.
人文思考
:數據時代的個人隱私Text, Text, Text
評論、美食、產品、電影、書籍、課程、施政、
?
由自然語言表徵的兩個軸度
Taxonomies
Linguistics and Data Science
語言的習得與發展
,結構與功能
,神經與心理機制
,社會變異與演化過程
等。Linguistic data are ubiqutous
, knowledge to be discovered, tendency to be predicted.p
]Data source: [Adapted from (Pinker, 1999)]
"Every time I fire a linguist, the performance of the speech recognizer goes up", (Frederick Jelinek 1932-2010, IBM and Johns Hopkins.)
Does Deep machine learning only require shallow linguistic processing ?
The language of lying Noah Zandon
實作
Jane Andrews, The Stories Mother Nature Told Her Children
wget http://archive.org/download/thestoriesmother05792gut/stmtn10.txt
file stmtn10.txt
head -n 20 stmtn10.txt
less -N stmtn10.txt
sed '2206,2525d' stmtn10.txt > stmtn10-nofooter.txt
sed '1,40d' stmtn10-nofooter.txt > stmtn10-trimmed.txt
wc -l stmtn10-trimmed.txt
grep -n "giant" stmtn10-trimmed.txt
w3m -dump http://www.gnu.org/gnu/manifesto.html | wc
抓一篇 Alice's Adventures in Wonderland by Lewis Carroll 來試試。
初學者從實用面來決定
想像一個場景:妳的公司開發了一種智慧型 XX。作為一個 Data Scientist,妳要面對的資料類型可能有:
Data <> Story
)
> wget http://www.sensorywithr.org/wp-content/uploads/2014/06/perfumes_comments.csv
> iconv -f ISO-8859-15 -t UTF-8 perfumes_comments.csv > perfumes_comments_utf8.csv
> csvlook perfumes_comments_utf8.csv | head
R way
comments <- read.csv("../../../data/week2/perfumes_comments.csv", sep = "\t",
dec = ".", quote = "\"")
head(comments)
summary(comments)
Source: sensorywithr
library(FactoMineR)
res.textual <- textual(comments, num.text = 3, contingence.by = 1, sep.word = ";")
names(res.textual)
res.textual$cont.table[,1:10]
apply(res.textual$cont.table[,1:10], MARGIN = 2, FUN = sum)
twitteR
demo
課內分組最好的解法從哪裡來?
我也沒有標準答案
為什麼?因為我是語言學家 XD