- 語言學和開放政府的關係
- 開放語料的意義與實作架構
- 開放語料的未來芻議
謝舒凱 (Aber) and Pierre Magistry (A-Tsioh)
台大語言學研究所助理教授 (2010-)/台大語言學研究所博士後訪問 (2013-2014)
語言學和開放政府的關係
語言的習得與發展
,結構與功能
,神經與心理機制
,社會變異與演化過程
等。經驗/計算語言學 (empirical/computational linguistics) [a.k.a. Natural Language Processing] 用電腦來幫助我們回答上述問題,並產生應用。
(大數據中的) 語言數據(語料)蘊含了文化歷史記憶,社會心理趨勢,政治輿情傾向,情緒偏好分佈,人格特質與決策行為,疾病前期徵兆等等。
Linguistic data are ubiqutous
, knowledge to be discovered, tendency to be predicted.
自然語言處理 (Natural Language Processing) and 文本分析技術 (Textual Analytics) are the keys [Why? pressing p
]
為了溝通與表情(情緒)達意(思維),我們所實現出來的語言資料。
在計算語言學(自然語言處理)的脈絡下,廣義的語料包括各項語言資源
(字典 dictionay,詞庫 lexicon,語料庫 corpus,語法樹庫 treebank,知識本體 ontologies,等等。)
開發各種語言資訊應用,理解與推展社會進步的潛力
政府 >> 公民嗨客(civic hacking) >> 學界 >> 產業
open the data >> hacking the data >> exploring the data >> data social product
開放語料的意義與實作
Thanks to MOE's opening of its 《臺灣閩南語常用詞辭典》I could:
- turn a small experiment into an APP
- which was downloaded 10,000 times
How
TaigIME-android
came into existenceAs a foreign student in linguistics. I noticed that many friends were using Taiwanese (Holo) in there online messages. It was either
- directly in ㄅㄆㄇㄈ
- or in 漢字 using Mandarin sounds in ㄅㄆㄇㄈ as IME... As a computational linguist, the second option just drove me crazy. How inconvenient is that !?! Can't you just use the transcription of Taiwanese sound to input 漢字 ?!?
I had the idea, but was missing the Data.... and MOE.cc found!! If the MOE had selected a too restrictive licence, the APP would never had made its way up to the Google Play Store, and now with the status of downloaded more than 10.000 times ! https://github.com/a-tsioh/TaigIME-android/
Module.Variable | Description |
---|---|
concept.sense |
word sense number from Chinese Wordnet, CWN, please help |
concept.gloss |
sense definitions from CWN |
concept.relations |
lexical semantic relations |
emotion.polarity |
polarity of descriptive emotional words |
emotion.location |
location collocates of emotion |
emotion.cause |
cause collocates |
emotion.result |
resulting event collocates |
emotion.time |
time collocates |
frequency.asbc |
frequency of Sinica Corpus |
frequency.plurk |
frequency of Plurk Corpus |
frequency.childes |
frequency of CHILDES Corpus |
frequency.ptt |
frequency of PTT |
AND MANY MORE! modules in progress: 情緒 發展歷程 語義 使用頻率 年紀 關係 性別 教學難易 部首概念 意類 知識本體 社會心理人格 . . . . . . . . . . . . . .
我們的理想長遠目標
"量化數據不能代表好壞只能參考,修正草案數多不一定較好,還請點選該立委觀看其修正草案的內容再作論定。" http://ly.g0v.tw/
問題在於 Data 有很多類型。
對於文本內容的深度分析可以增強信服力。
用語言表達來觀察價值選擇,關心主題與立場。
keyness
calculation algorithm 公開。
國會測謊器:文本,論述,表情與政治 (無誠勿入)
(語音,文本,多模態,言談語用,語言心理)
開放語料的未來芻議
(語言學) 實驗室要放在社會發展的脈絡。
詞典(dictionary) \(\bowtie\) 詞庫 (lexicon) \(\bowtie\) 詞網 (lex.network) \(\bowtie\) 詞雲 (lex.cloud)
Bridge the gap between the labs and the people !
眾籌眾包眾什麼都歡迎 Crowd (found|sourc)ing language resources for Taiwan.