感謝老師當年收留不才的我進入詞庫,
謝舒凱 台大語言學研究所
CKIP30 workshop, 2015, Academia Sinica
感謝老師當年收留不才的我進入詞庫,
Current Approaches | 詞彙的心理意義
Coh-Metrix Lexicon [Mcnamara et al., 2014]: Lexical data and measures for text and discourse at different levels, including MRC Psycholinguistic Database with various properties of words (e.g., age of acquisition, familiarity ratings, concreteness, imagability, Colorado Meaningness, etc), and CELEX word frequency.
LIWC Lexicon [Pennebaker et. al. 2001; Tausczik et al. 2010]: provides the psychometrics of word usage in various social-psychological contexts. (e.g., linguistic features of deceptive statements).
Current Approaches 詞彙的心理意義
English Lexicon Project: affords access to a large set of lexical behavioral characteristics from visual lexical decision and naming studies of 40,481 words and 40,481 nonwords. (Also for Dutch, British, French)
DysList: a lexicon of dyslexic errors annotated with linguistic, phonetic and visual features.
. . . . . . . . .
Current Approaches | 詞彙的語意與概念
Existing Chinese Lexical Resources
*Traditional Chinese Psycholinguistic Database : provides a large-scale psycholinguistic norm of 3,314 Traditional Chinese characters along with their naming latencies collected from 140 Chinese speakers.
What's NEXT:
Old Wisdom in New World
DeepLEX
: With its modularized open architecture, it aims to be a fine-grained yet scaled
multilingual lexical resource that empowers linguists to pursuit a wide array of previously unanswerable research questions.
- 不同視野下的詞彙行為/知識表徵化
Operationalized lexical knowledge representation
E.g., 我們不(只)關心「打」的詞義有哪些,更關心
什麼標準|脈落
下「打」的詞義分成幾個;標記訊息
(Categorical and/or Numerical) 跟該單位在其他脈絡下被觀察到的行為 (習得,情緒,發展,語言教學,神經表徵,心理反應等) 之間的關聯為何?shallow linguistic feature engineering
)- 任何的努力都應該被記錄
Reused, Reproduced, Reshaped and Reinforced.
It takes the functional position (usage-based view) in determining units and patterns (in Chinese), as well as the ontological grounding on the relation between linguistic objects and situations (bits of reality). (Langacker 1987, 1988, 1999; Croft 2002; Tomasello 2003; Bybee 2006, 2010)
Lexical data at different levels are modularized (only for practical reasons), such as syntax-semantics module, emotion module, discourse and pragmatic module, diachronic module, etc. Researchers from different fields can initiate a new cooperation based upon.
Hanzi | Semantics | Emotion | Lexical.Age | Aquisition | Social Network | ......... |
---|---|---|---|---|---|---|
phonetics | sense | polarity | 1930.freq | 3y.freq | indegree | ---- |
components | relations | classes | 1940.freq | 4y.freq | ---------------- | ---- |
At the moment there are 45k units (ranging from characters to lexical chunks) with over than 140 variables. The scope and size are still evolving, with its concerted and long-term efforts we believe this resource will be valuable for deep processing of natural language processing and intelligent applications.
Morpho-Syn-Sem module
Affective module
Diachronic module
Stance module
Morphemes (Hanzi) module
Frequency module
Morpho-Syn-Sem
Affective module
Diachronic module
Stance module
Morphemes (Hanzi)
Frequency module
affect <- read.csv("~/LOPEN/BIGLEXICON/modules/emotion/data/affect.0.csv")
View(affect)
head(str(affect),10)
Deep and Big
in what senseYou are now empowered to ask DIFFERENT questions.
詞彙難易
與那些變項有關?詞彙年齡
(存活能力) ? \[ \min_{\beta_0,\beta} \frac{1}{N}\sum_{i=1}^N w_il(y_i,\beta_0+\beta^Tx_i)+\lambda \left[(1-\alpha) ||\beta||_2^2/2+\alpha||\beta||_1\right] \]
Crowd (-sourced) Annotation 眾標
期待聰明的解法:無相布施 (annotation)
Make it open before make it right
Re-evaluate the role of linguistics/linguistic annotation in the era of Machine Learning.
Crossover
is in: collaboration is needed, Public Sharing is crucial. Time to discover and explore the Wisdom of Linguistic Crowds
[Mcnamara et al,2014] Automated evaluation of text and discourse with Coh-Metrix. Cambridge.
[Pennebaker et al. 2001] Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71.
[Tausczik et al. 2010] The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1) 24–54.
[Chang et al. 2015] A psycholinguistic database for traditional Chinese character naming. Behavior Research Methods.
[Sze et al. 2013] The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behavior Research Methods.
[Magistry et al. 2015] Sentiment Detection in Micro-blogs Using Unsupervised Chunks Extraction. CLSW 2015 (accepted).
王伯雅:歷時模組
呂珮瑜:情緒模組
莊茹涵:語用模組
秦睿謙/林欣霓:漢字模組
CWN2 Group:詞彙語意模組
劉純睿,Pierre Magistry,張瑜芸,施孟賢,. . . . 所有 LOPE lab 的成員!