DeepLEX

A Multi-dimensional Chinese Lexical Resource

謝舒凱台大語言學研究所
CKIP30 workshop, 2015, Academia Sinica

Motivation

感謝老師當年收留不才的我進入詞庫，

詞庫的故事

Current Approaches | 詞彙的心理意義
- Coh-Metrix Lexicon [Mcnamara et al., 2014]: Lexical data and measures for text and discourse at different levels, including MRC Psycholinguistic Database with various properties of words (e.g., age of acquisition, familiarity ratings, concreteness, imagability, Colorado Meaningness, etc), and CELEX word frequency.
- LIWC Lexicon [Pennebaker et. al. 2001; Tausczik et al. 2010]: provides the psychometrics of word usage in various social-psychological contexts. (e.g., linguistic features of deceptive statements).

詞庫的故事

Current Approaches 詞彙的心理意義
- English Lexicon Project: affords access to a large set of lexical behavioral characteristics from visual lexical decision and naming studies of 40,481 words and 40,481 nonwords. (Also for Dutch, British, French)
- DysList: a lexicon of dyslexic errors annotated with linguistic, phonetic and visual features.
- . . . . . . . . .

詞庫的故事

Current Approaches | 詞彙的語意與概念
- WordNet
- HowNet
- FrameNet
- WordSim353 MEN Test Collection: a collection for measuring word similarity or relatedness, contains two sets of English word pairs along with human-assigned similarity judgements.
- . . . . . . . . .

詞庫的故事

Current Approaches
Existing Chinese Lexical Resources

*eHowNet

*Chinese Wordnet

*Chinese Coh-Metrix

*Chinese LIWC

*Traditional Chinese Psycholinguistic Database : provides a large-scale psycholinguistic norm of 3,314 Traditional Chinese characters along with their naming latencies collected from 140 Chinese speakers.

What's NEXT: Old Wisdom in New World

Sustainable
Reproducible
Collaborative (w/o pain)
Open and (thus) linked, decentrailized

大與深 (BIG and DEEP) 可以兼得否？

BIG is easy, DEEP is hard

(Hsieh at al., 2015. to be submitted)

DeepLEX: With its modularized open architecture, it aims to be a fine-grained yet scaled multilingual lexical resource that empowers linguists to pursuit a wide array of previously unanswerable research questions.

提醒：詞庫和詞典的基本任務不同

線上漢語大詞典與康熙字典

DeepLEX: Rationale

- 不同視野下的詞彙行為/知識表徵化

Operationalized lexical knowledge representation

E.g., 我們不(只)關心「打」的詞義有哪些，更關心

在什麼標準|脈落下「打」的詞義分成幾個；
這樣的標記訊息(Categorical and/or Numerical) 跟該單位在其他脈絡下被觀察到的行為 (習得，情緒，發展，語言教學，神經表徵，心理反應等) 之間的關聯為何？
以這樣的第二層知識出發提供「語言學養分」給 NLP/Machine Learning (rather than shallow linguistic feature engineering)

- 任何的努力都應該被記錄

Reused, Reproduced, Reshaped and Reinforced.

DeepLEX: Assumptions

It takes the functional position (usage-based view) in determining units and patterns (in Chinese), as well as the ontological grounding on the relation between linguistic objects and situations (bits of reality). (Langacker 1987, 1988, 1999; Croft 2002; Tomasello 2003; Bybee 2006, 2010)
Lexical data at different levels are modularized (only for practical reasons), such as syntax-semantics module, emotion module, discourse and pragmatic module, diachronic module, etc. Researchers from different fields can initiate a new cooperation based upon.

Modules

Hanzi	Semantics	Emotion	Lexical.Age	Aquisition	Social Network	.........
phonetics	sense	polarity	1930.freq	3y.freq	indegree	----
components	relations	classes	1940.freq	4y.freq	----------------	----

At the moment there are 45k units (ranging from characters to lexical chunks) with over than 140 variables. The scope and size are still evolving, with its concerted and long-term efforts we believe this resource will be valuable for deep processing of natural language processing and intelligent applications.

模組化的詞彙知識

開放集合

Morpho-Syn-Sem module
Affective module
Diachronic module
Stance module
Morphemes (Hanzi) module
Frequency module

模組化的詞彙知識

開放集合

Morpho-Syn-Sem
Affective module
Diachronic module
Stance module
Morphemes (Hanzi)
Frequency module

Possible values

collocates, arguments, familiarity ratings, polarity, length, senses, ontological class, pos, estimated ages, discourse markers, ...........

Affective module as an Example

affect <- read.csv("~/LOPEN/BIGLEXICON/modules/emotion/data/affect.0.csv")
View(affect)
head(str(affect),10)

Summary: `Deep and Big` in what sense

Deep: unlimited dimensions (currently more than 140 indi. variables)
Big: continuous in units (currently more than 45k lexical units)
- non-word morpheme
- lexical chunks: frequent word strings of compositional nature (e.g. don't have to worry)
- formulaic sequences, idioms, and other multi-word expressions.

Current Development

[Tools and APIs] R 的套件 (LexicoR): provides programmable tools for aggregating, exploring, visualizing, and modeling the lexical data, as well as various quantitative measures for working with the data in reproducible manner.
[Annotation] 標記平台
[Market] 資料集散

DeepLEX-based Applications | 應用

Sentiment/Emotion Detection already on the way (Magistry et al. 2015).
You are now empowered to ask DIFFERENT questions.
- 詞彙難易與那些變項有關？
- 情緒詞彙的漢字部首/部件組合的關係？
- 那些變項|組合最大地決定了詞彙年齡 (存活能力) ?

Modeling the Life Stage of Diffusion, Stabilization, Inactivation of words

\[ \min_{\beta_0,\beta} \frac{1}{N}\sum_{i=1}^N w_il(y_i,\beta_0+\beta^Tx_i)+\lambda \left[(1-\alpha) ||\beta||_2^2/2+\alpha||\beta||_1\right] \]

plot of chunk block2

The Future is Now, The Sky is the Limit

Crowd (-sourced) Annotation 眾標
- 期待聰明的解法：無相布施 (annotation)
Make it open before make it right
- 萌典是個好例子。

Conclusion

我們需要知道為什麼。

Re-evaluate the role of linguistics/linguistic annotation in the era of Machine Learning.

跨界合作眾志成城的時代已然來臨。

Crossover is in: collaboration is needed, Public Sharing is crucial. Time to discover and explore the Wisdom of Linguistic Crowds

Developer mailing list

Reference

[Mcnamara et al,2014] Automated evaluation of text and discourse with Coh-Metrix. Cambridge.

[Pennebaker et al. 2001] Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71.

[Tausczik et al. 2010] The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1) 24–54.

[Chang et al. 2015] A psycholinguistic database for traditional Chinese character naming. Behavior Research Methods.

[Sze et al. 2013] The Chinese Lexicon Project: A repository of lexical decision behavioral responses for 2,500 Chinese characters. Behavior Research Methods.

[Magistry et al. 2015] Sentiment Detection in Micro-blogs Using Unsupervised Chunks Extraction. CLSW 2015 (accepted).

Acknowledgement

王伯雅：歷時模組

呂珮瑜：情緒模組

莊茹涵：語用模組

秦睿謙/林欣霓：漢字模組

CWN2 Group：詞彙語意模組

劉純睿，Pierre Magistry，張瑜芸，施孟賢，. . . . 所有 LOPE lab 的成員！

DeepLEX

A Multi-dimensional Chinese Lexical Resource

Motivation

詞庫的故事

詞庫的故事

詞庫的故事

詞庫的故事

大與深 (BIG and DEEP) 可以兼得否？

BIG is easy, DEEP is hard

BIG is easy, DEEP is hard

(Hsieh at al., 2015. to be submitted)

提醒：詞庫和詞典的基本任務不同

DeepLEX: Rationale

- 不同視野下的詞彙行為/知識表徵化

- 任何的努力都應該被記錄

DeepLEX: Assumptions

Modules

模組化的詞彙知識

開放集合

模組化的詞彙知識

開放集合

Possible values

Affective module as an Example

Summary: Deep and Big in what sense

Current Development

DeepLEX-based Applications | 應用

Modeling the Life Stage of Diffusion, Stabilization, Inactivation of words

The Future is Now, The Sky is the Limit

Conclusion

Reference

Acknowledgement

Summary: `Deep and Big` in what sense