Theme: 泰勒斯真的靠失戀在賺錢嗎!? Background: 泰勒斯2008 “Fearless” 專輯 - 近870萬銷售量 泰勒斯2010 “Speak now” 專輯 - 500萬銷售量 Hypothesis:泰勒斯最佳銷售唱片“Fearless”在全球有近870萬銷量, 但緊接著釋出的專輯“Speak
now”卻只有500萬銷售量, 小道消息指出, 泰勒斯源源不絕的靈感皆來自於戀愛, 尤其是失戀,
因此想進一步的探討歌詞間和銷售量的關聯性, 以及失戀與銷售量是否呈正相關。
Step 1: get fearless_lyrics_file
##先把原始檔用linux整理一次,原始檔是手動從網路上複製貼上到txt的
#cat fearless_lyrics.txt | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | tr -d '[:digit:]' | tr ' ' '\n' > fearless_forR.txt
path <- "http://homepage.ntu.edu.tw/~b04106003/file/speaknow_forR.txt"
text <- read.csv(path,encoding="UTF-8")
head <- as.vector(unique(text[,1]))
result <- matrix(0,length(head),2)
result <- as.data.frame(result)
col_name <- c('word','frequent')
colnames(result) <- col_name
for(i in 1:length(head)){
result[i,1] <- head[i]
}
for(i in 1:nrow(text)){
if(text[i,1]=='' || is.na(text[i,1])){
}
else{
for(ii in 1:nrow(result)){
if(text[i,1]==result[ii,1]){
result[ii,2] <- as.numeric(result[ii,2]) + 1
}
}
}
}
order_index <- rev(order(result[,2]))
frequent <- as.data.frame(result)
for(i in 1:length(order_index)){
row_index <- order_index[i]
frequent[i,2] = result[row_index,2]
frequent[i,1] = result[row_index,1]
}
write.csv(frequent,'fearless.csv')
f_freq <- frequent
Step 2: get speaknow_lyrics_file
##先把原始檔用linux整理一次,原始檔是手動從網路上複製貼上到txt的
#cat speaknow_lyrics.txt | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | tr -d '[:digit:]' | tr ' ' '\n' > speaknow_forR.txt
path <- "http://homepage.ntu.edu.tw/~b04106003/file/speaknow_forR.txt"
text <- read.csv(path,encoding="UTF-8")
head <- as.vector(unique(text[,1]))
result <- matrix(0,length(head),2)
result <- as.data.frame(result)
col_name <- c('word','frequent')
colnames(result) <- col_name
for(i in 1:length(head)){
result[i,1] <- head[i]
}
for(i in 1:nrow(text)){
if(text[i,1]=='' || is.na(text[i,1])){
}
else{
for(ii in 1:nrow(result)){
if(text[i,1]==result[ii,1]){
result[ii,2] <- as.numeric(result[ii,2]) + 1
}
}
}
}
order_index <- rev(order(result[,2]))
frequent <- as.data.frame(result)
for(i in 1:length(order_index)){
row_index <- order_index[i]
frequent[i,2] = result[row_index,2]
frequent[i,1] = result[row_index,1]
}
write.csv(frequent,'speaknow.csv')
s_freq <- frequent
Step 3: create “fearless” wordlist top 10 words with the highest frequency-> possible explanation->
library(data.table)
head(f_freq,10)
## word frequent
## 1 you 352
## 2 i 253
## 3 the 248
## 4 and 187
## 5 to 121
## 6 me 111
## 7 a 98
## 8 your 92
## 9 on 82
## 10 it 81
f_dt <- as.data.table(f_freq)
barplot(
head(f_dt[,frequent],10),
names.arg = head(f_dt[,word],10),
xlab = "top 10 words",
ylab = "frequency",
main = "Top 10 words with the highest frequency in Fearless",
las = 1
)
Step 4: create “speaknow” wordlist top 10 words with the highest frequency-> possible explanation->
head(s_freq,10)
## word frequent
## 1 you 352
## 2 i 253
## 3 the 248
## 4 and 187
## 5 to 121
## 6 me 111
## 7 a 98
## 8 your 92
## 9 on 82
## 10 it 81
s_dt <- as.data.table(s_freq)
barplot(
head(s_dt[,frequent],10),
names.arg = head(s_dt[,word],10),
xlab = "top 10 words",
ylab = "frequency",
main = "Top 10 words with the highest frequency in Speak Now",
las = 1
)
Result and Conclusion:
觀察兩張專輯的詞頻表的前十位,虛詞確實如同預期佔了不小比例,但最高的兩位卻是第二人稱”you” 及第一人稱”I”。而由於”you”的主、受格同形,”you are”也常縮略為”youre”,若將這三個型態合併,並依同理也將”me”、”im”與”I”合併,則可發現在兩張專輯中第一人稱的「我」都高於「你」,不過”Fearless"的比例(I:you = 324:281)明顯高於”Speak Now”(I:you = 410:403),所以也許歌詞越「自我」,粉絲越能產生共鳴,銷售成績也就跟著上來了?? 另外,原來預期出現率會很高的”love”卻讓我們失望了。在”Speak Now”的6706個字中,”love”只出現了30次(“love” = 24, “loved” = 6),而”Fearless”的4363字中,”love”也只出現了26次(“love” = 16, “loved” = 10)。但若比較兩者比例,可發現字數較少的”Fearless”展現的”愛”是比較多的(”Fearless” = 0.6%, ”Speak Now” = 0.45%)。所以我們可以告訴Taylor,字寫得不多沒關係,只要多用幾個”love”就行了。