Theme: 泰勒斯真的靠失戀在賺錢嗎!? Background: 泰勒斯2008 “Fearless” 專輯 - 近870萬銷售量 泰勒斯2010 “Speak now” 專輯 - 500萬銷售量 Hypothesis:泰勒斯最佳銷售唱片“Fearless”在全球有近870萬銷量, 但緊接著釋出的專輯“Speak
now”卻只有500萬銷售量, 小道消息指出, 泰勒斯源源不絕的靈感皆來自於戀愛, 尤其是失戀,
因此想進一步的探討歌詞間和銷售量的關聯性, 以及失戀與銷售量是否呈正相關。

Step 1: get fearless_lyrics_file

##先把原始檔用linux整理一次,原始檔是手動從網路上複製貼上到txt的
#cat fearless_lyrics.txt | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | tr -d '[:digit:]' | tr ' ' '\n' > fearless_forR.txt

path <- "http://homepage.ntu.edu.tw/~b04106003/file/speaknow_forR.txt"
text <- read.csv(path,encoding="UTF-8")
head <- as.vector(unique(text[,1]))
result <- matrix(0,length(head),2)
result <- as.data.frame(result)
col_name <- c('word','frequent')
colnames(result) <- col_name
for(i in 1:length(head)){
    result[i,1] <- head[i]
}
for(i in 1:nrow(text)){
    if(text[i,1]=='' || is.na(text[i,1])){
    }
    else{
        for(ii in 1:nrow(result)){
            if(text[i,1]==result[ii,1]){
                result[ii,2] <- as.numeric(result[ii,2]) + 1
            }
        }
    }
}
order_index <- rev(order(result[,2]))
frequent <- as.data.frame(result)
for(i in 1:length(order_index)){
    row_index <- order_index[i]
    frequent[i,2] = result[row_index,2]
    frequent[i,1] = result[row_index,1]
}
write.csv(frequent,'fearless.csv')
f_freq <- frequent

Step 2: get speaknow_lyrics_file

##先把原始檔用linux整理一次,原始檔是手動從網路上複製貼上到txt的
#cat speaknow_lyrics.txt | tr -d '[:punct:]' | tr '[A-Z]' '[a-z]' | tr -d '[:digit:]' | tr ' ' '\n' > speaknow_forR.txt
path <- "http://homepage.ntu.edu.tw/~b04106003/file/speaknow_forR.txt"
text <- read.csv(path,encoding="UTF-8")
head <- as.vector(unique(text[,1]))
result <- matrix(0,length(head),2)
result <- as.data.frame(result)
col_name <- c('word','frequent')
colnames(result) <- col_name
for(i in 1:length(head)){
    result[i,1] <- head[i]
}
for(i in 1:nrow(text)){
    if(text[i,1]=='' || is.na(text[i,1])){
    }
    else{
        for(ii in 1:nrow(result)){
            if(text[i,1]==result[ii,1]){
                result[ii,2] <- as.numeric(result[ii,2]) + 1
            }
        }
    }
}
order_index <- rev(order(result[,2]))
frequent <- as.data.frame(result)
for(i in 1:length(order_index)){
    row_index <- order_index[i]
    frequent[i,2] = result[row_index,2]
    frequent[i,1] = result[row_index,1]
}

write.csv(frequent,'speaknow.csv')
s_freq <- frequent

Step 3: create “fearless” wordlist top 10 words with the highest frequency-> possible explanation->

library(data.table)
head(f_freq,10)
##    word frequent
## 1   you      352
## 2     i      253
## 3   the      248
## 4   and      187
## 5    to      121
## 6    me      111
## 7     a       98
## 8  your       92
## 9    on       82
## 10   it       81
f_dt <- as.data.table(f_freq)
barplot(
  head(f_dt[,frequent],10),
  names.arg = head(f_dt[,word],10),
  xlab = "top 10 words",
  ylab = "frequency",
  main = "Top 10 words with the highest frequency in Fearless",
  las = 1
)

Step 4: create “speaknow” wordlist top 10 words with the highest frequency-> possible explanation->

head(s_freq,10)
##    word frequent
## 1   you      352
## 2     i      253
## 3   the      248
## 4   and      187
## 5    to      121
## 6    me      111
## 7     a       98
## 8  your       92
## 9    on       82
## 10   it       81
s_dt <- as.data.table(s_freq)
barplot(
  head(s_dt[,frequent],10),
  names.arg = head(s_dt[,word],10),
  xlab = "top 10 words",
  ylab = "frequency",
  main = "Top 10 words with the highest frequency in Speak Now",
  las = 1
)

Result and Conclusion:

觀察兩張專輯的詞頻表的前十位,虛詞確實如同預期佔了不小比例,但最高的兩位卻是第二人稱”you” 及第一人稱”I”。而由於”you”的主、受格同形,”you are”也常縮略為”youre”,若將這三個型態合併,並依同理也將”me”、”im”與”I”合併,則可發現在兩張專輯中第一人稱的「我」都高於「你」,不過”Fearless"的比例(I:you = 324:281)明顯高於”Speak Now”(I:you = 410:403),所以也許歌詞越「自我」,粉絲越能產生共鳴,銷售成績也就跟著上來了??   另外,原來預期出現率會很高的”love”卻讓我們失望了。在”Speak Now”的6706個字中,”love”只出現了30次(“love” = 24, “loved” = 6),而”Fearless”的4363字中,”love”也只出現了26次(“love” = 16, “loved” = 10)。但若比較兩者比例,可發現字數較少的”Fearless”展現的”愛”是比較多的(”Fearless” = 0.6%, ”Speak Now” = 0.45%)。所以我們可以告訴Taylor,字寫得不多沒關係,只要多用幾個”love”就行了。