Analysis 2: Sentiment Analysis of Harry Potter

I will be performing a text analysis in regards to the sentiment of words in Harry Potter.

Code
potter_tbl <- read_csv("data/harry-potter.csv")
potter_tokens = unnest_tokens(tbl = potter_tbl, input = text, output = word)
potter_clean = anti_join(potter_tokens, stop_words, by= "word")

Dictionary Analysis

Sentiments of the Top Words (via bing lexicon)

Code
potter_sentim = inner_join(potter_clean, sentiments, by= "word")
top_words <- potter_sentim |> count(word, sentiment, name = "count", sort = TRUE)|>
    slice_head(n=20) |>
    ggplot(aes(x=count, y= reorder(word, count), fill = sentiment, text = paste("Word:", word, "<br>Count:", count, "<br>Sentiment:", sentiment))) + geom_col() + labs(y= "words")
ggplotly(top_words, tooltip = "text")

This graphic displays the most frequent words that have an associated sentiment across all the Harry Potter Books. These had a substantial effect on the creation of the overall sentiment of the series. Here we see words like dark and death creating a large negative sentiment on the series. Notice that magic is a column as well having a positive impact on the series, however, in the context of Harry Potter the word magic can be attached like ‘dark magic’ making it negative (but we’ve already accounted for that via the count of ‘dark’), or on it’s own its more of a simple descriptor of a common action in the wizarding world - so it’d be neutral so for that reason we will be filtering it out in the next graphic.

Sentiments of the Top Words filtered (via bing lexicon)

As mentioned in the previous graphic’s analysis, we will be filtering out the word magic as in the context of the wizarding world it acts as more neutral and is not too informative on the sentiment of the text.

Code
potter_sentim = inner_join(potter_clean, sentiments, by= "word")
sentim_plot <- potter_sentim |> count(word, sentiment, name = "count", sort = TRUE)|>
    filter(word != "magic") |>
    slice_head(n=20) |>
    ggplot(aes(x=count, y= reorder(word, count), fill = sentiment, text = paste("Word:", word, "<br>Count:", count, "<br>Sentiment:", sentiment))) + geom_col() + labs(y= "words")
ggplotly(sentim_plot, tooltip = "text")

Here we can get a more close view on the most common words’ sentiments without the inclusion of the word magic.

Sentiments of the Top Words (via afinn lexicon)

Code
afinn <- read_csv("data/afinn.csv")
potter_sentim = inner_join(potter_clean, afinn, by= "word")
afinn_plot <- potter_sentim |> count(word, value, name = "count", sort = TRUE) |>
    slice_head(n=20) |>
    ggplot(aes(x=count, y= reorder(word, count), fill = value, text = paste("Word:", word, "<br>Count:", count, "<br>Sentiment Value:", value))) + geom_col() + labs(y= "words")
ggplotly(afinn_plot, tooltip = "text")

Here we see how the afinn dictionary has some different words attached to sentiments, such as ‘yeah’. Additionally, the afinn dictionary goes beyond the binary split of bing’s positive versus negative, here afinn gives a value to each word from -5 to 5, with the -5 being the most negative, and 5 being the most positive. Thus we see this array of blues where darkest blue indicates the most negative word and the lightest blue indicates the most positive value.

Contribution Analysis

Since the bing sentiment lexicon is broken down by categorical negative/positive, for a contribution numerical analysis we will be using the afinn lexicon. (This includes some different words and it also attaches a value to each word from -5 to 5, with the -5 being the most negative, and 5 being the most positive)

Code
afinn <- read_csv("data/afinn.csv")
potter_sentim = inner_join(potter_clean, afinn, by= "word")
contrib_plot <- potter_sentim |> count(word, value, name = "count", sort = TRUE)|> 
    mutate(contribution = value * count) |>
    filter(word != "magic") |>
    slice_max(abs(contribution), n=20) |>
    ggplot(aes(x=contribution, y= reorder(word, contribution), fill = value, text = paste("Word:", word, "<br>Count:", count, "<br>Contribution:", contribution))) + geom_col() + labs(y= "words")
ggplotly(contrib_plot, tooltip = "text")

The above graphic showcases 20 words that had the highest contribution towards the sentiment of the book series. We see how there are more negative words that have a much larger absolute contribution on the overall sentiment of the book (as noted by the darker bars)

Sentiment Trajectory

Code
potter_clean = anti_join(potter_tokens, stop_words, by= "word") 
potter_chap_counts = potter_clean |> count(word, book, chapter, name = "count", sort = TRUE) |> group_by(book)

potter_chap_sentim = inner_join(potter_chap_counts, afinn, by= "word") |> mutate(contribution = value * count)


chapter_scores = potter_chap_sentim |> 
  group_by(book, chapter) |> 
  summarize(score = sum(contribution)) 

Book 1 Sentiment Trajectory

Code
book_trajectory <- chapter_scores |> filter(book == "Book 1: Philosopher's Stone") |>   mutate(sentiment = case_when(
    score >= 0 ~ "positive",
    score < 0 ~ "negative")) |> ggplot(aes(x = chapter, y = score, fill = sentiment, text = paste("Chapter:", chapter, "<br>Score:", score, "<br>Sentiment:", sentiment))) +
  geom_col() +
  labs(title = "Sentiment score for each chapter in ' Book 1: Philosopher's Stone'") +
  theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(book_trajectory, tooltip = "text")

Here you can see how the sentiment varies over each chapter in the first book, Philosopher’s stone. In this graphic we see the overwhelmingly negative scores of chapter 15 and chapter 4, and additionally how it is negative overall across the book since most bars are negative as indicative by their color.

Series Trajectory

Code
series_trajectory <- chapter_scores |>
  mutate(
    sentiment = case_when(
      score >= 0 ~ "positive",
      score < 0 ~ "negative"
    )
  ) |>
  ggplot(aes(
    x = chapter,
    y = score,
    fill = sentiment,
    text = paste(
      "Book:", book, "<br>Chapter:", chapter, "<br>Score:", score, "<br>Sentiment:", sentiment ))) +
  geom_col() +
  facet_wrap(~ book, scales = "free_x", ncol = 2) +
  labs(
    title = "Sentiment Trajectory Across the Harry Potter Series",
    x = "Chapter",
    y = "Sentiment Score"
  ) +
  theme_bw() + theme(panel.spacing.y = unit(2, "lines"))

ggplotly(series_trajectory, tooltip = "text")

Through the faceted plot we can see that across the books the chapter sentiments tend to be mainly negative, which makes sense as the book covers a lot of heavy topics including the abuse Harry initially faces when he lives with relatives, the character deaths covered, and the antagonization he faces from characters like Draco Malfoy. From this it seems that Deathly Hallows has the most negative sentiments from the chapters, as their bars appear to be closer to -300, which makes sense since it covered a lot of the actual final fight against voldemort where there were casualities. It covers deaths like Snape, Mad eyed moody, Fred Weasley, Dobby, Voldemort, Bellatrix, and so on.