Analysis 1: Frequency Analysis of Harry Potter

I will be performing a text analysis in regards to the frequency of words in Harry Potter. This can help with getting an understanding of who or what the series places a lot of emphasis on.

What are the different Harry Potter Books? Below is a tibble with their names:

Code
potter_tbl <- read_csv("data/harry-potter.csv")
potter_tbl |> distinct(book)
# A tibble: 7 × 1
  book                        
  <chr>                       
1 Book 1: Philosopher's Stone 
2 Book 2: Chamber of Secrets  
3 Book 3: Prisoner of Azkaban 
4 Book 4: Goblet of Fire      
5 Book 5: Order of the Phoenix
6 Book 6: Half Blood Prince   
7 Book 7: Deathly Hallows     

Top 10 most common words from all the books

Code
common_plot = potter_counts |> 
    slice(1:10) |>
    ggplot(aes(x=count, y= reorder(word,count), text = paste("Word:", word, "<br>Count:", count))) + geom_col(fill = "purple") + labs(y="word")
ggplotly(common_plot, tooltip = "text")

From this plot, we can see that there is a lot of emphasis placed on Harry as in comparision to his two friends, Ron and Hermione, which makes sense as Harry is the main character of the Book, whereas his friends act as more supporting leads. You can also see that the books have an emphasis on the word looked, likely due to how visual the book gets with its descriptions.

Top 10 most common words from book 1

Code
book_1_counts = potter_clean |> filter(book == "Book 1: Philosopher's Stone") |> count(word, name = "count", sort = TRUE)
book_1_plot = book_1_counts |>     slice(1:10) |>
    ggplot(aes(x=count, y= reorder(word,count), text = paste("Word:", word, "<br>Count:", count))) + geom_col(fill = "purple") + labs(x= "count", y = "word")

ggplotly(book_1_plot, tooltip = "text")

Here you see that book 1 had more of an emphasis on Hagrid, as he was mentioned 336 in book one, which is even more than hermione who was only mentioned 257 times in comparison, despite being a main lead.

N-gram Analysis

Code
potter_bigram = unnest_tokens(tbl = potter_tbl, input = text, output = bigram, token = "ngrams", n = 2)
# remove stop words
potter_bigram2 = potter_bigram |> mutate(word1 = str_extract(bigram, "^\\w+"),
    word2= str_extract(bigram, "\\w+$")) |>
    filter(!(word1 %in% stop_words$word), !(word2 %in% stop_words$word))
potter_bigram2 = potter_bigram2 |> count( bigram,word1,word2, name = "count", sort = TRUE)

Top 10 most common bigrams across all books

Code
ngram_plot = potter_bigram2 |> 
    slice(1:10) |>
    ggplot(aes(x=count, y= reorder(bigram,count), text = paste("Bigram:", bigram, "<br>Count:", count))) + geom_col(fill = "purple") + labs(y="Bigram")
ggplotly(ngram_plot, tooltip = "text")

Here we see the use of a title making professor mcgonagall and uncle vernon used more frequency that the bigram of harry potter. This makes sense as the book is told from the perspective of harry potter, so in conversation and in his analysis he does typically refers to adults with their proper titles, ex professor, uncle, aunt, etc.

Wordcloud of 20 most common bigrams for book 1

Code
library(wordcloud)

potter_bigram = unnest_tokens(tbl = potter_tbl, input = text, output = bigram, token = "ngrams", n = 2)
# remove stop words
potter_bigram2 = potter_bigram |> mutate(word1 = str_extract(bigram, "^\\w+"),
    word2= str_extract(bigram, "\\w+$")) |>
    filter(!(word1 %in% stop_words$word), !(word2 %in% stop_words$word))
potter_bigram2 = potter_bigram2 |> count( bigram,book, word1,word2, name = "count", sort = TRUE)

book_1_bigrams = potter_bigram2 |> filter(book == "Book 1: Philosopher's Stone") |>  arrange(desc(count))

wordcloud(
  words = book_1_bigrams$bigram,
  freq = book_1_bigrams$count, 
  max.words = 20, 
  scale = c(1.5, 0.5),
  colors = c("#A053A1", "#DB778F", "#09A39A", "#5869C7"))

This word cloud is another way to visualize the 20 of the most frequent n-grams. However, some words can blend together, making it a bit difficult to read it fully. Here we see that there is a large emphasis placed on uncle vernon and professor mcgonagall in this book 1 due to Harry interacting with the uncle since he lives with him at this time, and since professor McGonagall is the head of the gryfinndor house, Harry frequently interacts with them as he starts out in this house in book 1. As indicated by these two being larger and colored differently compared to most other ones.

Top 10 most common Trigrams across all books

Code
potter_trigram = unnest_tokens(tbl = potter_tbl, input = text, output = trigram, token = "ngrams", n = 3)
# remove stop words
potter_trigram2 <- potter_trigram |>
  mutate(
    word1 = str_extract(trigram, "^\\w+"),
    word2 = str_extract(trigram, "(?<=\\s)\\w+(?=\\s)"),
    word3 = str_extract(trigram, "\\w+$")
  ) |>
  filter(
    !(word1 %in% stop_words$word),
    !(word2 %in% stop_words$word),
    !(word3 %in% stop_words$word)
  ) |>
  count(trigram, word1, word2, word3, name = "count", sort = TRUE)
Code
trigram_plot <- potter_trigram2 |>
  slice_max(count, n = 10) |>
  mutate(
    trigram = reorder(trigram, count),
    tooltip = paste0("Trigram: ", trigram, "<br>Count: ", count)
  )

p <- trigram_plot |>
  ggplot(aes(x = count, y = trigram, text = tooltip)) +
  geom_col(fill = "purple") +
  labs(y = "Trigram")

ggplotly(p, tooltip = "text")

professor grubbly plank refers to the female witch who was a substitute Care of Magical Creatures teacher at Hogwarts, and mad eye moody waas this other professor figure in defense against the dark arts. Additionally, across the series we see that phrases such as the quidditch world cup as it indicates a major plot point of the series since it is a popular game and Harry plays in a tournament for it.