Scrape Specific Pokemon Data

On each Pokemon’s page (for example, Gengar’s page), there’s a consistent and structured data that we can programmatically scrape. Each page has a table showing the Pokemon’s “national number”, type(s), height, and weight, another table showing base stats for battle such as health points (HP), attack, and defense, and several other tables and information.

To scrape the data using R, I’ll rely on the rvest package to extract data from different parts of the site. Looking at the site’s source HTML is a helpful part of this process in knowing what to extract.

Load Packages

Let’s start by loading all of the packages we’ll need.

# Load packages
library(rvest)
library(dplyr)
library(tidyr)
library(tibble)

Specify Parameters

Next, we’ll specify the Pokemon we want to look up and the URL from which to scrape. Let’s stick with Gengar for now.

# Set the name of the Pokemon
name <- "Gengar"

# Convert name to lower case for the URL
name <- tolower(name)

# Build the URL to fetch Pokemon data
url <- paste0("https://pokemondb.net/pokedex/", name)

Get Tables From Page

Now that we’ve got the URL to scrape, we need to fetch the tables storing the Pokemon’s data we want to scrape.

# Read the body from the page
body <- url %>% read_html() %>% html_nodes("body")

# Get the tables with the vital information
vitals_tables <- html_nodes(body, "table.vitals-table")
main_table <- html_table(vitals_tables[1])[[1]]
stats_table <- html_table(vitals_tables[4])[[1]]

glimpse(main_table)
Rows: 7
Columns: 2
$ X1 <chr> "National №", "Type", "Species", "Height", "Weight", "Abilities", "…
$ X2 <chr> "0094", "Ghost Poison", "Shadow Pokémon", "1.5 m (4′11″)", "40.5 kg…

This gives a pretty messy table, but it’s definitely something we can use.

Getting Stats From the Table

Ultimately, we’re going to want to pull all of this data into one structured table, so let’s accumulate all of this data into one row. We’ll need to tweak the table to make it easier to extract data from, then we’ll put together the Pokemon information and base battle stats.

# Function to help us turn the first row of our tibble into the header
header_from_row <- function(df) {
  names(df) <- as.character(unlist(df[1, ]))
  df[-1, ]
}

# Get the types, species, height, and weight from the table
main_tbl <- main_table %>%
  t %>%
  as_tibble %>%
  header_from_row %>%
  select(c("Type", "Species", "Height", "Weight"))

# Get stats columns
stats_tbl <- stats_table %>%
  select(c("X1", "X2")) %>%
  t %>%
  as_tibble %>%
  header_from_row

# Merge the data
data_tbl <- cbind(main_tbl, stats_tbl)

glimpse(data_tbl)
Rows: 1
Columns: 11
$ Type      <chr> "Ghost Poison"
$ Species   <chr> "Shadow Pokémon"
$ Height    <chr> "1.5 m (4′11″)"
$ Weight    <chr> "40.5 kg (89.3 lbs)"
$ HP        <chr> " 60"
$ Attack    <chr> " 65"
$ Defense   <chr> " 60"
$ `Sp. Atk` <chr> "130"
$ `Sp. Def` <chr> " 75"
$ Speed     <chr> "110"
$ Total     <chr> "500"

This gives us a little more of a cleaned up table now, but there’s still some work to do.

Clean Up Scraped Data

One thing I’d like to fix at this stage is that there are two different “Types”, Poison and Ghost, that are getting placed into one column. We want to break this up into multiple columns instead such that “Type 1” is Ghost and “Type 2” is Poison. We can do this mostly with strsplit and separate and a little bit of manipulation to make sure we always get up to three different types. No Pokemon so far has more than three types, so this is good for now.

# Clean up the types by breaking them into different columns
num_types <- data_tbl$Type %>% strsplit(" ") %>% unlist %>% length
col_names <- paste("Type", c(1:num_types))
data_tbl <- data_tbl %>%
  separate(col = "Type", into = col_names)

# Ensure columns are fixed. Types sometimes only has 1, but can be up to 3.
cols <- c(
  `Type 1` = NA_character_,
  `Type 2` = NA_character_,
  `Type 3` = NA_character_
)
data_tbl <- add_column(data_tbl,
                       !!!cols[setdiff(names(cols), names(data_tbl))])

glimpse(data_tbl)

And now we have our cleaned up Types in different columns. There’s more we could do to clean the data, like only have metric units in the Height and Weight columns, but I’ll hold that off until we build the full data frame.

Rows: 1
Columns: 13
$ `Type 1`  <chr> "Ghost"
$ `Type 2`  <chr> "Poison"
$ Species   <chr> "Shadow Pokémon"
$ Height    <chr> "1.5 m (4′11″)"
$ Weight    <chr> "40.5 kg (89.3 lbs)"
$ HP        <chr> " 60"
$ Attack    <chr> " 65"
$ Defense   <chr> " 60"
$ `Sp. Atk` <chr> "130"
$ `Sp. Def` <chr> " 75"
$ Speed     <chr> "110"
$ Total     <chr> "500"
$ `Type 3`  <chr> NA

Add Evolution Data

Finally, let’s add some evolution data to our data row. In particular, let’s add the following:

  • Has Evolution: a boolean flag for if the Pokemon is part of an evolution chain or is a standalone Pokemon. Gengar is part of an evolution chain.
  • Evolution Place: an integer stating where in the evolution chain the Pokemon sits. Gengar is the third evolution in its evolution chain.
  • Maximum Evolution Count: an integer specifying the final step of the Pokemon’s evolution chain. Gengar’s evolution chain has three different Pokemon.
  • Evolution Index: a floating point value ranging from 0 to 1 specifying where in the evolution chain the Pokemon sits, Evolution Place/Maximum Evolution Count. Gengar’s evolution index is 1.
# Look for evolution information
evo_node <- html_nodes(body, "div.infocard-list-evo")

# Check to see if there was any evolution information
has_evo <- length(evo_node) >= 1

# If there was evolution information, fill out the data
# Otherwise, assume there is not evolution of this Pokemon
if (has_evo) {

  # Get the list of evolutions
  evo_list <- evo_node %>% html_nodes("a.ent-name") %>% html_text

  # Get the maximum number of evolutions for this Pokemon's evolution chain
  max_evo <- length(unique(evo_list))

  # Find out where in the evolution chain this Pokemon sits
  evo_place <- which(tolower(evo_list) == name)[1]

  # Calculate an evolution index, how far to max evolution the Pokemon is
  evo_index <- round(as.double(evo_place) / as.double(max_evo), 2)

} else {

  # Set the evolution information to NA
  max_evo <- NA_integer_
  evo_place <- NA_integer_
  evo_index <- NA_integer_

}

# Append evolution information to the data tibble
evo_list <- c(
  `Has Evolution` = has_evo,
  `Evolution Place` = evo_place,
  `Maximum Evolution Count` = max_evo,
  `Evolution Index` = evo_index
)
evo_tbl <- evo_list %>% t %>% as_tibble
data_tbl <- cbind(data_tbl, evo_tbl)

glimpse(data_tbl)
Rows: 1
Columns: 17
$ `Type 1`                  <chr> "Ghost"
$ `Type 2`                  <chr> "Poison"
$ Species                   <chr> "Shadow Pokémon"
$ Height                    <chr> "1.5 m (4′11″)"
$ Weight                    <chr> "40.5 kg (89.3 lbs)"
$ HP                        <chr> " 60"
$ Attack                    <chr> " 65"
$ Defense                   <chr> " 60"
$ `Sp. Atk`                 <chr> "130"
$ `Sp. Def`                 <chr> " 75"
$ Speed                     <chr> "110"
$ Total                     <chr> "500"
$ `Type 3`                  <chr> NA
$ `Has Evolution`           <dbl> 1
$ `Evolution Place`         <dbl> 3
$ `Maximum Evolution Count` <dbl> 3
$ `Evolution Index`         <dbl> 1

And that’s it! Now we can build on this to create one Pokedex dataframe for all of the different Pokemon on the site.