User:Worf/Pokemon/Parser specs

From AniDB
Jump to navigation Jump to search

Scraper for pocketmonsters.net

Format for input file for episode mapping (created manually):

a<AID>.map

<anidb epno> || <pm.net epid>

1 || 1
2 || 2
...
S1 || 666

Script should read and parse each .map file to get the mapping between anidb episode numbers and the pm.net episode ids.

Script should then parse the pm.net episode pages one by one to extract information about characters and pokemon.

Example: http://pocketmonsters.net/episodes/276#Char

The links of interest have the following formats:

http://pocketmonsters.net/character/2

http://pocketmonsters.net/dex/16

The parsed information should be stored in an output file called a<AID>_[characters|pokemon].out with the following format:

a<AID>_characters.out:

---275---
Pokemon Character #0002

---276---
Pokemon Character #0002
Pokemon Character #...

a<AID>_pokemon.out:

---275---
Pokemon #001
Pokemon #016

---276---
Pokemon #016
Pokemon #...

Character ids should be padded with zeros to a 4 digit number, pokemon ids should be padded with zeros to a 3 digit number. The number enclosed by --- is the episode number on anidb.


Additional features:

Compare output files a<AID>_[characters|pokemon].out with existing files a<AID>_[characters|pokemon].done to find changes between the data on AniDB and the data on pm.net.

Parse all output files a<AID>_[characters|pokemon].out to create a list of character and pokemon ids that are not present in any of the lists. (Take highest id, create list 1-<highest id>, remove all appearing ids from list, output list in missing_[characters|pokemon].out. Incorporate blacklist_[characters|pokemon].in file to remove further characters from the list that arent anime characters (pokemon live show characters for example).


v2 Character parser

Compare data already added to anidb with data from pm.net using http://anidb.net/perl-bin/animedb.pl?show=report&report.type=worf_pokemon_characterlist&report.csv=1&do.report=Generate+Report

Mappings file between pm.net and anidb: http://wiki.anidb.net/w/User:Worf/Pokemon/Parser_specs_mapping

charid -> anidb.net character id
identifier -> pm.net character id of the character; http://pocketmonsters.net/character/<identifier> (remove leading 0s)
pokedexid -> pokedex id the character is a guise of; http://pocketmonsters.net/dex/<pokedexid> (remove leading 0s)
pokemon -> comma-separated list of pm.net character ids; list of pokemon the character owns
eids -> list of anidb.net episode ids the character appears in; check against pm.net->anidb.net mappings file
names -> list of comma-separated list of names; each element consists of name, type and language divided by ||
other -> description of the character


Data that should be compared:

- whether or not a name differs between the two systems and whether a name is present in one but missing on the other system

- whether the description of a character is the same on both systems

- if the character is a pokemon (has a guise relation on anidb / has a "Pokemon Species" set on pm.net) check whether the relation is set on both systems

- whether all trainer->pokemon relations are set

- whether the pokemon->trainer relation is correctly set

- whether the episode appearances are the same on both systems


Output:

- Differences as specified above in a humanly readable format (pm.net_id, identifier, differing field(s))

- List of pm.net character IDs that don't appear in the report (ignore blacklisted IDs read from blacklist file)

v2 Episode parser

---todo---