User:Worf/Pokemon/Parser specs
Scraper for pocketmonsters.net
Format for input file for episode mapping (created manually):
a<AID>.map
<anidb epno> || <pm.net epid>
1 || 1 2 || 2 ... S1 || 666
Script should read and parse each .map file to get the mapping between anidb episode numbers and the pm.net episode ids.
Script should then parse the pm.net episode pages one by one to extract information about characters and pokemon.
Example: http://pocketmonsters.net/episodes/276#Char
The links of interest have the following formats:
http://pocketmonsters.net/character/2
http://pocketmonsters.net/dex/16
The parsed information should be stored in an output file called a<AID>_[characters|pokemon].out with the following format:
a<AID>_characters.out:
---275--- Pokemon Character #0002 ---276--- Pokemon Character #0002 Pokemon Character #...
a<AID>_pokemon.out:
---275--- Pokemon #001 Pokemon #016 ---276--- Pokemon #016 Pokemon #...
Character ids should be padded with zeros to a 4 digit number, pokemon ids should be padded with zeros to a 3 digit number. The number enclosed by --- is the episode number on anidb.
Additional features:
Compare output files a<AID>_[characters|pokemon].out with existing files a<AID>_[characters|pokemon].done to find changes between the data on AniDB and the data on pm.net.
Parse all output files a<AID>_[characters|pokemon].out to create a list of character and pokemon ids that are not present in any of the lists. (Take highest id, create list 1-<highest id>, remove all appearing ids from list, output list in missing_[characters|pokemon].out. Incorporate blacklist_[characters|pokemon].in file to remove further characters from the list that arent anime characters (pokemon live show characters for example).
v2
Compare data from already added to anidb with data from pm.net using http://anidb.net/perl-bin/animedb.pl?show=report&report.type=worf_pokemon_characterlist&report.csv=1&do.report=Generate+Report
charid -> anidb.net character id identifier -> pm.net character id of the character; http://pocketmonsters.net/character/<identifier> (remove leading 0s) pokedexid -> pokedex id the character is a guise of; http://pocketmonsters.net/dex/<pokedexid> (remove leading 0s) pokemon -> comma-separated list of pm.net character ids; list of pokemon the character owns eids -> list of anidb.net episode ids the character appears in; check against pm.net->anidb.net mappings file names -> list of comma-separated list of names; each element consists of name, type and language divided by || other -> description of the character