User:Worf/Pokemon/Parser specs
Scraper for pocketmonsters.net
Format for input file for episode mapping:
a<AID>.map
<anidb epno> || <pm.net epid>
1 || 1 2 || 2 ... S1 || 666
Script should read and parse each .map file to get the mapping between anidb episode numbers and the pm.net episode ids.
Script should then parse the pm.net episode pages one by one to extract information about characters and pokemon.
Example: http://pocketmonsters.net/episodes/276#Char
The links of interest have the following formats:
http://pocketmonsters.net/character/2
http://pocketmonsters.net/dex/16
The parsed information should be stored in an output file called a<AID>_[characters|pokemon].out with the following format:
a<AID>_characters.out:
---275--- Pokemon Character #0002 ---276--- Pokemon Character #0002 Pokemon Character #...
a<AID>_pokemon.out:
---275--- Pokemon #001 Pokemon #016 ---276--- Pokemon #016 Pokemon #...
Character ids should be padded with zeros to a 4 digit number, pokemon ids should be padded with zeros to a 3 digit number. The number enclosed by --- is the episode number on anidb.
Additional features:
Compare output files a<AID>_[characters|pokemon].out with existing files a<AID>_[characters|pokemon].done to find changes between the data on AniDB and the data on pm.net.
Parse all output files a<AID>_[characters|pokemon].out to create a list of character and pokemon ids that are not present in any of the lists. (Take highest id, create list 1-<highest id>, remove all appearing ids from list, output list in missing_[characters|pokemon].out. Incorporate blacklist_[characters|pokemon].in file to remove further characters from the list that arent anime characters (pokemon live show characters for example).