User:Worf/Pokemon/Parser specs

Revision as of 22:46, 11 December 2010 by Worf (talk | contribs)

Scraper for pocketmonsters.net

Format for input file for episode mapping (created manually):

a<AID>.map

<anidb epno> || <pm.net epid>

1 || 1
2 || 2
...
S1 || 666

Script should read and parse each .map file to get the mapping between anidb episode numbers and the pm.net episode ids.

Script should then parse the pm.net episode pages one by one to extract information about characters and pokemon.

Example: http://pocketmonsters.net/episodes/276#Char

The links of interest have the following formats:

http://pocketmonsters.net/character/2

http://pocketmonsters.net/dex/16

The parsed information should be stored in an output file called a<AID>_[characters|pokemon].out with the following format:

a<AID>_characters.out:

---275---
Pokemon Character #0002

---276---
Pokemon Character #0002
Pokemon Character #...

a<AID>_pokemon.out:

---275---
Pokemon #001
Pokemon #016

---276---
Pokemon #016
Pokemon #...

Character ids should be padded with zeros to a 4 digit number, pokemon ids should be padded with zeros to a 3 digit number. The number enclosed by --- is the episode number on anidb.


Additional features:

Compare output files a<AID>_[characters|pokemon].out with existing files a<AID>_[characters|pokemon].done to find changes between the data on AniDB and the data on pm.net.

Parse all output files a<AID>_[characters|pokemon].out to create a list of character and pokemon ids that are not present in any of the lists. (Take highest id, create list 1-<highest id>, remove all appearing ids from list, output list in missing_[characters|pokemon].out. Incorporate blacklist_[characters|pokemon].in file to remove further characters from the list that arent anime characters (pokemon live show characters for example).


v2

Compare data from already added to anidb with data from pm.net using http://anidb.net/perl-bin/animedb.pl?show=report&report.type=worf_pokemon_characterlist&report.csv=1&do.report=Generate+Report

charid -> anidb.net character id
identifier -> pm.net character id of the character; http://pocketmonsters.net/character/<identifier> (remove leading 0s)
pokedexid -> pokedex id the character is a guise of; http://pocketmonsters.net/dex/<pokedexid> (remove leading 0s)
pokemon -> comma-separated list of pm.net character ids; list of pokemon the character owns
eids -> list of anidb.net episode ids the character appears in; check against pm.net->anidb.net mappings file
names -> list of comma-separated list of names; each element consists of name, type and language divided by ||
other -> description of the character
MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.