User:Worf/Pokemon/Parser specs

From AniDB
Revision as of 09:36, 9 November 2010 by Worf (talk | contribs)
Jump to navigation Jump to search

Scraper for pocketmonsters.net

Format for input file for episode mapping (created manually):

a<AID>.map

<anidb epno> || <pm.net epid>

1 || 1
2 || 2
...
S1 || 666

Script should read and parse each .map file to get the mapping between anidb episode numbers and the pm.net episode ids.

Script should then parse the pm.net episode pages one by one to extract information about characters and pokemon.

Example: http://pocketmonsters.net/episodes/276#Char

The links of interest have the following formats:

http://pocketmonsters.net/character/2

http://pocketmonsters.net/dex/16

The parsed information should be stored in an output file called a<AID>_[characters|pokemon].out with the following format:

a<AID>_characters.out:

---275---
Pokemon Character #0002

---276---
Pokemon Character #0002
Pokemon Character #...

a<AID>_pokemon.out:

---275---
Pokemon #001
Pokemon #016

---276---
Pokemon #016
Pokemon #...

Character ids should be padded with zeros to a 4 digit number, pokemon ids should be padded with zeros to a 3 digit number. The number enclosed by --- is the episode number on anidb.


Additional features:

Compare output files a<AID>_[characters|pokemon].out with existing files a<AID>_[characters|pokemon].done to find changes between the data on AniDB and the data on pm.net.

Parse all output files a<AID>_[characters|pokemon].out to create a list of character and pokemon ids that are not present in any of the lists. (Take highest id, create list 1-<highest id>, remove all appearing ids from list, output list in missing_[characters|pokemon].out. Incorporate blacklist_[characters|pokemon].in file to remove further characters from the list that arent anime characters (pokemon live show characters for example).