A very labour intensive task is the generation of all the statistics and counters for the AniDB database entries. Optimization of this process is therefore on the todo list.
The data which is currently collected by the main stats update script.
- episodes added for anime
- files added for anime
- groups subbing the anime
- users collecting the anime
- total size of all files for this anime
- anime subbed
- files released
- users collecting releases by this group
- total size of all files by this group
- files added for this episode
- users collecting this episode
- users with this file in MyList (split according to MyList state: unknown/hdd/cd/deleted)
- anime in MyList
- episodes in MyList
- files in MyList
- size of files in MyList
- last anime added to MyList
- date of last MyList addition
- anime added to AniDB
- episodes added to AniDB
- files added to AniDB
- groups added to AniDB
- producers added to AniDB
- anime titles added to AniDB
- anime categories added to AniDB
- anime-producer relation added to AniDB
- anime-group comments added to AniDB
- review comments added to AniDB
- reviews added to AniDB
- votes added to AniDB
- lame files (no ed2k link)
- lame files percentage
- independence percentage
- leech percentage
- number of watched episodes
- watched percentage for MyList
- watched percentage for AniDB
- collected percentage for AniDB
3 times a week a script is run to update the counters. The script will read all relevant tables in chunks. I.e. in order to update the anime stats it will gather all episode, file and MyList information for the anime with aid 1-249, 250-499, 500-749, ... This is done in order to limit the memory usage during calculation. The data collected is stored inside of an in-memory Perl hash and any required updates are written back to the database at the end of each chunk in one transaction.
This leads to one big, monolithic cronjob which creates a lot of database, memory and CPU load.
Current Runtimes: (13.05.2007)
- Anime Stats
- 1525 seconds (25 minutes) for <5210 anime
- Group Stats
- 7 seconds for <4476 groups
- Anime/Ep/File counters
- 1633 seconds (27 minutes) for 324901 file and 73193 ep entries
- User stats
- 4437 seconds (74 minutes) for 221037 users
- Total: 7602 seconds (127 minutes)
The process is mostly limited by the database (the script uses about 900 seconds of CPU time).
The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all any more the way it works right now.
Some general ideas. Some of them could also be combined.
On-The-Fly Updating / Triggers
We could reduce the interval between runs of the statistic update script greatly if all important values would be updated on the fly. As we'd probably not handle all possible cases we might not be able to remove the script all together, but we might be able to run it only once a week or once a month.
The question here is whether the additional load these on-the-fly updates impose on the database might be a problem.
One approach to realize this with acceptable work effort required, would be a number of database triggers and corresponding PL/pgSQL functions which transparently update all relevant counters and stats. This could however lead to some performance issues, i.e. think about a user who deletes an entire anime with 100 eps from his MyList. This would yield at least 100 trigger invocations for any listtb based trigger which updates user stats. So instead of decreasing the users file count by 100 in one action, it would effectively be decreased by 1, 100 times.
Read-Only database slave for stats work
Another approach would be to introduce a read-only database slave (i.e. with Slony) and to execute all read queries of the stats update scripts on this database. As the scripts only write to the database if any value has actually changed this would greatly reduce the load on the main database. The issue here is whether we'd get the hardware resources to do this and whether it scales.
The script could be run at shorter intervals and calculate only a part of the stats during each run. I.e. we could run it 3 times a day and process 250 anime in each run...
The current approach gathers data for all DB entries. It doesn't matter whether any of their stats values are likely to have changed. This is especially problematic for the user stats. With each stats update we're collecting the data for all users, even though only a small percentage of them has done any changes to AniDB. They might not even have logged in since the last stats update.
Possible approaches for this would be to:
- skip users who haven't logged in since last update
- add a "dirty" boolean flag to entries in the user table which is set whenever a user makes a change to AniDB which is potentially relevant for his stats.
- ... ?