Maintenance DEV: Difference between revisions
(New page: == AniDB Stats == A very labour intensive task is the generation of all the statistics and counters for the anidb db entries. Optimization of this process is therefore on the todo list. ...) |
No edit summary |
||
Line 1: | Line 1: | ||
{{TOCright}} | |||
== AniDB Stats == | == AniDB Stats == | ||
Revision as of 15:59, 13 May 2007
AniDB Stats
A very labour intensive task is the generation of all the statistics and counters for the anidb db entries. Optimization of this process is therefore on the todo list.
Data
The data which is currently collected by the main stats update script.
Anime
- eps added for anime
- files added for anime
- groups subbing the anime
- users collecting the anime
- total size of all files for this anime
Group
- animes subbed
- files released
- users collecting releases by this group
- total size of all files by this group
Episode
- files added for this ep
- users collecting this ep
File
- users with this file in mylist (split according to mylist state: unknown/hdd/cd/deleted)
User
- animes in mylist
- eps in mylist
- files in mylist
- size of files in mylist
- last anime added to mylist
- date of last mylist addition
- animes added to anidb
- eps added to anidb
- files added to anidb
- groups added to anidb
- producers added to anidb
- anime titles added to anidb
- anime categories added to anidb
- anime-producer relation added to anidb
- anime-group comments added to anidb
- review comments added to anidb
- reviews added to anidb
- votes added to anidb
- lame files (no ed2k link)
- lame files percentage
- independence percentage
- leech percentage
- number of watched eps
- watched percentage for mylist
- watched percentage for anidb
- collected percentage for anidb
Current Approach
3 times a week a script is run to update the counters. The script will read all relevant tables in chunks. I.e. in order to update the anime stats it will gather all episode, file and mylist information for the animes with aid 1-249, 250-499, 500-749, ... This is done in order to limit the memory usage during calculation. The data collected is stored inside of an in-memory perl hash and any required updates are written back to the database at the end of each chunk in one transaction.
This leads to one big, monolithic cronjob which creates a lot of database, memory and cpu load.
Current Runtimes: (13.05.2007)
- Anime Stats
- 1525 seconds (25 minutes) for <5210 animes
- Group Stats
- 7 seconds for <4476 groups
- Anime/Ep/File counters
- 1633 seconds (27 minutes) for 324901 file and 73193 ep entries
- User stats
- 4437 seconds (74 minutes) for 221037 users
- Total: 7602 seconds (127 minutes)
The process is mostly limited by the database (the script uses about 900 seconds of cpu time).
The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all anymore the way it works right now.
Possible Alternatives
Some general ideas. Some of them could also be combined.
On-The-Fly Updating / Triggers
We could reduce the intervall between runs of the statistic update script greatly if all important values would be updated on the fly. As we'd probably not handle all possible cases we might not be able to remove the script all together, but we might be able to run it only once a week or once a month.
The question here is whether the additional load these on-the-fly updates impose on the database might be a problem. One approach to realize this with acceptable work effort required, would be a number of database triggers and corresponding PL/pgSQL functions which transparently update all relevant counters and stats.
Read-Only database slave for stats work
Another approach would be to introduce a read-only database slave (i.e. with Slony) and to execute all read queries of the stats update scripts on this database. As the scripts only write to the database if any value has actually changed this would greatly reduce the load on the main database. The issue here is whether we'd get the hardware resources to do this and whether it scales.
Small Updates
The script could be run at shorter intervalls and calculate only a part of the stats during each run. I.e. we could run it 3 times a day and process 250 animes in each run...
Dirty Flag
The current approach gathers data for all db entries. It doesn't matter whether any of their stats values are likely to have changed. This is especially problematic for the user stats. With each statsupdate we're collecting the data for all users, even though only a small percentage of them has done any changes to anidb. They might not even have logged in since the last stats update.
Possible approaches for this would be to:
- skip users who haven't logged in since last update
- add a "dirty" boolean flag to entries in the user table which is set whenever a user makes a change to anidb which is potentially relevant for his stats.
- ... ?
?
what else?