Latest revision as of 19:21, 15 May 2009

AniDB Stats

A very labour intensive task is the generation of all the statistics and counters for the AniDB database entries. Optimization of this process is therefore on the todo list.

Data

The data which is currently collected by the main stats update script.

Anime

episodes added for anime
files added for anime
groups subbing the anime
users collecting the anime
total size of all files for this anime

Group

anime subbed
files released
users collecting releases by this group
total size of all files by this group

Episode

files added for this episode
users collecting this episode

File

users with this file in MyList (split according to MyList state: unknown/hdd/cd/deleted)

User

anime in MyList
episodes in MyList
files in MyList
size of files in MyList
last anime added to MyList
date of last MyList addition

anime added to AniDB
episodes added to AniDB
files added to AniDB
groups added to AniDB
producers added to AniDB
anime titles added to AniDB
anime categories added to AniDB
anime-producer relation added to AniDB
anime-group comments added to AniDB
review comments added to AniDB
reviews added to AniDB
votes added to AniDB

lame files (no ed2k link)
lame files percentage
independence percentage
leech percentage
number of watched episodes
watched percentage for MyList
watched percentage for AniDB
collected percentage for AniDB

Current Approach

3 times a week a script is run to update the counters. The script will read all relevant tables in chunks. I.e. in order to update the anime stats it will gather all episode, file and MyList information for the anime with aid 1-249, 250-499, 500-749, ... This is done in order to limit the memory usage during calculation. The data collected is stored inside of an in-memory Perl hash and any required updates are written back to the database at the end of each chunk in one transaction.

This leads to one big, monolithic cronjob which creates a lot of database, memory and CPU load.

Current Runtimes: (13.05.2007)

Anime Stats
- 1525 seconds (25 minutes) for <5210 anime

Group Stats
- 7 seconds for <4476 groups

Anime/Ep/File counters
- 1633 seconds (27 minutes) for 324901 file and 73193 ep entries

User stats
- 4437 seconds (74 minutes) for 221037 users

Total: 7602 seconds (127 minutes)

The process is mostly limited by the database (the script uses about 900 seconds of CPU time).

The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all any more the way it works right now.

Possible Alternatives

Some general ideas. Some of them could also be combined.

On-The-Fly Updating / Triggers

We could reduce the interval between runs of the statistic update script greatly if all important values would be updated on the fly. As we'd probably not handle all possible cases we might not be able to remove the script all together, but we might be able to run it only once a week or once a month.

The question here is whether the additional load these on-the-fly updates impose on the database might be a problem.

One approach to realize this with acceptable work effort required, would be a number of database triggers and corresponding PL/pgSQL functions which transparently update all relevant counters and stats. This could however lead to some performance issues, i.e. think about a user who deletes an entire anime with 100 eps from his MyList. This would yield at least 100 trigger invocations for any listtb based trigger which updates user stats. So instead of decreasing the users file count by 100 in one action, it would effectively be decreased by 1, 100 times.

Read-Only database slave for stats work

Another approach would be to introduce a read-only database slave (i.e. with Slony) and to execute all read queries of the stats update scripts on this database. As the scripts only write to the database if any value has actually changed this would greatly reduce the load on the main database. The issue here is whether we'd get the hardware resources to do this and whether it scales.

Small Updates

The script could be run at shorter intervals and calculate only a part of the stats during each run. I.e. we could run it 3 times a day and process 250 anime in each run...

Dirty Flag

The current approach gathers data for all DB entries. It doesn't matter whether any of their stats values are likely to have changed. This is especially problematic for the user stats. With each stats update we're collecting the data for all users, even though only a small percentage of them has done any changes to AniDB. They might not even have logged in since the last stats update.

Possible approaches for this would be to:

skip users who haven't logged in since last update
add a "dirty" boolean flag to entries in the user table which is set whenever a user makes a change to AniDB which is potentially relevant for his stats.
... ?

?

what else?

@@ Line 63: / Line 63: @@
 The data collected is stored inside of an in-memory Perl hash and any required updates are written back to the database at the end of each chunk in one transaction.
-This leads to one big, monolithic cronjob which creates a lot of database, memory and cpu load.
+This leads to one big, monolithic cronjob which creates a lot of database, memory and CPU load.
 Current Runtimes: (13.05.2007)
 * Anime Stats
-** 1525 seconds (25 minutes) for <5210 animes
+** 1525 seconds (25 minutes) for <5210 anime
 * Group Stats
@@ Line 80: / Line 80: @@
 * Total: 7602 seconds (127 minutes)
-The process is mostly limited by the database (the script uses about 900 seconds of cpu time).
+The process is mostly limited by the database (the script uses about 900 seconds of CPU time).
 The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all any more the way it works right now.

Maintenance DEV: Difference between revisions