Maintenance DEV: Difference between revisions
(New page: == AniDB Stats == A very labour intensive task is the generation of all the statistics and counters for the anidb db entries. Optimization of this process is therefore on the todo list. ...) |
mNo edit summary |
||
(7 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
{{TOCright}} | |||
= AniDB Stats = | |||
A very labour intensive task is the generation of all the statistics and counters for the | A very labour intensive task is the generation of all the statistics and counters for the AniDB database entries. Optimization of this process is therefore on the todo list. | ||
== Data == | |||
The data which is currently collected by the main stats update script. | The data which is currently collected by the main stats update script. | ||
=== Anime === | |||
* | * episodes added for anime | ||
* files added for anime | * files added for anime | ||
* groups subbing the anime | * groups subbing the anime | ||
Line 13: | Line 14: | ||
* total size of all files for this anime | * total size of all files for this anime | ||
=== Group === | |||
* | * anime subbed | ||
* files released | * files released | ||
* users collecting releases by this group | * users collecting releases by this group | ||
* total size of all files by this group | * total size of all files by this group | ||
=== Episode === | |||
* files added for this | * files added for this episode | ||
* users collecting this | * users collecting this episode | ||
=== File === | |||
* users with this file in | * users with this file in MyList (split according to MyList state: unknown/hdd/cd/deleted) | ||
=== User === | |||
* | * anime in MyList | ||
* | * episodes in MyList | ||
* files in | * files in MyList | ||
* size of files in | * size of files in MyList | ||
* last anime added to | * last anime added to MyList | ||
* date of last | * date of last MyList addition | ||
* | * anime added to AniDB | ||
* | * episodes added to AniDB | ||
* files added to | * files added to AniDB | ||
* groups added to | * groups added to AniDB | ||
* producers added to | * producers added to AniDB | ||
* anime titles added to | * anime titles added to AniDB | ||
* anime categories added to | * anime categories added to AniDB | ||
* anime-producer relation added to | * anime-producer relation added to AniDB | ||
* anime-group comments added to | * anime-group comments added to AniDB | ||
* review comments added to | * review comments added to AniDB | ||
* reviews added to | * reviews added to AniDB | ||
* votes added to | * votes added to AniDB | ||
* lame files (no ed2k link) | * lame files (no ed2k link) | ||
Line 53: | Line 53: | ||
* independence percentage | * independence percentage | ||
* leech percentage | * leech percentage | ||
* number of watched | * number of watched episodes | ||
* watched percentage for | * watched percentage for MyList | ||
* watched percentage for | * watched percentage for AniDB | ||
* collected percentage for | * collected percentage for AniDB | ||
== Current Approach == | |||
3 times a week a script is run to update the counters. The script will read all relevant tables in chunks. | 3 times a week a script is run to update the counters. The script will read all relevant tables in chunks. | ||
I.e. in order to update the anime stats it will gather all episode, file and | I.e. in order to update the anime stats it will gather all episode, file and MyList information for the anime with aid 1-249, 250-499, 500-749, ... This is done in order to limit the memory usage during calculation. | ||
The data collected is stored inside of an in-memory | The data collected is stored inside of an in-memory Perl hash and any required updates are written back to the database at the end of each chunk in one transaction. | ||
This leads to one big, monolithic cronjob which creates a lot of database, memory and | This leads to one big, monolithic cronjob which creates a lot of database, memory and CPU load. | ||
Current Runtimes: (13.05.2007) | Current Runtimes: (13.05.2007) | ||
* Anime Stats | * Anime Stats | ||
** 1525 seconds (25 minutes) for <5210 | ** 1525 seconds (25 minutes) for <5210 anime | ||
* Group Stats | * Group Stats | ||
Line 81: | Line 80: | ||
* Total: 7602 seconds (127 minutes) | * Total: 7602 seconds (127 minutes) | ||
The process is mostly limited by the database (the script uses about 900 seconds of | The process is mostly limited by the database (the script uses about 900 seconds of CPU time). | ||
The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all | The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all any more the way it works right now. | ||
== Possible Alternatives == | |||
Some general ideas. Some of them could also be combined. | Some general ideas. Some of them could also be combined. | ||
=== On-The-Fly Updating / Triggers === | |||
We could reduce the | We could reduce the interval between runs of the statistic update script greatly if all important values would be updated on the fly. As we'd probably not handle all possible cases we might not be able to remove the script all together, but we might be able to run it only once a week or once a month. | ||
The question here is whether the additional load these on-the-fly updates impose on the database might be a problem. | The question here is whether the additional load these on-the-fly updates impose on the database might be a problem. | ||
One approach to realize this with acceptable work effort required, would be a number of database triggers and corresponding PL/pgSQL functions which transparently update all relevant counters and stats. This could however lead to some performance issues, i.e. think about a user who deletes an entire anime with 100 eps from his MyList. This would yield at least 100 trigger invocations for any listtb based trigger which updates user stats. So instead of decreasing the users file count by 100 in one action, it would effectively be decreased by 1, 100 times. | |||
=== Read-Only database slave for stats work === | |||
Another approach would be to introduce a read-only database slave (i.e. with [http://slony.info/ Slony]) and to execute all read queries of the stats update scripts on this database. As the scripts only write to the database if any value has actually changed this would greatly reduce the load on the main database. | Another approach would be to introduce a read-only database slave (i.e. with [http://slony.info/ Slony]) and to execute all read queries of the stats update scripts on this database. As the scripts only write to the database if any value has actually changed this would greatly reduce the load on the main database. | ||
The issue here is whether we'd get the hardware resources to do this and whether it scales. | The issue here is whether we'd get the hardware resources to do this and whether it scales. | ||
=== Small Updates === | |||
The script could be run at shorter | The script could be run at shorter intervals and calculate only a part of the stats during each run. I.e. we could run it 3 times a day and process 250 anime in each run... | ||
=== Dirty Flag === | |||
The current approach gathers data for all | The current approach gathers data for all DB entries. It doesn't matter whether any of their stats values are likely to have changed. This is especially problematic for the user stats. With each stats update we're collecting the data for all users, even though only a small percentage of them has done any changes to AniDB. They might not even have logged in since the last stats update. | ||
Possible approaches for this would be to: | Possible approaches for this would be to: | ||
* skip users who haven't logged in since last update | * skip users who haven't logged in since last update | ||
* add a "dirty" boolean flag to entries in the user table which is set whenever a user makes a change to | * add a "dirty" boolean flag to entries in the user table which is set whenever a user makes a change to AniDB which is potentially relevant for his stats. | ||
* ... ? | * ... ? | ||
=== ? === | |||
what else? | what else? | ||
[[Category:Development]] |
Latest revision as of 19:21, 15 May 2009
AniDB Stats
A very labour intensive task is the generation of all the statistics and counters for the AniDB database entries. Optimization of this process is therefore on the todo list.
Data
The data which is currently collected by the main stats update script.
Anime
- episodes added for anime
- files added for anime
- groups subbing the anime
- users collecting the anime
- total size of all files for this anime
Group
- anime subbed
- files released
- users collecting releases by this group
- total size of all files by this group
Episode
- files added for this episode
- users collecting this episode
File
- users with this file in MyList (split according to MyList state: unknown/hdd/cd/deleted)
User
- anime in MyList
- episodes in MyList
- files in MyList
- size of files in MyList
- last anime added to MyList
- date of last MyList addition
- anime added to AniDB
- episodes added to AniDB
- files added to AniDB
- groups added to AniDB
- producers added to AniDB
- anime titles added to AniDB
- anime categories added to AniDB
- anime-producer relation added to AniDB
- anime-group comments added to AniDB
- review comments added to AniDB
- reviews added to AniDB
- votes added to AniDB
- lame files (no ed2k link)
- lame files percentage
- independence percentage
- leech percentage
- number of watched episodes
- watched percentage for MyList
- watched percentage for AniDB
- collected percentage for AniDB
Current Approach
3 times a week a script is run to update the counters. The script will read all relevant tables in chunks. I.e. in order to update the anime stats it will gather all episode, file and MyList information for the anime with aid 1-249, 250-499, 500-749, ... This is done in order to limit the memory usage during calculation. The data collected is stored inside of an in-memory Perl hash and any required updates are written back to the database at the end of each chunk in one transaction.
This leads to one big, monolithic cronjob which creates a lot of database, memory and CPU load.
Current Runtimes: (13.05.2007)
- Anime Stats
- 1525 seconds (25 minutes) for <5210 anime
- Group Stats
- 7 seconds for <4476 groups
- Anime/Ep/File counters
- 1633 seconds (27 minutes) for 324901 file and 73193 ep entries
- User stats
- 4437 seconds (74 minutes) for 221037 users
- Total: 7602 seconds (127 minutes)
The process is mostly limited by the database (the script uses about 900 seconds of CPU time).
The key issue here is that these numbers rise all the time. In the early days we've run that script multiple times a day, then once a day, now 3 times a week. If things continue as they are now, we'll reach a point where we can't run it at all any more the way it works right now.
Possible Alternatives
Some general ideas. Some of them could also be combined.
On-The-Fly Updating / Triggers
We could reduce the interval between runs of the statistic update script greatly if all important values would be updated on the fly. As we'd probably not handle all possible cases we might not be able to remove the script all together, but we might be able to run it only once a week or once a month.
The question here is whether the additional load these on-the-fly updates impose on the database might be a problem.
One approach to realize this with acceptable work effort required, would be a number of database triggers and corresponding PL/pgSQL functions which transparently update all relevant counters and stats. This could however lead to some performance issues, i.e. think about a user who deletes an entire anime with 100 eps from his MyList. This would yield at least 100 trigger invocations for any listtb based trigger which updates user stats. So instead of decreasing the users file count by 100 in one action, it would effectively be decreased by 1, 100 times.
Read-Only database slave for stats work
Another approach would be to introduce a read-only database slave (i.e. with Slony) and to execute all read queries of the stats update scripts on this database. As the scripts only write to the database if any value has actually changed this would greatly reduce the load on the main database. The issue here is whether we'd get the hardware resources to do this and whether it scales.
Small Updates
The script could be run at shorter intervals and calculate only a part of the stats during each run. I.e. we could run it 3 times a day and process 250 anime in each run...
Dirty Flag
The current approach gathers data for all DB entries. It doesn't matter whether any of their stats values are likely to have changed. This is especially problematic for the user stats. With each stats update we're collecting the data for all users, even though only a small percentage of them has done any changes to AniDB. They might not even have logged in since the last stats update.
Possible approaches for this would be to:
- skip users who haven't logged in since last update
- add a "dirty" boolean flag to entries in the user table which is set whenever a user makes a change to AniDB which is potentially relevant for his stats.
- ... ?
?
what else?