OstDB DEV Foosic: Difference between revisions

comment on epoximators approach
(comment on epoximators approach)
Line 33: Line 33:


--[[User:Epoximator|Epoximator]] 08:39, 10 May 2007 (UTC)
--[[User:Epoximator|Epoximator]] 08:39, 10 May 2007 (UTC)
I see several problems with this approach:
* there is no natural relation between ofids and fingerprints, fingerprints are matched to songs, not files. However you imply such a relation by using ofids as ids for fingerprints on the slave servers. I.e. a case where this would become a problem:
** Lets say a fingerprint is known to anidb and is loosely (foosic fp match) shared between multiple files. Your approach would relate all these files to one another (or to one specific file from the set). If it now becomes apparent that one of these files, possibly the one whose ofid you used as fingerprint id, is in fact wrongly matched to the song in question and is moved to another song, we'd effectively be changing the primary key of the fingerprint, on multiple decentralized servers... bad idea
** And using the songid instead won't work either. As there won't be a ostfile<->song match at the time when the fingerprints are processed. I.e. we might end up with a group of files which all share similar fingerprints but which are all not yet matched to a song. The fact that tey're all sharing a similar fingerprint is highly relavant for the matching process though. So we can't delay the fingerprint analysis inorder to be able to use the song ids as keys. Which is why I'd suggest that we just create new unique ids for the fingerprints.
* load balancing:
** what are "matching slaves" ?
** what do you mean with "the server with the worst results (or none) for storage"? the error margin on found matches is hardly useful to determine the average load each server is under. Instead I'd suggest to return the total number of stored fingerprints on each server with the reply and store the fingerprint on the server with the smallest fingerprint count.
** "returns best matches", only if the match is good enough to be considered a real "match". So in many cases most of the slaves would probably reply that the fingerprint is unknown.
** I do agree that it is probably useful to remember which server a specific fingerprint id is stored on. The easiest/efficient approach for that would be to use a couple of the high bits of the ids to store that info. I.e. we could use the highest 4bits (skipping the sign bit) of a 32bit signed integer, leaving us with an effective 27bit room of ids. I guess we wouldn't reach the 130 million entry mark anytime soon. Neither are we likely to fork of to more than 16 servers. Both numbers could easily be extended later by switching to int8 values and using a larger room in the header for the server selection, i.e. 16bit. That would mean that each server would have it's own unique sequence for fingerprint ids (starting at 0) and we'd simply AND a server specific bitmask to it to obtain the external id, when transmitting the data. Which means that the DB ids on the servers are not unique and will not change if we ever decide to change the splitting between server bits and id bits.
* "anidb2 cronjob adds new ostfile relations based on the results", in cases where we already know other files with a matching fingerprint and those are already related to a specific song. If the other similar files are also "song-less", then the matching/group information would remain unused until someone opens the manual matching website. Where those files could be grouped together and would by default all be related to the song which the user selects.
: [[User:Exp|Exp]] 13:45, 11 May 2007 (UTC)


== Possible Extension ==
== Possible Extension ==
MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.