Talk:OstDB DEV Foosic: Difference between revisions

From AniDB
Jump to navigation Jump to search
No edit summary
mNo edit summary
 
Line 1: Line 1:
=== Protocol ===
=== Protocol ===
I don't see any reasons to use cleartext/ascii here (?):
I don't see any reasons to use cleartext/ASCII here (?):
* this is an internal protocol, so it don't have to be readable for the masses
* this is an internal protocol, so it don't have to be readable for the masses
* it'll be pain and simple without much need for future expansions
* it'll be pain and simple without much need for future expansions
Line 11: Line 11:
::[[User:Exp|Exp]] 21:23, 21 May 2007 (UTC)
::[[User:Exp|Exp]] 21:23, 21 May 2007 (UTC)


::: i don't see why it should be public. we already have a public interface and that is the UDP API. (the MATCH command doesn't make any sense for external use atm because of the ofid argument, btw). it shouldn't make much difference @ performance either. and the clients would still have to communicate with the udp api to get any useful info. adding a new public service will just lead to more confusion and expose a service that's main function is only for internal usage --[[User:Epoximator|Epoximator]] 06:49, 22 May 2007 (UTC)
::: i don't see why it should be public. we already have a public interface and that is the UDP API. (the MATCH command doesn't make any sense for external use ATM because of the ofid argument, BTW). it shouldn't make much difference @ performance either. and the clients would still have to communicate with the UDP API to get any useful info. adding a new public service will just lead to more confusion and expose a service that's main function is only for internal usage --[[User:Epoximator|Epoximator]] 06:49, 22 May 2007 (UTC)


:::: if we don't make it available as a public service the UDP API would have to take it's place. Some usage scenarios will require a way for clients to find out potential ofids for a file by fingerprint. I.e. a music player plugin which retrieves file metadata (ID3 tags) from anidb (like the musicbrainz plugins do) would first try to identify a file by content hash via the UDP API. But if that fails it would try to get a potential match via the foosic fingerprint.
:::: if we don't make it available as a public service the UDP API would have to take it's place. Some usage scenarios will require a way for clients to find out potential ofids for a file by fingerprint. I.e. a music player plug-in which retrieves file metadata (ID3 tags) from AniDB (like the musicbrainz plugins do) would first try to identify a file by content hash via the UDP API. But if that fails it would try to get a potential match via the foosic fingerprint.
:::: though you're probably right. It would be a lot simpler for client developers if they could simply send a single lookup query with size, content hash and fingerprint to the UDP API and get either a definitve or a potential match as result instead of having to contact multiple APIs.
:::: though you're probably right. It would be a lot simpler for client developers if they could simply send a single lookup query with size, content hash and fingerprint to the UDP API and get either a definitive or a potential match as result instead of having to contact multiple APIs.
:::: that would mean that the UDP API would need to do internal lookups via the matching redirector for lookup queries where the content hash is unknown and delay it's reply to the requesting client until it got the matching data from the redirector. seems feasible.
:::: that would mean that the UDP API would need to do internal lookups via the matching redirector for lookup queries where the content hash is unknown and delay it's reply to the requesting client until it got the matching data from the redirector. seems feasible.
:::: [[User:Exp|Exp]] 07:25, 22 May 2007 (UTC)
:::: [[User:Exp|Exp]] 07:25, 22 May 2007 (UTC)
Line 21: Line 21:


alternatively:
alternatively:
* avdump sends data to anidb2 udpapi
* Avdump sends data to anidb2 udpapi
* anidb2 udpapi stores all data
* anidb2 udpapi stores all data


Line 37: Line 37:
* there is no natural relation between ofids and fingerprints, fingerprints are matched to songs, not files. However you imply such a relation by using ofids as ids for fingerprints on the slave servers. I.e. a case where this would become a problem:
* there is no natural relation between ofids and fingerprints, fingerprints are matched to songs, not files. However you imply such a relation by using ofids as ids for fingerprints on the slave servers. I.e. a case where this would become a problem:
** Lets say a fingerprint is known to anidb and is loosely (foosic fp match) shared between multiple files. Your approach would relate all these files to one another (or to one specific file from the set). If it now becomes apparent that one of these files, possibly the one whose ofid you used as fingerprint id, is in fact wrongly matched to the song in question and is moved to another song, we'd effectively be changing the primary key of the fingerprint, on multiple decentralized servers... bad idea
** Lets say a fingerprint is known to anidb and is loosely (foosic fp match) shared between multiple files. Your approach would relate all these files to one another (or to one specific file from the set). If it now becomes apparent that one of these files, possibly the one whose ofid you used as fingerprint id, is in fact wrongly matched to the song in question and is moved to another song, we'd effectively be changing the primary key of the fingerprint, on multiple decentralized servers... bad idea
** And using the songid instead won't work either. As there won't be a ostfile<->song match at the time when the fingerprints are processed. I.e. we might end up with a group of files which all share similar fingerprints but which are all not yet matched to a song. The fact that tey're all sharing a similar fingerprint is highly relavant for the matching process though. So we can't delay the fingerprint analysis inorder to be able to use the song ids as keys. Which is why I'd suggest that we just create new unique ids for the fingerprints.
** And using the songid instead won't work either. As there won't be a ostfile<->song match at the time when the fingerprints are processed. I.e. we might end up with a group of files which all share similar fingerprints but which are all not yet matched to a song. The fact that they're all sharing a similar fingerprint is highly relevant for the matching process though. So we can't delay the fingerprint analysis in order to be able to use the song ids as keys. Which is why I'd suggest that we just create new unique ids for the fingerprints.
:: i assumed, and still do, that fp is unique per encode (ofid) and not song. even if two fps are equal they would still be treated as two different ones throughout the system (they would just have a very good match value). so the relation would be ostfiletb >-< ostfiletb. then, on top of this, rows in songtb relate to one row each in ostfiletb (which the other relations can be derived from). if we don't want this redundancy we shouldn't store fps in ostfiletb at all, but rather in their own table where uniqueness is preserved (and with own ids). then you would have ostfiletb>-fptb-(<)songtb.
:: i assumed, and still do, that fp is unique per encode (ofid) and not song. even if two fps are equal they would still be treated as two different ones throughout the system (they would just have a very good match value). so the relation would be ostfiletb >-< ostfiletb. then, on top of this, rows in songtb relate to one row each in ostfiletb (which the other relations can be derived from). if we don't want this redundancy we shouldn't store fps in ostfiletb at all, but rather in their own table where uniqueness is preserved (and with own ids). then you would have ostfiletb>-fptb-(<)songtb.


Line 45: Line 45:
:: a matching slave is a puter with a matching daemon running
:: a matching slave is a puter with a matching daemon running
** what do you mean with "the server with the worst results (or none) for storage"? the error margin on found matches is hardly useful to determine the average load each server is under. Instead I'd suggest to return the total number of stored fingerprints on each server with the reply and store the fingerprint on the server with the smallest fingerprint count.
** what do you mean with "the server with the worst results (or none) for storage"? the error margin on found matches is hardly useful to determine the average load each server is under. Instead I'd suggest to return the total number of stored fingerprints on each server with the reply and store the fingerprint on the server with the smallest fingerprint count.
:: the work load is not defined by how many fps there are in total, but how many that are selected for matching. the idea is that if a slave come up with no results, or only bad ones, then it should have the new fp because it is "unique" on that server. ie. fps should be distribued so that similar fps are spread as much as possible. but the decision depends on how the selection is done:
:: the work load is not defined by how many fps there are in total, but how many that are selected for matching. the idea is that if a slave come up with no results, or only bad ones, then it should have the new fp because it is "unique" on that server. i.e. fps should be distributed so that similar fps are spread as much as possible. but the decision depends on how the selection is done:
::<tt>select fp from fptb order by 2*abs(len-?)+abs(fit-?)+abs(dom-?) asc limit X</tt> (use results)
::<tt>select fp from fptb order by 2*abs(len-?)+abs(fit-?)+abs(dom-?) asc limit X</tt> (use results)
::<tt>select fp from fptb where abs(len-?)<X and abs(fit-?)<Y and abs(dom-?)<Z</tt> (use count)
::<tt>select fp from fptb where abs(len-?)<X and abs(fit-?)<Y and abs(dom-?)<Z</tt> (use count)
::it will of course not balance good if the slaves have very different hw (and/or additional tasks). we should consider processing time also, but per request
::it will of course not balance good if the slaves have very different hw (and/or additional tasks). we should consider processing time also, but per request
** "returns best matches", only if the match is good enough to be considered a real "match". So in many cases most of the slaves would probably reply that the fingerprint is unknown.
** "returns best matches", only if the match is good enough to be considered a real "match". So in many cases most of the slaves would probably reply that the fingerprint is unknown.
::yes, that's true, at least in the beginning. it only depends on how many versions there are of each song and how many of them we have registered. (but one point is that we don't know what's considered a good match atm. it has to be adjusted)
::yes, that's true, at least in the beginning. it only depends on how many versions there are of each song and how many of them we have registered. (but one point is that we don't know what's considered a good match ATM. it has to be adjusted)
** I do agree that it is probably useful to remember which server a specific fingerprint id is stored on. The easiest/efficient approach for that would be to use a couple of the high bits of the ids to store that info. I.e. we could use the highest 4bits (skipping the sign bit) of a 32bit signed integer, leaving us with an effective 27bit room of ids. I guess we wouldn't reach the 130 million entry mark anytime soon. Neither are we likely to fork of to more than 16 servers. Both numbers could easily be extended later by switching to int8 values and using a larger room in the header for the server selection, i.e. 16bit. That would mean that each server would have it's own unique sequence for fingerprint ids (starting at 0) and we'd simply AND a server specific bitmask to it to obtain the external id, when transmitting the data. Which means that the DB ids on the servers are not unique and will not change if we ever decide to change the splitting between server bits and id bits.
** I do agree that it is probably useful to remember which server a specific fingerprint id is stored on. The easiest/efficient approach for that would be to use a couple of the high bits of the ids to store that info. I.e. we could use the highest 4bits (skipping the sign bit) of a 32bit signed integer, leaving us with an effective 27bit room of ids. I guess we wouldn't reach the 130 million entry mark any time soon. Neither are we likely to fork of to more than 16 servers. Both numbers could easily be extended later by switching to int8 values and using a larger room in the header for the server selection, i.e. 16bit. That would mean that each server would have it's own unique sequence for fingerprint ids (starting at 0) and we'd simply AND a server specific bitmask to it to obtain the external id, when transmitting the data. Which means that the DB ids on the servers are not unique and will not change if we ever decide to change the splitting between server bits and id bits.
::yes, it's useful for administration. if one slave dies it should be simple to redistribute the "lost" fps between the remaining slaves or to a new one. we should also redistribute fps when new slaves are added in general. i still think that the ids should be created and stored in the main db, but it doesn't really matter. i think it would be safer and more clean, though.
::yes, it's useful for administration. if one slave dies it should be simple to redistribute the "lost" fps between the remaining slaves or to a new one. we should also redistribute fps when new slaves are added in general. i still think that the ids should be created and stored in the main DB, but it doesn't really matter. I think it would be safer and more clean, though.
* "anidb2 cronjob adds new ostfile relations based on the results", in cases where we already know other files with a matching fingerprint and those are already related to a specific song. If the other similar files are also "song-less", then the matching/group information would remain unused until someone opens the manual matching website. Where those files could be grouped together and would by default all be related to the song which the user selects.
* "anidb2 cronjob adds new ostfile relations based on the results", in cases where we already know other files with a matching fingerprint and those are already related to a specific song. If the other similar files are also "song-less", then the matching/group information would remain unused until someone opens the manual matching website. Where those files could be grouped together and would by default all be related to the song which the user selects.
::yes. the matching is fully automated and is completely unrelated to the songs until someone adds those relations manually. it is of course possible to generate songs to based on the metadata, though. (initially, at least)
::yes. the matching is fully automated and is completely unrelated to the songs until someone adds those relations manually. it is of course possible to generate songs to based on the metadata, though. (initially, at least)
::--[[User:Epoximator|Epoximator]] 17:53, 11 May 2007 (UTC)
::--[[User:Epoximator|Epoximator]] 17:53, 11 May 2007 (UTC)
: [[User:Exp|Exp]] 13:45, 11 May 2007 (UTC)''Italic text''
: [[User:Exp|Exp]] 13:45, 11 May 2007 (UTC)''Italic text''

Latest revision as of 09:27, 14 May 2009

Protocol

I don't see any reasons to use cleartext/ASCII here (?):

  • this is an internal protocol, so it don't have to be readable for the masses
  • it'll be pain and simple without much need for future expansions
  • all the datatypes used is well defined (constant length)

I assume we'll use encryption? at least between redirector and matching servers --Epoximator 13:29, 21 May 2007 (UTC)

This service will not be limited to internal use. The matching redirector is open to the public. And public protocols should be ASCII, it is simply much easier to debug and work with. And as the protocol needs are very similar, I don't think it's a good idea to use different protocols for the matching server<->matching redirector and matching redirector<->client/cron job.
I don't know whether we'll need encryption. Though just slapping a shared-secret AES encryption ontop of the TCP connection between the matching servers and the matching redirector would probably be very easy to do.
Exp 21:23, 21 May 2007 (UTC)
i don't see why it should be public. we already have a public interface and that is the UDP API. (the MATCH command doesn't make any sense for external use ATM because of the ofid argument, BTW). it shouldn't make much difference @ performance either. and the clients would still have to communicate with the UDP API to get any useful info. adding a new public service will just lead to more confusion and expose a service that's main function is only for internal usage --Epoximator 06:49, 22 May 2007 (UTC)
if we don't make it available as a public service the UDP API would have to take it's place. Some usage scenarios will require a way for clients to find out potential ofids for a file by fingerprint. I.e. a music player plug-in which retrieves file metadata (ID3 tags) from AniDB (like the musicbrainz plugins do) would first try to identify a file by content hash via the UDP API. But if that fails it would try to get a potential match via the foosic fingerprint.
though you're probably right. It would be a lot simpler for client developers if they could simply send a single lookup query with size, content hash and fingerprint to the UDP API and get either a definitive or a potential match as result instead of having to contact multiple APIs.
that would mean that the UDP API would need to do internal lookups via the matching redirector for lookup queries where the content hash is unknown and delay it's reply to the requesting client until it got the matching data from the redirector. seems feasible.
Exp 07:25, 22 May 2007 (UTC)

OLD Stuff

alternatively:

  • Avdump sends data to anidb2 udpapi
  • anidb2 udpapi stores all data
  • anidb2 cronjob sends fingerprint to matching slaves
  • matching slaves returns best matches: ofid,match value
  • anidb2 cronjob sends fingerprint,ofid to the server with the worst results (or none) for storage.
  • this slave is now responsible for this fingerprint
  • storage would be ofid,length,avg_fit,avg_dom,fp
  • an identifier for the slave should be stored in ostfiletb for later administration of slaves/fingerprints
  • anidb2 cronjob adds new ostfile relations based on the results

--Epoximator 08:39, 10 May 2007 (UTC)

I see several problems with this approach:

  • there is no natural relation between ofids and fingerprints, fingerprints are matched to songs, not files. However you imply such a relation by using ofids as ids for fingerprints on the slave servers. I.e. a case where this would become a problem:
    • Lets say a fingerprint is known to anidb and is loosely (foosic fp match) shared between multiple files. Your approach would relate all these files to one another (or to one specific file from the set). If it now becomes apparent that one of these files, possibly the one whose ofid you used as fingerprint id, is in fact wrongly matched to the song in question and is moved to another song, we'd effectively be changing the primary key of the fingerprint, on multiple decentralized servers... bad idea
    • And using the songid instead won't work either. As there won't be a ostfile<->song match at the time when the fingerprints are processed. I.e. we might end up with a group of files which all share similar fingerprints but which are all not yet matched to a song. The fact that they're all sharing a similar fingerprint is highly relevant for the matching process though. So we can't delay the fingerprint analysis in order to be able to use the song ids as keys. Which is why I'd suggest that we just create new unique ids for the fingerprints.
i assumed, and still do, that fp is unique per encode (ofid) and not song. even if two fps are equal they would still be treated as two different ones throughout the system (they would just have a very good match value). so the relation would be ostfiletb >-< ostfiletb. then, on top of this, rows in songtb relate to one row each in ostfiletb (which the other relations can be derived from). if we don't want this redundancy we shouldn't store fps in ostfiletb at all, but rather in their own table where uniqueness is preserved (and with own ids). then you would have ostfiletb>-fptb-(<)songtb.


  • load balancing:
    • what are "matching slaves" ?
a matching slave is a puter with a matching daemon running
    • what do you mean with "the server with the worst results (or none) for storage"? the error margin on found matches is hardly useful to determine the average load each server is under. Instead I'd suggest to return the total number of stored fingerprints on each server with the reply and store the fingerprint on the server with the smallest fingerprint count.
the work load is not defined by how many fps there are in total, but how many that are selected for matching. the idea is that if a slave come up with no results, or only bad ones, then it should have the new fp because it is "unique" on that server. i.e. fps should be distributed so that similar fps are spread as much as possible. but the decision depends on how the selection is done:
select fp from fptb order by 2*abs(len-?)+abs(fit-?)+abs(dom-?) asc limit X (use results)
select fp from fptb where abs(len-?)<X and abs(fit-?)<Y and abs(dom-?)<Z (use count)
it will of course not balance good if the slaves have very different hw (and/or additional tasks). we should consider processing time also, but per request
    • "returns best matches", only if the match is good enough to be considered a real "match". So in many cases most of the slaves would probably reply that the fingerprint is unknown.
yes, that's true, at least in the beginning. it only depends on how many versions there are of each song and how many of them we have registered. (but one point is that we don't know what's considered a good match ATM. it has to be adjusted)
    • I do agree that it is probably useful to remember which server a specific fingerprint id is stored on. The easiest/efficient approach for that would be to use a couple of the high bits of the ids to store that info. I.e. we could use the highest 4bits (skipping the sign bit) of a 32bit signed integer, leaving us with an effective 27bit room of ids. I guess we wouldn't reach the 130 million entry mark any time soon. Neither are we likely to fork of to more than 16 servers. Both numbers could easily be extended later by switching to int8 values and using a larger room in the header for the server selection, i.e. 16bit. That would mean that each server would have it's own unique sequence for fingerprint ids (starting at 0) and we'd simply AND a server specific bitmask to it to obtain the external id, when transmitting the data. Which means that the DB ids on the servers are not unique and will not change if we ever decide to change the splitting between server bits and id bits.
yes, it's useful for administration. if one slave dies it should be simple to redistribute the "lost" fps between the remaining slaves or to a new one. we should also redistribute fps when new slaves are added in general. i still think that the ids should be created and stored in the main DB, but it doesn't really matter. I think it would be safer and more clean, though.
  • "anidb2 cronjob adds new ostfile relations based on the results", in cases where we already know other files with a matching fingerprint and those are already related to a specific song. If the other similar files are also "song-less", then the matching/group information would remain unused until someone opens the manual matching website. Where those files could be grouped together and would by default all be related to the song which the user selects.
yes. the matching is fully automated and is completely unrelated to the songs until someone adds those relations manually. it is of course possible to generate songs to based on the metadata, though. (initially, at least)
--Epoximator 17:53, 11 May 2007 (UTC)
Exp 13:45, 11 May 2007 (UTC)Italic text