OstDB DEV Foosic: Difference between revisions

Jump to navigation Jump to search
m
no edit summary
mNo edit summary
 
(11 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{{TOCright}}
{{TOCright}}
Protocol for foosic client<->server communication via UDP and TCP


VERSION 1
Protocol for foosic audio fingerprint matching related client<->server communication via UDP and TCP.


Server: dedicated java daemon (UDP and TCP), on anidb3 (sig) (UDP for single queries, TCP for batch runs)
VERSION 2


Client: cronjob/daemon on anidb2
Servers:
* dedicated Java matching server daemon TCP on anidb3 (sig) (does not accept client requests directly)
* dedicated Java redirector/load balancer daemon TCP and UDP on anidb2 (main/api) (UDP for single queries, TCP for batch runs)


Clients:
* cronjob on anidb2
* clients on user systems


== General Workflow ==
== General Workflow ==
=== Workflow Draft ===
=== Workflow Draft ===
==== Involved Parties ====
==== Involved Parties ====
* client (avdump), locally on user machine
* client (Avdump), locally on user machine
* main server (anidb2), keeps file meta data incl. fingerprints
* main server (anidb2), keeps file meta data includes fingerprints
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (IDENT goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (MATCH goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching server(s) (anidb3/sig server), java standalone app does all the fingerprint<->fingerprint matching, TCP interface
* matching server(s) (anidb3/sig server), Java standalone app does all the fingerprint<->fingerprint matching, TCP interface
 


==== Workflow - Matching ====
==== Workflow - Matching ====
===== Synchronous Part =====
===== Synchronous Part =====
* user runs client on local audio files
* user runs client on his local audio files
* client sends file meta data, including foosic audio fingerprint, to main server via UDP API
* client sends file meta data, including foosic audio fingerprint, to main server via UDP API
** optional intermediate step to speedup client processing: calculate content hash first and generate and submit the fingerprint only if the content hash is unknown to AniDB or no fingerprint is listed on AniDB for that file
** optional intermediate step to speedup client processing: calculate content hash first and generate and submit the fingerprint only if the content hash is unknown to AniDB or no fingerprint is listed on AniDB for that file
* main server does not do any processing on the fingerprint and simply stores it in ostfiletb
* main server does not do any processing on the fingerprint and simply stores it in ostfiletb


===== Asynchronous Part =====
===== Asynchronous Part =====
* cronjob/daemon on main server regularly checks for newly added foosic fingerprints and sends them, together with the ostfile id to the matching redirector via TCP with store set to 1
* a cronjob/daemon on the main server regularly checks for newly added (yet unmatched) foosic fingerprints and sends them, together with the ostfile id to the matching redirector via TCP with store set to 1
** flooding is impossible due to the synchronous nature of the TCP matching redirector api
** flooding is impossible due to the synchronous nature of the TCP matching redirector API
* matching redirector forwards the IDENT query to all matching servers via TCP
** the cronjob could open multiple TCP connections to the matching redirector if one connection does not yield enough throughput.
** TCP connections to all matching servers should be kept alive inbetween queries (prevent TCP connection handshake overhead)
* the matching redirector forwards the MATCH query to all matching servers via TCP
* matching servers try to match the fingerprint to all potentially interesting fingerprints in their database
** (multiple) TCP connections to all matching servers should be kept alive in-between queries (to prevent TCP connection handshake overhead)
* the matching servers try to match the fingerprint to all potentially interesting fingerprints in their database
** pre-match filtering: via length, avg dom and avg fit
** pre-match filtering: via length, avg dom and avg fit
** post-match filtering: via hard cut-off-point for matching, this will definitely be >=0,50 anything lower would be useless. It will be possible to increase this value without problems at any point in time. However, reducing it would be a problem. We might therefore want to just start with 0,50 pr 0,60 and increase it if we feel that it is necessary to reduce the amount of matches returned.
** post-match filtering: via hard cut-off-point for matching confidence value, this will definitely be >=0,50 anything lower would be useless. It will be possible to increase this value without problems at any point in time. However, reducing it would be a problem. We might therefore want to just start with 0,50 or 0,60 and increase it if we feel that it is necessary to reduce the amount of matches returned.
** potentially further internal optimizations, i.e. by identifying group representatives for certain well matching groups of files and only matching taking the representatives into account when matching
** potentially further internal optimizations, i.e. by identifying group representatives for certain well matching groups of files and only matching taking the representatives into account when matching
* each matching server replies with a list of matching ostfile ids together with the magnitude of error per match and some general usage data (for load balancing).
* each matching server replies with a list of matching ostfile ids together with the matching confidence per match and some general usage data (for load balancing).
** neither fingerprints nor matching results are stored on the matching servers at this point
** neither fingerprints nor matching results are stored on the matching servers at this point
** matching servers keep some internal usage statistics for potential future optimization
** matching servers keep some internal usage statistics for potential future optimization
** if a matching server already stores this exact ofid, it will add a note about this fact to it's reply.
** if a matching server already stores this exact ofid, it will add a note about this fact to it's reply.
* the matching redirector collects the replies from all matching servers and collates them into one reply for _later_ transmittion to the main server.
** the matching redirector does not keep any persistent storage. Other than some usage data, nothing is kept here.
* the matching redirector collects the replies from all matching servers and collates them into one reply for _later_ transmission to the main server.
* the matching redirector then decides which matching server should store the new fingerprint based on the usage statistics returned by each matching server and sends that server a STORE command. That servers identifier is then included with the reply to the main server.
* the matching redirector then decides which matching server should store the new fingerprint based on the usage statistics returned by each matching server and sends that server a STORE command. That servers identifier is then included with the reply to the main server.
* the matching redirector replies to the main server (matching results + ident of matching server)
** if one of the earlier replies from the matching servers indicated, that a server already has the fingerprint, no new STORE command is issued. The identifier of that server is used instead.
* the matching redirector replies to the main server (matching results + ident of matching server storing that fingerprint)
** for the main server/cron daemon it is not visible which _match_ came from which matching server
** for the main server/cron daemon it is not visible which _match_ came from which matching server
* main server stores the matching data in the db
* the main server stores the matching data in the DB
** (ofid1 int4, ofid2 int4, matching float) as ostfilefoosictb
** (ofidlow int4, ofidhigh int4, match int2) as ostfilefoosictb
** (...,foosicfp bytea,foosicsrv int2,...) in ostfiletb
** (...,foosicfp bytea,foosicsrv int2,...) in ostfiletb
* main server uses matching data to support manual file<->song matching via the webinterface
* main server uses matching data to support manual file<->song matching via the web interface
** the user will be able select the cut-off-point for the matching value on-the-fly in order to reduce false-positives or increase recall
** the user will be able select the cut-off-point for the matching value on-the-fly in order to reduce false-positives or increase recall


==== Workflow - Deletion ====
==== Workflow - Deletion ====
Line 62: Line 63:
* the matching servers will delete all local data for the given ofid
* the matching servers will delete all local data for the given ofid
* the results are collated by the matching redirector and returned to the main server's cron job
* the results are collated by the matching redirector and returned to the main server's cron job
=== OLD ===
alternatively:
* avdump sends data to anidb2 udpapi
* anidb2 udpapi stores all data
* anidb2 cronjob sends <tt>fingerprint</tt> to matching slaves
* matching slaves returns best matches: <tt>ofid,match value</tt>
* anidb2 cronjob sends <tt>fingerprint,ofid</tt> to the server with the worst results (or none) for storage.
:* this slave is now responsible for this fingerprint
:* storage would be <tt>ofid,length,avg_fit,avg_dom,fp</tt>
:* an identifier for the slave should be stored in ostfiletb for later administration of slaves/fingerprints
* anidb2 cronjob adds new ostfile relations based on the results
--[[User:Epoximator|Epoximator]] 08:39, 10 May 2007 (UTC)
I see several problems with this approach:
* there is no natural relation between ofids and fingerprints, fingerprints are matched to songs, not files. However you imply such a relation by using ofids as ids for fingerprints on the slave servers. I.e. a case where this would become a problem:
** Lets say a fingerprint is known to anidb and is loosely (foosic fp match) shared between multiple files. Your approach would relate all these files to one another (or to one specific file from the set). If it now becomes apparent that one of these files, possibly the one whose ofid you used as fingerprint id, is in fact wrongly matched to the song in question and is moved to another song, we'd effectively be changing the primary key of the fingerprint, on multiple decentralized servers... bad idea
** And using the songid instead won't work either. As there won't be a ostfile<->song match at the time when the fingerprints are processed. I.e. we might end up with a group of files which all share similar fingerprints but which are all not yet matched to a song. The fact that tey're all sharing a similar fingerprint is highly relavant for the matching process though. So we can't delay the fingerprint analysis inorder to be able to use the song ids as keys. Which is why I'd suggest that we just create new unique ids for the fingerprints.
:: i assumed, and still do, that fp is unique per encode (ofid) and not song. even if two fps are equal they would still be treated as two different ones throughout the system (they would just have a very good match value). so the relation would be ostfiletb >-< ostfiletb. then, on top of this, rows in songtb relate to one row each in ostfiletb (which the other relations can be derived from). if we don't want this redundancy we shouldn't store fps in ostfiletb at all, but rather in their own table where uniqueness is preserved (and with own ids). then you would have ostfiletb>-fptb-(<)songtb.
* load balancing:
** what are "matching slaves" ?
:: a matching slave is a puter with a matching daemon running
** what do you mean with "the server with the worst results (or none) for storage"? the error margin on found matches is hardly useful to determine the average load each server is under. Instead I'd suggest to return the total number of stored fingerprints on each server with the reply and store the fingerprint on the server with the smallest fingerprint count.
:: the work load is not defined by how many fps there are in total, but how many that are selected for matching. the idea is that if a slave come up with no results, or only bad ones, then it should have the new fp because it is "unique" on that server. ie. fps should be distribued so that similar fps are spread as much as possible. but the decision depends on how the selection is done:
::<tt>select fp from fptb order by 2*abs(len-?)+abs(fit-?)+abs(dom-?) asc limit X</tt> (use results)
::<tt>select fp from fptb where abs(len-?)<X and abs(fit-?)<Y and abs(dom-?)<Z</tt> (use count)
::it will of course not balance good if the slaves have very different hw (and/or additional tasks). we should consider processing time also, but per request
** "returns best matches", only if the match is good enough to be considered a real "match". So in many cases most of the slaves would probably reply that the fingerprint is unknown.
::yes, that's true, at least in the beginning. it only depends on how many versions there are of each song and how many of them we have registered. (but one point is that we don't know what's considered a good match atm. it has to be adjusted)
** I do agree that it is probably useful to remember which server a specific fingerprint id is stored on. The easiest/efficient approach for that would be to use a couple of the high bits of the ids to store that info. I.e. we could use the highest 4bits (skipping the sign bit) of a 32bit signed integer, leaving us with an effective 27bit room of ids. I guess we wouldn't reach the 130 million entry mark anytime soon. Neither are we likely to fork of to more than 16 servers. Both numbers could easily be extended later by switching to int8 values and using a larger room in the header for the server selection, i.e. 16bit. That would mean that each server would have it's own unique sequence for fingerprint ids (starting at 0) and we'd simply AND a server specific bitmask to it to obtain the external id, when transmitting the data. Which means that the DB ids on the servers are not unique and will not change if we ever decide to change the splitting between server bits and id bits.
::yes, it's useful for administration. if one slave dies it should be simple to redistribute the "lost" fps between the remaining slaves or to a new one. we should also redistribute fps when new slaves are added in general. i still think that the ids should be created and stored in the main db, but it doesn't really matter. i think it would be safer and more clean, though.
* "anidb2 cronjob adds new ostfile relations based on the results", in cases where we already know other files with a matching fingerprint and those are already related to a specific song. If the other similar files are also "song-less", then the matching/group information would remain unused until someone opens the manual matching website. Where those files could be grouped together and would by default all be related to the song which the user selects.
::yes. the matching is fully automated and is completely unrelated to the songs until someone adds those relations manually. it is of course possible to generate songs to based on the metadata, though. (initially, at least)
::--[[User:Epoximator|Epoximator]] 17:53, 11 May 2007 (UTC)
: [[User:Exp|Exp]] 13:45, 11 May 2007 (UTC)


== Possible Extension ==
== Possible Extension ==
(for client features; UDP API)
(for client features; UDP API)
* command to fetch audio meta data by ostfile id, size+content hash or foosic id
* command to fetch audio meta data by ostfile id or size+content hash
* command to add audio file to mylist by ostfile id or size+content hash (foosic server; UDP)
* command to add audio file to MyList by ostfile id or size+content hash
* same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients. i.e. to allow lookups by foosic fingerprint. for that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to anidb it would send the fingerprint to the foosic server (anidb3) to get one or more foosic id(s) and then use those to query song data from the UDP API. (this is not meant for avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins)
* same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients.
 
** i.e. to allow lookups by foosic fingerprint. For that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to AniDB it would send the fingerprint to the matching redirector (anidb2), which would delegate to the matching servers (anidb3), to get one or more ostfile id(s) and then use those to query song data from the UDP API.
*** this is not meant for Avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins.


== General Considerations for Future Expansion ==
== General Considerations ==
* it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: length int4, dom int4, fit int4, fingerprint bytea kind of table
* it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: (length int4, dom int4, fit int4, fingerprint bytea) kind of table
* it may become necessary to purge rarely accessed fingerprints from the db every now and then to limit the db size. in order to do that we'll need to keep some counters and dates. i'd suggest: seencount int4, addeddate timestamp, lastseen timestamp as the foosic server would require no authentication, the same user sending a fingerprint multiple times would increase the counter everytime
** the number of fingerprints to consider for each matching lookup could be reduced further by only taking one representative fingerprint for each closely matching group of fingerprints into account.
*** i.e. if there are 20 files for one song and the fingerprints of 18 of those files match with a confidence of NN% (the actual confidence number to use will be hard to decide on, might be something like 98%) then the median of that group (the file which has the closest cumulated match with all other files of that group) could be picked as a representative and all other 17 fingerprints could be skipped during matching lookups. If we always want to return all matches, then the remaining 17 fingerprints could be matched in a second run once the matching with the representative yielded a value above the cut-off-point.
*** depending on the number of encodes per song and the closeness of their match such an optimization might well reduce the number of fingerprints to consider per lookup by a factor of 20 or more.
*** possible storage: additional grouprep int4 field which stores ofid of group representative if an entry is part of a group. Group representatives and fingerprints not belonging to any group would have a value of 0. The initial matching lookup could simply restrict the SELECT to WHERE grouprep=0 (in addition to the len, dom and fit constraints).
* as further optimization may become necessary someday, we should collect some usage statistics per fingerprint in order to identify hotspots and areas of very low interest. Data to collect could be:
** added - the date this fingerprint was added
** bestmatch - the best match confidence this fingerprint ever achieved for a query (ignoring a self compare)
** lastmatchdate - date of last match of this fingerprint for a query
** matchedcnt - the number of times this fingerprint was returned as a match for a query
** as the foosic server would require no authentication, the same user sending a fingerprint multiple times would increase the counter every time
* it may also become necessary to split the processing over multiple servers someday. this can be greatly simplified if the protocol is designed in a way which would allow the following setup.
* it may also become necessary to split the processing over multiple servers someday. this can be greatly simplified if the protocol is designed in a way which would allow the following setup.
** loadbalancing server listens for foosic fingerprint lookups each received lookup is send to _all_ foosic servers
** loadbalancing matching redirector listens for foosic fingerprint lookups each received lookup is send to _all_ matching servers
** each foosic server replies with ids or with "unknown"
** each matching server replies with ostfile ids and match confidence or with "unknown"
** loadbalancer merges all id replies together (sorted by error rate) and returns reply to client
** matching redirector merges all ostfile id replies together (sorted by match confidence) and returns reply to client.
** if all foosic servers replied unknown, loadbalancer tells least used server to store the fingerprint and returns the generated id to the client
** if none of the matching servers indicated that it has the exact fingerprint in his local storage, the matching redirector tells a matching server with free resources to store the fingerprint.
* this would mean that each query is processed in parallel by all available servers. the very nature of the search approach makes the entire approach very scalable.
*** the decision is made based on the observed fingerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
 
* this would mean that each query is processed in parallel by all available matching servers. The very nature of the search approach makes the entire approach very scalable.


== Protocol ==
== Protocol ==
=== Broken Clients ===
=== Broken Clients ===
* we probably want to require a client string and client version in every query (similar to udp api) to be able to ban badly broken clients, should the need arrise someday.
* we probably want to require a client string and client version in every query (similar to UDP API) to be able to ban badly broken clients, should the need arise someday.


=== Protocol Draft ===
=== Protocol Draft ===
 
Every query should contain as additional parameters:
Every query should contain as additional paramets:
client={str client name}&clientver={int client version}
* client={str client name}&clientver={int client version}


Access to the STORE command will be limited to the main server's cron job by this method.
Access to the STORE command will be limited to the main server's cron job by this method.


==== Querying a foosic fingerprint (don't add if unknown) ====
==== Querying a foosic fingerprint (don't add if unknown) ====
Used by:
Used by:
* external clients  
* external clients  
* main server's cron job
* main server's cron job
* matching redirector (forwarded)
* matching redirector (forwarded)


Client:
Client:
* MATCH ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}[&store=1]
MATCH ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}[&store=1]
** the store=1 parameter is filtered out and interpreted by the matching redirector, only the main server's cron job is allowed to set store=1
: The <tt>store=1</tt> parameter is filtered out and interpreted by the matching redirector, only the main server's cron job is allowed to set <tt>store=1</tt>.
 


Server Reply:
Server Reply:
Line 152: Line 116:
: 200 MATCHED
: 200 MATCHED
: {int2 ident of matching server this fingerprint is stored on}
: {int2 ident of matching server this fingerprint is stored on}
** this line is inserted by the matching redirector, the data is only interesting for the main server's cron job. Done only if store=1.
:* this line is inserted by the matching redirector, the data is only interesting for the main server's cron job. Done only if store=1.
: {int result count}|{int compare count}|{int time taken in ms}
: {int result count}|{int compare count}|{int time taken in ms}
** this line will be suppressed by the matching redirector which processes it to decide where to store a new fingerprint (load balancing)
:* this line will be suppressed by the matching redirector which processes it to decide where to store a new fingerprint (load balancing)
: ({int error}|{int ofid}\n)*
: ({int error}|{int ofid}\n)*


The reply starts with
: 201 MATCHED LOCAL
if the ofid being queried is stored on that specific matching server
:* the matching redirector will filter 201 replies, a normal client will never see them.


* No matchings found
* No matchings found
: 300 UNMATCHED
: 300 UNMATCHED
: {int2 ident of matching server this fingerprint is stored on}
: {int2 ident of matching server this fingerprint is stored on}
** this line is inserted by the matching redirector, the data is only interesting for the main server's cron job. Done only if store=1.
:* this line is inserted by the matching redirector, the data is only interesting for the main server's cron job. Done only if store=1.
: {int result count}|{int compare count}|{int time taken in ms}
: {int result count}|{int compare count}|{int time taken in ms}
** this line will be suppressed by the matching redirector which processes it to decide where to store a new fingerprint (load balancing)
:* this line will be suppressed by the matching redirector which processes it to decide where to store a new fingerprint (load balancing)


The reply starts with
: 301 UNMATCHED LOCAL
if the ofid being queried is stored on that specific matching server
:* the matching redirector will filter 201 replies, a normal client will never see them.


==== Submitting a new foosic fingerprint ====
==== Submitting a new foosic fingerprint ====
Used by:
Used by:
* matching redirector, access is restricted
* matching redirector, access is restricted


Client:
Client:
* STORE ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}
STORE ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}
 


Server Reply:
Server Reply:
Line 182: Line 151:
* Fingerprint was already in DB
* Fingerprint was already in DB
: 310 ALREADY STORED
: 310 ALREADY STORED


==== Submitting a new foosic fingerprint ====
==== Submitting a new foosic fingerprint ====
Used by:
Used by:
* main server's cron job, access is restricted
* main server's cron job, access is restricted
* matching redirector (forwarded), access is restricted
* matching redirector (forwarded), access is restricted


Client:
Client:
* DELETE ofid={int4 ostfile id}[&{int2 ident of matching server this fingerprint is stored on}]
DELETE ofid={int4 ostfile id}[&{int2 ident of matching server this fingerprint is stored on}]
 


Server Reply:
Server Reply:
Line 201: Line 166:
* Fingerprint was not in DB
* Fingerprint was not in DB
: 320 NOT FOUND
: 320 NOT FOUND


==== Query the current server load/utilization ====
==== Query the current server load/utilization ====
Used by:
Used by:
* matching redirector, access is restricted
* matching redirector, access is restricted


Client:
Client:
* LOADSTAT
LOADSTAT
 


Server Reply:
Server Reply:
: 299 LOAD STAT
: 299 LOAD STAT
: {int2 load factor}|{int4 number of fingerprints in db}|{int2 system load}
: {int2 load factor}|{int4 number of fingerprints in db}|{int2 system load}
** load factor: a simply multiplicative constant which is used to distinguish between fast and slow server hardware. This can i.e. be used to store twice as many fingerprints on one server compared to others. The fingerprint count is converted according to the following formula prior to comparison/selection of least used server.
:* load factor: a simply multiplicative constant which is used to distinguish between fast and slow server hardware. This can i.e. be used to store twice as many fingerprints on one server compared to others. The fingerprint count is converted according to the following formula prior to comparison/selection of least used server.
*** relative fingerprint number/load = number of fingerprints in db * (load factor / 100)
::* relative fingerprint number/load = number of fingerprints in db * (load factor / 100)
 
[[Category:Development]]
1,633

edits

Navigation menu

MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.