OstDB DEV Foosic: Difference between revisions

m
no edit summary
mNo edit summary
Line 6: Line 6:


Servers:
Servers:
* dedicated java matching server daemon TCP on anidb3 (sig) (does not accept client requests directly)
* dedicated Java matching server daemon TCP on anidb3 (sig) (does not accept client requests directly)
* dedicated java redirector/load balancer daemon TCP and UDP on anidb2 (main/api) (UDP for single queries, TCP for batch runs)
* dedicated Java redirector/load balancer daemon TCP and UDP on anidb2 (main/api) (UDP for single queries, TCP for batch runs)




Line 23: Line 23:
* main server (anidb2), keeps file meta data incl. fingerprints
* main server (anidb2), keeps file meta data incl. fingerprints
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (MATCH goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (MATCH goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching server(s) (anidb3/sig server), java standalone app does all the fingerprint<->fingerprint matching, TCP interface
* matching server(s) (anidb3/sig server), Java standalone app does all the fingerprint<->fingerprint matching, TCP interface




Line 37: Line 37:
===== Asynchronous Part =====
===== Asynchronous Part =====
* a cronjob/daemon on the main server regularly checks for newly added (yet unmatched) foosic fingerprints and sends them, together with the ostfile id to the matching redirector via TCP with store set to 1
* a cronjob/daemon on the main server regularly checks for newly added (yet unmatched) foosic fingerprints and sends them, together with the ostfile id to the matching redirector via TCP with store set to 1
** flooding is impossible due to the synchronous nature of the TCP matching redirector api
** flooding is impossible due to the synchronous nature of the TCP matching redirector API
** the cronjob could open multiple TCP connections to the matching redirector if one connection does not yield enough throughput.
** the cronjob could open multiple TCP connections to the matching redirector if one connection does not yield enough throughput.
* the matching redirector forwards the MATCH query to all matching servers via TCP
* the matching redirector forwards the MATCH query to all matching servers via TCP
** (multiple) TCP connections to all matching servers should be kept alive inbetween queries (to prevent TCP connection handshake overhead)
** (multiple) TCP connections to all matching servers should be kept alive in-between queries (to prevent TCP connection handshake overhead)
* the matching servers try to match the fingerprint to all potentially interesting fingerprints in their database
* the matching servers try to match the fingerprint to all potentially interesting fingerprints in their database
** pre-match filtering: via length, avg dom and avg fit
** pre-match filtering: via length, avg dom and avg fit
Line 50: Line 50:
** if a matching server already stores this exact ofid, it will add a note about this fact to it's reply.
** if a matching server already stores this exact ofid, it will add a note about this fact to it's reply.
** the matching redirector does not keep any persistent storage. Other than some usage data, nothing is kept here.
** the matching redirector does not keep any persistent storage. Other than some usage data, nothing is kept here.
* the matching redirector collects the replies from all matching servers and collates them into one reply for _later_ transmittion to the main server.
* the matching redirector collects the replies from all matching servers and collates them into one reply for _later_ transmission to the main server.
* the matching redirector then decides which matching server should store the new fingerprint based on the usage statistics returned by each matching server and sends that server a STORE command. That servers identifier is then included with the reply to the main server.
* the matching redirector then decides which matching server should store the new fingerprint based on the usage statistics returned by each matching server and sends that server a STORE command. That servers identifier is then included with the reply to the main server.
** if one of the earlier replies from the matching servers indicated, that a server already has the fingerprint, no new STORE command is issued. The identifier of that server is used instead.
** if one of the earlier replies from the matching servers indicated, that a server already has the fingerprint, no new STORE command is issued. The identifier of that server is used instead.
* the matching redirector replies to the main server (matching results + ident of matching server storing that fingerprint)
* the matching redirector replies to the main server (matching results + ident of matching server storing that fingerprint)
** for the main server/cron daemon it is not visible which _match_ came from which matching server
** for the main server/cron daemon it is not visible which _match_ came from which matching server
* the main server stores the matching data in the db
* the main server stores the matching data in the DB
** (ofidlow int4, ofidhigh int4, match int2) as ostfilefoosictb
** (ofidlow int4, ofidhigh int4, match int2) as ostfilefoosictb
** (...,foosicfp bytea,foosicsrv int2,...) in ostfiletb
** (...,foosicfp bytea,foosicsrv int2,...) in ostfiletb
* main server uses matching data to support manual file<->song matching via the webinterface
* main server uses matching data to support manual file<->song matching via the web interface
** the user will be able select the cut-off-point for the matching value on-the-fly in order to reduce false-positives or increase recall
** the user will be able select the cut-off-point for the matching value on-the-fly in order to reduce false-positives or increase recall


Line 78: Line 78:
* command to add audio file to mylist by ostfile id or size+content hash
* command to add audio file to mylist by ostfile id or size+content hash
* same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients.
* same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients.
** i.e. to allow lookups by foosic fingerprint. For that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to anidb it would send the fingerprint to the matching redirector (anidb2), which would delegate to the matching servers (anidb3), to get one or more ostfile id(s) and then use those to query song data from the UDP API.
** i.e. to allow lookups by foosic fingerprint. For that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to AniDB it would send the fingerprint to the matching redirector (anidb2), which would delegate to the matching servers (anidb3), to get one or more ostfile id(s) and then use those to query song data from the UDP API.
*** this is not meant for avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins
*** this is not meant for avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins


Line 84: Line 84:
* it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: (length int4, dom int4, fit int4, fingerprint bytea) kind of table
* it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: (length int4, dom int4, fit int4, fingerprint bytea) kind of table
** the number of fingerprints to consider for each matching lookup could be reduced further by only taking one representative fingerprint for each closely matching group of fingerprints into account.
** the number of fingerprints to consider for each matching lookup could be reduced further by only taking one representative fingerprint for each closely matching group of fingerprints into account.
*** i.e. if there are 20 files for one song and the fingerprints of 18 of those files match with a confidence of NN% (the actual confidence number to use will be hard to decide on, might be something like 98%) then the median of that group (the file which has the closest commulated match with all other files of that group) could be picked as a representative and all other 17 fingerprints could be skipped during matching lookups. If we always want to return all matches, then the remaining 17 fingerprints could be matched in a second run once the matching with the representative yielded a value above the cut-off-point.
*** i.e. if there are 20 files for one song and the fingerprints of 18 of those files match with a confidence of NN% (the actual confidence number to use will be hard to decide on, might be something like 98%) then the median of that group (the file which has the closest cumulated match with all other files of that group) could be picked as a representative and all other 17 fingerprints could be skipped during matching lookups. If we always want to return all matches, then the remaining 17 fingerprints could be matched in a second run once the matching with the representative yielded a value above the cut-off-point.
*** depending on the number of encodes per song and the closeness of their match such an optimization might well reduce the number of fingerprints to consider per lookup by a factor of 20 or more
*** depending on the number of encodes per song and the closeness of their match such an optimization might well reduce the number of fingerprints to consider per lookup by a factor of 20 or more
*** possible storage: additional grouprep int4 field which stores ofid of group representative if an entry is part of a group. Group representatives and fingerprints not belonging to any group would have a value of 0. The initial matching lookup could simply restrict the SELECT to WHERE grouprep=0 (in addition to the len, dom and fit constraints).
*** possible storage: additional grouprep int4 field which stores ofid of group representative if an entry is part of a group. Group representatives and fingerprints not belonging to any group would have a value of 0. The initial matching lookup could simply restrict the SELECT to WHERE grouprep=0 (in addition to the len, dom and fit constraints).
Line 92: Line 92:
** lastmatchdate - date of last match of this fingerprint for a query
** lastmatchdate - date of last match of this fingerprint for a query
** matchedcnt - the number of times this fingerprint was returned as a match for a query
** matchedcnt - the number of times this fingerprint was returned as a match for a query
** as the foosic server would require no authentication, the same user sending a fingerprint multiple times would increase the counter everytime
** as the foosic server would require no authentication, the same user sending a fingerprint multiple times would increase the counter every time
* it may also become necessary to split the processing over multiple servers someday. this can be greatly simplified if the protocol is designed in a way which would allow the following setup.
* it may also become necessary to split the processing over multiple servers someday. this can be greatly simplified if the protocol is designed in a way which would allow the following setup.
** loadbalancing matching redirector listens for foosic fingerprint lookups each received lookup is send to _all_ matching servers
** loadbalancing matching redirector listens for foosic fingerprint lookups each received lookup is send to _all_ matching servers
Line 98: Line 98:
** matching redirector merges all ostfile id replies together (sorted by match confidence) and returns reply to client
** matching redirector merges all ostfile id replies together (sorted by match confidence) and returns reply to client
** if none of the matching servers indicated that it has the exact fingerprint in his local storage, the matching redirector tells a matching server with free resources to store the fingerprint.
** if none of the matching servers indicated that it has the exact fingerprint in his local storage, the matching redirector tells a matching server with free resources to store the fingerprint.
*** the decision is made based on the observed finerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
*** the decision is made based on the observed fingerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
* this would mean that each query is processed in parallel by all available matching servers. The very nature of the search approach makes the entire approach very scalable.
* this would mean that each query is processed in parallel by all available matching servers. The very nature of the search approach makes the entire approach very scalable.


Line 104: Line 104:


=== Broken Clients ===
=== Broken Clients ===
* we probably want to require a client string and client version in every query (similar to udp api) to be able to ban badly broken clients, should the need arrise someday.
* we probably want to require a client string and client version in every query (similar to udp API) to be able to ban badly broken clients, should the need arise someday.


=== Protocol Draft ===
=== Protocol Draft ===
1,633

edits

MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.