OstDB DEV Foosic: Difference between revisions

m
client commands in source boxes
mNo edit summary
m (client commands in source boxes)
Line 1: Line 1:
{{TOCright}}
{{TOCright}}
Protocol for foosic audio fingerprint matching related client<->server communication via UDP and TCP.
Protocol for foosic audio fingerprint matching related client<->server communication via UDP and TCP.


Line 21: Line 22:
==== Involved Parties ====
==== Involved Parties ====
* client (avdump), locally on user machine
* client (avdump), locally on user machine
* main server (anidb2), keeps file meta data incl. fingerprints
* main server (anidb2), keeps file meta data includes fingerprints
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (MATCH goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (MATCH goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching server(s) (anidb3/sig server), Java standalone app does all the fingerprint<->fingerprint matching, TCP interface
* matching server(s) (anidb3/sig server), Java standalone app does all the fingerprint<->fingerprint matching, TCP interface
Line 76: Line 77:
(for client features; UDP API)
(for client features; UDP API)
* command to fetch audio meta data by ostfile id or size+content hash
* command to fetch audio meta data by ostfile id or size+content hash
* command to add audio file to mylist by ostfile id or size+content hash
* command to add audio file to MyList by ostfile id or size+content hash
* same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients.
* same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients.
** i.e. to allow lookups by foosic fingerprint. For that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to AniDB it would send the fingerprint to the matching redirector (anidb2), which would delegate to the matching servers (anidb3), to get one or more ostfile id(s) and then use those to query song data from the UDP API.
** i.e. to allow lookups by foosic fingerprint. For that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to AniDB it would send the fingerprint to the matching redirector (anidb2), which would delegate to the matching servers (anidb3), to get one or more ostfile id(s) and then use those to query song data from the UDP API.
*** this is not meant for avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins
*** this is not meant for avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins.


== General Considerations ==
== General Considerations ==
Line 85: Line 86:
** the number of fingerprints to consider for each matching lookup could be reduced further by only taking one representative fingerprint for each closely matching group of fingerprints into account.
** the number of fingerprints to consider for each matching lookup could be reduced further by only taking one representative fingerprint for each closely matching group of fingerprints into account.
*** i.e. if there are 20 files for one song and the fingerprints of 18 of those files match with a confidence of NN% (the actual confidence number to use will be hard to decide on, might be something like 98%) then the median of that group (the file which has the closest cumulated match with all other files of that group) could be picked as a representative and all other 17 fingerprints could be skipped during matching lookups. If we always want to return all matches, then the remaining 17 fingerprints could be matched in a second run once the matching with the representative yielded a value above the cut-off-point.
*** i.e. if there are 20 files for one song and the fingerprints of 18 of those files match with a confidence of NN% (the actual confidence number to use will be hard to decide on, might be something like 98%) then the median of that group (the file which has the closest cumulated match with all other files of that group) could be picked as a representative and all other 17 fingerprints could be skipped during matching lookups. If we always want to return all matches, then the remaining 17 fingerprints could be matched in a second run once the matching with the representative yielded a value above the cut-off-point.
*** depending on the number of encodes per song and the closeness of their match such an optimization might well reduce the number of fingerprints to consider per lookup by a factor of 20 or more
*** depending on the number of encodes per song and the closeness of their match such an optimization might well reduce the number of fingerprints to consider per lookup by a factor of 20 or more.
*** possible storage: additional grouprep int4 field which stores ofid of group representative if an entry is part of a group. Group representatives and fingerprints not belonging to any group would have a value of 0. The initial matching lookup could simply restrict the SELECT to WHERE grouprep=0 (in addition to the len, dom and fit constraints).
*** possible storage: additional grouprep int4 field which stores ofid of group representative if an entry is part of a group. Group representatives and fingerprints not belonging to any group would have a value of 0. The initial matching lookup could simply restrict the SELECT to WHERE grouprep=0 (in addition to the len, dom and fit constraints).
* as further optimization may become necessary someday, we should collect some usage statistics per fingerprint in order to identify hotspots and areas of very low interest. Data to collect could be:
* as further optimization may become necessary someday, we should collect some usage statistics per fingerprint in order to identify hotspots and areas of very low interest. Data to collect could be:
Line 96: Line 97:
** loadbalancing matching redirector listens for foosic fingerprint lookups each received lookup is send to _all_ matching servers
** loadbalancing matching redirector listens for foosic fingerprint lookups each received lookup is send to _all_ matching servers
** each matching server replies with ostfile ids and match confidence or with "unknown"
** each matching server replies with ostfile ids and match confidence or with "unknown"
** matching redirector merges all ostfile id replies together (sorted by match confidence) and returns reply to client
** matching redirector merges all ostfile id replies together (sorted by match confidence) and returns reply to client.
** if none of the matching servers indicated that it has the exact fingerprint in his local storage, the matching redirector tells a matching server with free resources to store the fingerprint.
** if none of the matching servers indicated that it has the exact fingerprint in his local storage, the matching redirector tells a matching server with free resources to store the fingerprint.
*** the decision is made based on the observed fingerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
*** the decision is made based on the observed fingerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
Line 108: Line 109:
=== Protocol Draft ===
=== Protocol Draft ===


Every query should contain as additional paramets:
Every query should contain as additional parameters:
* client={str client name}&clientver={int client version}
client={str client name}&clientver={int client version}


Access to the STORE command will be limited to the main server's cron job by this method.
Access to the STORE command will be limited to the main server's cron job by this method.
Line 123: Line 124:


Client:
Client:
* MATCH ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}[&store=1]
MATCH ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}[&store=1]
** the store=1 parameter is filtered out and interpreted by the matching redirector, only the main server's cron job is allowed to set store=1
: The <tt>store=1</tt> parameter is filtered out and interpreted by the matching redirector, only the main server's cron job is allowed to set <tt>store=1</tt>.




Line 162: Line 163:


Client:
Client:
* STORE ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}
STORE ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}




Line 181: Line 182:


Client:
Client:
* DELETE ofid={int4 ostfile id}[&{int2 ident of matching server this fingerprint is stored on}]
DELETE ofid={int4 ostfile id}[&{int2 ident of matching server this fingerprint is stored on}]




Line 199: Line 200:


Client:
Client:
* LOADSTAT
LOADSTAT




1,633

edits

MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.