OstDB DEV Foosic: Difference between revisions

m
not finished
m (not finished)
Line 10: Line 10:


== General Workflow ==
== General Workflow ==
* user runs avdump on local audio files
* avdump sends file meta data, including foosic audio fingerprint, to anidb2 via UDP api
* cronjob/daemon on anidb2 regularly checks for newly added foosic fingerprints and sends them to anidb3 (while making sure not to flood)
* anidb3 tries to match the fingerprint.
* if a match is found a list of matching ids is returned together with the magnitude of error per match.
* if no match is found, anidb3 creates a new unique id for the fingerprint and returns the unique id.
* anidb2 stores unique id(s) in db
* anidb2 uses unique id(s) to support manual file<->song matching via the webinterface


Involved Parties:
* client (avdump), locally on user machine
* main server (anidb2), keeps file meta data incl. fingerprints
* matching redirector (anidb2), a simple load balancer which redirects all requests to the corresponding matching server(s) (IDENT goes to all, STORE to one), TCP and UDP interface (UDP for external queries, TCP for cron job/batch queries)
* matching server(s) (anidb3/sig server), java standalone app does all the fingerprint<->fingerprint matching, TCP interface
Workflow - Synchronous:
* user runs client on local audio files
* client sends file meta data, including foosic audio fingerprint, to main server via UDP API
** optional intermediate step to speedup client processing: calculate content hash first and generate and submit the fingerprint only if the content hash is unknown to AniDB or no fingerprint is listed on AniDB for that file
* main server does not do any processing on the fingerprint and simply stores it in ostfiletb
Workflow - Asynchronous:
* cronjob/daemon on main server regularly checks for newly added foosic fingerprints and sends them, together with the ostfile id to the matching redirector via TCP
** flooding is impossible due to the synchronous nature of the TCP matching redirector api
* matching redirector forwards the IDENT query to all matching servers via TCP
** TCP connections to all matching servers should be kept alive inbetween queries (prevent TCP connection handshake overhead)
* matching servers try to match the fingerprint to all potentially interesting fingerprints in their database
** pre-match filtering: via length, avg dom and avg fit
** post-match filtering: via hard cut-off-point for matching, this will definitely be >=0,50 anything lower would be useless. It will be possible to increase this value without problems at any point in time. However, reducing it would be a problem. We might therefore want to just start with 0,50 pr 0,60 and increase it if we feel that it is necessary to reduce the amount of matches returned.
** potentially further internal optimizations, i.e. by identifying group representatives for certain well matching groups of files and only matching taking the representatives into account when matching
* each matching server replies with a list of matching ostfile ids together with the magnitude of error per match and some general usage data (for load balancing).
** neither fingerprints nor matching results are stored on the matching servers
** matching servers keep some internal usage statistics for potential future optimization
* matching redirector collects the replies from all matching servers and collates them into one reply which is then returned to the main server cron daemon.
** for the main server/cron daemon it is not visible which match came from which matching server
* main server stores the matching data in the db
** (ofid1 int4, ofid2 int4, matching float)
* main server uses matching data to support manual file<->song matching via the webinterface
** the user will be able select the cut-off-point for the matching value on-the-fly in order to reduce false-positives or increase recall
Workflow - New
OLD:


alternatively:
alternatively:
Line 85: Line 113:
Every query should contain as additional paramets:
Every query should contain as additional paramets:
* client={str client name}&clientver={int client version}
* client={str client name}&clientver={int client version}
Access to the STORE command will be limited to the main server's cron job by this method.




==== Submitting a foosic fingerprint (add to db if unknown) ====
==== Querying a foosic fingerprint (don't add if unknown) ====
(used by external clients and the main server's cron job)


Client:
Client:
* SUBMIT foosic={ascii representation}
* MATCH ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}


Server Reply:
Server Reply:
* Known fingerprint
* Matchings found
** 200 KNOWN\n{int error}|{int id}\n({int error}|{int id}\n)*
: 200 MATCHED
: {int result count}|{int compare count}|{int time taken in ms}
** this line will be suppressed by the matching redirector which processes it to decide where to store a new fingerprint (load balancing)
: ({int error}|{int ofid}\n)*
 
* No matchings found
: 300 UNMATCHED


* Unknown fingerprint
** 210 STORED\n{int id}\n


==== Querying a foosic fingerprint (don't add if unknown) ====
==== Submitting a new foosic fingerprint ====
(this would be used by a loadbalancer)
(this is only used by the main server's cron job, access is restricted)


Client:
Client:
* IDENT foosic={ascii representation}
* STORE ofid={int4 ostfile id}&foosic={str ascii hex representation of fingerprint}


Server Reply:
Server Reply:
* Known fingerprint
* Fingerprint was not yet in DB
** 200 KNOWN\n{int error}|{int id}\n({int error}|{int id}\n)*
: 210 STORED
: {int2 ident of matching server this fingerprint is stored on}
** this line is inserted by the matching redirector, the data is only interesting for the main server's cron job


* Unknown fingerprint
* Fingerprint was already in DB
** 320 UNKNOWN\n
: 310 ALREADY STORED
: {int2 ident of matching server this fingerprint is stored on}
** this line is inserted by the matching redirector, the data is only interesting for the main server's cron job


==== Submit a foosic fingerprint, forcing storage ====
 
(used if there is an incorrect match with another fingerprint (false positive) or by a loadbalancer for fingerprints which are unknown to all servers)
==== Submitting a new foosic fingerprint ====
(this is only used by the main server's cron job, access is restricted)


Client:
Client:
* STORE foosic={ascii representation}
* DELETE ofid={int4 ostfile id}


Server Reply:
Server Reply:
* 210 STORED\n{int id}\n
* Fingerprint was in DB
: 220 DELETED
 
* Fingerprint was not in DB
: 320 NOT FOUND
 


==== Query the current server load/utilization ====
==== Query the current server load/utilization ====
(this is only used by the match redirector, access is restricted)


Client:
Client:
Line 127: Line 173:


Server Reply:
Server Reply:
* 299 LOAD STAT\n{int2 load factor}\n{int4 number of fingerprints in db}\n
: 299 LOAD STAT
: {int2 load factor}|{int4 number of fingerprints in db}|{int2 system load}
** load factor: a simply multiplicative constant which is used to distinguish between fast and slow server hardware. This can i.e. be used to store twice as many fingerprints on one server compared to others. The fingerprint count is converted according to the following formula prior to comparison/selection of least used server.
** load factor: a simply multiplicative constant which is used to distinguish between fast and slow server hardware. This can i.e. be used to store twice as many fingerprints on one server compared to others. The fingerprint count is converted according to the following formula prior to comparison/selection of least used server.
*** relative fingerprint number/load = number of fingerprints in db * (load factor / 100)
*** relative fingerprint number/load = number of fingerprints in db * (load factor / 100)
(might be extended with additional data someday, i.e. real average foosic load)
MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.