OstDB DEV Foosic: Difference between revisions

possible future optimization - group reps
(possible future optimization - group reps)
Line 83: Line 83:
== General Considerations ==
== General Considerations ==
* it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: (length int4, dom int4, fit int4, fingerprint bytea) kind of table
* it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: (length int4, dom int4, fit int4, fingerprint bytea) kind of table
* as furtger optimization may become necessary someday, we should collect some usage statistics per fingerprint in order to identify hotspots and areas of very low interest. Data to collect could be:
** the number of fingerprints to consider for each matching lookup could be reduced further by only taking one representative fingerprint for each closely matching group of fingerprints into account.
*** i.e. if there are 20 files for one song and the fingerprints of 18 of those files match with a confidence of NN% (the actual confidence number to use will be hard to decide on, might be something like 98%) then the median of that group (the file which has the closest commulated match with all other files of that group) could be picked as a representative and all other 17 fingerprints could be skipped during matching lookups. If we always want to return all matches, then the remaining 17 fingerprints could be matched in a second run once the matching with the representative yielded a value above the cut-off-point.
*** depending on the number of encodes per song and the closeness of their match such an optimization might well reduce the number of fingerprints to consider per lookup by a factor of 20 or more
*** possible storage: additional grouprep int4 field which stores ofid of group representative if an entry is part of a group. Group representatives and fingerprints not belonging to any group would have a value of 0. The initial matching lookup could simply restrict the SELECT to WHERE grouprep=0 (in addition to the len, dom and fit constraints).
* as further optimization may become necessary someday, we should collect some usage statistics per fingerprint in order to identify hotspots and areas of very low interest. Data to collect could be:
** added - the date this fingerprint was added
** added - the date this fingerprint was added
** bestmatch - the best match confidence this fingerprint ever achieved for a query (ignoring a self compare)
** bestmatch - the best match confidence this fingerprint ever achieved for a query (ignoring a self compare)
Line 96: Line 100:
*** the decision is made based on the observed finerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
*** the decision is made based on the observed finerprint distribution of the matching servers. The MATCH reply from the matching servers lists the number of fingerprints which needed to be taken into account for the specific match. The server with the smallest number of fingerprints with similar length, avg. fit and avg. dom would be the best place to store the new fingerprint. Other factors could also be taken into account.
* this would mean that each query is processed in parallel by all available matching servers. The very nature of the search approach makes the entire approach very scalable.
* this would mean that each query is processed in parallel by all available matching servers. The very nature of the search approach makes the entire approach very scalable.


== Protocol ==
== Protocol ==
MediaWiki spam blocked by CleanTalk.
MediaWiki spam blocked by CleanTalk.