OstDB DEV Foosic

From AniDB
Jump to navigation Jump to search

Protocol for foosic client<->server communication via UDP and TCP

VERSION 1

Server: dedicated java daemon (UDP and TCP), on anidb3 (sig) (UDP for single queries, TCP for batch runs)

Client: cronjob/daemon on anidb2


General Workflow

  • user runs avdump on local audio files
  • avdump sends file meta data, including foosic audio fingerprint, to anidb2 via UDP api
  • cronjob/daemon on anidb2 regularly checks for newly added foosic fingerprints and sends them to anidb3 (while making sure not to flood)
  • anidb3 tries to match the fingerprint.
  • if a match is found a list of matching ids is returned together with the magnitude of error per match.
  • if no match is found, anidb3 creates a new unique id for the fingerprint and returns the unique id.
  • anidb2 stores unique id(s) in db
  • anidb2 uses unique id(s) to support manual file<->song matching via the webinterface


alternatively:

  • avdump sends data to anidb2 udpapi
  • anidb2 udpapi stores all data
  • anidb2 cronjob sends fingerprint to matching slaves
  • matching slaves returns best matches: ofid,match value
  • anidb2 cronjob sends fingerprint,ofid to the server with the worst results (or none) for storage.
  • this slave is now responsible for this fingerprint
  • storage would be ofid,length,avg_fit,avg_dom,fp
  • an identifier for the slave should be stored in ostfiletb for later administration of slaves/fingerprints
  • anidb2 cronjob adds new ostfile relations based on the results

--Epoximator 08:39, 10 May 2007 (UTC)

I see several problems with this approach:

  • there is no natural relation between ofids and fingerprints, fingerprints are matched to songs, not files. However you imply such a relation by using ofids as ids for fingerprints on the slave servers. I.e. a case where this would become a problem:
    • Lets say a fingerprint is known to anidb and is loosely (foosic fp match) shared between multiple files. Your approach would relate all these files to one another (or to one specific file from the set). If it now becomes apparent that one of these files, possibly the one whose ofid you used as fingerprint id, is in fact wrongly matched to the song in question and is moved to another song, we'd effectively be changing the primary key of the fingerprint, on multiple decentralized servers... bad idea
    • And using the songid instead won't work either. As there won't be a ostfile<->song match at the time when the fingerprints are processed. I.e. we might end up with a group of files which all share similar fingerprints but which are all not yet matched to a song. The fact that tey're all sharing a similar fingerprint is highly relavant for the matching process though. So we can't delay the fingerprint analysis inorder to be able to use the song ids as keys. Which is why I'd suggest that we just create new unique ids for the fingerprints.
i assumed, and still do, that fp is unique per encode (ofid) and not song. even if two fps are equal they would still be treated as two different ones throughout the system (they would just have a very good match value). so the relation would be ostfiletb >-< ostfiletb. then, on top of this, rows in songtb relate to one row each in ostfiletb (which the other relations can be derived from). if we don't want this redundancy we shouldn't store fps in ostfiletb at all, but rather in their own table where uniqueness is preserved (and with own ids). then you would have ostfiletb>-fptb-(<)songtb.


  • load balancing:
    • what are "matching slaves" ?
a matching slave is a puter with a matching daemon running
    • what do you mean with "the server with the worst results (or none) for storage"? the error margin on found matches is hardly useful to determine the average load each server is under. Instead I'd suggest to return the total number of stored fingerprints on each server with the reply and store the fingerprint on the server with the smallest fingerprint count.
the work load is not defined by how many fps there are in total, but how many that are selected for matching. the idea is that if a slave come up with no results, or only bad ones, then it should have the new fp because it is "unique" on that server. ie. fps should be distribued so that similar fps are spread as much as possible. but the decision depends on how the selection is done:
select fp from fptb order by 2*abs(len-?)+abs(fit-?)+abs(dom-?) asc limit X (use results)
select fp from fptb where abs(len-?)<X and abs(fit-?)<Y and abs(dom-?)<Z (use count)
it will of course not balance good if the slaves have very different hw (and/or additional tasks). we should consider processing time also, but per request
    • "returns best matches", only if the match is good enough to be considered a real "match". So in many cases most of the slaves would probably reply that the fingerprint is unknown.
yes, that's true, at least in the beginning. it only depends on how many versions there are of each song and how many of them we have registered. (but one point is that we don't know what's considered a good match atm. it has to be adjusted)
    • I do agree that it is probably useful to remember which server a specific fingerprint id is stored on. The easiest/efficient approach for that would be to use a couple of the high bits of the ids to store that info. I.e. we could use the highest 4bits (skipping the sign bit) of a 32bit signed integer, leaving us with an effective 27bit room of ids. I guess we wouldn't reach the 130 million entry mark anytime soon. Neither are we likely to fork of to more than 16 servers. Both numbers could easily be extended later by switching to int8 values and using a larger room in the header for the server selection, i.e. 16bit. That would mean that each server would have it's own unique sequence for fingerprint ids (starting at 0) and we'd simply AND a server specific bitmask to it to obtain the external id, when transmitting the data. Which means that the DB ids on the servers are not unique and will not change if we ever decide to change the splitting between server bits and id bits.
yes, it's useful for administration. if one slave dies it should be simple to redistribute the "lost" fps between the remaining slaves or to a new one. we should also redistribute fps when new slaves are added in general. i still think that the ids should be created and stored in the main db, but it doesn't really matter. i think it would be safer and more clean, though.
  • "anidb2 cronjob adds new ostfile relations based on the results", in cases where we already know other files with a matching fingerprint and those are already related to a specific song. If the other similar files are also "song-less", then the matching/group information would remain unused until someone opens the manual matching website. Where those files could be grouped together and would by default all be related to the song which the user selects.
yes. the matching is fully automated and is completely unrelated to the songs until someone adds those relations manually. it is of course possible to generate songs to based on the metadata, though. (initially, at least)
--Epoximator 17:53, 11 May 2007 (UTC)
Exp 13:45, 11 May 2007 (UTC)

Possible Extension

(for client features; UDP API)

  • command to fetch audio meta data by ostfile id, size+content hash or foosic id
  • command to add audio file to mylist by ostfile id or size+content hash (foosic server; UDP)
  • same commands as available via TCP (used by the anidb2 cronjob) also available via UDP for use by other clients. i.e. to allow lookups by foosic fingerprint. for that a client would first contact the UDP API (anidb2) with the content hash and if the content hash is unknown to anidb it would send the fingerprint to the foosic server (anidb3) to get one or more foosic id(s) and then use those to query song data from the UDP API. (this is not meant for avdump, but it might be interesting for direct integration into player software, i.e. a winamp/amarok plugin, would work somewhat like the already available musicbrainz plugins)


General Considerations for Future Expansion

  • it is very important to effectively limit the number of fingerprints which need to be taken into account for each lookup. As such the file length and the average dom and fit should be stored in a way which allows easy and fast filtering via range queries on those 3 dimensions. so that'd probably mean it will be a: length int4, dom int4, fit int4, fingerprint bytea kind of table
  • it may become necessary to purge rarely accessed fingerprints from the db every now and then to limit the db size. in order to do that we'll need to keep some counters and dates. i'd suggest: seencount int4, addeddate timestamp, lastseen timestamp as the foosic server would require no authentication, the same user sending a fingerprint multiple times would increase the counter everytime
  • it may also become necessary to split the processing over multiple servers someday. this can be greatly simplified if the protocol is designed in a way which would allow the following setup.
    • loadbalancing server listens for foosic fingerprint lookups each received lookup is send to _all_ foosic servers
    • each foosic server replies with ids or with "unknown"
    • loadbalancer merges all id replies together (sorted by error rate) and returns reply to client
    • if all foosic servers replied unknown, loadbalancer tells least used server to store the fingerprint and returns the generated id to the client
  • this would mean that each query is processed in parallel by all available servers. the very nature of the search approach makes the entire approach very scalable.


Protocol

Broken Clients

  • we probably want to require a client string and client version in every query (similar to udp api) to be able to ban badly broken clients, should the need arrise someday.

Protocol Draft

Every query should contain as additional paramets:

  • client={str client name}&clientver={int client version}


Submitting a foosic fingerprint (add to db if unknown)

Client:

  • SUBMIT foosic={ascii representation}

Server Reply:

  • Known fingerprint
    • 200 KNOWN\n{int error}|{int id}\n({int error}|{int id}\n)*
  • Unknown fingerprint
    • 210 STORED\n{int id}\n

Querying a foosic fingerprint (don't add if unknown)

(this would be used by a loadbalancer)

Client:

  • IDENT foosic={ascii representation}

Server Reply:

  • Known fingerprint
    • 200 KNOWN\n{int error}|{int id}\n({int error}|{int id}\n)*
  • Unknown fingerprint
    • 320 UNKNOWN\n

Submit a foosic fingerprint, forcing storage

(used if there is an incorrect match with another fingerprint (false positive) or by a loadbalancer for fingerprints which are unknown to all servers)

Client:

  • STORE foosic={ascii representation}

Server Reply:

  • 210 STORED\n{int id}\n

Query the current server load/utilization

Client:

  • LOADSTAT

Server Reply:

  • 299 LOAD STAT\n{int2 load factor}\n{int4 number of fingerprints in db}\n
    • load factor: a simply multiplicative constant which is used to distinguish between fast and slow server hardware. This can i.e. be used to store twice as many fingerprints on one server compared to others. The fingerprint count is converted according to the following formula prior to comparison/selection of least used server.
      • relative fingerprint number/load = number of fingerprints in db * (load factor / 100)

(might be extended with additional data someday, i.e. real average foosic load)