OstDB DEV
General
this is the place to contribute ideas on a possible future addition of anime OST data to anidb.
For other areas of active development on AniDB, check: Development
Directly related: Generic_PersonCompany_DEV
Vision
The general idea would be that AniDB clients would be extended with audio file support and would automatically provide anidb with lots of raw data on audio files being collected by it's userbase. For so far unknown audio files interested users (aka work monkeys) would either use a client or the webinterface to specify the song (or add it, if it is not yet listed on anidb). Known audio files could automatically be added to the users my(ost)list, could be renamed or their ID3/Comment data could be updated.
Data
What are the things we should be able to store/provide?
... list all entities and their attributes here ...
Artist
...
Artist Group (Band)
Data stored:
- founded on
- disbanded on
- description
...
Collection (Album)
...
Song
Representation of a specific song. Each song may be included in an arbitrary number of collections (at least one). A song can also be directly added as op/ed for an anime (also stored: first ep the song was used in). A song is related to files. Each file can contain one or more songs (i.e. full cd image). And each song has lots of files.
Each song automatically has one special "generic file", which can be used to add the song to my(ost)list without specifying any concrete file. The files known to AniDB for each song are not shown on the webinterface. They can only be added to mylist by using an AniDB client.
Data stored:
- arbitrary number of titles (with language information)
- length (in seconds)
- genre
- ...
maybe:
- flags?
- release date
- ...
Audio File
Representation of an actual physical file which was encountered by an AniDB client at least once (it is not possible to add files manually). Files are initially not linked to any songs, albums or artists. Once known to AniDB a file can be manually linked to a song via the webinterface or an AniDB client. Linking will be supported by the file meta data known to AniDB. I.e. if AniDB collected ID3 Tag data for song title, artist, tracknumber, album, ... the available data will be used to suggest some likely matchings for each file. It will be up to the user to verify the correctness of the suggested matchings.
Users can add files to their my(ost)lists. Songs can also be added to mylist directly by using generic files.
Data stored:
- id (ofid)
- size (without header)
- content hash (SHA1 without header)
- trmid (music brainz TRMid, unused atm)
- audio codec
- bitrate (in kBit/s)
- audio channels
- length (in seconds)
- flags (unused)
Special cached values:
- number of users who have this file in their ost list (unique)
- number of users who have submitted data for this file (unique)
- number of meta data entries submitted for this file
Constraints:
- unique (size, content hash)
...
Implementation
General
One key factor to allow for a certain degree of automation is the automatic identification of audio files. There are some services out there like music brainz which do this but tend to list only the very well known OSTs. Reimplementing something like this for anidb would be clearly inveasible. One possible approach would be to generate normal SHA1 hashes over the raw audio data (still in compressed form but without any ID3 Tags, Comments, ..., basically this would mostly mean skipping the header for hash generation). This could be extended by storing additional TRM IDs from music brainz, where available. Content hashes would differ for the same song from encode to encode. However, matching of audio files to songs could probably automated to a certain degree by using ID3/Comment values found on the files in question.
Maybe a free acoustic fingerprinting algorithm could be used? http://www.foosic.org/libfooid.php
- (EXP) that's pretty similar to music brainz, being free definitely has some advantages. However, unfortunately it seems as if they're not offering any indexing server which would assing unique ids (like musicbrainz's TRM ids) to songs based on acoustic fingerprints. That means we'd have to do that ourself. Storing lots of ~500Byte fingerprints (which are not 100% equal for different files of the same song) and doing the loose matching would be quite demanding for the server. That could of course be handled by a separate server, but I wonder if it is really a good idea to doublicate existing services in such a way. On the other hand one of the main problems with music brainz is that they regularly purge old/rarely referenced songs from their database. Which could give us some real troubles for anime OSTs. If you try it, you'll notice that their coverage of anime OSTs is very poor.
Database
Approach 1
Here is one possible way of realizing the database structure, not exactly 100% correct UML but you should get the idea. Classes are supposed to represent database entities. Lots of attributes are still missing. But I'd like some feedback on whether this general structure would be viable.
To Fix:
- if a ArtistOrBand entry is a Band it needs a name Ace (EXP: true)
- maybe linking the MetaData and FileSys.Data to the submitting user? Ace
- (EXP) might come in handy in some cases, but on the other hand it'll increase the data size. but it's probably a good idea. It would make it easier to ensure that we're not counting the submittions of a user multiple times. and we could also display the user's personal metatags/filenames on his pages.
Needs Feedback:
- maybe add a CollectionFile table for zip, rar, ... packed releases linked to AudioFile, Collection and Group? Ace
- (EXP) do we really need to support packed releases? most users would extract archives anyway. But even if we want to support them, we won't need another table for that. The song<->file relation is M:N, meaning that one file can already contain multiple songs. I wouldn't add archives though. If we really have to, the client software could extract archives on the fly and hash the audio which are inside. The one-file:multiple-songs case was rather meant for those lossless audio files which may contain an entire cd.
Don't Fix?:
- we can't store very complex cases, but maybe we don't need to
- more than one MetaData field for a AudioFile? Ace
- (EXP) the diagram says that already, doesn't it? (NOTE: multiplicities are used according to UML class diagram conventions, not ER diagram conventions)
- right sorry Ace 11:13, 27 February 2007 (UTC)
- (EXP) the diagram says that already, doesn't it? (NOTE: multiplicities are used according to UML class diagram conventions, not ER diagram conventions)
Fixed?:
- store type with song<->artist relations (i.e. composer, lyrics, singer, ...)
- rename some tables to make them more generic (ID3-Tags -> MetaData?, Album -> Collection?)
- album<->song relation needs int trackno attribute
- maybe a song<->song relation "cover version of" ?
- list bands/groups and members? -> artist<->artist relation "member of" and a type flag for artists: band/person?
- make song<->audio file a many-to-many relation (for audio files which contain more than one song)
- maybe try to unify typical data about a person in a new person table which can then be refered to by seiyu, artist and producer tables. (would also remove the need for a special artist<->seiyu relation)
- multipilcities of released-by relation between audio file and audio group are wrong (switched) in diagram
- anime<->song relation with attributes (type: OP/ED, first-ep: eid)
changes to diagram
- rev3:
- split ArtistOrBand into ArtistGroup and Artist
- artist<->person relation now * to 1 (was * to 0..1)
- song<->song "cover of" relation now 0..1 to * (was * to *)
- merged FileSys.Data and MetaData, MetaData now only stores specific meta data which we're interested in, not all tag data stored in a file.
- added relation User<->MetaData to keep track of who submitted what
- dropped AudioGroup
- dropped attributes from title entities