Skip to content

Latest commit

 

History

History
458 lines (334 loc) · 22 KB

api.md

File metadata and controls

458 lines (334 loc) · 22 KB

Modules

bucket

Lookup available data dumps on the S3 bucket

dataManager

Manage data dump files that have already been downloaded

fetcher

Download data dumps and show download progress

processing/dumpFormatter

Helpers to transform on the dumps parsed by XMLParser into plain objects that are easier to work with.

processing/processor

Parse the data dump XML into plain JS objects and process them with a given function. See readme.md for an example

util/parseUtils

Small helpers for parsing discogs data

bucket

Lookup available data dumps on the S3 bucket

bucket~getDumpURL(version, collection) ⇒ string

Get the URL for a specific data dump

Kind: inner method of bucket

Param Type Description
version string The exact version name, eg '20180101'
collection string The type of data. Can be either "artists", "labels", "masters" or "releases"

bucket~getChecksumURL(version) ⇒ string

Get the URL for a checksum file of the specified version

Kind: inner method of bucket

Param Type Description
version string The exact version name, eg '20180101'

bucket~fetchYearListings() ⇒ Promise.<Array.<{path:string, year:number}>>

Fetch a set of years available on the Discogs data S3 bucket with their paths on the bucket.

Kind: inner method of bucket

bucket~fetchFileListing(yearPrefix) ⇒ Promise.<Array.<string>>

Fetch the list of files available on the S3 bucket for a certain year

Kind: inner method of bucket
Returns: Promise.<Array.<string>> - An array of paths

Param Type Description
yearPrefix string The year prefix of the file. For example: "data/2016/"

bucket~parseFileNames(filenames) ⇒ Object

Parse a list of file paths (as returned by fetchFileListing). Groups them by year

Kind: inner method of bucket
Returns: Object - An object with keys for each year and an array of parsed path objects as values.

Param Type
filenames Array.<string>

bucket~getLatestVersion() ⇒ Promise.<string>

Gets the name of the latest version available in the S3 bucket

Kind: inner method of bucket
Returns: Promise.<string> - A promise that resolves with the version name

dataManager

Manage data dump files that have already been downloaded

dataManager~getXMLPath(version, collection, [gz], [dataDir]) ⇒ string

Get the path where a data XML is saved

Kind: inner method of dataManager

Param Type Default Description
version string The exact version name, eg '20180101'
collection string The type of data. Can be either "artists", "labels", "masters" or "releases"
[gz] boolean false If this is the compressed file (.xml.gz) or non-compressed (.gz)
[dataDir] string "&quot;./data&quot;" Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory

dataManager~getChecksumPath(version, [dataDir]) ⇒ string

Get the path to where the checksum file for a specified version is stored

Kind: inner method of dataManager

Param Type Default Description
version string The exact version name, eg '20180101'
[dataDir] string "&quot;./data&quot;" Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory

dataManager~findXML(version, collection, [gz], [dataDir]) ⇒ Object | null

Looks up an existing data xml on disk

Kind: inner method of dataManager
Returns: Object | null - An object of the form { path: string, gz: boolean } if the file was found, null otherwise

Param Type Default Description
version string The exact version name, eg '20180101'
collection string The type of data. Can be either "artists", "labels", "masters" or "releases"
[gz] boolean false If this is the compressed file (.xml.gz) or non-compressed (.gz)
[dataDir] string "&quot;./data&quot;" Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory

dataManager~findData(version, collections, [dataDir]) ⇒ Array.<(Object|null)>

Looks up the xml files on disk for a given version

Kind: inner method of dataManager
Returns: Array.<(Object|null)> - An array of results for each type: An object of the form { path: string, gz: boolean } if the file was found, null otherwise

Param Type Default Description
version string The exact version name, eg '20180101'
collections Array.<string> An array of types to get. Possible options: "artists", "labels", "masters" or "releases". Defaults to all types
[dataDir] string "&quot;./data&quot;" Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory

dataManager~globDumps([dataDir]) ⇒ Object

List all data downloaded to the data directory

Kind: inner method of dataManager
Returns: Object - A map containing all downloaded files

Param Type Default Description
[dataDir] string "&quot;./data&quot;" Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory

fetcher

Download data dumps and show download progress

fetcher~ensureDump(version, collection, [showProgress], [dataDir]) ⇒ Promise.<void>

Ensures a data dump file is downloaded to ./data//. Does nothing if a file already exists. Does not verify the file.

Kind: inner method of fetcher
Returns: Promise.<void> - A Promise that completes when all data is downloaded

Param Type Default Description
version string The exact version name, eg '20180101'
collection string The type of data. Can be either "artists", "labels", "masters" or "releases"
[showProgress] boolean false Show a progress indicator. For usage in an interactive CLI. On a server you probably want this set to false
[dataDir] string Set to overwrite the default data directory where dumps are stored (./data)

fetcher~ensureDumps(version, [collections], [showProgress], [dataDir]) ⇒ Promise.<void>

Ensures all the specified collections of a specific data dump version are downloaded to the given data directory

Kind: inner method of fetcher
Returns: Promise.<void> - A Promise that completes when all data is downloaded

Param Type Default Description
version string The exact version name, eg '20180101'
[collections] Array.<string> An array of types to get. Possible options: "artists", "labels", "masters" or "releases". Defaults to all types
[showProgress] boolean false Show a progress indicator. For usage in an interactive CLI. On a server you probably want this set to false
[dataDir] string Set to overwrite the default data directory where dumps are stored (./data)

fetcher~ensureChecksum(version, [dataDir]) ⇒ Promise.<void>

Ensures that the CHECKSUM file for a given version is downloaded

Kind: inner method of fetcher

Param Type Description
version string The exact version name, eg '20180101'
[dataDir] string Set to overwrite the default data directory where dumps are stored (./data)

processing/dumpFormatter

Helpers to transform on the dumps parsed by XMLParser into plain objects that are easier to work with.

processing/dumpFormatter~formatLabel(label, [includeImageObjects]) ⇒ object

Format a label tag. See readme.md for information of how the data is transformed

Kind: inner method of processing/dumpFormatter

Param Type Default Description
label Object A label as tag parsed by XMLParser which conforms to the schema/label-xml.json schema
[includeImageObjects] boolean false If true, include the images object (even though they do not contain URI)

processing/dumpFormatter~formatArtist(artist, [includeImageObjects]) ⇒ object

Format an artist tag. See readme.md for information of how the data is transformed

Kind: inner method of processing/dumpFormatter

Param Type Default Description
artist Object An artist tag parsed by XMLParser which conforms to the schema/artist-xml.json schema
[includeImageObjects] boolean false If true, include the images object (even though they do not contain URI)

processing/dumpFormatter~formatMaster(master, [includeImageObjects]) ⇒ object

Format a master tag. See readme.md for information of how the data is transformed

Kind: inner method of processing/dumpFormatter

Param Type Default Description
master Object A master tag parsed by XMLParser which conforms to the schema/master-xml.json schema
[includeImageObjects] boolean false If true, include the images object (even though they do not contain URI)

processing/dumpFormatter~formatRelease(release, [includeImageObjects]) ⇒ object

Format a release tag. See readme.md for information of how the data is transformed

Kind: inner method of processing/dumpFormatter

Param Type Default Description
release Object A release tag parsed by XMLParser which conforms to the schema/master-xml.json schema
[includeImageObjects] boolean false If true, include the images object (even though they do not contain URI)

processing/processor

Parse the data dump XML into plain JS objects and process them with a given function. See readme.md for an example

processing/processor~processDumpFile(path, collection, fn, [gz], [chunkSize], [restart]) ⇒ Promise

Processes an XML dump file using node-expat into plain objects. Every chunkSize rows the parser will pause and pass the result to the fn function. Once the fn function completes, parsing continues until the entire file is parsed.

Kind: inner method of processing/processor
Returns: Promise - A Promise that resolves when processing is complete

Param Type Default Description
path string The full path to the file to process
collection string The type of data. Can be either "artists", "labels", "masters" or "releases"
fn processChunkFn The function to call on each chunk of data.
[gz] boolean true A boolean indicating if the dump is compressed in gzip format
[chunkSize] number 1000 The number of XML rows that are parsed by node-expat until fn is called. A bigger number may be more efficient, but costs more memory
[restart] boolean false By default, the processing progress is stored in a .processing file alongside the data dumps. If the processing is stopped, it will continue from that row once you call processDumpFile again. Set this to true to always start from the beginning.

Example

processDumpFile(
  './discogs_20190101_artists.xml.gz',
  'artists',
  chunk => {
     // process the results here. For this example, we just console.log them
     chunk.forEach(row => console.log(row));

     return Promise.resolve();
  }
);

processing/processor~processDumps(version, fn, [collections], [chunkSize], [restart], [dataDir]) ⇒ Promise.<void>

Looks up the downloaded data dumps of a given version. Then calls processDumpFile on each of them.

Kind: inner method of processing/processor
See: processDumpFile

Param Type Default
version string
fn function
[collections] Array.<string>
[chunkSize] number 1000
[restart] boolean false
[dataDir] string "'/data'"

processing/processor~processChunkFn ⇒ Promise

The signature of the fn function passed to processDumpFile

Kind: inner typedef of processing/processor
Returns: Promise - A promise that resolves when processing is complete

Param Type Description
chunk Array.<Object> An array of plain objects as parsed by node-expat from XML
collection string The type of collection ("artists", "labels", "masters" or "releases")
path string The path to the dump file that is being processed

util/parseUtils

Small helpers for parsing discogs data

util/parseUtils~parseIntSafe(str) ⇒ number

Runs parseInt and errors when the result is NaN

Kind: inner method of util/parseUtils

Param Type Description
str string The string to parse

util/parseUtils~parseDiscogsName(name, target) ⇒ object

Parses a name from Discogs that potentially has a "(n)" numeric postfix. Stores the result on the specified target object. Will set the following properties:

name: the name with the "(n)" postfix removed
originalName: the name without modifications
nameIndex: the number n inside the postfix. 1 if there isn't any

Kind: inner method of util/parseUtils
Returns: object - A reference to target

Param Type Description
name string The name to parse
target object An object to store the results on

util/parseUtils~parseDuration(duration, target) ⇒ object

Parses the duration string from a Discogs XML file and stores the result on the target object. Will store the string as-is on the 'originalDuration' property. If the duration is formatted somewhat correctly, will calculate the duration in number of seconds and store it on the 'duration' property.

Kind: inner method of util/parseUtils
Returns: object - target for chaining

Param Type Description
duration string A duration formatted as string
target object The target object to store results on

util/parseUtils~parseReleaseDate(date, `target`)

Will parse the given release date and format it according to Discogs Database Guidelines. The result is stored on the "released" property of the target object. The date will be either formatted as YYYY or YYYY-MM-DD. If only the year and month are given, the date will be set to 00. If dashes are missing, they will be added. All other formats are discarded.

Kind: inner method of util/parseUtils

Param Type Description
date string The date string to parse
target object for chaining