- bucket
Lookup available data dumps on the S3 bucket
- dataManager
Manage data dump files that have already been downloaded
- fetcher
Download data dumps and show download progress
- processing/dumpFormatter
Helpers to transform on the dumps parsed by XMLParser into plain objects that are easier to work with.
- processing/processor
Parse the data dump XML into plain JS objects and process them with a given function. See readme.md for an example
- util/parseUtils
Small helpers for parsing discogs data
Lookup available data dumps on the S3 bucket
- bucket
- ~getDumpURL(version, collection) ⇒
string
- ~getChecksumURL(version) ⇒
string
- ~fetchYearListings() ⇒
Promise.<Array.<{path:string, year:number}>>
- ~fetchFileListing(yearPrefix) ⇒
Promise.<Array.<string>>
- ~parseFileNames(filenames) ⇒
Object
- ~getLatestVersion() ⇒
Promise.<string>
- ~getDumpURL(version, collection) ⇒
Get the URL for a specific data dump
Kind: inner method of bucket
Param | Type | Description |
---|---|---|
version | string |
The exact version name, eg '20180101' |
collection | string |
The type of data. Can be either "artists", "labels", "masters" or "releases" |
Get the URL for a checksum file of the specified version
Kind: inner method of bucket
Param | Type | Description |
---|---|---|
version | string |
The exact version name, eg '20180101' |
Fetch a set of years available on the Discogs data S3 bucket with their paths on the bucket.
Kind: inner method of bucket
Fetch the list of files available on the S3 bucket for a certain year
Kind: inner method of bucket
Returns: Promise.<Array.<string>>
- An array of paths
Param | Type | Description |
---|---|---|
yearPrefix | string |
The year prefix of the file. For example: "data/2016/" |
Parse a list of file paths (as returned by fetchFileListing). Groups them by year
Kind: inner method of bucket
Returns: Object
- An object with keys for each year and an array of parsed
path objects as values.
Param | Type |
---|---|
filenames | Array.<string> |
Gets the name of the latest version available in the S3 bucket
Kind: inner method of bucket
Returns: Promise.<string>
- A promise that resolves with the version name
Manage data dump files that have already been downloaded
- dataManager
- ~getXMLPath(version, collection, [gz], [dataDir]) ⇒
string
- ~getChecksumPath(version, [dataDir]) ⇒
string
- ~findXML(version, collection, [gz], [dataDir]) ⇒
Object
|null
- ~findData(version, collections, [dataDir]) ⇒
Array.<(Object|null)>
- ~globDumps([dataDir]) ⇒
Object
- ~getXMLPath(version, collection, [gz], [dataDir]) ⇒
Get the path where a data XML is saved
Kind: inner method of dataManager
Param | Type | Default | Description |
---|---|---|---|
version | string |
The exact version name, eg '20180101' | |
collection | string |
The type of data. Can be either "artists", "labels", "masters" or "releases" | |
[gz] | boolean |
false |
If this is the compressed file (.xml.gz) or non-compressed (.gz) |
[dataDir] | string |
""./data"" |
Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory |
Get the path to where the checksum file for a specified version is stored
Kind: inner method of dataManager
Param | Type | Default | Description |
---|---|---|---|
version | string |
The exact version name, eg '20180101' | |
[dataDir] | string |
""./data"" |
Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory |
Looks up an existing data xml on disk
Kind: inner method of dataManager
Returns: Object
| null
- An object of the form { path: string, gz: boolean }
if the file was found, null otherwise
Param | Type | Default | Description |
---|---|---|---|
version | string |
The exact version name, eg '20180101' | |
collection | string |
The type of data. Can be either "artists", "labels", "masters" or "releases" | |
[gz] | boolean |
false |
If this is the compressed file (.xml.gz) or non-compressed (.gz) |
[dataDir] | string |
""./data"" |
Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory |
Looks up the xml files on disk for a given version
Kind: inner method of dataManager
Returns: Array.<(Object|null)>
- An array of results for each type:
An object of the form { path: string, gz: boolean }
if the file was found,
null otherwise
Param | Type | Default | Description |
---|---|---|---|
version | string |
The exact version name, eg '20180101' | |
collections | Array.<string> |
An array of types to get. Possible options: "artists", "labels", "masters" or "releases". Defaults to all types | |
[dataDir] | string |
""./data"" |
Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory |
List all data downloaded to the data directory
Kind: inner method of dataManager
Returns: Object
- A map containing all downloaded files
Param | Type | Default | Description |
---|---|---|---|
[dataDir] | string |
""./data"" |
Root directory where discogs-data-tools stores data files. Defaults to ./data relative to working directory |
Download data dumps and show download progress
- fetcher
- ~ensureDump(version, collection, [showProgress], [dataDir]) ⇒
Promise.<void>
- ~ensureDumps(version, [collections], [showProgress], [dataDir]) ⇒
Promise.<void>
- ~ensureChecksum(version, [dataDir]) ⇒
Promise.<void>
- ~ensureDump(version, collection, [showProgress], [dataDir]) ⇒
Ensures a data dump file is downloaded to ./data//. Does nothing if a file already exists. Does not verify the file.
Kind: inner method of fetcher
Returns: Promise.<void>
- A Promise that completes when all data is
downloaded
Param | Type | Default | Description |
---|---|---|---|
version | string |
The exact version name, eg '20180101' | |
collection | string |
The type of data. Can be either "artists", "labels", "masters" or "releases" | |
[showProgress] | boolean |
false |
Show a progress indicator. For usage in an interactive CLI. On a server you probably want this set to false |
[dataDir] | string |
Set to overwrite the default data directory where dumps are stored (./data) |
Ensures all the specified collections of a specific data dump version are downloaded to the given data directory
Kind: inner method of fetcher
Returns: Promise.<void>
- A Promise that completes when all data is
downloaded
Param | Type | Default | Description |
---|---|---|---|
version | string |
The exact version name, eg '20180101' | |
[collections] | Array.<string> |
An array of types to get. Possible options: "artists", "labels", "masters" or "releases". Defaults to all types | |
[showProgress] | boolean |
false |
Show a progress indicator. For usage in an interactive CLI. On a server you probably want this set to false |
[dataDir] | string |
Set to overwrite the default data directory where dumps are stored (./data) |
Ensures that the CHECKSUM file for a given version is downloaded
Kind: inner method of fetcher
Param | Type | Description |
---|---|---|
version | string |
The exact version name, eg '20180101' |
[dataDir] | string |
Set to overwrite the default data directory where dumps are stored (./data) |
Helpers to transform on the dumps parsed by XMLParser into plain objects that are easier to work with.
Format a label tag. See readme.md for information of how the data is transformed
Kind: inner method of processing/dumpFormatter
Param | Type | Default | Description |
---|---|---|---|
label | Object |
A label as tag parsed by XMLParser which conforms to the schema/label-xml.json schema | |
[includeImageObjects] | boolean |
false |
If true, include the images object (even though they do not contain URI) |
Format an artist tag. See readme.md for information of how the data is transformed
Kind: inner method of processing/dumpFormatter
Param | Type | Default | Description |
---|---|---|---|
artist | Object |
An artist tag parsed by XMLParser which conforms to the schema/artist-xml.json schema | |
[includeImageObjects] | boolean |
false |
If true, include the images object (even though they do not contain URI) |
Format a master tag. See readme.md for information of how the data is transformed
Kind: inner method of processing/dumpFormatter
Param | Type | Default | Description |
---|---|---|---|
master | Object |
A master tag parsed by XMLParser which conforms to the schema/master-xml.json schema | |
[includeImageObjects] | boolean |
false |
If true, include the images object (even though they do not contain URI) |
Format a release tag. See readme.md for information of how the data is transformed
Kind: inner method of processing/dumpFormatter
Param | Type | Default | Description |
---|---|---|---|
release | Object |
A release tag parsed by XMLParser which conforms to the schema/master-xml.json schema | |
[includeImageObjects] | boolean |
false |
If true, include the images object (even though they do not contain URI) |
Parse the data dump XML into plain JS objects and process them with a given function. See readme.md for an example
Processes an XML dump file using node-expat
into plain objects. Every
chunkSize
rows the parser will pause and pass the result to the fn
function. Once the fn
function completes, parsing continues until the
entire file is parsed.
Kind: inner method of processing/processor
Returns: Promise
- A Promise that resolves when processing is complete
Param | Type | Default | Description |
---|---|---|---|
path | string |
The full path to the file to process | |
collection | string |
The type of data. Can be either "artists", "labels", "masters" or "releases" | |
fn | processChunkFn |
The function to call on each chunk of data. | |
[gz] | boolean |
true |
A boolean indicating if the dump is compressed in gzip format |
[chunkSize] | number |
1000 |
The number of XML rows that are parsed by node-expat until fn is called. A bigger number may be more efficient, but costs more memory |
[restart] | boolean |
false |
By default, the processing progress is stored in a .processing file alongside the data dumps. If the processing is stopped, it will continue from that row once you call processDumpFile again. Set this to true to always start from the beginning. |
Example
processDumpFile(
'./discogs_20190101_artists.xml.gz',
'artists',
chunk => {
// process the results here. For this example, we just console.log them
chunk.forEach(row => console.log(row));
return Promise.resolve();
}
);
processing/processor~processDumps(version, fn, [collections], [chunkSize], [restart], [dataDir]) ⇒ Promise.<void>
Looks up the downloaded data dumps of a given version. Then calls processDumpFile
on each of them.
Kind: inner method of processing/processor
See: processDumpFile
Param | Type | Default |
---|---|---|
version | string |
|
fn | function |
|
[collections] | Array.<string> |
|
[chunkSize] | number |
1000 |
[restart] | boolean |
false |
[dataDir] | string |
"'/data'" |
The signature of the fn
function passed to processDumpFile
Kind: inner typedef of processing/processor
Returns: Promise
- A promise that resolves when processing is complete
Param | Type | Description |
---|---|---|
chunk | Array.<Object> |
An array of plain objects as parsed by node-expat from XML |
collection | string |
The type of collection ("artists", "labels", "masters" or "releases") |
path | string |
The path to the dump file that is being processed |
Small helpers for parsing discogs data
- util/parseUtils
- ~parseIntSafe(str) ⇒
number
- ~parseDiscogsName(name, target) ⇒
object
- ~parseDuration(duration, target) ⇒
object
- ~parseReleaseDate(date, `target`)
- ~parseIntSafe(str) ⇒
Runs parseInt and errors when the result is NaN
Kind: inner method of util/parseUtils
Param | Type | Description |
---|---|---|
str | string |
The string to parse |
Parses a name from Discogs that potentially has a "(n)" numeric postfix. Stores the result on the specified target object. Will set the following properties:
name: the name with the "(n)" postfix removed
originalName: the name without modifications
nameIndex: the number n inside the postfix. 1 if there isn't any
Kind: inner method of util/parseUtils
Returns: object
- A reference to target
Param | Type | Description |
---|---|---|
name | string |
The name to parse |
target | object |
An object to store the results on |
Parses the duration string from a Discogs XML file and stores the result on the target object. Will store the string as-is on the 'originalDuration' property. If the duration is formatted somewhat correctly, will calculate the duration in number of seconds and store it on the 'duration' property.
Kind: inner method of util/parseUtils
Returns: object
- target
for chaining
Param | Type | Description |
---|---|---|
duration | string |
A duration formatted as string |
target | object |
The target object to store results on |
Will parse the given release date and format it according to Discogs Database Guidelines. The result is stored on the "released" property of the target object. The date will be either formatted as YYYY or YYYY-MM-DD. If only the year and month are given, the date will be set to 00. If dashes are missing, they will be added. All other formats are discarded.
Kind: inner method of util/parseUtils
Param | Type | Description |
---|---|---|
date | string |
The date string to parse |
target |
object |
for chaining |