tldextract
accurately separates the gTLD or ccTLD (generic or country code
top-level domain) from the registered domain and subdomains of a URL. For
example, say you want just the 'google' part of 'http://www.google.com'.
Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.
tldextract
on the other hand knows what all gTLDs and ccTLDs look like by
looking up the currently living ones according to
the Public Suffix List. So,
given a URL, it knows its subdomain from its domain, and its domain from its
country code.
>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', tld='com')
>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', tld='co.uk')
>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', tld='org.kg')
ExtractResult
is a namedtuple, so it's simple to access the parts you want.
>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> '.'.join(ext[:2]) # rejoin subdomain and domain
'forums.bbc'
This module started by implementing the chosen answer from this StackOverflow question on getting the "domain name" from a URL. However, the proposed regex solution doesn't address many country codes like com.au, or the exceptions to country codes like the registered domain parliament.uk. The Public Suffix List does, and so does this module.
Latest release on PyPI:
$ pip install tldextract
Or the latest dev version:
$ pip install -e git://github.com/john-kurkowski/tldextract.git#egg=tldextract
Command-line usage, splits the url components by space:
$ tldextract http://forums.bbc.co.uk
forums bbc co.uk
Run tests:
$ python -m tldextract.tests.all
Beware when first running the module, it updates its TLD list with a live HTTP
request. This updated TLD set is cached indefinitely in
/path/to/tldextract/.tld_set
.
(Arguably runtime bootstrapping like that shouldn't be the default behavior, like for production systems. But I want you to have the latest TLDs, especially when I haven't kept this code up to date.)
To avoid this fetch or control the cache's location, use your own extract callable:
# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
no_fetch_extract = tldextract.TLDExtract(fetch=False)
no_fetch_extract('http://www.google.com')
# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_file='/path/to/your/cache/file')
custom_cache_extract('http://www.google.com')
If you want to stay fresh with the TLD definitions--though they don't change often--delete the cache file occasionally.
It is also recommended to delete the file after upgrading this lib.
I know it's just one method, but I've needed this functionality in a few
projects and programming languages, so I've uploaded
tldextract
to App Engine. It's there on
GAE's free pricing plan until Google cuts it off. Just hit it with
your favorite HTTP client with the URL you want parsed like so:
$ curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
{"domain": "bbc", "subdomain": "www", "tld": "co.uk"}