Skip to content

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

License

Notifications You must be signed in to change notification settings

llonchj/tldextract

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Module

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

tldextract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', tld='com')
>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', tld='co.uk')
>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', tld='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> '.'.join(ext[:2]) # rejoin subdomain and domain
'forums.bbc'

This module started by implementing the chosen answer from this StackOverflow question on getting the "domain name" from a URL. However, the proposed regex solution doesn't address many country codes like com.au, or the exceptions to country codes like the registered domain parliament.uk. The Public Suffix List does.

Installation

Latest release on PyPI:

$ pip install tldextract 

Or the latest dev version:

$ pip install -e git://github.com/john-kurkowski/tldextract.git#egg=tldextract

Command-line usage, splits the url components by space:

$ python -m tldextract.tldextract http://forums.bbc.co.uk
forums bbc co.uk

Run tests:

$ python -m tldextract.tests.all

Version History

  • 0.3
    • Added support for a huge class of missing TLDs (Issue #1). No more need for IANA.
    • If you pass fetch=False to tldextract.extract, or the connection to the Public Suffix List fails, the module will fall back on the included snapshot.
    • Internally, to support more TLDs, switched from a very long regex to set-based lookup. Cursory timeit runs suggest performance is the same as v0.2, even with the 1000s of new TLDs. (Note however that module init time has gone up into the tens of milliseconds as it must unpickle the set. This could add up if you're calling the script externally.)

Note About Caching

In order to not slam TLD sources for every single extraction and app startup, the TLD set is cached indefinitely in /path/to/tldextract/.tld_set. This location can be overridden by specifying cache_file in the call to tldextract.extract. If you want to stay fresh with the TLD definitions--though they don't change often--delete this file occasionally.

It is also recommended to delete this file after upgrading this lib.

Public API

I know it's just one method, but I've needed this functionality in a few projects and programming languages, so I've uploaded tldextract to App Engine. Just hit it with your favorite HTTP client with the URL you want parsed like so:

$ curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
{"domain": "bbc", "subdomain": "www", "tld": "co.uk"}

About

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%