Name	Name	Last commit message	Last commit date
Latest commit History 39 Commits
tldextract	tldextract
tldextract_app	tldextract_app
.gitignore	.gitignore
MANIFEST.in	MANIFEST.in
README.md	README.md
setup.py	setup.py

Python Module

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

tldextract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

>>> import tldextract
>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', tld='com')
>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', tld='co.uk')
>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', tld='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> ext.domain
'bbc'
>>> '.'.join(ext[:2]) # rejoin subdomain and domain
'forums.bbc'

This module started by implementing the chosen answer from this StackOverflow question on getting the "domain name" from a URL. However, the proposed regex solution doesn't address many country codes like com.au, or the exceptions to country codes like the registered domain parliament.uk. The Public Suffix List does.

Installation

Latest release on PyPI:

$ pip install tldextract

Or the latest dev version:

$ pip install -e git://github.com/john-kurkowski/tldextract.git#egg=tldextract

Command-line usage, splits the url components by space:

$ python -m tldextract.tldextract http://forums.bbc.co.uk
forums bbc co.uk

Run tests:

$ python -m tldextract.tests.all

Version History

0.3
- Added support for a huge class of missing TLDs (Issue #1). No more need for IANA.
- If you pass fetch=False to tldextract.extract, or the connection to the Public Suffix List fails, the module will fall back on the included snapshot.
- Internally, to support more TLDs, switched from a very long regex to set-based lookup. Cursory timeit runs suggest performance is the same as v0.2, even with the 1000s of new TLDs. (Note however that module init time has gone up into the tens of milliseconds as it must unpickle the set. This could add up if you're calling the script externally.)

Note About Caching

In order to not slam TLD sources for every single extraction and app startup, the TLD set is cached indefinitely in /path/to/tldextract/.tld_set. This location can be overridden by specifying cache_file in the call to tldextract.extract. If you want to stay fresh with the TLD definitions--though they don't change often--delete this file occasionally.

It is also recommended to delete this file after upgrading this lib.

Public API

I know it's just one method, but I've needed this functionality in a few projects and programming languages, so I've uploaded tldextract to App Engine. Just hit it with your favorite HTTP client with the URL you want parsed like so:

$ curl "http://tldextract.appspot.com/api/extract?url=http://www.bbc.co.uk/foo/bar/baz.html"
{"domain": "bbc", "subdomain": "www", "tld": "co.uk"}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python Module

Installation

Version History

Note About Caching

Public API

About

Releases

Packages

Languages

License

llonchj/tldextract

Folders and files

Latest commit

History

Repository files navigation

Python Module

Installation

Version History

Note About Caching

Public API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages