In the previous session, we used pathlib
and the local package gurlpath derived from urlpath
to open object streams from URLs and files.
In this session, we will extend this to deal with reading and writing to text and binary files and URLs.
You will need some understanding of the following:
- 001 Using Notebooks
- 002 Unix with a good familiarity with the UNIX commands we have been through.
- 003 Getting help
- 010 Variables, comments and print()
- 011 Data types
- 012 String formatting
- 013_Python_string_methods
- 020_Python_files
You will need to recall details from 020_Python_files on using the two packages.
You should run a NASA account test if you have not already done so.
As before, we note that we can conveniently use pathlib
to deal with file input and output. The main methods we have seen are:
command | purpose |
---|---|
Path.open() |
open a file and return a file descriptor |
Path.read_text() |
read text |
Path.write_text() |
write text |
Path.read_bytes() |
read byte data |
Path.write_bytes() |
write byte data |
For gurlpath
we have the following equivalent functions:
command | purpose |
---|---|
URL.open() |
open a file descriptor with data from a URL |
URL.read_text() |
read text from URL |
URL.write_text() |
write text to file |
URL.read_bytes() |
read byte data from URL |
URL.write_bytes() |
write byte data to file |
Recall that the write
functions (and open
when used for write) write to local files, not to the URL. They have a keyword argument local_file
to set the location to write the file to. If this is not given, the the directory structure of the URL is used (relative to the current directory). Alternatively, you can set the keyword local_dir
, or set URL.local_file
or URL.local_dir
as appropriate.
Note that URL
is tolerant of calling with a Path
: if we call URL
with a local file, most operations will continue and apply the appropriate Path
function.
We can read text from a file with Path.read_text()
or from a URL with URL.read_text()
, then either Path.write_text()
or URL.write_text()
to write text to a file:
from pathlib import Path
# from https://www.json.org
some_text = '''
It is easy for humans to read and write.
It is easy for machines to parse and generate.
'''
# set up the filename
outfile = Path('work/easy.txt')
# write the text
nbytes = outfile.write_text(some_text)
# print what we did
print(f'wrote {nbytes} bytes to {outfile}')
wrote 90 bytes to work/easy.txt
- Using
Path.read_text()
read the text from the filework/easy.txt
and print the text returned. - split the text into lines of text using
str.split()
at each newline, and print out the resulting list
You learned how to split strings in 013_Python_string_methods
We can show that we get the same result reading the same file locally from data/json-en.html
or from the web from https://www.json.org/json-en.html
:
from geog0111.gurlpath import URL
from pathlib import Path
# first read the data from URL with no cache
# and directory work
u = 'https://www.json.org/json-en.html'
url = URL(u,local_dir='work',verbose=True,noclobber=False)
data_url = url.read_text()
# then from file in directory data
data_file = Path('data/json-en.html').read_text()
try:
assert data_url == data_file
print('files are the same')
except:
print('files are not the same')
print(data_file)
--> WARNING: error reading data from /Users/plewis/Documents/GitHub/geog0111/notebooks/work/database.db
--> trying https://www.json.org/json-en.html
files are the same
We can read binary data from a file with Path.read_bytes()
or from a URL with URL.read_bytes()
, then either Path.write_bytes()
or URL.write_bytes()
to write the binary data to a file. Other than that, and the fact that we cannot directly visualise the contents of the binary files without some interpreted code, there is no real difference in how we treat them.
One of the deepest sources of geospatial information over the last two decades is that obtained from the NASA MODIS products. We wiull make use of various MODIS datasets in this course.
As a start on this, let's first access a MODIS file from the web, as we did in 020_Python_files. Here, the kwargs
are passed on to URL
:
from geog0111.modis import Modis
kwargs = {
'verbose' : True,
'product' : 'MCD15A3H',
'db_dir' : 'work',
'local_dir' : 'work',
}
modis = Modis(**kwargs)
url = modis.get_url(year="2020",month="01",day="01")[0]
--> WARNING: error reading data from /Users/plewis/Documents/GitHub/geog0111/notebooks/work/database.db
--> product MCD15A3H -> code MOTA
--> WARNING: error reading data from /Users/plewis/Documents/GitHub/geog0111/notebooks/work/database.db
--> trying https://e4ftl01.cr.usgs.gov/
--> parsing URLs from html file 1 items
--> discovered 1 files with pattern MOTA in https://e4ftl01.cr.usgs.gov/
--> trying https://e4ftl01.cr.usgs.gov/MOTA
--> parsing URLs from html file 1 items
--> discovered 1 files with pattern MCD15A3H.006 in https://e4ftl01.cr.usgs.gov/MOTA
--> trying https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006
--> parsing URLs from html file 1 items
--> discovered 1 files with pattern 2020.01.01 in https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006
--> trying https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01
--> parsing URLs from html file 1 items
--> discovered 1 files with pattern MCD15A3H*.h08v06*.hdf in https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01
We can access the binary data with url.read_bytes()
, although we would normally want to use some package such as gdal
to interpret the data.
Cached data will be used where available unless we set noclobber=False
.
b = url.read_bytes()
print(f'data for {url} cached in {url.local()}')
print(f'dataset is {len(b)} bytes')
--> getting login
--> logging in to https://e4ftl01.cr.usgs.gov/
--> data read from https://e4ftl01.cr.usgs.gov/
--> code 200
data for https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf cached in /Users/plewis/Documents/GitHub/geog0111/notebooks/work/e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf.store
dataset is 9067184 bytes
--> updated cache database in /Users/plewis/Documents/GitHub/geog0111/notebooks/work/database.db
--> retrieving data https://e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf from database
--> local file /Users/plewis/Documents/GitHub/geog0111/notebooks/work/e4ftl01.cr.usgs.gov/MOTA/MCD15A3H.006/2020.01.01/MCD15A3H.A2020001.h08v06.006.2020006032951.hdf.store exists
We could explicitly write the data to a file, but since we are using a cache, there is no real point. This means that we can just use the URL to access the dataset. If we do need to specify the filename explicitly for any other codes, we can use url.local()
.
Using the code:
kwargs = {
'product' : 'MCD15A3H',
'db_dir' : 'work',
'local_dir' : 'work',
}
modis = Modis(**kwargs)
# get URLs
hdf_urls = modis.get_url(year="2020",month="01",day="01")
- write a function called
get_locals
that loops over each entry in the listhdf_urls
and returns the local filename - write code to test the function and print results using data from
modis.get_url("2020","01","*")
The MODIS files are in hdf
format, and as we have noted, we do not generally want direct access to the raw (byte) information. Instead we must use some package to interpret the data.
We can use the package gdal
to access information from these and other geospatial files. We will explore the contents of MODIS files in a later session, but for now, we can note that each MODIS file contains a set of sub datasets.
Basic use of gdal
in this context is:
g = gdal.Open(str(url.local()))
to convert the cached URL filename to a string, then to open the file in gdal
. If this returns None, there has been a problem opening the file, so we might check that.
Then
g.GetSubDatasets()
returns a list of sub-dataset information. Each item in the list is a tuple of two strings. In each, the first is the full name of the sub-dataset, and the second a text descriptor of the dataset. We call these filename,name
below.
We read the dataset with:
gsub = gdal.Open(filename)
data = gsub.ReadAsArray()
In the illustration below, we will examine only the first sub-dataset g.GetSubDatasets()[0]
.
import gdal
from geog0111.modis import Modis
# as before
kwargs = {
'product' : 'MCD15A3H',
'db_dir' : 'work',
'local_dir' : 'work',
}
modis = Modis(**kwargs)
url = modis.get_url(year="2020",month="01",day="01")[0]
# set True to force download of the local file
filename = url.local(True).as_posix()
# open the local file associated with the dataset
g = gdal.Open(filename)
if g:
# get the first SDS only for illustration
filename,name = g.GetSubDatasets()[0]
print(f'dataset info is: {name}')
# read the dataset
gsub = gdal.Open(filename)
if gsub:
data = gsub.ReadAsArray()
print(f'dataset read is shape {data.shape} and type {type(data)}')
dataset info is: [2400x2400] Fpar_500m MOD_Grid_MCD15A3H (8-bit unsigned integer)
dataset read is shape (2400, 2400) and type <class 'numpy.ndarray'>
name = '[2400x2400] Fpar_500m MOD_Grid_MCD15A3H (8-bit unsigned integer)'
- Take the string variable
name
above, split it to obtain the second field (Fpar_500m
here) and store this in a variablesds_name
- Write a function called
get_data
that reads an HDF (MODIS) filename, and returns a dictionary of all of the sub-datasets in the file, usingReadAsArray()
. The dictionary keys should correspond to the items insds_name
above. - test the code by showing the keys in the dictionary returned and the shape of their dataset
You will need to recall how to split a string, that was covered in 013 Python string methods. You will also need to recall how to loop over a dictionary. We saw how to find the shape of the dataset returned above (.shape
).
In this section, we have used Path
and URL
classes to read and write text and binary files. We have combined these ideas with earlier work to access MODIS datafiles and other text and binary datasets. For data we access through a URL, we can do file operations on a cached version of the file. We have refreshed our memory of some of the earlier material, especially string formatting.
We have learned how to use gdal
to look at the sub-datasets in an HDF file and also how to read them.
You should now have some confidence in these matters, so that if you were set a task of downloading and saving datasets, as well as other tasks such as finding their size, whether the exists or not, you could do this.
Remember:
Modis library
from geog0111.modis import Modis
modis = Modis(**kwargs)
get_url(**kwargs) method of geog0111.modis.Modis instance
Get URL object list for NASA MODIS products
for the specified product, tile, year, month, day
Keyword Arguments:
verbose: bool
product : str e.g. 'MCD15A3H'
tile : str e.g. 'h08v06'
year : str valid 2000-present
month : str 01-12
day : str 01-(28,29,30,31)
gdal
Command | Comment |
---|---|
g = gdal.Open(filename) |
Open geospatial file filename and return gdal object g (None if file not opened correctly) |
g.GetSubDatasets() |
Get list of sub-datasets from gdal object g |
g.ReadAsArray() |
Read dataset from gdal object g into array |