Set of scripts that crawl the online Java documentation to scrape information about the methods and constructors of each class, organised by package names. The result of the scrape is stored as XML files, with each file corresponding to one Java package.
The structure of the xml file for one package is as follows (the bracketed terms indicate attributes),
package
├── name
├── description
├── class (id)
│ ├── name
│ └── description
└── method (id)
├── name
├── description
├── parameter
│ ├── name
│ └── type
├── return
└── class
There are three modules and one script present in this package,
-
class_scraper.py
--- Containsscrape_class(soup, cls_name)
that takes aBeautifulSoup
object and the class name, and scrapes the class' web page to retrieve the class description, and the name, description, parameters and return type of each method belonging to the class. -
package_scraper.py
--- Containsscrape_package(package_name, package_url)
that takes the name and url of the package to be scraped, generates the list of classes in the package, retrieves the web page of each class and generates theBeautifulSoup
object for the web page, callsscrape_class
for each class, and then callswrite_xml
on the data returned fromscrape_class
. -
doc_scrape.py
--- Executable script that reads the package list frompkg_list.json
and callsscrape_package
on each package. Also responsible for printing the results of the operation. -
misc.py
--- Contains two functions,get_absolute_url(current_url, relative_url)
--- Generates an absolute url from the current url and a url specified relative to the current url.write_xml(package_info)
--- Generates the XML tree from the package info and writes it to a file atdocs/<package name>.xml
.
The function scrape_package
executes scrape_class
for each class on a separate thread, up to a maximum of 32 threads using concurrent.futures.ThreadPoolExecutor
. Further, the doc_scrape.py
script executes scrape_package
for each package on a separate process, with a maximum of 8 concurrent processes.
The scripts have been written for Python 3.6.0 and require the following external packages, installable via pip
,
requests 2.13.0
lxml 3.7.3
beautifulsoup4 4.5.3
Further, execution of the script has not been verified on any operating system other than Arch Linux.
The file pkg_list.json
contains all packages and their associated urls in the JSON format. This file can be backed up and edited to reduce the number of packages to scrape. Running the script doc_scrape.py
starts the scraper.
$ ./doc_scrape.py
The output of ./doc_scrape.py
is as follows,
- There are 5 columns --- status, package, done, total, errors.
- Each row corresponds to the result of the scrape for one package, indicated by the package column.
- The first column, status, indicates the overall result.
SUCCESS
: All classes successfully parsed.PARTIAL
: A few classes could not be parsed.FAILURE
: No classes could be parsed.
- The third and fourth columns indicate the number of classes successfully parsed and the total number of classes in the package respectively. These values are identical on the status
SUCCESS
, differ onPARTIAL
, and are marked with-
onFAILURE
. - The last column prints the error during a
FAILURE
. - Finally, after the scraping is complete, there are 4 values presented,
- Total packages: Total packages scraped.
- Complete: Number of packages with
SUCCESS
status. - Incomplete: Number of packages with
PARTIAL
status. - Failed: Number of packages with
FAILURE
status. - Empties: Number of packages with no classes. These packages do not show up on the output table.
Apart from the standard output, log files in the logs
folder indicate the errors associated with parsing a class when the package has reported a PARTIAL
status. These files are named <package name>.log
.
doc_scape.py
may also be called as,
$ ./doc_scrape.py --retry
The retry flag reads a file pkg_retry
, if it exists, and only parses those packages specified there. Every time doc_scrape.py
is executed, all those packages that fail, or are partially scraped are written to pkg_retry
. This allows the scraping to resume from only those packages that have failed the previous time.
WARNING Currently, the
logs
,docs
andpkg_retry
files are overwritten on each execution ofdoc_scrape.py
.
Finally, there is one convenience script included, clean.py
, that removes docs
, logs
, __pycache__
, and pkg_retry
after a prompt for each. Calling clean.py
with the option --force
skips the prompt.
$ ./clean.py
or,
$ ./clean.py --force
This software is licensed under the MIT License.
For further information, view LICENSE.
Copyright (C) 2017 Abhijit J. Theophilus, [email protected]