This is a collection of PHP 7 classes useful for extracting, transforming, and loading data between sources.
Hierarchical XML and JSON can be automatically converted to relational SQL. Support includes extracting data documents from a file system or REST API, and then loading the data into a DBMS like Microsoft SQL Server.
Values are surveyed for datatypes, numeric cardinality, and unique natural key candidates. Then this information is used to create a normalized multi-table database structure suited to insert the data.
Branch | Tests | Code Coverage |
---|---|---|
master | ||
dev |
PHP >= 7.0
This library is registered on packagist and can be easily installed into your project using composer.
composer require jpuck/etl
There are 3 basic groups of interrelated classes: Sources provide Data which have Schemata.
-
Sources
Sources extend the abstract
Source
class and transportDatum
objects. In particular, the abstractDB
class has concrete class implementations such asMicrosoftSQLServer
. -
Data
Data classes extend
Datum
and must implement a valid parser, satisfied byParseValidator
. It uses theSchematizer
to construct the object from raw data, which can be overridden by passing an existingSchema
. -
Schemata
A
Schema
is a concrete class with aValidator
to enforce structure. TheMerger
class is for combining Schemas to create super-set Schemas. TheDDL
trait is used by theDB
class to generate SQL Data Definition Language which contains abstract methods to be implemented by a specific database management system.
The Schematizer
class is for surveying the structure of the data.
It includes node names, the count of distinct element groupings,
numeric cardinality for relationships between subnodes, and descriptive
statistics about the values including uniqueness. Categorically, it recognizes
datetime, integer, and decimal datatypes. Decimals will include scale and
precision measurements suitable for SQL.
Schematizer::getPrecision
returns the scale and precision of
numeric values suitable for the SQL DECIMAL(scale,precision)
datatype.
This function has notable behavior in that trailing zeros are discarded
when passed as raw PHP float types. However, when passed as a string, then the
trailing zeros are preserved in the precision.
See SchematizerUtilitiesTest::precisionDataProvider
for examples.
Note that in the XML
class, parsed values are represented as strings
in PHP, so trailing zeros should be represented in the precision values.
node name
├── count
│ ├── max
│ │ ├── measure
│ │ └── value
│ └── min
│ ├── measure
│ └── value
├── unique (all values)
├── primaryKey
├── varchar ────┐
│ ├── max │
│ │ ├── measure │
│ │ └── value │
│ └── min │
│ ├── measure │
│ └── value │
├── datetime ├ datatypes
│ ├── max │
│ │ └── value │
│ └── min │
│ └── value │
├── int/decimal │
│ ├── max │
│ │ └── value │
│ └── min │
│ └── value ────┘
├── scale ────┐
│ ├── max │
│ │ ├── measure │
│ │ └── value │
│ └── min │
│ ├── measure │
│ └── value │
├── precision ├ if decimal
│ ├── max │
│ │ ├── measure │
│ │ └── value │
│ └── min │
│ ├── measure │
│ └── value ────┘
├── children
│ ├── distinct (count of children)
│ └── count
│ ├── max
│ │ └── measure
│ └── min
│ └── measure
├── attributes
│ └── ... (excluding count, which must be 1)
└── elements
└── ...
The DB
class requires an instance of PDO
in the constructor to connect,
but it is possible to pass a null
value if only utilizing the class for DDL.
When one-to-many XML nodes are used to represent one-to-one relationships, then
the Schematizer
recognizes this and a DDL
class flattens them as columns
on a table. If a node has more than one of its name or grandchildren, then the
one-to-many relationship is preserved in a separate normalized table. Surrogate
keys are created to maintain the Primary/Foreign Key referential integrity.
If the Schema
has a primaryKey
set, then that field will be used for DDL
generation instead of the surrogate. However, this Schema
must also be passed
to the Datum
constructor prior to being used with DB::insert
, otherwise the
surrogate keys will be used by default and will result in a failed insertion if
the surrogate columns don't exist.
Generating Schemas can take a long time and may require customization, such
as adding primaryKey
flags or removing unwanted fields to be ignored. Here are
some examples for exporting and importing:
$xml = file_get_contents("sample.xml");
$xml = new XML($xml);
$schema = $xml->schema();
// normal JSON_PRETTY_PRINT
echo $schema;
By simply echoing the object, output can be redirected to a file from console:
php script.php > myschema.json
Use file_put_contents
to write to disk.
Schema::toJSON
accepts all the json_encode
options.
$string = $schema->toJSON(JSON_UNESCAPED_UNICODE);
file_put_contents('myschema.json', $string);
// native php array
$array = var_export($schema->toArray(), true);
$array = "<?php return $array;";
file_put_contents('myschema.php', $array);
Import any of those formats the same way by passing the filename, an array, or a JSON string to the constructor.
$schemas []= new Schema('myschema.php');
$schemas []= new Schema('myschema.json');
$schemas []= new Schema($schema->toArray());
$schemas []= new Schema($schema->toJSON());
$schemas []= new Schema($schema->toJSON(JSON_PRETTY_PRINT));
foreach($schemas as $s){
var_dump($schema == $s);
}
Override the internal Schematizer
by passing a Schema
to the Datum
constructor.
$schema = new Schema('myschema.json');
$xml = file_get_contents("sample.xml");
$xml = new XML($xml, $schema);
You can also pass the Schema
override through Source::fetch
$credentials = [
'url' => 'https://api.example.com',
'username' => 'user',
'password' => 'P@55w0rd',
];
$source = new REST($credentials);
$xml = $source->fetch('endpoint', XML::class, new Schema('myschema.json'));
The development dependencies can be installed by running composer with or
without the --dev
option (enabled by default).
composer install --dev
Tests are written for PHPUnit which is included as a composer dev-dependency. To run the whole test suite, then execute this command from the shell console:
php vendor/bin/phpunit
You might also be interested in an easy to read checklist output:
php vendor/bin/phpunit --testdox
When stepping through breakpoints in an IDE, like Netbeans, it's helpful to see the current test name output by setting the run configuration to debug:
php vendor/bin/phpunit --debug
A code coverage report is available if you have the xdebug
extension
installed. In addition to the console text summary report, a full HTML GUI is
generated to explore in the coverage
folder. The easiest way to view this is
to boot up a dev server:
php -S localhost:8080 -t coverage/
You must create the file (or symbolic link) tests/data/pdos/pdo.php
in order
to run the database tests. This should simply return a PDO
instance,
for example:
<?php
return (function(){
$hostname = 'sql.example.com';
$database = 'mydb';
$username = 'user';
$password = 'P@55w0rd';
// https://www.microsoft.com/en-us/download/details.aspx?id=50419
$driver = 'ODBC Driver 13 for SQL Server';
$pdo = new PDO("odbc:Driver=$driver;
Server=$hostname;
Database=$database",
$username,
$password
);
$pdo->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );
return $pdo;
})();