This directory contains a Apache Hadoop MapReduce InputFormat/OutputFormat implementation for TensorFlow's TFRecords format. This can also be used with Apache Spark.
-
protoc 3.3.0 installed.
-
Download and compile protos from TensorFlow.
-
Tested with Hadoop 2.6.0. Patches are welcome if there are incompatibilities with your Hadoop version.
-
Compile TensorFlow Example protos
# Suppose $TF_SRC_ROOT is the source code root of TensorFlow project protoc --proto_path=$TF_SRC_ROOT --java_out=src/main/java/ $TF_SRC_ROOT/tensorflow/core/example/{example,feature}.proto
-
Compile the code
mvn clean package
Alternatively, if you have an earlier version of protoc installed, e.g., protoc 3.1.0, you can compile the code using:
mvn clean package -Dprotobuf.version=3.1.0
-
Optionally install (or deploy) the jars
mvn install
After installation (or deployment), the package can be used with the following dependency:
<dependency> <groupId>org.tensorflow</groupId> <artifactId>tensorflow-hadoop</artifactId> <version>1.0-SNAPSHOT</version> </dependency>
Alternatively, use the shaded version of the package in case of incompatibilities with the version of the protobuf library in your application. For example, Apache Spark uses an older version of the protobuf library which can cause conflicts. The shaded package can be used as follows:
<dependency> <groupId>org.tensorflow</groupId> <artifactId>tensorflow-hadoop</artifactId> <version>1.0-SNAPSHOT</version> <classifier>shaded-protobuf</classifier> </dependency>
The Hadoop MapReduce example can be found here.
Spark supports reading/writing files with Hadoop InputFormat/OutputFormat. Use the shaded version of the package to avoid conflicts with the protobuf version included in Spark.
The following command demonstrates how to use the package with spark-shell:
$spark-shell --master local --jars target/tensorflow-hadoop-1.0-SNAPSHOT-shaded-protobuf.jar
The following code snippet demonstrates the usage.
import org.tensorflow.hadoop.shaded.protobuf.ByteString
import org.apache.hadoop.io.{NullWritable, BytesWritable}
import org.apache.spark.{SparkConf, SparkContext}
import org.tensorflow.example.{BytesList, Int64List, Feature, Features, Example}
import org.tensorflow.hadoop.io.TFRecordFileOutputFormat
val inputPath = "path/to/input.txt"
val outputPath = "path/to/output.tfr"
val sparkConf = new SparkConf().setAppName("TFRecord Demo")
val sc = new SparkContext(sparkConf)
var features = sc.textFile(inputPath).map(line => {
val text = BytesList.newBuilder().addValue(ByteString.copyFrom(line.getBytes)).build()
val features = Features.newBuilder()
.putFeature("text", Feature.newBuilder().setBytesList(text).build())
.build()
val example = Example.newBuilder()
.setFeatures(features)
.build()
(new BytesWritable(example.toByteArray), NullWritable.get())
})
features.saveAsNewAPIHadoopFile[TFRecordFileOutputFormat](outputPath)