BigDL is a distributed deep learning library for Apache Spark*. Using BigDL, you can write deep learning applications as Scala or Python* programs and take advantage of the power of scalable Spark clusters. This article introduces BigDL, shows you how to build the library on a variety of platforms, and provides examples of BigDL in action.
What Is Deep Learning?
Deep learning is a branch of machine learning that uses algorithms to model high-level abstractions in data. These methods are based on artificial neural network topologies and can scale with larger data sets.
Figure 1 explores the fundamental difference between traditional and deep learning approaches. Where traditional approaches tend to focus on feature extraction as a single phase for training a machine learning model, deep learning introduces depth by creating a pipeline of feature extractors in the model. These multiple phases in a deep learning pipeline increase the overall prediction accuracy of the resulting model through hierarchical feature extraction.
Figure 1. Deep learning as a pipeline of feature extractors
What Is BigDL?
BigDL is a distributed deep learning library for Spark that can run directly on top of existing Spark or Apache Hadoop* clusters. You can write deep learning applications as Scala or Python programs.
What Is Apache Spark*?
Spark is a lightning-fast distributed data processing framework developed by the University of California, Berkeley, AMPLab. Spark can run in stand-alone mode, or it can run in cluster mode on YARN on top of Hadoop or in Apache Mesos* cluster manager (Figure 2). Spark can process data from a variety of sources, including the HDFS, Apache Cassandra*, or Apache Hive*. Its high performance comes from ability to do in-memory processing via memory-persistent RDDs or DataFrames instead of saving data to hard disks like traditional Hadoop MapReduce architecture.
Figure 2. BigDL in the Apache Spark* stack
Why Use BigDL?
You may want to use BigDL to write your deep learning programs if:
Install the Toolchain
Note: This article provides instructions for a fresh installation, assuming that you have no Linux* tools installed. If you are not a novice Linux user, feel free to skip rudimentary steps like Git installation).
Install the Toolchain on Oracle* VM VirtualBox
To install the toolchain on Oracle* VM VirtualBox, perform the following steps:
1. Enable virtualization in your computer’s BIOS settings (Figure 3).
The BIOS settings are typically under the SECURITY menu. You will need to reboot your machine and go into the BIOS settings menu.
Figure 3. Enable virtualization in your computer’s BIOS.
2. Determine whether you have 64-bit or 32-bit machine.
Most modern computers are 64 bit, but just to be sure, look in the Control Panel System and Security applet (Figure 4).
Figure 4. Determine whether your computer is 32 bits or 64 bits.
Note: You should have more than 4 GB of RAM. Your virtual machine (VM) will run with 2 GB, but most of the BigDL examples (except for LeNet) won’t work well (or at all). This article considers 8 GB of RAM the minimum.
Install VirtualBox with GuestAdditions or another VM client from the VirtualBox website.
The VM configurations in Figure 5 are known to work with BigDL.
Figure 5.
Important:
4. Install Ubuntu Desktop from the Ubuntu download page.
5. When the download is complete, point your VM to the downloaded file so that you can boot from it. (For instructions, go to the Ubuntu FAQ):
sudo apt-get update
sudo apt-get upgrade
From this point forward, all installations will take place within Ubuntu or your version of Linux; instructions should be identical for the VM or native Linux except when explicitly said otherwise.
If you have trouble accessing the Internet, you may need to set up proxies. (Virtual private networks in general and Cisco AnyConnect in particular are notorious for modifying the VM’s proxy settings.) The proxy settings in Figure 6 are known to work for VirtualBox version 5.1.12 and 64-bit Ubuntu.
Figure 6. Proxy settings for Oracle* VM VirtualBox version 5.1.12 and 64‑bit Ubuntu* systems
6. Install Java* with the following commands:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
7. Verify the version of Java:
java –version
8. Install Scala version 2.11 from the Scala download page.
When downloading Scala, use the Debian* file format, which downloads into the Downloads folder by default. Open Scala in the file browser, and then click it:
sudo apt-get update
8. Install Spark from the Spark download page.
The instructions that follow are for Spark-1.6. If you install a different version, replace 1.6.x with your flavor of choice:
a. Replace Spark with the version you downloaded:
$ cd Downloads
$ tar -xzf spark-1.6.1-bin-hadoop2.6.tgz
b. Move the files to their proper location:
$ sudo mkdir /usr/local/spark
$ sudo mv spark-1.6.1-bin-hadoop2.6 /usr/local/spark
c. Test the installation:
$ cd /usr/local/spark
$ cd spark-2.1.0-bin-hadoop2.6
$ ./bin/spark-shell
d. Install Git:
$ sudo apt-get install git
Download BigDL
BigDL is available from GitHub* using the Git source control tool. Clone the BigDL Git repository as shown:
$ git clone https://github.com/intel-analytics/BigDL.git
When you’re done, you’ll see a new subdirectory called BigDL, which contains the BigDL repository. You must also install Apache Maven, as shown below, to compile the Scala code:
$ sudo apt-get install maven
Build BigDL
This section shows you how to download and build BigDL on your Linux distribution. The prerequisites for building BigDL are:
Apache Maven Environment
You need Maven 3 to build BigDL. You can download it from the Maven website.
After installing Maven 3, set the environment variable MAVEN_OPTS as follows:
$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
When compiling with Java 7, you must add the option -XX:MaxPermSize=1G.
Building
It is highly recommended that you build BigDL by using the make-dist.sh script.
When the download is complete, build BigDL with the following commands:
$ bash make-dist.sh
After that, you can find a dist folder that contains all the files you need to run a BigDL program. The files indist include:
1. dist/bin/bigdl.sh. Use this script to set up proper environment variables and launch the BigDL program.
2. dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies.jar. This Java archive (JAR) package contains all dependencies except Spark.
3. dist/lib/bigdl-0.1.0-SNAPSHOT-jar-with-dependencies-and-spark.jar. This JAR package contains all dependencies, including Spark.
Alternative Approaches
If you’re building for Spark 2.0, you should amend the bash call for Scala 2.11. To build for Spark 2.0, which uses Scala 2.11 by default, pass –P spark_2.0 to the make-dist.sh script:
$ bash make-dist.sh -P spark_2.0
It is strongly recommended that you use Java 8 when running with Spark 2.0. Otherwise, you may observe poor performance.
By default, make-dist.sh uses Scala 2.10 for Spark version 1.5.x or 1.6.x and Scala 2.11 for Spark 2.0. To override the default behaviors, pass -P scala_2.10 or -P scala_2.11 to make-dist.sh, as appropriate.
Getting Started
Prior to running a BigDL program, you must do a bit of setup, first. This section identifies the steps for getting started.
Set Environment Variables
To achieve high performance, BigDL uses Intel MKL and multithreaded programming. Therefore, you must first set the environment variables by running the provided script inPATH_To_BigDL/scripts/bigdl.sh as follows:
$ source PATH_To_BigDL/scripts/bigdl.sh
Alternatively, you can use PATH_To_BigDL/scripts/bigdl.sh to launch your BigDL program.
Working with the Interactive Scala Shell
You can experiment with BigDL codes by using the interactive Scala shell. To do so, run:
$ scala -cp bigdl-0.2.0-SNAPSHOT-jar-with-dependencies-and-spark.jar
Then, you can see something like:
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91).
Type in expressions for evaluation. Or try :help.
scala>
For instance, to experiment with the Tensor application programming interfaces (APIs) in BigDL, you can then try:
scala> import com.intel.analytics.bigdl.tensor.Tensor
import com.intel.analytics.bigdl.tensor.Tensor
scala> Tensor[Double](2,2).fill(1.0)
res9: com.intel.analytics.bigdl.tensor.Tensor[Double] =
1.0 1.0
1.0 1.0
[com.intel.analytics.bigdl.tensor.DenseTensor of size 2×2]
For more details about the BigDL APIs, refer to the BigDL Programming Guide.
Running a Local Java* Program (Single Mode)
You can run a BigDL program—for example, the VGG model training —as a local Java program that runs on a single node (machine). To do so, perform the following steps:
1. Download the CIFAR-10 data from the CIFAR-10 Dataset page.
Remember to choose the binary version.
2. Use the bigdl.sh script to launch the example as a local Java program:
./dist/bin/bigdl.sh — \
java -cp dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies-and-spark.jar \
com.intel.analytics.bigdl.models.vgg.Train \
–core core_number \
–node 1 \
–env local \
-f cifar10_folder \
-b batch_size
This command uses the following parameters:
Running a Spark Program
You can run a BigDL program—for example, the VGG model training — as a standard Spark program (running in either local mode or cluster mode). To do so, perform the following steps:
1. Download the CIFAR-10 data from the CIFAR-10 Dataset page.
Remember to choose the binary version.
2. Use the bigdl.sh script to launch the example as a Spark program:
./dist/bin/bigdl.sh — \
spark-submit –class com.intel.analytics.bigdl.models.vgg.Train \
dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies.jar \
–core core_number_per_node \
–node node_number \
–env spark \
-f cifar10_folder/ \
-b batch_size
This command uses the following parameters:
Debugging Common Issues
Currently, BigDL uses synchronous mini-batch SGD in model training, and it will launch a single Spark task (running in multithreaded mode) on each executor (or container) when processing a mini-batch:
Note: You may observe poor performance when running BigDL for Spark 2.0 with Java 7, so use Java 8 when you build and run BigDL for Spark 2.0.
In Spark 2.0, use the default Java serializer instead of Kryo because of Kryo Issues 341: “Stackoverflow error due to parentScope of Generics being pushed as the same object” (see the Kryo Issues page for more information). The issue has been fixed in Kryo 4.0, but Spark 2.0 uses Kryo 3.0.3. Spark versions 1.5 and 1.6 do not have this problem.
On CentOS* versions 6 and 7, increase the max user processes to a larger value, such as, 514585. Otherwise, you may see errors like “unable to create new native thread.”
Currently, BigDL loads all the training and validation data into memory during training. You may encounter errors if BigDL runs out of memory.
BigDL Examples
Note: if you’re using a VM, your performance won’t be as good as it would be in a native operating system installation, mostly due to the overhead of the VM and limited resources available to it. This is expected and is in no way reflective of genuine Spark performance.
Training LeNet on MNIST
The “hello world” example for deep learning is training LeNet (a convolutional neural network—on the MNIST database).
Figure 7 shows the clone LeNet example.
Figure 7. The clone LeNet example
A BigDL program starts with an import of com.intel.analytics.bigdl._. Then, it initializes the engine, including the number of executor nodes, the number of physical cores on each executor, and whether it runs on Spark or as a local Java program:
val sc = Engine.init(param.nodeNumber, param.coreNumber, param.env =="spark").map(conf => {
conf.setAppName("Train Lenet on MNIST")
.set("spark.akka.frameSize", 64.toString)
.set("spark.task.maxFailures", "1")
new SparkContext(conf)
})
If the program runs on Spark, Engine.init() returns a SparkConf with proper configurations populated, which you can then use to create the SparkContext. Otherwise, the program runs as a local Java program, and Engine.init() returns None.
After the initialization, begin creating the LeNet model by calling LeNet5(), which creates the LeNet-5 convolutional network model as follows:
val model = Sequential()
model.add(Reshape(Array(1, 28, 28)))
.add(SpatialConvolution(1, 6, 5, 5))
.add(Tanh())
.add(SpatialMaxPooling(2, 2, 2, 2))
.add(Tanh())
.add(SpatialConvolution(6, 12, 5, 5))
.add(SpatialMaxPooling(2, 2, 2, 2))
.add(Reshape(Array(12 * 4 * 4)))
.add(Linear(12 * 4 * 4, 100))
.add(Tanh())
.add(Linear(100, classNum))
.add(LogSoftMax())
Next, load the data by creating the DataSet.scala commands (either a distributed or local one, depending on whether it runs on Spark—and then applying a series of Transformer.scala commands(for example, SampleToGreyImg, GreyImgNormalizer, and GreyImgToBatch):
val trainSet = (if (sc.isDefined) {
DataSet.array(load(trainData, trainLabel), sc.get, param.nodeNumber)
} else {
DataSet.array(load(trainData, trainLabel))
})
-> SampleToGreyImg(28, 28)
-> GreyImgNormalizer(trainMean, trainStd)
-> GreyImgToBatch(param.batchSize)
After that, you create the Optimizer — either a distributed or local one, depending on whether it runs on Spark) by specifying the data set, the model, and the criterion, which, given input and target, computes gradient per given loss function:
val optimizer = Optimizer(
model = model,
dataset = trainSet,
criterion = ClassNLLCriterion[Float]())
Finally, after optionally specifying the validation data and methods for the Optimizer, you train the model by calling Optimizer.optimize():
optimizer
.setValidation(
trigger = Trigger.everyEpoch,
dataset = validationSet,
vMethods = Array(new Top1Accuracy))
.setState(state)
.setEndWhen(Trigger.maxEpoch(param.maxEpoch))
.optimize()
The following command executes the LeNet example from the command line:
$ dist/bin/bigdl.sh — \
java -cp dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies-and-spark.jar \
com.intel.analytics.bigdl.models.lenet.Train \
-f ~/data/mnist/ \
–core 1 \
–node 1 \
–env local \
–checkpoint
The following is an example command of LeNet Spark Local mode on a VM:
Note: Before you run Spark local mode, export SPARK_HOME=your_spark_install_dir andPATH=$PATH:$SPARK_HOME/bin:
$ ./dist/bin/bigdl.sh \
— spark-submit \
–master local[1] \
–driver-class-path dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies.jar \
–class com.intel.analytics.bigdl.models.lenet.Train \
dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies.jar \
-f ~/data/mnist/ \
–core 1 \
–node 1 \
–env spark \
–checkpoint ~/model/model_lenet_spark
Image Classification
Note: For an example of running an Image inference on pre-trained model, see the BigDL README.md.
Download the images into the directory ILSVRC2012_img_val.tar. This is the image .tar file for the pre-trained model. Put it in ~/data/imagenet, for example, and untar it. Next, download the modelresnet-18.t7 and put it in ~/model/resnet-18.t7, for example.
Run the following command from a command line (this example is for a VM, so memory sizes are set to1g):
$ ./dist/bin/bigdl.sh –spark-submit –master local[1] \
–driver-memory 1g \
–executor-memory 1g \
–driver-class-path dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies.jar \
–class com.intel.analytics.bigdl.example.imageclassification.ImagePredictor \
dist/lib/bigdl-0.2.0-SNAPSHOT-jar-with-dependencies-and-spark.jar \
–modelPath ~/model/resnet-18.t7/resnet-18.t7 \
–folder ~/data/imagenet/predict_100 \
–modelType torch -c 1 -n 1 \
–batchSize 4 \
–isHdfs false
The output of the program is a table with two columns: the *.JPEG file name and a numeric label. To check the accuracy of prediction, refer to the imagenet1000_clsid.txt label file. (Note that the file labels are 0-based, while the output of the example is 1-based.)
The first few lines of the image file are as follows:
{0: 'tench, Tinca tinca',
1: 'goldfish, Carassius auratus',
2: 'great white shark, white shark, man-eater, man-eating shark, Carcharodon carcharias',
3: 'tiger shark, Galeocerdo cuvieri',
4: 'hammerhead, hammerhead shark',
…
The following are the first few lines of the example output:
[ILSVRC2012_val_00025162.JPEG,360]
[ILSVRC2012_val_00025102.JPEG,958]
[ILSVRC2012_val_00025113.JPEG,853]
[ILSVRC2012_val_00025153.JPEG,867]
[ILSVRC2012_val_00025132.JPEG,229]
[ILSVRC2012_val_00025133.JPEG,5]
The image ILSVRC2012_val_00025133.JPEG (Figure 8) has been correctly identified as label 5, which corresponds to the entry “4: ‘hammerhead, hammerhead shark’“ in the imagenet1000_clsid.txt file.
Figure 8. The image file ILSVRC2012_val_00025133.JPEG
To run the classification example from within the IntelliJ IDE, you must set arguments slightly differently when running in VM, as Figure 9 shows.
Figure 9. The classification example in the IntelliJ integrated development environment
Because the IntelliJ IDE in Figure 9 does not display the entire lines, here are the VM options:
-Dspark.master=local[1]
-Dspark.executor.cores=1
-Dspark.total.executor.cores=1
-Dspark.executor.memory=1g
-Dspark.driver.memory=1g
-Xmx1024m -Xms1024m
And here are the program arguments:
ImageClassifier Program arguments:
–modelPath ~/model/resnet-18.t7/resnet-18.t7
–folder ~/data/imagenet/predict_100
–modelType torch
-c 1
-n 1
–batchSize 1
–isHdfs false
Breadth and Depth of Neural Networks Support
BigDL currently supports the following-preconfigured neural network models:
For more information, see the BigDL Models page.
In addition, BigDL supports more than 100 standard neural network “building blocks,” allowing you to configure your own topology (Figure 10).
Figure 10. BigDL standard neural network building blocks
For more information, see the BigDL Neural Networks page.
In addition, BigDL comes with out-of-the-box examples of end-to-end implementations for LeNet, ImageNet, and TextClassification (Figure 11).
Figure 11. BigDL examples
For more information, see the BigDL Examples page.
Abbreviations
HDFS
Hadoop Distributed File System
IDE
Integrated Development Environment
MKL
Math Kernel Library
RDD
Resilient Distributed Dataset
RNN
Recurrent Neural Networks
SGD
Stochastic Gradient Descent
VM
Virtual Machine
YARN
Yet Another Resource Negotiator
For more such intel IoT resources and tools from Intel, please visit the Intel® Developer Zone
Source:https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-apache-spark