[SPARK-19520][STREAMING] Do not encrypt data written to the WAL.

Spark's I/O encryption uses an ephemeral key for each driver instance. So driver B cannot decrypt data written by driver A since it doesn't have the correct key. The write ahead log is used for recovery, thus needs to be readable by a different driver. So it cannot be encrypted by Spark's I/O encryption code. The BlockManager APIs used by the WAL code to write the data automatically encrypt data, so changes are needed so that callers can to opt out of encryption. Aside from that, the "putBytes" API in the BlockManager does not do encryption, so a separate situation arised where the WAL would write unencrypted data to the BM and, when those blocks were read, decryption would fail. So the WAL code needs to ask the BM to encrypt that data when encryption is enabled; this code is not optimal since it results in a (temporary) second copy of the data block in memory, but should be OK for now until a more performant solution is added. The non-encryption case should not be affected. Tested with new unit tests, and by running streaming apps that do recovery using the WAL data with I/O encryption turned on. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #16862 from vanzin/SPARK-19520.
author: Marcelo Vanzin <vanzin@cloudera.com> 2017-02-13 14:19:41 -0800
committer: Marcelo Vanzin <vanzin@cloudera.com> 2017-02-13 14:19:41 -0800
commit: 0169360ef58891ca10a8d64d1c8637c7b873cbdd (patch)
tree: 8a0e7b1652c7d32bda363ee7cbf4696a1daed608 /docs
parent: 9af8f743b00001f9fdf8813481464c3837331ad9 (diff)
download: spark-0169360ef58891ca10a8d64d1c8637c7b873cbdd.tar.gz
spark-0169360ef58891ca10a8d64d1c8637c7b873cbdd.tar.bz2
spark-0169360ef58891ca10a8d64d1c8637c7b873cbdd.zip
1 files changed, 3 insertions, 0 deletions
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index 38b4f78177..a878971608 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -2017,6 +2017,9 @@ To run a Spark Streaming applications, you need to have the following.
   `spark.streaming.driver.writeAheadLog.closeFileAfterWrite` and
   `spark.streaming.receiver.writeAheadLog.closeFileAfterWrite`. See
   [Spark Streaming Configuration](configuration.html#spark-streaming) for more details.
+  Note that Spark will not encrypt data written to the write ahead log when I/O encryption is
+  enabled. If encryption of the write ahead log data is desired, it should be stored in a file
+  system that supports encryption natively.
 
 - *Setting the max receiving rate* - If the cluster resources is not large enough for the streaming
   application to process data as fast as it is being received, the receivers can be rate limited
author	Marcelo Vanzin <vanzin@cloudera.com>	2017-02-13 14:19:41 -0800
committer	Marcelo Vanzin <vanzin@cloudera.com>	2017-02-13 14:19:41 -0800
commit	0169360ef58891ca10a8d64d1c8637c7b873cbdd (patch)
tree	8a0e7b1652c7d32bda363ee7cbf4696a1daed608 /docs
parent	9af8f743b00001f9fdf8813481464c3837331ad9 (diff)
download	spark-0169360ef58891ca10a8d64d1c8637c7b873cbdd.tar.gz spark-0169360ef58891ca10a8d64d1c8637c7b873cbdd.tar.bz2 spark-0169360ef58891ca10a8d64d1c8637c7b873cbdd.zip