Release Notes - Spark - Version 2.3.2 - HTML format

Sub-task

  • [SPARK-24535] - Fix java version parsing in SparkR on Windows
  • [SPARK-24976] - Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

Bug

  • [SPARK-22809] - pyspark is sensitive to imports with dots
  • [SPARK-23243] - Shuffle+Repartition on an RDD could lead to incorrect answers
  • [SPARK-23618] - docker-image-tool.sh Fails While Building Image
  • [SPARK-23731] - FileSourceScanExec throws NullPointerException in subexpression elimination
  • [SPARK-23732] - Broken link to scala source code in Spark Scala api Scaladoc
  • [SPARK-24018] - Spark-without-hadoop package fails to create or read parquet files with snappy compression
  • [SPARK-24216] - Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
  • [SPARK-24369] - A bug when having multiple distinct aggregations
  • [SPARK-24385] - Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join
  • [SPARK-24415] - Stage page aggregated executor metrics wrong when failures
  • [SPARK-24452] - long = int*int or long = int+int may cause overflow.
  • [SPARK-24468] - DecimalType `adjustPrecisionScale` might fail when scale is negative
  • [SPARK-24495] - SortMergeJoin with duplicate keys wrong results
  • [SPARK-24506] - Spark.ui.filters not applied to /sqlserver/ url
  • [SPARK-24530] - Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
  • [SPARK-24531] - HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
  • [SPARK-24536] - Query with nonsensical LIMIT hits AssertionError
  • [SPARK-24552] - Task attempt numbers are reused when stages are retried
  • [SPARK-24578] - Reading remote cache block behavior changes and causes timeout issue
  • [SPARK-24583] - Wrong schema type in InsertIntoDataSourceCommand
  • [SPARK-24588] - StreamingSymmetricHashJoinExec should require HashClusteredPartitioning from children
  • [SPARK-24589] - OutputCommitCoordinator may allow duplicate commits
  • [SPARK-24603] - Typo in comments
  • [SPARK-24613] - Cache with UDF could not be matched with subsequent dependent caches
  • [SPARK-24704] - The order of stages in the DAG graph is incorrect
  • [SPARK-24739] - PySpark does not work with Python 3.7.0
  • [SPARK-24781] - Using a reference from Dataset in Filter/Sort might not work.
  • [SPARK-24809] - Serializing LongHashedRelation in executor may result in data error
  • [SPARK-24813] - HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
  • [SPARK-24867] - Add AnalysisBarrier to DataFrameWriter
  • [SPARK-24879] - NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
  • [SPARK-24889] - dataset.unpersist() doesn't update storage memory stats
  • [SPARK-24891] - Fix HandleNullInputsForUDF rule
  • [SPARK-24908] - [R] remove spaces to make lintr happy
  • [SPARK-24909] - Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
  • [SPARK-24927] - The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
  • [SPARK-24934] - Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
  • [SPARK-24948] - SHS filters wrongly some applications due to permission check
  • [SPARK-24950] - scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
  • [SPARK-24957] - Decimal arithmetic can lead to wrong values using codegen
  • [SPARK-24987] - Kafka Cached Consumer Leaking File Descriptors
  • [SPARK-25028] - AnalyzePartitionCommand failed with NPE if value is null
  • [SPARK-25051] - where clause on dataset gives AnalysisException
  • [SPARK-25076] - SQLConf should not be retrieved from a stopped SparkSession
  • [SPARK-25084] - "distribute by" on multiple columns may lead to codegen issue
  • [SPARK-25114] - RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
  • [SPARK-25124] - VectorSizeHint.size is buggy, breaking streaming pipeline
  • [SPARK-25144] - distinct on Dataset leads to exception due to Managed memory leak detected
  • [SPARK-25164] - Parquet reader builds entire list of columns once for each column
  • [SPARK-25205] - typo in spark.network.crypto.keyFactoryIteration
  • [SPARK-25231] - Running a Large Job with Speculation On Causes Executor Heartbeats to Time Out on Driver
  • [SPARK-25313] - Fix regression in FileFormatWriter output schema
  • [SPARK-25330] - Permission issue after upgrade hadoop version to 2.7.7
  • [SPARK-25357] - Add metadata to SparkPlanInfo to dump more information like file path to event log
  • [SPARK-25368] - Incorrect constraint inference returns wrong result
  • [SPARK-25371] - Vector Assembler with no input columns leads to opaque error
  • [SPARK-25402] - Null handling in BooleanSimplification
  • [SPARK-26802] - CVE-2018-11760: Apache Spark local privilege escalation vulnerability

Story

  • [SPARK-25234] - SparkR:::parallelize doesn't handle integer overflow properly

New Feature

  • [SPARK-24542] - Hive UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files

Improvement

  • [SPARK-24455] - fix typo in TaskSchedulerImpl's comments
  • [SPARK-24696] - ColumnPruning rule fails to remove extra Project
  • [SPARK-25400] - Increase timeouts in schedulerIntegrationSuite

Test

  • [SPARK-24502] - flaky test: UnsafeRowSerializerSuite
  • [SPARK-24521] - Fix ineffective test in CachedTableSuite
  • [SPARK-24564] - Add test suite for RecordBinaryComparator

Documentation

  • [SPARK-24507] - Description in "Level of Parallelism in Data Receiving" section of Spark Streaming Programming Guide in is not relevan for the recent Kafka direct apprach
  • [SPARK-25273] - How to install testthat v1.0.2

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.