Release Notes - ASF JIRA

Release Notes - Spark - Version 2.3.2 - HTML format

Configure Release Notes

Sub-task

[SPARK-24535] - Fix java version parsing in SparkR on Windows
[SPARK-24976] - Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

Bug

[SPARK-22809] - pyspark is sensitive to imports with dots
[SPARK-23243] - Shuffle+Repartition on an RDD could lead to incorrect answers
[SPARK-23618] - docker-image-tool.sh Fails While Building Image
[SPARK-23731] - FileSourceScanExec throws NullPointerException in subexpression elimination
[SPARK-23732] - Broken link to scala source code in Spark Scala api Scaladoc
[SPARK-24018] - Spark-without-hadoop package fails to create or read parquet files with snappy compression
[SPARK-24216] - Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
[SPARK-24369] - A bug when having multiple distinct aggregations
[SPARK-24385] - Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join
[SPARK-24415] - Stage page aggregated executor metrics wrong when failures
[SPARK-24452] - long = int*int or long = int+int may cause overflow.
[SPARK-24468] - DecimalType `adjustPrecisionScale` might fail when scale is negative
[SPARK-24495] - SortMergeJoin with duplicate keys wrong results
[SPARK-24506] - Spark.ui.filters not applied to /sqlserver/ url
[SPARK-24530] - Sphinx doesn't render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
[SPARK-24531] - HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
[SPARK-24536] - Query with nonsensical LIMIT hits AssertionError
[SPARK-24552] - Task attempt numbers are reused when stages are retried
[SPARK-24578] - Reading remote cache block behavior changes and causes timeout issue
[SPARK-24583] - Wrong schema type in InsertIntoDataSourceCommand
[SPARK-24588] - StreamingSymmetricHashJoinExec should require HashClusteredPartitioning from children
[SPARK-24589] - OutputCommitCoordinator may allow duplicate commits
[SPARK-24603] - Typo in comments
[SPARK-24613] - Cache with UDF could not be matched with subsequent dependent caches
[SPARK-24704] - The order of stages in the DAG graph is incorrect
[SPARK-24739] - PySpark does not work with Python 3.7.0
[SPARK-24781] - Using a reference from Dataset in Filter/Sort might not work.
[SPARK-24809] - Serializing LongHashedRelation in executor may result in data error
[SPARK-24813] - HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
[SPARK-24867] - Add AnalysisBarrier to DataFrameWriter
[SPARK-24879] - NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
[SPARK-24889] - dataset.unpersist() doesn't update storage memory stats
[SPARK-24891] - Fix HandleNullInputsForUDF rule
[SPARK-24908] - [R] remove spaces to make lintr happy
[SPARK-24909] - Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
[SPARK-24927] - The hadoop-provided profile doesn't play well with Snappy-compressed Parquet files
[SPARK-24934] - Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
[SPARK-24948] - SHS filters wrongly some applications due to permission check
[SPARK-24950] - scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
[SPARK-24957] - Decimal arithmetic can lead to wrong values using codegen
[SPARK-24987] - Kafka Cached Consumer Leaking File Descriptors
[SPARK-25028] - AnalyzePartitionCommand failed with NPE if value is null
[SPARK-25051] - where clause on dataset gives AnalysisException
[SPARK-25076] - SQLConf should not be retrieved from a stopped SparkSession
[SPARK-25084] - "distribute by" on multiple columns may lead to codegen issue
[SPARK-25114] - RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
[SPARK-25124] - VectorSizeHint.size is buggy, breaking streaming pipeline
[SPARK-25144] - distinct on Dataset leads to exception due to Managed memory leak detected
[SPARK-25164] - Parquet reader builds entire list of columns once for each column
[SPARK-25205] - typo in spark.network.crypto.keyFactoryIteration
[SPARK-25231] - Running a Large Job with Speculation On Causes Executor Heartbeats to Time Out on Driver
[SPARK-25313] - Fix regression in FileFormatWriter output schema
[SPARK-25330] - Permission issue after upgrade hadoop version to 2.7.7
[SPARK-25357] - Add metadata to SparkPlanInfo to dump more information like file path to event log
[SPARK-25368] - Incorrect constraint inference returns wrong result
[SPARK-25371] - Vector Assembler with no input columns leads to opaque error
[SPARK-25402] - Null handling in BooleanSimplification
[SPARK-26802] - CVE-2018-11760: Apache Spark local privilege escalation vulnerability

Story

[SPARK-25234] - SparkR:::parallelize doesn't handle integer overflow properly

New Feature

[SPARK-24542] - Hive UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files

Improvement

[SPARK-24455] - fix typo in TaskSchedulerImpl's comments
[SPARK-24696] - ColumnPruning rule fails to remove extra Project
[SPARK-25400] - Increase timeouts in schedulerIntegrationSuite

Test

[SPARK-24502] - flaky test: UnsafeRowSerializerSuite
[SPARK-24521] - Fix ineffective test in CachedTableSuite
[SPARK-24564] - Add test suite for RecordBinaryComparator

Documentation

[SPARK-24507] - Description in "Level of Parallelism in Data Receiving" section of Spark Streaming Programming Guide in is not relevan for the recent Kafka direct apprach
[SPARK-25273] - How to install testthat v1.0.2

Edit/Copy Release Notes

The text area below allows the project release notes to be edited and copied to another document.

Release Notes - Spark - Version 2.3.2
    
<h2>        Sub-task
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24535'>SPARK-24535</a>] -         Fix java version parsing in SparkR on Windows
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24976'>SPARK-24976</a>] -         Allow None for Decimal type conversion (specific to PyArrow 0.9.0)
</li>
</ul>
            
<h2>        Bug
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-22809'>SPARK-22809</a>] -         pyspark is sensitive to imports with dots
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-23243'>SPARK-23243</a>] -         Shuffle+Repartition on an RDD could lead to incorrect answers
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-23618'>SPARK-23618</a>] -         docker-image-tool.sh Fails While Building Image
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-23731'>SPARK-23731</a>] -         FileSourceScanExec throws NullPointerException in subexpression elimination
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-23732'>SPARK-23732</a>] -         Broken link to scala source code in Spark Scala api Scaladoc
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24018'>SPARK-24018</a>] -         Spark-without-hadoop package fails to create or read parquet files with snappy compression
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24216'>SPARK-24216</a>] -         Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24369'>SPARK-24369</a>] -         A bug when having multiple distinct aggregations
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24385'>SPARK-24385</a>] -         Trivially-true EqualNullSafe should be handled like EqualTo in Dataset.join
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24415'>SPARK-24415</a>] -         Stage page aggregated executor metrics wrong when failures 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24452'>SPARK-24452</a>] -         long = int*int or long = int+int may cause overflow.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24468'>SPARK-24468</a>] -         DecimalType `adjustPrecisionScale` might fail when scale is negative
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24495'>SPARK-24495</a>] -         SortMergeJoin with duplicate keys wrong results
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24506'>SPARK-24506</a>] -         Spark.ui.filters not applied to /sqlserver/ url
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24530'>SPARK-24530</a>] -         Sphinx doesn&#39;t render autodoc_docstring_signature correctly (with Python 2?) and pyspark.ml docs are broken
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24531'>SPARK-24531</a>] -         HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24536'>SPARK-24536</a>] -         Query with nonsensical LIMIT hits AssertionError
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24552'>SPARK-24552</a>] -         Task attempt numbers are reused when stages are retried
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24578'>SPARK-24578</a>] -         Reading remote cache block behavior changes and causes timeout issue
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24583'>SPARK-24583</a>] -         Wrong schema type in InsertIntoDataSourceCommand
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24588'>SPARK-24588</a>] -         StreamingSymmetricHashJoinExec should require HashClusteredPartitioning from children
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24589'>SPARK-24589</a>] -         OutputCommitCoordinator may allow duplicate commits
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24603'>SPARK-24603</a>] -         Typo in comments
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24613'>SPARK-24613</a>] -         Cache with UDF could not be matched with subsequent dependent caches
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24704'>SPARK-24704</a>] -         The order of stages in the DAG graph is incorrect
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24739'>SPARK-24739</a>] -         PySpark does not work with Python 3.7.0
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24781'>SPARK-24781</a>] -         Using a reference from Dataset in Filter/Sort might not work.
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24809'>SPARK-24809</a>] -         Serializing LongHashedRelation in executor may result in data error
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24813'>SPARK-24813</a>] -         HiveExternalCatalogVersionsSuite still flaky; fall back to Apache archive
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24867'>SPARK-24867</a>] -         Add AnalysisBarrier to DataFrameWriter 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24879'>SPARK-24879</a>] -         NPE in Hive partition filter pushdown for `partCol IN (NULL, ....)`
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24889'>SPARK-24889</a>] -         dataset.unpersist() doesn&#39;t update storage memory stats
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24891'>SPARK-24891</a>] -         Fix HandleNullInputsForUDF rule
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24908'>SPARK-24908</a>] -         [R] remove spaces to make lintr happy
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24909'>SPARK-24909</a>] -         Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24927'>SPARK-24927</a>] -         The hadoop-provided profile doesn&#39;t play well with Snappy-compressed Parquet files
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24934'>SPARK-24934</a>] -         Complex type and binary type in in-memory partition pruning does not work due to missing upper/lower bounds cases
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24948'>SPARK-24948</a>] -         SHS filters wrongly some applications due to permission check
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24950'>SPARK-24950</a>] -         scala DateTimeUtilsSuite daysToMillis and millisToDays fails w/java 8 181-b13
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24957'>SPARK-24957</a>] -         Decimal arithmetic can lead to wrong values using codegen
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24987'>SPARK-24987</a>] -         Kafka Cached Consumer Leaking File Descriptors
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25028'>SPARK-25028</a>] -         AnalyzePartitionCommand failed with NPE if value is null
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25051'>SPARK-25051</a>] -         where clause on dataset gives AnalysisException
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25076'>SPARK-25076</a>] -         SQLConf should not be retrieved from a stopped SparkSession
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25084'>SPARK-25084</a>] -         &quot;distribute by&quot; on multiple columns may lead to codegen issue
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25114'>SPARK-25114</a>] -         RecordBinaryComparator may return wrong result when subtraction between two words is divisible by Integer.MAX_VALUE
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25124'>SPARK-25124</a>] -         VectorSizeHint.size is buggy, breaking streaming pipeline
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25144'>SPARK-25144</a>] -         distinct on Dataset leads to exception due to Managed memory leak detected  
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25164'>SPARK-25164</a>] -         Parquet reader builds entire list of columns once for each column
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25205'>SPARK-25205</a>] -         typo in spark.network.crypto.keyFactoryIteration
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25231'>SPARK-25231</a>] -         Running a Large Job with Speculation On Causes Executor Heartbeats to Time Out on Driver
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25313'>SPARK-25313</a>] -         Fix regression in FileFormatWriter output schema
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25330'>SPARK-25330</a>] -         Permission issue after upgrade hadoop version to 2.7.7
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25357'>SPARK-25357</a>] -         Add metadata to SparkPlanInfo to dump more information like file path to event log
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25368'>SPARK-25368</a>] -         Incorrect constraint inference returns wrong result
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25371'>SPARK-25371</a>] -         Vector Assembler with no input columns leads to opaque error
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25402'>SPARK-25402</a>] -         Null handling in BooleanSimplification
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-26802'>SPARK-26802</a>] -         CVE-2018-11760: Apache Spark local privilege escalation vulnerability
</li>
</ul>
        
<h2>        Story
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25234'>SPARK-25234</a>] -         SparkR:::parallelize doesn&#39;t handle integer overflow properly
</li>
</ul>
    
<h2>        New Feature
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24542'>SPARK-24542</a>] -         Hive UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files
</li>
</ul>
    
<h2>        Improvement
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24455'>SPARK-24455</a>] -         fix typo in TaskSchedulerImpl&#39;s comments
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24696'>SPARK-24696</a>] -         ColumnPruning rule fails to remove extra Project
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25400'>SPARK-25400</a>] -         Increase timeouts in schedulerIntegrationSuite
</li>
</ul>
    
<h2>        Test
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24502'>SPARK-24502</a>] -         flaky test: UnsafeRowSerializerSuite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24521'>SPARK-24521</a>] -         Fix ineffective test in CachedTableSuite
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24564'>SPARK-24564</a>] -         Add test suite for RecordBinaryComparator
</li>
</ul>
                                                                                                                                                
<h2>        Documentation
</h2>
<ul>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-24507'>SPARK-24507</a>] -         Description in &quot;Level of Parallelism in Data Receiving&quot; section of Spark Streaming Programming Guide in is not relevan for the recent Kafka direct apprach 
</li>
<li>[<a href='https://issues.apache.org/jira/browse/SPARK-25273'>SPARK-25273</a>] -         How to install testthat v1.0.2
</li>
</ul>