Spark 2.0.0+:
UserDefinedType
has been made private in Spark 2.0.0 and as for now it has no Dataset
friendly replacement.
See: SPARK-14155 (Hide UserDefinedType in Spark 2.0)
Most of the time statically typed Dataset
can serve as replacement
There is a pending Jira SPARK-7768 to make UDT API public again with target version 2.4.
See also How to store custom objects in Dataset?
Spark < 2.0.0
Is there a possibility to add or define a schema for certain types (here type Some)?
I guess the answer depends on how badly you need this. It looks like it is possible to create an UserDefinedType
but it requires access to DeveloperApi
and is not exactly straightforward or well documented.
import org.apache.spark.sql.types._
@SQLUserDefinedType(udt = classOf[SomeUDT])
sealed trait Some
case object AType extends Some
case object BType extends Some
class SomeUDT extends UserDefinedType[Some] {
override def sqlType: DataType = IntegerType
override def serialize(obj: Any) = {
obj match {
case AType => 0
case BType => 1
}
}
override def deserialize(datum: Any): Some = {
datum match {
case 0 => AType
case 1 => BType
}
}
override def userClass: Class[Some] = classOf[Some]
}
You should probably override hashCode
and equals
as well.
Its PySpark counterpart can look like this:
from enum import Enum, unique
from pyspark.sql.types import UserDefinedType, IntegerType
class SomeUDT(UserDefinedType):
@classmethod
def sqlType(self):
return IntegerType()
@classmethod
def module(cls):
return cls.__module__
@classmethod
def scalaUDT(cls): # Required in Spark < 1.5
return 'net.zero323.enum.SomeUDT'
def serialize(self, obj):
return obj.value
def deserialize(self, datum):
return {x.value: x for x in Some}[datum]
@unique
class Some(Enum):
__UDT__ = SomeUDT()
AType = 0
BType = 1
In Spark < 1.5 Python UDT requires a paired Scala UDT, but it look like it is no longer the case in 1.5.
For a simple UDT like you can use simple types (for example IntegerType
instead of whole Struct
).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…