Packages

  • package root
    Definition Classes
    root
  • package org
    Definition Classes
    root
  • package apache
    Definition Classes
    org
  • package spark

    Core Spark functionality.

    Core Spark functionality. org.apache.spark.SparkContext serves as the main entry point to Spark, while org.apache.spark.rdd.RDD is the data type representing a distributed collection, and provides most parallel operations.

    In addition, org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDs of key-value pairs, such as groupByKey and join; org.apache.spark.rdd.DoubleRDDFunctions contains operations available only on RDDs of Doubles; and org.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that can be saved as SequenceFiles. These operations are automatically available on any RDD of the right type (e.g. RDD[(Int, Int)] through implicit conversions.

    Java programmers should reference the org.apache.spark.api.java package for Spark programming APIs in Java.

    Classes and methods marked with Experimental are user-facing features which have not been officially adopted by the Spark project. These are subject to change or removal in minor releases.

    Classes and methods marked with Developer API are intended for advanced users want to extend Spark through lower level interfaces. These are subject to changes or removal in minor releases.

    Definition Classes
    apache
  • package mllib

    RDD-based machine learning APIs (in maintenance mode).

    RDD-based machine learning APIs (in maintenance mode).

    The spark.mllib package is in maintenance mode as of the Spark 2.0.0 release to encourage migration to the DataFrame-based APIs under the org.apache.spark.ml package. While in maintenance mode,

    • no new features in the RDD-based spark.mllib package will be accepted, unless they block implementing new features in the DataFrame-based spark.ml package;
    • bug fixes in the RDD-based APIs will still be accepted.

    The developers will continue adding more features to the DataFrame-based APIs in the 2.x series to reach feature parity with the RDD-based APIs. And once we reach feature parity, this package will be deprecated.

    Definition Classes
    spark
    See also

    SPARK-4591 to track the progress of feature parity

  • package tree

    This package contains the default implementation of the decision tree algorithm, which supports:

    This package contains the default implementation of the decision tree algorithm, which supports:

    • binary classification,
    • regression,
    • information loss calculation with entropy and Gini for classification and variance for regression,
    • both continuous and categorical features.
    Definition Classes
    mllib
  • package configuration
    Definition Classes
    tree
  • Algo
  • BoostingStrategy
  • FeatureType
  • QuantileStrategy
  • Strategy

class Strategy extends Serializable

Stores all the configuration options for tree construction

Annotations
@Since( "1.0.0" )
Linear Supertypes
Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. Strategy
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new Strategy(algo: Algo.Algo, impurity: Impurity, maxDepth: Int, numClasses: Int, maxBins: Int, categoricalFeaturesInfo: Map[Integer, Integer])

    Java-friendly constructor for org.apache.spark.mllib.tree.configuration.Strategy

    Annotations
    @Since( "1.1.0" )
  2. new Strategy(algo: Algo.Algo, impurity: Impurity, maxDepth: Int, numClasses: Int = 2, maxBins: Int = 32, quantileCalculationStrategy: QuantileStrategy.QuantileStrategy = Sort, categoricalFeaturesInfo: Map[Int, Int] = Map[Int, Int](), minInstancesPerNode: Int = 1, minInfoGain: Double = 0.0, maxMemoryInMB: Int = 256, subsamplingRate: Double = 1, useNodeIdCache: Boolean = false, checkpointInterval: Int = 10)

    algo

    Learning goal. Supported: org.apache.spark.mllib.tree.configuration.Algo.Classification, org.apache.spark.mllib.tree.configuration.Algo.Regression

    impurity

    Criterion used for information gain calculation. Supported for Classification: org.apache.spark.mllib.tree.impurity.Gini, org.apache.spark.mllib.tree.impurity.Entropy. Supported for Regression: org.apache.spark.mllib.tree.impurity.Variance.

    maxDepth

    Maximum depth of the tree (e.g. depth 0 means 1 leaf node, depth 1 means 1 internal node + 2 leaf nodes).

    numClasses

    Number of classes for classification. (Ignored for regression.) Default value is 2 (binary classification).

    maxBins

    Maximum number of bins used for discretizing continuous features and for choosing how to split on features at each node. More bins give higher granularity.

    quantileCalculationStrategy

    Algorithm for calculating quantiles. Supported: org.apache.spark.mllib.tree.configuration.QuantileStrategy.Sort

    categoricalFeaturesInfo

    A map storing information about the categorical variables and the number of discrete values they take. An entry (n to k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.

    minInstancesPerNode

    Minimum number of instances each child must have after split. Default value is 1. If a split cause left or right child to have less than minInstancesPerNode, this split will not be considered as a valid split.

    minInfoGain

    Minimum information gain a split must get. Default value is 0.0. If a split has less information gain than minInfoGain, this split will not be considered as a valid split.

    maxMemoryInMB

    Maximum memory in MB allocated to histogram aggregation. Default value is 256 MB. If too small, then 1 node will be split per iteration, and its aggregates may exceed this size.

    subsamplingRate

    Fraction of the training data used for learning decision tree.

    useNodeIdCache

    If this is true, instead of passing trees to executors, the algorithm will maintain a separate RDD of node Id cache for each row.

    checkpointInterval

    How often to checkpoint when the node Id cache gets updated. E.g. 10 means that the cache will get checkpointed every 10 updates. If the checkpoint directory is not set in org.apache.spark.SparkContext, this setting is ignored.

    Annotations
    @Since( "1.3.0" )

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  4. var algo: Algo.Algo
    Annotations
    @Since( "1.0.0" )
  5. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  6. var categoricalFeaturesInfo: Map[Int, Int]
    Annotations
    @Since( "1.0.0" )
  7. var checkpointInterval: Int
    Annotations
    @Since( "1.2.0" )
  8. def clone(): AnyRef
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @native() @throws( ... )
  9. def copy: Strategy

    Returns a shallow copy of this instance.

    Returns a shallow copy of this instance.

    Annotations
    @Since( "1.2.0" )
  10. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  11. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  12. def finalize(): Unit
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. def getAlgo(): Algo.Algo
    Annotations
    @Since( "1.0.0" )
  14. def getCategoricalFeaturesInfo(): Map[Int, Int]
    Annotations
    @Since( "1.0.0" )
  15. def getCheckpointInterval(): Int
    Annotations
    @Since( "1.2.0" )
  16. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  17. def getImpurity(): Impurity
    Annotations
    @Since( "1.0.0" )
  18. def getMaxBins(): Int
    Annotations
    @Since( "1.0.0" )
  19. def getMaxDepth(): Int
    Annotations
    @Since( "1.0.0" )
  20. def getMaxMemoryInMB(): Int
    Annotations
    @Since( "1.0.0" )
  21. def getMinInfoGain(): Double
    Annotations
    @Since( "1.2.0" )
  22. def getMinInstancesPerNode(): Int
    Annotations
    @Since( "1.2.0" )
  23. def getNumClasses(): Int
    Annotations
    @Since( "1.2.0" )
  24. def getQuantileCalculationStrategy(): QuantileStrategy.QuantileStrategy
    Annotations
    @Since( "1.0.0" )
  25. def getSubsamplingRate(): Double
    Annotations
    @Since( "1.2.0" )
  26. def getUseNodeIdCache(): Boolean
    Annotations
    @Since( "1.2.0" )
  27. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  28. var impurity: Impurity
    Annotations
    @Since( "1.0.0" )
  29. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  30. def isMulticlassClassification: Boolean

    Annotations
    @Since( "1.2.0" )
  31. def isMulticlassWithCategoricalFeatures: Boolean

    Annotations
    @Since( "1.2.0" )
  32. var maxBins: Int
    Annotations
    @Since( "1.0.0" )
  33. var maxDepth: Int
    Annotations
    @Since( "1.0.0" )
  34. var maxMemoryInMB: Int
    Annotations
    @Since( "1.0.0" )
  35. var minInfoGain: Double
    Annotations
    @Since( "1.2.0" )
  36. var minInstancesPerNode: Int
    Annotations
    @Since( "1.2.0" )
  37. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  38. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  39. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  40. var numClasses: Int
    Annotations
    @Since( "1.2.0" )
  41. var quantileCalculationStrategy: QuantileStrategy.QuantileStrategy
    Annotations
    @Since( "1.0.0" )
  42. def setAlgo(algo: String): Unit

    Sets Algorithm using a String.

    Sets Algorithm using a String.

    Annotations
    @Since( "1.2.0" )
  43. def setAlgo(arg0: Algo.Algo): Unit
    Annotations
    @Since( "1.0.0" )
  44. def setCategoricalFeaturesInfo(categoricalFeaturesInfo: Map[Integer, Integer]): Unit

    Sets categoricalFeaturesInfo using a Java Map.

    Sets categoricalFeaturesInfo using a Java Map.

    Annotations
    @Since( "1.2.0" )
  45. def setCategoricalFeaturesInfo(arg0: Map[Int, Int]): Unit
    Annotations
    @Since( "1.0.0" )
  46. def setCheckpointInterval(arg0: Int): Unit
    Annotations
    @Since( "1.2.0" )
  47. def setImpurity(arg0: Impurity): Unit
    Annotations
    @Since( "1.0.0" )
  48. def setMaxBins(arg0: Int): Unit
    Annotations
    @Since( "1.0.0" )
  49. def setMaxDepth(arg0: Int): Unit
    Annotations
    @Since( "1.0.0" )
  50. def setMaxMemoryInMB(arg0: Int): Unit
    Annotations
    @Since( "1.0.0" )
  51. def setMinInfoGain(arg0: Double): Unit
    Annotations
    @Since( "1.2.0" )
  52. def setMinInstancesPerNode(arg0: Int): Unit
    Annotations
    @Since( "1.2.0" )
  53. def setNumClasses(arg0: Int): Unit
    Annotations
    @Since( "1.2.0" )
  54. def setQuantileCalculationStrategy(arg0: QuantileStrategy.QuantileStrategy): Unit
    Annotations
    @Since( "1.0.0" )
  55. def setSubsamplingRate(arg0: Double): Unit
    Annotations
    @Since( "1.2.0" )
  56. def setUseNodeIdCache(arg0: Boolean): Unit
    Annotations
    @Since( "1.2.0" )
  57. var subsamplingRate: Double
    Annotations
    @Since( "1.2.0" )
  58. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  59. def toString(): String
    Definition Classes
    AnyRef → Any
  60. var useNodeIdCache: Boolean
    Annotations
    @Since( "1.2.0" )
  61. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  62. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  63. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @native() @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped