weka.classifiers.functions

Class SGDText

  • All Implemented Interfaces:
    Serializable, Cloneable, Classifier, UpdateableClassifier, Aggregateable<SGDText>, CapabilitiesHandler, OptionHandler, Randomizable, RevisionHandler, WeightedInstancesHandler


    public class SGDTextextends RandomizableClassifierimplements UpdateableClassifier, WeightedInstancesHandler, Aggregateable<SGDText>
    Implements stochastic gradient descent for learning a linear binary class SVM or binary class logistic regression on text data. Operates directly (and only) on String attributes. Other types of input attributes are accepted but ignored during training and classification.

    Valid options are:

     -F  Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression)  (default = 0) 
     -outputProbs  Output probabilities for SVMs (fits a logsitic  model to the output of the SVM) 
     -L  The learning rate (default = 0.01). 
     -R <double>  The lambda regularization constant (default = 0.0001) 
     -E <integer>  The number of epochs to perform (batch learning only, default = 500) 
     -W  Use word frequencies instead of binary bag of words. 
     -P <# instances>  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune) 
     -M <double>  Minimum word frequency. Words with less than this frequence are ignored.  If periodic pruning is turned on then this is also used to determine which  words to remove from the dictionary (default = 3). 
     -normalize  Normalize document length (use in conjunction with -norm and -lnorm 
     -norm <num>  Specify the norm that each instance must have (default 1.0) 
     -lnorm <num>  Specify L-norm to use (default 2.0) 
     -lowercase  Convert all tokens to lowercase before adding to the dictionary. 
     -stoplist  Ignore words that are in the stoplist. 
     -stopwords <file>  A file containing stopwords to override the default ones.  Using this option automatically sets the flag ('-stoplist') to use the  stoplist if the file exists.  Format: one stopword per line, lines starting with '#'  are interpreted as comments and ignored. 
     -tokenizer <spec>  The tokenizing algorihtm (classname plus parameters) to use.  (default: weka.core.tokenizers.WordTokenizer) 
     -stemmer <spec>  The stemmering algorihtm (classname plus parameters) to use. 
    Author:
    Mark Hall (mhall{[at]}pentaho{[dot]}com), Eibe Frank (eibe{[at]}cs{[dot]}waikato{[dot]}ac{[dot]}nz)
    See Also:
    Serialized Form
    • Field Detail

      • TAGS_SELECTION

        public static final Tag[] TAGS_SELECTION
        Loss functions to choose from
    • Constructor Detail

      • SGDText

        public SGDText()
    • Method Detail

      • setStemmer

        public void setStemmer(Stemmer value)
        the stemming algorithm to use, null means no stemming at all (i.e., the NullStemmer is used).
        Parameters:
        value - the configured stemming algorithm, or null
        See Also:
        NullStemmer
      • getStemmer

        public Stemmer getStemmer()
        Returns the current stemming algorithm, null if none is used.
        Returns:
        the current stemming algorithm, null if none set
      • stemmerTipText

        public String stemmerTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setTokenizer

        public void setTokenizer(Tokenizer value)
        the tokenizer algorithm to use.
        Parameters:
        value - the configured tokenizing algorithm
      • getTokenizer

        public Tokenizer getTokenizer()
        Returns the current tokenizer algorithm.
        Returns:
        the current tokenizer algorithm
      • tokenizerTipText

        public String tokenizerTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • useWordFrequenciesTipText

        public String useWordFrequenciesTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setUseWordFrequencies

        public void setUseWordFrequencies(boolean u)
        Set whether to use word frequencies rather than binary bag of words representation.
        Parameters:
        u - true if word frequencies are to be used.
      • getUseWordFrequencies

        public boolean getUseWordFrequencies()
        Get whether to use word frequencies rather than binary bag of words representation.
        Parameters:
        u - true if word frequencies are to be used.
      • lowercaseTokensTipText

        public String lowercaseTokensTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setLowercaseTokens

        public void setLowercaseTokens(boolean l)
        Set whether to convert all tokens to lowercase
        Parameters:
        l - true if all tokens are to be converted to lowercase
      • getLowercaseTokens

        public boolean getLowercaseTokens()
        Get whether to convert all tokens to lowercase
        Returns:
        true true if all tokens are to be converted to lowercase
      • useStopListTipText

        public String useStopListTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setUseStopList

        public void setUseStopList(boolean u)
        Set whether to ignore all words that are on the stoplist.
        Parameters:
        u - true to ignore all words on the stoplist.
      • getUseStopList

        public boolean getUseStopList()
        Get whether to ignore all words that are on the stoplist.
        Returns:
        true to ignore all words on the stoplist.
      • setStopwords

        public void setStopwords(File value)
        sets the file containing the stopwords, null or a directory unset the stopwords. If the file exists, it automatically turns on the flag to use the stoplist.
        Parameters:
        value - the file containing the stopwords
      • getStopwords

        public File getStopwords()
        returns the file used for obtaining the stopwords, if the file represents a directory then the default ones are used.
        Returns:
        the file containing the stopwords
      • stopwordsTipText

        public String stopwordsTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • periodicPruningTipText

        public String periodicPruningTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setPeriodicPruning

        public void setPeriodicPruning(int p)
        Set how often to prune the dictionary
        Parameters:
        p - how often to prune
      • getPeriodicPruning

        public int getPeriodicPruning()
        Get how often to prune the dictionary
        Returns:
        how often to prune the dictionary
      • minWordFrequencyTipText

        public String minWordFrequencyTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setMinWordFrequency

        public void setMinWordFrequency(double minFreq)
        Set the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
        Parameters:
        minFreq - the minimum word frequency to use
      • getMinWordFrequency

        public double getMinWordFrequency()
        Get the minimum word frequency. Words that don't occur at least min freq times are ignored when updating weights. If periodic pruning is turned on, then min frequency is used when removing words from the dictionary.
        Parameters:
        return - the minimum word frequency to use
      • normalizeDocLengthTipText

        public String normalizeDocLengthTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setNormalizeDocLength

        public void setNormalizeDocLength(boolean norm)
        Set whether to normalize the length of each document
        Parameters:
        norm - true if document lengths is to be normalized
      • getNormalizeDocLength

        public boolean getNormalizeDocLength()
        Get whether to normalize the length of each document
        Returns:
        true if document lengths is to be normalized
      • normTipText

        public String normTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • getNorm

        public double getNorm()
        Get the instance's Norm.
        Returns:
        the Norm
      • setNorm

        public void setNorm(double newNorm)
        Set the norm of the instances
        Parameters:
        newNorm - the norm to wich the instances must be set
      • LNormTipText

        public String LNormTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • getLNorm

        public double getLNorm()
        Get the L Norm used.
        Returns:
        the L-norm used
      • setLNorm

        public void setLNorm(double newLNorm)
        Set the L-norm to used
        Parameters:
        newLNorm - the L-norm
      • lambdaTipText

        public String lambdaTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setLambda

        public void setLambda(double lambda)
        Set the value of lambda to use
        Parameters:
        lambda - the value of lambda to use
      • getLambda

        public double getLambda()
        Get the current value of lambda
        Returns:
        the current value of lambda
      • setLearningRate

        public void setLearningRate(double lr)
        Set the learning rate.
        Parameters:
        lr - the learning rate to use.
      • getLearningRate

        public double getLearningRate()
        Get the learning rate.
        Returns:
        the learning rate
      • learningRateTipText

        public String learningRateTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • epochsTipText

        public String epochsTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setEpochs

        public void setEpochs(int e)
        Set the number of epochs to use
        Parameters:
        e - the number of epochs to use
      • getEpochs

        public int getEpochs()
        Get current number of epochs
        Returns:
        the current number of epochs
      • setLossFunction

        public void setLossFunction(SelectedTag function)
        Set the loss function to use.
        Parameters:
        function - the loss function to use.
      • getLossFunction

        public SelectedTag getLossFunction()
        Get the current loss function.
        Returns:
        the current loss function.
      • lossFunctionTipText

        public String lossFunctionTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setOutputProbsForSVM

        public void setOutputProbsForSVM(boolean o)
        Set whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).
        Parameters:
        o - true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.
      • getOutputProbsForSVM

        public boolean getOutputProbsForSVM()
        Get whether to fit a logistic regression (itself trained using SGD) to the outputs of the SVM (if an SVM is being learned).
        Returns:
        true if a logistic regression is to be fit to the output of the SVM to produce probability estimates.
      • outputProbsForSVMTipText

        public String outputProbsForSVMTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setOptions

        public void setOptions(String[] options)                throws Exception
        Parses a given list of options.

        Valid options are:

         -F  Set the loss function to minimize. 0 = hinge loss (SVM), 1 = log loss (logistic regression)  (default = 0) 
         -outputProbs  Output probabilities for SVMs (fits a logsitic  model to the output of the SVM) 
         -L  The learning rate (default = 0.01). 
         -R <double>  The lambda regularization constant (default = 0.0001) 
         -E <integer>  The number of epochs to perform (batch learning only, default = 500) 
         -W  Use word frequencies instead of binary bag of words. 
         -P <# instances>  How often to prune the dictionary of low frequency words (default = 0, i.e. don't prune) 
         -M <double>  Minimum word frequency. Words with less than this frequence are ignored.  If periodic pruning is turned on then this is also used to determine which  words to remove from the dictionary (default = 3). 
         -normalize  Normalize document length (use in conjunction with -norm and -lnorm 
         -norm <num>  Specify the norm that each instance must have (default 1.0) 
         -lnorm <num>  Specify L-norm to use (default 2.0) 
         -lowercase  Convert all tokens to lowercase before adding to the dictionary. 
         -stoplist  Ignore words that are in the stoplist. 
         -stopwords <file>  A file containing stopwords to override the default ones.  Using this option automatically sets the flag ('-stoplist') to use the  stoplist if the file exists.  Format: one stopword per line, lines starting with '#'  are interpreted as comments and ignored. 
         -tokenizer <spec>  The tokenizing algorihtm (classname plus parameters) to use.  (default: weka.core.tokenizers.WordTokenizer) 
         -stemmer <spec>  The stemmering algorihtm (classname plus parameters) to use. 
        Specified by:
        setOptions in interface OptionHandler
        Overrides:
        setOptions in class RandomizableClassifier
        Parameters:
        options - the list of options as an array of strings
        Throws:
        Exception - if an option is not supported
      • globalInfo

        public String globalInfo()
        Returns a string describing classifier
        Returns:
        a description suitable for displaying in the explorer/experimenter gui
      • reset

        public void reset()
        Reset the classifier.
      • buildClassifier

        public void buildClassifier(Instances data)                     throws Exception
        Method for building the classifier.
        Specified by:
        buildClassifier in interface Classifier
        Parameters:
        data - the set of training instances.
        Throws:
        Exception - if the classifier can't be built successfully.
      • updateClassifier

        public void updateClassifier(Instance instance)                      throws Exception
        Updates the classifier with the given instance.
        Specified by:
        updateClassifier in interface UpdateableClassifier
        Parameters:
        instance - the new training instance to include in the model
        Throws:
        Exception - if the instance could not be incorporated in the model.
      • distributionForInstance

        public double[] distributionForInstance(Instance inst)                                 throws Exception
        Description copied from class: AbstractClassifier
        Predicts the class memberships for a given instance. If an instance is unclassified, the returned array elements must be all zero. If the class is numeric, the array must consist of only one element, which contains the predicted value. Note that a classifier MUST implement either this or classifyInstance().
        Specified by:
        distributionForInstance in interface Classifier
        Overrides:
        distributionForInstance in class AbstractClassifier
        Parameters:
        inst - the instance to be classified
        Returns:
        an array containing the estimated membership probabilities of the test instance in each class or the numeric prediction
        Throws:
        Exception - if distribution could not be computed successfully
      • getDictionarySize

        public int getDictionarySize()
        Return the size of the dictionary (minus any low frequency terms that are below the threshold but haven't been pruned yet).
        Returns:
        the size of the dictionary.
      • bias

        public double bias()
      • setBias

        public void setBias(double bias)
      • aggregate

        public SGDText aggregate(SGDText toAggregate)                  throws Exception
        Aggregate an object with this one
        Specified by:
        aggregate in interface Aggregateable<SGDText>
        Parameters:
        toAggregate - the object to aggregate
        Returns:
        the result of aggregation
        Throws:
        Exception - if the supplied object can't be aggregated for some reason
      • finalizeAggregation

        public void finalizeAggregation()                         throws Exception
        Call to complete the aggregation process. Allows implementers to do any final processing based on how many objects were aggregated.
        Specified by:
        finalizeAggregation in interface Aggregateable<SGDText>
        Throws:
        Exception - if the aggregation can't be finalized for some reason
      • main

        public static void main(String[] args)
        Main method for testing this class.

Copyright © 2013 University of Waikato, Hamilton, NZ. All Rights Reserved.



NOTHING
NOTHING
Add the Maven Dependecy to your project: maven dependecy for com.amazonaws : aws-java-sdk : 1.3.14