信息熵与选择:由三门问题想到的

今天看公众号文章正好看到一篇《三门问题》,虽然看似简单,但细想感觉很有意思,特意把自己的思考记录一下。三门问题出自上世纪 70 年代美国的一个综艺节目,基本描述是这样的:你是游戏的参与者,面前有三扇门,其中一扇门后面是一辆跑车,其他两扇后面什么都没有。在你选择一扇门后先不打开,主持人打开了另一扇门(后面必然是空的),此时你有两个选择,坚持选择刚才选的那扇,或者换另外一扇还没打开的。怎么选择才能让你得到跑车的概率最大?

More

Sequence Processing with Recurrent Networks Note (SLP Ch09)

Problematic of the sliding window in general NN:

  • Like Markov it limits the context from which information can be extracted (limits to window area)
  • Window makes it difficult to learn systematic patterns arising from phenomena like constituency

RNN is a class of networks designed to address these problems by processing sequences explicitly as sequences, allowing us to handle variable length inputs without the use of arbitrary fixed-sized windows.

More

Neural Networks and Neural Language Models Note (SLP Ch07)

Units

z=wx+bz = w · x + b

y=σ(z)=11+ezy = \sigma(z) = \frac{1}{1+e^{-z}}

In practice, the sigmoid is not commonly used as an activation function. A better one is tanh function ranges from -1 to 1: $$y = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

The most commonly used is the rectified linear unit, also called ReLU: y = max(x, 0)​

In the sigmoid or tanh functions, very high values of z result in values of y that are saturated, extremely close to 1, which causes problems for learning.

  • Rectifiers don’t have this problem, since the output of values close to 1 also approaches 1 in a nice gentle linear way.
  • By contrast, the tanh function has the nice properties of being smoothly differentiable and mapping outlier values toward the mean.

More

Vector Semantics Note (SLP Ch06)

Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.

  • words which are synonyms tended to occur in the same environment
  • with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments”

vector semantics instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts.

More

Logistic Regression Note (SLP Ch05)

In NLP, logistic regression is the baseline supervised machine learning algorithm for classification.

  • discriminative classifier: like logistic regression
    • only trying to learn to distinguish the classes.
    • directly compute $$P(c|d)$$
  • generative classifier: like naive Bayes
    • have the goal of understanding what each class looks like.
    • makes use of likelihood term $$P(d|c)P©$$

A machine learning system for classification has four components:

  • A feature representation of the input
  • A classification function that computes $$\hat y$$, the estimated class, via $$p(y|x)$$. Like sigmoid and softmax.
  • An objective function for learning, usually involving minimizing error on training examples. Like cross-entropy loss function.
  • An algorithm for optimizing the objective function. Like stochastic gradient descent.

More

Naive Bayes and Sentiment Classification Note (SLP Ch04)

Text categorization, the task of assigning a label or category to an entire text or document.

  • sentiment analysis
  • spam detection
  • subject category or topic label

Probabilistic classifier additionally will tell us the probability of the observation being in the class.

Generative classifiers like naive Bayes build a model of how a class could generate some input data.

Discriminative classifiers like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes.

More