Problematic of the sliding window in general NN:
- Like Markov it limits the context from which information can be extracted (limits to window area)
- Window makes it difficult to learn systematic patterns arising from phenomena like constituency
RNN is a class of networks designed to address these problems by processing sequences explicitly as sequences, allowing us to handle variable length inputs without the use of arbitrary fixed-sized windows.
Parts-of-speech (also known as POS, word classes, or syntactic categories) are useful because they reveal a lot about a word and its neighbors. Useful for:
labeling named entities
- speech recognition or synthesis
武夷山旅途期间花了 3 个小时阅读了这本书，虽然本书以观念为主，且部分观点间界限比较模糊甚至有冲突，但却挺对自己的胃口，我一直也想让自己能像书中观点所说的那样工作，所以就整理了一下以便时常查阅。
In practice, the sigmoid is not commonly used as an activation function. A better one is tanh function ranges from -1 to 1:
The most commonly used is the rectified linear unit, also called ReLU: y = max(x, 0)
In the sigmoid or tanh functions, very high values of z result in values of y that are saturated, extremely close to 1, which causes problems for learning.
- Rectifiers don’t have this problem, since the output of values close to 1 also approaches 1 in a nice gentle linear way.
- By contrast, the tanh function has the nice properties of being smoothly differentiable and mapping outlier values toward the mean.
Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.
- words which are synonyms tended to occur in the same environment
- with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments”
vector semantics instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts.
In NLP, logistic regression is the baseline supervised machine learning algorithm for classification.
- discriminative classifier: like logistic regression
- only trying to learn to distinguish the classes.
- directly compute
- generative classifier: like naive Bayes
- have the goal of understanding what each class looks like.
- makes use of likelihood term
A machine learning system for classification has four components:
- A feature representation of the input
- A classification function that computes , the estimated class, via . Like sigmoid and softmax.
- An objective function for learning, usually involving minimizing error on training examples. Like cross-entropy loss function.
- An algorithm for optimizing the objective function. Like stochastic gradient descent.
Text categorization, the task of assigning a label or category to an entire text or document.
- sentiment analysis
- spam detection
- subject category or topic label
Probabilistic classifier additionally will tell us the probability of the observation being in the class.
Generative classifiers like naive Bayes build a model of how a class could generate some input data.
Discriminative classifiers like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes.
Normalizing text means converting it to a more convenient, standard form.
- tokenization: separating out or tokenizing words from running text
- lemmatization: words have the same root but different surface. Stemming refers to a simpler version of lemmatization in which just strip suffixes from the end of the word.
- sentence segmentation
有时候我们需要将多个数据源的 DataBase 放在一个地方，最 Naive 的方法就是把其他 DataBase 的数据备份出来，再全部导入到一个 DataBase，但是这样比较麻烦，而且当数据库很大时也会比较耗时。这时候使用 FDW 就非常方便了。FDW 全称 Foreign Data Wrapper，这里有一些基本的介绍：Foreign data wrappers - PostgreSQL wiki，FDW 非常简单且效果不错，下面逐步介绍（基于 postgres）基本操作和注意事项。