Vector Semantics Note (SLP Ch06)

Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.

words which are synonyms tended to occur in the same environment
with the amount of meaning difference between two words “corresponding roughly to the amount of difference in their environments”

vector semantics instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts.

Logistic Regression Note (SLP Ch05)

In NLP, logistic regression is the baseline supervised machine learning algorithm for classification.

discriminative classifier: like logistic regression
- only trying to learn to distinguish the classes.
- directly compute $$P(c|d)$$
generative classifier: like naive Bayes
- have the goal of understanding what each class looks like.
- makes use of likelihood term $$P(d|c)P©$$

A machine learning system for classification has four components:

A feature representation of the input
A classification function that computes $$\hat y$$, the estimated class, via $$p(y|x)$$. Like sigmoid and softmax.
An objective function for learning, usually involving minimizing error on training examples. Like cross-entropy loss function.
An algorithm for optimizing the objective function. Like stochastic gradient descent.

Naive Bayes and Sentiment Classification Note (SLP Ch04)

Text categorization, the task of assigning a label or category to an entire text or document.

sentiment analysis
spam detection
subject category or topic label

Probabilistic classifier additionally will tell us the probability of the observation being in the class.

Generative classifiers like naive Bayes build a model of how a class could generate some input data.

Discriminative classifiers like logistic regression instead learn what features from the input are most useful to discriminate between the different possible classes.

Regular Expressions, Text Normalization, and Edit Distance Note (SLP Ch02)

Normalizing text means converting it to a more convenient, standard form.

tokenization: separating out or tokenizing words from running text
lemmatization: words have the same root but different surface. Stemming refers to a simpler version of lemmatization in which just strip suffixes from the end of the word.
sentence segmentation

DataBase Foreign Data Wrapper

有时候我们需要将多个数据源的 DataBase 放在一个地方，最 Naive 的方法就是把其他 DataBase 的数据备份出来，再全部导入到一个 DataBase，但是这样比较麻烦，而且当数据库很大时也会比较耗时。这时候使用 FDW 就非常方便了。FDW 全称 Foreign Data Wrapper，这里有一些基本的介绍：Foreign data wrappers - PostgreSQL wiki，FDW 非常简单且效果不错，下面逐步介绍（基于 postgres）基本操作和注意事项。

说明：本教程只适用于之前没接触过编程然后又想学点编程知识的同学，他们的目的主要有两个：第一，解决日常工作中的一些重复劳动或者自己做一个好玩儿的小项目；第二，尝试一种新的思维或 ”面对世界“ 的方式。本教程将围绕这两个目的展开。

Information Extraction Note (SLP Ch18)

Recently, I wanted to build an information extraction system, so I searched for Google. However there were little Chinese articles, the quality was not so good as well. Fortunately, I found several English ones seemed well, and then the summary is here. The whole structure is based on my favorite NLP book Speech and Language Processing (use SLP instead below), also with some other materials in the reference.

Information extraction (IE), turns the unstructured information extraction information embedded in texts into structured data, for example for populating a relational database to enable further processing. Here is a figure of: Simple Pipeline Architecture for an Information Extraction System.

From: https://www.nltk.org/book/ch07.html

By the way, this book provides actionable steps, focusing on specific actions.

architecture

第十八章：自然语言处理中的理性主义与经验主义

理性主义：以生成语言学为基础的方法
经验主义：以大规模语料库的分析为基础的方法

第十七章：自然语言处理系统评测

测评的一般原则和方法

两种不同的测评方法：

黑箱评测（外在评测）：不关心 NLP 系统内部机制和组成结构，主要根据输入输出结果判断，有助于了解外在的总体性能。
白箱评测（内在评测）：对 NLP 内部机制分别分析，测评各组成部分性能，有助于了解内部组成部分的性能。

主要采用黑箱评测，“宽进严出”。

第十六章：统计机器翻译中的形式模型

基于语料库的机器翻译方法可分为两种：基于统计的机器翻译方法和基于实例的机器翻译方法。前者的知识表示是统计数据，后者语料库本身就是翻译知识的一种表现形式。

Yam

Feeling, Coding, Thinking