Generative Learning Algorithms
通过学习 $P(x|y)\;P(y)$,利用贝叶斯公式得到 $P(y|x)$
当我们预测的时候有
$\underset{y}{\mathrm{argmax}}\;P(y|x) = \underset{y}{\mathrm{argmax}}\;\frac{P(x|y)P(y)}{P(x)} = \underset{y}{\mathrm{argmax}}\;{P(x|y)P(y)}$
Gaussian Discriminant Analysis
假设:
- $x \in \mathbb{R}^{n}$ 是 连续值
- $x \mid y$ is Gaussian $\;\Sigma_0 = \Sigma_1 = \Sigma$
- $y \in \{0, 1\} \; y$ ~ $Bernoulli(\phi)$
$P(y) = \phi^y(1-\phi)^{1-y}$ $P(x|y = 0) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0))$ $P(x|y = 1) = \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}exp(-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1))$
maximize joint likelihood:
$l(\theta) = \log\prod_{i=1}^mP(x^{(i)},y^{(i)})=log\prod_{i=1}^mP(y^{(i)})P(x^{(i)} \mid y^{(i)})$
令偏导为0,得到:
$\phi = \frac{\sum_{i=1}^m \;1{y^{(i)} = 1}}{m}$
$\mu_0 = \frac{\sum_{i=1}^m\;1{y^(i)=0}x^{(i)}}{\sum_{i=1}^m\;1{y^(i)=0}}$
$\mu_1 = \frac{\sum_{i=1}^m\;1{y^(i)=1}x^{(i)}}{\sum_{i=1}^m\;1{y^(i)=1}}$
$\Sigma = \frac{1}{m} \sum_{i=1}^m (x^{(i)}-\mu_{y^{(i)}})(x^{(i)}-\mu_{y^{(i)}})^T$
我们发现
$P(y=1|x) = \frac{P(x|y=1)P(y=1)}{P(x|y=0)P(y=0)+P(x|y=1)P(y=1)}
= \frac{1}{1+\frac{1-\phi}{\phi}e^{A-B}}
= \frac{1}{1+e^{\theta^{T}x}}$
where
$A = -\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)$
$B = -\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)$
$\theta = f(\phi, \Sigma, \mu_0, \mu_1)$
也就是说
$x|y$ is Gaussian 是 $P(y|x)$ is logistic function的充分非必要条件。
说明GDA比logistic regression对数据做出了更强的假设,因此对于数据符合高斯分布的时候,训练时用更少的数据得到较好的模型(?),而逻辑回归假设比较弱,鲁棒性好,在数据量大的时候,往往比GDA效果好。因此我们常用logistic regression.
Naive Bayes classifier
假设:
- $x \in \mathbb{R}^{n} \; x_i \in {0,1} \; x_i \mid y$ ~ $Bernoulli(\phi_{i \mid y})$
- $x_i’s$ are conditional independent on $y$
- $y \in \{0, 1\} \; y$ ~ $Bernoulli(\phi)$
maximize the joint likelihood of the data:
$l(\theta) = log \prod_{i=1}^mP(x^{(i)},y^{(i)})$
偏导为0,得:
$\phi_{j|y=1} = \frac{\sum_{i=1}^m\;1{x_j^{(i)} = 1, y^{(i)} = 1}}{\sum_{i=1}^m\;1{y^{(i)} = 1}}$
$\phi_{j|y=0} = \frac{\sum_{i=0}^m\;1{x_j^{(i)} = 1, y^{(i)} = 0}}{\sum_{i=1}^m\;1{y^{(i)} = 0}}$
$\phi = \frac{1{y^{(i)} = 1}}{m}$
当 $y \in \{1, \cdots, k\}$ 时,为multinomial event model.
Discriminative Learning Algorithms
直接学习 $P(y|x)$ 或者令 $h_{\theta}(x) \in {0, 1}$
Laplace Smoothing
假设$y \in \{1, \cdots, k\}$,那么laplace平滑后的