Notes on Convex Optimization

最近开始学习凸优化，虽然作为数学系的学生曾上过数值优化课，但当时学得是“稀里糊涂”(in boss’s words)。当然以前也没有认识到数值优化的应用，现转为初级码农，从业ML，遂得知优化之实用(当然主要是每年paper无数)，重头学习凸优化。目前正在看Yurii Nesterov所编的教材的Introductory Lectures on Convex Optimization: A Basic Course，为免看完即忘，因此作了笔记。

What’s in this book

General optimization problems (Unfortunately “unsolvable”)
Smooth convex optimization
Nonsmooth convex optimization
Convex minimization problems with explicit structure

PS: Goals of this course (from Introduction section):

Show that: despite their attraction, the proposed “solutions” of general optimization problems very often cannot satisfy the expectations of a naive user. The main face, which should be known to any person dealing with optimization models, is that in general optimization problems are unsolvable. (The reality is always cruel)
Discussing methods for solvable nonlinear models, namely convex optimization problems.

Chapter 1 – Nonlinear Optimization

General formulation of the problem

Let $x \in \mathbb{R}^n, S \subset \mathbb{R}^n$ and $f_0(x), \ldots, f_m(x)$ be real-valued functions of $x$. We deal with different variants of the following general minimization problem:
$$
\min f_0(x), ~ \text{s.t. } f_j(x) \ \& \ 0, j = 1, \ldots, m; ~ x \in S.
$$
where the $\&$ can be $\leq, \geq$ or $=$.

“In general, optimization problems are unsolvable”

Performance of numerical methods

Usually, the numerical methods are developed for solving many different problems with similar characteristics. Thus, a performance of method $\mathcal{M}$ on the whole class $\mathcal{F}$ (of problems) is a natural characteristics of its efficiency.
Some concepts:

$\Sigma$: A known (for numerical scheme) “part” of problem $\mathcal{P}$ is called the model of the problem
$\mathcal{O}$: The process of collecting the data is called oracle

Zero-order oracle: return function value $f(x)$
First-order oracle: return $f(x)$ and gradient $f’(x)$
Second-order oracle: return $f(x), f’(x)$ and Hessian $f’’(x)$

$\mathcal{F}$: Problem class $\mathcal{F} \equiv (\Sigma, \mathcal{O}, \mathcal{T}_\epsilon)$, $\mathcal{T}_\epsilon$ is stopping criterion. And $\mathcal{M}$ method’s performance on problem class is its performance on the worst $\mathcal{P}_w$ from $\mathcal{F}$.
Mesure of complexity of problem $\mathcal{P}$ for method $\mathcal{M}$ (As finding an exact solution is always impossible for numerical optimization, to solve a problem means to find an approximate solution with an accuracy $\epsilon > 0$.)

Analytical complexity: number of calls of oracle, which is required to solve problem $\mathcal{P}$ up to accuracy $\epsilon$
Arithmetical complexity: The total number of arithmetic operations (including the work of oracle and work of method), which is required to solve problem $\mathcal{P}$ up to accuracy $\epsilon$

General Iterative Scheme

Local black box

The only information available for the numerical scheme is the answer of the oracle
The orical is local: A small variation of the problem far enough from the test point $x$ does not change the answer at $x$

Complexity bounds for global optimization

Consider the class of problems $\mathcal{C}$, $\mathbb{R}^n$ equiped with $l_\infty$-norm, $B_n = \{ x \in \mathbb{R}^n | 0 \leq x^{(i)} \leq 1, i = 1, \ldots, m \}$,

Then it is proved that the uniform grid method is optimal for $\mathcal{C}$, which has a lower bound $(\left \lfloor \frac{L}{2\epsilon} \right \rfloor)^n$. Then for $L=2, n=10, \epsilon = 0.01$ it would take $10^{20}$ calls of oracle (0-order), at least 31, 250, 000 years.

The lower bounds

To achieve the stopping criterion, for the worst problem (specific to the method), how many calls of oracles are needed for the method.
Based on black box concept
Valid for all reasonable iteative schemes, thus provide us a lower estimate for analytical complexity on the problem class
Very often employ the idea of resisting oracle. A resisting oracle tries to create a worst problem for each particular method.

The upper complexity

How many calls of oracles can assure the stopping criterion is achieved, for any problem of the problem class.
For example, for the Lipschitz continuous functions on $B_n$, the uniform grid method has upper bound $(\left \lfloor \frac{L}{2\epsilon} \right \rfloor +2)^n$.

The optimal method

The lower and upper bounds coincide up to a constant multiplicative factor
if $\epsilon = O(\frac{L}{n})$, the uniform grid method is an optimal method for $\mathcal{C}$

Misc: Classes of differentiable functions

Let $Q \subset \mathbb{R}^n$, we denote $f \in C_L^{k, p}(Q)$ if

$f$ is k times continuously differentiable on $Q$
Its pth derivative is Lipschitz continuous on $Q$ with constant $L$
Intuitively, the Lipschitz continuity limits how fast the functions can change