Notes on Convex Optimization

最近开始学习凸优化,虽然作为数学系的学生曾上过数值优化课,但当时学得是“稀里糊涂”(in boss’s words)。当然以前也没有认识到数值优化的应用,现转为初级码农,从业ML,遂得知优化之实用(当然主要是每年paper无数),重头学习凸优化。目前正在看Yurii Nesterov所编的教材的Introductory Lectures on Convex Optimization: A Basic Course,为免看完即忘,因此作了笔记。

What’s in this book

  • General optimization problems (Unfortunately “unsolvable”)
  • Smooth convex optimization
  • Nonsmooth convex optimization
  • Convex minimization problems with explicit structure

PS: Goals of this course (from Introduction section):

  • Show that: despite their attraction, the proposed “solutions” of general optimization problems very often cannot satisfy the expectations of a naive user. The main face, which should be known to any person dealing with optimization models, is that in general optimization problems are unsolvable. (The reality is always cruel)
  • Discussing methods for solvable nonlinear models, namely convex optimization problems.

Chapter 1 – Nonlinear Optimization

General formulation of the problem

Let $x \in \mathbb{R}^n, S \subset \mathbb{R}^n$ and $f_0(x), \ldots, f_m(x)$ be real-valued functions of $x$. We deal with different variants of the following general minimization problem:
$$
\min f_0(x), ~ \text{s.t. } f_j(x) \ \& \ 0, j = 1, \ldots, m; ~ x \in S.
$$
where the $\&$ can be $\leq, \geq$ or $=$.

“In general, optimization problems are unsolvable”

Performance of numerical methods

Usually, the numerical methods are developed for solving many different problems with similar characteristics. Thus, a performance of method $\mathcal{M}$ on the whole class $\mathcal{F}$ (of problems) is a natural characteristics of its efficiency.
Some concepts:

  • $\Sigma$: A known (for numerical scheme) “part” of problem $\mathcal{P}$ is called the model of the problem
  • $\mathcal{O}$: The process of collecting the data is called oracle
  • Zero-order oracle: return function value $f(x)$
  • First-order oracle: return $f(x)$ and gradient $f’(x)$
  • Second-order oracle: return $f(x), f’(x)$ and Hessian $f’’(x)$
  • $\mathcal{F}$: Problem class $\mathcal{F} \equiv (\Sigma, \mathcal{O}, \mathcal{T}_\epsilon)$, $\mathcal{T}_\epsilon$ is stopping criterion. And $\mathcal{M}$ method’s performance on problem class is its performance on the worst $\mathcal{P}_w$ from $\mathcal{F}$.
  • Mesure of complexity of problem $\mathcal{P}$ for method $\mathcal{M}$ (As finding an exact solution is always impossible for numerical optimization, to solve a problem means to find an approximate solution with an accuracy $\epsilon > 0$.)
  • Analytical complexity: number of calls of oracle, which is required to solve problem $\mathcal{P}$ up to accuracy $\epsilon$
  • Arithmetical complexity: The total number of arithmetic operations (including the work of oracle and work of method), which is required to solve problem $\mathcal{P}$ up to accuracy $\epsilon$

General Iterative Scheme

Local black box

  • The only information available for the numerical scheme is the answer of the oracle
  • The orical is local: A small variation of the problem far enough from the test point $x$ does not change the answer at $x$

Complexity bounds for global optimization

Consider the class of problems $\mathcal{C}$, $\mathbb{R}^n$ equiped with $l_\infty$-norm, $B_n = \{ x \in \mathbb{R}^n | 0 \leq x^{(i)} \leq 1, i = 1, \ldots, m \}$,

Then it is proved that the uniform grid method is optimal for $\mathcal{C}$, which has a lower bound $(\left \lfloor \frac{L}{2\epsilon} \right \rfloor)^n$. Then for $L=2, n=10, \epsilon = 0.01$ it would take $10^{20}$ calls of oracle (0-order), at least 31, 250, 000 years.

The lower bounds

  • To achieve the stopping criterion, for the worst problem (specific to the method), how many calls of oracles are needed for the method.
  • Based on black box concept
  • Valid for all reasonable iteative schemes, thus provide us a lower estimate for analytical complexity on the problem class
  • Very often employ the idea of resisting oracle. A resisting oracle tries to create a worst problem for each particular method.

The upper complexity

  • How many calls of oracles can assure the stopping criterion is achieved, for any problem of the problem class.
  • For example, for the Lipschitz continuous functions on $B_n$, the uniform grid method has upper bound $(\left \lfloor \frac{L}{2\epsilon} \right \rfloor +2)^n$.

The optimal method

  • The lower and upper bounds coincide up to a constant multiplicative factor
  • if $\epsilon = O(\frac{L}{n})$, the uniform grid method is an optimal method for $\mathcal{C}$

Misc: Classes of differentiable functions

Let $Q \subset \mathbb{R}^n$, we denote $f \in C_L^{k, p}(Q)$ if

  • $f$ is k times continuously differentiable on $Q$
  • Its pth derivative is Lipschitz continuous on $Q$ with constant $L$
    Intuitively, the Lipschitz continuity limits how fast the functions can change