python实现贝叶斯优化_贝叶斯优化的并行实现
python实现贝叶斯优化
The concept of ‘optimization’ is central to data science. We minimize loss by optimizing weights in a neural network. We optimize hyper-parameters in our gradient boosted trees to find the best bias-variance trade-off. We use A-B testing to optimize behavior on our websites. Whether our function is a neural network, consumer behavior, or something more sinister, we all have something we want to optimize.
“优化”的概念对于数据科学至关重要。 通过优化神经网络中的权重,我们将损失降至最低。 我们在梯度增强树中优化超参数,以找到最佳的偏差方差折衷方案。 我们使用AB测试来优化我们网站上的行为。 无论我们的功能是神经网络,消费者行为还是更险恶的事物,我们都有一些我们要优化的东西。
Sometimes the functions we are trying to optimize are expensive, and we wish to get to our destination in as few steps as possible. Sometimes we want to be confident that we find the best possible solution, and sometimes our functions don’t have a tractable gradient, so there is no nice arrow to point us in the right direction. Often, our functions have random elements to them, so we are really trying to optimize f(x) = y + e, where e is some random error element. Bayesian optimization is a function optimizer (maximizer) which thrives in these conditions.
有时,我们尝试优化的功能非常昂贵,并且我们希望以尽可能少的步骤到达目的地。 有时我们希望找到可能的最佳解决方案,有时我们的函数没有可控制的梯度,因此没有很好的箭头指向正确的方向。 通常,我们的函数具有随机元素,因此我们实际上是在尝试优化f(x)= y + e,其中e是一些随机误差元素。 贝叶斯优化是在这些条件下蓬勃发展的函数优化器(最大化器)。
目录 (Table of Contents)
- What Is Bayesian Optimization什么是贝叶斯优化
- Implementing From Scratch从头开始实施
- Implementing In Parallel并行实施
- Final Words最后的话
什么是贝叶斯优化(What Is Bayesian Optimization)
Let’s say we have a function f, and we want to find the x which maximizes (or minimizes) f(x). We have many, many options. However, if our function fits the description right above the table of contents, we will definitely want to consider Bayesian optimization.
比方说,我们有一个函数f,我们要找到它最大化X(或最小化)F(X)。 我们有很多选择。 但是,如果我们的函数适合目录上方的描述,则我们肯定会考虑贝叶斯优化。
There are several different methods for performing Bayesian optimization. All of them involve creating an assumption about how certain things are distributed, making a decision based on that assumption, and then updating the assumption.
有几种不同的方法可以执行贝叶斯优化。 所有这些都涉及创建关于某些事物如何分布的假设,基于该假设做出决定,然后更新该假设。
The method in this article uses Gaussian processes to create an assumption about how f(x) is distributed. These processes can be thought of as a distribution of functions — where drawing a random sample from a Gaussian distribution results in a number, drawing a random sample from a Gaussian process results in a function. If you are not familiar with Gaussian processes, this is a little hard to picture. I recommend this video, which is what made the concept click for me.
本文中的方法使用高斯过程来创建关于f(x)分布的假设。 可以将这些过程视为函数的分布-从高斯分布中抽取随机样本会产生一个数,从高斯过程中抽取随机样本会产生一个函数。 如果您不熟悉高斯过程,这很难想象。 我推荐此视频,这正是使这个概念吸引我的原因。
The algorithm itself can be summarized as such:
该算法本身可以概括如下:
从头开始实施(Implementing From Scratch)
Here we walk through a single iteration of Bayesian optimization without using a package. The process is pretty straightforward. First, we define a toy function func we want to maximize, and then we sample it 4 times:
在这里,我们不使用包就完成了贝叶斯优化的单次迭代。 这个过程非常简单。 首先,我们定义要最大化的玩具函数func ,然后对其进行4次采样:
# Function to optimize
func <- function(input) {
dnorm(input,15,5) + dnorm(input,30,4) + dnorm(input,40,5)
}# Sample the function 4 times
func_results <- data.frame(input = c(5,18,25,44))
func_results$output <- func(func_results$input)# Plot
library(ggplot2)
p <- ggplot(data = data.frame(input=c(0,50)),aes(input)) +
stat_function(fun=func,size=1,alpha=0.25) +
geom_point(data=func_results,aes(x=input,y=output),size=2) +
ylab("output") +
ylim(c(-0.05,0.2))
p + ggtitle("Our Function and Attempts")
We are pretending we don’t know the true function, so all we see in practice are the 4 points we sampled. For the sake of keeping this walk-through interesting, we did a pretty miserable job of selection our initial points. Let’s fit a Gaussian process to the 4 points to define our assumption about how output is distributed for each input.
我们假装不知道真正的功能,因此在实践中看到的只是我们采样的4点。 为了使本演练有趣,我们在选择初始要点方面做得很惨。 让我们将高斯过程拟合到这四个点,以定义关于每个输入的输出分布方式的假设。
library(DiceKriging)
set.seed(1991)
gp <- km(
design = data.frame(input=func_results$input)
, response = func_results$output
, scaling = TRUE
)
Let’s take a look at our Gaussian process next to the points we have sampled and the true function value:
让我们看一下采样点和真实函数值旁边的高斯过程:
predGP <- function(x,grab) {
predict(gp,data.frame(input=x),type = "UK")[[grab]]
}a=1
cl = "purple"
plotGP <- function(grab,cl,a) {
stat_function(
fun=predGP,args=list(grab=grab),color=cl,alpha=a,n=1000
)
}
p + ggtitle("Gaussian Process Results") +
plotGP("mean",cl,a) +
plotGP("lower95",cl,a) +
plotGP("upper95",cl,a)
The Gaussian process allows us to define a normal distribution of the output for each input. In the picture above, the purple lines show the Gaussian process. The middle line is the mean, and the upper/lower lines are the 95th percentiles of the normal distribution at that input. So, for example, if we wanted to know how we assume the output is distributed at input = 30, we could do:
高斯过程允许我们为每个输入定义输出的正态分布。 在上面的图片中,紫色线表示高斯过程。 中线是平均值,上/下线是该输入处正态分布的第95个百分点。 因此,例如,如果我们想知道我们如何假设输出在输入= 30处分布,我们可以这样做:
predict(gp,data.frame(input=30),type="UK")[c("mean","sd")]$mean
[1] 0.05580301
$sd
[1] 0.007755026
This tells us that we are assuming, at input = 30, our output follows a normal distribution with mean = 0.0558, and sd = 0.0078.
这告诉我们,我们假设在输入= 30时,我们的输出遵循正态分布,平均值= 0.0558,并且sd = 0.0078。
Now that we have defined our assumption about the distribution of the output, we need to determine where to sample the function next. To do this, we need to define how ‘promising’ an input is. We do this by defining an acquisition function. There are several to choose from:
现在,我们已经定义了关于输出分布的假设,我们需要确定接下来要在哪里对函数进行采样。 为此,我们需要定义输入的“希望”程度。 我们通过定义采集功能来做到这一点。 有几种可供选择:
Of these, the upper confidence bound is the easiest to implement, so let’s define the function and plot it on our chart:
其中,最高置信度范围最容易实现,因此让我们定义函数并将其绘制在图表上:
ucb <- function(x,kappa=3) {
gpMean <- predGP(x,grab="mean")
gpSD <- predGP(x,grab="sd")
return(gpMean + kappa * gpSD)
}a=0.25
p + ggtitle("Upper Confidence Bound") +
plotGP("mean",cl,a) +
plotGP("lower95",cl,a) +
plotGP("upper95",cl,a) +
stat_function(fun=ucb,color="blue")
We can see that our upper confidence bound is maximized (green diamond) somewhere between 10 and 15, so let’s find the specific spot, sample it, and update our GP:
我们可以看到我们的置信度上限在10到15之间最大化(绿色菱形),因此让我们找到特定的点,对其进行采样并更新GP:
# Find exact input that maximizes ucb
acqMax <- optim(
par = 12
, fn = ucb
, method = “L-BFGS-B”
, control = list(fnscale = -1)
, lower = 10
, upper = 20
)$par# Run our function as this spot
func_results <- rbind(
func_results
, data.frame(input = acqMax,output = func(acqMax))
)
We have just completed one iteration of Bayesian optimization! If we continued to run more, we would see our chart evolve:
我们刚刚完成了贝叶斯优化的一次迭代! 如果继续运行,我们将看到图表不断发展:
并行实施(Implementing in Parallel)
We won’t implement this part from scratch. Instead, we will use the ParBayesianOptimization R package to do the heavy lifting. This package allows us to sample multiple promising points at once. If there is only 1 promising point, it samples the surrounding area multiple times. So, in our first example, we would sample all 5 of the local maximums of the acquisition function:
我们不会从头开始实现这部分。 相反,我们将使用ParBayesianOptimization R包来完成繁重的工作。 该软件包使我们能够一次采样多个有希望的点。 如果只有1个有希望的点,它将对周围区域进行多次采样。 因此,在第一个示例中,我们将对采集函数的所有5个局部最大值进行采样:
Let’s get it up and running and see what comes out. We initialize the process with the 4 same points as above, and then run 1 optimization step with 5 points:
让我们启动并运行它,看看有什么结果。 我们使用与上述相同的4个点初始化该过程,然后使用5个点运行1个优化步骤:
library(ParBayesianOptimization)
library(doParallel)# Setup parallel cluster
cl <- makeCluster(5)
registerDoParallel(cl)
clusterExport(cl,c('func'))# bayesOpt requires the function to return a list with Score
# as the metric to maximize. You can return other fields, too.
scoringFunc <- function(input) return(list(Score = func(input)))# Initialize and run 1 optimization step at 5 points
optObj <- bayesOpt(
FUN = scoringFunc
, bounds = list(input=c(0,50))
, initGrid = list(input=c(5,18,25,44))
, iters.n = 5
, iters.k = 5
, acqThresh = 0
, parallel = TRUE
)
stopCluster(cl)
registerDoSEQ()# Print the input and score of the first Epoch
optObj$scoreSummary[,c("Epoch","input","Score","gpUtility")] Epoch input Score gpUtility
0 5.00000 0.0107981936 NA
0 18.00000 0.0677578712 NA
0 25.00000 0.0573468343 NA
0 44.00000 0.0581564852 NA
1 35.59468 0.0916401614 0.6558418
1 50.00000 0.0107985650 0.6326077
1 13.74720 0.0773487879 0.5417429
1 21.13259 0.0462167925 0.4734561
1 0.00000 0.0008863697 0.1961284
Our score summary shows us that bayesOpt ran our 4 initial points (Epoch = 0) and then ran 1 optimization step (Epoch = 1) in which it sampled all 5 of the local optimums of the acquisition function. If we ran for more iterations, we would continue to sample 5 points at a time. If our function to maximize was actually expensive, this would allow us to find the global optimum much faster.
我们的分数摘要显示,bayesOpt运行了我们的4个初始点(Epoch = 0),然后运行了1个优化步骤(Epoch = 1),在该步骤中,它对采集函数的所有5个局部最优值进行了采样。 如果我们进行更多迭代,我们将继续一次采样5点。 如果最大化我们的功能实际上是昂贵的,这将使我们能够更快地找到全局最优值。
The acqThresh parameter in bayesOpt is crucial to the sampling process. This parameter represents the minimum percentage of the global optimum of the acquisition function that a local optimum must reach for it to be sampled. For example, if acqThresh=0.5, then each local optimum (upper confidence bound, in our case) must be at least 50% of the global optimum, or it will be ignored. We set acqThresh=0, so all local optimums would be sampled.
bayesOpt中的acqThresh参数对于采样过程至关重要。 此参数表示要对其进行采样的局部最优值必须达到的采集函数全局最优值的最小百分比。 例如,如果acqThresh = 0.5,则每个局部最优值(在我们的情况下为上限置信区间)必须至少为全局最优值的50%,否则它将被忽略。 我们设置acqThresh = 0,因此将对所有局部最优值进行采样。
Draw your attention to the gpUtility field above. This is the scaled value of the acquisition function (upper confidence bound in our case) at each of the points sampled. If you notice this value converging to 0 over Epochs, then the Gaussian process’ opinion is that there is not many promising points left to explore. A more thorough, package specific explanation can be found here.
请注意上面的gpUtility字段。 这是在每个采样点上采集函数的标度值(在我们的情况下为上限置信度)。 如果您注意到该值在历元上收敛到0,那么高斯过程的观点是,没有太多值得探索的点。 可以在此处找到更全面的,特定于软件包的说明。
最后的话 (Final Words)
Bayesian optimization is an amazing tool for niche scenarios. In modern data science, it is commonly used to optimize hyper-parameters for black box models. However, being a general function optimizer, it has found uses in many different places. I personally tend to use this method to tune my hyper-parameters in both R and Python.
贝叶斯优化是利基场景的绝佳工具。 在现代数据科学中,它通常用于优化黑匣子模型的超参数。 但是,作为通用功能优化器,它已在许多不同的地方使用。 我个人倾向于使用此方法在R和Python中调整我的超参数。
翻译自: https://towardsdatascience.com/a-parallel-implementation-of-bayesian-optimization-2ffcdb2733a2
python实现贝叶斯优化
上一篇: 自动化运维 Ansible 安装部署
下一篇: c++ protected误区