转载

[译] 谷歌云原生架构的 5 个原则

译者：Mandy | 校对：James | 编辑： Hayley

编者按：本文重点介绍了谷歌的“云原生架构”，深入浅出地介绍了云原生的具体概念，详细地介绍了构建系统的五大原则，相信想要入门的伙伴们看完本文后会对“云原生架构”有一定的了解。

At Google Cloud, we often throw around the term ‘cloud-native architecture’ as the desired end goal for applications that you migrate or build on Google Cloud Platform (GCP). But what exactly do we mean by cloud-native? More to the point, how do you go about designing such a system?

在谷歌云中，我们经常将 “云原生架构”作为在谷歌云平台（ GCP ）上迁移或构建应用程序的最终目标。但是云原生到底是什么意思呢？更重要的是，如何着手设计这样一个系统？

[译] 谷歌云原生架构的 5 个原则

At a high level, cloud-native architecture means adapting to the many new possibilities—but very different set of architectural constraints—offered by the cloud compared to traditional on-premises infrastructure. Consider the high level elements that we as software architects are trained to consider:

在高层次上，云原生架构意味着适应许多新的可能性。但是受不同的体系结构约束，与传统的内部基础设施相比，云计算提供了更多的服务。考虑到我们作为软件架构师所训练的高级元素：

. The functional requirements of a system (what it should do, e.g 'process orders in this format...')

系统的功能性需求（它应该做什么，例如： “ 处理这种格式的订单 …” ）

. The non-functional requirements (how it should perform e.g. 'process at least 200 orders a minute')

非功能性需求（他应该如何表现，例如： “ 一分钟至少处理 200 个订单 ” ）

. Constraints (what is out-of-scope to change e.g. 'orders must be updated on our existing mainframe system').

约束（什么超出了更改的范围，例如： “订单必须在我们现有的主机系统上更新”）

While the functional aspects don't change too much, the cloud offers, and sometimes requires, very different ways to meet non-functional requirements, and imposes very different architectural constraints. If architects fail to adapt their approach to these different constraints, the systems they architect are often fragile, expensive, and hard to maintain. A well-architected cloud native system, on the other hand, should be largely self-healing, cost efficient, and easily updated and maintained through Continuous Integration/Continuous Delivery (CI/CD).

虽然功能性方面没有太大的变化，但是云提供了（有时又要求了）非常另类的方式去满足非功能性需求，并施加了非常不同的架构约束。如果架构师不能使他们的方法适应这些不同的约束，那么他们所构建的系统通常是脆弱的、昂贵的和难以维护的。另一方面，一个架构良好的云原生系统应该是大体上能自我修复的、低成本的，并且可以通过持续集成 / 持续交付来轻松地更新和维护。

The good news is that cloud is made of the same fabric of servers, disks and networks that makes up traditional infrastructure. This means that almost all of the principles of good architectural design still apply for cloud-native architecture. However, some of the fundamental assumptions about how that fabric performs change when you’re in the cloud. For instance, provisioning a replacement server can take weeks in traditional environments, whereas in the cloud, it takes seconds—your application architecture needs to take that into account.

好消息是，云是由与组成传统基础设施结构相同的服务器、磁盘和网络组成的。这意味着几乎所有优秀架构设计的原则仍然适用于云原生架构。然而，当您在云中时，关于这种结构执行情况的一些基本假设会发生变化。例如，在传统环境中，准备一个替换服务器可能需要几周的时间，而在云环境中，只需要几秒钟的时间，你的应用程序架构需要考虑到这点。

In this post we set out five principles of cloud-native architecture that will help to ensure your designs take full advantage of the cloud while avoiding the pitfalls of shoe-horning old approaches into a new platform.

在这篇文章中，我们列出了云原生架构的五个原则，这将有助于确保您的设计充分利用云，同时避免将旧方法硬塞进新平台的误区。

Principles for cloud-native architecture

The principle of architecting for the cloud,a.k.a. cloud-native architecture, focuses on how to optimize system architectures for the unique capabilities of the cloud. Traditional architecture tends to optimize for a fixed, high-cost infrastructure, which requires considerable manual effort to modify. Traditional architecture therefore focuses on the resilience and performance of a relatively small fixed number of components. In the cloud however, such a fixed infrastructure makes much less sense because cloud is charged based on usage (so you save money when you can reduce your footprint) and it’s also much easier to automate (so automatically scaling-up and down is much easier). Therefore, cloud-native architecture focuses on achieving resilience and scale though horizontal scaling, distributed processing, and automating the replacement of failed components. Let’s take a look.

云原生架构的原则

云架构，尤其是云原生架构的原则，关注如何为云的独特功能优化系统架构。传统架构倾向于优化固定的、高成本的基础设施，它们需要大量的人力去进行修改。因此，传统架构关注的是相对较少的固定数量的部件的弹性和性能。但是在云架构中，这种固定的基础设施的意义就少很多了，因为云计算是根据使用量收费的（所以你减少点击量的话就会省钱），并且自动化也更加容易（因此，自动放大和缩小要容易得多）。因此，云原生架构侧重于通过水平扩展实现弹性和伸缩性、分布式处理、并自动替换失败的组件。让我们来看看：

Principle 1: Design for automation

Automation has always been a best practice for software systems, but cloud makes it easier than ever to automate the infrastructure as well as components that sit above it. Although the upfront investment is often higher, favouring an automated solution will almost always pay off in the medium term in terms of effort, but also in terms of the resilience and performance of your system. Automated processes can repair, scale, deploy your system far faster than people can. As we discuss later on, architecture in the cloud is not a one-shot deal, and automation is no exception—as you find new ways that your system needs to take action, so you will find new things to automate.

Some common areas for automating cloud-native systems are:

原则一： 自动化设计

自动化一直是软件系统的最佳实践，但是云计算使得基础设施及其之上的组件的自动化比以往任何时候都要容易。尽管前期投资通常会比较高，支持自动化解决方案几乎总是会在中期得到回报，但是也取决于你系统的弹性和性能。自动化处理远比人工更快地修复、扩展和部署系统。正如我们稍后讨论的，云架构不是一次性的 ——正如您发现系统需要实行新的方式，自动化也不例外，因此您将发现新的事物需要自动化。

自动化云原生系统的一些常见领域有：

Infrastructure : Automate the creation of the infrastructure, together with updates to it, using tools like Google Cloud Deployment Manager or Terraform

基础设施：使用谷歌云部署管理器或 Terraform 等工具自动创建基础设施，并对其进行更新。

Continuous Integration/Continuous Delivery: Automate the build, testing, and deployment of the packages that make up the system by using tools like Google Cloud Build, Jenkins and Spinnaker. Not only should you automate the deployment, you should strive to automate processes like canary testing and rollback.

持续集成 / 持续交付：通过使用诸如谷歌 Cloud build 、 Jenkins 和 Spinnaker 之类的工具，自动化构建、测试和部署组成系统的封包。您不仅应该自动化部署，还应该争取使用金丝雀测试和回滚之类的自动化流程。

Scale up and scale down : Unless your system load almost never changes, you should automate the scale up of the system in response to increases in load, and scale down in response to sustained drops in load. By scaling up, you ensure your service remains available, and by scaling down you reduce costs. This makes clear sense for high-scale applications, like public websites, but also for smaller applications with irregular load, for instance internal applications that are very busy at certain periods, but barely used at others. For applications that sometimes receive almost no traffic, and for which you can tolerate some initial latency, you should even consider scaling to zero (removing all running instances, and restarting the application when it's next needed).

扩展和收缩：除非您的系统负载几乎从不变动，否则您应该通过对系统规模的自动化扩展或收缩，来响应负载的增加或下降。通过扩大规模，您可以确保您的服务仍然可用，通过缩小规模，您可以降低成本。这对于大型应用程序（如公共网站），以及负载不规则的小型应用程序（例如在某些时段非常繁忙，但在其他时段几乎不使用的内部应用程序）来说，都是很有意义的。对于有时几乎不接收任何流量的应用程序，并且对它们您可以容忍一些初始延迟，您甚至应该考虑将其缩小到零（删除所有正在运行的实例，并在下一次有需要时重新启动应用程序）。

Monitoring and automated recovery: You should bake monitoring and logging into your cloud-native systems from inception. Logging and monitoring data streams can naturally be used for monitoring the health of the system, but can have many uses beyond this. For instance, they can give valuable insights into system usage and user behaviour (how many people are using the system, what parts they’re using, what their average latency is, etc). Secondly, they can be used in aggregate to give a measure of overall system health (e.g., a disk is nearly full again, but is it filling faster than usual? What is the relationship between disk usage and service uptake? etc). Lastly, they are an ideal point for attaching automation. Now when that disk fills up, instead of just logging an error, you can also automatically resize the disk to allow the system to keep functioning.

监控和自动恢复：您应该从一开始就将监控和日志规划到您的云原生系统中。日志和监控数据流可以自然地用于监控系统的健康状况，但它们的用途远不止如此。例如，它们可以为系统使用率和用户行为提供有价值的见解（有多少人在使用系统，他们在使用哪些部件，他们的平均延迟有多少，等等）。其次，可以将它们聚合在一起来度量整个系统的健康状况（例如，一块硬盘又快满了，但是它的填充速度比平时更快吗？硬盘使用量和服务需求之间的关系是什么等）。最后，它们是结合自动化的理想场景。这样，当硬盘填满时，您不仅可以记录错误，还可以自动调整硬盘大小，以保证系统继续运行。

Principle 2: Be smart with state

Storing of 'state', be that user data (e.g., the items in the users shopping cart, or their employee number) or system state (e.g., how many instances of a job are running, what version of code is running in production), is the hardest aspect of architecting a distributed, cloud-native architecture. You should therefore architect your system to be intentional about when, and how, you store state, and design components to be stateless wherever you can.

Stateless components are easy to:

原则二：理智的对待状态

“状态” 存储，无论是用户数据（例如：用户购物车中的商品，或其员工编号），还是系统状态（例如：工作运行了多少个实例，生产中运行的代码版本是什么），都是设计分布式、云原生架构最困难的部分。因此，您应该有意识地设计您的系统架构，以确定在何时、如何存储状态，并尽可能地将组件设计为无状态。

无状态组件很容易实现：

Scale : To scale up, just add more copies. To scale down, instruct instances to terminate once they have completed their current task.

缩放 : 要扩展，只需添加更多的副本，要缩减，就让实例在完成当前任务后立即终止。

Repair: To 'repair' a failed instance of a component, simply terminate it as gracefully as possible and spin up a replacement.

修复：要 “修复” 一个组件失败了的实例，只需尽可能优雅地终止它并启动一个替换的实例。

Roll-back: If you have a bad deployment, stateless components are much easier to roll back, since you can terminate them and launch instances of the old version instead.

回滚：如果你有一个坏的部署，无状态组件更容易回滚，因为你能够终止它们并启动旧版本的实例。

Load-Balance across : When components are stateless, load balancing is much simpler since any instance can handle any request. Load balancing across stateful components is much harder, since the state of the user's session typically resides on the instance, forcing that instance to handle all requests from a given user.

跨实例负载均衡：当组件处于无状态时，负载均衡就简单得多了，因为任何实例都可以处理任何请求。而跨有状态组件的负载均衡要困难得多，因为用户会话的状态通常驻留在某个实例上，这迫使该实例处理来自给定用户的所有请求。

Principle 3: Favor managed services

Cloud is more than just infrastructure. Most cloud providers offer a rich set of managed services, providing all sorts of functionality that relieve you of the headache of managing the backend software or infrastructure. However, many organizations are cautious about taking advantage of these services because they are concerned about being 'locked in' to a given provider. This is a valid concern, but managed services can often save the organization hugely in time and operational overhead.

Broadly speaking, the decision of whether or not to adopt managed services comes down to portability vs. operational overhead, in terms of both money, but also skills. Crudely, the managed services that you might consider today fall into three broad categories:

原则3:热衷于托管服务

云服务不仅仅是基础设施。大多数云服务商提供了一系列丰富的托管服务，包含各种各样的功能，可以减少管理后端软件或基础设施所带来的麻烦。然而，许多组织对利用这些服务持谨慎态度，因为它们担心被 “ 绑架 ”到特定的服务商。这个担忧是合理的，但是托管服务通常可以为组织节省大量的时间和运营开销。

从广义上讲，是否采用托管服务的决策取决于可移植性和操作开销，这不仅涉及到资金，还涉及到技能。粗略地说，您目前可能考虑的托管服务可以分为三类：

Managed open source or open source-compatible services: Services that are managed open source (for instance Cloud SQL) or offer an open-source compatible interface (for instance Cloud Bigtable). This should be an easy choice since there are a lot of benefits in using the managed service, and little risk.

托管的开源或与开源兼容的服务：托管的开源服务（例如 Cloud SQL ）或者提供一个开源兼容的界面（例如 Cloud Bigtable ）。这应该是一个简单的选择，因为使用托管服务有很多好处，而且风险很小。

Managed services with high operational savings : Some services are not immediately compatible with open source, or have no immediate open source alternative, but are so much easier to consume than the alternatives, they are worth the risk. For instance, BigQuery is often adopted by organizations because it is so easy to operate.

管理能解约运营成本的服务：有些服务不能立即与开源兼容，或者没有立即的开源替代方案，但是它们比其他替代方案更容易使用，因此值得冒这个风险。例如， BigQuery 经常被组织采用，因为它非常容易操作。

Everything else: Then there are the hard cases, where there is no easy migration path off of the service, and it presents a less obvious operational benefit. You’ll need to examine these on a case-by-case basis, considering things like the strategic significance of the service, the operational overhead of running it yourself, and the effort required to migrate away.

However, practical experience has shown that most cloud-native architectures favor managed services; the potential risk of having to migrate off of them rarely outweighs the huge savings in time, effort, and operational risk of having the cloud provider manage the service, at scale, on your behalf.

其他：还有一些棘手的情况，比如没有简单的迁移路径可以脱离服务，并且它带来的运营收益不太明显。您需要逐个分析这些问题，考虑服务的战略意义、您自己运行它的操作开销以及迁移所需的工作等。

然而，实践经验表明，大多数云原生架构都热衷于托管服务 ; 必须迁移它们的潜在风险很少会超过让云提供商代表您管理大规模服务所节省的时间、精力和操作的风险。

Principle 4: Practice defense in depth

Traditional architectures place a lot of faith in perimeter security, crudely a hardened network perimeter with 'trusted things' inside and 'untrusted things' outside. Unfortunately, this approach has always been vulnerable to insider attacks, as well as external threats such as spear phishing. Moreover, the increasing pressure to provide flexible and mobile working has further undermined the network perimeter.

Cloud-native architectures have their origins in internet-facing services, and so have always needed to deal with external attacks. Therefore they adopt an approach of defense-in-depth by applying authentication between each component, and by minimizing the trust between those components (even if they are 'internal'). As a result, there is no 'inside' and 'outside'.

原则4： 深入练习防守

传统的架构对周边的安全很有信心，粗略地说，就是一个坚固的网络边界，里面有 “可信的东西”，外面有“不可信的东西”。不幸的是，这种方法一直容易受到内部攻击，以及外部威胁，如鱼叉式网络钓鱼。此外，提供灵活和移动工作的压力越来越大，进一步破坏了网络边界。

云原生架构起源于面向互联网的服务，因此总是需要处理外部攻击。因此，他们采用了一种深入防御的方法，通过每个组件之间应用身份验证，并最小化这些组件之间的信任程度（即使它们是 “内部的” ）。因此，就没有 “内部”和“外部” 之分了。

Cloud-native architectures should extend this idea beyond authentication to include things like rate limiting and script injection. Each component in a design should seek to protect itself from the other components. This not only makes the architecture very resilient, it also makes the resulting services easier to deploy in a cloud environment, where there may not be a trusted network between the service and its users.

云原生架构应该将这种思想扩展到身份验证之外，包括速率限制和脚本注入等内容。设计中的每个组件都应该设法保护自己不受其他组件的影响。这不仅使架构具有很强的弹性，还使最终得到的服务更容易部署到云环境中，在那里，服务与其用户之间可能没有可信的网络。

Principle 5: Always be architecting

One of the core characteristics of a cloud-native system is that it’s always evolving, and that's equally true of the architecture. As a cloud-native architect, you should always seek to refine, simplify and improve the architecture of the system, as the needs of the organization change, the landscape of your IT systems change, and the capabilities of your cloud provider itself change. While this undoubtedly requires constant investment, the lessons of the past are clear: to evolve, grow, and respond, IT systems need to live and breathe and change. Dead, ossifying IT systems rapidly bring the organization to a standstill, unable to respond to new threats and opportunities.

原则5： 永远在设计架构

云原生系统的一个核心特征是它总是在不断发展，架构也是如此。作为一个云原生架构师，随着组织的需求的变化， IT 系统环境的变化，您的云服务本身的功能的变化，您应该始终寻求细化、简化和改进系统的架构。虽然这无疑需要不断的投资，但是过去的经验教训是清楚的：为了进化、增长和响应， IT 系统需要生存、呼吸和变化。死气沉沉的、僵化的 IT 系统使组织迅速陷入停顿，无法对新的威胁和机遇做出响应。

The only constant is change

In the animal kingdom, survival favors those individuals who adapt to their environment. This is not a linear journey from 'bad' to 'best' or from 'primitive' to 'evolved', rather everything is in constant flux. As the environment changes, pressure is applied to species to evolve and adapt. Similarly, cloud-native architectures do not replace traditional architectures, but they are better adapted to the very different environment of cloud. Cloud is increasingly the environment in which most of us find ourselves working, and failure to evolve and adapt, as many species can attest, is not a long term option.

唯一不变的是变化

在动物王国里，适者生存。这不是一个从 “坏的 ”到“好的 ”或从“原始的 ”到“进化了的 ”的线性过程，而是一切都在不断变化。随着环境的变化，物种适应和进化的压力也会增加。类似地，云原生架构并不会取代传统架构，但是它们更适合于非常不同的云环境。云环境越来越成为我们大多数人工作的环境，许多物种可以证明，进化和适应的失败不是一个长期的选择。

The principles described above are not a magic formula for creating a cloud-native architecture, but hopefully provide strong guidelines on how to get the most out of the cloud. As an added benefit, moving and adapting architectures for cloud gives you the opportunity to improve and adapt them in other ways, and make them better able to adapt to the next environmental shift. Change can be hard, but as evolution has shown for billions of years, you don't have to be the best to survive—you just need to be able to adapt.

上面描述的原则并不是创建云原生架构的神奇公式，但希望能够提供关于如何最大限度地利用云的强大指南。另一个好处是，迁移并适应云架构使您有机会以其他方式改进和适应它们，并使它们能够更好地适应下一个环境的变化。改变可能很难，但正如数十亿年来的进化所表明的那样，你不必为了生存而成为最好的人，你只需要能够适应。

关于我们

[译] 谷歌云原生架构的 5 个原则

Google Developer Groups 谷歌开发者社区，是谷歌开发者部门发起的全球项目，面向对 Google 和开源技术感兴趣的人群而存在的公益性开发者社区。GDG Shanghai 创立于 2009 年，是全球 GDG 社区中最活跃和知名的技术社区之一，每年举办 30 – 50 场大大小小的科技活动，每年影响十几万以上海为中心辐射长三角地带的开发者及科技从业人员。

社区中的各位组织者均是来自各个行业有着本职工作的互联网从业者，我们需要更多新鲜血液的加入！如果你对谷歌技术感兴趣，业余时间可调配，认同社区的价值观，愿意为社区做出贡献，欢迎加入我们成为社区志愿者！