The recent workshop hosted by EIT Digital in Berlin on 7 & 8 November 2016 presented the latest approaches to support standardisation of reliability and resilience for cloud services.

As research in cloud computing progresses, cloud computing services are getting closer to becoming a utility such as water, electricity, and gas.  Although we do not expect outages in utilities, the largest cloud computing services can experience almost 2 days of downtime a year that can cost millions of dollars to resolve.  The European Commission is committed to continued standardisation in cloud computing, especially in terms of reliability and resilience to avoid such outages.

Providers and operators of IT services face the same challenge of providing a service that is reliable, efficient, cost-effective - and many things more at the same time. These challenges remain unchanged over time and technological evolution, from mainframe-based systems such as VAX/VMS and programming languages such as PL/1 and COBOL still used in the finance sector, to modern cloud infrastructures in a landscape of unprecedented demand and expectations.  Cloud computing is often marketed and sold as always available, hyper-flexible, elastic, adaptive, and much more. But how can such cloud services be provided whilst meeting all these challenges?

Any cloud service provider faces this same challenge, irrespective of their size of operations and type of offerings, i.e. Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), or Software-as-a-Service (SaaS).  Suitable best practices and solutions to the challenges faced by cloud service providers are continually changing and evolved IT components are coming to their rescue.

Excellence of speakers and participants

The mix of speakers at the workshop was strikingly good, ranging from European small- and medium-sized enterprises (SMEs) to large international corporations, from research and academia to the individual agitator waving the banner of standardisation and influencing role of public policy: me.

Right from beginning, when preparing the workshop (and my presentation) it was clear to me that many of the participants would converge on talking about the technical aspects of cloud service reliability and resilience.  Make no mistake, these aspects are of utmost importance when it comes to service operation and control, and service delivery to the customer.

And what an array of experts and presentations we enjoyed, ranging from the "classic" hyperscale cloud providers such as Google and Microsoft, via newcomers such as T-Systems and our co-host Huawei, to smaller entities, and service providers from the network sector (Brocade) and many, many more.

Technical reliability and resilience are in good hands

If you are interested in industry best practices of cloud reliability in terms of skills and business organisation, check out the available slide decks from Google, Huawei, LinkedIn, FlexiOPS, Brocade/StackStorm, among others.  The presentation highlights include "Site Reliability Engineer(ing)" as a business organisational model and "Fault Injection" as an automated way of infrastructure testing and education feedback for developers (Netflix's ChaosMonkey and other simian helpers are a good example). Last but not least, Microsoft showed how they introduced a form of "Capture The Flag" into site reliability engineering.  That is to say, using a ring-fenced part of the infrastructure, one team tries to destabilise/break the system while an opposing team tries its best to keep the system running within defined parameters. Counterstrike for business!   :-)

If you are interested in that, read the presentations on Slideshare now.  The recorded talks should be published on YouTube soon and listed on EIT Digital's website so you can follow the speakers and questions. You will not be disappointed and will no doubt find what you need.

Resilience and reliability are more than technology

However, just as I expected there was one aspect missing in the workshop, and I am happy EIT Digital invited me to fill that gap with my presentation on the role of standards, business strategy, and public policy on cloud reliability and resilience.

Broadly speaking, there are three aspects of resilience and reliability that must be considered:

 (1) Technical reliability, which many speakers and participants have covered in detail and with great expertise;

 (2) Business strategy and market mechanics frequently lead to best practices and convergence from a wide variety of options to far fewer alternatives; this is happening when a market consolidates and is maturing towards commodity/utility;

(3) Public policy, representing situations in which dysfunctional markets must be regulated (i.e. forced standardisation) to protect the interest of the public, for example against vendor lock-in

To be clear, I am not advocating gagging vendors or stifling the market.  When businesses collaborate, then it is their freedom and risk to make decisions in their own interest - including wrong decisions that put the organisation into grave peril.

Clawing back the term standardisation

Standardisation is an overused term, which I wish to claim back for what it should be exclusively used for: a publicly written and agreed specification of interaction (at the technical or policy level) supported and committed to by multiple stakeholders, where ownership and IPR are with the public.  "Open standards" typically come with FRAND (fair, reasonable and non-discriminatory) licenses for anyone interested in implementing an open standard.  Often developed by community-level Standards Development Organisations (SDOs), they may be adopted by Standards Setting Organisations (SSOs) into "de jure" standards that may become public policy.  Everything else should be named as what it is, entirely without bias or judgement, as "industry best practices".

Resistance is futile.  You will be standardised!

Last but not least, I would like to draw attention to an often entirely unnoticed or disregarded aspect of reliability and resilience: to prevent dysfunctional markets, and maintain a healthy overall economy that is resilient against systemic market failures, public policy needs to foster and nurture a strong SME-driven market.  If there is one lesson learned from the financial crisis in 2008, then it is the systemic threat of large, too-big-to-fail organisations on the verge of bankruptcy.  This is the true aim and thrust of the Digital Single Market (DSM): to nurture a strong SME-driven digitisation of European industry, considering statistics on SMEs across the EU show they represent around 99 % of all enterprises and account for around two-thirds of total employment in all EU countries and in Norway.

Even though individual failure of an SME is disastrous for its own employees, overall such a market is incredibly resilient and reliable.

Building a data economy in Europe is impossible without open standards supported by the right public policy.  In fact the European Commission has 4 initiatives that clearly define actions concerning cloud standardisation:-

  1. Communication on Priorities for ICT Standardisation

  2. Communication on a European Cloud Initiative

  3. Rolling Plan for ICT Standardisation

  4. EU Catalogue of ICT Standards

More detail is also available on the Digital Single Market webpage for standardisation