Behind the scene of a major infrastructure company

Post image

Online has a very specific and unsung job.
Behind the incredible ease and speed you feel when requesting a physical or virtual server in a few clics, nothing is virtual.

For about 3 months now, some of our customers have been disappointed while trying to acquire new BareMetal Dedibox servers, especially in our mid to high-end servers families, PRO & WOPR: none of them were available as we were unfortunately out of stock on these offers.

Our capacity planning has been impacted by a series of unexpected events and errors that totally disrupted our production. We failed in our missions of delivering on demand infrastructure to our customers and are totally aware of our responsibility in this failure. It's the second time we face this situation in twelve years.

In this blog post we will try to be transparent, explain the challenges we faced to sustain our growth and the reasons behind this very embarrassing situation.

0 A little bit of history

Twelve years ago, we defined as a settled principle to master 100% of the technology behind our job. Our goal was to control our whole infrastructure and to avoid any compromise for your precious data.

In 2006, when we started our industrial hosting provider activity, we decided to control all the production pipeline. We took this decision for two main reasons: provide high-quality services controlled from start to end and offer our customers the best possible reactivity.

Since then, our teams design, build and operate our own data centers. We use our own European optic-fibre network to interconnect our data centers. A large part of our products are designed internally in the Online Labs and manufactured in our Laval's factory. The remaining part come from well-known hardware providers: Dell, Quanta Computing, HP or Supermicro.

We operate large scale industrial infrastructures with critical quality constraints. Our daily units are hectares, megawatts, exabytes, kilonewtons, Tb/s. We are one of the largest computer assembler in Europe and are in the top10[1] of the biggest infrastructure provider.

At Online, 150 people are working for you everyday. 19 different jobs are represented - from refrigeration engineers, electronics engineers, support specialist to low level developers doing incredible things everyday to let you build & manage your infrastructure in seconds.

1 Dedibox servers are a smashing success

The surprising growth

For several years now, we are seeing large mutation in major hosting providers offering.
Long-standing players focus more and more on cloud products with high margins and abandon the BareMetal market which provides less margin and requires larger financial investments.

The manufacturers' new products are less attractive than before, CPU prices are rising and the DRAM & NAND market are facing an unprecedented crisis which accelerates this transformation of the market.

Today, we are the only provider in the world offering recent server configurations in high volumes for less than €20 per month. Our selling prices have always been extremely competitive and offer the best ratio price-quality-performance on the market.

Consequently, we have been facing an incredible growth, four times higher than our forecasts, on all our families of products and especially in our mid to high-end servers ones.

2 - Data centers scaling

More sales also means more datacenter space

DC2, DC3, DC4, DC5

Amsterdam

In June 2016, we announced our first facility outside of France in Amsterdam. The demand for AMS1 was amazing and we were out of stock within a week. Since then, we offered as many servers as we could but not fast enough to satisfy all the demand. We've now reached full network and power capacity.

To increase the capacity in our Amsterdam facility, we are performing the following operations:

  • We purchased 5x new 100Gbit/s links between our core network in Paris DC3 and Amsterdam AMS1 to satisfy our growth for the next months. But, in the meantime, our links provider upgraded all its network equipments and has been impacted by multiple issues on their European optical fibres. The last issue occurred the 3rd July, 2017 when five optical fibres have been destroyed in Belgium. These incidents have delayed our deployment of additional capacity.

  • We will upgrade our backbone in AMS1 with two Cisco ASR9910 to offer a high availability zone and increase the density of 100Gbit/s uplinks. But some paperwork issues are adding extra delays for the deployment.

Amsterdam backbone Aug 2016

Today, the situation in Amsterdam is starting to stabilise. We will add 350kW of power capacity and 500Gbit/sec of additional network capacity during the summer to handle the demand in this region.

Paris

At the same time in Paris, we reached 100% of the capacity of all our data centers:

  • DC4 has been delayed multiple times due to administrative authorizations missing. We hope all the authorizations will be approved by the end of 2017. The nuclear fallout shelter opened and is in production since the 1st July and is currently dedicated to our cold storage platform - C14.

DC4 nuclear fallout shelter before / after

  • At DC3, we deployed the capacity extension in September 2016 as expected. This extension increased the power, cooling and space capacity by a factor of two and was supposed to support two years of growth. But five weeks after the opening, everything had been filled due to the huge demand, mainly on our high-end services.

DC3 before / after

  • DC2 is full since 2015 but we are optimizing the facility with retrofits and density increase of the first generation data halls.

  • To meet the constantly increasing demand we acquired a new building, DC5. This hyperscale facility provides extraordinary characteristics. It's one of the biggest investments we ever made. The facility offers three times the capacity of DC3 and is a decisive project to sustain our growth for the seven next years. We finished the design in December 2016 and the construction is ongoing. DC5 will be one of the biggest data center in Europe, it will deliver up to 20.8MW of net IT power in January 2018 with a target PUE of 1.1. Last week, we opened a first room in DC5 with a capacity of 250kW.

DC5 first room

  • We will provision 500kW capacity in a partner facility in Paris until the DC5 launch.

The biggest challenge in the data center industry is the lead time. It requires between twelve and eighteen months to design and build a data center. We have always rejected the idea of delivering cheap data centers to sustain our growth, reducing costs in spite of our customers has never been an option. This decision has been recognized by a certification delivered by the Uptime Institute in 2014.

All our forecasts have been surpassed, in only a few years we've filled more than 8500 rack and 42k square meters of data center space. With DC5, we plan to keep one step ahead and sustain our growth for a mid-term period.

3 - The hardware failure

Back in February 2017, we've been alerted by one of our suppliers of an erratum concerning a component used in some of our entry-level servers (Dedibox SC 2016, Dedibox XC 2016, Dedibox XC 2015, Scaleway C2S, C2M, C2L, VC1S, VC1M, VC1L). This erratum impacts the component by reducing its lifetime at an accelerated rate. As of now we're not seeing any occurrences for this erratum in our data centers. Since that, our supplier totally stopped the production of this component. We are now waiting for deliveries of the upgraded and fixed component to continue our production. It is currently causing stock issues on the servers listed above.

At the same time, the Online Labs team worked to release earlier our 2018 products. We were able to accelerate the design of these new servers and they are now ready for production.
Our factory in France is currently starting the production of the electronic boards for our next generation of servers. We plan to deploy these new servers in our data centers in October.

4 - DRAM and NAND market crisis impact delivery

More sales also mean more RAM and SSD in a volatile and under pressure market.

In October 2016, we started facing a major issue with one of our components supplier. RAM and SSD disks prices raised week after week, the delivery times were not guaranteed anymore, which started to disrupt our production pipeline. Currently, the situation is worse than what we expected. More that an outbreak of prices, the problem is we don't even know when and if we will receive all the SSDs and RAMs we order.

To minimize costs and improve our time to market, we use a lean manufacturing strategy. This strategy allows us to move fast and upgrade our hardware frequently. This method offers many advantages except when this kind of scenario happens. Today, the situation is still difficult but we are adapting our supply chain to the situation, even if we have no visibility on the delivery lead time of our suppliers. We continue to receive what we order but the delivery time is really unstable and can be delayed for more than three months. By way of illustration, we are partially receiving orders we did 6 months ago.

Since May, we try to secure our SSD, HDD disks and RAM stocks to meet the demand on the coming months and now work with three different manufacturers to get a backup solution in case of delivery issues. All our efforts are still not enough to satisfy the demand of the market. Our supply chain team is doing magic everyday to improve the situation which is still critical today.

5 - IPv4 shortage

More sales also mean more IPv4

Nothing is faster than the speed of light... except maybe the growth of the speed of our IPv4 reserve utilization. As you probably know, there is a shortage of IPv4 addresses and acquiring IPv4 range is more and more complex and expensive. Similar to data centers, getting IPv4 is a very hard scaling point, unless you buy them on the black market. The majority of available IPv4 ranges are owned by governments and administrations and dealing with them is a very long legal process that can take up to 11 months to succeed. The Brexit froze 12 months of negotiations we were finalizing. We currently own 311 238 IPv4. This address space is used at 94%. We are acquiring three new /16 IPv4 ranges (196 602 IPs) to keep a healthy reserve for the coming months for both Online & Scaleway. We hope to conclude this acquisition before the end of summer.

6 - Supply chain transformation

The phoenix must burn to emerge. -Janet Fitch

Back in July we decided to revamp and industrialize our supply chain to increase our production speed from 2500 to 6000 servers per month. We moved from a per site logistic platform to a unique, centralized logistic platform for all sites in Paris and Amsterdam.

The logistic platform

Our new supply chain center will be totally up and running by November and will massively increase our daily production.

The new setup has been underestimated and had a direct impact on the delivery of our servers in addition of other issues. We totally had to rethink our processes and methods to scale out the manufacturing process which is now running at nearly its full production capacity. Last adjustments are being finalized.

  • Our logistic platform was delivered in February 2017 with a 3 months delay.

  • We changed our information system in favor of Odoo. The setup of this new solution required many changes in our organization and a long running-in period before things worked well.

  • We centralized all our stocks at the logistic platform, which were previously split between four facilities. During this period, more than 200 tons of hardware have been transferred and inventoried.

  • The dispatch of spare stocks in the data centers took more time than expected and is still not fully operational. This delay impacts our system and customer success team in their day to day operations.

  • Test benches to qualify hardware after assembly is not yet fully operational and results of a bottleneck in our servers delivery time.

  • Our supply chain team was under-capacity and we didn't anticipate correctly the team sizing to success in this challenge of rapid industrialization.

Conclusion

In 2006, when we announced Dedibox, it was a real earthquake. We deployed and sold 10s of thousands units in 10 months. We reached our first company achievement : completely filled-in a datacenter, DC1 with a huge market share. Everything was easy to scale, we had only 1 product, 1 huge and empty datacenter, 1 simple network and a team of 6 peoples.

The first issue we encountered was after reaching this insane milestone: we had nothing to sell for the next 22 months, the time needed to build our first self-owned datacenter, DC2.

DC2 before / after

This first sold-out period caused irreversible damages to Online. During this long period, the high demand of dedicated servers we created on the market was picked up by our competitors and we never succeed to catch up, even 9 years later.

After this period, we decided to learn from our errors and changed everything needed to never end up in a such situation.
We re-worked our brand image, our technical assistance, invested massively in the construction of facilities and deployed our own independent network. We focused on quality and refused to provide wobbly infrastructures and products. This major refactoring ends up with the success we know today. We currently grow by a factor of two every two years and we've delivered the total of our growth capacity during seven years without any major difficulty.

The next step is to develop a large scale industrialization. We anticipated this at the right time but we faced to many issues in cascade to succeed. We’ve been a bit enthusiast and underestimated some parts while restructuring the way we work everyday. This mistake had a direct impact on our stocks and sales during last months.

Our teams work everyday to do incredible things and ensure a smooth experience with Online services. The good news is that everything will be back to normal in September and we do not plan to face similar issues in the coming years.

We are more than 150 people and counting working 24h/24 to run a part of the Internet, and, let's face it, it’s not so easy to scale.

We hope you will like our transparency and we want to thanks each one of you who help us make great and amazing things by your feedbacks and suggestions.

If you have any question about this story, please leave us a comment here, we’ll be happy to answer you.

To sustain our growth, we'll massively open new positions in Paris in the coming months! You can already send us your resume at jobs at online dot net

[1] - Netcraft Hosting Provider by Computer - Jul 2017

Author image

Edouard Bonlieu

Strategy and marketing at Online.net & Scaleway