60% cost optimization on autoscaling and 70% savings on renting spot instances
Client’s result
Industry |
Retail Fashion
Company size |
1K-5K employees
Service | Migration to AWS, autoscaling configuration, AWS infrastructure support
Location | Ukraine
About INTERTOP
Main challenges
Challenge 1
- The company hosted its system on on-premises servers. It was enough for them at first, but as the business developed and the website load grew, the number of issues and time of system downtime increased to several incidents a month for any marketing activity.
- Hardware limitations, maintenance overhead, complex and time-consuming scaling mechanisms, and significant ongoing costs made it impossible to stick to the current solution.
Challenge 2
- Hosting the system on on-premises servers also caused overspending on infrastructure costs.
- It was quite challenging for the company to control resource allocation and feasibility, scale resources efficiently, cover unforeseen maintenance needs, and reduce energy consumption.
- INTERTOP wanted to spend money on infrastructure efficiently, track costs, and make informed decisions about infrastructure budgeting.
Challenge 3
- Efficient communication and cooperation between DevOps and software development teams is one of the keys to a more productive software development process.
- INTERTOP required DevOps assistance that could strengthen developers, facilitate the work on new features and updates, and increase time-to-market.
- They didn’t want to hire in-house experts because the company had no relevant internal expertise to control them, and there was not enough work for a full-time employee.
What we did
Solution 1
Migration from on-premises to the cloud
As on-premises servers were unable to cope with the growing load on the system, we offered cloud migration.
Our advice was to migrate to AWS at once. However, due to some budget constraints and hesitations, the client chose a small virtual hosting.
Its competitive price was a great advantage but it had limited server capacity, its service ecosystem and integration opportunities weren’t extensive, and its reliability and uptime lagged behind the standards of the leading global providers.
Using this small hosting, it was impossible to resize resources automatically. The team had to wait for the provisioning of additional servers for a few hours and pay the full rate for idle servers. Still, there were crashes several times a month, sometimes even with data loss, and we had to restore it from backups. All this was inconvenient and inefficient, so migration to AWS was a rescue.
We chose the lift-and-shift migration model to move to AWS as is. This is the least time and cost-consuming model that offers a quicker solution with opportunities for further infrastructure improvement and optimization. It took us 2 weeks to migrate to AWS using this approach instead of a potential 2-3 months in case of rearchitecting.
We moved the system to EC2 instances, configured scalability and load balancing, and set up CI/CD pipelines.
Migration to AWS was done before Black Friday in 2019. This solution helped cope with the upcoming loads and the overall positive growth dynamics of peak loads on the website.
Since then, the website load has grown yearly, and INTERTOP’s AWS system has been stable, durable, and scalable.
Due to the AWS-based solution, INTERTOP successfully went through Black Friday 2019 with the following results:
- 150% traffic growth compared to 2018
- 0,2 s response time
- 150 servers during the peak load
- 0,05% from online profit spent on AWS
Solution 2
System rightsizing and adaptation to the growing website load
There were three main tasks defined within this solution: adapt resources to the growing traffic load, prepare the system, and carry out load testing.
Adapting resources to the load increase consisted of the following steps:
- Tune autoscaling
- Adding and increasing the capacity of DB nodes
- Сreation of separate groups of web servers for Admin panel and API
- Increasing the performance of the web cluster
- The use of software code optimization and constant profiler to search for bottlenecks in the code
Autoscaling tuning
According to AWS official recommendations, scaling should be based on processor loads. However, in real life, it turns out to be not enough. The complexity of modern applications requires a combined approach that is applied in the solution for INTERTOP.
We considered not only processor loads, but also the number of occupied and available PHP workers, and the number of connections to one server.
Adding and upgrading database nods
This became a must before big and important events like sales. It was observed that during sales the load on the database grew abruptly, so we also considered it reasonable to keep some power in reserve for peak loads.
Creating separate web server groups for the admin panel and API
The admin panel influenced the website greatly. It contained a lot of heavy parts, requests, and pieces of code, and when operators used it intensely, it slowed down the website. When we isolated the admin panel, the problem ended.
It was a very successful decision due to which the total number of servers was reduced.
Compare the numbers before dividing servers into logical parts:
Night – 4 servers
Day – up to 35 servers
Peak load – up to 70 servers
Increasing the performance of the web cluster
Software developers noted that a piece of PHP code ran slowly on the production cluster. Performance in the test environment was 30% faster. When we compared the environments, we noticed that we partially used T3A instances on AMD processors. These processors reduced the performance in production. When we removed them from the cluster, the issue was fixed.
However, later we tested Intel and AMD processors once more and it turned out that AMD processors worked 30% slower, so we had to return to Intel.
The use of software code optimization and constant profiler to search for bottlenecks in the code
This work was done non-stop. A profiler was used to find such places and developers analyzed and eliminated them with our assistance if necessary.
System preparation consisted of the following parts:
- Adding database nodes
- Switching Redis and Memcached from regular servers to ElastiCache
- Upgrading storage servers
We conducted load testing after every change in the system and checked everything until we received the expected result.
All the changes and improvements made it possible to normalize system provisioning, which also led to significant cost optimization as INTERTOP requested.
Here are the benefits the client received:
- 100% autoscaling
- Possibility to add and remove resources dynamically (up to 60 seconds for server deployment)
- Automatic Failover (in case of a failure, AWS needs only 30-120 seconds to restore a backup copy of a service without human intervention)
- Opportunity to use native AWS services in the future
- 60% cost savings on autoscaling
- Up to 70% cost saving via renting Spot Instances
- Saving about 30% by purchasing Reserved Instances
Solution 3
Infrastructure support on an hourly basis
We offered hourly paid infrastructure support services that would cover all client’s needs and requests with no fixed monthly budget commitment.
Within this agreement, we worked on regular and emergency tasks, infrastructure optimization, security improvement, etc.
Examples of what we did within our infrastructure support services:
- Optimized architecture and database performance via the implementation of Aurora Serverless and later Aurora Serverless 2 for improved vertical fractional scaling.
- Connected Elasticsearch and SQS
- Dockerized the mobile app and put Docker containers in ECS to unify development and testing and simplify app deployment
- Implemented the microservices architecture for the mobile app
We still provide infrastructure support services to INTERTOP but now they have an in-house DevOps engineer. Their full-time specialist usually does regular tasks and our team provides more complex services, and overall architecture guidance and consulting.
Key Results and Business Value:
1. 100% system scalability
2. 60% cost savings on autoscaling
3. Up to 70% cost saving via renting Spot Instances
4. Saving about 30% by purchasing Reserved Instances
5. 0,05% from online profit spent on AWS
6. 0,2 s website response time
7. Increased system reliability (only 2 critical issues happened within 4 years)
Features Delivered:
1. AWS-based infrastructure
2. Improved system performance and reliability
3. System autoscaling
Technologies we used
Amazon EC2
for scalable computing capacity in the AWS cloud.
Amazon RDS
for simplified setup, operation, and scaling of relational databases in the cloud.
Amazon Aurora
a MySQL and PostgreSQL-compatible relational database, built for the cloud with high performance and availability.
Amazon Aurora Serverless
for automatic database capacity scaling up or down based on application needs.
Amazon ECS
for the deployment and management orchestration of Docker containers.
Amazon Elasticsearch service
for real-time search, analytics, and logging capabilities.
Amazon SQS
for a reliable, highly scalable hosted queue for storing messages as they travel between computers.
Docker
for application deployment automation inside lightweight, portable containers.
Amazon ElastiCache
for the deployment and operation of in-memory data stores in the cloud.
Redis
for an in-memory key-value storage known for high performance and flexibility.
Memcached
for a high-performance, distributed memory object caching system.
Client’s feedback