Author: Louis Tian, Chief Technology Officer (CTO) of KAYTUS
IT infrastructure provider KAYTUS explains how to improve computing power and increase the stability of entire AI clusters through systematic design
Market researchers at MarketsandMarkets predict that the AI market will reach an impressive USD 407 billion in global sales by 2027, a significant growth from the estimated USD 86.9 billion in 2022. According to BITKOM, 68% of German companies see great potential in AI. However, 43% of the companies surveyed see themselves as laggards in the use of AI in their day-to-day work, 38% even believe they have completely lost touch. There is a lot of catching up to do – and fast. But how can companies achieve this and create the necessary IT infrastructure?
Data centers need to meet future demands of AI applications, including GenAI, autonomous driving, intelligent diagnostics, algorithmic trading, and intelligent customer service. Increasing data generation and growing demand for compute and faster data transmission are placing an enormous strain on older IT infrastructures. Existing IT architectures often are not suitable for rapidly growing data volumes and AI, because AI model development and deployment presents several challenges to datacenters.
Development and implementation of AI applications
In large-scale computing, the efficiency of individual nodes is very limited. Therefore, system interconnectivity, algorithm and interconnectivity optimization are increasingly important. A system-centric approach to building the IT infrastructure is best suited to overcoming the obstacles in the AI adoption. When deployping AI, the focus should be on the overall system with coordination across algorithms, computing power and data. By integrating computing resources, data resources, R&D deployment environments and process support, the efficiency and stability of AI development and implementation can be improved from cluster management to training development and inference application, thus expanding innovation paths through full-stack optimization.
This system-centered approach is necessary because different groups of people, including Infrastructure management staff, data scientists and business staff, work together in the AI development and applications. IT infrastructure experts attach great importance to the stability of clusters and the optimal use of computing resources. Data scientists focus on the efficiency and stability of model training. Business professionals are concerned with inferencing and want easy deployment of services and flexible computing resources. Throughout the entire AI process, the companies could improve the efficiency and the stability the entire cluster through a systematic design - so that companies can consistently derive business insights, generate revenue and maintain their competitiveness.
The biggest challenges in developing and implementing powerful and stable AI applications that companies should attach attentions are:
● GPU utilization
Model training and inference require large amounts of computing power, but the performance of computing platforms often does not grow linearly with computing power and may experience degradation. Most of the LLMs have a model computing utilization of less than 50%. Therefore, companies need to find a way to allocate resources and workloads by implementing intelligent GPU scheduling. This can be done via a platform to optimize computing resource scheduling based on the hardware characteristics of the cluster and the computing load characteristics, improving overall GPU utilization and training efficiency.
● Task Orchestration
Scheduling performance of large-scale POD tasks is another major challenge. Faced with highly varied and dynamically changing demands for computing resources, users require support for GPU resource allocation, task construction, and task scheduling, as well as support for optimization methods for dynamically adjusting GPU resource allocation. An approach to solve this challenge is utilizing a solution that can assure rapid startup and environment readiness for hundreds of PODs. This way the throughput can be increased by five times and latency can be decrease by five times compared to classic scheduler. This ensures efficient scheduling and utilization of computing resources for large-scale training.
● Data transfer speed and efficiency
Another factor that slows down AI developments is the speed and efficiency of data transfers. Massive data volumes pose great challenges to data transmission. Reasonable data reading efficiency can maximize the performance of GPUs and CPUs and improve the overall iteration efficiency of AI models. Innovative features such as supporting local loading and computing of remote data, which eliminates delays caused by network I/O during computation, can help accelerating data transfer tremendously. Utilizing strategies such as "zero-copy" data transfer, multi-threaded retrieval, incremental data updates, and affinity scheduling, significantly reduce data caching cycles. These enhancements greatly improve AI development and training efficiency, resulting in 2-3 times boost in model efficiency during data training.
● Uninterrupted model training
If the training of a large language model (LLM) is interrupted, intervening in the training process and reorganizing the training model is time-consuming and labor-intensive. Frequent cluster anomalies or failures can severely impact the progress of model development. For instance, in the training process of Meta's Llama 3.1, its training cluster of 16,000 GPUs experienced a failure every three hours. A cluster failback mechanism can reduce downtime in LLM training by quickly rebuilding clusters, restoring component availability, and bringing online services back to their latest state, thereby avoiding the loss of human and time resources in the model training process.
● Easy deployment
The operational threshold for deployment personnel is high, and the deployment is time-consuming and labor-intensive. Lack of expertise and experience in deploying LLMs makes the implementation even more challenging. Platforms and tools for AI development, which form the main production environment of AI technology, undertake the mission of lowering the AI deployment threshold. Capabilities such as low-code model fine-tuning, low-code deployment, and low-code application building need to be added to platforms to improve users' overall development efficiency within platforms. A complete deployment process template can support the rapid construction and orchestration of service flows around business scenarios. Full-process model and application deployment are very helpful for accelerating inference business deployment.
“We have seen many of our customers struggle with AI initiatives, which are however essential to their business success” explains Louis Tian, CTO of KAYTUS. “With customer-centered dedication we are able to support customer from bottom-layer hardware to upper-layer applications through our full-stack AI solutions. And we had already developed a comprehensive AI development platform, MotusAI. The platform is designed for AI development and inference, combining GPU and data resources with AI development environments to optimize computing resource allocation, task management, and centralized oversight. It speeds up training data processing and manages AI model development workflows effortlessly, increasing cluster computing power utilization to over 70%. Motus AI also supports LLM AI inference scenarios with high concurrency at the million level.”
Conclusion
The overall challenge in the AI adoptions, from cluster development to deployment, is how to systematically design and optimize computing cluster for improved computational efficiency and stability. For data center users, a viable approach consists of several steps. Firstly, the hardware including servers, storages and networks optimized for AI are fundamentals for them. Secondly, they should design and deploy a cluster solution involving computing, networking, and storage based on the computational needs of the AI applications. Thirdly, they should use a platform for intelligent operation and efficient management of the cluster, and fourthly, they need to improve the application through various optimization processes - including developing, testing, and tuning algorithms, code, parallel computing, and more. In order to professionalize this approach, users can also choose a reliable partner for the easy operation and deployment of the AI applications.
For more information on IT infrastructure for data centers, please visit: https://www.kaytus.com/