Amazon Elastic Inference (EI) – a great option to optimize cost.
Amazon SageMaker is amazing when it comes to hosting your models and running inferences through it. But associated costs can quickly climb up if you are not careful in terms of usage of resources. SageMaker does a great job of managing just the right configuration to run your models. But there are some configurations left to you. For example, for your need of GPU for your machine learning project, you may be going with full GPU configuration even though only fractional power of chosen GPU was all that you needed.
Role of GPU in Inference versus Training
Inference is when you are making predictions using an already trained model. In a typical Machine Learning project:
- During Training, the training jobs batch runs hundreds of parallel threads to process sample data. So, use of full standalone GPU instance(s) is warranted.
- However, at Inference time, you are only running a single set of data for prediction. Except for rare scenarios, you often would need only a fractional power of full standalone GPU.
This is where SageMaker Elastic Inference comes to help optimize the cost. By choosing Elastic Inference, you are able to leverage fractional GPU, and thus saving on cost.
Composition of Elastic Inference (EI)
Elastic Inference works through what’s called Elastic Inference accelerators, which are network attached devices. These accelerators connect with your SageMaker instances or endpoint to enable use of fractional GPUs. There are multiple options of Elastic Inference accelerators that you can use – it may take some experimentation to figure out which is the best accelerator to use for your specific need. You will need to factor-in the demands in terms of compute power that CPU is already providing, the memory (should be at least the file size of trained model), the latency and throughput requirements.
Following table shows throughput (in teraflops) associated with various Elastic Inference accelerators:
Accelerator Type | F32 Throughput in TFLOPS | F16 Throughput in TFLOPS | Memory in GB |
---|---|---|---|
ml.eia2.medium | 1 | 8 | 2 |
ml.eia2.large | 2 | 16 | 4 |
ml.eia2.xlarge | 4 | 32 | 8 |
ml.eia1.medium | 1 | 8 | 1 |
ml.eia1.large | 2 | 16 | 2 |
ml.eia1.xlarge | 4 | 32 | 4 |
Benefits and Features of Amazon Elastic Inference
- Reduce the cost of Inference step
- Can be used with SageMaker Notebook Instance, or on a SageMaker Hosted Endpoint
- Integrated with Amazon SageMaker, EC2, and ECS
- Supports TensorFlow, Apace MXNet, and PyTorch frameworks
- Supports Open Neural Network Exchange (ONNX)
- ONNX enables training a model in one deep learning framework, and then be able to transfer it to another for Inference
- Available in multiple throughput sizes (TFLOPS per accelerator)
- Available in multiple memory-sizes
Pricing
You are charged per hour usage of Elastic Inference usage.
- Accelerator: per hour
- Per-hour price varies by the configuration of the accelerator
External Resources
- Amazon Elastic Inference site
- Amazon Elastic Inference Developer Guide
- Amazon Elastic Inference Pricing