S3 Cost Savings

2024-12-02

Amazon S3

Amazon S3 is a fantastically versatile and scalable storage solution, but keeping costs under control is crucial. This blog post dives into key strategies to lower AWS S3 costs gradually.

Optimizing Storage With Storage Classes

S3 offers a variety of storage classes, each with different pricing based on access frequency. Here's how to leverage them:

Standard Storage: Ideal for frequently accessed data. Offers high performance and low latency but comes at a higher cost.
S3 Intelligent-Tiering: Automatically moves data between access tiers (Frequent, Infrequent, Archive) based on usage patterns. Perfect for data with unknown or evolving access needs.
S3 Standard-Infrequent Access (S3 Standard-IA): Cost-effective for infrequently accessed data. Retrieval fees apply, but storage costs are lower than Standard Storage.
Glacier Instant Retrieval: Glacier Instant Retrieval is a storage class offered by Amazon S3 Glacier specifically designed for data archiving that requires millisecond retrieval times. It provides a balance between cost-effectiveness and fast data access for rarely accessed archives.
Archive Storage Classes (Glacier and Deep Archive): Lowest storage costs for rarely accessed data. Retrievals require advance notice (hours or days) but offer significant savings.

Automated Cost Management Using Terraform Configuration

     Shell 
   
   # Configure AWS Provider

provider "aws" {

  region = "us-east-1" # Replace with your desired region

}

# Define your S3 Bucket

resource "aws_s3_bucket" "my_bucket" {

  bucket = "my-bucket-name"

  acl    = "private"

}

# Define Lifecycle Policy with transition rules

resource "aws_s3_bucket_lifecycle_configuration" "lifecycle_rules" {

  bucket = aws_s3_bucket.my_bucket.id

  rule {

    id     = "transition-to-ia-and-glacier"

    status = "Enabled"

    # Transition to S3 Standard-Infrequent Access after 90 days

    transition {

      days          = 90

      storage_class = "STANDARD_IA"

    }

    # Transition to Glacier storage after 1 year in S3

    transition {

      days          = 180

      storage_class = "GLACIER"

    }

  }

}

Explanation

Provider configuration and S3 bucket definition: These remain the same as in the previous example.
S3 lifecycle policy: Similar to before, this defines the lifecycle rules for the S3 bucket.
Lifecycle rule: The rule uses the same structure but has two transition actions defined within a single rule block.
1. First transition: Objects are transitioned to the S3 Standard-Infrequent Access (STANDARD_IA) storage class after 90 days (adjustable based on your needs). This provides a cost-effective option for data accessed less frequently than daily but still requires faster retrieval times compared to Glacier.
2. Second transition: After spending 180 days in S3-IA (again, customize this period), objects are transitioned to the Glacier storage class for the most cost-effective long-term archival.

Benefits of this Approach

Cost optimization: Utilizes a tiered storage approach, leveraging S3-IA for a middle ground between cost and access speed before transitioning to Glacier for long-term archival.
Flexibility: Adjust the transition days for each storage class to align with your data access patterns and retention requirements.

Additional Considerations

Remember, retrieving data from Glacier incurs retrieval fees, so factor in access needs when defining transition times.

Explore using S3 Intelligent-Tiering if you have unpredictable data access patterns, as it automatically manages object movement between storage classes.

Additional Cost-Saving Strategies

Incomplete Multipart Uploads

Multipart uploads are a method for efficiently uploading large files to S3. It breaks the file into smaller chunks, uploads them individually, and then S3 reassembles them on the backend.

How Multipart Uploads Work

Imagine uploading a large video file to S3. With multipart uploads, S3 allows you to split the video into smaller, more manageable parts (e.g., 10 MB each). Each part is uploaded as a separate request to S3. After all parts are uploaded successfully, you send a final "complete multipart upload" instruction to S3. S3 then reassembles the individual parts into the original video file.

Incomplete Multipart Uploads

Problems can arise during the upload process. An internet outage, application crash, or other issue might interrupt the upload before all parts are sent. In such scenarios, S3 won't have all the pieces to create the complete file. These interrupted uploads are called "Incomplete Multipart Uploads."

Why They Matter

Although the complete file isn't created, S3 still stores the uploaded parts. These leftover parts occupy storage space and you get billed for them. If you have many incomplete multipart uploads, it can significantly inflate your S3 storage costs. Additionally, they clutter your S3 bucket, making it harder to manage your data.

Key Points To Remember About Incomplete Multipart Uploads

They occur due to interrupted uploads before completion. They consume storage space and incur charges. They can be identified and cleaned up manually or through automated lifecycle rules. By managing incomplete multipart uploads, you can optimize your S3 storage usage and costs.

Code

     Shell 
   
   resource "aws_s3_bucket_lifecycle_configuration" "default" {

  bucket = aws_s3_bucket.your_bucket_name.id

  rule {

    id     = "incomplete-multipart-uploads"

    status = "Enabled"

    abort_incomplete_multipart_upload {

      days_after_initiation = 7  # Change this value as needed (minimum 1)

    }

  }

  rule {

    id     = "expired-uploads"

    status = "Enabled"

    # This rule targets the objects transitioned from the previous rule

    noncurrent_version_transition {

      days          = 0  # Expire immediately after transition

      storage_class = "GLACIER"  # Optional: Specify a cheaper storage class

    }

  }

}

resource "aws_s3_bucket" "your_bucket_name" {

  # ... your bucket definition ...

}

Important Considerations

Aborting an incomplete multipart upload doesn't automatically remove any data already uploaded as individual parts. You might need to consider additional cleanup logic. Before removing IMUs, ensure they are truly unwanted and not actively being completed. By implementing one of these approaches, you can integrate incomplete multipart upload removal.

S3 Versioning

S3 Versioning is a powerful feature offered by Amazon Simple Storage Service (S3) that allows you to keep track of all the different versions of objects you store in your S3 buckets.

Understanding Versioning Costs

S3 charges storage for all object versions, including the latest and all previous ones. This cost can accumulate if you retain numerous versions for extended periods.

Cost-Saving Strategies

1. Version Lifecycle Management

Utilize S3 Lifecycle Rules to automate version deletion based on predefined criteria.
You can set rules to:
- Delete non-current versions after a specific period (e.g., retain only the latest version and one previous version for 30 days).
- Transition older versions to a cheaper storage class like S3 Glacier for archival purposes (Glacier retrieval fees apply).

2. Versioning for Critical Objects Only

Consider enabling versioning only for S3 buckets containing critical data that require a high level of protection and rollback capabilities.
For less critical data with lower risk profiles, you might choose to forgo versioning to save on storage costs.

3. Regular Version Cleanup

Periodically review your S3 buckets and identify unnecessary object versions.
Manually delete them using the AWS Management Console, CLI, or SDKs.

4. Analyze Versioning Needs

Regularly assess your versioning requirements.
Do you truly need to retain all object versions for extended periods?
Can you adjust retention policies to strike a balance between data protection and cost optimization?

Additional Considerations

Versioning and object size: The cost impact of versioning is more significant for larger objects. Consider compressing objects before uploading them to S3 to reduce storage costs associated with multiple versions.
Alternative backup solutions: Explore cost-effective backup solutions outside of S3 for long-term archival needs, especially for infrequently accessed data.

Code

     Shell 
   
   # Configure AWS Provider

provider "aws" {

  region = "us-east-1" # Replace with your desired region

}

# Define your S3 Bucket

resource "aws_s3_bucket" "my_bucket" {

  bucket = "my-bucket-name"

  acl    = "private"

}

# Define Lifecycle Policy with versioning cleanup

resource "aws_s3_bucket_lifecycle_configuration" "versioning_cleanup" {

  bucket = aws_s3_bucket.my_bucket.id

  rule {

    id     = "delete-old-versions"

    status = "Enabled"

    # Delete non-current versions after 30 days

    noncurrent_version {

      days          = 30

      # Optional: exclude specific prefixes from version deletion

      exclude        = ["critical-data/"]  # Adjust as needed

    }

  }

}

Consideration

Within the rule block, we define a rule named "delete-old-versions."
The noncurrent_versionsection specifies how to handle non-current versions (all versions except the latest).
- days: This sets the retention period for non-current versions. In this example, they are deleted after 30 days.
- exclude (optional): You can specify prefixes to exclude from version deletion. This ensures critical data versions are retained even after 30 days.
You can adjust the day's value to define your desired version retention period.
Utilize multiple noncurrent_version blocks with different retention periods for a more granular approach (e.g., keep one previous version for 7 days and older versions for 90 days).
Remember, S3 Lifecycle Rules won't delete delete markers themselves. These markers are automatically removed after the configured deletion period (default 30 days) by S3

Important Note

Once a version is deleted through a lifecycle rule, it's gone permanently and cannot be recovered. Ensure your retention period is sufficient for your rollback and audit requirements.

Request Minimization

Analyze S3 requests (PUT, GET, DELETE, etc.) and identify areas for reduction. This reduction in requests can lead to significant cost savings, especially when dealing with cloud-based services that charge per request.

Utilizing Caching Mechanisms

Client-side caching: Implement caching logic within your application to store frequently accessed S3 object metadata locally. This reduces the number of requests to S3 for metadata retrieval, especially if your application retrieves the same object information repeatedly.

Optimizing Data Transfers

Head object: Before downloading an entire object, use the HEAD request to obtain only the object's metadata (size, last modified date, etc.). This helps determine if you truly need to download the complete object, potentially saving bandwidth and request costs.

Designing for Efficient Access

Organize data structure: Structure your data within S3 buckets efficiently. Group related objects together using prefixes to minimize the need for extensive listing operations.
Consider S3 select: If you only require specific parts of an object, explore using S3 Select functionality. This allows you to retrieve only the relevant data from an object without downloading the entire file, reducing bandwidth usage and request costs.

For Big Data Caching Mechanism

Spark Result Fragment Caching (EMR): If you're using Amazon EMR with Spark, you can leverage Spark Result Fragment Caching. This feature analyzes Spark SQL queries and caches frequently used result fragments directly in S3. Subsequent queries can then reuse these fragments, improving performance. Keep in mind this is specific to EMR.

Steps

1. Configure a Cache Bucket in S3

Create a dedicated S3 bucket to store the cached fragments.
Ensure your EMR cluster has read/write access to this bucket through the EMR File System (EMRFS). You can find detailed instructions on authorizing access in the AWS documentation.

2. Enable Spark Result Fragment Caching

There are two ways to enable this feature
- Using EMR configuration: During cluster creation in the EMR console, you can set the following configuration:
  - spark.sql.result.fragment.caching.enabled=true
- Using Spark configurations: If you're launching your EMR cluster using custom Spark configurations, you can set the same property spark.sql.result.fragment.caching.enabled=true in your Spark configuration file (e.g., spark-defaults.conf).

3. Optional Configuration (Advanced)

While Spark Result Fragment Caching works out of the box, you can further fine-tune its behavior using additional configurations:
- spark.sql.result.fragment.caching.reductionRatioThreshold(default: 0.5): This controls the minimum data reduction ratio for caching a fragment. A fragment is only cached if the size of the result is less than the original data by this ratio.
- spark.sql.result.fragment.caching.maxBufferSize (default: 10MB): This sets the maximum size of a fragment that can be cached.

How It Works

When Spark Result Fragment Caching is enabled, Spark analyzes your Spark SQL queries. It identifies subqueries or parts of the query plan that produce frequently used results (fragments). These result fragments are then cached in the designated S3 bucket after the first execution. Subsequent queries that utilize the same cached fragments can reuse them directly from S3, significantly improving performance compared to recomputing the results.

Benefits

Reduced query execution times, especially for complex queries with repetitive subqueries.
Improved cluster efficiency as less data needs to be shuffled and processed for repeated operations.

Things To Consider

This caching mechanism is most effective for queries that have significant data reduction after certain stages (e.g., filtering, aggregation). The effectiveness also depends on the access patterns of your queries. If subsequent queries rarely reuse the cached fragments, the benefit might be minimal. Consider enabling S3 Lifecycle Management for the cache bucket to automatically manage older versions of cached fragments.

Transfer Acceleration

While S3 Transfer Acceleration can boost data transfer speeds, it comes with additional costs. Evaluate if the speed improvement justifies the expense.

1. Evaluate Transfer Acceleration Usage

Analyze needs: Carefully assess your data transfer requirements. Do all data transfers within your S3 buckets truly benefit from the speed boost offered by Transfer Acceleration?
Consider disabling for specific buckets: If some buckets primarily experience infrequent data access or transfers of smaller files, consider disabling Transfer Acceleration for those buckets. This saves costs associated with unused acceleration.

2. Leverage S3 Intelligent-Tiering

Automatic optimization: Explore using S3 Intelligent-Tiering. This S3 storage class automatically manages object movement between different storage tiers based on access patterns.
Cost-effectiveness: Frequently accessed data will reside in the higher-performance tier (potentially benefiting from Transfer Acceleration), while less frequently accessed objects transition to more cost-effective tiers, reducing overall storage costs.

3. Optimize Data Transfer Size

Reduce object fragmentation: Minimize the number of small object uploads. Consider combining smaller files into larger objects before uploading them to S3. This reduces the overall number of data transfers and potentially lowers Transfer Acceleration charges.
Utilize multipart uploads for large files: For large file uploads, leverage S3 Multipart Uploads. This allows efficient transfer of large files in smaller chunks, improving reliability and potentially reducing costs associated with failed uploads in case of network interruptions.

4. Utilize S3 Glacier for Long-Term Archival

Cost-effective archival: If you have data that requires long-term archival and infrequent access, consider transitioning it to the S3 Glacier storage class. This extremely low-cost tier is ideal for infrequently accessed data, reducing storage costs significantly compared to using Transfer Acceleration for frequent access to such data.

5. Monitor and Analyze Costs

CloudWatch billing: Utilize CloudWatch to monitor your S3 storage and Transfer Acceleration usage costs. This allows you to identify any unexpected spikes or areas for further optimization.
Cost explorer: Leverage Amazon Cost Explorer for detailed insights into your S3 costs and Transfer Acceleration usage patterns. This can help you refine your cost-saving strategies over time.

By implementing these strategies, we can significantly reduce your S3 storage costs. Regularly monitor your usage and adapt your approach as your data needs evolve. Remember, the key lies in aligning storage classes with access requirements. There's no one-size-fits-all solution, so find the optimal mix for your specific use case.