AWS Storage Blog

How Audible uses Amazon S3 Object Lambda to improve streaming playback performance

Audible is a leading creator and provider of premium audio storytelling that offers its customers a new way to enhance and enrich their lives daily. Audible content includes over 790,000 audiobooks, podcasts, and Audible Originals. Audible has millions of members worldwide, who subscribe to one of 10 localized services designed for customers in Australia, Canada, France, Germany, India, Italy, Japan, Spain, the UK, and the US.

Since the launch of Audible Plus in 2020, Audible has used Amazon CloudFront and Lambda@Edge as its media streaming solution. In March 2023, Audible introduced spatial audio with Dolby Atmos, providing customers with an immersive, cinematic experience to really make the story come alive. Audible did this by adding support for Dolby Atmos quality sound.

As content encoded with Dolby codecs have larger sizes, Audible needed to support a larger bitrate ladder to better balance audio quality and rebuffers on different network types. Because Audible dynamically generates manifest files, adding more bitrates increases initial playback latency since it takes longer to generate larger files. As initial playback latency is an important KPI, Audible needed to migrate from the existing verbose MPEG DASH manifest format to one that would better scale with the number of bitrates.

Creating this new manifest type meant that Audible needed to manipulate audio files before returning them to users, effectively adding a proxy in front of Amazon Simple Storage Service (S3). Audible explored Amazon S3 Object Lambda, which lets customers modify the data returned by Amazon S3 GET, HEAD, and LIST requests. With CloudFront’s origin access control (OAC), customers can use S3 Object Lambda as a CloudFront distribution origin to tailor content for end users.

In this post, we present how Audible uses S3 Object Lambda and its integration with CloudFront to dynamically generate SegmentBase style MPEG DASH streams for our stereo and spatial quality tiers, improving our initial playback latency by 10% in many markets around the world.

A primer on media streaming with MPEG DASH

MPEG DASH is the ISO Standard media streaming protocol. Like all media streaming protocols, the player first downloads a special text file (referred to as “manifests“) to describe the media that makes up the stream. Then, the player downloads the media in small chunks (referred to as ”segments”) into memory and starts playing them. Each segment will typically hold 2s to 30s of content and can vary depending on the content and the media provider.

MPEG DASH uses a single manifest file to describe all content that makes up the stream to reduce the number of round trips to start a stream. Manifests define the media timeline, what content should be played when, and all media segments that make up that content. The timeline is broken up into one or more consecutive non-overlapping periods that describe the media to be played in that slot. This includes all available bitrates in all available languages.

An MPEG DASH manifest defines a collection of consecutive non-overlapping periods

The manifest describes the media using representations. Representations describe core media properties and segment locations. Segments are referenced via one of three addressing schemes—SegmentList and SegmentBase for on-demand content, and SegmentTimeline for live or event based content. Because Audible users stream audio content on-demand, SegmentTimeline is out-of-scope.

SegmentList

SegmentList is the simplest scheme as it lists each individual segment in a large list. Each segment is described with a SegmentURL element which models the URL (mediaUrl attribute) and/or byte range (mediaRange attribute) properties. If the URL is constant, then manifests can use the baseURL tag to list it once, leaving only the mediaRange to vary per segment.

The following is an example of a SegmentList based DASH manifest file with two bitrates and a single period and rendition. Each segment is 19.504761904761905 seconds long (430080 duration/22050 timescale).

<?xml version='1.0' encoding='utf-8'?>
<MPD minBufferTime='PT20S' type='static' mediaPresentationDuration='PT8H18M42.347S' profiles='urn:mpeg:dash:profile:isoff-main:2011'
xmlns='urn:mpeg:dash:schema:mpd:2011'
xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xsi:schemaLocation='urn:mpeg:DASH:schema:MPD:2011 http://standards.iso.org/ittf/PubliclyAvailableStandards/MPEG-DASH_schema_files/DASH-MPD.xsd'>
<Period id='0' duration='PT8H18M42.347S'>
<AdaptationSet id='0' contentType='audio' lang='und' segmentAlignment='true' bitstreamSwitching='true'>

<-- 32kbps Bitrate -->
<Representation id='0' mimeType='audio/mp4' codecs='mp4a.40.2' bandwidth='32768' audioSamplingRate='22050'>
<AudioChannelConfiguration schemeIdUri='urn:mpeg:dash:23003:3:audio_channel_configuration:2011' value='1'/>
<BaseURL>bk_potr_000001_22_32.mp4</BaseURL>
<SegmentList timescale='22050' duration='430080'>
<Initialization range='0-1038'/>
<SegmentURL mediaRange='1039-87882'/>
<SegmentURL mediaRange='87883-174578'/>
...skipping over 1530 lines for brevity...
<SegmentURL mediaRange='132875440-132884512'/>
</SegmentList>
</Representation>

<-- 64kbps Bitrate -->
<Representation id='1' mimeType='audio/mp4' codecs='mp4a.40.2' bandwidth='65536' audioSamplingRate='22050'>
<AudioChannelConfiguration schemeIdUri='urn:mpeg:dash:23003:3:audio_channel_configuration:2011' value='2'/>
<BaseURL>bk_potr_000001_22_64.mp4</BaseURL>
<SegmentList timescale='22050' duration='430080'>
<Initialization range='0-1038'/>
<SegmentURL mediaRange='1039-166252'/>
<SegmentURL mediaRange='166253-330998'/>
...skipping over 1530 lines for brevity...
<SegmentURL mediaRange='251404117-251568663'/>
</SegmentList>
</Representation>

</AdaptationSet>
</Period>
</MPD>

Naturally, as content duration increases, the number of segments increases because the segment duration is fixed. As we add more bitrates, we must also add a complete copy of the segment list, increasing the number of segments again.

Therefore, SegmentLists scale linearly with the content duration and number of bitrates. The longer the content is and the more bitrates there are, the larger the manifest becomes.

SegmentBase

SegmentBase optimizes on-demand manifest sizes by moving all segment metadata into a binary MP4 atom/section called a Segment Index (sidx) inside each media file. Then, the manifest file uses a special attribute of the SegmentBase tag (indexRange) to refer to the sidx’s byte range in the media. This drastically reduces manifest sizes to the point where they only scale by the number of bitrates/renditions. The following is an example of a SegmentBase based DASH manifest file with two bitrates with a single period and rendition:

<MPD minBufferTime='PT20S' type='static' mediaPresentationDuration='PT2H39M9.391S' profiles='urn:mpeg:dash:profile:isoff-on-demand:2011' xmlns='urn:mpeg:dash:schema:mpd:2011' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xsi:schemaLocation='urn:mpeg:DASH:schema:MPD:2011 http://standards.iso.org/ittf/PubliclyAvailableStandards/MPEG-DASH_schema_files/DASH-MPD.xsd'>
<Period id='0'>
<AdaptationSet id='1' contentType='audio' segmentAlignment='true' lang='und'>
<Representation id='0' mimeType='audio/mp4' codecs='mp4a.40.2' bandwidth='30794' audioSamplingRate='22050'>
<AudioChannelConfiguration schemeIdUri='urn:mpeg:dash:23003:3:audio_channel_configuration:2011' value='1'/>
<BaseURL>bk_lili_000032_22_32.mp4</BaseURL>
<SegmentBase timescale='44100' indexRange='747-1329'>
<Initialization range='0-747'/>
</SegmentBase>
</Representation>
<Representation id='1' mimeType='audio/mp4' codecs='mp4a.40.2' bandwidth='65536' audioSamplingRate='22050'>
<AudioChannelConfiguration schemeIdUri='urn:mpeg:dash:23003:3:audio_channel_configuration:2011' value='2'/>
<BaseURL>bk_lili_000032_22_64.mp4</BaseURL>
<SegmentBase timescale='44100' indexRange='747-1329'>
<Initialization range='0-747'/>
</SegmentBase>
</Representation>
</AdaptationSet>
</Period>
</MPD>

The segment index itself contains a list of segment lengths and durations, and is documented as part of the MPEG DASH standard.

What makes Audible unique?

Audible’s use cases go beyond traditional on-demand playback for music or video largely due to the nature of our content:

  • Audiobooks and podcasts are audio only, like music.
  • Although the average audiobook is about 10 hours long, audiobook length varies from title to title. Some of the longest titles can be over 100 hours long.
  • Audiobooks are distributed as a single large book that is broken up into chapters to facilitate navigation, similar to an e-book.
  • Audiobooks are delivered as on-demand streams. Live streaming is not a use case.

There are a few different ways users can consume Audible content. The primary use case is full title downloads and streaming, which are the most common consumption style for on-demand playback. Second, users can create clips of their favorite audiobook or podcast passages, and then listen to them before committing to navigate away from the current position. Lastly, for devices like Alexa, Audible typically streams content by chapter so manifest files can fit into memory.

We began our exploration by scoping AWS Elemental MediaPackage. It works great, but it requires that assets have at least one video track, which doesn’t work for Audible’s audio-only use case. Therefore, we had to own manifest generation on our own. Similar to AWS Elemental MediaPackage, Audible must generate manifests dynamically because users determine the positions and lengths of the clips. Statically generating those manifests up front doesn’t scale. Therefore, Audible uses dynamic manifest generation to save on cost and complexity, while still leveraging the CloudFront cache to reduce unnecessary computation.

Prior architecture

The following diagram shows our prior content delivery system at a high level, with the content ingestion workflow (steps 1-3) starting on the left and the customer request workflow (steps 1-7) starting on the right:

Our prior architecture utilizes Lambda@Edge to generate streaming manifest files and audio is statically served from S3

On the left, we have the content ingestion workflow that manages Audible’s catalog contents, as well as the accompanying transcoding pipeline. The key highlight here is that the pipeline generates binary manifests in the protobuf format and publishes it and the fragmented MP4 to Amazon S3. The protobuf contains metadata that the delivery infrastructure uses to deliver the stream.

The content delivery infrastructure is largely industry standard in terms of rights, licensing, and URL vending functions. The interesting thing is what happens in Step 5 (Fetch Manifest w/clipping params), where we use Lambda@Edge to dynamically generate a manifest file using the protobuf. The AWS Lambda function generates the required MPEG DASH manifest files by first downloading the protobuf from Amazon S3, iterating over its array of fragment metadata, and combining them to form larger segments.

Because we must fragment at 1s intervals to power clipping, the array can be large for long content. As the array gets longer, the Lambda runs more loop iterations, impacting generation performance. Furthermore, as each bitrate has its own array, this time is multiplied by the number of bitrates. The following table shows the scale of fragments and segments as content duration and bitrates increase:

Duration

Num fragments

Num loop iterations:

1 bitrate

Num loop iterations:

2 bitrates

Generation time:

2 bitrates

Manifest size

11m

677

33

66

10ms

5 KB

10h 4m

37161

1812

3624

30ms

172 KB

36h 2m

133014

6486

12972

100ms

620 KB

58h 4m

214348

10452

20904

160ms

1.1 MB

91h 4m

336164

16392

32784

190ms

1.6 MB

153h 59m

568415

27716

55432

200ms

2.8 MB

Since Audible adapted between two stereo audio bitrates, compressed manifest sizes were manageable and SegmentList was a satisfactory solution at the time.

Challenge

Adding more bitrates to the existing setup meant we’d be looping up to 27,000 iterations per bitrate, increasing initial playback latency. At this point, it was clear that we had to move away from SegmentList based manifests.

SegmentBase was particularly attractive because of its two main qualities:

  1. Compact, consistent manifest sizes: Manifest sizes don’t scale with content duration, and they’re always a fixed size across the catalog for the same amount of bitrates and DASH periods. This makes them faster to generate, and with more consistent performance across the catalog. The following table shows the stark differences in manifest size between SegmentList and SegmentBase:

Duration

SegmentList manifest

SegmentBase manifest

11m

5 KB

2 KB

10h 4m

172 KB

2 KB

36h 2m

620 KB

2 KB

58h 4m

1.1 MB

2 KB

91h 4m

1.6 MB

2 KB

153h 59m

2.8 MB

2 KB

  1. Defers computation: After parsing a manifest, audio players then select a bitrate and start downloading the segments. Spending computation time up front for bitrates the player may never play is time we can reclaim to get the player unblocked faster. Then, we can generate the segments for alternative bitrates when the player must switch qualities.

After making the decision to switch to SegmentBase, we realized it wouldn’t be trivial to generate. The main challenge is making the segment index (sidx). According to the MPEG DASH standard, segment indices can be hosted externally in a separate file or within the media itself right after the initialization segment. Unfortunately, hosting segment indices externally isn’t a supported configuration with fragmented MP4s.

Therefore, we must manipulate the audio by generating and injecting the segment index into the audio at the right place to conform to the standard and make playback work. More specifically, we must:

  1. Identify the correct byte position to inject the sidx.
  2. Generate and inject the sidx.
  3. Attach all remaining audio after it, effectively shifting their byte positions.

This would naturally make a file larger than the one present in Amazon S3.

 Dynamically creating SegmentBase enabled audio assets entails injecting a sidx atom directly between the moov atom and the first audio fragment

Because Lambda@Edge isn’t designed to manipulate large audio files, we looked at alternatives for this use case, while keeping Lambda@Edge for dynamic manifest generation.

Ideally, Audible wanted a proxy in front of Amazon S3 that modifies the response as it’s returned to the client. The modified content is cacheable once generated. Therefore, we wanted to leverage CloudFront to reduce unnecessary computation.

This is where we turned to S3 Object Lambda and its new integration with CloudFront, allowing customers to use an S3 Object Lambda Access Point alias as an origin in CloudFront. S3 Object Lambda uses Lambda functions to automatically process the output of Amazon S3 GET, HEAD, and LIST requests.

New architecture

The following diagram shows our new architecture with S3 Object Lambda integrated directly with CloudFront via OAC. To do this, we simply configure the S3 Object Lambda Access Point alias as an origin in CloudFront. Next, we set up permissions to make sure that only CloudFront can invoke the S3 Object Lambda Access Point and Lambda function. CloudFront will then invoke the Lambda function through the S3 Object Lambda Access Point, on an origin request, as it does with any other custom origin.

Our new architecture adds a new Amazon CloudFront origin that dynamically generates sidx enabled audio assets

We updated our manifest generation Lambda function to generate SegmentBase style manifests for full title playback only—not clipping or chapter-by-chapter playback use cases. The main change here is correctly referring to the indexRange where the sidx atom is located. To do this, we must correctly compute the size of the sidx atom that is generated in S3 Object Lambda. We calculate this using the MPEG DASH standard.

As CloudFront “walks” through the file, asking for subsequent byte ranges, we map the requested range to the correct range in Amazon S3, accounting for any position shifting due to the injected sidx bytes. This is all logic we’d place in the Lambda function attached to S3 Object Lambda. We enable support for range requests and fulfill CloudFront’s requested byte ranges using the above algorithm. Furthermore, Audible didn’t have to make any changes to the clients because we adhered to the published MPEG DASH industry standards.

Results

Audible launched its stereo content worldwide in 2023 and immediately realized latency improvements. Customers now enjoy streaming playback start 3.8% faster at P90, and 3.14% faster at P95, our best streaming performance ever. Over strong Wi-Fi connections, initial playback delay is reduced by up to 5% for content around 10 hours long and 50% for our longest titles (154 hours).

The following table highlights the latency reductions in the top 5 countries:

Latency reductions: Top five most improved locations

P90

P95

P99

Country

Reduction

Country

Reduction

Country

Reduction

Japan

9.33%

Spain

8.52%

Spain

10.25%

Australia

9.16%

Japan

5.86%

Japan

6.96%

Spain

4.07%

Australia

5.71%

Australia

6.72%

India

3.90%

India

4.62%

Canada

2.48%

France

3.88%

France

3.06%

United States

0.61%

In addition to these performance improvements, this launch helps reduce technical complexity on resource-constrained devices since segment indices are significantly smaller than their SegmentList counterparts.

Conclusion

In this post, we walked through the motivations, journey, and results of Audible’s launch of SegmentBase style MPEG DASH. Audible achieved significant latency improvements for all customers worldwide.

Dolby Atmos on Audible

In March 2023, we launched Audible’s spatial audio offering worldwide on top of this solution. The first spatial audio offering, Dolby Atmos on Audible, is a collection of immersive, cinematic listening experiences in pioneering spatial sound. The offering celebrates and expands the possibilities of audio storytelling by highlighting the extraordinary talents of a variety of actors, writers, directors, sound designers, along with other creators across multiple genres. This includes feature-length multi-cast productions, soundscapes, live performances, and podcasts. The launch collection includes over 40 of Audible’s most popular Audible Originals, available for the first time in spatial audio with Dolby Atmos. The Dolby Atmos titles are available to all Audible members globally to stream and download through the Audible app on compatible iOS and Android Dolby Atmos-enabled mobile devices.

To learn more about media solutions available in AWS, visit the CloudFront overview page, Media Services overview page, and the S3 Object Lambda product detail page.

Ronak Patel

Ronak Patel

Ronak is a Software Dev Engineer III at Audible. He is the architect for Audible's Content Delivery and Playback systems. Ronak has deep expertise in Media Technology and has worked on a variety of distributed, web, desktop, & mobile applications through his 18 year career.

Andrew Kutsy

Andrew Kutsy

Andrew is a Product Manager with Amazon S3. He joined Amazon in 2016 and loves talking to users to learn about the innovative ways they use AWS. He obsesses over coffee, enjoys traveling, and is currently on a search for the best croissant in the world.