Prelude
Hi guys, after a series of back-to-the-future-I-didn’t-have-time-to-write-new-things… I’m back. What happened in the last months… ok, covid19 put the whole world in trouble, I bought an apartment, I opened a company and I resign my contract. Really. Nothing. Special. But TODAY - I wanna talk about a project I have since a while, and I worked on a boring Sunday afternoon: I hacked my blog to let Polly read it for you! 😎 😎 😎
What is Polly?
For those of you who never heard the word Polly before, here it is:
AWS Polly is a service that turns text into lifelike speech, allowing you to create applications that talk and build entirely new categories of speech-enabled products.
I already made some experiments in the past (look at A serverless OCR with Polly and Rekognition unveils the power of stack inheritance in CDK) and I’ve always been fascinated by this:
AWS uses a lot of its own services to build cool things for their own purpose. One cool thing they do is to provide a spoken version of their blog posts and this is done through AWS Polly.
Step by step
The first thing was to actually extract the text from the blog. I decided to do it the hard (better, the stupid) - way - that is… using my old friends BeautifulSoup
Python parser library. The reason? Parsing markdown was boring and more difficult than parse the rich HTML output (I use Hugo to build my website). So I decided to remove the huge amount of dust from my old Python script to parse things and create my mp3 Polly files.
Step 1 of 6: __get_blog_post_urls()
The first thing to do to let AWS Polly Read your articles - and doing it using HTML - is to get your post URLs. You can certainly get your content from a local build, but I did it by parsing my blog.
My blog is hosted under a CloudFront CDN and served under https://madeddu.xyz
domain name. Moreover, my articles are served under the /posts
path as many other blogs built using Hugo (I guess). A simple script like this one can get all my article URLs:
def __get_blog_post_urls():
# create request
r = requests.get(BASE_URL)
index = 2
# accumulate blog post urls
urls = []
# go ahead with pages until
while True:
# parse page
soup = BeautifulSoup(r.text, features="html.parser")
# find all href
for a in soup.findAll('a', href=True):
# get only post
if a['href'] != f"{BASE_URL}/posts/" and f"{BASE_URL}/posts/" in a['href'] and "/page/" not in a['href']:
# append urls
urls.append(a['href'])
# create request
try:
r = requests.get(f"{url}/posts/page/{index}")
except:
return urls
index += 1
# return result
return urls
I just loop through pages, until I get a 200
response (starting from page 1 with no number and then going ahead with 2, 3, etc) and then I collect all links to my blog excluding my landing links and the paginator links. The results are pretty simple to imagine: it’s a long list like the following one.
https://madeddu.xyz/posts/aws/cdk/serverless-ocr/
https://madeddu.xyz/posts/aws/cdk/producer-consumer/
https://madeddu.xyz/posts/aws/cdk/uploader-stack/
https://madeddu.xyz/posts/aws/cdk/contact-form/
https://madeddu.xyz/posts/life/life-as-a-software-engineer/
https://madeddu.xyz/posts/aws/cdk/cloudformation-to-cdk/
...
and so on
I could use directly my filename because I use the permalink, but I also leverage folder structure to define my categories so I decided to parse my blog entirely. It can be done in a better and smarter way but my intention was to follow GTD: if you are already bored at death I shared in a Github Gist the actionable tasks I wrote, entirely, in a single python script.
Step 2 of 6: __get_page_content()
With the first step, we collected the links to the articles. The second step is to get the content of the theme. This is not-as-simple as you can imagine, because…even in my super light Hugo template, there’s A LOT of unused outer text - and I regret not having chosen a pure-text theme for my blog. So after a bit of Hugo Theme analysis, I wrote the simplest script I could think of:
def __get_page_content(url, number_of_words=NUMBER_OF_WORDS, final_sentence=FINAL_SENTENCE):
# create request
r = requests.get(url)
# parse page
soup = BeautifulSoup(r.text, features="html.parser")
# get all paragraph
paragraphs = soup.find("div", {"id": "main"}).findAll("p")
# accumulate outer text
page_with_no_code = ""
# for each found paragraph
for paragraph in paragraphs:
# exclude portion of code
if paragraph.findAll("div", {"class": "highlight"}):
continue
# get the text out of the paragraph
text = paragraph.text.strip()
# exclude a common header
if "Subscribe to my newsletter to be informed about my new blog posts, talks and activities." in text:
continue
# accumulate page text
page_with_no_code += text+" "
# return result
return f"{page_with_no_code[:number_of_words]}{FINAL_SENTENCE}"
The method __get_page_content()
just looks for every p
tags inside my main div
and then, in order, excludes:
- the
highlight
div that contains my code snippets; - the
Subscribe to my newsletter...
paragraph - the simplest thing was to actually look for the text, is not the best solution but it worked pretty well :)
The method then cut the content to a fixed number of words: I didn’t want to produce too-much-long audio for a single post, but just the first portion of it, like a few paragraphs. This part can be enhanced for sure because I just cat a long string, without considering words or sentences but it works for my interest - that is, provide you the idea and some snippets of code you can use to implement your Polly version of the blog.
Step 3 of 6: __get_content_read_by_polly()
At this point, we have all the text we would like to read: the third step is to actually get theme read and persisted on s3. Is as simple as doing the following:
def __get_content_read_by_polly(article_path, content):
# read content
response = polly_client.synthesize_speech(
VoiceId='Matthew',
OutputFormat='mp3',
Text = content)
# save mp3
with open('speech.mp3', 'wb') as f:
f.write(response['AudioStream'].read())
# upload mp3
with open('speech.mp3', 'rb') as f:
s3_client.upload_fileobj(f, CONTENT_BUCKET, f'mp3/{article_path[:-1]}.mp3')
return f'mp3/{article_path[:-1]}.mp3'
Pretty cool right? But I had this feeling I was missing something, and actually, I was…
First Neural Voices TTS
AWS Polly has a Neural TTS system that can produce even higher quality voices than its standard voices. The NTTS system produces the most natural and human-like text-to-speech voices possible. Standard TTS voices use concatenative synthesis. This method strings together (concatenates) the phonemes of recorded speech, producing very natural-sounding synthesized speech. However, the inevitable variations in speech and the techniques used to segment the waveforms limit the quality of speech. The Amazon Polly Neural TTS system doesn’t use standard concatenative synthesis to produce speech. It has two parts:
- A neural network that converts a sequence of phonemes—the most basic units of language—into a sequence of spectrograms, which are snapshots of the energy levels in different frequency bands
- A vocoder, which converts the spectrograms into a continuous audio signal.
The first component of the neural TTS system is a sequence-to-sequence model. This model doesn’t create its results solely from the corresponding input but also considers how the sequence of the elements of the input works together. The model chooses the spectrograms that it outputs so that their frequency bands emphasize acoustic features that the human brain uses when processing speech. The output of this model then passes to a neural vocoder. This converts the spectrograms into speech waveforms. When trained on the large data sets used to build general-purpose concatenative-synthesis systems, this sequence-to-sequence approach will yield higher-quality, more natural-sounding voices.
Really. Cool, Amazon. For real. So the question now is… how we can leverage these voices?
SSML Specification Language
Our friends from AWS have created an entire Generating Speech from SSML Documents you can use with Amazon Polly to generate speech. Using SSML-enhanced text gives you additional control over how Amazon Polly generates speech from the text you provide. For example, you can include a long pause within your text, or change the speech rate or pitch. Other options include:
- emphasizing specific words or phrases
- using the phonetic pronunciation
- including breathing sounds
- whispering
- using the Newscaster or Conversational speaking style.
Everything you can do is pretty much collected in this documentation page so… I’m going to do one step backward.
Step 2 (again) of 6: __get_page_content_for_nts()
To let me create SSML tagged speech without dealing too much with regex, markdown, and Python (feel free to do it) - but, at the same time - without dealing too much with HTML and parsing of my outer text as well, I put in place a small script to just gather paragraph text. My first intention was to leverage three aspects:
- emphasizing specific words or phrases, for instance the one enclosed in
em
orstrong
tags and replace with SSML notation<emphasis level="moderate|strong|reduced">
- put a pause and breaths inside my speech;
- let the reading sounds more natural let’s say;
So, I decided to replace my old __get_page_content
with the new method __get_page_content_for_nts
: it actually just provide the paragraph without losing the meta
needed for my next step.
def __get_page_content_for_nts(url, number_of_words=NUMBER_OF_WORDS, final_sentence=FINAL_SENTENCE):
# create request
r = requests.get(url)
# parse page
soup = BeautifulSoup(r.text, features="html.parser")
# get all paragraph
paragraphs = soup.find("div", {"id": "main"}).findAll("p")
# accumulate outer text
page_with_no_code = []
# for each found paragraph
for paragraph in paragraphs:
# exclude portion of code
if paragraph.findAll("div", {"class": "highlight"}):
continue
# get the text out of the paragraph
text = paragraph.text.strip()
# exclude a common header
if "Subscribe to my newsletter to be informed about my new blog posts, talks and activities." in text:
continue
# accumulate page text
page_with_no_code.append(paragraph)
# return result
return page_with_no_code
Step 2.5 (new) of 6: __add_SSML_Enhanced_tags()
Ok, we now have paragraphs including tags and many other things we would like to ignore in our parsing - but you can use them to experiments with Polly. The important thing to notice is that not all the features - and not all the voices - are available if you want to use Neural Voices.
Have a look at the right columns to discover more about the features current limitations. So… ok, I will shut-up and provide my script:
def __add_SSML_Enhanced_tags(paragraphs):
# tag to start a speach
text = "<speak>"
# add informal style
text = f'{text}<amazon:domain name="conversational"><amazon:effect name="drc">'
# # add breathing to sounds more natural
# text = f'{text}<amazon:auto-breaths>'
# for each paragraph
for paragraph in paragraphs[:NUMBER_OF_PARAGRAPHS]:
# prepare the paragraph with dot and comma breaks
paragraph_text = paragraph.text.strip()
# paragraph_text = paragraph_text.replace("...", "<break time=\"500ms\"/>")
# paragraph_text = paragraph_text.replace(". ", "<break time=\"800ms\"/>")
# paragraph_text = paragraph_text.replace(",", "<break time=\"300ms\"/>")
# prepare the paragraph with slang expression
paragraph_text = paragraph_text.replace("btw", "<sub alias=\"by the way\">by the way</sub>")
paragraph_text = paragraph_text.replace("PoC", f"<say-as interpret-as=\"spell-out\">PoC</say-as>")
# empthatyse em words
# ems = paragraph.findAll("em")
# for em in ems:
# paragraph_text = paragraph_text.replace(f"{em.text}", f'<emphasis level="moderate">{em.text}</emphasis>')
# # pronunce strong words loudly
# strongs = paragraph.findAll("strong")
# for strong in strongs:
# paragraph_text = paragraph_text.replace(f"{strong.text}", f'<emphasis level="moderate">{strong.text}</emphasis>')
# print(paragraph)
# print(paragraph_text)
# concat paragraph parsed to text
if len(f"{text} {paragraph_text}") > 1490-len(f" {FINAL_SENTENCE}"):
break
else:
text = f"{text} {paragraph_text}"
# close the text
#text = f"{text} {FINAL_SENTENCE}</speak>"
text = f"{text} {FINAL_SENTENCE}</amazon:effect></amazon:domain></speak>"
# close the text
return text
As you can see, in the beginning, I created some logic to provide emphasis at a deeper level, even over STRONGS words but then I switched to the conversational domain that just creates the natural kind of speech to the text I wanted to reach - plus some more things, like the dynamic range compression
(more here) and the reading of some aliases. The important thing to notice is that I provide a maximum number of paragraphs (not chars anymore) and, thus, I have to cut my text up to 1500
chars to avoid TextLengthExceededException
. You can find more about the exception you can get here. Also, if you want to have an SSML validator, I found this one and the AWS Polly Console pretty useful.
The result is… we have the text ready to be read, but we need to update __get_content_read_by_polly
just a bit
Step 3 (renewed) of 6: __get_content_read_by_polly()
I decided to use the Matthew Neural Voice and put my mp3 files using the same folder-structure
I have on my blog, under the mp3 prefix 😎 Here’s the code.
def __get_content_read_by_polly(article_path, content):
# read content
response = polly_client.synthesize_speech(
Engine='neural',
LanguageCode='en-US',
OutputFormat='mp3',
Text = content,
TextType='ssml',
VoiceId='Matthew'
)
# save mp3
with open('speech.mp3', 'wb') as f:
f.write(response['AudioStream'].read())
# upload mp3
with open('speech.mp3', 'rb') as f:
s3_client.upload_fileobj(f, CONTENT_BUCKET, f'mp3/{article_path[:-1]}.mp3')
return f'mp3/{article_path[:-1]}.mp3'
At this point, we have the whole articles read in s3, and we just need to change all our markdown files.
Step 4 of 6: __get_markdown_list()
Getting my markdown list recursively is just as simple as doing this:
def __get_markdown_list(base_path=BASE_PATH):
# get list of all markdown
list_of_files = list(Path(base_path).rglob("*.md"))
# return it
return list_of_files
Before going ahead, I want to preview one more thing: to let mp3 audio be played correctly, I had to change my Hugo theme template to support one more metadata
in my front-matter. To know more about Hugo and Front Matter, just have a look at Front Matter in the official Hugo documentation. For the laziest, Hugo allows you to add front matter in yaml, toml, or json to your content files. Front matter allows you to keep metadata attached to an instance of a content type—i.e., embedded inside a content file.
Exactly what I want to do!
Let’s go ahead, and I will show my reasoning at the last step… keep going, we are close to the end (for real XD).
Step 5 of 6: __get_markdown_list()
The match is pretty simple: if the end of the audio object prefix path of the s3 object provided by the __get_content_read_by_polly
is equal to the name of the markdown file, just pair theme to modify the front-matter of that post. I have to be this level of “sure” (exactly equal) due to name collision so my suggestion is … just put some print if you have a setup like mine, to be sure ;)
def __match_audio_and_post(file_list, audio_list):
# match_dict
matches = {audio_path : '' for audio_path in audio_list}
# find match by name
for audio_path in audio_list:
for file_name in file_list:
if audio_path.split("/")[-1].replace(".mp3", "") == str(file_name).split("/")[-1].replace(".md", "").lower():
matches[audio_path] = str(file_name)
continue
# return matches
return matches
And finally… modify the all articles in one shot! What could ever happen…
Step 6 of 6: __insert_new_audio_reference()
The matches are ready, we just need to transform each front matter like this:
---
layout: post
title: "I hacked my blog to let AWS Polly create podcast over it"
date: 2020-11-26
categories:
...
in a front matter like this:
---
layout: post
title: "I hacked my blog to let AWS Polly create podcast over it"
date: 2020-11-26
polly: https://madeddu.xyz/mp3/aws/hacked-my-blog-polly.mp3
categories:
...
And here we are:
def __insert_new_audio_reference(matches):
# for each match
for audio_name, file_name in matches.items():
# read the content
with open(file_name, "r") as f:
lines = f.readlines()
# add the line
lines = lines[0:4]+[f'polly: {BASE_URL}/{audio_name}\n']+lines[4:]
# write the new content
with open(file_name, "w") as f:
for line in lines:
f.write(line)
One more thing
How can we leverage the new meta inside a Hugo Template? Of course, using the .Params
field inside the theme used to produce the blog page:
In my case,
{{ define "main" }}
<span>
<h1>{{ .Title }}</h1>
<p>Subscribe to <a href="https://tinyletter.com/made2591">my newsletter</a> to be informed about my new blog posts, talks and activities.</p>
{{ partial "post-meta" . }}
</span>
{{ if .Params.polly }}<audio controls style="width: 100%"><source src="{{ .Params.polly }}" type="audio/mpeg"></audio><br />{{ end }}
<p>
{{ .Content }}
</p>
<p>Subscribe to <a href="https://tinyletter.com/made2591">my newsletter</a> to be informed about my new blog posts, talks and activities.</p>
{{ partial "disqus.html" . }}
{{ end }}
In this way, you are gonna put an HTML5 audio
link to the respective content for your article, and you are ready to run your blog as AWS does!
Conclusion
The next step is to put everything under an API and use it during my build pipeline to let me produce my mp3 at build-time. Here you can find a gist with the whole script I used - it’s already written like a lambda… Oooops 😜😜😜 Finally, in this article I wrote really BAD code as usual, but I hope I provided to you with some insights about how you can leverage AWS Polly to do fancy things and become the best friend of your boss. Just kidding, you can’t and you know it. XOXO
Have fun and stay safe! 🖖