Abstractions for Readability
I have never considered myself to be a particularly high level engineer. And I doubt I ever will.
To help me understand, I know that I have to write code with the right abstractions in place. They allow me to convey my thoughts to the reader. A lot of the time, that reader is myself at some point in the future!
Table of Contents
Thinking about our readers
I try to think about the cognitive load that I am imposing on readers.
The scenarios in which they have found themselves reading this piece of code.
Is this simply at code review stage?
Are they trying to debug their way through a bug around this area?
Are they having a bad day?
What can I do now to make this easier for them?
A major contributor to the solution would be to make use of abstractions.
Think of an abstraction as a way to hide unnecessary details from sight.
Abstractions are one of the most important core concepts in software engineering.
When the aforementioned reader is trying to debug a particular part of my code,
they are probably immediately concerned with 1 area of it.
But without clear lines of abstractions in place,
the reader would have to skim through every line of code and decide for themselves where their bug is located.
I’m not talented enough to do that, so I don’t expect anyone else to do so.
Minimal abstractions
Okay, enough with the blabber! Show me some code:
def count_words_from_target_url(target_url: str) -> Dict[str, int]:
html = urllib.request.urlopen(target_url).read()
beautiful_soup_parser = bs4.BeautifulSoup(markup=html, features="html.parser")
scraped_result = beautiful_soup_parser.findAll(text=True)
return collections.Counter(scraped_result)
There is quite a lot to unpack here but if you are familiar with urllib, beautiful soup and the collections library
then you probably understand what is happening.
And there lies the problem!
This function makes assumptions about the reader.
It assumes the reader is aware of what the usage of the libraries represents.
And how these implementation details relate to the final outcome.
Given a target_url
, this function will first open that target_url
and read the data.
That data will be parsed and all the text is then scraped from it.
Finally, collection.Counter
is used to create a dictionary containing each word in the scraped_result
along with the frequency of occurrence of that word.
Adding logical separations
Software development is primarily about communication.
One of the main objectives for the code we write is to clearly convey our thoughts and intentions to our readers.
Another way to approach this would be to ask yourself the following question. How would you describe this function, verbally to someone else?
With this in mind, you could logically split this function into 3 parts.
- The request being made to the
target_url
- The scraping of the response
- Finally, the counting of the word occurrences.
Now let us re-write the function:
def count_words_from_target_url(target_url: str) -> Dict[str, int]:
html = open_url(target_url)
all_text_in_html = find_all_text_in_html(html)
return count_word_frequencies(all_text_in_html)
Now we have encapsulated the details of how to do those 3 things within abstractions.
The count_words_from_target_url()
function does not have to care about the implementation details
of how to make the request, the scraping of data or the counting of the result.
Those responsibilities can be said to have been delegated elsewhere.
count_words_from_target_url()
only has to care about calling out to open_url()
to fetch the data it needs.
So we can now say it is the responsibility of open_url()
to know how to use the urllib
library.
With these abstractions in place, we now have clearer separation between our logic.
If tomorrow we decided to implement how the request is being made.
We would only need to make the change in the lower level of abstraction.
In this case, that change would be made only to the open_url()
function.
Leaving signposts
Think back to our poor reader who is having a bad day trying to find the source of a bug.
Our abstractions can now act as sign posts to our reader.
Turn right if you want to investigate how we open URLs, turn left if you want to see how scrape text.
We’ve also gained better testability for free. Writing with testability in mind is often a sure-fire way for us to ensure we are writing good quality code. With our abstractions in place, we can now also test our pieces of logic in isolation.
Our tests can be smaller and focused on 1 thing at a time.
Previously, our tests for the first iteration of count_words_from_target_url()
would have been quite bloated.
Not to mention that unit-testing that would have been quite difficult,
given the exposure of all of those implementation details.
Like with most things in life, too much of a good thing can be bad. It’s important that we strike the right balance. And having too many abstractions can send us in the opposite direction. But structuring your abstractions with readability in mind can be incredibly helpful in striking this balance.
In our case the biggest win from a testing perspective was from adding an abstraction between our code and a network call which was being made by the line:
html = open_url(target_url)
But we would not want to test for example that urllib.request.urlopen(target_url).read()
was being called under the hood.
Since that it is an implementation detail.
Doing so would make our tests brittle as they would subject
to the underlying details of open_url()
.
The key thing is to ensure that we test behaviour not implementation details.
Having the open_url()
abstraction makes sense as it means that we can stub out the network call
being made to the target_url
.
We can avoid all the flakiness that comes with network calls like this and focus on the behaviour of our
count_words_from_target_url()
.
Thereby removing the unreliability that network calls bring with them from our tests.
Injecting the I/O bound dependency
We still have a pretty major flaw here.
open_url()
is currently making an I/O bound network call out to the target_url
.
This is not great as the network call goes out of our system to an external component that we have no control over. But we should aim to have full control over everything we own.
Say we wanted to test how count_words_from_target_url()
computes the counting of words.
How would we isolate the source HTML from a URL from the counting of the words?
Would we create a dummy site and point our test there? Possibly, but I think you’ll agree that seems like a lot of work, plus it incurs significant maintenance overhead! That would also still not remove the flakiness that comes with external network calls.
Instead, we could inject the dependency of opening the URL.
That way, we would have more finite control over count_words_from_target_url()
.
def count_words_from_target_url(target_url, url_manager):
html = url_manager.open_url(target_url)
all_text_in_html = find_all_text_in_html(html)
return count_word_frequencies(all_text_in_html)
Now, our url_manager
parameter can be any object which implements an open_url()
method.
The caller of our function would have to provide the url_manager
, alternatively we could provide a default.
But the most important thing here is that
we now have a level of control over count_words_from_target_url()
that we did not have before.
We would gain the ability to truly unit test count_words_from_target_url()
.
And we would have seperated the concern of fetching the HTML of the target site from the counting of the words,
whilst also improving the readability of our code.
Summary
Now, we have also defined clear boundaries between the various moving parts.
We are also now depending on abstractions instead of implementations.
This is known as one of the SOLID
principles.
In particular, this satisfies the Dependency Inversion Principle
.
Any higher classes should always depend upon the abstraction of the class rather than the detail.
Which is a fancy way of saying we should provide separation between the various components of our code with the use of abstractions. We should allow ourselves to be guided in this way by building abstractions with readability in mind.
The most important part of programming
Our industry tends towards the latest fad, the newest framework or language. I have heard countless debates about one language or framework being better than the other. However, these are all just tools, which can shine in certain circumstances.
But all of these conversations miss a key point.
In my opinion, the single most important thing is knowing how to build systems with the appropriate abstractions. This leads us to building systems which are modular and can be easily modified.
This helps us build our systems with confidence. Confidence that we can easily make changes whilst minimising our blast-radius. Confidence that our systems can grow over time without becoming a dreaded tangled mess that is a horror to work with.
The reality is that most modern systems are too
complex for the vast majority any of us to understand every minor detail of.
Abstractions allow us to compartmentalize our systems and free up our minds to focus on specific areas. They also allow us to make changes easily and without fear.
This is perhaps the most under-appreciated/talked about aspect of software engineering. Ensuring that we can easily change our systems is our responsibility. The systems we build are not static. Unless you are a genius, and you also happen to carry a crystal ball with you at all times, then you will have to make changes to your system over time.
If we cannot do this, then debating for Java over Python or Go will not make the difference you might have been led to believe.
Related posts: