Type Safety - How to write less tests
Bad decisions
Sometimes I write bad code. In fact, it's so bad that when refactoring a function in Python, I end up forgetting to adjust the caller to the function's new signature. This results in very frustrating crashes and a long stack trace. First I need to parse through it (which, after some years of practice, is actually quite fast), then I need to find the offending line and understand how this error came to be (my change of the function signature, duh!).
Sure, I could write backward-compatible changes where I keep the old signature and just add some new **kwargs so as not to change the function default, but oftentimes this is not what you want. You do not just want to throw some code on top. Instead, you want to make sure a function does exactly what it is intended for and only receives exactly the arguments it needs. Just like in life, where some people will not like something you do, in programming changes are not always backward compatible and will break your function calls, and the callers will not like it. If you try to please them all, you live in agony forever.
The problem
But here is the problem: how do you even know which caller uses your function, and when and where? Thankfully, we have LSPs that make our lives easier, and most modern IDEs provide lookup automation for function calls. But then again, in Python you can do something magical like this:
from importlib import import_module
module_name = "math"
function_name = "sqrt"
module = import_module(module_name)
func = getattr(module, function_name)
result = func(16)
print(result) # 4.0
This one, to my knowledge, will not be caught by your LSP. This way your function can be used and you do not even notice. Not only do you have to remember to manually check all function callers using your tooling, you also have to check for dynamic calls like the one above. In fact, it is very tedious to find all the places where your function is called. Since we do not like it when our production systems suddenly break and raise errors while we are out walking our dog, taking a nap, or just enjoying our lives, or possibly even working, we start to write tests.
Tests
Tests are a major pillar of software development. Isn't it amazing that you can use the tool (software) itself to make sure the tool works as intended? Sometimes, writing tests for a feature even takes up more time than writing the feature itself. In fact, in the past, I would have argued that writing tests is more important than writing the actual code. What about that test-driven development (TDD) thing where people preach writing tests first like it was one of the Ten Commandments? They are not wrong, at least not entirely.
When working on a large Python codebase, I feel that deep urge to test every single line of code that I write. Why? Because part of the magic of Python is that it is so flexible and dynamic that you can do almost anything with it. It's like the Wild West of programming languages. It's amazing, but it is also a double-edged sword. Python will not tell you that your code is wrong until you run it. It may look great, it may feel right, and then it will beautifully crash in production under circumstances you may have never imagined. And because you have felt the pain of these things going wrong enough times, you start to excessively test your code. Unit tests, integration tests, end-to-end tests, you name it. You think of all possible scenarios, edge cases, and things that could go wrong. But not to just blame dynamic typing, statically typed languages also have a big playground where things can go wrong, so you will need tests either way.
But in Python, you will need more tests. Because maybe you did what I did, changed the function signature, and forgot to change the caller. And this one time you forgot to write a test for it. Maybe it's connected to a feature flag that is only set in production. Maybe your DEV environment and CI are too far away from reality for just that case. Circumstances come together to create the perfect storm, as they like to, to make our lives difficult. This also happens in the frontend: UIs break, users get upset. So what did developers do? They invented TypeScript, a statically typed language that transpiles to JavaScript. The promise is that it can discover more errors before code is deployed to production. And if you look at the rise of TypeScript, it is quite clear that this promise is being fulfilled. I cannot recall the last time somebody mentioned starting a new frontend project with plain JavaScript.
The lies
And in Python, we have type hints. The concept is the same, but the implementation is a bit different. Type hints are optional. Unless you set up your project with the right tooling for them, they will never be enforced. At first glance, I found type hints to be a great feature of the language, but on second thought I found them to be a false friend. They provide you assurance when reading a function signature, but in fact, unless you type check using mypy or ty or something similar, you can just write something like this:
def concat(a: str, b: str) -> Dict[str, int]:
return a + b
result = concat("hello", "world")
The return type in this function signature is an obvious lie (as obvious as those of some politicians these days). But what if the function grows larger and we cannot discover the discrepancy at first glance? We might fall into the trap of trusting the type hints, even though they are blatantly lying. The type hint feature feels a lot, as we say in German, "dran-getackert", which means "stuck on" without concern for the wider implications. It feels like a fifth wheel on a car that is just there, but not really integrated into the design of the language. And if you think about Python, what is it that makes it such a beautifully designed language? I would argue one of the main reasons is the consideration of whitespace as a feature, the simplicity of defining variables without the need for type annotations, and easily readable code because you are not overloaded with syntax. Simplicity in general is what makes it so pleasant to write Python. But its downside is the lack of type checking. And if you opt to use a type checker, you will find it quite challenging when working with big frameworks like Django, where there is no full support for type checking. So for now, in Python, I will keep writing excessive tests so I can sleep well at night.
Type Safety
If you have made it this far, you might be wondering: how can types actually help us write fewer tests? And isn't writing fewer tests actually something bad?
It depends. Tests are only useful if you actually run them. And the more tests you have, the more time it takes to run them.
So how can types help? Let's look at an example. I have a string that I know should never be empty. I want to make sure that under no circumstances this string is empty. In Python, I would write something like this:
def validate_string(s: str) -> str:
if s == "":
raise ValueError("String cannot be empty")
# do some other validation..
# ...
# if all validation passes, return the string
return s
You can then go and use the string as needed. But once your codebase is huge, many developers work in parallel, and you have a lot of moving parts, how do you actually know that the string went through the validation function? You can find out by checking tests and looking at the actual code to see if the validation function is called. But as we discussed before, finding function calls in Python can be quite tedious.
In statically typed languages, there is another way. For example, in Rust (yes, he writes about Rust again... but the idea also applies to other statically typed languages), you can define a new type for this string by wrapping the traditional string type.
struct NonEmptyString(String);
impl NonEmptyString {
pub fn new(s: String) -> Result<Self, String> {
if s.is_empty() {
Err("String cannot be empty".to_string())
} else {
Ok(NonEmptyString(s))
}
}
}
Since the String is wrapped in a new type and the field is private, you cannot instantiate this type without going through the validation of the new function.
struct NonEmptyString(private field -> String <- private field);
So you cannot do:
let s = NonEmptyString("".to_string());
because the compiler will give you an error before the code even runs once. In fact, the only entry point to create a NonEmptyString is:
let s = NonEmptyString::new("".to_string());
which would raise an error in the code using the Result type. And this gives you an amazing guarantee:
Wherever you find this type in the codebase, you know that the string is not empty.
You know this just by trusting the type. You do not need to check the tests, and you do not need to follow the flow of the string through the codebase. You do not need to gather all instances where validate_string is called. Because the type system guarantees that this type is in fact this type (and not a lie like in the Python example), you are comfortable trusting the type system.
So whenever a function receives or returns a NonEmptyString, you are confident that the string is in fact not empty.
fn some_function(s: String) -> NonEmptyString {
...
}
And this is where you start writing fewer tests. In fact, writing a parameterized unit test for the new function is enough to test the validation logic. Then, wherever this type appears in the codebase, you do not need to worry about it being empty.
AI Agents
In 2025, probably the most interesting technical development for me was the rise of AI agents (this topic deserves its own blog post). They work by chaining together multiple calls to LLMs to create a feedback loop. The agent becomes mighty because it can use tools and scripts to do things, literally impersonating a working human. It can develop a feature, then run the test suite, find the failing tests, and then use the output of the test run to adjust the code again. This feedback loop makes agents very, very powerful. And this is where types come into play. While some argue that agents work better for well-established programming languages like Python or JavaScript because they have larger datasets to learn from, in my experience the flexibility of these languages allows for more silent errors that are not provided back to the agent via the feedback loop. When an agent writes bad code and a function is called badly, like I would do from time to time, the agent will not be aware of the error immediately. The higher strictness of a language like Rust, on the other hand, will immediately give the agent feedback to reconsider its code and make adjustments to satisfy the function signature. And this is really where it becomes interesting: when agents become reliable enough to write code on their own and you just review the features they develop, is the flexibility of a language still an advantage? Is the ease of writing code still more important than the safety of the code when it is the AI that actually does the writing? I will leave this question open for now as food for thought.