Failure is your Domain
Go’s paradox is that error handling is core to the language yet the language doesn’t prescribe how to handle errors. Community efforts have been made to improve and standardize error handling but many miss the centrality of errors within our application’s domain. That is, your errors are as important as your Customer
and Order
types.
An error also must serve the different goals for each of its consumer roles—the application, the end user, and the operator. This post explores the purpose of errors for each of these consumers within our application and how we can implement a simple but effective strategy that satisfies each role’s needs.
This post expands on many ideas about application domain & project design from Standard Package Layout so it is helpful to read that first.
Why We Err
Errors, at their core, are simply a way of explaining why things didn’t go how you wanted them to. Go splits errors into two groups—panic
and error
. A panic
occurs when you don’t expect something to go wrong such as accessing invalid memory. Typically, a panic is unrecoverable so our application fails catastrophically and we simply notify an operator to fix the bug.
An error
, on the other hand, is when we expect something could go wrong. That’s the focus of this post.
Types of errors
We can divide error
into two categories—well-defined errors & undefined errors.
A well-defined error is one that is specified by the API such as an os.IsNotExist()
error returned from os.Open()
. These allow us to manage our application flow because we know what to expect and can work with them on a case-by-case basis.
An undefined error is one that is undocumented by the API and therefore we are unable to thoughtfully handle it. This can occur from poor documentation but it can also occur when APIs we depend on add additional errors conditions after we’ve integrated our code with them.
Who Consumes Our Errors
The tricky part about errors is that they need to be different things to different consumers of them. In any given system, we have at least 3 consumer roles—the application, the end user, & the operator.
Application role
Your first line of defense in error handling is your application itself. Your application code can recover from error states quickly and without paging anyone in the middle of the night. However, application error handling is the least flexible and it can only handle well-defined error states.
An example of this is your web browser receiving a 301
redirect code and navigating you to a new location. It’s a seamless process that most users are oblivious to. It’s able to do this because the HTTP specification has well-defined error codes.
End user role
If your application is unable to handle the error condition then hopefully your end user can resolve the issue. Your end user can see an error state such as "Your debit card is declined"
and is flexible enough to resolve it (i.e. deposit money in their bank account).
Unlike the application role, the end user needs a human-readable message that can provide context to help them resolve the error.
These users are still limited to well-defined errors since revealing undefined errors could compromise the security of your system. For example, a postgres
error may detail query or schema information that could be used by an attacker. When confronted with an undefined error, it may be appropriate to simply tell the user to contact technical support.
Operator role
Finally, the last line of defense is the system operator which may be a developer or an operations person. These people understand the details of the system and can work with any kind of error.
In this role, you typically want to see as much information as possible. In addition to the error code and human-readable message, a logical stack trace can help the operator understand the program flow.
Our Baseline Application Error Type
Given our understanding that we need error codes, human-readable messages, & logical stack trace we can construct a simple error type to handle most of our application’s errors.
package myapp
// Error defines a standard application error.
type Error struct {
// Machine-readable error code.
Code string
// Human-readable message.
Message string
// Logical operation and nested error.
Op string
Err error
}
Note: This approach extends the implementation described in Error handling in Upspin but with some important differences that we’ll discuss later.
This is our baseline error type for our application. The Code
and Message
fields provide communication to our application and end user roles, respectively. The Op
and Err
fields allow us to chain errors together so that we can build the logical stack trace for our operator.
One important detail that is easy to miss is that our error is defined in our root package—myapp.Error
. This is important because it becomes part of our domain language. It also avoids the stutter problem that’s caused when defining an errors
subpackage (e.g. errors.Error
).
Every application is unique though so this can be extended further based on your application or business use case. You may also need to define additional specific error types but this works surprisingly well for most error scenarios.
Error Management by Role
Our Error
type is a good start but we need to add some simple functionality to make it usable. Let’s walk through various scenarios from the perspective of each of our consumer roles. In these following examples we’ll assume an application, myapp
, that has a UserService
interface defined in its domain:
package myapp
// UserService represents a service for managing users.
type UserService interface {
// Returns a user by ID.
FindUserByID(ctx context.Context, id int) (*User, error)
// Creates a new user.
CreateUser(ctx context.Context, user *User) error
}
Let’s look and see how each of these roles can manage its error state.
Application role
Our application role is typically concerned with simple error codes. For example, if our program attempts to fetch a User
by ID and it receives a "not found"
error, it could reattempt by searching by an e-mail address.
Defining our error codes
While it’s tempting to build fine-grained error codes, it’s much easier to manage more generic codes. There are several systems that define generic codes from which we can draw inspiration but two of my favorites are HTTP & gRPC:
While these specifications contain numerous error codes, I try to start from a more humble set of codes and expand as needed. I find these to be a good, simple starting point:
// Application error codes.
const (
ECONFLICT = "conflict" // action cannot be performed
EINTERNAL = "internal" // internal error
EINVALID = "invalid" // validation failed
ENOTFOUND = "not_found" // entity does not exist
)
Translating codes to our domain
These error codes are specific to our application so when we are interacting with external libraries we must translate those errors to our domain’s error codes. For example, if our application implements our UserService
in Postgres, we will need to translate a sql.ErrNoRows
error to an ENOTFOUND
code. Our domain model has no knowledge of sql.ErrNoRows
and it would break down if we also implement UserService
with a non-SQL database.
package postgres
// FindUserByID returns a user by ID. Returns ENOTFOUND if user does not exist.
func (s *UserService) FindUserByID(ctx context.Context, id int) (*myapp.User, error) {
var user myapp.User
if err := s.QueryRowContext(ctx, `
SELECT id, username
FROM users
WHERE id = $1
`,
id
).Scan(
&user.ID,
&user.Username,
); err == sql.ErrNoRows {
return nil, &myapp.Error{Code: myapp.ENOTFOUND}
} else if err {
return nil, err
}
return &user, nil
}
This allows us to return an ENOTFOUND
error for our application to operate on independent of the implementation of UserService
.
Working with error codes effectively
At this point, however, we have two issues. First, our function returns error
instead of *myapp.Error
so we’ll need to type assert whenever we want to access Error.Code
. That’s annoying. Second, our Error.Err
field allows us to nest errors so our top-level error may not contain the error code and we’ll need to recursively search for it.
We can solve both issues with a simple function that accomplishes the following:
- Returns no error code for
nil
errors. - Searches the chain of
Error.Err
until a definedCode
is found. - If no code is defined then return an internal error code (
EINTERNAL
).
Here is the implementation:
// ErrorCode returns the code of the root error, if available. Otherwise returns EINTERNAL.
func ErrorCode(err error) string {
if err == nil {
return ""
} else if e, ok := err.(*Error); ok && e.Code != "" {
return e.Code
} else if ok && e.Err != nil {
return ErrorCode(e.Err)
}
return EINTERNAL
}
We can now apply this in our calling code:
user, err := userService.FindUserByID(ctx, 100)
if myapp.ErrorCode(err) == myapp.ENOTFOUND {
// retry another method of finding our user
} else if err != nil {
return err
}
End user role
Our end users expect actionable, human-readable messages. These can have additional constraints such as branding tone or internationalization but we’ll just focus on the basics.
Example usage
A perfect example of end user messaging is for field validation. Here we check to ensure that new users in our UserService
have a username and it is unique.
package postgres
// CreateUser creates a new user in the system.
// Returns EINVALID if the username is blank or already exists.
// Returns ECONFLICT if the username is already in use.
func (s *UserService) CreateUser(ctx context.Context, user *myapp.User) error {
// Validate username is non-blank.
if user.Username == "" {
return &myapp.Error{Code: myapp.EINVALID, Message: "Username is required."}
}
// Verify user does not already exist.
if s.usernameInUse(user.Username) {
return &myapp.Error{
Code: myapp.ECONFLICT,
Message: "Username is already in use. Please choose a different username.",
}
}
...
}
Working with error messages effectively
Our error messages pose similar issues to our error codes above. We need a utility function to extract messages from error
values. We can implement a function similar to ErrorCode()
except with the following rules:
- Returns no error message for
nil
errors. - Searches the chain of
Error.Err
until a definedMessage
is found. - If no message is defined then return a generic error message.
Here is the implementation:
// ErrorMessage returns the human-readable message of the error, if available.
// Otherwise returns a generic error message.
func ErrorMessage(err error) string {
if err == nil {
return ""
} else if e, ok := err.(*Error); ok && e.Message != "" {
return e.Message
} else if ok && e.Err != nil {
return ErrorMessage(e.Err)
}
return "An internal error has occurred. Please contact technical support."
}
Now we can show this to our users if there is an error:
if msg := ErrorMessage(err); msg != "" {
fmt.Printf("ERROR: %s\n", msg)
}
Operator role
Finally, we need to be able to provide all this information plus a logical stack trace to our operator so they can debug issues. Go already provides a simple method, error.Error()
, to print error information so we can utilize that.
Logical stack traces
Many operators are familiar with stack traces. They dump a list of every function in the call stack from where an error occurred. You can see this at work when you call panic()
.
However, many times stack traces can be overwhelming and we only need a small subset of those lines to understand the context of our error. A logical stack trace contains only the layers that we as developers find to be important in describing the program flow. We will accomplish this by using the Op
and Err
fields to wrap errors to provide context.
Implementing Error()
Our myapp.Error.Error()
function should return an error string suitable for operators. There’s no definitive standard for how to format this message but I format mine with these goals in mind:
- Show the logical stack trace first. It provides context for the rest of the message. It also allows us to sort error lines to group them together.
- Show
Code
&Message
at the end. - Print on a single line so it’s easy to grep.
// Error returns the string representation of the error message.
func (e *Error) Error() string {
var buf bytes.Buffer
// Print the current operation in our stack, if any.
if e.Op != "" {
fmt.Fprintf(&buf, "%s: ", e.Op)
}
// If wrapping an error, print its Error() message.
// Otherwise print the error code & message.
if e.Err != nil {
buf.WriteString(e.Err.Error())
} else {
if e.Code != "" {
fmt.Fprintf(&buf, "<%s> ", e.Code)
}
buf.WriteString(e.Message)
}
return buf.String()
}
This implementation assumes that Err
cannot coexist with Code
or Message
on any given error. I find this to work well in practice.
Let’s look at how we use this with an example below.
Example usage
Returning to our CreateUser()
function example, suppose we need to create additional roles for our new users. We can utilize the Op
and Err
fields in our application Error
to wrap this nested functionality.
// CreateUser creates a new user in the system with a default role.
func (s *UserService) CreateUser(ctx context.Context, user *myapp.User) error {
const op = "UserService.CreateUser"
// Perform validation...
// Insert user record.
if err := s.insertUser(ctx, user); err != nil {
return &myapp.Error{Op: op, Err: err}
}
// Insert default role.
if err := s.attachRole(ctx, user.ID, "default"); err != nil {
return &myapp.Error{Op: op, Err: err}
}
return nil
}
// insertUser inserts the user into the database.
func (s *UserService) insertUser(ctx context.Context, user *myapp.User) error {
const op = "insertUser"
if _, err := s.db.Exec(`INSERT INTO users...`); err != nil {
return &myapp.Error{Op: op, Err: err}
}
return nil
}
// attachRole inserts a role record for a user in the database
func (s *UserService) attachRole(ctx context.Context, id int, role string) error {
const op = "attachRole"
if _, err := s.db.Exec(`INSERT roles...`); err != nil {
return &myapp.Error{Op: op, Err: err}
}
return nil
}
Let’s assume we receive a Postgres syntax error inside our attachRole()
function and Postgres returns the message:
syntax error at or near "INSERT"
Without context, we don’t know if this occurred in our insertUser()
function or our attachRole()
function. This is very tedious to debug when your API executes 20+ SQL queries.
However, because we are wrapping our errors, Error()
will return:
UserService.CreateUser: attachRole: syntax error at or near "INSERT"
This lets us narrow down the errant query and make the fix.
Comparison with other approaches
This post describes how I work with errors in my applications and it’s an approach that has largely been inspired by the work of others. However, this approach does not take any one of these approaches wholesale but rather combines the best of each one. I think it is helpful to discuss where these approaches diverge and why.
Alternative: Upspin error handling
The Upspin team released a blog post called Error handling in Upspin that inspired much of my approach. They use a subpackage called errors
with a single concrete Error
type with an error code (Kind
) and error wrapping (Op
& Err
). They also include a builder function called errors.E()
for constructing their errors by using type information in the arguments.
I found the error codes & wrapping to be a great, simple implementation but I found issues with the following in practice:
- Moving the
Error
type to theerrors
package removes it from the domain and causes a stuttering name (errors.Error
). - The additional types such as
Op
&Kind
seemed more complicated than they needed to be and seemed to exist for syntactic sugar for theerrors.E()
builder function. Strings would suffice if instantiatingError
types directly. - There lacked a separation between end user error messaging and operator error messaging. This makes sense because Upspin’s end users are likely operators but this model doesn’t fit for most applications.
Alternative: pkg/errors
The pkg/errors project is an effort to provide error wrapping as a library. This approach handles error wrapping well and provides much needed context to errors, however, I found the following issues in usage:
- By importing your error handling from a third party, it is external to your application’s domain.
- The error wrapping allows developers to obtain root cause error information but still requires verbose type assertions to extract this information. The
ErrorCode()
function from this post simplifies this error information checking to a single line in the caller code. - It solves the error wrapping problem but does not help with other error handling concerns.
Feedback
These critiques are not meant to be slams against these other projects. No approach works for every application so I’ve added these comparisons to help developers make educated decisions about their application design. Please choose the approach that is right for you.
Conclusion
Error handling is a critical piece of your application design and is complicated by the variety of different consumer roles that require error information. By considering error codes, error messages, and logical stack traces in our design we can fulfill the needs of each consumer. By integrating our Error
into our domain, we give all parts of our application a common language to communicate about when unexpected things happen.