Two Phase Commit Protocol in Rust and Go

My friends and I were talking about what happens when you buy something online and your payment fails halfway through. Like, does your money just disappear? How do systems make sure that doesn’t happen? We ended up going down a rabbit hole and decided to build our own two phase commit protocol. I used Rust for the coordinator and Go for the microservices.

The Basic Idea

Two phase commit (2PC) is basically a voting system for distributed transactions. Either everyone agrees to do something, or nobody does it. Think of it like picking a restaurant with friends - if anyone says no, you have to start over.

What We Built

We split it into three parts: a coordinator written in Rust (the boss that tells everyone what to do), a wallet service in Go (handles user money), and an order service also in Go (manages product inventory).

The Coordinator

The coordinator is where all the decision making happens. Here’s the core logic in Rust:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
struct Coordinator {
wallet_conn: TcpStream,
order_conn: TcpStream,
}

impl Coordinator {
fn prepare_phase(&mut self, transaction: Transaction) -> Result<bool, Error> {
self.wallet_conn.write_all(&transaction.serialize())?;
self.order_conn.write_all(&transaction.serialize())?;

let wallet_vote = self.wallet_conn.read_response()?;
let order_vote = self.order_conn.read_response()?;

Ok(wallet_vote == READY && order_vote == READY)
}

fn commit_phase(&mut self) -> Result<(), Error> {
self.wallet_conn.write_all(COMMIT_MSG)?;
self.order_conn.write_all(COMMIT_MSG)?;
Ok(())
}
}

So how does it work? First phase: the coordinator asks everyone “can you do this transaction?” If anyone says no or doesn’t respond, we abort. Second phase: if everyone said yes, the coordinator tells them “okay, do it now.” Otherwise it’s like “never mind, forget about it.”

The Microservices

The microservices do the actual work. Here’s part of our wallet service:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
type WalletService struct {
db *sql.DB
}

func (ws *WalletService) handlePrepare(tx *sql.Tx, userId int, amount float64) error {
var balance float64
err := tx.QueryRow("SELECT balance FROM wallets WHERE user_id = ?", userId).Scan(&balance)
if err != nil {
return err
}

if balance < amount {
return errors.New("insufficient funds")
}

_, err = tx.Exec("UPDATE wallets SET balance = balance - ? WHERE user_id = ?", amount, userId)
return err
}

When Things Go Wrong

The interesting part is when stuff breaks, which happens all the time in distributed systems:

We tested what happens when services crash mid-transaction, when network connections drop, and when services are super slow to respond. Turns out distributed systems fail in really creative ways.

The Downsides

Two phase commit solves the consistency problem, but it’s not free. Everyone has to wait for the coordinator’s decision (blocking), there’s a ton of messages going back and forth (network overhead), and if the coordinator dies, everything just stops. That single point of failure is pretty brutal.

Deploying on the Cloud

We put this on Google Cloud Platform with separate VMs for each service. That’s when we learned that network latency is real and partial failures are everywhere.

Testing This Was Tricky

Testing distributed systems is way trickier than regular programs. Everything happens at once and things fail in weird ways:

1
2
3
4
5
6
7
8
9
10
11
12
13
#[test]
fn test_node_failure_during_prepare() {
let mut coordinator = Coordinator::new();
let transaction = Transaction::new(user_id: 1, amount: 100.0);

// Simulate node failure
coordinator.order_conn.shutdown()?;

assert!(matches!(
coordinator.prepare_phase(transaction),
Err(Error::Timeout)
));
}

Some Takeaways

Rust’s ownership model turned out to be really helpful for managing complex distributed state. Go’s goroutines made handling multiple transactions at once pretty straightforward.

The biggest thing though: what works perfectly on localhost breaks in weird and unexpected ways once you put it on actual infrastructure. Network latency is real and networks fail constantly. We had to think really carefully about timing and all the ways things could go wrong when testing.

Building this from scratch really helped us understand what’s happening under the hood in production systems. Distributed systems are hard, but at least now I get why payment systems are so complicated.

The code is on GitHub if you want to check it out. The README is in Norwegian though, since we wrote it for a class project.