December 15, 2017

Day 15 - A DevOps Christmas Carol

By: Emily Freeman (@editingemily)
Edited By: Corey Quinn (@quinnypig)

The DevOps Christmas Carol is a bastardized, satirical version of Charles Dickens’ iconic work, A Christmas Carol.

It’s Christmas Eve and we find Scrooge, a caricature of a San Francisco-based, VC-backed tech startup CEO haunted by Peter Drucker’s ghost — who warns him of the visits of three ghosts: the Ghost of DevOps Past, the Ghost of DevOps Present and the Ghost of DevOps Yet to Come. (Victorians were seriously wordy.)

Scrooge’s company,, has adopted DevOps, but their Tiny Tim app is still in danger of falling over. And Scrooge is still complaining about that AWS bill.

I want you to laugh at the absurdity of our industry, remember the failures of yesterday, learn the lessons of today and embrace the challenges of tomorrow.

Above all else, Merry Christmas, Chag Sameach and Happy New Year. May everyone’s 2018 be better than the dumpster fire that was this year.


Old Peter Drucker was as dead as a doornail. (I don’t exactly know what’s dead about a doornail, but we’ll go with it.)

Drucker had been dead for many years. Every DevOpsDays deck included a Drucker quote. And many shots had been consumed playing the Drunker Drucker drinking game.

Scrooge was your average SF CEO. His grandfather was business partners with Drucker and Scrooge continues to worship the Drucker deity.

It was Christmas Eve and Scrooge sat in his glass-enclosed office drinking artisanal, small-batch coffee. His cool disposition was warmed thinking about the yacht he would buy when his startup,, IPO’d.

Sure, they had unlimited vacation. But no one ever took it — even on Christmas Eve. His employees loved working that much.

He watched the developers and operations folks gather for standup in the “Innovate” conference room. It was great to see the teams working together. After all, they had spent $180,000 on a consultant to “do the DevOps.”

“A merry Christmas, uncle!” His sister had forced Scrooge to hire his cheery nephew as an intern.

“Bah!” said Scrooge, “Humbug!”

“Christmas a humbug, uncle!” said Scrooge’s nephew. “You don’t mean that, I am sure?”

“I do,” said Scrooge. “Merry Christmas! What reason have you to be merry? I pay you $19 an hour and you live in a closet with 4 other men in Oakland.”

The receptionist quietly tapped the glass door. “Two men from Homeless Helpers of San Francisco are here to see you.” Scrooge waved them in.

“At this festive season of the year, Mr. Scrooge,” said the gentleman, “it is more than usually desirable that we should make some slight provision for the poor and destitute, who suffer greatly at the present time.”

“I thought we were just moving them to San Jose,” replied Scrooge.

Seeing clearly it would be useless to pursue their point, the two men withdrew, mumbling about hate in the Trump era.

The day drew to a close. Scrooge dismounted from his Aeron chair and put on his sweatshirt and flat-brimmed hat.

“You’ll want all day tomorrow, I suppose?” said Scrooge to his employees, whose standing desks were packed as tightly as an Amazon box filled with Cyber Monday regret. “I suppose you must have the whole day. But if the site goes down, I expect all of you to jump on Slack and observe helplessly as Samantha restarts the servers.”

Scrooge took his melancholy dinner at his usual farm-to-table tavern. Walking home on the busy sidewalk, Scrooge approached his doorman only to see Drucker, staring at him. His demeanor vacant, his body translucent. Scrooge was shook. He brushed past the ghostly figure and hurried toward the elevator.

Satisfied he had too many glasses of wine with dinner, Scrooge shut the door and settled in for the night. Suddenly, Siri, Alexa, Cortana and Google joined together to make a horrific AI cacophony.

This was followed by a clanking noise, as if someone were dragging a heavy chain over the hardwood floors. Scrooge remembered to have heard that ghosts in haunted houses were described as dragging chains.

The bedroom door flew open with a booming sound and then he heard the noise much louder, coming straight towards his door.

His color changed when he saw the same face. The very same. Drucker, in his suit, drew a chain clasped about his middle. It was made of servers, CPUs, and endless dongles.

Scrooge fell upon his knees, and clasped his hands before his face. “Mercy!” he said. “Dreadful apparition, why do you trouble me?”

“I wear the chain I forged in life. Oh! Captive, bound, and double-ironed,” cried the phantom.

“But you were always a good man of business and operations, Peter,” faltered Scrooge, who now began to apply this to himself.

“Business!” cried the Ghost, wringing his hands again. “Mankind was my business. Empathy, compassion and shared documentation were all my business. Operations was but a drop of water in the ocean of my business! You will be haunted,” resumed the Ghost, “by three spirits.”

“I—I think I’d rather not,” said Scrooge.

“Without their visits,” said the Ghost, “you cannot hope to shun the path of waterfall development, silos and your startup’s slow descent into obscurity. Expect the first tomorrow.”

With that, Drucker’s ghost vanished.


It was dark when Scrooge awoke from his disturbed slumber. So dark he could barely distinguish transparent window from opaque wall. The chimes of a neighboring church struck the hour and a flash lit up the room in an instant. Scrooge stared face-to-face with an unearthly visitor.

It was a strange figure—small like a child, but old. Its hair was white with age but it’s face had not a wrinkle.

“Who, and what are you?” Scrooge demanded.

“I am the Ghost of DevOps Past.”

“Long past?” inquired Scrooge, observant of its dwarfish nature.

“DevOps is only like 10 years old. Do you even read Hacker News?” He paused. “Rise! And walk with me!”

The Ghost took his hand and together they passed through the wall, and stood upon an open convention floor.

“Good heaven! I went to this conference!” said Scrooge.

“The open space track is not quite deserted,” said the Ghost. “A solitary man, neglected by his friends, is left there still.”

In a corner of the conference, they found a long, bare, melancholy room. In a chair, a lonely man was reading near the feeble light of his phone. Scrooge wept to see poor Andrew Clay Shafer in a room alone.

“Poor man!” Scrooge cried. “I wish,” Scrooge muttered, putting his hand in his pocket, “but it’s too late now.”

“What is the matter?” asked the Spirit.

“Nothing,” said Scrooge. “Nothing. There was a developer who asked about ops yesterday. I should like to have given him something. That’s all.”

The Ghost smiled thoughtfully, and waved its hand: saying as it did so, “Let us see another conference!”

The room became a little darker. The panels shrunk and the windows cracked. They were now in the busy thoroughfares of a city, where shadowy passengers passed along the narrow streets and Medieval architecture.

The Ghost stopped at a certain door and ushered Scrooge in. “Why it’s old Patrick Dubois! Bless his heart.”

“Another round!” Dubois announced.

“This is the first DevOpsDays afterparty in Ghent,” explained the Ghost.

Away they all went, twenty couple at once toward the bar. Round and round in various stages of awkward grouping.

“Belgians sure know how to party,” said Scrooge.

“Sure. But it’s a tech conference so it’s still pretty awkward,” remarked the Ghost. “A small matter,” it continued, “to make these silly folks so full of gratitude. It only takes a t-shirt and a beer.”

“Small!” echoed Scrooge.

“Why! Is it not? He has spent but a few dollars of your mortal money. Is that so much he deserves this praise?”

“It isn’t that, Spirit. He has the power to render us happy or unhappy. To make our service light or burdensome. Say that his power lies in words and looks, in things so slight and insignificant that it is impossible to add and count ’em up. The happiness he gives, is quite as great as if it cost a fortune.”

He felt the Spirit’s glance, and stopped.

“What is the matter?” asked the Ghost.

“Nothing particular,” said Scrooge.

“Something, I think?” the Ghost insisted.

“No,” said Scrooge, “No, I should like to be able to say a word or two to my employees just now. That’s all.”

“My time grows short,” observed the Spirit. “Quick!”

It produced an immediate effect. Scrooge saw himself. He was not alone but with two former employees. The tension in the room was overwhelming.

“The code didn’t change,” explained the developer. “There’s no way we caused this out—”

“That’s bullshit!” interjected the SRE. “There was a deploy 10 minutes before the site went down. Of course that’s the issue here. These developers push out crappy code.”

“At least we can code,” replied the developer, cruelly.

“No more,” cried Scrooge. “No more. I don’t wish to see it.”

But the relentless Ghost pinioned him in both his arms and forced him to observe.

“Spirit!” said Scrooge in a broken voice, “remove me from this place. Remove me. I cannot bear it!”

He turned upon the Ghost, and seeing that it looked upon him with a face in which some strange way there were fragments of all the faces it had shown him. “Leave me! Take me back. Haunt me no longer!” Scrooge wrestled with the spirit with all his force. Light flooded the ground and he found himself exhausted. Overcome with drowsiness, Scrooge fell into a deep sleep.


Awaking in the middle of a prodigiously tough snore and sitting up, Scrooge was surprised to be alone. Now, being prepared for almost anything, he shuffled in his slippers to the bedroom door. The moment Scrooge’s hand was on the lock, a strange voice called him by his name and bade him enter. He obeyed.

It was his own room, there was no doubt. But it had undergone a surprising transformation. The walls and ceiling were hung with living green and bright gleaming berries glistened.

“Come in!” exclaimed the Ghost. “Come in and know me better, man!”

Scrooge entered timidly and hung his head. The Spirit’s eyes were clear and kind but he didn’t like to meet them.

“I am the Ghost of DevOps Present,” said the Spirit. “Look upon me!”

Scrooge reverently did so. It was clothed in one simple green robe, bordered with white fur. This garment hung so loosely on the figure that its capacious breast was bare.

“You have never seen the like of me before!” exclaimed the Ghost.

“Never,” Scrooge answered. “But you may want to cover up a bit. This is breaking some kind of code of conduct and men are getting in trouble for this kind of thing these days.”

“Touch my robe!”

“OK, this is getting awkward. And inappropriate. Seriously, you can’t expose yourself like this. House of Cards was canceled because of people like you. It’s not cool, man.”

“Touch my robe!” the Ghost bellowed.

Scrooge did as he was told, and held it fast.

The room vanished instantly. They found themselves in the conference room at The Ghost sprinkled incense from his torch on the heads of the employees sat around the table. It was a very uncommon kind of torch, for once or twice when there were angry words between some men, he shed a few drops of water on them from it, and their good humour was restored directly.

“Is there a peculiar flavor in what you sprinkle from your torch?” asked Scrooge.

“There is. My own.”

“OK, buddy, we gotta work on the subtle sexual harassment vibe you’re working with.”

Scrooge’s employees at the table began to argue about CI/CD, pipelines and testing for the new Tiny Tim app.

“We need to be deploying every ten minutes. At a minimum. That’s what Netflix does,” said Steve, confidently.

“We use Travis CI. What if we had commits immediately deploy to production?” asked Tony.

“Um, that’s a terrible idea. You want the developers tests be the only line of defense against site outages? I don’t want to be on call during that disaster.”

“There’s no reason to get an attitude, Steve.”

“Well, you’re suggesting that developers should be trusted to deploy their own code.”

“That’s exactly what I’m saying.”

“That’ll never work. QA and security need to review everything before it’s pushed out.”

“Yea, but there’s this whole concept of moving things to the left. Where ops, security, QA are all involved in feature planning and the developer architects the code with their concerns in mind. That way, we don’t have Amy spending a full week on a feature only to have security kick it back.”

“I think that’s exactly how it should work. Developers need to code better.”

“‘Coding better’ is not actionable or kind. And that kind of gatekeeping process creates silos and animosity. People will start to work around each other.”

“We’ll just add more process to prevent that.”

“Never underestimate someone in tech’s ability to use passive-aggressiveness in the workplace. Listen. Scrooge wants Tiny Tim to be reliable, agile, testable, maintainable and secure.”

“Well, that’s impossible.”

“Pfff,” remarked Scrooge. “You’re fired.”

The Ghost sped on. It was a great surprise to Scrooge, while listening to the moaning of the wind, and thinking what a solemn thing it was to move on through the lonely darkness over an unknown abyss, whose depths were secrets as profound as Death. This was the ever-present existential crisis of life. It was a great surprise to Scrooge, while thus engaged, to hear a hearty laugh.

“Ha, ha! Ha, ha, ha, ha!” laughed Scrooge’s nephew.

“He said that Christmas was a humbug, as I live!” cried Scrooge’s nephew. “He believed it too! I am sorry for him. I couldn’t be angry with him if I tried. Who suffers by his will whims! Himself, always. He takes it into this head to dislike us. No one at likes him. The product owner for Tiny Tim is about to quit and Scrooge has no idea!”

Scrooge was taken aback. He built a great team. They loved him. Adored him, even. Or so he thought. It’s true he hadn’t taken time to talk to any of them in several months, but everything was going so well. They were only two months from the launch of Tiny Tim.

His nephew continued, “We’re all underpaid and overworked. Scrooge is constantly moving the goalpost. He expects us to be perfect.”

The Ghost grew older, clearly older.

“Are spirits’ lives so short?” asked Scrooge.

“My life upon this globe is very brief,” replied the Ghost. “It ends tonight. Hark! The time is drawing near.”

The Ghost parted the folds of its robe. “Look here. Look, look, down here!”

“You. Have. To. Stop. With. This.” sighed Scrooge.

From its robe it brought two children. Scrooge started back, appalled at the creatures. “Spirit! Are they yours?”

“They are Op’s,” said the Spirit, looking down upon them. “And they cling to me. This boy is Serverless. This girl is Lambda.”

The bell struck twelve. Scrooge looked about him for the Ghost and saw it not. Lifting up his eyes, he beheld a solemn Phantom, draped and hooded, coming, like a mist along the ground, towards him.


The Phantom slowly, gravely, silently approached. When it came near him, Scrooge bent down upon his knee; for in the very air through which this Spirit moved it seemed to scatter gloom and mystery.

It was shrouded in a deep black garment, which concealed its head, its face, its form, and left nothing of it visible save on outstretched hand.

“I am in the presence of the Ghost of DevOps Yet To Come?” said Scrooge.

The Spirit answered not, but pointed onward with its hand.

“Lead on!” said Scrooge. “The night is waning fast, and it is precious time to me, I know. Lead on, Spirit!”

The Phantom moved away as it had come towards him. Scrooge followed in the shadow of its dress, which bore him up, he thought, and carried him along.

They scarcely seemed to enter the city. For the city rather seemed to spring up about them and encompass them of its own act. The Spirit stopped beside one little knot of business men. Observing that the hand was pointed to them, Scrooge advanced to listen to their talk.

“No,” said a great fat man with a monstrous chin, “I don’t know much about it, either way. I only know it’s dead.”

“When did it happen?” inquired another.

“Yesterday, I believe.”

“How much of a down round was it? Just a stop gap?”

“No. Investors have lost faith. Scrooge sold this Tiny Tim app hard. Bet all of on disrupting the DevSecDataTestOps space. He could only raise half of what they did in Series A.”

“It’s over,” remarked another.

“Oh yea, he’s done.”

This was received with a general laugh.

The Phantom glided on and stopped once again in the office of — it’s finger pointed to three employees fighting over who could take home the Yama cold brew tower — one nearly toppling it in the process. There were movers carrying standing desks out and employees haggling over their desks and chairs.

Scrooge watched as his office — his company — was systematically dismantled, broken down and taken away piece-by-piece. His own office was nearly empty save for a single accounting box — a solemn reminder of Scrooge’s priorities in his work.

“Spectre,” said Scrooge, “something informs me that our parting moment is at hand. I know it, but I know not how.”

The Ghost of DevOps Yet To Come conveyed him, as before. The Spirit did not stay for anything, but went straight on, as to the end just now desired, until besought by Scrooge to tarry for a moment.

The Spirit stopped; the hand was pointed elsewhere.

A pile of old computers lay before him. The Spirit stood among the aluminum graves, and pointed down to one. He advanced toward it trembling. The Phantom was exactly as it had been, but he dreaded that he saw new meaning in its solemn shape.

Scrooge crept towards it, and following the finger, read upon the screen, HUMBUG.LY — THIS WEBPAGE PARKED FREE, COURTESY OF GODADDY.COM.

“No, Spirit! Oh no, no!”

The finger was still there.

“Spirit!” he cried, tight clutching at its robe, “hear me! I am not the man I was. I will not be the man I have been but for this. I will honor DevOps in my heart, and try to keep it all the year. I will live in the Past, the Present and the Future. The Spirits of all three shall strive within me. I will not shut out the lessons they teach.”

Holding up his hands in a last prayer to have his fate reversed, he saw an alteration in the Phantom’s hood and dress. It shrunk, collapsed and dwindled down into a bedpost.


Yes! The bedpost was his own. The bed was his own, the room was his own.

“Oh Peter Drucker! Heaven, and DevOps be praised for this! I don’t know what to do!” cried Scrooge, laughing and crying in the same breath. Really, for a man who had been out of practice for so many years, it was a splendid laugh, a most illustrious laugh.

Scrooge hopped on Amazon with haste. He bought The Phoenix Project and The DevOps Handbook and had it delivered by drone within the hour.

“I’ll get a copy for every one of my employees!” exclaimed Scrooge.

He hopped into his Tesla and drove to his nephew’s apartment. Greeted by one of the 5 roommates, he asked to see his nephew.

“Fred!” said Scrooge.

“Why bless my soul!” cried Fred, “who’s that?”

“It’s I. Your uncle Scrooge. I have come to dinner. Will you let me in, Fred?”

The odd group had a twenty-something Christmas dinner and played Cards Against Humanity. Wonderful party, wonderful games, wonderful happiness!

But he was early at the office the next morning. If he could only be there first, and catch his employees coming in late!

The last employee stumbled in. “Hello!” growled Scrooge, in his accustomed voice, as near as he could feign it. “What do you mean by coming here at this time of day?”

“I am very sorry, sir. I am behind my time.”

“You are?” repeated Scrooge. “Yes. I think you are. Step this way, sir, if you please.”

“It’s only once a year, sir,” pleaded Bob. “It should not be repeated. Besides, we’re supposed to have flexible work hours. We have a ping-pong table for God’s sake!”

“Now, I’ll tell you what, my friend,” said Scrooge, “I am not going to stand this sort of thing any longer. And therefore,” he continued, leaping from his Aeron, “I am about to raise your salary! All of your salaries!”

Bob trembled. “Does… does this mean we’re doing that open salary thing?”

“No, don’t push it, Bob. That’s for hippies,” replied Scrooge. “A merry Christmas!”

Scrooge was better than his word. He did it all and infinitely more. The Tiny Tim app was finished on time, adopted a DevOps culture, developers stopped being assholes and ops folks got more sleep.

Scrooge had no further experience with Spirits and it was always said of him that he knew how to keep DevOps well, if any man alive possessed the knowledge. May that truly be said of all of us! And so, Ops bless us, everyone!

December 14, 2017

Day 14 - Pets vs. Cattle Prods: The Silence of the Lambdas

By: Corey Quinn (@quinnypig)
Edited By: Scott Murphy (@ovsage)

“Mary had a little Lambda
S3 its source of truth
And every time that Lambda ran
Her bill went through the roof.”

Lambda is Amazon’s implementation of a concept more broadly known as “Functions as a Service,” or occasionally “Serverless.” The premise behind these technologies is to abstract away all of the infrastructure-like bits around your code, leaving the code itself the only thing you have to worry about. You provide code, Amazon handles the rest. If you’re a sysadmin, you might well see this as the thin end of a wedge that’s coming for your job. Fortunately, we have time; Lambda’s a glimpse into the future of computing in some ways, but it’s still fairly limited.

Today, the constraints around Lambda are somewhat severe.

  • You’re restricted to writing code in a relatively small selection of languages– there’s official support for Python, Node, .Net, Java, and (very soon) Go. However, you can shoehorn in shell scripts, PHP, Ruby, and others. More on this in a bit.
  • Amazon has solved the Halting Problem handily– after a certain number of seconds (hard capped at 300) your function will terminate.
  • Concurrency is tricky: it’s as easy to have one Lambda running as a time as it is one thousand. If they each connect to a database, it’s about to have a very bad day. (Lambda just introduced per-function concurrency, which smooths this somewhat.)
  • Workflows around building and deploying Lambdas are left as an exercise for the reader. This is how Amazon tells developers to go screw themselves without seeming rude about it.
  • At scale, the economics of Lambda are roughly 5x the cost of equivalent compute in EC2. That said, for jobs that only run intermittently, or are highly burstable, the economics are terrific. Lambdas are billed in Gigabyte-Seconds (of RAM).
  • Compute and IO scale linearly with the amount of RAM allocated to a function. Exactly what level maps to what is unpublished, and may change without notice.
  • Lambda functions run in containers. Those containers may be reused (“warm starts”) and be able to reuse things like database connections, or have to be spun up from scratch (“cold starts”). It’s a grand mystery, one your code will have to take into account.
  • There are a finite list of things that can trigger Lambda functions. Fortunately, cron-style schedules are now one of them. The Lambda runs
  • within an unprivileged user account inside of a container. The only place inside of this container where you can write data is /tmp, and it’s limited to 500mb.
  • Your function must fit into a zip file that’s 50MB or smaller; decompressed, it must fit within 250MB– including dependencies.

Let’s focus on one particular Lambda use case: replacing the bane of sysadmin existence, cron jobs. Specifically, cron jobs that affect your environment beyond “the server they run on.” You still have to worry about server log rotation; sorry.

Picture being able to take your existing cron jobs, and no longer having to care about the system they run on. Think about jobs like “send out daily emails,” “perform maintenance on the databases,” “trigger a planned outage so you can look like a hero to your company,” etc.

If your cron job is written in one of the supported Lambda languages, great– you’re almost there. For the rest of us, we probably have a mashup of bash scripts. Rejoice, for hope is not lost! Simply wrap your terrible shell script (I’m making assumptions here– all of my shell scripts are objectively terrible) inside of a python or javascript caller that shells out to invoke your script. Bundle the calling function and the shell script together, and you’re there. As a bonus, if you’re used to running this inside of a cron job, you likely have already solved for the myriad shell environment variable issues that bash scripts can run into when they’re called by a non-interactive environment.

Set your Lambda trigger to be a “CloudWatch Event - Scheduled” event, and you’re there. It accepts the same cron syntax we all used to hate but have come to love in a technical form of Stockholm Syndrome.

This is of course a quick-and-dirty primer for getting up and running with Lambda in the shortest time possible– but it gives you a taste of what the system is capable of. More importantly, it gives you the chance to put “AWS Lambda” on your resume– and your resume should always be your most important project.

If you have previous experience with AWS Lambda and you’re anything like me, your first innocent foray into the console for AWS Lambda was filled with sadness, regret, confusion, and disbelief. It’s hard to wrap your head around what it is, how it works, and why you should care. It’s worth taking a look at if you’ve not used it– this type of offering and the design patterns that go along with it are likely to be with us for a while. Even if you’ve already taken a dive into Lambda, it’s worth taking a fresh look at– the interface was recently replaced, and the capabilities of this platform continue to grow.

December 13, 2017

Day 13 - Half-Dead TCP Connections and Why Heartbeats Matter

By: Alejandro Brito Monedero (@ae_bm)

Edited By: J. Paul Reed (@jpaulreed)

We are living interesting times in the tech world, full of trendy technologies, like cloud computing, containers, schedulers, and serverless.

They’re all rainbows and unicorns when they work. But we can start to forget about the systems that support our abstractions until they break, and we have to give our best to fix them.

Some time ago in one of our multiple pub-sub systems, there were some processes publishing messages to a message broker. Those messages are consumed by a process running in a container. So far this isn’t too exotic; it’s easy to diagram out:

Pub Sub

For the most part, this system worked as expected. However, at one point, we received some alerts from the monitoring system. Those alerts reported that the broker had a lot of queued messages and no consumers connected. Our first reaction was to restart the consumer container and call it a day. But the error kept happening, always at the worst possible time.

While taking a closer look at the problem, we confirmed that the broker has no consumers to deliver the messages. The surprise came when we inspect the consumer container. It was still running and seemingly all was well, except we did notice it was blocked on the socket used to communicate with the broker. When we inspect the socket statistics inside the container’s network namespace, it showed a connection to the broker:

# ss -tpno
ESTAB      0      0               <container ip>:<some port>         <broker ip>:<broker port>

Upon seeing this, our reaction could pretty much be summed up as:

The problem seemed to be that the connection state between the broker and consumer was not synchronized. In the host network namespace (where the broker is running), the status showed there weren’t any TCP connection from the container. Instead in the container network namespace there is an established connection to the broker. Our dear RFC 793 mentions this situation Half-Open Connections and Other Anomalies (emphases mine):

An established connection is said to be “half-open” if one of the TCPs has closed or aborted the connection at its end without the knowledge of the other, or if the two ends of the connection have become desynchronized owing to a crash that resulted in loss of memory. Such connections will automatically become reset if an attempt is made to send data in either direction. However, half-open connections are expected to be unusual, and the recovery procedure is mildly involved.

If at site A the connection no longer exists, then an attempt by the user at site B to send any data on it will result in the site B TCP receiving a reset control message. Such a message indicates to the site B TCP that something is wrong, and it is expected to abort the connection.

After that nice enlightenment, we started to think of possible causes for that “desynchronization.” Options that came to mind included:

  • the broker restarting or crashing
  • man-in-the-middle attack (MITM)
  • A grumpy kernel (iptables, ebtables, bridges, etc)
  • A grumpy container engine
  • Some Lovecraftian horror show

To determine which it was, we first checked if the broker has crashed or has been restarted. Upon inspection, we found it’d been running for a long time and other systems using it were working normally. So it wasn’t a problem with the broker.

The kernel iptables and bridge didn’t show anything weird. A MITM attack seemed a bit exotic. The other options were hard to prove, and we thought it wouldn’t be very professional of us to blame the container system without any evidence. ;-)

While trying to think of other causes it could be, we kept tcpdump running on one of consumer containers. tcpdump captured an RST message sent from the container IP to the broker in response to the broker sending a data message to the consumer container after a long period of inactivity. The weird thing is that network traffic never reached the container, neither the RST originated from the container. Maybe the MITM attack wasn’t such an exotic possibility after all?!

Meanwhile, while trying to re-create the problem end state and work toward making our our systems resilient to this situation: we used iptables to drop or reset traffic from the broker to the container after the container connected to the broker. Both methods allowed us to observe the same end-state we were getting in production, confirming the container never learns that the broker connection is lost. Figuring out how to find how to learn that your peer is down even if the TCP connection state is established proved difficult. But after some Internet searching, we found RFC 1122’s section on TCP Keep-Alives (again emphases mine):

Implementors MAY include “keep-alives” in their TCP implementations, although this practice is not universally accepted. If keep-alives are included, the application MUST be able to turn them on or off for each TCP connection, and they MUST default to off.

Keep-alive packets MUST only be sent when no data or acknowledgement packets have been received for the connection within an interval*. This interval MUST be configurable and MUST default to no less than two hours.


A “keep-alive” mechanism periodically probes the other end of a connection when the connection is otherwise idle, even when there is no data to be sent. The TCP specification does not include a keep-alive mechanism because it could:
(1) cause perfectly good connections to break during transient Internet failures; (2) consume unnecessary bandwidth (“if no one is using the connection, who cares if it is still good?”); and (3) cost money for an Internet path that charges for packets.

A TCP keep-alive mechanism should only be invoked in server applications that might otherwise hang indefinitely and consume resources unnecessarily if a client crashes or aborts a connection during a network failure.

Translation: distributed systems are fun… and determining whether a connection is still valid is often the cherry on top.

But before trying to poke at the TCP stack, we kept investigating. We found out that AMQP supports heartbeats, and they are used to check if a connection is still valid. The library we were using had this option disabled by default, which explains why the container was blocked and waiting instead of trying to reconnect to the broker. To make things worse because the container is a consumer, it never sends data to the broker. If the container has sent data using the same socket it could detect on its own whether the connection was still valid.

To fix this, we evaluated two solutions:

  • The TCP keep-alive fix was the fastest to implement, but we didn’t like it because it deletgated detection of the broken connection to the kernel TCP implementation. Also we didn’t really want to mess with kernel socket options.
  • For an alternative, we ran some tests with other applications and they handled it at the application level (thank you Bandwagon effect). Through this testing, we found we could change the library to activate the AMQP heartbeats. It took more time, but it felt like a better solution to use the mechanisms provided by the AMQP protocol.

But what about the MITM attack we thought we were seeing?

First some context: we run periodic, short-lived helper containers. We have observed with tcpdump that when a container starts, it announces some ICMPv6 memberships. Also by default, the container network namespace is attached to a network bridge. The network bridge uses a cache to associate addresses with its respective port, like a network switch. It populates this table when it sees network traffic; as you might imagine, if there isn’t any traffic for some time, the data in the table becomes stale.

The MITM “attack” happens when in a period without traffic between the broker and the container, the bridge cache is stale and a short live container is launched, there is a chance for it to get the same IP address the consumer container has. This new container changes the bridge cache, then if the broker sends a message to the consumer, the bridge delivers it to the new container, who then answers with a TCP RST because it doesn’t have a TCP connection with the broker. Finally the broker receives the TCP RST and aborts its connection with the consumer. The magic of giving the same IP address to different containers.

A picture is worth a thousand words.

Without the MITM

With the MITM

Ultimately, the problem turned out to be one of the most exotic possibilities we had come up with, making us feel pretty:


Our programs must be prepared to handle network disruptions even when the network traffic doesn’t leave a single host and you are using containers. Remember: the network is not reliable! If we forget this, we will have a lot of “fun” with distributed systems bugs.

Perhaps more importantly: always remember that if your program never sends traffic on its TCP socket, you can’t be sure whether the connection is valid or if you will end up waiting for a message that will never arrive.

There are two solutions to avoid this situation: the first is to use TCP keepalives and delegate detection of stale connections to the OS; The other is to implement or use a heartbeat mechanism at a higher layer in the protocol.

Both alternatives have their pros and cons, so you’ll need to find which one is best for your team and hte distributed system you run. But now that you’ve seen a story where TCP half-open connections, “anomelies,” and keep-alives all worked together, you’ll know that MITM “attack” might not be such an exotic cause of the problem, even if it’s not an attacker trying to get in, but rather your own the kernel.


While preparing this post, I found this article, which discusses half open connections; it would have been handy when explaining this problem to my coworkers.

December 12, 2017

Day 12 - Monitoring Postgres Replication Lag

By: Kathryn Exline (@kathryn_ex)
Edited By: Baron Schwartz (@xaprb)

Have you created replicas of your PostgreSQL databases? I am going to assume you are a good database steward and answered that question with a resounding “YES INDEEDLY DO!” Ned Flanders style. If not, I recommend taking the time to be kind to your future self and do so as soon as possible. We won’t talk about how to do that here, but you can find details on how to configure replication in the PostgreSQL documentation, the PostgreSQL wiki, and around the internet.

With your trusty replicas in place, make sure you take the time to properly monitor your clusters. One of the most important metrics of replication health, albeit seductively easy to over value, is “replication lag”. Before I show you a few simple queries to collect this value on your PostgreSQL clusters, let us briefly talk about replication in PostgreSQL. If you are already familiar with the concept of replication and how it is implemented in PostgreSQL, feel free to skip ahead to the “Why Care About Lag?” section.

The Basics: Replication and WAL

Replication is a mechanism where data from one database (a “primary”) is copied to another secondary database (a “replica” or “standby”), keeping it in sync. Most databases have built-in mechanisms to support this feature. After you configure your primary PostgreSQL database for your service, you should create one or more replica databases in case you lose your primary database or decide you want to offload some operations from the primary. Generally, you initialize a replica with a snapshot of the primary and then it stays up to date by fetching and replaying the primary’s transactions.

PostgreSQL implements replication via the Write-Ahead-Log or the “WAL” (pronounced “wall”, like the big icy thing in Game of Thrones). The notion of the WAL is not unique to PostgreSQL, and is similar to that of journaling in file systems. It ensures transactions are logged durably before they are committed, so updates can be recovered and replayed in the case of a crash. Aside from crash recovery, PostgreSQL leverages the WAL for internal performance gains and built-in replication support.

The WAL is a collection of 16MB binary files located in the pg_xlog directory of your data directory. Each time the database gets a transaction that requires changing any data, it appends a record of the transaction to the most recently created WAL segment file and assigns the record with a Log Sequence Number (LSN) to note its position in the WAL. I explicitly say position and not time because as the term Log Sequence Number suggests, the WAL files and their individual records are based on a sequence-based timeline. Why? Because if you are processing a high volume of transactions, timestamps may not be unique or granular enough to validate that your transactions are executed in the correct serial order. Not to mention time is full of nasty tricksies. This will be important later when we look at the queries you can use to find your primary’s and replicas’ position in the WAL.

PostgreSQL uses the WAL to make replicas of the primary in one of two ways. The latest and greatest is via streaming replication, where each WAL log record is sent to the replica as quickly as possible to be replayed. By default, this is done asynchronously so the replica can process the record without delaying the commit on the primary; however, PostgreSQL also supports synchronous replication where a transaction on the primary must wait until the WAL record is committed on both the primary and and the replica before considering the transaction successful.

The second and older option is via log-shipping where it ships one full WAL segment file (16MB) at a time from the primary to a replica. This generally results in higher replication lag since the replica will not receive the WAL records until the file is completely filled. Streaming replication is best for most use cases, but I recommend reading the PostgreSQL documentation around log-shipping standby servers for in-depth explanations of these two options.

Why Care About Lag?

Replication lag is the replica’s distance behind the primary in the sequential timeline. The time it takes to copy data from the primary to a replica, and apply the changes, can vary based on a number of factors including network time, replication configuration, and activity on both the primary and replicas. Unsurprisingly, I have seen replication lag spike on several occasions due to network issues. In another case, I saw replication lag spike on a replica that was not able to find and recover a WAL file from an archiving node, and it quietly fell out of date. The potential causes are widespread and I have found that replication lag is often an indicator that something is subtly failing or behaving unexpectedly.

Ultimately, it is safe to assume that there will be some amount of lag on any replica. But why do you need to know the replication lag in your clusters?

Disaster Recovery

In most scenarios where a primary database is lost, users want to promote the most up to date replica to ensure minimal data loss. You and your tooling can measure the lag to select the optimal replacement for the primary.

Service Strategies and Optimizations

If you connect all of your clients to the primary, you will eventually overload your database. When this happens, a common technique is to direct some read-only queries to replicas; however, If you don’t build your service to be aware of, and tolerate, replication lag, then your users will experience inconsistent behavior from your service. Knowing the typical replication lag of your replicas will help you strategize which services can still function in spite of potential lag.

Debugging and Observation

Just as measuring latency in an HTTP request can indicate an underlying issue, unusually high replication lag can indicate an issue with your databases. Unfortunately replication lag in isolation rarely informs users of the specific underlying problem, but it is a broad indicator of several issues and is another data point in your observability toolbelt.

How To Monitor Lag: Get Your “See Lags”

Now that we have a handle on the importance of monitoring your replication lag, let’s dive into two ways to measure replication lag.

By WAL Location

The most accurate way to determine the lag is to compare the current WAL location on the primary with the last WAL location received by the standby. To find the LSN value of the current WAL location in Postgres versions older than 10.x, run the following on the primary:

=# select pg_current_xlog_location();

In Postgres 10.x, you’ll need to use a newer function:

=# select pg_current_wal_lsn();

You should get an LSN value which looks like the following:

(1 row)

To find the LSN value of the last WAL location received and synced to disk by the standby, run the following on the replica. Once again, there’s a pre-10 syntax and a newer version for Postgres 10.x:

=# -- in Postgres 9.x
=# select pg_last_xlog_receive_location();

=# -- in Postgres 10.x
=# select pg_last_wal_receive_lsn();

You should get a similar record as the previous function:

(1 row)

Note that these functions denote what WAL position the replica has received from the primary, but not what it has applied to bring the replica’s copy into sync with the primary. There could be a difference between these two values. To find out what the replica has replayed, use the following functions on the replica:

=# -- in Postgres 9.x
=# select pg_last_xlog_replay_location();

=# -- in Postgres 10.x
=# select pg_last_wal_replay_lsn();

You can determine whether the replica is at the same point in the WAL as the primary by comparing the values of what’s been committed on the primary and what’s been received or replayed on the replica. The disadvantage of using WAL position is that, despite being an accurate representation of lag, it is difficult for humans to understand what an LSN difference really means. I have seen clever scripts that convert LSN’s to the byte position in the WAL and take the difference of these values, but there is an easier option that leverages another built-in function to approximate time lag.

By Time Difference

I told you earlier that time was tricky and the WAL is based on a sequence, but timestamps are more readable to humans and ingestible by external tools than the WAL location values. PostgreSQL can extract the timestamp of a given WAL location, allowing you to compare the timestamp of the last played transaction in the WAL with the current time using the following query on your replica:

=# -- Same in both Postgres 9.x and 10.x 
=# select now() - pg_last_xact_replay_timestamp();

This value needs to be read with additional context and taken with a grain of salt. It is meant to be an approximation of lag and should be treated as such. I find this query most useful to inject into my time-series observability metrics or to toss in a terminal pane when running operations that might affect lag. If you are selecting a replica to replace a failed primary, you should use the LSN instead of the approximate timestamp of the LSN.

Other Helpful Queries and Tools

You can run the following to determine whether you are interacting with the primary or a replica. A replica will return ‘t’ and the primary will return ‘f’:

=# select pg_is_in_recovery();

You can also translate the LSN value returned by the functions mentioned above to the name of the WAL file name within your pg_xlog directory using:

=# -- In Postgres 9.x
=# select pg_xlogfile_name(pg_last_xlog_receive_location());

=# -- In Postgres 10.x
=# select pg_walfile_name(pg_last_wal_receive_lsn());

If you are curious about what a WAL file actually looks like, PostgreSQL introduced the pg_xlogdump tool in version 9.3 to convert the contents of the binary WAL file into human readable form. Note this tool was renamed to pg_waldump in version 10.0 and is intended for educational purposes only.

Beyond the Queries

If your databases run on a cloud platform, your provider may already provide these metrics for you. For example, AWS Cloudwatch provides the ReplicaLag metric and GCP provides the replication metric. Finally, whether you use external tooling to monitor your replication lag or write your own monitoring plugins, you need to consider how you actually use these metrics.

As we discussed earlier, replication lag is a helpful metric and provides additional data points when making decisions about your services, but think long and carefully before alerting or paging around replication lag. You probably don’t want to. Replication lag is susceptible to a variety of factors, some of which are not actionable or inherently wrong, and it varies enough that you could find yourself bogged down in tuning alerting thresholds or developing complex anomaly detection. If you do choose to page on this value, make sure you embed plenty of headroom in your thresholds, provide context around potential lag causes in your alerting tools, and give your on-call rotation a few extra high-fives.


December 11, 2017

Day 11 - Scaling your on-duty team

By: Damien Pacaud (@damienpacaud)

Edited By: (@bmarsteau)

Our tech team at is mostly based in France where labour law and legislation provide quite a strict set of rules and boundaries for working out of office hours.

For this reason we’ve had to adapt and give some thought to our on-duty team organization as we grew from a start-up to a scale-up.


Scaling your on-duty team is crucial for most of the fast-growing startups that operate at a global level. The internet never sleeps, and even with the best design for resilience, one day, your system will go down. At teads, we deliver outstream video advertising for the biggest content publishers in the world. Any downtime has important repercussions on our revenue but also on the publisher’s revenue. We decided to carefully think about scaling our on-duty team in order to minimize the downtime when a system goes down. That story is below.

Our problem

In a few years, we’ve scaled from a growing startup operating with a few pizza teams into a company where more than 100 developers on 3 different locations deliver new features on a daily basis. We’ve been able to do so by implementing our own version of the “Spotify model” and it has given us the ability to stay agile while growing the tech team. Applying the same recipe to the on-duty team was a challenge, to say the least. Initially, the on-duty team was composed of a few developers that had been with teads since the very beginning and that were very knowledgeable on every part of the platform. We relied on their knowledge, availability and on the fact that they helped build most of the system. As we grew, the system became larger and more complex. The handful of developers keeping the revenue safe overnight were now unable to keep up with the needed knowledge to solve a problem.

First step : Growing the on-duty team

We started looking for people to add to the on-duty team and ideally have someone from each of our feature teams be part of the rotation. This was our way of implementing “you built it, you run it” in a country with strict labour laws. It meant growing that team to 12 people and that’s when we hit the first wall. We tried growing the team while having a few visible production incidents (S3 Service Disruption in us-east-1, anyone ?) and of course, no one was voluntarily applying to be on duty.

spounge bob

Besides, being on duty once every twelve weeks seems counterproductive as it is spread on too long a timespan. By the time you are back on duty a lot of systems have changed and it is difficult to remember good practices.

Lost battle: trying to be ready

One of the main reason nobody was applying to the on-duty rotation was the lack of documentation for how to react when an incident arises. We tried to tackle this problem and for a few months we set up meetings, put knowledgeable people in a room and ask them to kindly document the steps to take when incidents happen.


This was too large of a mission, even for a highly motivated team. Soon, meetings were skipped, and documentation was not improving.

At this point, we started thinking about the problem in a different way.

Enter on-duty pairing

The first decision we took was to have two persons on-duty at the same time for a week-long shift. We tried to wisely choose pairs for mutually exclusive skills set and experience. We will for example pair a back-end developer with a data-oriented developer. This allows to cover most systems on the critical chain.

The benefits that we see with the on-duty pairing are: It’s much easier to bounce ideas off someone when a problem is impacting production and you (or your pair) do not know how to fix it. Sometimes while on-duty, the incident runs so deep that a critical business decision must be taken. It’s much easier to share the responsibility of such a decision in the middle of the night. We accept that this may slow down the decision process as there will be back-and-forth between the pair. In the rare event of someone not waking up to the PagerDuty calls, there is a backup. Interestingly enough, we had never experienced someone not waking up until we started pairing. This brought the question that pairing may lower each individual’s sense of alert because there is a backup but in the end we feel it has more benefits than downsides.


We implemented this change in a few weeks and so far we are quite happy with it. The team has scaled to 12 developers, coming from all feature teams, and the rotation goes smoothly.

Escalation ?

The traditional way of dealing with increasing complexity is to have an escalation policy. We chose not to implement this and have PagerDuty automatically wake up both pairing developers at the same time. This automates the decision of waking-up another human being and makes PagerDuty responsible for it. We don’t want to be responsible for this hard decision so we let the robot do it.


Escalation usually also solves the “I need an expert on [insert any well known distributed system here] and I need her right now” problem. Putting them on escalation policies is great if you have a big enough pool of experts on each of the systems that you use. For us this meant that a few persons would be on call every other week. We thought this was not acceptable and decided that we could solve this by : Telling the on-duty team members we know they will do their best to recover the issue Giving them the confidence that, as engineers, they will find a solution Automating as much as we can routine maintenance operations (taking a bad cassandra node out of the ring, decommissioning and replacing a Kafka broker…)

Post-incident & Playbook

Soon after the incident, we gather everyone from the on-duty team in a room for a blameless, fact-oriented, post-mortem. We aim to leave the room after one hour having filled our very simple post-incident template. Summary of the issue How to reply to such an issue (should it rise again) Action plan

This process allows us to document our interventions and ensure, should the same incident happen, we have a solution to mitigate its effect in a timely manner.


After a few months, we are quite happy with this new on-duty rotation. It has proven useful many times and we now have more documentation than ever on how to react to our alerts. The post-incident ritual also acts as a team bonding meeting and we are thinking of creating more rituals specifically for the on-duty team (on top of each individual’s feature team rituals).

The biggest complexity that we encountered since launching was organizing the Christmas rotation period with pairs. It’s always a challenge to find one person available during those holidays, so trying to find two is double the fun.

December 10, 2017

Day 10 - From Product Eng to Systems Eng

By: Will Gallego (@wcgallego)

Edited By: Dave Mangot (@davemangot)

Engineering is an open field, not a paved road

In engineering, straight lines are few and far between. There is no certain path, no strict guide, no singular right way for every task. There are lots of wrong ways, for sure (being dismissive of others’ hard work, excluding underrepresented or underprivileged folks, stealing ideas and claiming them as your own for a few examples). More often than not it’s hard to know you’re “doing things right” until you look back. It took me a while to understand how my career led me to joining a Systems Engineering team in particular, because so much of this doubt clouded my career path early on. If you’re thinking of making a similar transition, I’d love to see you take a chance in exploring a new facet in the tech world.

Everyone has their own flavor of falling into engineering. Sometimes it’s sitting next to just the right person to peek over their shoulder as they work. Maybe you have a mom who was really into hardware hacking, passing that same love for electronics over to you as you grew up. A lot of the time it could be “part of the job” - you needed to pay the bills or they trained you to write scripts so you could automate some of the more mundane tasks along the way. All are valid and don’t let anyone tell you otherwise.

I grew up with a hobby centric mindset as my approach into software development. HTML in AOL Pages and writing a blackjack game in BASIC were my gateway drugs, back in the 90’s when it was just starting to become widely accessible to geeks in non-geek families. I had enough knowledge to be dangerous but not enough confidence to push past the fear of failure, even through college. I had a similar reluctance through the first few years of my professional career, a hesitation of parts of the stack that “weren’t for me”. Unless someone asked, I didn’t cross the line.

Taking a bigger step

Often as engineers, we tend to wait for someone to tell us what we can and can’t do. We hesitate to apply for senior roles because we haven’t been called “senior” in a title yet. We hold back on ideas in meetings or questions during architecture reviews worrying that they’re obvious or even stupid (spoiler: they’re not). We don’t question previous design decisions because we assume when it was first built it must have been right and must continue to be right. All of these are intertwined with the fear of being less, doing less, or appearing less in the eyes of those assumed to be smarter or simply more talented than us.

Nothing could be further from the truth.

I don’t believe in an intrinsic trait that breeds systems engineers, or engineers at all for that matter. People are not born and bred for this. Sure, there are folks who are naturally drawn to it and find their mark early in admin work or building distributed systems. That said, no one has database administration in their DNA. There isn’t a gene marker for admins. Noble blood lines delineating who can and can’t be an SRE don’t exist. The field is open to anyone interested, and if you hear someone say differently, they’re only displaying to the world their own insecurities.

Hand in hand with this falsehood is the belief that systems engineering is “harder” than other parts of the stack. Many engineers, myself included, started out building products on the front end typically because the feedback loop is shorter and perhaps arguably more tangible. We convince ourselves that backend work is beyond us because it departs from our comfort zone. Letting go of this self doubt in what you’re capable of opens up a ton of opportunities.

Why join Systems Engineering?

Many of the reasons for which you might have found yourself joining Product Engineering have parallels in Systems Engineering, avenues you might not realize exist. We’re problem solvers and builders, investigators and collaborators. We want to create and to improve, expanding our knowledge of our craft both for ourselves and for others directly and indirectly. If you feel this itch but worry “well, I’m just going to be setting up machines and never coding, right?”, let me allay those fears now. There are a ton of reasons to try your hand here.

Because you care

First and foremost, you care. You’re in this industry because you want to to build products that will be impactful, tools that improve someone’s daily lives or entertain audiences, maybe even save some lives in the process. Systems Engineers deeply believe in that too, just with a small shift in the focus of said audience. People have problems that need solving, ones that you want to put your energy into helping them overcome.

Empathy should be at the core of everything we do in engineering, regardless of role or position in the company. Typically when building a frontend app, your audience consists of folks external to your org. You might meet a few people who say “Oh, you work at X? I love that and use it every day!” which is a great ego boost knowing you’re making people’s lives better. How great is it to get that same feedback from your friends and coworkers? Combining your knowledge of the needs of your frontend with the functional knowledge of what can be done via the backend can make you an important asset to any technical organization. You get do so at a very personal level, one in which you can directly ask your consumers “how can I help?”.

You have a passion for understanding

You’re not content in making assumptions about your stack and you’re voracious in your consumption of material to learn. One of my favorite interview questions is simple in its ask and incredibly deep in its answers: “What happens when you type in your browser and hit enter?”. If you’ve ever walked through that, you’ll realize just how far the branching pathways extend in various directions.

  • Well, something has to interpret a domain name into an IP address, but I’ve always handwaved that (something something DNS). But how does DNS actually work?
  • Is it just one host machine somewhere though? Most likely not. How would a site that takes tens, hundreds, thousands of requests a second scale?
  • Hrm, what if I’m logged in - there has to be datastore requests for information relevant to my account to fetch. How do we reliably read and write to that, and how does it change for read or write heavy applications?
  • How are all of those concurrent requests being spread out? Is there a caching layer in front of them to reduce some of that load?
  • What happens when one machine, multiple machines, all of them, fail in a localized zone?
  • How do I know which static asset versions I should be serving?
  • What’s our deployment strategy for making updates to the site?
  • Is there any kind of security for accessing this site - a cert to confirm I’m visiting the correct site and not being hit by a man in the middle attack? Why should I use TLS over SSL?

And this is all just scratching the barest surface for these and many more questions. If you’re passionate about the learning aspect of engineering, you can see how extending yourself further down into the stack gives you near limitless opportunities to grow as an engineer. When you see companies asking for full stack engineers, it’s folks who are asking these questions and more when presented with a new or unknown architecture.

Sometimes it’s fun just to be the smart engineer who knows a ton of stuff, too. This goes a bit beyond the scope of this post, but there’s a distinct push and pull between doing great work you’re proud of while maintaining the humility that you can’t know everything. You certainly don’t want to be the know-it-all jerk who spouts pedantic trivia (think “well, actually…”), but it’s exciting to be the conduit between teams who can answer a ton of questions. Helping people out can create a strong positive feedback loop for feeling valuable in what you do.

In short, systems engineering can be an opportunity to bust through silos and open up some black boxes. See what assumptions you hold that you can break. There’s nothing magical inside, and yet your colleagues will think you’re a wizard with how much you know!

Pushing back against Defensive Attribution Hypothesis

Defensive Attribution Hypothesis is a cognitive bias that involves the disassociation in the thinking, understanding, and skills of others with those of your own in the face of failure. That blameful feeling you get when something goes awry and you want to point fingers at “those engineers over there”? That comes part and parcel with this. We tend to see failure external to ourselves and mentally push it away further, creating a gulf between perceived success and failure. If failure is way over there, then I must be successful over here. Of course in doing so, you’re also pushing understanding of the situation - and people - away.

The classic example of this is the devs vs. ops mentality. A deploy goes out to production. Shortly after, the site experiences an outage. You can imagine exactly what happens next. The developers cry foul, saying “the code was working in our dev environment, so production must not be set up correctly!”. The ops team says “of course it’s set up correctly, the site was working fine up until that deploy so it must be your change that broke everything. Why didn’t you test it more?”. There’s no insight here, no learning from what happened.

Now, imagine you have experience in both product engineering and systems engineering. You understand what time pressures devs are under to accomplish goals and the purpose of the app. They don’t want to break production, but they’re trying to hit their targets for the latest sprint. Likewise, you know what load the backend can handle and what’s required to scale it further. You know metrics like the requests per second your systems are currently under and how the architecture would need to be scaled up should those numbers change. You can see both sides of the situation.

As an example, let’s say your product team is launching a feature that adds a new query to the database. They build it in dev and it’s reading/writing as expected, but under much reduced load. They followed protocol, putting in the change request to the DBA’s and setting up an index for this query. They’re being proactive! Likewise, your DBA’s gain confidence in the deploy because of this request. The devs are testing it and they wouldn’t have asked for the index addition if they weren’t being careful, so they give the thumbs up. The deploy hits prod and things go south, with both sides believing they had done their due diligence and the other is to blame. A lack of perspective promotes conflict. Now let’s add you to this equation, with your prowess in multiple parts of the stack. You see a code review for this pop into your inbox and think:

  • Perhaps this query could be cached for more availability, since it’s fairly read heavy and can hold up to a bit of staleness without adversely affecting what the app is trying to accomplish
  • Maybe you can reduce the number of columns fetched and use a covering query, because you know what is and isn’t needed for the app
  • To give some confidence to future deploys in general, you could set up a canary cluster for the devs to rely on so that they could see the performance of their code changes before it affects end users

Your knowledge of multiple domains lets you assume best of intentions for all parties involved because you can understand where both parties are coming from and empathize with the requirements they face.

What’s next?

So you’re ready to try your hand at systems engineering be it general ops, database admin, networking, etc., but it’s super intimidating to jump in. You have experience with dev work in some form but the divide looks too far to take in one stride. You need to start fresh in a number of areas, which means in a lot of ways you might feel like you’re starting over. Fortunately, there are lots of ways you can ease into the waters without too much disruption. How do you get from here to there?

Temp rotations with your local ops/backend team

Ask your manager if you can do a short term rotation with your company’s ops team. That tool that everyone wants built but no one has time to get to? This is a great opportunity for both you and your teammates to make it happen. With some coordination, you can be everyone’s favorite engineer putting it together while gaining some real world experience. Likewise, if there are tickets in your Jira, Trello, or other respective task board, see if you can snag one or two that look like low hanging fruit. Your admin friends will be grateful for your help in chopping away at the queue and hopefully can extend some of their domain knowledge to level you up in the process. Everyone wins!

Attend new-to-you meetups, conferences, and meetings

Rotations can be a tall order for some companies, though, and not every org’s roadmap allows for a multi-week trip digression. If your company will sponsor you to go to conferences you might not normally attend, that’s an excellent resource for a longer term investment. Setting up smaller sessions internally within your org to spread domain knowledge as well (demos and “lunch and learns” are two great vehicles for this!) can be immensely useful. This can simultaneously help to break engineers free of being single points of failure for maintaining subsystems and can inform a large swath of folks. Local meetups in your area? Jump on those too! If you’re feeling a bit timid, though, grab a friend to go with, as partnering up can also help the feedback loop for answering questions and stimulating ideas. Sharing information and learning as a team can make those intimidating questions become trivial quickly.

Daily tips and tricks

Impostor syndrome can really hit home when you’re making a large shift like this. Trying to gearshift when you’ve focused so heavily on one vertical in tech is really daunting. DNS? Filesystems? Database integrity? There’s so many paths to choose and each goes so deep. With so many people who have deep concentrations of knowledge it can be intimidating to try to “catch up”.

There’s no need to try to boil the ocean learning all of this in one day. You’re in no rush and you’ve got lots of time in your career no matter where you are looking to pick up skills. You’d probably be surprised at how much you’ve learned in a short time - unless you’re writing out your achievements as they come (and you should! It’s another great way to fight impostor syndrome). There’s a quote attributed to Bill Gates: “Most people overestimate what they can do in one year and underestimate what they can do in ten years.”. We try to do so much in the immediate, but set up a long term plan and you can move mountains. Try working on smaller bite sized projects, courses, and reading that’ll move you along your path.

Here are a few ways to get inspired if you’re looking to attack this problem on multiple fronts:

  • Subscribe to twitter feeds that offer daily tips or regularly contribute to learning. Some of my favorites: Julia Evans (@bork), Command Line Magic (@climagic), and of course SysAdvent in December (@SysAdvent)
  • Some of your favorite sites have great tech blogs - Kickstarter, Yelp, Netflix, and Etsy
  • Subscribe to mailing lists and newsletters with articles that arrive in your mailbox - DevOpsWeekly, SysAdmin Casts, Monitoring Weekly, and SRE Weekly
  • Add some tech podcasts or screencasts to your commute - Arrested DevOps, CodeNewbie Podcast, SysAdmincasts, and Devops Cafe
  • If your company is doing demo days, sit in on some for other teams you don’t typically collaborate with.
  • Likewise, hack weeks hosted by your company can bubble up great ideas and inspire you to venture into new parts of the stack with guidance from their owners.
  • Start a tech book club at your company reading over a chapter every week or two. Learning can be even more effective when you’re sharing ideas with other like minded folks.

Looking just over the horizon

Finally, you can look to parts of the stack adjacent to your comfort zone as a direction into systems engineering. If your strengths lie in mobile app building, ask yourself what might that API architecture it interacts with look like. If you need to set up a datastore call, investigate how you might profile and optimize that query to utilize indexes or set up caching around it. If you’re writing views and controllers in your favorite language (say, php), take a look behind the curtain to see how a dependency management tool, like composer, might be installing packages in various environments.

Picking up new skills doesn’t have to feel like being air dropped into the middle of nowhere. There’s something novel for sure about learning about tools that may be wholly different from where your current strengths lie, but easing into it with tech bordering what you’re comfortable with can smooth that transition. Checking out “over the horizon” tech to see what’s near to what you know but still new can help broaden your skillsets while leaving you with a starting point to build from in your mental model.

Bundling this up

For those of you thinking about a transition to systems engineering work, I can attest from personal experience how rewarding it can be. By opening yourself up, you can be a powerful force for good in your company, one with high adaptability and a wide breadth of scope for promoting positive change. Understanding this can afford you exciting work to be a part of and new challenges to spur your career moving forward. It’s a big step into uncharted territory, but one that can be deeply satisfying.

If you’re restricting yourself to familiar comfort zones or if you have a tendency towards vertical over horizontal learning, make this an opportunity to surprise yourself. There’s no rush - you’re not falling behind by exploring other parts of the stack. The best engineers in the industry have made it because they understand that failure is a necessary risk to achieve personal growth. Yes, you’re going to trip and fall. You did the same getting into engineering in the first place! Trust that you’ll eventually land on your feet and be all the better for it in the end.

December 9, 2017

Day 9 - Using Kubernetes for multi-provider, multi-region batch jobs

By: Eric Sigler (@esigler)
Edited By: Michelle Carroll (@miiiiiche)


At some point you may find yourself wanting to run work on multiple infrastructure providers — for reliability against certain kinds of failures, to take advantage of lower costs in capacity between providers during certain times, or for any other reason specific to your infrastructure. This used to be a very frustrating problem, as you’d be restricted to a “lowest common denominator” set of tools, or have to build up your own infrastructure primitives across multiple providers. With Kubernetes, we have a new, more sophisticated set of tools to apply to this problem.

Today we’re going to walk through how to set up multiple Kubernetes clusters on different infrastructure providers (specifically Google Cloud Platform and Amazon Web Services), and then connect them together using federation. Then we’ll go over how you can submit a batch job task to this infrastructure, and have it run wherever there’s available capacity. Finally, we’ll wrap up with how to clean up from this tutorial.


Unfortunately, there isn’t a one-step “make me a bunch of federated Kubernetes clusters” button. Instead, we’ve got several parts we’ll need to take care of:

  1. Have all of the prerequisites in place.
  2. Create a work cluster in AWS.
  3. Create a work cluster in GCE.
  4. Create a host cluster for the federation control plane in AWS.
  5. Join the work clusters to the federation control plane.
  6. Configure all clusters to correctly process batch jobs.
  7. Submit an example batch job to test everything.


  1. Kubecon is the first week of December, and Kubernetes 1.9.0 is likely to be released the second week of December, which means this tutorial may go stale quickly. I’ll try to call out what is likely to change, but if you’re reading this and it’s any time after December 2017, caveat emptor.
  2. This is not the only way to set up Kubernetes (and federation). One of the two work clusters could be used for the federation control plane, and having a Kubernetes cluster with only one node is bad for reliability. A final example is that kops is a fantastic tool for managing Kubernetes cluster state, but production infrastructure state management often has additional complexity.
  3. All of the various CLI tools involved (gcloud, aws, kube*, and kops) have really useful environment variables and configuration files that can decrease the verbosity needed to execute commands. I’m going to avoid many of those in favor of being more explicit in this tutorial, and initialize the rest at the beginning of the setup.
  4. This tutorial is based off information from the Kubernetes federation documentation and kops Getting Started documentation for AWS and GCE wherever possible. When in doubt, there’s always the source code on GitHub.
  5. The free tiers of each platform won’t cover all the costs of going through this tutorial, and there are instructions at the end for how to clean up so that you shouldn’t incur unplanned expense — but always double check your accounts to be sure!

Setting up federated Kubernetes clusters on AWS and GCE

Part 1: Take care of the prerequisites

  1. Sign up for accounts on AWS and GCE.
  2. Install the AWS Command Line Interface - brew install awscli.
  3. Install the Google Cloud SDK.
  4. Install the Kubernetes command line tools - brew install kubernetes-cli kubectl kops
  5. Install the kubefed binary from the appropriate tarball for your system.
  6. Make sure you have an SSH key, or generate a new one.
  7. Use credentials that have sufficient access to create resources in both AWS and GCE. You can use something like IAM accounts.
  8. Have appropriate domain names registered, and a DNS zone configured, for each provider you’re using (Route53 for AWS, Cloud DNS for GCP). I will use “” below — note that you’ll need to keep track of the appropriate records.

Finally, you’ll need to pick a few unique names in order to run the below steps. Here are the environment variables that you will need to set beforehand:

export S3_BUCKET_NAME="put-your-unique-bucket-name-here"
export GS_BUCKET_NAME="put-your-unique-bucket-name-here"

Part 2: Set up the work cluster in AWS

To begin, you’ll need to set up the persistent storage that kops will use for the AWS work cluster:

aws s3api create-bucket --bucket $S3_BUCKET_NAME

Then, it’s time to create the configuration for the cluster:

kops create cluster \
 --name="" \
 --dns-zone="" \
 --zones="us-east-1a" \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \

If you want to review the configuration, use kops edit cluster --state="s3://$S3_BUCKET_NAME". When you’re ready to proceed, provision the AWS work cluster by running:

kops update cluster --yes --state="s3://$S3_BUCKET_NAME"

Wait until kubectl get nodes --show-labels shows the NODE role as Ready (it should take 3–5 minutes). Congratulations, you have your first (of three) Kubernetes clusters ready!

Part 3: Set up the work cluster in GCE

OK, now we’re going to do a very similar set of steps for our second work cluster, this one on GCE. First though, we need to have a few extra environment variables set:

export PROJECT=`gcloud config get-value project`

As the documentation points out, using kops with GCE is still considered alpha. To keep each cluster using vendor-specific tools, let’s set up state storage for the GCE work cluster using Google Storage:

gsutil mb gs://$GS_BUCKET_NAME/

Now it’s time to generate the configuration for the GCE work cluster:

kops create cluster \
 --name="" \
 --dns-zone="" \
 --zones="us-east1-b" \
 --state="gs://$GS_BUCKET_NAME/" \
 --project="$PROJECT" \
 --kubernetes-version="1.8.0" \

As before, use kops edit cluster --state="gs://$GS_BUCKET_NAME/" to peruse the configuration. When ready, provision the GCE work cluster by running:

kops update cluster --yes --state="gs://$GS_BUCKET_NAME/"

And once kubectl get nodes --show-labels shows the NODE role as Ready, your second work cluster is complete!

Part 4: Set up the host cluster

It’s useful to have a separate cluster that hosts the federation control plane. In production, it’s better to have this isolation to be able to reason about failure modes for different components. In the context of this tutorial, it’s easier to reason about which cluster is doing what work.

In this case, we can use the existing S3 bucket we’ve previously created to hold the configuration for our second AWS cluster — no additional S3 bucket needed! Let’s generate the configuration for the host cluster, which will run the federation control plane:

kops create cluster \
 --name="" \
 --dns-zone="" \
 --zones=us-east-1b \
 --master-size="t2.medium" \
 --node-size="t2.medium" \
 --node-count="1" \
 --state="s3://$S3_BUCKET_NAME" \
 --kubernetes-version="1.8.0" \

Once you’re ready, run this command to provision the cluster:

kops update cluster --yes --state="s3://$S3_BUCKET_NAME"

And one last time, wait until kubectl get nodes --show-labels shows the NODE role as Ready.

Part 5: Set up the federation control plane

Now that we have all of the pieces we need to do work across multiple providers, let’s connect them together using federation. First, add aliases for each of the clusters:

kubectl config set-context aws
kubectl config set-context gcp
kubectl config set-context host

Next up, we use the kubefed command to initialize the control plane, and add itself a member:

kubectl config use-context host
kubefed init fed --host-cluster-context=host --dns-provider=aws-route53 --dns-zone-name=""

If the message “Waiting for federation control plane to come up” takes an unreasonably long amount of time to appear, you can check the underlying pods for any issues by running:

kubectl get all --namespace=federation-system
kubectl describe po/fed-controller-manager-EXAMPLE-ID --namespace=federation-system

Once you see “Federation API server is running,” we can join the work clusters to the federation control plane:

kubectl config use-context fed
kubefed join aws --host-cluster-context=host --cluster-context=aws
kubefed join gcp --host-cluster-context=host --cluster-context=gcp
kubectl --context=fed create namespace default

To confirm everything’s working, you should see the aws and gcp clusters when you run:

kubectl --context=fed get clusters

Part 6: Set up the batch job API

(Note: This is likely to change as Kubernetes evolves — this was tested on 1.8.0.) We’ll need to edit the federation API server in the control plane, and enable the batch job API. First, let’s edit the deployment for the fed-apiserver:

kubectl --context=host --namespace=federation-system edit deploy/fed-apiserver

And within the configuration in the federation-apiserver section, add a –runtime-config=batch/v1 line, like so:

  - command:
    - /hyperkube
    - federation-apiserver
    - --admission-control=NamespaceLifecycle
    - --bind-address=
    - --client-ca-file=/etc/federation/apiserver/ca.crt
    - --etcd-servers=http://localhost:2379
    - --secure-port=8443
    - --tls-cert-file=/etc/federation/apiserver/server.crt
    - --tls-private-key-file=/etc/federation/apiserver/server.key
  ... Add the line:
    - --runtime-config=batch/v1

Then restart the Federation API Server and Cluster Manager pods by rebooting the node running them. Watch kubectl get all --context=host --namespace=federation-system if you want to see the various components change state. You can verify the change applied by running the following Python code:

# Sample code from Kubernetes Python client
from kubernetes import client, config

def main():

    print("Supported APIs (* is preferred version):")
    print("%-20s %s" %
          ("core", ",".join(client.CoreApi().get_api_versions().versions)))
    for api in client.ApisApi().get_api_versions().groups:
        versions = []
        for v in api.versions:
            name = ""
            if v.version == api.preferred_version.version and len(
                    api.versions) > 1:
                name += "*"
            name += v.version
        print("%-40s %s" % (, ",".join(versions)))

if __name__ == '__main__':

You should see output from that Python script that looks something like:

> python
Supported APIs (* is preferred version):
core                 v1
federation           v1beta1
extensions           v1beta1
batch                v1

Part 7: Submitting an example job

Following along from the Kubernetes batch job documentation, create a file, pi.yaml with the following contents:

apiVersion: batch/v1
kind: Job
  generateName: pi-
      name: pi
      - name: pi
        image: perl
        command: ["perl",  "-Mbignum=bpi", "-wle", "print bpi(2000)"]
      restartPolicy: Never
  backoffLimit: 4

This job spec:

  • Runs a single container to generate the first 2,000 digits of Pi.
  • Uses a generateName, so you can submit it multiple times (each time it will have a different name).
  • Sets restartPolicy: Never, but OnFailure is also allowed for batch jobs.
  • Sets backoffLimit. This generates a parse violation in 1.8, so we have to disable validation.

Now you can submit the job, and follow it across your federated set of Kubernetes clusters. First, at the federated control plane level, submit and see which work cluster it lands on:

kubectl --validate=false --context=fed create -f ./pi.yaml 
kubectl --context=fed get jobs
kubectl --context=fed describe job/<JOB IDENTIFIER>

Then (assuming it’s the AWS cluster — if not, switch the context below), dive in deeper to see how the job finished:

kubectl --context=aws get jobs
kubectl --context=aws describe job/<JOB IDENTIFIER>
kubectl --context=aws get pods
kubectl --context=aws describe pod/<POD IDENTIFIER>
kubectl --context=aws logs <POD IDENTIFIER>

If all went well, you should see the output from the job. Congratulations!

Cleaning up

Once you’re done trying out this demonstration cluster, you can clean up all of the resources you created by running:

kops delete cluster --yes --state="gs://$GS_BUCKET_NAME/"
kops delete cluster --yes --state="s3://$S3_BUCKET_NAME"
kops delete cluster --yes --state="s3://$S3_BUCKET_NAME"

Don’t forget to verify in the AWS and GCE console that everything was removed, to avoid any unexpected expenses.


Kubernetes provides a tremendous amount of infrastructure flexibility to everyone involved in developing and operating software. There are many different applications for federated Kubernetes clusters, including:

Good luck to you in whatever your Kubernetes design patterns may be, and happy SysAdvent!