A comment on Second Systems

I recently left this comment on a Pragmatic Engineer review of Fred Brook’s Mythical Man Month in “What Changed in 50 Years of Computing: Part 2”. This was what I reacted to:

Software design and “the second-system effect”

Brooks covers an interesting phenomenon in Chapter 5: “The Second-System Effect.” He states that architects tend to design their first system well, but they over-engineer the second one, and carry this over-engineering habit on to future systems.

“This second system is the most dangerous system a [person] ever designs. When [they] do this and [their] third and later ones, [their] prior experiences will confirm each other as to the general characteristics of such systems, and their differences will identify those parts of [their] experience that are particular and not generalizable.”

The general tendency is to over-design the second system, using all the ideas and frills that were sidetracked on the first one.”

I can see this observation making sense at a time when:

  • Designing a system took nearly a year
  • System designs were not circulated, pre-internet
  • Architects were separated from “implementers”

Today, all these assumptions are wrong:

  • Designing systems takes weeks, not years
  • System designs are commonly written down and critiqued by others. We cover more in the article, Engineering Planning with RFCs, Design Documents and ADRs
  • “Hands-off architects” are increasingly rare at startups, scaleups, and Big Tech

As a result, engineers design more than one or two systems in a year and get more feedback, so this “second-system effect” is likely nonexistent.

And this was my comment/reaction:

I think the Second-System Effect is still very present.

I would say it most frequently manifests as a result of not recognizing Gall’s Law: “all complex systems that work evolved from simpler systems that worked.”

What trips people up is usually that they start from a place of “X feature is hard to achieve in the current system” and then they start designing/architecting for that feature and not recognizing all of the other table-stakes necessities and Chesterton Fences of the current system, which only are recognized and bolted on late in the implementation when it is more difficult and complicated.

The phrase “10 years of experience, or 1 year of experience 10 times” comes to mind when thinking of people who only have the experience of implementing a new system once and trivially, and do not have the experience of growing and supporting and maintaining a system they designed over a meaningful lifecycle.

Which also reminds me of a recent callback to a Will Larson review of Kent Beck’s Tidy First about growing software:

I really enjoyed this book, and reading it I flipped between three predominant thoughts:

  • I’ve never seen a team that works this way–do any teams work this way?
  • Most of the ways I’ve seen teams work fall into the “never tidy” camp, which is sort of an implicit belief that software can’t get much better except by replacing it entirely. Which is a bit depressing, if you really think about it
  • Wouldn’t it be inspiring to live in a world where your team believes that software can actually improve without replacing it entirely?

A Ruby Meetup and 3 Podcasts

Me standing on a small stage in front of a slide with 2 adoptable cats and the GitHub logo

Last week I spoke at the SF Bay Area Ruby Meetup, which was hosted at GitHub HQ, which made for an easy commute for me. Here’s the video and the slides. My talk was entitled “An OK compromise: Faster development by designing for the Rails autoloader”

Also, I haven’t shared here the 3 podcasts I did over the past few years. Here they are:


Rails Active Record: Will it bind?

Active Record, Ruby on Rail’s ORM, has support for Prepared Statements that work if you structure your query for it. Because of my work on GoodJob, which can make a lot of nearly identical database queries every second to pop its job queue, I’ve invested a lot of time trying to make those queries as efficient as possible.

Prepared Statements are a database feature that allow the database to reuse query parsing and planning when queries are structurally the same. Prepared statements, at least in Postgres, are linked to the database connection/session and stored in memory in the database. This implies some things:

  • There can be a performance benefit to making queries “preparable” for Prepared Statements which Active Record will decide for you based on how a query is structured.
  • There can be a performance harm (or at least non-useful database processing and memory growth) if your application produces a lot of preparable queries that are never reused again.

By default, Rails will have the database store 1,000 queries (statement_limit: 1000). Many huge Rails monoliths (like GitHub, where I work) disable prepared statements (prepared_statements: false) because there is too much query heterogeneity to get a performance benefit and the database spends extra and unnecessary cycles storing and evicting unused prepared statements. But that’s maybe not your application!

Structurally similar queries can still have variable values inside of them: these are called bind parameters. For example, in GoodJob, I want pop jobs that are scheduled to run right now. In SQL that might look like:

SELECT * FROM good_jobs WHERE scheduled_at < '2024-03-31 16:44:11.047499`

That query has the downside of the timestamp changing multiple times a second as new queries are emitted. What I want to do is extract out the timestamp into a bind parameter (a ? or $1 depending on the database adapter) instead of embedded in the query:

SELECT * FROM good_jobs WHERE scheduled_at < ?

That’s the good stuff! That’s ideal and preparable 👍

But that’s raw SQL, now how to do that in Active Record? In the following exploration, I’m using the private to_sql_and_binds method in Active Record; I’m also using the private Arel API. This is all private and subject to change, so be sure to write some tests around this behavior if you do choose to use it. Here’s some quick experiments I’ve done:

job = Job.create(scheduled_at: 10.minutes.ago)

# Experiment 1: string with ?
relation = Job.where("scheduled_at < ?", Time.current)
# =>  Job Load (0.1ms)  SELECT "good_jobs".* FROM "good_jobs" WHERE (scheduled_at < '2024-03-31 16:34:11.064614')
expect(relation.to_a).to eq([job])
_query, binds, prepared = Job.connection.send(:to_sql_and_binds, relation.arel)
expect(binds.size).to eq 0 # <-- Not a bind parameter
expect(prepared).to eq false # <-- Not preparable

# Experiment 2: Arel query with value
relation = Job(Job.arel_table['scheduled_at'].lt(Time.current))
# =>  Job Load (0.1ms)  SELECT "good_jobs".* FROM "good_jobs" WHERE scheduled_at < '2024-03-31 16:34:11.064614'
expect(relation.to_a).to eq([job])
_query, binds, prepared = Job.connection.send(:to_sql_and_binds, relation.arel)
expect(binds.size).to eq 0 # <-- Not a bind parameter
expect(prepared).to eq true # <-- Yikes 🥵

# Experiment 3: Arel query with QueryAttribute
relation = Job.where(Job.arel_table['scheduled_at'].lt(ActiveRecord::Relation::QueryAttribute.new('scheduled_at', Time.current, ActiveRecord::Type::DateTime.new)))
# =>  Job Load (0.1ms)  SELECT "good_jobs".* FROM "good_jobs" WHERE scheduled_at < $1  [["scheduled_at", "2024-03-31 16:34:11.064614"]]
expect(relation.to_a).to eq([job])
_query, binds, prepared = Job.connection.send(:to_sql_and_binds, relation.arel)
expect(binds.size).to eq 1 # <-- Looking good! 🙌
expect(prepared).to eq true # <-- Yes! 👏

That very last option is the good one because it has the bind parameter ($1) and then the Active Record logger will show the values in the nested array next to the query. The successful combination uses:

  • Arel comparable syntax
  • Wrapping the value in an ActiveRecord::Relation::QueryAttribute

Note that many Active Record queries will automatically do this for you, but not all of them. In this particular case, it’s because I use the “less than” operator, whereas equality does make it preparable. You’ll have to inspect each query yourself. For example, it’s also necessary with Arel#matches/ILIKE. It’s also possible to temporarily disable prepared statements within a block in the undocumented (!) Model.connection.unprepared_statement { your_query }.

The above code is true as of Rails 7.1. Jean Boussier has improved Active Record in newer (currently unreleased) Rails to also properly bind Job.where("scheduled_at < ?", Time.current) query syntax too 🙇

Update: I realized I didn’t try beginless/endless range values. Good news: they create bind parameters 🎉

# Experiment 4: beginless range
relation = Job.where(scheduled_at: ...Time.current)
# =>  Job Load (0.1ms)  SELECT "good_jobs".* FROM "good_jobs" WHERE scheduled_at < $1  [["scheduled_at", "2024-03-31 16:34:11.064614"]]
expect(relation.to_a).to eq([job])
_query, binds, prepared = Job.connection.send(:to_sql_and_binds, relation.arel)
expect(binds.size).to eq 1 # <-- Looking good! 🙌
expect(prepared).to eq true # <-- Yes! 👏

Low-effort prototyping

From Bunnie Hung’s blog about exploring and designing an infrared chip imaging rig. I thought this is an interesting distinction between “low-effort” and “rapid” prototypes. I think the analogy in software would be a “Walking Skeleton that is production-like in architecture and deployment but does very little, versus building a demo using lightweight scripting and static site generators. (bolded text mine)

Sidebar: Iterate Through Low-Effort Prototypes (and not Rapid Prototypes)

With a rough idea of the problem I’m trying to solve, the next step is build some low-effort prototypes and learn why my ideas are flawed.

I purposely call this “low-effort” instead of “rapid” prototypes. “Rapid prototyping” sets the expectation that we should invest in tooling so that we can think of an idea in the morning and have it on the lab bench by the afternoon, under the theory that faster iterations means faster progress.

The problem with rapid prototyping is that it differs significantly from production processes. When you iterate using a tool that doesn’t mimic your production process, what you get is a solution that works in the lab, but is not suitable for production. This conclusion shouldn’t be too surprising – evolutionary processes respond to all selective pressures in the environment, not just the abstract goals of a project. For example, parts optimized for 3D printing consider factors like scaffolding, but have no concern for undercuts and cavities that are impossible to produce with CNC processes. Meanwhile CNC parts will gravitate toward base dimensions that match bar stock, while minimizing the number of reference changes necessary during processing.

So, I try to prototype using production processes – but with low-effort. “Low-effort” means reducing the designer’s total cognitive load, even if it comes at the cost of a longer processing time. Low effort prototyping may require more patience, but also requires less attention. It turns out that prototyping-in-production is feasible, and is actually the standard practice in vibrant hardware ecosystems like Shenzhen. The main trade-off is that instead of having an idea that morning and a prototype on your desk by the afternoon, it might take a few days. And yes – of course there ways to shave those few days down (already anticipating the comments informing me of this cool trick to speed things up) – but the whole point is to not be distracted by the obsession of shortening cycle times, and spend more attention on the design. Increasing the time between generations by an order of magnitude might seem fatally slow for a convergent process, but the direction of convergence matters as much as the speed of convergence.

More importantly, if I were driving a PCB printer, CNC, or pick-and-place machine by myself, I’d be spending all morning getting that prototype on my desk. By ordering my prototypes from third party service providers, I can spend my time on something else. It also forces me to generate better documentation at each iteration, making it easier to retrace my footsteps when I mess up. Generally, I shoot for an iteration to take 2-4 weeks – an eternity, I suppose, by Silicon Valley metrics – but the two-week mark is nice because I can achieve it with almost no cognitive burden, and no expedite fees.

I then spend at least several days to weeks characterizing the results of each iteration. It usually takes about 3-4 iterations for me to converge on a workable solution – about a few months in total. I know, people are often shocked when I admit to them that I think it will take me some years to finish this project.

A manager charged with optimizing innovation would point out that if I could cut the weeks out where I’m waiting to get the prototype back, I could improve the time constant on an exponential and therefore I’d be so much more productive: the compounding gains are so compelling that we should drop everything and invest heavily in rapid prototyping.

However, this calculus misses the point that I should be spending a good chunk of time evaluating and improving each iteration. If I’m able to think of the next improvement within a few minutes of receiving the prototype, then I wasn’t imaginative enough in designing that iteration.

That’s the other failure of rapid prototyping: when there’s near zero cost to iterate, it doesn’t pay to put anything more than near zero effort into coming up with the next iteration. Rapid-prototyping iterations are faster, but in much smaller steps. In contrast, with low-effort prototyping, I feel less pressure to rush. My deliberative process is no longer the limiting factor for progress; I can ponder without stress, and take the time to document. This means I can make more progress every step, and so I need to take fewer steps.


Farewell Brompt

I’m planning to shut down Brompt, which I previously wrote about in 2008, 2011, and 2022. I archived the code on GitHub.

Let’s do a final confetti drop 🎉 together.

Farewell 👋 Brompt is shutting down

I have some sad news to share: I’m planning to shut down this service, Brompt, at the end of the month (February, 2024).

Shutting down Brompt means that you’ll no longer receive these automated reminders for your blog or writing. I’ve been running Brompt since 2008 and unfortunately, I haven’t been able to make it sustainable. You’re one of about 80 people who are still using the service, though I’m never sure if you or anyone ever opens the emails regularly.

Regardless of Brompt shutting down, I hope you’re doing well and I’d love to stay in touch. Send me an email at [email protected] or read my own blog at https://island94.org

All the best,

Ben (the person who made Brompt)

Screenshots from Brompt


Replacing Devise with Rails has_secure_password and friends

I love the Devise user-authentication gem. I’ve used it for years, and I recently moved off of it in one of my personal apps and replaced it with Rails’s built-in has_secure_password and generates_secure_token and a whole bunch of custom controllers and helpers and code that I now maintain myself. I do not recommend this! User authentication is hard! Security is hard!

And… maybe you need to walk the same path too. So I want to share what I learned through the process.

Ok, so to back up, why did I do this?

  • Greater compatibility with Rails main. My day job runs Rails main, and I’m more frequently contributing to Rails development; I’d like to run my personal projects on Rails main too. When I looked back on upgrade-blocking gems, Devise (and its dependencies, like Responders) topped my list.
  • More creative onboarding flows. I’ve twisted Devise quite a bit (it’s great!) to handle the different ways I want users to be able to register (elaborate onboarding flows, email-only subscriptions, optional passwords, magic logins). I’ve already customized or overloaded nearly every Devise controller and many model methods, so it didn’t seem like such a big change anyway.
  • Hubris. I’ve built enterprise auth systems from scratch, managed the Bug Bounty program, and worked with security researchers. I have seen and caused and fixed some shit. (Fun fact: I have been paid for reporting auth vulnerabilities on the bug bounty platforms themselves.) I know that even if it’s not a bad idea for me, it’s not a great idea either. Go read all of the Devise-related CVEs; seriously, it’s a responsibility.

That last bit is why this blog post will not be like, “Here’s everything you need to know and do to ditch Devise.” Don’t do it! Instead, here’s some stuff I learned that I want to remember for the next app I work on.

A test regime

I went back through all of my system tests for auth, and here is a cleaned-up, though not exhaustive list of my scenarios and assertions. It seems like a lot. It is! There are also unit tests for models and controllers and mailers and separate API tests for the iOS and Android apps. Don’t take this lightly! (Remember, many of these are specific to my custom onboarding flows).

  • When a new user signs up for an account
    • Their email is valid, present and stored; password is nil.
    • They are not confirmed
    • They receive a confirmation email
    • If not confirmed, registering again with the same email resends the confirmation email but does not leak account presence
    • If the email associated account already exists and is confirmed, sends a “you already have an account” email and does not leak account presence.
    • Following the link in the confirmation email confirms the new account and redirects to the account setup page.
  • When a user sets up their account
    • They can assign a username and password
    • A password cannot be assigned if a password already exists
    • A username cannot be assigned if a username already exists
    • If a username and password already exist, the setup page redirects to the account update page
    • The account update page redirects to the setup page if a username or password does not yet exist
    • Signing in with an unsetup account redirects to setup page
    • Resetting password with an unsetup account redirects to setup page
    • Adding a password invalidates reset-password links.
  • When a user updates their account
    • The current password is required to update email, username, or password.
    • When the email address is changed, a new confirmation email is sent out to that email address.
    • An email change confirmation can be confirmed with or without an active session.
    • If the email address is already confirmed by a different account, send the “you already have an account” email and do not leak account presence.
    • Multiple accounts can have the same unconfirmed email address.
  • When a user performs a password reset
    • Can’t be accessed with an active session
    • Link is invalidated after 20 minutes, or when email, or password changes.
    • Can be performed on an unsetup account
    • Confirms an email but not an email change
    • Signs in the user
    • Does not leak account presence
    • Is throttled to only send once a minute.
  • When a user performs or resends an email confirmation
    • Can be accessed with an active session.
    • Cannot resend confirmation of an email change without an active session.
    • Link is invalidated after 20 minutes, or when email, unconfirmed email, confirmed at, or password changes.
    • Signs in the user
    • Does not leak account presence
    • Is throttled to only send once a minute.
    • When user is already confirmed, send them an email with a link to reset their password
  • When a user signs into a session
    • Requires a valid email or username, and password
    • Cannot sign in with a nil, blank, or absent password param (unsetup account)
    • Session is invalidated when email or password changes.
    • Does not leak account presence with missing or invalid credentials
    • Redirects to the session[:return_to] path if present, otherwise the root path.

Using has_secure_password

This was a fairly simple change. I had to explicitly add bcrypt to the gemfile, and then add to my User model:

# models/user.rb
alias_attribute :password_digest, :encrypted_password
has_secure_password :password

I’ll eventually rename the database column, but this was a zero-migration change.

Also, you might need to use validations: false on has_secure_password and implement your own validations if you have custom onboarding flows like me. Read the docs and the Rails code.

When authenticating on sign in, you’ll want to use User.authenticate_by(email:, password:), which is intended to avoid timing attacks.

Using generates_token_for

The generates_token_for methods are new in Rails 7.1 and really nice. They create a signed token containing the user id and additional matching attributes and it doesn’t need to be stored in the database:

# models/user.rb
generates_token_for :email_confirmation, expires_in: 30.minutes do
  [confirmed_at, email, unconfirmed_email, password_salt]
end

generates_token_for :password_reset, expires_in: 30.minutes do
  [email, password_salt]
end

I’ll explain that password_salt in a bit.

To verify this, you want to do use something like this: User.find_by_token_for(:email_confirmation, value_from_the_link).

btw security: when you put a link in an email message, you can only use a GET , because emails can’t reliably submit web forms (some clients can, but it’s weird and unreliable). So your link is going to look like https://example.com/account/reset_password?token=blahblahblahblahblah. If there is any links to 3rd party resources like script tags or off-domain images, you will leak the token through the referrer when the page is loaded with the ?token= in the URL. Devise never fixed it (😱) . What you should do is take value out of the query param and put it in the session and redirect back to the same page without the query parameter and use the session value instead. (Fun fact: this is a bug bounty that got me paid.)

Authenticatable salts

Here’s where I explain that password_salt value.

There’s several places I’ve mentioned where tokens and sessions should be invalidated when the account password changes. When bcrypt stores the password digest in the database, it also generates and includes a random “salt” value that changes every time the password changes. Comparing that salt is a proxy for “did the password change?” and it’s safer to embed that random salt in cookies and tokens instead of the user’s hashed password.

Devise uses the first 29 characters of the encrypted password (which is technically the algorithm, cost and salt):

# models/user.rb
def authenticatable_salt
  encrypted_password[0, 29] if encrypted_password
end

But it’s also possible to simply get the salt. I dunno if the difference matters (tell me!):

# models/user.rb
def password_salt
  BCrypt::Password.new(password_digest).salt[-10..] if password_digest
end

A nice session

There’s a lot to write about creating sessions and remember-me cookies, that I won’t be writing here. The main thing to note is that I’m storing and verifying both the user id and their password salt in the session; that means all of their session are invalidated when they change their password:

# app/controllers/application_controller.rb
UNASSIGNED = Module.new
USER_SESSION_KEY = "_yoursite_user".freeze

def initialize
  super
  @_current_user = UNASSIGNED
end

def sign_in(user)
  session[USER_SESSION_KEY] = [user.id, user.password_salt]
end

def current_user
  return @_current_user unless @_current_user == UNASSIGNED

  # Check if the user was already loaded by route helpers
  @_current_user = if request.env.key?("current_user")
                     request.env["current_user"]
                   else
                     user_id, password_salt = session[USER_SESSION_KEY]
                     User.find_by_id_and_password_salt(user_id, password_salt) if user_id && password_salt
                   end
end

In doing this project I learned that Rail’s cookies will magically serialize/deserialize arrays and hashes. I’ve been manually and laboriously converting them into JSON strings for years 🥵

btw, if that UNASSIGNED stuff is new to you, go read my Writing Object Shape friendly code in Ruby.

Rotating active sessions

This is a little extra but I wanted the switchover to be transparent to users. To do so, I read from Devise’s active sessions and then create a session cookie using the new format. It looks something like this:

# controllers/application_controller.rb
before_action :upgrade_devise_session

def upgrade_devise_session
  # Devise session structure: [[USER_ID],"AUTHENTICATABLE_SALT"]
  if session["warden.user.user.key"].present?
    user_id = session["warden.user.user.key"].dig(0, 0)
    user_salt = session["warden.user.user.key"].dig(1)
  elsif cookies.signed["remember_user_token"].present?
    user_id = cookies.signed["remember_user_token"].dig(0, 0)
    user_salt = cookies.signed["remember_user_token"].dig(1)
  end
  return unless user_id.present? && user_salt.present?

  # Depending on your deploy/rollout strategy ,
  # you may want need to retain and dual-write both
  # Devise and new user session values instead of this.
  session.delete("warden.user.user.key")
  cookies.delete("remember_user_token")

  user = User.find_by(id: user_id)
  sign_in(user) if user && user.devise_authenticatable_salt == user_salt
end

Route helpers

Devise mixes some nice helper methods into Rails’s routing DSL like authenticated; they’re even necessary if you need to authenticate Rails Engines that can’t easily access the app’s ApplicationController methods. Here’s how to recreate them using Route Constraints and monkeypatching ActionDispatch::Routing::Mapper (that’s how Devise does it)

# app/constraints/current_user_constraint.rb
class CurrentUserConstraint
  def self.matches?(request)
    new.matches?(request)
  end

  def initialize(&block)
    @block = block
  end

  def matches?(request)
    current_user = if request.env.key?("current_user")
                      request.env["current_user"]
                    else
                      user_id, password_salt = request.session[USER_SESSION_KEY]
                      request.env["current_user"] = User.find_by_id_and_password_salt(user_id, password_salt) if user_id && password_salt
                   end

    if @block
      @block.call(current_user, request)
    else
      current_user.present?
    end
  end
end


# config/routes.rb
module ActionDispatch
  module Routing
    class Mapper
      def authenticated(&)
        scope(constraints: CurrentUserConstraint, &)
      end

      def unauthenticated(&)
        scope(constraints: CurrentUserConstraint.new { |user| user.blank? }, &)
      end

      def admin_only(&)
        scope(constraints: CurrentUserConstraint.new { |user| user&.admin? }, &)
      end
    end
  end
end

Rails.application.routes.draw do
  # ...
  authenticated do
    resources :special_somethings
  end
end

Because routing happens before a controller is initialized, the current user is put into request.env so that the controller won’t have to query it a second time from the database. This could also be done in a custom Rack Middleware.

If you want to put stuff into not-the-session cookies, those cookies can be accessed via request.cookie_jar, e.g., request.cookie_jar.permanent.encrypted["_my_cookie"].

Closing thoughts

That was all the interesting bits for me. I also learned quite a bit poking around Dave Kimura’s ActionAuth (thank you!), and am thankful for the many years of service I’ve gotten from Devise.


Two stories about technical debt, I guess

One activity I don’t enjoy very much: griping about “technical debt”; that label just never seems descriptive enough. And the things people gripe about seem to mostly fall into:

  • Deferred maintenance: we haven’t updated to the latest version of X, or it’s written in A language/framework but now everyone is much more familiar with B, or we know M raises an exception at N when they O while P
  • Just not the quality we expect, because reasons. It’s ok, we’re all winging it.

…and those categories crowds out the real sweaty palms stuff, the “we did do a good job but we know more now” that I think is the real deal. I can talk about that.

I’ve never found the particular post/video/talk? again (I’ve looked!), but it described technical debt as like: the distance between the current understanding of the business domain and the technical implementation’s modeling of the business domain. It had a chart that has stuck in my mind; it looked something like this:

A chart that shows the coming together and divergence of business and technical domain knowledge

That definition of “technical debt” clicked for me. For myself and the high performing teams I’ve worked with, we’re most productive, pushing value, when we’re asking ourselves: does this make sense? Given what we know now, and what we know about where we’re going, do the objects and their interactions conceptually map to how the business is being talked about by the non-technical people? Are we adapting at the same rate we’re learning? If yes, we’re golden; if sorta with some tweaks that’s good; when no… that’s bad, disordered, schismatic: carrying a burden of always translating between the language and models in the technical system and the language and concepts in the business domain. That sucks!

Aside: There’s a funny/terrifying/random thing this makes me think of: “We’ll Never Make That Kind of Movie Again”: An oral history of The Emperor’s New Groove, a raucous Disney animated film that almost never happened. One of the screenwriters describes the process of making an animated film:

In a normal four-year process, you’ve got meetings, you’ve got development people going, “What if the girl was a boy? What if the bird was a flower?” And then you have to run all those ideas.

The software projects I’ve worked on are a conveyor belt of “What if the bird was a flower?” decision making and idea running. Extend that object, repurpose this method, bolt on that, change this, swap that, rename this, better leave a comment on that. When it’s going well, it doesn’t matter that it was a bird yesterday and a flower today… as long as it’s not retaining so much birdliness that it compromises its flowerability. It’s when you’re having to remember and callback “hey, well, um, it was once a bird, fyi, that’s why it’s so much trouble to do this flower stuff”, then you’re sunk.

It’s a bad sign when forward development requires telling stories about the past.

Here’s two stories…

The isolated component

When I was working at a website hosting startup, we had one component that I could just never get the support to integrate into the larger application. It was the form that would create a new website project for the user. When the startup began, the original technical architecture was a bunch of interlinked but separated components: User Registration, User Account Management, Website Creation, Website Management, Organization Management, Agency/Reseller Management, etc. It made sense early as the business was figuring out what worked as a business, and then during my tenure we brought those components together into a unified application. Well, almost all of them.

There was a lot of give and take in the product integration; sometimes me and the other engineers would just do it and other times we’d defer it until there was a particular feature that necessitated it, and then we’d include that extra work in our estimates. It frequently took a couple passes of code cleanup and bringing it onto the design system, and that was ok. That’s the job!

That last, last, last component of Website Creation eluded us though, and it was outside our control. At that point, development was transitioning from “engineering led” to “product management and design led” and I had been instructed that engineering couldn’t make any product changes unless they were connected to an active PRD (Product Requirements Document) controlled by the PMs.

There was plenty of demand to make changes to Website Creation: smooth out the first-time account registration flow into creating a website; allow the user to do some activities in parallel while the website was spinning up like inviting team members or perusing billing levels; decouple the concept of “a website is a marketing strategy” from “upload your code and have some containers and a load balancer provisioned” so that non-developers could still plan a website without invoking all the technical bits.

But not enough appetite to get it done.

Of “technical debt”: everyone except our little engineering team maintaining the frontend didn’t think anything special of Website Creation. It wasn’t obvious unless you carefully watched for the hard-browser refresh, or noticed the navigation bar change slightly. Conceptually it was a unified product (heck, I even remember a product launch called “One”), but the work hadn’t yet been done on the application side and we engineers carried the burden.

It was funny because every time a product change that touched Website creation was discussed, the same thing happened:

PM: Your effort estimate seems really high. What’s going on?
Me: Well, this involves Website Creation and it’s still its own component and can’t access any of those other systems that are necessary to build the feature. We’d need to bring it into the rest of the application. It’s definitely possible! There’s a few open questions with product and design implications, so we’d need to work together on that.
PM: Oh, well, huh, I didn’t expect that. Hmm, we don’t have the bandwidth for all that. Let’s pass on it for now.

This happened multiple times! It was weird too because the particular project being planned would be spiked, and then the engineering team would have to wait around while a new project was spun up and that likely took just as long as it would have taken to do the work on the Website Creation component. If I hadn’t been explicitly told to sit on my hands otherwise, I would have probably just done this as off-the-books, shucks-I-didn’t-think-it-would-be-a-big-deal, shadow-work.

It never got done during my tenure; I think they later decided the problem was that the whole thing wasn’t written in React 🤷

The campaign message

When I was a developer on GetCalFresh, the functionality with perpetually unexpected estimations was “Campaign Messages”.

GetCalFresh would help people apply for their initial food stamp application, at which point it would be sent to the applicant’s county for processing. Over the next 14 to 28 days the county would schedule an in-person or telephone interview, and request various documents like pay stubs and rental leases, and the applicant would have to navigate all of that. (The administrative state suuuuucks!) To help, GetCalFresh would send a sequence of email and/or SMS messages to the applicant over this time period explaining what was needed and what to expect next. A messaging campaign, y’know, of “campaign messages”

When GetCalFresh was first laid down in code, there were two “types” of campaign messages: General and Expedited. Under a couple of different circumstances, if an applicant is homeless or at risk of being homeless, or has no cash-on-hand to purchase food, their application is eligible for expedited processing and thus we’d send them a message sequence over ~14 days; everyone else would receive a message sequence over ~28 days. We were sending the same messages, just on a different schedule.

So when we engineers were then asked to customize a message, like “if the applicant is a student, add this description about their special eligibility”… we just if/elsed it on in there. Oh, now this county is piloting a special process, let’s make a little carve out for them too and swap this message. Still, same sequence, just tweaks, right? Well, all those small tweaks and carve-outs build up, and all of a sudden we’re having to ask “ok, so you want us to rewrite this one itty bitty message, well we also need you to specify what it should be for students, who do and don’t qualify for County X’s special process too”. It got twistier and twistier. And when requests like “don’t send that message in this special circumstance” or “add a totally new message but just as this one-off” came in, we’d be like “totally possible! and that’s gonna take more work than you think!”

GetCalFresh had the best Product Managers I have ever worked with in my life, and we still got locked into a similar loop as the last story: we’d do our estimation with the PMs, it exposed the fruit hung more high than low, and the change would be deprioritized. I think the PMs got it, but the challenge was that the other folks, client support and the folks coordinating with the counties and the datascience team, would be like “we heard that Engineering doesn’t want to build it.” So weird! Not engineering’s call! (Aside: I spent so much time coaching non-technical stakeholders on how to work in a PM-led system, but always more coaching to do.)

I remember making a Google Doc to analyze why and explain how the system we initially designed for (same sequence of messages with different schedules) didn’t match our understanding of the problem today. The doc listed out all of the different reasons we knew of why we might customize the message. It was at least 10 bullet points. And there were a lot of other learnings too: initially we designed around customizing for just 3 major county systems (State Automated Welfare Systems - SAWS), but later found ourselves doing county-by-county customizations (California has 52 counties). I advocated for configuring each county in its own object despite the scale brainworms demanding a singular and infinitely abstracted model (I call these things “spreadsheet problems” when you can simply list the entire domain in a spreadsheet).

Of “technical debt”, I still can replay in my brain the deliberate mental spatial shift of imagining the campaign model as a 2-column array (General and Expedited) with 10+ rows of logic shifts and then flopping it onto its side to make a change. All that mental context has a huge carrying cost that all of us had to budget for when making a change.

During my tenure, we never did the significant reworking to how campaign messages were implemented, though some bold colleagues did their best to make changes as safe and simple as possible with DSLs and test macros. Thank you! 🙏

That’s it

Sorry, no great lessons here. Just stories to share (“ideally you’d try to see it as a funny story you can tell friends rather than a hurtful snub that needs to bother you”) I mentioned coaching folks on working with PMs, and I think the frequent advice I gave non-technical folks probably holds true for engineers too when asking:

Always have your top 3 ranked work items ready to go when talking to decision makers (the PM?). Don’t bring new stuff unless it changes those top 3.

(I mean sure, share context and adapt, but allow yourself no doubt that you’ve’ clearly and consistently communicated what your top priorities are before they get dumped in with everyone else’s.)

(But also, if you’re an engineer and you can and no one is breathing down your neck, simply get it done and celebrate. The PM doesn’t have to lead everything. You can do it! 👍)


The answer is in your heap: debugging a big memory increase in Ruby on Rails

I recently participated in an interesting series of debugging sessions tracking down the source of a large increase in memory when upgrading a Rails application. We ultimately tracked down the cause using John Hawthorn’s Sheap heap analyzer and successfully submitted a patch to Rails. I thought it was interesting enough to write up because maybe the general approach to debugging memory issues would be helpful (and this is the kind of stuff that I very quickly forget unless I write it down).

How it started: Reddit

Lots of people ask for help on r/rails, and it can be difficult to debug at a distance. This time it was a little different. I recognized the username’s owner, Jonathan Rochkind, because he’s been a familiar and helpful face in GoodJob’s discussions and I’ve sponsored his aggregator Rubyland News. The observed problem was that after upgrading from Rails 7.0 to Rails 7.1, their application’s memory footprint increased by about 25%. Weird!

Working the problem

We worked through a bunch of questions:

  • Was the memory increase at startup or over time? Not at boot, but memory increased very quickly.
  • Did anything change with Puma configuration? Nope.
  • Get set up with derailed_benchmarks, and create a bin/profile Rails binstub to make it easy to boot into a production-like configuration for profiling. Here’s what my very polished one looks like:

    #!/usr/bin/env ruby
    
    # This file is a wrapper around the rails and derailed executables
    # to make it easier to boot in PRODUCTION mode.
    #
    # Usage: bin/profile [rails|derailed] [command]
    
    ENV["RAILS_ENV"] = ENV.fetch("RAILS_ENV", "production")
    ENV["RACK_ENV"] = "production"
    ENV["RAILS_LOG_TO_STDOUT"] = "true"
    ENV["RAILS_SERVE_STATIC_FILES"] = "true"
    ENV["FORCE_SSL"] = "false"
    ## ^^ Put ENV to boot in production mode here ^^
    
    executable = ARGV.shift
    if executable == "rails"
      load File.join(File.dirname(__FILE__), "rails")
    elsif executable == "derailed"
      require 'bundler/setup'
      load Gem.bin_path('derailed_benchmarks', 'derailed')
    else
      puts "ERROR: '#{executable}' is not a valid command."
      puts "Usage: bin/profile [rails|derailed]"
      exit 1
    end
    

We flailed around with Derailed Benchmarks, as well as John Hawthorn’s Vernier profiler’s memory mode (aside: John Hawthorn is doing amazing stuff with Ruby).

At this point, we had a general understanding of the application memory footprint, which involved a large number of model instances (Work), many of which contained a big blob of json. For some reason they were sticking around longer than a single web request, but we weren’t able to find any smoking guns of like, memoized class instance variables that were holding onto references. So we kept digging.

You can read along to all of this here: https://github.com/sciencehistory/scihist_digicoll/issues/2449

Analyzing memory with Sheap

I used Derailed Benchmark’s perf:heap to generate heap dumps (also possible using rbtrace --heapdump), and then plugged those into Sheap. Sheap is a relatively new tool, and where it shines is being interactive. Instead of outputting a static report, Sheap allows for exploring a heap dump (or diff: to identify retained objects), and ask questions of the dump. In our case: what objects are referencing this object and why is it being retained?

# $ irb
require './lib/sheap.rb

diff = Sheap::Diff.new("/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-1.ndjson", "/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-2.ndjson")

# Find one of the Work records that's been retained
model = diff.after.class_named("Work").first.instances[200]
=> <OBJECT 0x117cf5c98 Work (4 refs)>

# Find the path to the (default) root
diff.after.find_path(model)
=>
[<ROOT vm (2984 refs)>,
 <IMEMO 0x126c9ab68 callcache (1 refs)>,
 <IMEMO 0x126c9acf8 ment (4 refs)>,
 <CLASS 0x12197c080 (anonymous) (15 refs)>,
 <OBJECT 0x122ddba08 (0x12197c080) (3 refs)>,
 <OBJECT 0x117cfc458 WorksController (13 refs)>,
 <OBJECT 0x117cf7318 WorkImageShowComponent (15 refs)>,
 <OBJECT 0x117cf5c98 Work (4 refs)>]

# What is that initial callcache being referenced by the ROOT?
diff.after.at("0x126c9ab68").data
=>
{"address"=>"0x126c9ab68",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"callcache",
 "references"=>["0x126c9acf8"],
 "file"=>"/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
 "line"=>48,
 "method"=>"public_send",
 "generation"=>288,
 "memsize"=>40,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# And then a method entry
irb(main):015> diff.after.at("0x126c9acf8").data
=>
{"address"=>"0x126c9acf8",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"ment",
 "references"=>["0x12197c080", "0x12197c080", "0x126c9ade8", "0x126c9b4a0"],
 "file"=>
  "/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
 "line"=>33,
 "method"=>"method_missing",
 "generation"=>288,
 "memsize"=>48,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# Aha, and a singleton for RoutesProxy!
diff.after.at("0x12197c080").data
=>
{"address"=>"0x12197c080",
 "type"=>"CLASS",
 "shape_id"=>14,
 "slot_size"=>160,
 "class"=>"0x12211c308",
 "variation_count"=>0,
 "superclass"=>"0x12211c3a8",
 "real_class_name"=>"ActionDispatch::Routing::RoutesProxy",
 "singleton"=>true,
 "references"=> [...],
 "file"=>
  "/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
 "line"=>33,
 "method"=>"method_missing",
 "generation"=>288,
 "memsize"=>656,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# I expect the next object to be the RoutesProxy instance
diff.after.at("0x122ddba08").klass.data["real_class_name"]
=> "ActionDispatch::Routing::RoutesProxy"

Sheap is pretty great! In the above example, we were able to find a Work model instance in the heap, and then using find_path identify what was referencing it all the way back to the heap’s root, which is what causes the object to be “retained”; if there was no path to the root, the object would be garbage collected.

(I have the huge benefit of having John as a colleague at GitHub and he helped me out a lot with this. Thank you, John!)

What we’re looking at is something in Rails’ RoutesProxy holding onto a reference to that Work object, via a callcache, a method entry (ment), a singleton class, a RouteSet, and then a Controller. What the heck?!

The explanation

Using Rails’ git history, we were able to find that a change had been made to the RoutesProxy ’s behavior of dynamically creating a new method: a class_eval had been changed to an instance_eval.

Calling instance_eval "def method...." is what introduced a new singleton class, because that new method is only defined on that one object instance. Singleton classes can be cached by the Ruby VM (they’ll be purged when the cache fills up), and that’s what, through that chain of objects, was causing the model instances to stick around longer than expected and bloat up the memory! It’s not that instance_evaling new methods is itself inherently problematic, but when those singleton methods are defined on an object that references an instance of an Action Controller, which has many instance variables that contained big Active Record objects…. that’s a problem.

(Big props, again, to John Hawthorn who connected these dots.)

Having tracked down the problem, we submitted a patch to Rails to change the behavior and remove the instance_eval -defined methods. It’s been accepted and it should be released in the next Rails patch (probably v7.1.3); the project temporarily monkey-patched in that change too has been released as part of Rails 7.1.3.

I realize that’s all a big technical mouthful, but the takeaway should be: Sheap is a really great tool, and exploring your Ruby heap can be very satisfying.

Update: Jean Boussier pointed me to the fix in Ruby 3.3 for call Cache for singleton methods can lead to “memory leaks” 🎉 And suggested looking at harb too as a Sheap-like.


Trigger GitHub Actions workflows with inputs from Apple Shortcuts

I’ve been using Apple Shortcuts to invoke GitHub Actions workflows to create webpage bookmarks. It’s been great! (disclosure: I do work at GitHub)

My use case: I’ve been wanting to quit Pinboard.in, so I needed an alternative way to create and host my web bookmarks, some of which date back to ~2005 del.icio.us vintage. It’s been easy enough for me to export of all my bookmarks (settings -> backup -> JSON) and convert them to YAML files to be served by Jekyll and GitHub Pages. But I also needed an easy way to create new bookmarks that would work on all my Apple devices. I ended up with:

  1. Bookmarks are organized as individual yaml files, in this blog’s repository.
  2. A Ruby script to take some simple inputs (url, title, notes), generate a new yaml file, and commit it to the repo using Octokit.
  3. A GitHub Actions workflow that accepts those same inputs and can be manually triggered, that runs the script. One thing to note is that I echo the inputs to $GITHUB_STEP_SUMMARY early in the workflow in case a later step errors, so I won’t lose the bookmark details and can go back later and manually fix it up.
  4. An Apple Shortcut that asks for those inputs (either implicitly via the Share Sheet or via text inputs) and then manually triggers the GitHub Actions workflow via the GitHub API.

The only difficult part for me was getting Apple Shortcuts to work nicely with the GitHub REST API. Here’s what worked for me:

Use Get Contents of URL Action:

  • URL: https://api.github.com/repos/USER/REPOSITORY/actions/workflows/WORKFLOW.yml/dispatches
  • Method: POST
  • Headers:
    • Accept: application/vnd.github.v3+jsonp
    • Authorization: Bearer GITHUB_ACCESS_TOKEN
  • Request Body: JSON
    • ref: main (or whatever branch you’re using)
    • inputs (Dictionary):
      • INPUT: VALUE
      • … and your other GitHub Actions workflow inputs

Here’s what it looks like all together (btw, Dictionary-type inputs were broken in iOS 16 / Mac 13 😨) :

Screenshot of Apple Shortcut with the previous configuration


Recently, January 3, 2023

  • I’ve now watched the Taylor Swift Eras movie twice, once at home, and a second time over the holidays with niece (completely) and nephew (partly). My most burning question is whether Taylor menaces the same dancer every show’s “Tolerate it”, or if they share rhetorical pain. My Apple Music Replay also ranked highly with Taylor Swift, though also apparently Andrew McMahon; unexpected.
  • I started playing Talos Principle 1 after beating 2, though it’s a lot more intense with guns and exploding things and so many timing-based puzzles. I’ve almost beaten it… but also took a break to play Super Mario Wonder which is much more fun fun (especially, again, with niece and nephew).
  • I finished reading “Babel”, started then stopped reading “Wolf Hall”, and am picking my way through “Less”. There’s several nonfiction books in there that I’ve read about chapter of, but nothing so memorable to note here. Prior to that I read “Translation State”, and realizing I didn’t remember half of what was happening, re-read the previous 4 Imperial Radch books, which only now makes me think I should read-read Translation State again for the clean sweep. And I guess I finished the latest Bruno book, “A Chateau Under Siege”, and it was fine (there’s only so many times one can describe Kir Royale, and I think we’re past it).
  • I believe the Crumpler Soupansalad Son-o is the perfect jacket-and-book-and-water-bottle-and-also-half-gallon-of-milk-and-greek-yogurt bag. It also hasn’t been made in like 15 years so I bought another one on eBay (I attempted to do this via Poshmark 6 months ago but it failed to deliver). Now I own two: my longtime bag in brown, the new one in black.
  • After 7 years of service, I’ve stepped down from my church’s Property Committee. The buildings are being sold. As Boenhoeffer wrote, albeit about Germany’s embrace of fascism and not a congregational vote on the role of real-estate within a small church’s modest investment portfolio: “If you board the wrong train, it is no use running along the corridor in the other direction.” I’m currently serving on zero committees anywhere; a special moment.
  • Reflecting on my past year at GitHub, I’ll just directly quote Rachel Andrew’s blog; search and replace as necessary:

    The layoffs at Google at the beginning of 2023, didn’t impact me or my writing team directly, however they cast a shadow over the year. I try to look at difficult situations through a lens of what I can actually do to change things or improve the situation. At my level of management I’m not privy to layoff decisions, but I can be there to support my team and make space to talk about their concerns. I can strive to make sure our work and the impact of it is visible, and I can make sensible business decisions to make the most of resources in a more constrained environment. And so, after the initial shock of it all, that’s how I’ve approached this year.

  • I’m slowly trending towards GoodJob v4.0. It’s looking like it will be a noop: a chance for me to clean up all of the migrations and deprecation warnings and probably stop support for Ruby < 3.0, but nothing noticeable otherwise. I want to have everything ready for anyone to opt into FOR UPDATE SKIP LOCK from Advisory Locks, but as a feature that won’t be part of four-point-zero. We’ve managed to get to 3.22 (that’s 22 minor releases!) without a breaking change.
  • I’m not big on setting yearly goals, but I liked this evergreen advice from Ask a Manager, so I’m gonna continue aspiring to that:

    ideally you’d try to see it as a funny story you can tell friends rather than a hurtful snub that needs to bother you