• 0 Posts
  • 22 Comments
Joined 1 year ago
cake
Cake day: June 11th, 2023

help-circle
  • Most voters don’t have a business and never will.

    The value of a net new business is that it creates more jobs and economic activity.
    Most people benefit from more jobs to either work at or drive up labor demand.
    Per that school of economic thought, incentivizing a new business adds more activity to the market and more opportunity for people to find ways to innovate, provide value and become profitable.
    Giving money to an existing struggling business is subsidizing a businesses that’s already demonstrated that it’s not working.

    However, we’re both putting too much into it. The goal is to say $50k for small business, because people like a business friendly atmosphere.
    Trump gets credit for giving tax cuts to businesses for stock buyback, which only helps investors. The goal is to court people who want pro business policies without literal handouts to corporations.






  • Yup. :/

    I looked it up and it’s not unusual for sentencing in New York to take several months, but I would have been much happier if the political realities had pushed things to move faster.

    Having read the prosecutions response to the request for delay that basically said “everything the defense said justifying a delay was wrong, here’s why a delay would actually be a good idea”, it feels hard to blame the judge too much for granting the delay.
    Even though none of the reasons seem to be based on sound legal principles and are at best based on practical considerations.







  • Only for the sake of specific-ness: Crowdstrike forced the update, not the OS. :) and yeah, that’s generally unheard of. Like so unheard of that it’s a professional recommendation reversing occurrence based purely on how they could release a product that bypassed user expectations so aggressively and without any documentation that it was happening.
    I work in the security sector with computers, and before all this I would have said “yeah, crowdstrike is a widely deployed product and if it fits your requirements it’s reasonable to use”. Now I would strongly recommend against it, not because of this incident, but because of the engineering, product and safety culture that thought it was okay to design a product this way without user controls or even documentation around any part of it. Their after incident report is horrifying in testing it communicates they weren’t doing.

    I wouldn’t advise someone to use windows for a server, but that’s a preference thing, not a “hazard” thing. If they had a working windows setup I wouldn’t even comment on it.

    What sounds like happened to Delta is that they were set-up roughly like other companies. Maybe a little loose on different setups at different airports. That’s a forgivable level of slop. Where they differed was in having a piece of software that couldn’t handle being entirely shut off, and then immediately loaded to 100% with no ease in.
    Scheduling is a type of computer problem that’s very susceptible to getting increasingly difficult the bigger the number of things being worked with. Like exponentially more difficult, but it’s actually worse than exponential.
    I know nothing about they’re system, but I can guess that it worked fine when it was running because it needed to make a small number of scheduling decisions at a time, and could look at the existing state of things as a decided “fact”. Start the system fresh, and suddenly it needs to compare the hundreds of airports, more hundred of planes and crews, and thousands of possible routes to each other and is looking at literally billions of possible schedules which it needs to sort through to pick the best ones.
    Other airlines appear to have scheduling systems that were either developed using more modern techniques that can find “good enough” very efficiently, or the application was written to fail less easily or had better hardware so it could work faster.

    For whatever reason, delta was the only one that had the key bit of software fail to come back up.

    Delta has higher costs than the other airlines because there are regulations protecting travelers and ensuring they get appropriate refunds and accomodations if their flights are cancelled. Other airlines were able to shift people around and get going again before they had to pay out too much in ticket refunds, food, or hotels.
    Delta is arguing that crowdstrike is responsible for the total cost of the incident, which would include all the refunds and hotels, since they caused it.
    Crowdstrike recently responded that they think their liability is no greater than $10mil. They seem to be taking the position that they’re only responsible for the immediate effects, so things like diverting aircraft, needing to manually poke systems and all that.

    “Yeah I t-boned you when I ran a red light, so I owe you for the damage to your car, but your car was a dangerous piece of crap so I’m not responsible for your broken legs, hospital bills or lost wages”.
    I think the judge will find that running the red light means they are responsible for the extended consequences of their actions, even if they’re vastly in excess of what anyone would have predicted up front, but that the car was pretty dangerous so it was really only a matter of time so it’s not all on them.

    If there’s one thing I’ve learned from reading about court cases, it’s that a civil suit like this will get really complicated with how they assess damages and responsibilities.

    And yeah, there’s no perfect answer for computer system stability. You can never get perfect stability, and each 9 you add to your 99.9% uptime costs more than the last one. Eventually you have teams of people whose full time job is keeping the system up for an additional second per year. And even with that, sometimes Google still goes down because it’s all a numbers game.

    I didn’t mean to ramble so long, but I have opinions and I get type-y before bed. :)


  • You are correct that Delta was an outlier, but it wasn’t with regards to the scale of the outage, it was that their scheduling software was down far longer and they handled a lot of the customer side of things significantly less well.

    Generally, your protection against operating system issues is the aforementioned restriction on changes and how they go out.
    If something is stable, you can expect it to remain stable unless something changes or random chance breaks something.
    The operational cost of running multiple operating systems in production like you describe would be high. Typically software is only written to work on one platform, and while it can be modified to work on others, it’s usually a cost with no benefit outside of a consumer environment.
    Different operating systems have different performance characteristics you need to factor in for load scaling, different security models, and different maintenance requirements.
    Often, but not always, server administrators will focus on one OS, so adding more to the mix can mean people are rusty with whichever is your backup, which can be worse than just focusing on fixing the issue with the primary.
    OS bugs are rare, and they usually manifest early or randomly. It’s why production deployments tend to use the OS as long as it’s supported: change means learning the new issues and you’ve probably already encountered all the bullshit with what you’re currently using. That’s why the Linux distros tend to have long term support versions, and windows server edition tends to just get support for a long time with terrible documentation.

    I’m a Linux guy, so defending windows feels weird, and I want to include that I don’t think anyone should use it, particularly for a server, but the professional in me acknowledges that it’s a perfectly functional hammer.

    As we’ve learned more, I’ve become more disparaging of deltas choice to not keep the scheduling system modernized in a way that could recover faster, and not investing enough in making systems homogeneous across different airports. I still think that these issues are largely independent of their actual disaster recovery or resiliency plans.
    Inevitably, the lawsuits will determine that the blame for the damage is split between the two of them. My bet is 70/30 crowdstrike/delta, since they can easily demonstrate that the issue was fundamentally caused by crowdstrike and negatively impacted other airlines and businesses in general. Some was clearly deltas fault for just failing to keep a system modernized to handle a massive shift like this, and would have been similarly disrupted by any outage with flight cancellations.


  • The current geological era will have measurable levels of radioactive isotopes different from expectations. Just like we can tell when plants started making oxygen from the Fossil record and rock chemistry, we’ll be able to tell when humans started having some physics fun time in the atmosphere.

    Other fun fact is that we’ve added a decent set of new markers for future archeologists to date things with.
    I think we’ve caused some of the carbon dating techniques to need a little * in the future, since we’ve shifted the baseline level around quite a bit.
    We also added some new radioactive isotopes to the mix, like strontium, which show up in your teeth. Not new-new, but measurably increased levels.
    We can actually use the levels in your teeth to predict your age within a year or two.

    The discovery of this is part of what motivated the partial nuclear test ban that had both the US and Soviet Union stop testing in the atmosphere.


  • Ugh, that’s shitty. Companies keep acting like they’re confused about why they can’t find anyone when everyone knows that the problem is that they just want better than disgusting benefits, mistreatment, shit pay and legal loopholes that somehow make that the workers fault.

    There’s a place for contractors in the employment landscape. A bakery doesn’t need a staff plumber. A clothing store might only need a web designer for a few months to rebuild the website.
    But a delivery company saying the people who do their deliveries, the core of their business, every day on an ongoing basis indefinitely are contractors? That’s so obviously bullshit. We need oppressively stiff penalties for shit like that to keep them from doing it, because as long as it’s cheaper to do it wrong, they have no reason to try to do it right.


  • In this case they’re employees of a “delivery service partner”.

    It’s roughly the same thing, except instead of driving a semitruck, you’re hired as a contractor to hire and manage the delivery drivers, do everything Amazon tells you, and make sure your drivers do everything Amazon tells them as well.
    That way Amazon can pressure you into abusing the driver’s and claim it wasn’t them, it’s just that they hire terrible contractors. Refuse to negotiate because they don’t work for them.

    Which, quite clearly, isn’t a thing you’re allowed to do, since even if your employees get their checks from someone else they’re still your employees since all the work they do is for you.




  • Well, in your example you should be mad at yourself for not having a backup house. 😛

    There’s a lot of assumptions underpinning the statements around their backup systems. Namely, that they didn’t have any.
    Most outage backups focus on datacenter availability, network availability, and server availability.
    If your service needs one server to function, having six servers spread across two data centers each with at least two ISPs is cautious, but prudent. Particularly if you’re setup to do rolling updates, so only one server should ever be “different” at a time, leaving you with a redundant copy at each location no matter what.
    This goes wrong if someone magically breaks every redundant server at the same time. The underlying assumption around resiliency planning is that random failure is probabilistic in nature, and so by quantifying your failure points and their failure probability you can tune your likelihood of an outage to be arbitrarily low (but never zero).
    If your failure isn’t random, like a vendor bypassing your update and deployment controls, then that model fails.

    A second point: an airline uses computers that aren’t servers, and requires them for operations. The ticketing agents, the gate crew that manages where people sit and boarding, the ground crew that need to manage routine inspection reports, the baggage handlers that put bags on the right cart to get them to the right plane, and office workers who manage stuff like making sure fuel is paid for, that crews are ready for when their plane shows up and all that stuff that goes into being an airline that isn’t actually flying planes.
    All these people need computers, and you don’t typically issue someone a redundant laptop or desktop computer. You rely on hardware failures being random, and hire enough IT staff to manage repairs and replacement at that expected cadence, with enough staff and backup hardware to keep things running as things break.

    Finally, if what you know is “computers are turning off and not coming back online”, your IT staff is swamped, systems are variously down or degraded, staff in a bunch of different places are reporting that they can’t do their jobs, your system is in an uncertain and unstable position. This is not where you want a system with strict safety requirements to be, and so the only responsible action is to halt operations, even if things start to recover, until you know what’s happening, why, and that it won’t happen again.

    As more details have come out about the issues that Delta is having, it appears that it’s less about system resiliency, although needing to manually fix a bunch of servers was a problem, and more that the scale of flight and crew availability changes overloaded that aforementioned scheduling system, making it difficult to get people and planes in the right place at the right time.
    While the application should be able to more gracefully handle extremely high loads, that’s a much smaller failure of planning than not having a disaster recovery or redundancy plan.

    So it’s more like I built a house with a sprinkler system, and then you blew it up with explosives. As the fire department and I piece it back together, my mailbox fills with mail and tips over into a creek, so I miss paying my taxes and need to pay a penalty.
    I shouldn’t have had a crap mailbox, but it wouldn’t have been a problem if you hadn’t destroyed my house.


  • Yes, that book. Because the software indicated to end users that they had disabled or otherwise asserted appropriate controls on the system updating itself and it’s update process.

    That’s sorta the point of why so many people are so shocked and angry about what went wrong, and why I said “could have done everything by the book”.

    As far as the software communicated to anyone managing it, it should not have been doing updates, and cloudstrike didn’t advertise that it updated certain definition files outside of the exposed settings, nor did they communicate that those changes were happening.

    Pretend you’ve got a nice little fleet of servers. Let’s pretend they’re running some vaguely responsible Linux distro, like a cent or Ubuntu.
    Pretend that nothing updates without your permission, so everything is properly by the book. You host local repositories that all your servers pull from so you can verify every package change.
    Now pretend that, unbeknownst to you, canonical or redhat had added a little thing to dnf or apt to let it install really important updates really fast, and it didn’t pay any attention to any of your configuration files, not even the setting that says “do not under any circumstances install anything without my express direction”.
    Now pretend they use this to push out a kernel update that patches your kernel into a bowl of luke warm oatmeal and reboots your entire fleet into the abyss.
    Is it fair to say that the admin of this fleet is a total fuckup for using a vendor that, up until this moment, was generally well regarded and presented no real reason to doubt while being commonly used? Even though they used software that connected to the Internet, and maybe even paid for it?

    People use tools that other people build. When the tool does something totally insane that they specifically configured it not to, it’s weird to just keep blaming them for not doing everything in-house. Because what sort of asshole airline doesn’t write their own antivirus?