Your application developers work with your company's most valuable intellectual property and have access to your most sensitive systems. What's the expectation as to how their laptops are configured? How can you ensure that DevOps machines are securely configured before developers gain access to critical resources? In this session, Uma Unni explains how Stripe uses osquery to validate secure machine configurations in real time before granting developers access to sensitive resources.
Uma Unni takes us from the foundation of modern, zero-trust architecture to how the security team at Stripe has implemented dynamic verification controls into its infrastructure. She shows how zero-trust verification using endpoint device health is used to grant or deny access, and how you can provide a break-glass option for users having time-sensitive access requests.
Transcript
Introduction
My name is Uma Unni. I'm a software engineer at Stripe, where I work on the detection engineering team. Today I'll be talking about using osquery to implement zero-trust controls and validate endpoint security health. Or to put it another way, "Don't touch me with that insecure laptop."
First I’ll give some background on what zero-trust is. I'll address why we care about verifying the security health of endpoints in a zero-trust system, then I'll provide an example flow for what a device integrity verification process might look like in a network. Wrapping up, you’ll learn how osquery can help with collecting security health data.
What Is Zero-Trust?
Zero-trust is a relatively new approach to enforcing access control for a company's network and internal resources. Traditionally, network security follows a perimeter-based model—like a castle and moat. There's a boundary around all internal resources you want to protect. Anyone who wants to access network resources has to be validated at that boundary. But once inside, they basically have free access to whatever they want and can hop around to any other resource within the network. They're essentially given implicit trust since they're already inside.
This works well enough, but it entirely relies on your ability to tightly secure that initial border. If anyone gains network access through any given access point when they shouldn't, they'll be able to potentially move laterally and access many other network resources—well beyond the one they initially compromised.
This is where the concept of zero-trust comes in. You can describe it as “never trust, always verify.” Instead of validating system users at the boundary only once, they get validated every time they make a request to any asset or resource. If they want access to your user database, validate them. If ten minutes later they want access to some other database or resource, validate them again. They're not given implicit trust just because they recently accessed something else in your network.
"This zero-trust approach can limit attackers' ability to move laterally, even if they've initially gained network access. They're not given implicit trust to access any other resources."
This zero-trust approach can limit attackers' ability to move laterally, even if they've initially gained network access. They're not given implicit trust to access any other resources. This limits the damage they're able to do.
But an equally important part of zero-trust involves verifying users’ device integrity. Are the devices they're using secure? Do we think any device is likely to be compromisable?
Zero-Trust Controls Content Overview
If a person’s laptop is vulnerable, then a bad actor can compromise it. They'll be able to access your internal systems, exfiltrate company data, and compromise internal systems.
"Is the OS up-to-date? Does the endpoint have file encryption enabled? Have you disabled or somehow limited remote login abilities? A lot of this endpoint information can be collected using osquery."
So what is a healthy laptop? What does it mean for it to be secure? This depends. Many factors can be included, but it might include some factors such as, ‘Is the OS up to date? Does the endpoint have file encryption enabled? Have you disabled or somehow limited remote login abilities?’ Things like this can help ensure that the laptop is less likely to be compromised or used in a way you don't want. A lot of this endpoint information can be collected using osquery.
How Does Zero-Trust Verification Work?
Here I've got a simplified flow diagram. With a zero-trust setup, typically you'll have some sort of frontend handling all requests coming into the network to access your internal resources. In this system, that frontend would be part of what enforces much of the zero-trust posture. This would also be where you enforce verifying that the user identity is that of someone you want to grant access to your stuff and that they have permission to access the specific resource they're requesting.
How can a network frontend be used to do device integrity verification? Say a request comes in from a user; they want to access your database. You first need to gather all the relevant security health data for this endpoint to see if it’s trustworthy. Or is it likely to be compromised?
At this point you can take whatever device identifier you've received from the request and route it to wherever you're storing your endpoint security health data—perhaps a data store you've set up. It would inquire, ‘This endpoint has requested access. What do we know about it?’ You would retrieve some data stating, ‘Their OS version is like this version, they've got file encryption enabled, and their firewalls are turned on.’ You can then sort through the received data and decide, ’Based on what we know, this looks like a secure laptop. It doesn't look like it's going to be compromised."
You can now route their request to wherever they want it to go. But if you decide, ‘I don't know about this one. It looks compromisable,’ you can deny that request.
You also need to let the user know what happened so they can fix whatever is wrong with their laptop. In some cases, this might be simple as, ‘Please update your operating system.’ Other times it might be more complicated, such as, ‘You have this file monitoring software and need to enable it. Here's documentation on how to do that.’ So, it can be more or less complicated how to get back into compliance or full security health.
The "Break Glass" Option
Another usability consideration is to have a “break glass” option. Say your user or developer didn't update their operating system but they've suddenly been paged—there's some big incident and they need to get in right now and start fixing things. They don't have time to update their OS.
"It can be useful to have a “break glass” option that communicates, ‘I understand my laptop is not totally secure right now, but I need immediate access. Please grant it to me temporarily.’"
Or maybe you have a marketing person with a big sales call coming up. They also don't have time to be fixing things right now. So, in these cases it can be useful to have a “break glass” option that communicates, ‘I understand my laptop is not totally secure right now, but I need immediate access. Please grant it to me temporarily.’ Make sure they can get in, that you're not completely blocking people for what might potentially not be a huge deal (or even if it is).
In such situations, you also want to have some sort of monitoring to make sure this feature isn't abused by users who repeatedly claim ‘Yes, this is an emergency’ every time it comes up. That said, this option can definitely help with reducing friction with users.
Using Osquery to Gather Security Health Data
We have a couple of features that are useful. The first one is query packs. When you're trying to gather your security health data, it'll usually look like a bunch of individual queries. In this example you’ve elected to get their OS version, and you can select Encrypted from disc encryption if you want to know whether they have file encryption turned on. You can query whatever you've determined are useful items to be checking for endpoint health.
"You can make this more convenient by wrapping a set of queries into a query pack bundle… With scheduled queries, you can make sure you’re getting up to date, fresh data for all endpoints."
You can make this more convenient by wrapping a set of queries into a query pack bundle. An assessment essentially goes like this: ‘This is my suite of security checks I need to run across everyone’s laptops.’ You'll have a means you can easily configure to run on whichever sets of laptops or whatever you want to run it on. This can be quite useful.
And with scheduled queries, you can make sure you’re getting up-to-date, fresh data for all endpoints. If you've got your query pack set up, you can schedule something like, ‘I want to run this query across all laptops in my fleet every five minutes or ten minutes’—or whatever you decide is a good cadence.
Having fresh data across the fleet is important in a couple of ways. Basically there are two possibilities. Let's say one user was fully security compliant 30 minutes ago, but in the last ten minutes they've enabled remote login or something. Now they're trying to access your internal resources but aren’t presently compliant, so you want to make sure you're not letting them in anymore. This is where having fresh data is important.
In another scenario, 30 minutes ago they weren’t compliant; perhaps their OS was out of date but now they've fixed it. You want to know ASAP that they fixed it; otherwise, you'll keep them blocked for longer than necessary. This creates friction and makes people not like your system very much.
"It's very useful to make sure you're getting fresh data and have an updated sense regarding the status of your fleet."
So in both cases it's very useful to make sure you're getting fresh data and have an updated sense regarding the status of your fleet. These are a couple of really useful osquery features that can help you effectively implement device integrity verification in a zero-trust system.
Conclusion
Zero-trust equates to never trust outright—always verify every approach to network security. This requires checking device integrity to verify you don't have compromised devices accessing the network, so monitoring and enforcing device verification checks is essential. Osquery can really help with query packs, such that you have a current picture of your entire fleet at all times.
Questions
Question: You talk about using an identifier from an endpoint that comes from your web frontend to look up results relevant to that endpoint. What identifier do you use or what identifiers have you found to be useful for correlating the endpoint you're seeing on the request with data you have in the store?
Uma: It depends on what data your frontend is already capturing, but ideally, I think a hardware serial number is the most foolproof option. But another possible option is if they have a hostname, like the user's name or something.
Question: We use a similar model for some different metadata. You mentioned sending your data to a store—I'm assuming either S3 or a CM. Can you describe more of the process, such as the backend? You've got osquery collecting all of this great telemetry enabling your solution. But once a laptop comes on the network, how is the backend architected?
Uma: We do some additional processing on the data we’ve collected using Uptycs—sort of wrangling it into a format that's more easily queryable by our other languages. This intermediate step pulls in the Uptycs data, does all the transformations, then writes it to our data store.
Question: This might be more of a software or DevOps question. Stripe has quite a large number of employees. I can imagine the scale of every time someone tries to access some resource, it's asking, ‘Hey, is this OK?’ So I imagine there might be ways you scale down or reduce the amount of pings. You mentioned the refetching time, but also there might be other ways, like you don't require it on every single ping, right? Maybe it's once every couple of minutes or something. I'm curious about the scaling...what kinds of knobs you've turned to make that work.
Uma: Scaling is kind of an issue. We have around 8,000 laptops to manage with this system. A couple of things help make sure we’re not adding a ton of latency to requests. One is we use a very fast policy engine that takes the data and evaluates whether the endpoint meets our security health guidelines. We also have a form of caching, where our policy engine doesn’t pull from that endpoint data store every time a request comes in. For us, that data is only updated on a certain cadence.
Every time a new bundle of data is available, it gets pulled and then used; that cuts down on the network overload. Other optimizations are possible, and we're still exploring caching results. But we have to be careful because the concept of zero-trust is to verify every single time and not assume, ‘Oh, you were fine five minutes ago. You're probably OK this time, too.’ That’s a trade-off, for sure. It’s still something we continue to work on.
Question: Have you implemented the “break glass” feature? If so, can you tell us more about how that works, and are there additional user requirements such as two-factor authentication? What things make the feature less abusable and keep you from feeling bad about it existing?
Uma: It's something we're still in the process of implementing. The general idea is having some sort of LDAP group to which people can request membership for a temporary duration. They need to submit a reason for their temporary membership. We're still deciding... ‘If it's really an emergency, should we wait for there to be a verification process, like someone approving it?’ Maybe not, but at least we’ll have a paper trail. Maybe their manager can review and say, ‘Hey, user, I see you requested this. Can you tell me more about that?’
If we see someone routinely uses it, we’ll have a record. It's definitely something where you also have to have some level of trust or you require an external authorization step from maybe a manager or someone. But there comes a trade-off with the speed that is intended to be there with the “break glass” option.
Question: When a user is presented with the access denied screen, presumably you're giving them instructions as to what they need to do to remediate the issue. What does the process look like from the user’s standpoint? Do you give them instructions germane to the issue they ran into? Do you have any sort of target or an SLA for them to regain access after they've fixed their issue? Is there a way for them to initiate a manual recheck? What's the user feedback been like?
Uma: The SLA we're aiming for is about 15 minutes from when they remediate their laptop issues to when they regain access. When they're first given a denied screen, they're told specifically which security check they failed. So, if it's their OS version, we'll convey, ‘Your OS version is 12.1. We need it to be at least 12.4,’ or something like that. We'll provide a documentation link if it's something more complicated to update than just the operating system.
We're in the early stages of rolling this out; we're doing it in a monitoring mode where people aren’t actually blocked, but we keep tabs on what happens. I think it’ll definitely be its own challenge without making everyone really hate it. Being lenient with allowing “break glass” exceptions—especially early on—is something that can help. But you also have to communicate to users that there's a reason we're doing this—that it's not just to make their lives difficult.
Question: I imagine you have certain query packs that comprise an array of queries associated with your solution. You have destinations, you have to get that data to your telemetry point, to whatever storage solution you use. But how are you using Uptycs in an efficient manner to correlate to not just those query packs, but also the subqueries and all those QPs?
Uma: There's been a lot of tuning, and some of the queries will be long-running that you can't reasonably run on a quick cadence. So a lot comes down to fine-tuning. For example, I might think, ‘OK, here’s one that’s just not feasible to run on a quick cadence. Using it would mean we'd have to alter something else.’ You have to prune your QPs to the best set that provides good security coverage. It's a process.
Question: Is your trust an all-or-nothing, or do you disable groups based on what information you’ve got?
Uma: We use an all-or-nothing approach to start out. But for future development, if we aren’t using specific security policies for specific resources, then at least we should have buckets marked low-security assets, low-security resources, medium-security, high-security—each with potentially different standards imposed.
Question: You basically told us that the zero-trust principle is like saying a given laptop was in one state five minutes ago and in another five minutes later. I think that actually has a relatively negligible effect on the likelihood of compromise. But all of this device integrity stuff—it has to be paired with actual access control, right?
If all of your resources behind this system allow everyone to access them and the only thing that we're adding is a device integrity check, then the likelihood of compromise and the risk we've reduced is actually kind of pointless. We could all be on a network and it would be essentially the same exposure, maybe barring a small amount of risk that we reduce.
So in terms of how Stripe looks at zero-trust overall, do we actually have to make sure that everybody can't access everything at the identity layer, and not on the device-tier layer? And only by doing these two things together do we actually get risk-benefit. How has that gone? And are people thinking that way, or are you in your silos where it's like, ‘It's not my problem to figure out who has access to what.’ It's your problem to figure out which laptop is icky. At a higher level in the security strategy, how does Stripe look at how to solve both problems at the same time? Is it going well? Not going well? How do you look at the pairing of these two things that need to happen together?
Uma: It's definitely important to have both aspects—identity management as well as device verification. I can’t speak too much to the identity management efforts, although that’s definitely active work. I think the zero-trust term should apply to both, and I think that’s how we're approaching it. There are efforts in progress to pare down access control based on who a person is and not just let everyone access anything.
Question: Is your frontend an Okta, Auth0? Are you making authorization decisions within some third-party product, or is it a service you or your team built?
Uma: Ours is a service we built in-house.
Question: Do you treat users and machines differently? If I'm a machine and have a JWT bearer token or whatever, are you gonna disable it if I scan differently, or is this just user-focused?
Let me rephrase that. If I'm a web app and I scan badly, is the monitoring different for a web app versus a user?
Uma: We're mostly just focused on the user side. All of this, like the frontend I mentioned earlier in our diagram, is just for human-generated requests.
Question: Can you identify if a connection is coming from home devices or company-issued devices and if your platform detects between Windows OS or mobile devices or Linux?
Uma: At this point, we're limiting zero-trust to company-issued devices. All requests can initially only come through if you're connected to the company VPN, which is restricted to company laptops. I don't believe mobile phones are even allowed. It's a very specific scope.
Question: What kind of applications are behind the zero-trust authentication layer? Is it the VPN server or a database?
Uma: The VPN, I believe, is not behind it because you have to connect to it first to connect to a lot of internal services. Most of our internal services, such as third-party resources like vendor tools, lie behind that layer. There is work underway to also introduce zero-trust to emails.