Uptycs has submitted two pull requests to add HTTP(s) proxy & TLS persistent transport support to osquery. Both have now been merged in support for Beast (more on that later) and Persistent Transport Support).
These contributions will hopefully improve the performance of osquery. We'll review the benefits of persistent transport support for TLS (which we’ll abbreviate as PTS), why we decided to implement it, and how this may eventually benefit your osquery installation.
TLS transport is the component of the TLS plugin which is used to communicate with a TLS server over HTTPS. Before implementing PTS, every HTTPS request to a TLS server required building a TLS connection, including exchanging SSL certificate information. With TLS persistent transport support, once a secured connection is established it can be used to send multiple HTTPS requests to a TLS server using the same connection, removing a lot of the overhead normally associated building up and tearing down TLS sessions.
One of the log mechanisms for osquery is to output to a TLS endpoint. Uptycs uses nginx to terminate TLS connections into our service. Before implementing PTS, we observed resource utilization issues on our servers supporting large osquery deployments. Using stock osquery worked fine with 100’s of endpoints, but when we started having 1000’s of endpoints or more talking to the same TLS server, we observed a lot of resource overhead.
At Uptycs, we’re pushing osquery config updates as often as once a minute, and we’re receiving up to 10 to 15 log updates per minute from some of our more high volume endpoints. Given the TCP enforced 60 seconds wait in the TIME_WAIT state, we were seeing the use of an average of 12 to 16 sockets on each endpoint per minute without PTS. That’s a lot of resources being used setting up and tearing down the sockets. Knowing that we had a demand for larger and larger deployments in the near future meant we needed to look for a different solution. Our architecture allows us to scale horizontally, but there had to be a better solution than just throwing (expensive) resources at the problems.
osquery originally handled management of socket connections through cpp-netlib. We considered trying to work with cpp-netlib to add additional features, but decided against it — cpp-netlib doesn’t have HTTP Proxy support (another feature we needed to support our growing customer base), does not provide low level programming interfaces to customize how you handle HTTP connections, and is no longer being actively maintained.
After some additional research, we decided to use Beast. In addition to supporting the latest HTTP specifications, Beast is actively maintained and is a part of the Boost library — which is already in use by osquery. Since it is already in osquery, this does not add another dependency. Beast is also customizable enough and provides appropriate interfaces for building out HTTP Proxy Support, as well as having support for WebSockets, which could allow additional augmentations to osquery in the future.
The amount of total data transmitted and received by osquery over the TLS connection depends on the scheduled query configuration. We created a simple test to show the improvement in bandwidth usage by running a deployment once we had built osquery with PTS support included.
For this test, we used one of the basic osquery configuration profiles we deploy to endpoints for our production network, and ran that for a while collecting bandwidth stats. The absolute bandwidth consumption is not really important here, as that will change based off of configuration, but the relative difference will hopefully show the efficiency of adding PTS to the mix. We are not attempting to conflate data and bandwidth even though we talk about both, however in our testing it was easier to measure average bandwidth changes than actually measuring each byte sent/received.
The results speak for themselves:
However, the numbers here raised an interesting additional question: Why are the bytes received from the nginx server to osquery endpoint so high? We were really surprised that we saw more bandwidth going to the endpoints instead of from the endpoints.
There are several things that could contribute to this. One of which is that we are sending about 11KB of (uncompressed size) information to the endpoint every minute, since there is not currently a way of incrementally updating the osquery configuration. You have to update the whole configuration at once.
The other is that osquery is only sending diffs of changes back up — so a large majority of the bandwidth is actually the TLS overhead, not the actual data being transmitted. If you are hosting in a scenario where you are paying for bandwidth in and out of your server enclave (which most people are!) implementing PTS can save a lot of money if you are using osquery at scale.
With PTS in use, not only do we use less bandwidth, but information is actually transferred in a smaller window of time, when you remove the overhead for having to constantly build up and tear down unnecessary connections. We did some additional testing showing how much time it takes to hit some of our API endpoints before and after implementing PTS.
We tested first with an osquery endpoint in the same cloud as the TLS server terminating the sockets from osquery, and then with an osquery endpoint running on a laptop in our office talking to the same TLS server (but over the internet).
The times are all longer with the endpoint talking across the internet versus the endpoint local to the TLS server, but proportionately, things are again faster with PTS in place. It may seem trivial to worry about microseconds, but when you are talking about thousands or tens of thousands of hosts times many customers, it adds up!
Beast is now integrated into osquery for future development efforts. If PTS support is adopted, users of osquery will be able to configure both whether you use PTS or not, and also the duration of the timeout for how often you recreate sockets. This would be done with flags such as the following (n.b. this pull request is still under discussion, so exact flags may change):
—-tls_persist_transport Turns on the persistent transport capability --tls_persist_transport_timeout=3600 Close and recreate persistent socket after this many seconds |
We realize that keeping a socket open for extremely long periods of time can cause issues, so we don’t recommend setting this value to a really large number. 3600
is a duration of one hour.
After adding in TLS persistent transport support using Beast, osquery now consumes less resources and uses less bandwidth to achieve the same amount of data transfer than it did before. This makes an osquery deployment that is logging to TLS more efficient, and cheaper to maintain (at least from a bandwidth and resource utilization standpoint). osquery can also now be used to communicate to a TLS endpoint via an HTTP(s) proxy.
There are still a lot of areas where osquery can benefit from performance testing and optimizations. Right now, osquery requires a full config to be pushed to it if you want to do updates. Now that Beast is in place, it may be possible to do some sort of incremental updates (both to config and to logging) via WebSockets, that could lead to even faster updates of osquery and even less consumption of bandwidth.
Nishant Pamnani implemented this feature and provided significant input into this blog post.
Related osquery resources: