2018 marks the first full year in which Uptycs, the company created to bring Facebook’s open source osquery agent to widespread commercial adoption, has had its turnkey security analytics platform in the market. As can be expected of any startup that launches a new ground-breaking product, it has been an exciting year, full of anticipation, unprecedented interest, and challenging work as we tweaked and tuned the product to optimize it for what our customers needed it to do.
The incredible breadth of data that osquery is capable of collecting opens it up for a nearly unbounded set of applications. Using osquery, Uptycs collects data spanning from the static configuration of systems, to certain network traffic, to the nearly full runtime state, to high-level abstractions like containers. At the dawn of this year, we were truly excited to see what our customers would want to do, or want us to do, with all that data. Perhaps massive scale with queries returning data instantaneously across an infrastructure spanning half a million machines. Or, the application of machine learning techniques to find anomalies in the environment. Perhaps even still, this data will open up a new vista that none of us had even considered possible before.
What transpired during the year was simultaneously anti-climatic and humbling. What we learnt from our customers was that even before we tackled machine learning or uncharted vistas, we needed to solve much simpler, utterly unglamorous, but notoriously difficult problems first. Then, prove they could be done at massive scale.
Take, for example, the current state of the art of incident response. Even a basic security framework will do an adequate job of alerting on suspicious network traffic. Yet, answering the simple question – which system had the IP address referenced in the alert at the time of the incident – is notoriously difficult. IP addresses are dynamically allocated and reallocated using DHCP; with the explosion of devices connecting to networks, IP address leases are short duration, so there is little, if any, temporal affinity of IP address to device.
Current state of the art security frameworks have a SIEM at its heart, collecting event logs from all systems. Unfortunately, to make sense of events, one often needs access to some relatively static piece of state information – for example, the DHCP allocations of IP addresses to devices. So, despite all the sophisticated technologies incident responders have at their disposal, they get stymied with the most basic of questions – which device had this IP address at the time of the incident?
A similar situation occurs when capturing events from hosts and endpoints. Audit logs contain accurate and critical data about events on the system, and there are plenty of technologies that will process these event logs and fire legitimate alerts. Yet, they often cannot answer the most basic question – which user had this user-id (UID) at the time of this event? This is because the contents of user tables on a system are not events; much like the IP-address-to-device mapping table at any given point in time, they are an important piece of state information about the infrastructure.
Yet another intuitively simple, yet notoriously difficult, problem was presented by a customer that was trying to capture file events related to files being copied/modified/deleted from USB drives. USB drive insertions are events that can be captured using several existing solutions. File monitoring is also a common security control, and a number of solutions are available. So, what’s the problem? The challenge is that when USB drives are inserted or removed, the events captured have details about the logical device being added/removed from the system.
File monitoring software, on the other hand, have details about the file-system paths and names of the files being monitored. Missing from the picture is the file-system-to-hardware-device mapping table. Which filesystem path refers to the USB drive that was just inserted or removed? This mapping is available in the mounts table or drive letter allocation table, but if you are only collecting events – device insertion/deletion events and file modification events – this critical mapping table will be missing, and the correlation of the two events will prove to be an unsolvable problem.
Osquery breaks this conundrum and the separation of event data from context data. It is equally good at handling both types of data and presenting all of it as a set of standardized SQL tables. When combined with a powerful storage and analytics platform like Uptycs, incident responders finally have all the pieces in one place. IP addresses are available via the interface_details tables. Process events are available in the process_events tables, and the list of usersids are available in the, well, users table. Device information is available in the USB_devices table, and mount point information is available in the mounts table. And so on.
The main remaining challenge is to have an analytics platform that continuously stores information about all these tables from all the hosts and endpoints in an infrastructure, and then quickly and reliably recreates the full state of those tables at any given point in recorded history.
Which, of course, is precisely what we ended up working on through the year at Uptycs. (Along with addressing the massive scale component, too. Looks like uncharted vistas will have to be on the 2019 roadmap...) Reliably storing the entire system state information, continuously and from across a very large infrastructure, and then being able to retrieve any set of that state information quickly.
And so, from all of us here at Uptycs, we offer our Christmas gift – the ability to retrieve any osquery table, from any point of time, quickly – including, the ip address to hostname table, the device type to mounts table, and of course, the You-sers_table.
Merry Christmas!
Related osquery resources: