I never tried it but I am expecting the performance to be not good enough, it takes already several minutes with code specifically written to perform the calculations I am interested in. And because I don't know what exactly I am looking for I need more or less interactive speed so that I can try out many different ways to look at the data. But maybe I could use [materialized] views to convey enough information to the query planner how to efficiently carry out the calculations or maybe I am even underestimating how good query planners are. I just have the gut feeling that performing a lot of aggregation will make a database perform a lot of unnecessary work. But maybe I should and will try loading the data into SQL Server and see what happens.
The other thing is that SQL seems not the best fit to me. Say you just want to know how many events occurred in the last three months in any hour, that is straight forward grouping and counting at first, but already rounding the timestamps to an hour is not as obvious as it should be. But if there was no event in a specific hour, your result will just have no row for that hour instead of a row saying there were zero events in that hour. This in turn will cause more trouble if you want to build a histogram showing in how many hours there were say 0 to 9, 10 to 19, 20 to 29, and so on events. Certainly still doable with SQL but we are already entering the territory where writing a single query will take most people several hours to get the desired result.
I also couldn't easily tell how to express calculating the 99th percentile of the event size for every day of the week and hour of the day. I am pretty sure it is possible but I guess it would also be pretty unreadable unless you put in quite a bit of effort to create utility functions instead of hacking together one huge SQL statement. Then again I don't really know much about the more recent SQL features for partitioning and aggregating, maybe I should have a closer look at that first.
Right now it is just an effort to develop a tool to diagnose and hopefully thereafter fix random performance problems we are experiencing with one of our applications in production. Despite having a small team dedicated to investigating the problems, monitoring every click and function call with Dynatrace, having had a Microsoft SQL server expert look into it, and getting the system audited by one of the big consulting companies, the problem persists since years and nobody has really any clue about what is going wrong.
The performance is never really great, it is [one of] the central applications of the company and depends on the interaction with a sizable junk of the system landscape developed over decades and therefore it is prone to be affected by incidents in a lot of systems but most of the time it is good enough. But once every couple of weeks or months something goes badly wrong an requests, it's a web application, start taking several seconds or even minutes to complete. Minutes later everything is back to normal.
But I digress. If I would manage to come up with a reusable and somewhat general tool to analyze data similar to what I am looking at, I would consider releasing it. It could either be a somewhat general data analysis and visualization tool, think R, or it could be more specifically tailored towards looking for anomalies in data sets like the one I am investigating. But as of now I am struggling to come up with a general framework to express the analyses I am performing and therefore all I have is a rather ad hoc collection of transformations that extract and visualize aspects of the data that could lead to new insights into what is going on.
But right now it is really driven by our specific issue, I notice something in one view of the data and then come up with a new transformation to look at it in more detail or from a different angle. It is nothing that could easily be reused by anyone else and so for the moment it seems most likely that this will never become public or maybe only in the form of a blog article explaining what kind of information might be useful to look at and how to derive it from logs that look rather uninteresting at first glance.
The other thing is that SQL seems not the best fit to me. Say you just want to know how many events occurred in the last three months in any hour, that is straight forward grouping and counting at first, but already rounding the timestamps to an hour is not as obvious as it should be. But if there was no event in a specific hour, your result will just have no row for that hour instead of a row saying there were zero events in that hour. This in turn will cause more trouble if you want to build a histogram showing in how many hours there were say 0 to 9, 10 to 19, 20 to 29, and so on events. Certainly still doable with SQL but we are already entering the territory where writing a single query will take most people several hours to get the desired result.
I also couldn't easily tell how to express calculating the 99th percentile of the event size for every day of the week and hour of the day. I am pretty sure it is possible but I guess it would also be pretty unreadable unless you put in quite a bit of effort to create utility functions instead of hacking together one huge SQL statement. Then again I don't really know much about the more recent SQL features for partitioning and aggregating, maybe I should have a closer look at that first.