Show HN: Log collector that runs on a $4 VPS

Dachande663 · on Feb 11, 2023

I’ve found the hard part is not so much the collection of logs (especially at this scale), but the eventual querying. If you’ve got an unknown set of fields been logged, queries very quickly devolve into lots of slow table scans or needing materialised views that start hampering your ingest rate.

I settled on a happy/ok midpoint recently whereby I dump logs in a redis queue using filebeat as it’s very simple. Then have a really simple queue consumer that dumps the logs into clickhouse using a schema Uber detailed (split keys and values), so queries can be pretty quick even over arbitrary fields. 30,00 logs an hour and I can normally search for anything in under a second.

mr-karan · on Feb 12, 2023

I've a similar pipeline to yours (for the storage part). I use vector.dev for collecting and aggregating logs, enriching with metadata (cluster, env, AWS tags) and then finally storing them in a Clickhouse table.

Do you use any particular UI/Frontend tool for querying these logs?

mekster · on Feb 15, 2023

I've got that same logging backend and I use Metabase for querying. This is far cleaner and easier to use/learn than Kibana or Greylog.

I've also considered Grafana, but it's not good for viewing raw logs.

mr-karan · on Feb 15, 2023

Interesting. In my experience I found Metabase gets clunkier when you've to query for adhoc logs. However it's a great fit when it's just a template and all the end user has to do is to enter some variables to see the logs.

I think LogQL + Grafana UI is a pretty rich/interactive experience of viewing logs v/s Metabase.

mekster · on Feb 15, 2023

I'd take regular tabular view for logs like this,

https://www.metabase.com/glossary/images/table/example-table...

than this brain crushing string spaghetti view any day.

https://grafana.com/static/img/docs/v64/logs-panel.png

I can easily colorize lines depending on log severity, filter by dates with a drop-down date selector UI, filter by each column value etc all in eye friendly manner.

https://pasteboard.co/PQDjUwtFgZTW.png

https://pasteboard.co/tfHHA1mgETvL.png

Dachande663 · on Feb 12, 2023

We already have a back office tool, so there’s just an extra screen that has query builder and outputs. Every few months I’ll tweak and add bits but nice to be able to see a user or entity ID and click on it to view the full resource.

mr-karan · on Feb 15, 2023

Got it, thanks.

metadat · on Feb 12, 2023

What are the hardware requirements/ / resources dedicated to pull this off?

Dachande663 · on Feb 12, 2023

Runs off a 2GB digital ocean box, which I think is $10 now?

It’s probably incredibly boring to describe, but I think that’s why it just tends to work. The whole thing took an afternoon to write (in PHP of all things too).

simonw · on Feb 12, 2023

This genuinely sounds the opposite of boring to me. I'd love to read a full, detailed description of this, hacky PHP scripts included!

Dachande663 · on Feb 13, 2023

I've added some more info https://news.ycombinator.com/item?id=34771486

FrenchTouch42 · on Feb 12, 2023

Can you share more information about the schema you're mentioning? Thank you!

ignoramous · on Feb 12, 2023

Not OP but they might be referring to Uber moving from ES to ClickHouse to store their schema-flexible, structured logs, mostly to improve ingestion performance: https://archive.is/bFsTF / https://www.uber.com/blog/logging/

The gist of it is:

- Structured logs (json) are stored as kv pairs in parallel arrays, along side metadata (host, timestamp, id, geo, namespace, etc).

- Log fields (ie kv pairs) are materialized (indexed) depending on query patterns, and vaccummed up if unused.

- Authoring queries and Kibana dashboard support is not trivial but handled with a query translation layer.

atombender · on Feb 12, 2023

What do you mean by parallel arrays here?

Do you mean something like two arrays [k1, ..., kN] and [v1, ..., vN] in two different columns?

Is there a way in Clickhouse to filter such a pair of arrays such that you can do a search akin to vals[indexOfKey("foo")] == "bar"?

ignoramous · on Feb 13, 2023

Yep. If you read the blog post I linked to it does talk a tonne about ClickHouse and what it can do (like indexOf, for example).

atombender · on Feb 14, 2023

Ah, of course. Thanks!

folmar · on Feb 11, 2023

Sorry, but I don't see the selling point yet. Rsyslog has omlibdbi module that send your data to sqlite. It can consume pretty much any standard protocol on input, is already available and battle proven.

remram · on Feb 12, 2023

Or just keep it in the log file? I am not sure what is the advantage of putting it in SQLite, if all you're going to do is unindexed `json_extract()` queries on it.

djbusby · on Feb 12, 2023

Or syslog-ng ? And syslog is crazy easy to integrate into nearly any code.

unxdfa · on Feb 11, 2023

I see your idea but you could drop the JSON and use rsyslogd + logrotate + grep? You can grep 10 gig files on a $5 VPS easily and quickly! I can't speak for a $4 one ;)

tiagod · on Feb 12, 2023

If you use grep you'll be doing the same expensive operation every time, following files naively will fail after rotation, etc... And if you use something like Loki it's easier to integrate with other tools to react to the logs

unxdfa · on Feb 12, 2023

It’s potentially a premature optimisation to not do that expensive operation every time. Loki and brethren have a significant infrastructure cost and cognitive load to consider. I speak from experience and know where the ROI appears and it’s far from the use case specified here.

ilyt · on Feb 12, 2023

> following files naively will fail after rotation,

...so what you're saying they have to write "tail -F" instead of "tail".

> If you use grep you'll be doing the same expensive operation every time

if you have ingest that low it barely matters. Modern grep replacements are pretty fast

mekster · on Feb 15, 2023

Why do people like to stick to inefficient ancient method like grep for log viewing?

Try tools like Metabase and see how it makes your log reading far better.

Thaxll · on Feb 11, 2023

You could have just used Filebeat? It's also in Go and it's pretty easy to use.

https://www.elastic.co/guide/en/beats/filebeat/current/fileb...

mekster · on Feb 15, 2023

I think Vector really shines with its VRL language to parse and enrich data. It's well thought out with buffering for network errors and throwing errors on parsing instead of silently discarding.

https://vector.dev/docs/reference/vrl/

rsdbdr203 · on Feb 12, 2023

This is exactly why I build log-store. Can easily handle 60k logs/sec, but I think more importantly is the query interface. Commands to help you extract value from your logs, including custom commands written in Python.

Free through '23 is my motto... Just a solo founder looking for feedback.

recck · on Feb 12, 2023

I came across this a few months ago and have been following pretty closely. Having been using this locally in a Docker container has been painless. The UI is definitely iterating quickly, but the time-to-first-log was impressive! Happy to keep using it.

spsesk117 · on Feb 12, 2023

Disclaimer: I am friends with the founder of log-store.

I have been beta testing it for a while for small scale (~50 million non-nested json objects) log aggregation it's working beautifully for this case.

It's a no nonsense solution that is seemless to integrate and operate. On the ops side, it's painless to setup, maintain, and push logs to. On the user side, its extremely fast and straight forward. End users are not fumbling their way through a monster UI like Kibana, access to information they need is straight forward and uncluttered.

I can't speak to it's suitability in a 1TB logs/day situation, but for a small scale straight forward log agg. tool I can't recommend it enough.

binwiederhier · on Feb 12, 2023

log-store [1] is pretty neat. Thanks for making it. It's super powerful and easy to use. There's a learning curve with the query language, but it's super cool once you figure it out.

[1] https://log-store.com/

remram · on Feb 12, 2023

May be more widely applicable for personal servers: lnav, an advanced log file viewer for the terminal: https://lnav.org/

It uses SQLite internally but can parse log files in many formats on the fly. C++, BSD license, discussed 1 month ago: https://news.ycombinator.com/item?id=34243520

keroro · on Feb 11, 2023

If anyones looking for similar services Im using vector.dev to move logs around & it works great & has a ton of sources/destinations pre-configured.

Hamuko · on Feb 11, 2023

I feel like if you're going to use "$4 VPS" as a quantifier, you could at least specify which $4 VPS is being used.

sgt · on Feb 12, 2023

Look at this:

https://www.hetzner.com/cloud

More like $5 but still, 1 vCPU, 2GB RAM, 20GB NVMe storage. Closer to $4 USD if you let go of IPv4 in favor of IPv6 only.

Edit: Looks like that's also a shared vCPU.

teruakohatu · on Feb 11, 2023

DO's 512mb basic VPS starts at $4, so I am guessing it is that.

benatkin · on Feb 12, 2023

I don't think it is. That one is shared vCPU and I've been hearing about a single vCPU one.

withinboredom · on Feb 11, 2023

Neat! Have you considered using query params instead of bodies, then just piping the access logs to a spool (no program actually on the server, just return an empty file). Then your program can just read from the spool and dump them into sqlite.

That should tremendously improve throughput, at the expense of some latency.

Nevin1901 · on Feb 11, 2023

That's a really good idea, thanks for suggesting it. I'll try implementing it. I'm hoping the main bottleneck is with inserting the logs into SQLite, so using a spool might help

aninteger · on Feb 11, 2023

I'm doing something similar with a $5 VPS, but with fastcgi/c++/sqlite3. I then have a cronjob that then aggregates error logs, generates an summary and posts to a Slack channel. Personally I wish I didn't have to write it, but it works.

Nevin1901 · on Feb 11, 2023

One of my eventual goals with erlog is actually doing observability (eg: it'll send you reports if logs/metrics deviate from the norm), so it's really interesting to see you had this problem.

sgt · on Feb 12, 2023

Imagine what we could do with modern hardware if programs were as efficient as your typical C++/SQLite combo!

andymac4182 · on Feb 12, 2023

I have been using https://datalust.co/ to handle this. It scales really well down and up to how much you want to spend. It comes with existing integrations with a lot of libraries and formats and a CLI to push data from file based logs to their service.

They have just added a new parser and query engine written in Rust to get the best performance out of your instance. https://news.ycombinator.com/item?id=34758674

peterpost2 · on Feb 12, 2023

Second this, seq is incredibly handy and easy to query. Performance could be better though

cnkk · on Feb 12, 2023

I am been using vector.dev for a long time now. It is also easy to setup. And it looks similar to your idea.

ilyt · on Feb 12, 2023

...uh, just rsyslog and files ? I think it can even write to SQLite

marcrosoft · on Feb 11, 2023

Woah cool. I did the same thing. I Made a poor man’s small scale splunk replacement with SQLite json and go. I used the built in json and full text search extensions.

Weryj · on Feb 12, 2023

I run a self hosted version of Sentry.io on a NUC at home and a relay on a VPS, the. Use Tailscale to connect the two.

If you have an old computer at home, using a VPS as the gateway is always a good option.

Edit: you can then use the VPS as a exit node for internet.

harisamin · on Feb 11, 2023

Ah cool! Somewhat related I built a json log query tool recently using rust and SQLite. Didn’t build the server part of it

https://github.com/hamin/jlq

Nevin1901 · on Feb 11, 2023

That's really cool. I might rewrite some of your code in go and use it in erlog for searching (and give you credit of course).

How did you come up with the idea for jlq? It seems like it solved a pretty cool use case.

arjvik · on Feb 12, 2023

I'm working on a project where I'm handling simultaneous connections to a bunch of peers. What's the best way to log messages to trace the flow of requests through my system when multiple code paths are running asynchronously (NodeJS, so I can't simply get a thread ID)?

Groxx · on Feb 12, 2023

With tracing libraries, e.g. https://opentracing.io/

viraptor · on Feb 12, 2023

That and some backend... most are SaaS though. The only self hosted one I know of is Grafana Traces/Tempo.

I mean, you can just log the trace/span/parent IDs for each request, but that's a bit painful to deal with later.

ilyt · on Feb 12, 2023

jaeger + opentracing/opentelemetry libs are easy enough. When testing Jaeger can just work as in-memory database, or put it to some other storage like elasticsearch

int0x2e · on Feb 12, 2023

I strongly urge people to try something like Application Insights. It's not dirt cheap, but not that expensive, and lets you collect anything you'd want and query your telemetry/logs retroactively extremely flexibly. It's just great.

maybesimpler · on Feb 12, 2023

You could also not write your own server. Just configure OpenResty, write some simple LUA to push to the redis queue. Then consume the queue via your language of choice to write to your store(clickhouse).

vbezhenar · on Feb 12, 2023

Logs must be stored in S3, it's no-brainer. Disk storage is too expensive. Logging system should be designed for S3 from ground up IMO.

addandsubtract · on Feb 12, 2023

How is S3 the cheapest option? Backblaze is $0.005 per GB and Hetzner sells storage boxes for less than €0.0038 per GB.

zo1 · on Feb 12, 2023

AWS is almost never the answer unless you work at making it cheap and work for you. It's a black hole of insanely convoluted billing and filled with snake oil salesmen (dev ops priests) that'll complicate it so much that you don't even know what you're paying for or why you even need it. And S3 is just a gateway drug into this whole mess, stay far away from it kids.

vbezhenar · on Feb 13, 2023

I'm talking about S3 API. Every cloud I'm aware of, provides S3-compatible object storage. And this storage is much cheaper than VM-attached storage.

vbezhenar · on Feb 13, 2023

I'm talking about storage volumes which are attached to the VM versus object storage which provides S3-compatible API.

You can use S3 API to access Backblaze.

I'm not experienced with Hetzner Storage Box but I don't think that you can just attach it to your storage VM as fast storage. You can mount it as samba store but I think that's a recipe for disaster.

aejnsn · on Feb 12, 2023

Replace what “S3” with “S3-compatible object storage”.

Jamie9912 · on Feb 12, 2023

I think he was being sarcastic