Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Log collector that runs on a $4 VPS (github.com/nevin1901)
118 points by Nevin1901 on Feb 11, 2023 | hide | past | favorite | 62 comments
Hey guys, I'm building erlog to try and solve problems with logging. While trying to add logs to my application, I couldn't find any lightweight log platform which was easy to set up without adding tons of dependencies to my code, or configuring 10,000 files.

ErLog is just a simple go web server which batch inserts json logs into an sqlite3 server. Through tuning sqlite3 and batching inserts, I find I can get around 8k log insertions/sec which is fast enough for small projects.

This is just an MVP, and I plan to add more features once I talk to users. If anyone has any problems with logging, feel free to leave a comment and I'd love to help you out.



I’ve found the hard part is not so much the collection of logs (especially at this scale), but the eventual querying. If you’ve got an unknown set of fields been logged, queries very quickly devolve into lots of slow table scans or needing materialised views that start hampering your ingest rate.

I settled on a happy/ok midpoint recently whereby I dump logs in a redis queue using filebeat as it’s very simple. Then have a really simple queue consumer that dumps the logs into clickhouse using a schema Uber detailed (split keys and values), so queries can be pretty quick even over arbitrary fields. 30,00 logs an hour and I can normally search for anything in under a second.


I've a similar pipeline to yours (for the storage part). I use vector.dev for collecting and aggregating logs, enriching with metadata (cluster, env, AWS tags) and then finally storing them in a Clickhouse table.

Do you use any particular UI/Frontend tool for querying these logs?


I've got that same logging backend and I use Metabase for querying. This is far cleaner and easier to use/learn than Kibana or Greylog.

I've also considered Grafana, but it's not good for viewing raw logs.


Interesting. In my experience I found Metabase gets clunkier when you've to query for adhoc logs. However it's a great fit when it's just a template and all the end user has to do is to enter some variables to see the logs.

I think LogQL + Grafana UI is a pretty rich/interactive experience of viewing logs v/s Metabase.


I'd take regular tabular view for logs like this,

https://www.metabase.com/glossary/images/table/example-table...

than this brain crushing string spaghetti view any day.

https://grafana.com/static/img/docs/v64/logs-panel.png

I can easily colorize lines depending on log severity, filter by dates with a drop-down date selector UI, filter by each column value etc all in eye friendly manner.

https://pasteboard.co/PQDjUwtFgZTW.png

https://pasteboard.co/tfHHA1mgETvL.png


We already have a back office tool, so there’s just an extra screen that has query builder and outputs. Every few months I’ll tweak and add bits but nice to be able to see a user or entity ID and click on it to view the full resource.


Got it, thanks.


What are the hardware requirements/ / resources dedicated to pull this off?


Runs off a 2GB digital ocean box, which I think is $10 now?

It’s probably incredibly boring to describe, but I think that’s why it just tends to work. The whole thing took an afternoon to write (in PHP of all things too).


This genuinely sounds the opposite of boring to me. I'd love to read a full, detailed description of this, hacky PHP scripts included!



Can you share more information about the schema you're mentioning? Thank you!


Not OP but they might be referring to Uber moving from ES to ClickHouse to store their schema-flexible, structured logs, mostly to improve ingestion performance: https://archive.is/bFsTF / https://www.uber.com/blog/logging/

The gist of it is:

- Structured logs (json) are stored as kv pairs in parallel arrays, along side metadata (host, timestamp, id, geo, namespace, etc).

- Log fields (ie kv pairs) are materialized (indexed) depending on query patterns, and vaccummed up if unused.

- Authoring queries and Kibana dashboard support is not trivial but handled with a query translation layer.


What do you mean by parallel arrays here?

Do you mean something like two arrays [k1, ..., kN] and [v1, ..., vN] in two different columns?

Is there a way in Clickhouse to filter such a pair of arrays such that you can do a search akin to vals[indexOfKey("foo")] == "bar"?


Yep. If you read the blog post I linked to it does talk a tonne about ClickHouse and what it can do (like indexOf, for example).


Ah, of course. Thanks!


Sorry, but I don't see the selling point yet. Rsyslog has omlibdbi module that send your data to sqlite. It can consume pretty much any standard protocol on input, is already available and battle proven.


Or just keep it in the log file? I am not sure what is the advantage of putting it in SQLite, if all you're going to do is unindexed `json_extract()` queries on it.


Or syslog-ng ? And syslog is crazy easy to integrate into nearly any code.


I see your idea but you could drop the JSON and use rsyslogd + logrotate + grep? You can grep 10 gig files on a $5 VPS easily and quickly! I can't speak for a $4 one ;)


If you use grep you'll be doing the same expensive operation every time, following files naively will fail after rotation, etc... And if you use something like Loki it's easier to integrate with other tools to react to the logs


It’s potentially a premature optimisation to not do that expensive operation every time. Loki and brethren have a significant infrastructure cost and cognitive load to consider. I speak from experience and know where the ROI appears and it’s far from the use case specified here.


> following files naively will fail after rotation,

...so what you're saying they have to write "tail -F" instead of "tail".

> If you use grep you'll be doing the same expensive operation every time

if you have ingest that low it barely matters. Modern grep replacements are pretty fast


Why do people like to stick to inefficient ancient method like grep for log viewing?

Try tools like Metabase and see how it makes your log reading far better.


You could have just used Filebeat? It's also in Go and it's pretty easy to use.

https://www.elastic.co/guide/en/beats/filebeat/current/fileb...


I think Vector really shines with its VRL language to parse and enrich data. It's well thought out with buffering for network errors and throwing errors on parsing instead of silently discarding.

https://vector.dev/docs/reference/vrl/


This is exactly why I build log-store. Can easily handle 60k logs/sec, but I think more importantly is the query interface. Commands to help you extract value from your logs, including custom commands written in Python.

Free through '23 is my motto... Just a solo founder looking for feedback.


I came across this a few months ago and have been following pretty closely. Having been using this locally in a Docker container has been painless. The UI is definitely iterating quickly, but the time-to-first-log was impressive! Happy to keep using it.


Disclaimer: I am friends with the founder of log-store.

I have been beta testing it for a while for small scale (~50 million non-nested json objects) log aggregation it's working beautifully for this case.

It's a no nonsense solution that is seemless to integrate and operate. On the ops side, it's painless to setup, maintain, and push logs to. On the user side, its extremely fast and straight forward. End users are not fumbling their way through a monster UI like Kibana, access to information they need is straight forward and uncluttered.

I can't speak to it's suitability in a 1TB logs/day situation, but for a small scale straight forward log agg. tool I can't recommend it enough.


log-store [1] is pretty neat. Thanks for making it. It's super powerful and easy to use. There's a learning curve with the query language, but it's super cool once you figure it out.

[1] https://log-store.com/


May be more widely applicable for personal servers: lnav, an advanced log file viewer for the terminal: https://lnav.org/

It uses SQLite internally but can parse log files in many formats on the fly. C++, BSD license, discussed 1 month ago: https://news.ycombinator.com/item?id=34243520


If anyones looking for similar services Im using vector.dev to move logs around & it works great & has a ton of sources/destinations pre-configured.


I feel like if you're going to use "$4 VPS" as a quantifier, you could at least specify which $4 VPS is being used.


Look at this:

https://www.hetzner.com/cloud

More like $5 but still, 1 vCPU, 2GB RAM, 20GB NVMe storage. Closer to $4 USD if you let go of IPv4 in favor of IPv6 only.

Edit: Looks like that's also a shared vCPU.


DO's 512mb basic VPS starts at $4, so I am guessing it is that.


I don't think it is. That one is shared vCPU and I've been hearing about a single vCPU one.


Neat! Have you considered using query params instead of bodies, then just piping the access logs to a spool (no program actually on the server, just return an empty file). Then your program can just read from the spool and dump them into sqlite.

That should tremendously improve throughput, at the expense of some latency.


That's a really good idea, thanks for suggesting it. I'll try implementing it. I'm hoping the main bottleneck is with inserting the logs into SQLite, so using a spool might help


I'm doing something similar with a $5 VPS, but with fastcgi/c++/sqlite3. I then have a cronjob that then aggregates error logs, generates an summary and posts to a Slack channel. Personally I wish I didn't have to write it, but it works.


One of my eventual goals with erlog is actually doing observability (eg: it'll send you reports if logs/metrics deviate from the norm), so it's really interesting to see you had this problem.


Imagine what we could do with modern hardware if programs were as efficient as your typical C++/SQLite combo!


I have been using https://datalust.co/ to handle this. It scales really well down and up to how much you want to spend. It comes with existing integrations with a lot of libraries and formats and a CLI to push data from file based logs to their service.

They have just added a new parser and query engine written in Rust to get the best performance out of your instance. https://news.ycombinator.com/item?id=34758674


Second this, seq is incredibly handy and easy to query. Performance could be better though


I am been using vector.dev for a long time now. It is also easy to setup. And it looks similar to your idea.


...uh, just rsyslog and files ? I think it can even write to SQLite


Woah cool. I did the same thing. I Made a poor man’s small scale splunk replacement with SQLite json and go. I used the built in json and full text search extensions.


I run a self hosted version of Sentry.io on a NUC at home and a relay on a VPS, the. Use Tailscale to connect the two.

If you have an old computer at home, using a VPS as the gateway is always a good option.

Edit: you can then use the VPS as a exit node for internet.


Ah cool! Somewhat related I built a json log query tool recently using rust and SQLite. Didn’t build the server part of it

https://github.com/hamin/jlq


That's really cool. I might rewrite some of your code in go and use it in erlog for searching (and give you credit of course).

How did you come up with the idea for jlq? It seems like it solved a pretty cool use case.


I'm working on a project where I'm handling simultaneous connections to a bunch of peers. What's the best way to log messages to trace the flow of requests through my system when multiple code paths are running asynchronously (NodeJS, so I can't simply get a thread ID)?


With tracing libraries, e.g. https://opentracing.io/


That and some backend... most are SaaS though. The only self hosted one I know of is Grafana Traces/Tempo.

I mean, you can just log the trace/span/parent IDs for each request, but that's a bit painful to deal with later.


jaeger + opentracing/opentelemetry libs are easy enough. When testing Jaeger can just work as in-memory database, or put it to some other storage like elasticsearch


I strongly urge people to try something like Application Insights. It's not dirt cheap, but not that expensive, and lets you collect anything you'd want and query your telemetry/logs retroactively extremely flexibly. It's just great.


You could also not write your own server. Just configure OpenResty, write some simple LUA to push to the redis queue. Then consume the queue via your language of choice to write to your store(clickhouse).


Logs must be stored in S3, it's no-brainer. Disk storage is too expensive. Logging system should be designed for S3 from ground up IMO.


How is S3 the cheapest option? Backblaze is $0.005 per GB and Hetzner sells storage boxes for less than €0.0038 per GB.


AWS is almost never the answer unless you work at making it cheap and work for you. It's a black hole of insanely convoluted billing and filled with snake oil salesmen (dev ops priests) that'll complicate it so much that you don't even know what you're paying for or why you even need it. And S3 is just a gateway drug into this whole mess, stay far away from it kids.


I'm talking about S3 API. Every cloud I'm aware of, provides S3-compatible object storage. And this storage is much cheaper than VM-attached storage.


I'm talking about storage volumes which are attached to the VM versus object storage which provides S3-compatible API.

You can use S3 API to access Backblaze.

I'm not experienced with Hetzner Storage Box but I don't think that you can just attach it to your storage VM as fast storage. You can mount it as samba store but I think that's a recipe for disaster.


Replace what “S3” with “S3-compatible object storage”.


I think he was being sarcastic




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: