More

data-ottawa · 2026-02-04T02:55:34 1770173734

Good for France.

We've seen the US sanction the ICC, they have the Cloud Act and the Patriot Act. The US has shown both a willingness and a capability to weaponize your tech against you.

It should profoundly worry the world that three companies — whose heads all have frequent dinners at the White House — control virtually every phone, tablet, and computer in the world. If you expand that to data centres and clouds, email addresses, services and software it's far worse.

It should be considered a matter of national defence for basically all nations to ensure digital sovereignty.

(It doesn't matter who is in the White House, my point is it's a massive security nightmare to give this much control to one group)

data-ottawa · 2026-02-02T02:05:31 1769997931

It’s fantastic to be able to prototype small to medium complexity projects, figure what architects work and don’t, then build on a stable foundation.

That’s what I’ve been doing lately, and it really helps get a clean architecture at the end.

johnrob · 2026-02-02T02:24:39 1769999079

I’ve done this in pure Python for a long time. Single file prototype that can mostly function from the command line. The process helps me understand all the sub problems and how they relate to each other. Best example is when you realize behaviors X, Y, and Z have so much in common that it makes sense to have a single component that takes a parameter to specify which behavior to perform. It’s possible that already practicing this is why I feel slightly “meh” compared to others regarding GenAI.

data-ottawa · 2026-01-29T20:35:23 1769718923

Yes, I’ve been working on this and you need a clear semantic layer.

If there are multiple paths or perceived paths to an answer, you’ll get two answers. Plus, LLMs like to create pointless “xyz_index” metrics that are not standard, clear, or useful. Yet i see users just go “that sounds right” and run with it.

maxchehab · 2026-01-29T20:58:20 1769720300

Absolutely. We make it obvious to the user when a query/chart is using a non standard metric and have a fast SLA on finding/building the right metric.

It only works because all of the data looks the same between customers (we manage ad platform, email, funnel data).

So if we make an “email open rate” metric, that’ll amortize to other customers.

data-ottawa · 2026-01-29T18:36:01 1769711761

There are some days where it acts staggeringly bad, beyond baselines.

But it’s impossible to actually determine if it’s model variance, polluted context (if I scold it, is it now closer in latent space to a bad worker, and performs worse?), system prompt and tool changes, fine tunes and AB tests, variances in top P selection…

There’s too many variables and no hard evidence shared by Anthropic.

data-ottawa · 2026-01-28T21:50:47 1769637047

You can reference other justfiles as modules too, so in a mono repo you can do `just foo-app test`.

If you combine that with relative working folders it’s very easy to manage large projects.

And you can get shell completion, which is extra nice.

data-ottawa · 2026-01-28T21:17:53 1769635073

Gemma3 is probably the best supported fine tunable model.

data-ottawa · 2026-01-28T16:21:40 1769617300

As a long time DS I sadly feel we filled the field with people who don’t do any actual data science or engineering. A lot of it is glorified BI users who at most pull some averages and run half baked AB tests.

I don’t think the field will go away with AI, frankly with LLMs I’ve automated that bottom 80% of queries I used to have to do for other users and now I just focus on actual hard problems.

That “build a self serve dashboard” or number fetching is now an agentic tool I built.

But the real meat of “my business specializes in X, we need models to do this well” has not yet been replaceable. I think most hard DS work is internal so isn’t in training sets (yet).

data-ottawa · 2026-01-28T16:07:47 1769616467

Duckdb does support pipe operators as an extension, which is a welcome addition to sql engines for me.

But I do agree with you.

data-ottawa · 2026-01-28T16:04:00 1769616240

A dataframe API allows you to write code in Python, with native syntax highlighting and your LSP can complete it, in one analysis file. Inlined SQL is not as nice, and has weird ergonomics.

UDFs in most dataframe libraries tend to feel better than writing udfs for a sql engine as well.

Polars specifically has lazy mode which enables a query optimizer, so you get predicate push down and all the goodies if SQL, with extra control/primitives (sane pivoting, group_by_dynamic, etc)

I do use ibis on top of duckdb sometimes, but the UDF situation persists and the way they organize their docs is very difficult to use.

data-ottawa · 2026-01-28T15:54:40 1769615680

Map is one operation pandas does nicely that most other “wrap a fast language” dataframe tools do poorly.

When it feels like you’re writing some external udf thats executed in another environment, it does not feel as nice as throwing in a lambda, even if the lambda is not ideal.

vegabook · 2026-01-28T18:14:26 1769624066

you have map_elements in polars which does exactly this.

https://docs.pola.rs/api/python/dev/reference/expressions/ap...

You can also iter_rows into a lambda if you really want to.

https://docs.pola.rs/api/python/stable/reference/dataframe/a...

Personally I find it extremely rare that I need to do this given Polars expressions are so comprehensive, including when.then.otherwise when all else fails.

data-ottawa · 2026-01-28T20:34:51 1769632491

That one has a bit more friction than pandas because the return schema requirement -- pandas let's you get away with this bad practice.

It also does batches when you declare scalar outputs, but you can't control the batch size, which usually isn't an issue, but I've run into situations where it is.