I've been using Rust to process a large text corpus, and as Armin suggests, it's...

burntsushi · on Oct 1, 2014

> Rust is one of those languages where I need to work to make the compiler happy, but once I manage that, the code generally works on the first try.

I can confirm this. I have a CSV parser[1] that is maybe twice as fast as Python's CSV parser (which is written in C)+. There's nothing magical going on: with Rust, I can expose a safe iterator over fields in a record without allocating.

[1] - https://github.com/BurntSushi/rust-csv

The docs explain the different access patterns (start with convenience and move toward performance): http://burntsushi.net/rustdoc/csv/#iteratoring-over-records

+ - Still working on gathering evidence...

phaylon · on Oct 1, 2014

I should've taken a deeper look at your CSV handling code earlier. That's quite a nice demonstration of an API using Decoder/Decodable.

moron4hire · on Oct 1, 2014

I don't understand. What more do you need out of CSV parsing that any standard regex library doesn't provide?

Ygg2 · on Oct 1, 2014

It's kinda funny, because burntsushi wrote Rust's standard regex library.

sorentwo · on Oct 1, 2014

There are plenty of values in a purpose built CSV parser. Several important features, depending on the CSVs you'll be working with are:

1. Sometimes you want to actually write files, in which case you'll want help escaping quoting.

2. Shortcuts for using header information to provide more convenient access into a particular column, rather than always doing it by index.

3. Conveniently slurp in only parts of a file, or only particular columns in a file, without loading everything into memory.

moron4hire · on Oct 1, 2014

All things a proper understanding of regex and a minimal understanding of streaming file IO can cover. The whole "if you think regex is the solution to your problem, now you have two problems" thing has gotten out of hand. Regex is not that hard.

Tyr42 · on Oct 1, 2014

It's simpler to handle quotes and backslashes escaping commas with a custom parser. And then there's the domain knowledge baked into the lib. Does your regex solution produce excel compatible csv files when you have leading zeros? That's important to some people.

newobj · on Oct 1, 2014

So, what is the regex to parse csv?

moron4hire · on Oct 1, 2014

I think people who are downvoting me don't understand how ludicrously retarded most people who output CSV are. CSV is not RFC4180. It's whatever bullshit text file your client has handed you and convinced your project manager is your problem to parse, not their problem to generate even remotely correctly. There is no CSV library capable of handling "CSV". Every time someone asks you for it, you better kick and scream or expect to do a custom job.

ghshephard · on Oct 1, 2014

I think people are being a fit unfair downvoting you right now (I bumped you up) - but I also disagree with you.

When I'm working with a file that purports to be CSV/TSV, with python, I reach out to the CSV module, specify the dialect that created it - and instantly get all the power of being able to identify refer to all the fields, rows without having to otherwise worry about parsing them.

Is it 100% bulletproof - definitely not - but, then again, I'm not writing a life safety system. And I've also never had the Python CSV parser break on any reasonable file I've sent it.

I'm truly thankful for robust CSV/TSV parsers. Throwaway code like this just works - in particular handles parsing the column headers to automatically build the dict for me.

   sitesFN=['gateway.tsv','relay.tsv']
   dsites={}
   for fn in sitesFN:
       f=open(fn,'r')
       reader=csv.DictReader(f,dialect='excel-tab')
       for row in reader:
           dsites[row['NIC_Serial_No']]=row['Device_Name']

Perhaps what you are trying to say that people are failing to hear, is that you can't rely on a CSV parser to, a priori, handle all possible files that purport to be "TSV/CSV" - in that I agree with you, and that you will always need to examine the file and determine if the built in parser will handle it.

But - What if it turns out the standard library CSV parser handles the "CSV" file just fine - in that case it seems to make a lot of sense to use it, rather than taking the time in writing your own (along with the bugs that come from re-writing anything).

And, speaking just for myself - again, I've never seen a CSV/TSV file that the python didn't handle just fine - not to say they aren't out there - you just have to go out of your way to create them.

burntsushi · on Oct 1, 2014

> And, speaking just for myself - again, I've never seen a CSV/TSV file that the python didn't handle just fine - not to say they aren't out there - you just have to go out of your way to create them.

Indeed. Python's CSV module supports a "strict" mode that will yell at you more often, but by default, it is disabled. When disabled, the parser will greatly prefer a parse over a correct parse. I took the same route with my CSV parser in Rust (with the intention of adding a strict mode later), because that's by far the most useful implementation. There's nothing more annoying then trying to slurp in a CSV file from somewhere and having your CSV library choke on it.

Too · on Oct 1, 2014

The csv library of python has handled every csv i've ever thrown on it, csv is "standardized" enough for that. Just set two things, the delimiter, the escape quote method and be done with it. The output is a list of dictionaries with the column headers as keys, very elegant. The best part is that using the same input you did to read a file you can use to save/modify the file and be sure it will look the same when your client re-opens it. A regexp would take twice the time to write and wouldn't give you half of those features, and it would probably fail at escaping sooner or later, for the same reason as xml can't be parsed with regexps.

tutuca · on Oct 2, 2014

While I agree with you, I'm having a lot of trouble handling unicode and windows' latin-1 encodings...

codexon · on Oct 1, 2014

This post sums up why.

http://programmers.stackexchange.com/a/215171

CSV is not just simple fields separated by commas with some quotes thrown in.

mercurial · on Oct 1, 2014

Well, if you look at, eg, Python's CSV parsing library, it's been more than enough to cover my needs so far, and handles different CSV flavours. It is much nicer and less error-prone to use than using regexps.

smcl · on Oct 1, 2014

You are getting downvoted to hell but I can see what you're meaning. Generally if someone hands you a CSV file there is no guarantee that something mental isn't happening as there is no "CSV" standard. So you're saying that when your task is "process the client's CSV file" you might not necessarily be able to rely on a library handling it correctly, and that you should prepare to get your hands dirty (perhaps hacking together something with a regex or two).

mcguire · on Oct 1, 2014

He actually has a point there. There are so many different versions of "CSV" floating around that I'm not at all sure I'd want to deal with a parser that could handle most of them. Ever generated a CSV file from a spreadsheet or DB interface program? Did it have a big list of options on how the CSV would be formatted, so you could easily read the generated file into whatever downstream you were using?

Yeah.

burntsushi · on Oct 1, 2014

> I'm not at all sure I'd want to deal with a parser that could handle most of them.

Python's CSV parser will handle almost anything you throw at it and it is widely used to great success.

> Ever generated a CSV file from a spreadsheet or DB interface program? Did it have a big list of options on how the CSV would be formatted, so you could easily read the generated file into whatever downstream you were using?

Just about every single CSV file that I've ever had to read was generated by someone other than me. Frequently (but not always), they come from a non-technical person.

Sometimes those CSV files even have NUL bytes in them. Yeah. Really. I swear. It's awful and Python's CSV parser fell over when trying to read them. (You can bet that my parser won't.)

> He actually has a point there.

His point is to use regexes instead of a proper CSV parser. I'm hard pressed to think of a reason to ever do such a thing:

1. A regex is much harder to get correct than using a standard CSV parser. 2. A regex will probably be slower than a fast CSV parser.

mcguire · on Oct 1, 2014

"Python's CSV parser will handle almost anything you throw at it and it is widely used to great success."

I like the word "almost". It's kept me in cheesy-puffs for years, now. :-)

I was actually only speaking to the post I replied to. CSV is a mess.

ghshephard · on Oct 1, 2014

A good CSV Parser (Python's) let's you specify the dialect of CSV file - it doesn't make assumptions as to the format of the CSV file.

cpeterso · on Oct 1, 2014

I think you just answered your own question. Parsing CSV is not a problem solved by "any standard regex library".

jxf · on Oct 1, 2014

A regexp would work, but it's usually the wrong level of abstraction to operate at. One wants to say "for the next 2,000 rows, retrieve columns 2 and 4, and the column labeled 'foo'", not write a regexp.

lilyball · on Oct 1, 2014

I don't understand; why would you try and parse CSV with a regex?

Note: CSV is actually much more complicated in practice than just splitting on commas.

moron4hire · on Oct 1, 2014

regex is much more capable than splitting on commas

kibwen · on Oct 1, 2014

This is great feedback, thanks. And if there's anything that you don't like about Rust, please let us know now while we still have a chance to possibly fix it! Only a few short months left until all of our mistakes are forever entombed in Rust 1.0. :)

yen223 · on Oct 1, 2014

I've been using Rust for toy projects for a while, and I gotta say Rust is one of the nicest languages out there. Kudos to the Rust team.

As a relative newbie to Rust, one of the biggest hurdles I faced was that of poor documentation. A lot of "rust xxx" searches would link to outdated articles or to stale links in the Rust official docs. I understand that Rust is a young language, and documentation is probably the last thing on the core devs' to-do list (and rightly so), but I think it would greatly help drive adoption rates if we could proper docs in place.

masklinn · on Oct 1, 2014

> A lot of "rust xxx" searches would link to outdated articles or to stale links in the Rust official docs.

Or to an old mirror of the official doc, http://web.mit.edu/rust-lang_v0.9 is the bane of my rustperience.

steveklabnik · on Oct 1, 2014

I specifically contacted MIT about this issue, and unfortunately, we're just gonna have to out-SEO them.

masklinn · on Oct 2, 2014

Did they give any reason why they couldn't e.g. use google's webmaster tools to disable indexing of these pages?

yen223 · on Oct 1, 2014

It doesn't help that there's a game called Rust out there!

masklinn · on Oct 1, 2014

I feared that, but overall combined with programmatic context (e.g. iter, array or whatever) I've had the game come up very rarely.

broodbucket · on Oct 1, 2014

It gets a bit annoying that a lot of Rust libraries are named something rust-themed. I was looking at which web framework I should use (between Nickel and Iron) and so searching "rust nickel vs iron" returned absolutely nothing of relevance.

I did learn something about metals, though, so there's that.

masklinn · on Oct 1, 2014

Hah yes that is definitely an annoying property of "cutesy" library names, they're cute and fairly easy to remember but finding them afterwards is a pain in the ass compared to unfun dreary names like "rust-web".

tomp · on Oct 1, 2014

I think that mutability should also be recorded in types (i.e. even a &mut would not allow you to modify an immutable type). Then, programmers could control whether variants/enums are "fat" (use the maximum amount of memory of any variant in the same type) or "thin" (use the minimum amount of memory for the given variant) by changing the mutability of fields they are stored in.

But I totally understand that you probably have other, well thought-out reasons not to do that.

kibwen · on Oct 1, 2014

Off the top of my head this seems like it would allow the amount of memory taken up by an enum to vary dynamically, which would prohibit stack allocation and require heap allocation instead. It would also be impossible to realize this optimization for any array of enums, since they would all need to be the same size for efficient indexing, and further it would wreak havoc with structs containing enums by making it impossible to statically compute field offsets.

However, we have considered elevating the position of mutability in the type system for other reasons--specifically, to make it possible to write a function that is generic with respect to mutability. But there's never been a huge amount of enthusiasm for that, so I wouldn't expect any changes of that sort.

But thanks for the feedback anyway! :) It's definitely valuable to know that people do care about the size of enums. We have various optimizations required by the language spec to shrink them down wherever we can do so while preserving semantics, and we welcome people to concoct more.

tomp · on Oct 1, 2014

Well, it would definitely need to be special-cased, i.e. only use it when (1) the field containing the enum is immutable, and (2) it's the last field in the structure (it would then make the structure a Dynamically Sized Type).

I only care about the size of enums inasmuch as I find it wasteful to allocate a 10 words long Box to hold a 2 words long enum. Of course, the optimization would only make sense if the Box was immutable (even with a unique reference). I mostly mentioned the enum size issue because I read somewhere not too long ago that Servo people were having problems with memory usage because of that (and that they also want objects/records with inheritance of fields - that would bring all the issues you speak about above).

kibwen · on Oct 1, 2014

It's true, enum design in Rust can be a fine art. Specifically, knowing when to make the tradeoff between big inlined enum variants and pointer-sized variants that require an indirection to access. This plagued the Rust compiler for a long time as well, as the AST was formerly represented by a ludicrously large enum with something like 120 words per node.

darksaints · on Oct 1, 2014

Two things that come to my mind: 1) General concurrency/parallelism ease of use. It would be nice to have Async/Await a la C#, as well as some easier to use channels and synchronization primitives/functions along the lines of Go or Core.Async. And 2) use of <> for type annotations instead of [], which is much easier to parse when nested.

kibwen · on Oct 1, 2014

Servo is definitely demanding a wide variety of approaches to both concurrency and parallelism, so I expect the facilities for such in Rust to mature very quickly (but faster with more help and feedback!). It's a very important issue that is seeing work from full-time developers as we speak.

As for the second point, we've gone beyond the bikeshed (<> and [] are both exactly as ambiguous to the grammar, so it's just an argument of preference by this point) and addressed the root issue with `where` clauses, which prevent you from needing to nest <> at all for a majority (or perhaps plurality, I haven't measured) of function signatures where you previously needed to.

Symmetry · on Oct 1, 2014

The ability to evaluate functions at compile time would be really nice. I was sort of looking at writing a macro in Rust to mirror Nimrod/Nim's 'when' statements but gave sort of gave up when I saw that. But of course that's probably an entirely unreasonable thing to request in the remaining time you have and it might be incompatible with having a borrow checker anyways.

The borrow checker looks really valuable though, and something that might have a big influence on language design going forward.

kibwen · on Oct 1, 2014

Lightweight compile-time evaluation is definitely a weakness of ours. A full-fledged syntax extension can get you pretty far, but those are a terror to write and ugly to import. While I don't know of any proposals in the air at the moment, I expect the developers will treat this with great importance post-1.0.

darksaints · on Oct 1, 2014

Interesting. From what I understand, this was what BurntSushi's regex macro was intended to accomplish: an efficiently compiled regex that optimized away all unneeded functionality for regex literals. With CTFE, would this get rid of the need for a macro entirely?

burntsushi · on Oct 1, 2014

The regex macro has two primary benefits:

1. You cannot compile your program with an invalid regex.

2. The regex runs in a specialized VM corresponding to the specific bytecodes for a regex. This results in an across-the-board performance improvement, mostly because there are some places that can now use stack allocation instead of heap allocation. (But it's still a full NFA simulation, so it isn't going to, say, beat RE2/C++ quite yet.)

You can take a look at the syntax extension here: https://github.com/rust-lang/rust/blob/master/src/libregex_m...

Basically, if CTFE allows you to write functions that produce AST transforms, then the `regex!` macro should be convertible to that form. Most of the syntax extension is a quasiquoted copy of the general VM: https://github.com/rust-lang/rust/blob/master/src/libregex/v... --- The actual syntax extension bits are pretty minimal.

(There are quite a few tricks employed to get native and dynamic regexes to have the same API. It's the reason why the main `Regex` type is actually an `enum`! https://github.com/rust-lang/rust/blob/master/src/libregex/r...)

Full disclaimer: the `regex!` macro does have a couple downsides. #1, it's a syntax extension, so it has a dynamic dependency on `libsyntax`. #2, it ends up producing a lot of code and bloating your binary if you use it a lot (you might start noticing at after the first few dozen, and you might start cursing it after the first few hundred).

kibwen · on Oct 1, 2014

Sadly I'm not enough of an expert to say for sure. It would certainly depend on the specific form that it took, and there are a multitude of options in this space. Of the current contributors who are willing and able to push the boundaries here, I believe that quasiquotation is the preferred approach (motivated by use cases for compile-time codegen in Servo).

burntsushi · on Oct 1, 2014

> I believe that quasiquotation is the preferred approach

Oh yes yes yes. Quasiquotation makes writing syntax extensions almost easy.

It can't cover everything though. For those cases, dropping down into AstBuilder (or the raw AST) isn't completely terrible, but it's a big step down from quasiquoting.

MichaelGG · on Oct 1, 2014

Hi! I use F# for most tasks (from websites/JS generation, to packet capture and indexing, call routing and billing processing), and some C where required. Rust looks fantastic, and would give me the memory control I need when I need extra performance. A LOT of it comes down to simply being able to stack-allocate things; in F# I'm essentially forced to use the GC heap for even the most trivial things.

Rust looks fantastic, and I'm very excited about using it. When I found out about it and started reading how it worked, it was almost exactly what I had been wanting, on almost every count. I really wish it had existed a few years ago.

My comments are from someone that's just been playing around with the getting started guide of Rust.

-- Lack of custom operators limits expressiveness. For instance, look at Parsec or FParsec. Why shouldn't they be allowed to exist in Rust? But if custom operators are totally out of the question, then what about user-defined infix functions? (And then, why not functions with nearly arbitrary codepoints as identifiers?)

-- It seems that currying, partial application, and function composition are sorta cumbersome in Rust. Is this a conscious decision, that idiomatic Rust shouldn't be doing such things? Like " add >> inc >> print " being equivalent to "|a,b| -> print(inc(add(a,b)))" ? In F# I use this kind of stuff all the time, especially when processing lists.

-- It seems there's a difference between function types. Like if I do "fn foo(a:int, f:proc(int)->int)", I can call it with foo(1i, inc) if inc is a fn. But if I first do "let f = inc; foo(i1, f)", that's an error. Offhand, I'd assume this is due to lifetime management, but it feels a bit strange. When writing HoFs, do I need to implement a version for each kind of function? Or am I totally misunderstanding things?

-- Sorta related, does Rust allow something like:

  let inc =
    let x = ~0
    || { let res = *x; *x += 1; res }

The idea is to expose a function that contains some private state. I remember hearing that Rust changed the ownership stuff around a few times, but the basic idea is to create a globally-accessible closure. Is this impossible, requiring us to use statics to store the global state?

-- Why doesn't Rust warn when discarding the result of a function if not unit? Is it idiomatic Rust to return values that are often ignored?

-- Even without higher kinded types, monadic syntax is useful. Is Rust planning any sort of syntax that'd let users implement Option or Async? How does Rust avoid callback hell? Or is this quite possible today with macros and syntax extensions?

-- Has Rust considered Active Patterns (ala F#)? With that, I can define arbitrary patterns and use them with matching syntax. E.g. define a Regex pattern, then "match s { Regex("\d") => ... , Regex("\D") => ... , _ => ... }"

-- Consider allowing trailing commas in declarations; why special-case the last item?

-- And last but not least: Please, please, please, reconsider type inference. It's baffling why "fn" requires type annotations, but a closure does not. What's more, why is there even different function syntax in the first place? I've heard the reason that "top level items should have types documented", but that's a very personal decision to make. It certainly isn't something the Rust compiler should force me to do in all my code. Why do ya gotta limit my expressiveness? (Same argument I'd use for custom operators.) Statics/consts should also have type inference. And note that these items aren't necessarily public or exposed - but a private fn on a module still requires type annotations.

Havvy · on Oct 2, 2014

Rust 1.0 is supposed to be somewhat minimal, so things like custom operators and monadic syntax are out scope since they can be bolted on in a backwards-incompatible way in future versions. Active patterns would probably also fall under this.

Likewise, having functions have their own private global state is not on the table, and seems to me to actually be unsafe since the function could be called at the same time in two different threads overwriting the operations done to them.

For not warning when returning non-unit values for unit returning functions, a lint could probably help there.

Type inference isn't changing. The rule is that items need to be explicitly declared and expressions are inferred. Closures are expressions while most (all?) functions are items. It's a purposeful limitation on inference based on experience with prior languages.

dbaupp · on Oct 1, 2014

> -- Consider allowing trailing commas in declarations; why special-case the last item?

They are allowed in most circumstances.

kibwen · on Oct 1, 2014

And if they're not, I'd say that might be a bug.

kibwen · on Oct 2, 2014

  > Lack of custom operators limits expressiveness.

I understand that this is going to be hard for some to swallow, but it is explicitly a non-goal of Rust to be maximally expressive. :) New features are motivated almost solely by solutions to concrete pain points in Rust code (especially Servo). This may sound particularly Blubby, but the devs are well-versed in Haskell (and Lisp, and Scala, and ML, and...). With respect to custom operators and infix operators, they're taking the cautious approach of leaving the option open for future versions of Rust, since they can be added completely backwards-compatibly if there's significant demand for them post-1.0.

  > It seems that currying, partial application, and 
  > function composition are sorta cumbersome in Rust.

There are two different camps in competition here. One camp wants Rust to have default function arguments as per C++. Another camp wants Rust to have automatic currying. These camps are in opposition because they both want to control what `foo(bar)` does for a function `foo` that takes more than one argument. So far the devs have resisted entreaties from both camps. Experience post-1.0 may change this.

  > It seems there's a difference between function 
  > types.

Closures are really terrible right now and are in the midst of getting a complete overhaul to be more useful and less hacky. :) Procs won't even be a thing afterward. Please excuse our mess!

  > Why doesn't Rust warn when discarding the result of 
  > a function if not unit?

It does warn, for any type that has the `#[must_use]` attribute. This includes the stdlib's `Result` type, which is basically Haskell's `Either` except explicitly intended to be used for error handling.

  > Is Rust planning any sort of syntax that'd let 
  > users implement Option or Async?

I'm not sure what this means, as users can already implement Option. It's not special-cased by the language in any way.

(As for the lack of HKT, that's a hotly-desired feature for post-1.0. It's still a bit pie-in-the-sky, but the devs have acknowledged that it would be very useful to improve our error handling story.)

  > Please, please, please, reconsider type inference.

Not going to happen. :) Requiring type signatures on top-level functions is so useful that even the languages that allow it to be inferred tend to enforce their presence via social pressure. In addition to providing powerful self-documentation, this vastly improves error messages. Finally, I suspect that Rust's trailing-semicolon rule (especially combined with the willingness to ignore the return value of most functions) would interact poorly with function signature inference.

  > Statics/consts should also have type inference.

IIRC this isn't possible, but I've forgotten the reason for now. It certainly isn't motivated by any sort of philosophy.

MichaelGG · on Oct 2, 2014

While I strongly disagree with limiting expressiveness, operators, etc., and the thing about top-level functions is misleading (because a nested, private, module isn't really "top level), this is a fantastic response and helps me understand Rust a lot better. Thank you very much.

I hope things will change (esp. type inference, which while playing around is really annoying, even if I eventually end up wanting to annotate. A REPL could fix a lot of the pain.). But there's nothing out there that competes with Rust, and the C-friendliness means I can fairly easily interop with languages with more expressiveness ;).

As far as async/option, I meant something like Haskell's do notation or F#'s workflows. This allows implementation of async code without callback hell or the huge limitations of promises or whatnot. (But without HKTs, you can't mix multiple monad types within one block.)

kibwen · on Oct 2, 2014

While this is far from a promise, any future implementation of HKT would likely be accompanied by a do-style notation. The `do` keyword is unused yet reserved for precisely this reason (https://github.com/rust-lang/rust/blob/master/src/libsyntax/...).

kibwen · on Oct 2, 2014

Actually, after asking around, I must have misremembered about inference on statics being impossible, it's merely difficult. :) In lieu of full inference, there are proposals to allow statics to have the same sort of very simple inference scheme that C++ and Go have: https://github.com/rust-lang/rfcs/issues/296

rz2k · on Oct 1, 2014

Over the past few days I have been trying Julia. While I don't have much of an idea what I'm doing yet, and I'm probably doing a lot of unproductive premature optimization for the sake of exploring, I see a similar type of difference compared to R and Python where I have more experience.

What I am curious about is why you chose Rust rather than Julia. As I've understood the chatter so far, Rust is great for low level programming, and can work well as a replacement for other system level programming languages. Julia on the other hand is supposedly tailored for the type of task you are talking about, and like Rust, the chatter says Julia can be similar to C and Fortan in terms of speed.

My understanding of Rust was as a low-level general-purpose language and Julia as a technical or scientific programming language that is similarly fast.

Anyway, I'm just curious about the trade offs.

callcongressnow · on Oct 1, 2014

As far as I am aware, Julia isn't that fast for string operations yet. The language has been optimized for numerical calculations but they haven't done much work on making working with strings as fast as they could. I remember reading this on a github issue at some point but I can't find the issue at the moment.