Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Who Says C is Simple? (cs.berkeley.edu)
175 points by ColinWright on July 17, 2013 | hide | past | favorite | 75 comments


Simple Language != Simple to use.

C is a small language. C is gussied-up assembly. It is the ballet of programming language.

People who say C is simple really mean that it doesn't do anything behind my back.


C is like the marketing slogan to Othello (reversi):

    A minute to learn, a lifetime to master.
The language is fairly simple in terms of features and aspects you have to learn to use it, but concepts like pointers and direct memory management are difficult to master. Programming in C is like building a building out of bricks; bricks are relatively simple objects, building a building out of them is not a simple task.


Agree with both parents. "Simple" to me means that the code does exactly what I ask: no more, no less. For better or for worse. No behind-the-scenes magic (garbage collection, etc). I know exactly how much memory I am using, where it's allocated, and how it's passed around. Again, for better or for worse. With great power comes great responsibility!


Do you also know all the undefined and unspecified use cases of the standard?

http://de.slideshare.net/olvemaudal/deep-c


>No behind-the-scenes magic

I assume you never compile with optimisations turned on then?


Has very little to do with the language, apart from what undefined and platform dependent behavior allow the compiler to trick around. Of course, if the compiler goes nuts or is buggy, anything can happen "behind the scenes", but as far as I'm aware, this has nothing to do with the language as specified in the standard.


The problem is when what you ask for is incompatible with how it's done. Hence me today observing a price field in a record containing $9.500000000000000000002, knowing that making the obvious problem there "easy" will result in painful data manipulations elsewhere, and remembering that under certain conditions a Boolean can be neither true nor false. Asking a binary computer to handle non binary data just plain gets weird sometimes.


I think that pointers are hard is self fulfilling prophecy. We studied them in the 9th grade. Because nobody bothered to tell us they were hard almost my whole class - around 80% after the first two lessons groked pointers and linked lists.

Same was with pointer arithmetic and function pointers.

Right now I am struggling mightily with monads mostly because the voice in my head tells me the thing all of the internet is singing - they are hard don't bother.


Pointers are easy when everything is working right.

It's when there are bugs in code which uses pointers that the weird can kick in and really give you a headache.


Monads are easy. Do bother.

=D. Just need to be the discord to the Internet's harmony.


Is it a small language? The standard takes a dense 179 pages to describe the language. I guess it's debatable if that is small or not. I certainly don't agree that it's a simple language though. There are myriad rules and exceptions to rules to remember. When and where one integer type will be converted to another. When overflow is defined and when it's undefined. Just those things alone mean that much real world C code is full of undefined behaviour[1] just related to arithmetic on numeric types, because the rules are hard to remember and reason about.

John Regehr asked people to submit str2long() implementations that didn't execute undefined behaviour. Despite being well warned about avoiding overflows and other undefined behaviour only 35 of the 78 submissions passed the test suite[2].

Even a trivially small 2 line C function can result in different results on the common compilers depending on which compiler and which optimisation level you use[3]. Compiler engineers struggle to agree on and correctly implement the "simple" rules of C. To me that's an indication that they aren't so simple.

I would say that C doesn't do anything behind your back as long as you don't ever use an optimising compiler or you pay very very close attention to the standard. If you forget then suddenly your check for pointer arithmetic overflow has been "helpfully optimised away" behind your back for reasons that are not immediately apparent at all[4].

I do agree that apparently simple rules can lead to complexity in use, and C is full of this too. For example you would think in a lower level language like C viewing a chunk of memory as a different type would be trivial, but the strict aliasing rule means you have to be a language lawyer to understand what is allowed and what isn't[5] as it makes the intuitive solution into undefined behaviour (which is extra pernicious since most of the time it will work as intended, until somebody compiles the code with a smarter compiler at a high optimisation setting).

[1] http://blog.regehr.org/archives/963

[2] http://blog.regehr.org/archives/914

[3] http://blog.regehr.org/archives/482

[4] http://pdos.csail.mit.edu/~xi/papers/stack-sosp13.pdf example 1

[5] http://blog.regehr.org/archives/959


Compared to other languages, 179 pages for the full specification is pretty small. Java and C sharp have that more than three that size. C++ is roughly five times as large. If you're a fan of functional languages, Haskell is almost twice as large. Even Scheme, an extremely simple language, is only about 30 pages shorter.


The book "The Definition of Standard ML" is ~150 pages for a language that is simple (IMO) and rich. Bonus, the definition is not written in prose, but with typing rules and operational semantics.


> Bonus, the definition is not written in prose, but with typing rules and operational semantics.

While that is certainly nice, I suspect it makes it hard to really compare based on just pagecount. I doubt prose and such a formal definition like that are equally dense.


Agreed. From what I've read [1], the syntax rules and semantic rules take up about 40 pages with the rest of the book being "introduction, exposition, core material, appendices and index".

[1] http://mythryl.org/my-Mythryl_is_not_just_a_bag_of_features_... (go to the middle of the page or search for "pages")


The Python Language Reference is only 100 or so pages. I'm not sure how different the PLR is from a "real" spec, however.


I wouldn't say that any of that means that C itself is complex. It means your computer's native instruction-set architecture is complex (and full of undefined behavior), and C, being simple, just gives you relatively transparent access to it rather than trying to abstract it and standardize it and generally clean it up.


Wrong; one example: your CPU's instruction set has well defined behavior for signed integer overflow. C doesn't.


That does not contradict what was said. C says "undefined" so that compilers are free to simply let each CPU do whatever is easiest for that CPU. On any particular CPU with a particular compiler you get consistent behavior. But when you switch CPUs, watch out.

Thus YOUR SPECIFIC CPU's instruction set may be well-defined. But if you say "your CPU" to a group of people with different CPUs, there may be no simple statement that is well-defined and generalizable across all of them.


> C says "undefined" so that compilers are free to simply let each CPU do whatever is easiest for that CPU. On any particular CPU with a particular compiler you get consistent behavior. But when you switch CPUs, watch out.

That was how it worked in the 1990s. Nowadays, a C programmer needs to figure out which undefined behaviors are justifiable and which should be avoided at all costs because they will be used by the compiler to justify optimizations. And Signed arithmetic overflow used to be in the first category, now it is in the second one. So is the use of uninitialized variables.

Not to appeal to authority, but I worry about these things for a living:

http://blog.frama-c.com/index.php?post/2013/07/11/Arithmetic...

http://blog.frama-c.com/index.php?post/2013/05/20/Attack-by-...

If you cannot be bothered to read that much, then please simply compile int f(int x) { return x + 1 > x; } at different optimization levels with the compiler you already have, and observe the values for f(INT_MAX) in each case.


Thank you for the update to my knowledge. I'm not a C programmer, and have been relying on older explanations of why C does what it does.


You're thinking of "implementation defined" which are required to have well-defined behaviors. "undefined" is far more pernicious.


By small, I meant that you can hold every little part of the syntax and normal behavior in your head pretty easily. Yeah, there are edge cases (important ones even) and you can build god-awful complicated expressions if you want. But look a the K&R books -- they are slim. It just not that big.

By not-doing-stuff-behind-your-back, I mean that you can map the code you write into machine instructions in many cases. An optimizer will move stuff around on you but it is typically local(-ish) manipulations. That's much less black magic that garbage collection. You have to manage memory yourself.

By being-ballet, I mean that it is hard. I am a swing dancer. The best dancers I know all took ballet classes as kids. None of them dance ballet today. It is probably coincidence/age but the best programmers I know all spent a decade or more working in C. I think working in C gives you a level of understanding about what a computer does that a higher level language doesn't. That said -- if you want to get stuff done, use a higher level language. I'm really good at C coding but I'm x2-10 faster when in C#.


>By small, I meant that you can hold every little part of the syntax and normal behavior in your head pretty easily

Well I think that is where we disagree. I don't find it easy to hold all the rules of C in my head. It's plenty large enough that there are things I hardly ever use. Even the common parts of the language like arithmetic on integer types can get complicated very quickly if you want to be sure your code contains no undefined behaviour or works correctly for INT_MAX etc. Often you have to understand not just what the standard says but also what your compiler/target architecture does for the many implementation defined things.

>Yeah, there are edge cases (important ones even) and you can build god-awful complicated expressions if you want.

You don't have to write long or complicated expressions for things to get tricky. That was the point of this example: http://blog.regehr.org/archives/482 It's a simple function yet mainstream compilers got it wrong for years.

I don't consider these things as edge cases because they come up all the time and have caused countless serious bugs in real world C code.

>By not-doing-stuff-behind-your-back, I mean that you can map the code you write into machine instructions in many cases.

That is becoming less and less true with modern compilers. Vectorizers will kick in at different optimisation levels and depending on various heuristics that I'm not sure even the compiler authors would be confident in predicting for more complex code. They can perform a lot of complicated transforms. Undefined behaviour means lots of code can be modified in fairly unintuitive ways.

>An optimizer will move stuff around on you but it is typically local(-ish) manipulations

clang includes a link time optimizer: http://www.llvm.org/docs/LinkTimeOptimization.html#example-o...


The size of the spec in part reflects the intensity and vastness of real world usage, having to get into painstaking detail regarding otherwise ignorable issues. It agonizes over minutiae because enough people are going into those dark corners that rules and expectations must be laid out for a profusion of obscurities.

If you have a relatively small number of users who understand the gist of the language, it can be expressed in a few pages.

If you want the language to be useful beyond a trivial description, you'll have to add some complexity, which leads to weaknesses.

If your language becomes world-scale popular, you're going to have to spend specification space dealing with things like explaining that under certain conditions yes in fact a Boolean variable can have a value other than "true" or "false".


1. Your first example ends with "C may be a small language, but it’s not a simple one."

By your argument, it feels like you'd say the language is large because you have to understand that rule if you really cared about consistent results everywhere. I agree with the author -- small but not simple.

Platforms differ and it leaks into the language precisely because it is such a simple language. It is simple like HTML 1.0 is simple. It is simple like the Bill of Rights is simple. There are a limited set of rules but there a lot of undefined behavior as a result.

I guess you believe HTML is complex because you have to understand CSS these days to do anything.

2. I find it interesting that you use what is clearly an edge case and then argue that because they are common it is not an edge case.

The example of "int foo(char x) { char y = x; return ++x > y; }" is almost the textbook example of an edge case. Seriously, don't trust me. Ask around and see if you can find 10 people who know C well that would consider this mainstream (excluding embedded developers).

There are countless serious bugs in real world C code because (a) there is so much damn real world C code and (b) it doesn't exactly protect from shooting yourself in the foot.

My experience working with a big program that ran on Windows, 2-5 flavors of UNIX, and VMS (both DEC and Alpha) is that the bulk of the real world errors in C code do not have anything to do with undefined behavior across platforms. They have to do with memory management (null pointers, buffer overruns, etc) and poorly written macros.

3. I agree with you that adding optimizers introduce a whole set of things you have to hold in your head that push you into 'large' territory. Just like programming on a GPU makes you rethink everything about how you organize code and writing for embedded code has its own set of rules.

But how does that make the language large?

My point is that a language that says "we manage memory on your behalf inside of a VM" is doing a lot more for you.

Boy, are we beating this thing to death or what...

...

I like C quite a bit but I wouldn't want to make a living programming in it today. I drop back into C when I have a compute kernel that needs it but 99% of the code remains in C#. C is small but too simple for the problems that I'm solving today.


But the article contains many examples of the compiler doing things behind your back. For example:

> The answer depends on whether the optimizations are turned on. If they are then the answer is 3 (the first definition is inlined at all occurrences until the second definition). If the optimizations are off, then the first definition is ignore (treated like a prototype) and the answer is 4.


In that case, it's your fault for defining two versions of a function (inline and non-inline) that do different things.


Even if you don't do something like that a modern C compiler will do all sorts of things behind your back. Like removing null checks[1] or pointer overflow checks[2]

[1] http://www.cvedetails.com/cve/CVE-2009-1897

[2] http://www.cvedetails.com/cve/CVE-2008-1685


Well, to be fair, the NULL check was in a wrong place. The pointer was already dereferenced before the check.

  struct tun_struct *tun = __tun_get(tfile);
  struct sock *sk = tun->sk;
  unsigned int mask = 0;
  
  if (!tun)
    return POLLERR;


It "doesn't do anything behind my back" is exactly the phrase I've been searching for.


"Terse" is the best adjective I can think of for C.

Merriam-Webster definition of "Terse" :

1: smoothly elegant : polished

2: using few words : devoid of superfluity


Also the standard library is really small. So after a short amount of time you have used almost all of it a few times.


However it's going to take some time to learn all the gotchas in string functions... Failing to zero-terminate strings (strncpy), potential buffer overflows (many), global internal buffer and overwriting an argument (strtok), etc etc etc etc


Simple != easy, as per this great talk by Rich Hickey:

http://www.infoq.com/presentations/Simple-Made-Easy


To get an even better explanation on what the word "simple" means and what it derives from, I would recommend "Simplicity Ain't Easy" by Stuart Halloway[1]. While it is very similar to "Simple made Easy", it focuses a lot more on the etymology of the words "simple" and "complex" and how people misuse the word.

[1]: http://www.youtube.com/watch?v=cidchWg74Y4


Someone else at Cal summarized C succinctly when he wrote "C is the machete of programming languages. It's great for clearing a quick path through the jungle of programming problems, but be alert as there are no safety mechanisms to protect the wielder of this wonderful weapon." -David Patterson http://www.informit.com/promotions/promotion.aspx?promo=1389...


Warnings as errors, turn on all of them, pedantic mode and static analysers running alongside unit tests on every build.

Even then, there might be dragons waiting for you.


These little teasers are starting to annoy me. Congratulations, you can twist C to be as unintuitive as you'd like it to be. But why would you write code like that to begin with? I don't care what value __ return -3 >> (8 * sizeof(int)); __ returns because no sane program anyone contributes to will have constructs like that.


You are missing the context, which is that of an unambiguous parsing of any valid C code. Examples given are teasers for the parser, not a prorammer.


"The following examples were actually encountered either in real programs or are taken from the ISO C99 standard or from the GCC’s testcases."


> Who Says C is Simple?

I do.

C is like chess. There are few simple rules (compared say to C++ spec). But knowing the rules doesn't mean you'll end up beating Kasparov. It still takes skill and practice to be a good C programmer.

So, the language itself, is pretty simple compared to other popular programming languages as far as it has a simple syntax. I can teach someone the rules of chess pretty quickly, It doesn't mean I'll create a chess grand-master in a day or two.


>> Who Says C is Simple?

> I do.

I disagree. C comes with a lot of edge cases and subtleties that can surprise even people like myself who 'know' C.

Some examples: Exact semantics of restrict, C99 inline semantics (eg I wasn't aware that it's possible to make a non-extern/non-static inline definition into an extern one with a single redeclaration), effective typing rules and whether it's possible to circumvent them, whether or not it's undefined behaviour to cross boundaries of the sub-arrays of a multi-dimensional array if it doesn't happen in a single expression, ...


The difference is between gimmicky ways one can use ambiguity to produce convoluted edge cases vs knowing enough to do actual work. Sure C being low level and and having a relatively short specification is ripe for compiler specifics and hardware specific hacks. But still think it has the basic core defined and that is pretty small.

With C++ and its typical libraries it is a bit different. It has so many features (templates, classes, streams, friends, shared pointers, unique pointers, distructors, constructors, polymorphism rules, and combination of those) that code gets complicated without using obscure features, just sticking to the standard ones gets hairy and needs someone who knows the whole spec.


The difference is that I don't want my programming language to take me a lifetime to become reasonably skilled with it. There are many languages that are simpler than C, and easier to use as well.


I bet most or all of the languages that you're thinking of run on a VM written in C.


C is not a simple language. It is confoundingly difficult for a machine to parse. Add in the fact that it pretty much requires a preprocessor to be useful and it get's even more difficult. Clearly the article is not about whether the C paradigm is complex. It obviously isn't, because it doesn't come with a whole lot of pre-baked abstractions. But for parsing reliably, it's an absolute nightmare.


tl;dr: C is simple. Bad code is bad. Shitty compilers are shitty.

This needs to be qualified. Simple compared to what? C is simple when compared to C++, Java, and many other languages. Obviously, C is simple because it lacks syntactic sugar, classes, polymorphism, templating (generics), memory management, etc. So, uh, yeah. It's simple.

The fact that bad code can be written in a language doesn't really make the language non-simple. I could write bad code in any language (often times I do!) - so I'm not sure how any judgment about a language can be made with these kinds of examples. As far as the GCC/VC examples are concerned, they are a non-issue. Shitty compiler keywords are shitty[1]. We know. This is one of the many pains of writing cross-platform code in compiled languages. These examples are contrived and I highly doubt most come from production code.

[1] http://stackoverflow.com/questions/3437404/min-and-max-in-c/...


Signed/unsigned conversions and conversion to/from pointers of functions and arrays cause real, production-code confusion. In terms of how much mental overhead the language takes up - how many edge cases you have to keep in mind to read code in the language (because if you're programming right then you spend more time reading than writing) - no, it's not simpler than Java, because classes, generics and memory language are more straightforward and less confusing than C's typing rules.

Languages have a complexity budget, and C blew its on a squillion different kinds of integer and a bunch of arbitrary-seeming rules to minimize the amount of typing needed to change between them. That was a good tradeoff in its day, where hand-optimization was practical and program source needed to be small. It's not an appropriate language to use now outside of very specific circumstances.


You say this as if Java doesn't have its fair share of oddities. It seems you're saying that the highly granular typing found in C doesn't have an equally-annoying analogue in Java; it does[1].

[1] http://www.javalobby.org/java/forums/m92125315.html


That's not a corner case, it's a particular instance of a general Java misfeature. In Java, "==" is an abomination that you must never use and will give essentially random results; "equals" is what you use to compare things for equality, and this is consistent throughout your codebase. Is it annoying? Yes. But it's easy to remember because it applies everywhere; any case of "==" in a java source file sticks out like a sore thumb. (Also any use of an array, anywhere, for anything. Sometimes I think I should write "Java: the good parts").


C is simple like the game Go is simple. Both are defined by (comparatively) simple rules, yet the possible situations that both present are far more complex than languages or games defined by far more complex rules.


In that vein, assembly languages are even simpler. Which is true. The languages themselves are simple. Their use often isn't.


No, assembly languages are not simpler, except for a few processors, mostly didactic.

Programming in assembly generally requires understanding registers, how the processor works, lots of branching variation, a few other ways to loop, memory organization on the processor, and at least tenths or hundreds of different opcodes, some with surprisingly subtle differences between them.

So, no, ASM is not simple the way C is.


C is not simple because its BNF is not simple and hard to interpret. You can have a simple (feature wise) C like language with a much cleaner syntax. ie Pascal, Modula, Oberon. You can simplify it a lot just by dropping += :)



> Why does the following code return 0 and not -1?

This questions appears twice in the article.

Note that shifting a 32-bit integer (no matter the sign) by 32 is undefined behavior (§6.5.7, 3). I'm using Clang, and its `int` has 32 bits even on a 64-bit system.

C is deceptively simple.


I would not have thought that

    return ({goto L; 0;}) && ({L: 5;});
was valid C. GCC 4.4.7 chokes on it, using -std=gnu99. Is this a C11 thing?


To the compiler, it shouldn't care where a label exists. If you're telling it to go to label L:, execution will jump there. GCC appears to look out for the programmer and always prints an error (and I don't see an option to shut it off). Oracle Studio compiles it with no issue and the above statement returns 1.


No, it's a GCC extension. That entire section is about non-standard GCC features.


I just slapped it into Xcode and compiled using LLVM 4.2 (-std=gnu99), and it works fine. Using Xcode's LLVM GCC 4.2 setting (-std=gnu99 by default), it chokes.

Conclusion: I kind of lost interest at this point. :-) I don't quickly know how to get it to compile using GCC, though.


All this proves is that you can write ugly C and GCC has some crocks in it.

These problems are all optional.


C is a tool. A powerful and flexible tool.

If a programmer wants to write obfuscated code (as in the examples), C allows that.

If a programmer wants to write clean, maintainable and documented code, C allows that too.


Nobody?

What these examples demonstrate is not that C is "hard", but that it's powerful.


There isn't a single feature in C that isn't present in more safer languages like Modula-2 or Turbo Pascal. They are as powerful, or even more, than C.

Just because C won the battle with those languages, it does not mean we need to live with its design issues ad eternum.


How are you going to go about fixing C? And what design issues? It's pretty bare bones on top of the assembly. What should change?


* Eliminate sources of undefined behavior * Remove implicit type casting * Have a way to check the hardware overflow flag from the language * Saner syntax for declaring variables (i.e. function pointers) * get rid of null-terminated strings * a module system instead of the preprocessor


Additionally:

- Proper arrays with bound checking, which can locally be turned off, if required for performance reasons

- Explicit operation for converting arrays into pointers


-fstack-protector && -D_FORTIFY_SOURCE=2


Where is that defined in the C standard?

Because you see, if it is compiler specific, it is not part of the language.


Also:

- remove crufty alternate syntaxes such as trigraphs and K&R-style definitions

- a multiple-pass compiler which removes the need for explicit prototypes/header files


• Modules are on the way: http://clang.llvm.org/docs/Modules.html

• Check for overflow would be awesome indeed.

• Undefined behavior avoids massive performance penalties on hardware that wouldn't match the defined behavior, so it's a feature and unlikely to go away (compilers may warn you though).


> Modules are on the way: http://clang.llvm.org/docs/Modules.html

Yeah, with luck they will be part of C++17, you just need to wait 4 years for them to be defined and then around 5 more for all major compilers, across all OS to support them.

They are not even being discussed for the next C standard, and thus similarly to C blocks, it will remain a clang language extension.


1. Undefined behavior is because not all hardware is the same

2. Pedantic but it's nul terminated, null is something either the same or completely different depending on the implementation. Also what would you recommend for non null strings? passing a struct around of *s and size_t len, _that_ is a horrid idea.


Go Berkeley, awesome prof.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: