The Prime That Wasn’t

bediger · on April 5, 2011

That's not a true Regular Expression: it uses back-tracking, and to the author's credit, he notes that in the article.

True regular expressions can't recognize primes. It's important to keep these kind of distinctions in mind for two reasons.

First, without clear, precisely defined categories of formal languages, you'll fail to recognize unsolvable (or undecidable) problems and waste a lot of time. Personally, I put "malware dectection" in this category, as determining whether a given file constitutes a virus seems equivalent to solving the halting problem.

Second, sloppy usage like using "Regular Expression" for something else is one of those common usage issues that allow people to argue based on the name ("regular expression") to lure the unsuspecting in, but then fall back on extreme technicalities to trick people into agreeing with the argument when they might not otherwise have agreed. A lot like I've done in this comment.

lmkg · on April 5, 2011

I agree with you to a certain extent, but I think it's also important to realize that, just as languages evolve and dialects diverge, technical jargon evolves and diverges into dialects as well. To a large class of programmers, the non-DFA perl-type regex is more common and more important than the formally defined regular expression. The overlap is inconvenient, for exactly the reasons you state, and I wish it didn't exist, but it's unrealistic to expect one camp or the other to abandon their ingrained term for a common concept.

For my part, I use "regular expression" for the DFA-expressible grammar and "regex(en)" for the string-processing facility. The most important thing is clarity, and as long as you're being clear about which type of regular expression you're talking about, the double-meaning may be twitch-inducing but it doesn't impede communication.

lloeki · on April 5, 2011

Indeed I do the same kind of distinction with the exact same naming convention.

Now this does not detract to the fact that this regex is quite elegant in its simplicity and we often assume our hardware+software stack to be actual Turing machines.

Call me paranoid but finding such a simple expression and seeing it validate a number being a prime in such record times should have raised some alarms. Indeed, removing the "optimization" makes computing times go through the roof.

lukeschlather · on April 5, 2011

I think you've completely missed the point of the exercise. This is not about optimization. It's about having a deep knowledge of how Perl-style regular expressions work.

lloeki · on April 5, 2011

Sorry, this was not my intention to mean so: I used "optimization" for lack of finding the correct word regarding the role of the PCRE setting. My point was that the quickness of the initial results was fishy enough to doubt of the correctness of those results (and hence diving into regex engine probing), even before finding actual false positives.

eru · on April 5, 2011

That's a reasonable convention. (Another less reasonable one, is to use the term `Frankenxpression'.)

eru · on April 5, 2011

To see some genuine regular expression trickery, see https://github.com/matthiasgoergens/Div7/blob/master/regex7 for a regular expression that recognizes decimal numbers that are divisible by 7.

If you memorize that expression, you have a handy way to tell whether a number is divisible by 7.

shasta · on April 5, 2011

It's much easier to remember and apply the corresponding finite state machine, where you just keep track of the remainder mod 7 that you've seen thus far. Example: 48193. What's the remainder mod 7?

48 mod 7 = 6 -- First two digits

61 mod 7 = 5 -- Remainder thus far, appending next digit

59 mod 7 = 3 -- Remainder thus far, appending next digit

33 mod 7 = 5 -- Remainder thus far, appending next digit

No more digits, so the answer is 5.

(Edit: I did figure that you were almost certainly being sarcastic. I just wanted to point out that in its corresponding FSM form, we've memorized the transitions as children.)

eru · on April 5, 2011

Yes, that's much easier. My comment wasn't meant to be serious. Check out how long that regular expression is. (Though it can be simplified further.)

I actually generated the regular expression linked by compiling the finite state machine.

pronoiac · on April 6, 2011

You're doing all the work of dividing by 7. There are shortcuts to checking divisibility [1] - 2, 3, 5, & 9 have the easiest shortcuts - but 7 is a complicated case, which I'd probably just divide out anyway.

[1] http://en.wikipedia.org/wiki/Divisibility_rule#Divisibility_...

eru · on April 8, 2011

No, he's not doing the work of dividing by seven. That would need memory of around the same size as your number of digits of your input number. His state machine only reads one digit at a time and only needs to store a less-than-one-digit state.

My regular expression is a joke on the usual divisibility rules.

By the way, you can find your own divisibility rules. They are actually easy to find and prove, if you know a bit about cyclic groups / calculating with modulo. E.g. the rule for divisibility by F in hexadecimal is the same as the one for 9 in decimal.

Divisibility by 11 or 101 or 1001, or 9, 99, 999 is also really simple.

jarin · on April 5, 2011

Boy, and I've been doing it the hard way all this time (num % 7 == 0).

robinhouston · on April 5, 2011

A little while ago I wrote some Javascript code to compute a regular expression for deciding divisibility by any number, in any base: http://s3.boskent.com/divisibility-regex/divisibility-regex....

eru · on April 5, 2011

You can find my Haskell version under https://github.com/matthiasgoergens/Div7/

Wilfred · on April 5, 2011

> determining whether a given file constitutes a virus seems equivalent to solving the halting problem.

In fact (for a sufficiently rigorous definition of virus), it is provably the case that we cannot determine whether a file is a virus, using an argument very similar to a halting problem proof.

You might be interested in 'Computer viruses: theory and experiments' by Fred Cohen for the full working-out of the proof.

derefr · on April 6, 2011

Viruses are really a matter of computer action given user intent—and therefore are able to be solved not just by detection of incongruous computer action, but also by detection of incongruous user intent.

For example, imagine that instead of WIMP, all interactions with a computer occurred through a hierarchical, task-based UI (similar to, e.g., emacs org-mode) where you had to specify a high-level goal you wanted to accomplish before you could set about running programs to do it. If the computer knows that what you want to get done doesn't involve deleting files, then any program you run that does delete files has either been executed in error, or is acting not according to the user's command.

dexen · on April 6, 2011

> If the computer knows that what you want to get done doesn't involve deleting files, then any program you run that does delete files has either been executed in error, or is acting not according to the user's command.

Tell that the authors of certain popular PDF reader software, currently version X. Which, on the surface at least, has only one goal: display content of a PDF file.

But scratch the surface and you note it may involve creation of temporary file of sorts. And perhaps loading additional libraries every now and then. And perhaps executing code contained in the document (if it's a form with validation code, for example).

There are countless ways of achieving a goal, and which is better depends on situation. Pinning everything down to some one-true-way won't do. And expressing the possibilities is, well, designing & programming. With the usual assortment of bugs, up to and including vulnerabilities.

So no, you can't have a secure system / virus detection based on user stating general high-level goal.

derefr · on April 6, 2011

> But scratch the surface and you note it may involve creation of temporary file of sorts.

Alright, s/file/document/g in my statement above—a "document" being some chunk of information the user values, and which the program doesn't have any "ownership" of unless the user explicitly delegates said ownership to it.

Think of SELinux: a program would have to request a right to delete the user's files, although it would be perfectly allowed to delete its own files. Except, in this case, instead of the program presenting a security elevation dialog to get permission to delete a document (or do any of an infinite number of other fine-grained things), you'd specify its permissions ahead of time, in the form of a natural-language phrase that is fuzzily-keyed in a HTM memory-base to sub-goals and sibling-goals, each of which eventually reach leaf nodes representing granted permissions.

Also note that, if I specify my goal as viewing a PDF document, I am explicitly telling the computer I that running a script is not something I want to be doing. If I wanted that, I would say I want to interact with a PDF document (or, in more natural-language terms, "fill out a form", of which "interact with a PDF file" would be one of the child nodes.)

dexen · on April 6, 2011

> Also note that, if I specify my goal as viewing a PDF document, I am explicitly telling the computer I that running a script is not something I want to be doing.

LOL, no, you don't.

At least in general, for interacting with rich documents.

Imagine for a moment disabling JavaScript in your browser and browsing HN. Or any other important website. Now go ahead and actually try that -- you'll find that you've vastly underestimated the impact on usability. That you've missed a lot of functionality that is seamless, unobtrusive -- and you only notice its importance once its gone.

Thank you, I rest my case [0] [1].

> (...) you'd specify its permissions ahead of time, in the form of a natural-language phrase that is fuzzily-keyed in a HTM memory-base (...)

It strikes me as something very open to attacking special cases. Like weak encryption -- works most of the time, but breakable by somebody with incentives. But that's just a hitch, no hard facts.

The UX reeks of the UAC -- I can't see any bottom-line minding company repeating that mistake.

----

[0] yes, I know some people use the `noscript' plugin or similar. Still, the second basic functionality of such plugin is whitelisting websites for enabling JS on 'em.

[1] In any case, a bug in interpreter of any no-scripted (`plain data') document can give way to memory corruption and code injection. Happened way too many times with browsers, movie players etc. Exploit hidden in PNG? Heck, why not?

bediger · on April 5, 2011

Thanks for the tip on the Fred Cohen paper. I have seen arguments to the effect that since a virus doens't run with an infinite memory, you can indeed determine whether a disk file represents a virus or not.

As Alfred North Whitehead once said: "Necessity is the mother of invention" is a silly proverb. "Necessity is the mother of futile dodges" is much nearer the truth."

xyzzyz · on April 6, 2011

You can state it even more generally -- given any task, there is no algorithm to check whether a given program does it.

http://en.wikipedia.org/wiki/Rice_theorem

bediger · on April 6, 2011

Good reference, thank you. But...

Tell that to any software development manager, and you'll get a long lasting blank look. Too many people and corporations have lied to the managerial class for too long. I've had managers tell me that I can estimate development time exactly, while giving me a short, finite time span in which to generate the estimate.

pufuwozu · on April 6, 2011

I read your comment and decided to do read up on some theory behind it. I came across this:

http://cstheory.stackexchange.com/q/448/4052

From the highest rated answer:

Larry Wall proposed that we use "regular expression" for the formalism Kleene proposed, and "regex" for expressions for the widely used extensions. It's a fairly widely followed convention. If you want to make it clear that you are talking about regular expressions in the formal languages sense, it is usually not difficult to translate into talk of regular languages.

vog · on April 5, 2011

From the article:

> The regex relies on backtracking and thus would not work with a DFA-based engine

So the author already stated that strictly speaking, those are not true Regular Expressions. However, it would have been nice if he had stated this more clearly.

ars · on April 5, 2011

Dup: http://news.ycombinator.com/item?id=1572031

Did you add a question mark at the end to avoid the dup checker?

ugh · on April 6, 2011

That Submission is more than a year old. As far as I know, the dup checker doesn't check back that far.

Herald_MJ · on April 6, 2011

It does. I came across this article yesterday morning and attempted to submit it to HN, found that it was rejected for being a dupe.

Ultimately, I submitted it to reddit.com/r/compsci instead. I assume fogus picked up on it from there and hacked the HN dup detector by appending a '?' to the URL. Kind of annoying, really.

jarin · on April 5, 2011

It kind of reminds me of this blog post from a while back about Apple's CFArray implementation and how it switches data structure implementations at about 300,000 elements: http://ridiculousfish.com/blog/archives/2005/12/23/array/

Just a reminder that the simpler things seem, the more complex they can often get under the surface.

seles · on April 5, 2011

The regex is not complex, just the php implementation is.

I feel like any regex expression should give an error or at least a warning if is going to return a potentially incorrect answer.

bobfunk · on April 5, 2011

Would absolutely expect an exception if my regex engine decided it couldn't handle an expression!

Just giving back a wrong result unless you tweak the php.ini file seems insane. Stuff like this is really what makes PHP deserve its bad rep...

makmanalp · on April 5, 2011

> Needless to say, don’t use this prime determination algorithm for any sort of real-life stuff. Instead, just appreciate it for its elegance.

Yet another testament to the subjectivity of elegance in code. One man's elegant code is another's ugly hack. (Not to discredit the cleverness of the author.)

drivebyacct2 · on April 5, 2011

I don't find solutions that are limited in success and are ill-adapted to the problem set to be "elegant", but maybe I'm nitpicking.