488 private links
FrontierMath's difficult questions remain unpublished so that AI companies can't train against it. //
On Friday, research organization Epoch AI released FrontierMath, a new mathematics benchmark that has been turning heads in the AI world because it contains hundreds of expert-level problems that leading AI models solve less than 2 percent of the time, according to Epoch AI. The benchmark tests AI language models (such as GPT-4o, which powers ChatGPT) against original mathematics problems that typically require hours or days for specialist mathematicians to complete.
FrontierMath's performance results, revealed in a preprint research paper, paint a stark picture of current AI model limitations. Even with access to Python environments for testing and verification, top models like Claude 3.5 Sonnet, GPT-4o, o1-preview, and Gemini 1.5 Pro scored extremely poorly. This contrasts with their high performance on simpler math benchmarks—many models now score above 90 percent on tests like GSM8K and MATH.
The design of FrontierMath differs from many existing AI benchmarks because the problem set remains private and unpublished to prevent data contamination. Many existing AI models are trained on other test problem datasets, allowing the AI models to easily solve the problems and appear more generally capable than they actually are. Many experts cite this as evidence that current large language models (LLMs) are poor generalist learners.
68 °F above average is a lot. For a tropical country it is not credible for temperatures to be that much warmer than average because the average is too high to give enough headroom. So what gives?
Reading the article I found this:
parts of Malawi saw a maximum temperature of 43C (109F), compared with an average of nearly 25C (77F)
As I expected the actual temperature increase was 32 °F, not 68 °F. So what’s up with that headline? Here’s a hint: this is what the headline might say if you set your location to somewhere other than the United States:
Now “nearly 20C” is an odd way of saying “18 °C”, but I guess they really like round numbers, and that’s not the problem. The problem is that somebody – the localization team? an algorithm? – decided that 20 °C was equivalent to 68 °F. And they’re not wrong. And yet they are.
When converting from a temperature in Celsius to one in Fahrenheit you have to multiply by 1.8 (because each degree Celsius covers a range 1.8 times as large as a degree Fahrenheit) and you have to add 32 °F (because the freezing point in Fahrenheit is 32, compared to 0 in Celsius). However if you are converting a temperature difference you just multiply by 1.8. //
This is just another version of the fallacy involved when somebody says that it is “twice as hot” when the temperature goes from 5 °C to 10 °C – note that this is equivalent to going from 278 K to 283 K, or 41 °F to 50 °F, so clearly not “twice as hot” in any meaningful way.
Intel’s manuals for their x86/x64 processor clearly state that the fsin instruction (calculating the trigonometric sine) has a maximum error, in round-to-nearest mode, of one unit in the last place. This is not true. It’s not even close.
The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct. For huge inputs it can be much worse, but I’m going to ignore that.
I was shocked when I discovered this. Both the fsin instruction and Intel’s documentation are hugely inaccurate, and the inaccurate documentation has led to poor decisions being made. //
brucedawson on October 9, 2014 at 10:38 pm
This will affect programmers who then have to work around the issue so that every day computer users are not affected. The developer of VC++ and glibc had to write alternate versions, so that’s one thing. The inaccuracies could be enough to add up over repeated calls to sin and could lead to errors in flight control software, CAD software, games, various things. It’s hard to predict where the undocumented inaccuracy could cause problems.
It likely won’t now because most C runtimes don’t use fsin anymore and because the documentation will now be fixed.
The DM32 is our enhanced classic all-rounder based on the HP 32SII. 171 functions, of which 75 are directly accessible from the keypad. Programmable. Conversions, statistics, fractions, equations, solver and more. The perfect choice for almost everybody. BETA firmware installed, updates will be required.
Anonymous Coward
Don't put it in your pocket
Are we now going to discover that Hezbollah bought a batch of calculators from Brazil some months ago?
Ian JohnstonSilver badge
Re: Don't put it in your pocket
If they did, it's a bad move which might easily blow up in their faces.
Yet Another Anonymous cowardSilver badge
Re: Scientific Calculator
Scientific calculators use a body of tested and published algorithms to determine the answer.
Non-scientific calculators believe what they read in the Daily Mail and what someones sister's best-friends hairdresser's partner saw on Facebook
Andy NonSilver badge
Re: Scientific calculator:
1+2x3=7
Daily Mail calculator:
1+2x3=9
Here’s how I used ChatGPT and a little bit of JavaScript to figure out I could win this video contest... and then proceed to win it. //
All told, there were 538 entries competing for prizes in the contest. n/(538+n)n/(538+n)n/(538+n) doesn’t sound like great odds, does it?
Let’s dig deeper to see why winning isn’t as improbable as it sounds on the surface. To do this, we’ll review the existing submissions. //
Now we can get to figuring out the probabilities and odds. For this, we’ll use a function to calculate the binomial coefficient.
For more than 2,300 years, Euclid’s Elements has been the foundation for countless students to learn how to reason with precision and pursue knowledge in all fields of learning.
The brilliance of his work has made it the second most published book in history because it provides profound tools to distinguish truth from error and discover fundamental principles about the world.
Modeled after our core mathematics course, “Mathematics and Logic” examines the vital importance of good reasoning to the liberal arts.
With this course, you’ll study the transformation of mathematics by the ancient Greeks, the fundamentals of logic and deductive reasoning, the central proofs of Euclid, the birth of modern geometry, and much more.
And now, you can own a DVD box set of “Mathematics and Logic” for a gift of $100 or more to Hillsdale College.
What price common sense? • June 11, 2024 7:30 AM
@Levi B.
“Those who are not familiar with the term “bit-squatting” should look that up”
Are you sure you want to go down that rabbit hole?
It’s an instant of a general class of problems that are never going to go away.
And why in
“Web servers would usually have error-correcting (ECC) memory, in which case they’re unlikely to create such links themselves.”
The key word is “unlikely” or more formally “low probability”.
Because it’s down to the fundamentals of the universe and the failings of logic and reason as we formally use them. Which in turn has been why since at least as early as the ancient Greeks through to 20th Century, some of those thinking about it in it’s various guises have gone mad and some committed suicide.
To understand why you need to understand why things like “Error Correcting Codes”(ECC) will never by 100% effective and deterministic encryption systems especially stream ciphers will always be vulnerable. //
No matter what you do all error checking systems have both false positive and false negative results. All you can do is tailor the system to that of the more probable errors.
But there are other underlying issues, bit flips happen in memory by deterministic processes that apparently happen by chance. Back in the early 1970’s when putting computers into space became a reality it was known that computers were effected by radiation. Initially it was assumed it had to be of sufficient energy to be ‘ionizing’ but later any EM radiation such as the antenna of a hand held two way radio would do with low energy CMOS chips.
This was due to metastability. In practice the logic gates we use are very high gain analog amplifiers that are designed to “crash into the rails”. Some logic such as ECL was actually kept linear to get speed advantages but these days it’s all a bit murky.
The point is as the level at a simple logic gate input changes it goes through a transition region where the relationship between the gate input and output is indeterminate. Thus an inverter in effect might or might not invert or even oscillate with the input in the transition zone.
I won’t go into the reasons behind it but it’s down to two basic issues. Firstly the universe is full of noise, secondly it’s full of quantum effects. The two can be difficult to differentiate in even very long term measurements and engineers tend to try to lump it all under a first approximation of a Gaussian distribution as “Addative White Gaussian Noise”(AWGN) that has nice properties such as averaging predictably to zero with time and “the root of the mean squared”. However the universe tends not to play that way when you get up close, so instead “Phase Noise in a measurement window” is often used with Allan Deviation. //
There are things we can not know because they are unpredictable or beyond or ability to measure.
But also beyond a deterministic system to calculate.
Computers only know “natural numbers” or “unsigned integers” within a finite range. Everything else is approximated or as others would say “faked”. Between every natural number there are other numbers some can be found as ratios of natural numbers and others can not. What drove philosophers and mathematicians mad was the realisation of the likes of “root two”, pi and that there was an infinity of such numbers we could never know. Another issue was the spaces caused by integer multiplication the smaller all the integers the smaller the spaces between the multiples. Eventually it was realised that there was an advantage to this in that it scaled. The result in computers is floating point numbers. They work well for many things but not with addition and subtraction of small values with large values.
As has been mentioned LLM’s are in reality no different from “Digital Signal Processing”(DSP) systems in their fundamental algorithms. One of which is “Multiply and ADd”(MAD) using integers. These have issues in that values disappear or can not be calculated. With continuous signals they can be integrated in with little distortion. In LLM’s they can cause errors that are part of what has been called “Hallucinations”. That is where something with meaning to a human such as the name of a Pokemon trading card character “Solidgoldmagikarp” gets mapped to an entirely unrelated word “distribute”, thus mayhem resulted on GPT-3.5 and much hilarity once widely known.
Purdue University mathematics professor Clarence Waldo was only at the Indiana Statehouse to lobby for the school during budget talks in February of 1897. That’s when he happened to witness House Bill 246 – to legally change the value of the number pi to 3.2 – pass its third and final reading in the General Assembly’s lower house. //
Waldo resolved to make sure the Senate didn’t make the same embarrassing mistake, privately coaching several senators on how to speak against the bill. At the same time, newspapers outside the state were picking up the story, correctly making fun of Indiana legislators for being so easily hoodwinked.
Sen. Orrin Hubbel of Elkhart County took the lead in trying to kill the bill when it reached the floor of the Senate, calling it “utter folly” and stating he and his colleagues “might as well try to legislate water to run up hill as to establish mathematical truth by law,” according to a report in the Indianapolis Journal.
Thankfully, the bill died before coming to a vote, but that was due more to Waldo’s lobbying and the negative publicity than any principled opposition based on basic mathematical knowledge.