TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How do you run long computations without interruption?

6 pointsby annowikialmost 4 years ago
Recently I started messing around with numpy trying to write a government simulator to analyze the influence of government institution structure on the frequency of laws being passed (i.e., how liberal or conservative a government&#x27;s outcomes are across similar populations) and I found the simulations to take a really long time to run (probably because the code is badly optimized).<p>When I spent ten minutes waiting for completion only to see a completely faulty result I started wondering how most scientists and data programmers run long simulations without interruptions or disappointing results. I assume one simple thing I should have done was run a shorter simulation to see if the code was working, but I also started thinking about how I could run a really long simulation without having to worry about the simulation being scrapped after a lot of computation, either because of a runtime error or just a power outage or something.<p>I know there are cloud services to run expensive computations, but really my question is: are there industry tools or techniques for pause&#x2F;resume execution of code and simulations? I can&#x27;t imagine anyone running a simulation for more than ten minutes in a Jupyter cell.

8 comments

al2o3cralmost 4 years ago
Checkpointing - saving the state of the computation at intermediate points - can help a lot with this.<p>BUT, beware that checkpointing can also be unreliable if the system being simulated exhibits chaotic dynamics; truncating off &quot;unimportant&quot; decimal places during checkpointing can produce results that diverge rapidly between the &quot;original&quot; and &quot;restarted from checkpoint&quot; versions.
评论 #27654819 未加载
JoeyBananasalmost 4 years ago
1. Spend some time working on the verbose mode of your program so that you can use the output for debugging.<p>2. Consider setting up another computer on your local network to run your simulation. I often use workflows where to run my program, I execute a makefile that Rsyncs the project directory and runs the program on a remote machine over SSH.
评论 #27645396 未加载
ThrowawayR2almost 4 years ago
Have the simulation do periodic dumps of its state to disk and have the ability to resume the simulation from those dumps if interrupted.
评论 #27647338 未加载
softwaredougalmost 4 years ago
Get extensive code out of notebooks into testable libraries. Confirm it works before running a big computation.<p>We try to keep our notebooks slim and factor anything complex into a Python library. We can experiment with this libraries implementation, write unit tests on smaller data sets to confirm it works, before doing a bigger experiment. We can also deploy the library to prod and ensure users get the exact same thing we experimented with.<p>I would recommend the talk “I don’t like notebooks” <a href="https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=7jiPeIFXb6U" rel="nofollow">https:&#x2F;&#x2F;m.youtube.com&#x2F;watch?v=7jiPeIFXb6U</a><p>(There’s also the alternative nbdev that creates a full software engineering and experimentation setup in 100% notebooks. Not something I particularly like, but it might be another option.)
h2odragonalmost 4 years ago
At some point your &quot;data set&quot; becomes a &quot;data base&quot; and you might look at what actual DBMS things can do for you. The idea of transactions can be very helpful if your simulation runs in steps; might get rollback for free.
alimwalmost 4 years ago
I have run weeks-long computations in a Jupyter cell! The only issue was with OSX updates-- I disabled those.
cpachalmost 4 years ago
What is long in this context? 1 day? 1 week? 3 months ?
评论 #27645400 未加载
rapjr9almost 4 years ago
I put computing clusters on big UPS&#x27;s to help keep them running during temporary power glitches. Make sure to put all the related equipment on UPS&#x27;s also (switches, routers, hubs, KVM, monitor). Access control and scheduling computing time is important also, you don&#x27;t want multiple people running computations at the same time because you might run out of RAM for example and in many experiments knowing the run time on a fully utilized CPU is important. Batch scripts can help; if you are running a series of computations that are not dependent on each other run them from a script, then if some fail it is easy to comment out the others and rerun the ones that failed. There are cluster management software programs that can automate much of this for you. Security measures are important, you don&#x27;t want ransomware to hit during a run. Write good code for your computation that has sensible error handling so the computation can recover from errors, or at least give you data on what caused them. It can be difficult to debug long running software, it is similar to debugging server software. Sometimes you need the ability to only start generating logs after X hours of running or when the computation reaches a particular stage or to capture state continuously but only dump recent state history when an error occurs. Reliable air conditioning and power are essential. Physical access control to the hardware keeps anyone from bumping the hardware or accidentally unplugging it or otherwise messing with it. Sometimes it may make sense to have some spare machines that a task can migrate to if there is a hardware failure. Critical management functions you might run on more than one machine so that there is always at least one machine alive to control the systems. There are ways to lock circuit breakers so that service people can not turn them off (or on) by mistake without a key. Write your code to remove dependencies on an internet connection (e.g., use IP addresses instead of DNS, keep data local, have a local DNS server, send error reports using a local mail server that will keep retrying.) The computer location should keep insects, rodents, animals, mold, and water at bay. Vibration from nearby mechanical systems can be an issue (e.g., construction going on in a nearby room. Make sure facilities people have to get permission from you.) Remote access can let you check that things are running smoothly, a dashboard showing a variety of measures can be useful, though these are also a security risk. Keep automatic upgrades&#x2F;updates turned off, only manually upgrade&#x2F;update. Stop unnecessary services from running. Once you start thinking about it you&#x27;ll find lots of things that could potentially go wrong; the world is extremely perverse and sneaky in this respect. If you find yourself running long term tasks over many years your list of mitigation measures will always be growing. Maybe you didn&#x27;t consider the RF environment but a new local cell tower or radio tower starts messing with your systems. A pile driver a half mile away at a construction site can create strong vibrations and impulses in your computers. A solar storm can increase RF noise and reduce the quality of the AC power. Earthquakes happen. Med labs can have radiation sources. Fire is destructive, fire suppression systems can destroy your computers even if there is no fire. Someone may fire a gun through the walls if the public can get close to your computer room. Wireless networking is a security risk.