What I wish systems researchers would work on

93 点作者 ivanist大约 12 年前

11 条评论

Computer systems, which are deterministic, have long since passed out of the realm of being understandable.They're chaotic systems[1]. Parameters which are below the threshold of detection can drastically affect their trajectory in phase space. Accidental complexity has zoomed off to live a life all its own, and that takeoff has been continuous for decades [2].Everything we build around software that's not directly bearing on the problem domain is chipping away at this or that aspect of the ungoverned bits. As we corral each level of new unpredictability, we merely set the scene of the next level of complexity.There is no universal solution. There is no final boss which, once defeated, leads to the scrolling names and 8-bit themesong. It never stops, ever, because the value of systems that frequently fail unpredictably is lower than the cost of flawless systems and higher than the value of not building such systems.Edit: sounds very unhelpful, even defeatist. I know. I'm deeply ambivalent about what can be done versus what we should assume is possible. I'm a fan of serious software engineering and tool support and all that jazz ... yet thoroughly pessimistic about the limits of the possible. Deep sighs for every occasion![1] <a href="http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reigning-in-chaos/" rel="nofollow">http://chester.id.au/2012/06/27/a-not-sobrief-aside-on-reign...</a> (warning, self linkage)[2] <a href="http://www.cs.nott.ac.uk/~cah/G51ISS/Documents/NoSilverBullet.html" rel="nofollow">http://www.cs.nott.ac.uk/~cah/G51ISS/Documents/NoSilverBulle...</a>

评论 #5717461 未加载

评论 #5717393 未加载

评论 #5718836 未加载

评论 #5717467 未加载

ethereal大约 12 年前

Speaking as a student studying systems (particularly security and OS development) . . . the second point sounds really interesting.Recently I've been working on a simple microkernel (called Sydi) that should actually make this possible. It's a distributed system straight down to the kernel: when two instances of the kernel detect each other running on a network, they freely swap processes and resources between each other as required to balance the load. This is a nightmare from a security perspective, of course, and I haven't figured out the authentication bits yet -- too busy working on an AML interpreter because I don't like ACPICA -- but that's not the issue here. (It's far from finished, but I'm pretty confident that I've worked out the first 85% of the details, with just the remaining 115% left, plus the next 95% of the programming . . .)Due to some aspects of its design (asynchronous system calls via message-passing, transparent message routing between nodes in the network, etc) I feel that it will be entirely possible to take a snapshot of an entire cluster of machines running the kernel. It would be expensive -- requiring a freeze of all running processes while the snapshot is taking place to maintain any level of precision -- but I'm confident I could code that in during the space of a week or two . . .I haven't thought much about what kind of analysis one could do on the instrumentation/snapshot results, though. I'm sadly too inexperienced with `real-world' systems stuff to be able to say. Anyone have any suggestions for possible analysis avenues to explore?

评论 #5717646 未加载

评论 #5722808 未加载

评论 #5719458 未加载

david927大约 12 年前

I want to register a complaint. Everyone talks about wanting ground-breaking research and innovation, but in my own personal experience I see little-to-no support for it.It reminds of the scene from the film Election, where a character intones, after you've seen him ski down a double-black diamond slope and crash, "Last winter when I broke my leg, I was so mad at God."We're spending money to squeeze the last drops of gold out of the research done in the 60's and 70's, while research starves, and then complaining that there's nothing new. I don't see how we can expect anything different.

评论 #5718410 未加载

评论 #5717585 未加载

zobzu大约 12 年前

i find it "interesting" that all the complains are about useability from a "userland" developer pov. Generally what comes from that are patched up systems to "run stuff in kernel" because "its safe and fast"Which of course is not true. There are however various attempts to make a "proper" "modern" OS but all of these have failed due to the lack of commercial backing. There is apparently no financial interest into having a secure, scalable, fast, deterministic, simple OS right now (which, oh, requires rewriting all your apps, obviously)Even Microsoft tried. Yes, Microsoft did an excellent such OS.

评论 #5717480 未加载

评论 #5717629 未加载

PaulHoule大约 12 年前

CS research has been pathological for a long time.Anybody near the field has their own list of "taboo" research topics that are very interesting but just don't get funding.Then you go to the library and realize they have 120 shelf-feet of conference proceedings about Internet QoS that have gone precisely nowhere.

InclinedPlane大约 12 年前

Configuration, performance, package management, reliability/robustness, security, monitoring, virtualization, and distributed computing. Those are the big problems and missing pieces that people doing systems level engineering are facing today. And, no coincidence, these are the things where you see the most repeated instances of tools that address these issues.Consider package management. You have an OS level package manager, you have OS level application management (on android, for example), you have node package management, you have ruby gems, you have perl and python packages, you have wordpress themes and plugins, etc. Obviously it doesn't make sense to have a single global package manager, but what about standards for package management? Competing package management infrastructures which serve as foundations for all these lower level systems? A mechanism for aggregating information and control for all of the package management sub-systems?As the article points out a lot of systems folks and OS folks are still retreading the same paths. They want to pave those paths, then build moving walkways, and so forth. That's all fine and good, but there's a lot of wilderness out there, and a lot of people are out in that wilderness hacking away with machetes and pickaxes building settlements, but for some reason most OS folks don't even imagine that roads can go there too.

评论 #5717747 未加载

rlpb大约 12 年前

I think all of these areas are making great progress.> An escape from configuration helldebconf. chef, puppet. git. augeas. gsettings. In the Windows world, the registry.There's tons of work being done in this area. You may say that more can be done, or that it isn't good enough, but I think you're wrong to claim that nobody is working on it.> Understanding interactions in a large, production systemHow about the Google Prediction API? Send the data in, and look for changes when you have a problem that you're trying to fix, such as in your example.> Pushing the envelope on new computing platformsThis is incredibly vague. I find it difficult to understand your definition of "new computing platforms". Based on my interpretation of what you mean, I think a modern smartphone might qualify: a new form factor, more ubiquitous than before, novel UX (capacitive touchscreen), and an array of interesting sensors (accelerometers, compasses, GPS).I think your post highlights the lamentation of AI researchers: we don't have a good, static measure of success. Once we've achieved something, our definition changes to exclude it.

评论 #5717596 未加载

评论 #5717551 未加载

jsnell大约 12 年前

I think people replying here aren't quite appreciating the complexity of configuration properly here. One might think that configuration doesn't get harder with scale, but it does. Suggesting that the solution is simply using a general purpose language for configuration or storing configs in a key-value db or in a version control system is almost adorable. It's basically completely missing where the pain points are.And you can be sure that a company like Google will have done configuration using flat files, key-value stores, DSLs, general purpose languages that generate config files, general purpose languages that actually execute stuff, and probably many other ways :-) If there were simple solutions like that, they'd already be in use.Here's a small list of desirable properties:- Configurations must be reusable, composable and parametrizable. When running a system with multiple components, you need to be able to take base configurations of those two components and merge them into a useful combination, rather than writing the configurations from scratch. You must also be able to parametrize those base configurations in useful ways -- which external services to connect to, which data center and on how many machines to run on, etc. And note -- this can't be a one time thing. If the base configurations change, there needs to be some way of eventually propagating those changes to all users.- Configurations must be reproducible. If the configuration is a program, it shouldn't depend on the environment where it's run on (nor should it be able to have side-effects on the environment). Why? Because when somebody else needs to fix your system at 3am, the last thing you want them to do is need to worry about exactly replicating the previous person's setup.- Tied to the previous point, configurations also need to be static or snap-shottable. If a server crashes and you need to restart a task on another machine, it'd be absolutely inexcusable for it to end up using a different configuration than the original task due to e.g. time dependencies in programmatic configuration.- It must be possible to update the configuration of running services without restarting them, and it must be possible to roll out config changes gradually. During such a gradual rollout you need to have options for manual and automatic rollback if the new configuration is problematic on the canaries.- Configurations need to be debuggable. Once your systems grow to having tens of thousands of lines of configuration or hundreds of thousands of programatically generated key-value pairs, the mere act of figuring out "why does this key have this value" can be a challenge.- It'd be good if configurations can be validated in advance, rather than only when actually starting the program reading the configuration. At a minimum, this would probably mean that people writing configurations that other people use as a base be able to manually add assertions or constraints on whatever is parametrized. But you might also want constraints and validation against the actual runtime environment. "Don't allow stopping this service unless that service is also down", or something.Some of these desirable properties are of course conflicting. And some are going to be things that aren't universally agreed on. For example there is a real tension between not being able to run arbitrary code from a configuration (important for several of these properties) vs. reuse of data across different services.My thinking on this is already a few years out of date, it's probable that configuration has again gotten an order of magnitude more complex since the last time I thought about this stuff seriously, and there are already new and annoying properties to think about. The main point is just that this really is a bit more complex than your .emacs file is.

评论 #5718547 未加载

评论 #5718029 未加载

tikhonj大约 12 年前

Dealing with configuration files is a problem that extends well beyond distributed systems, or even systems programming in general. It's certainly a very important issue.One of the fundamental problems with configuration files is that they are written in ad-hoc external DSLs. This means that the config files behave nothing like actual software--instead, they have their own special rules to follow and their own semantics to understand. These languages also tend to have very little abstractive power, which leads to all sort of incidental complexity and redundancy. Good software engineering and programming language practices are simply thrown away in config files.Some of these issues are mitigated in part by using a well-understood format like XML or even JSON. This is great, but it does not go far enough: config files still stand alone and still have significant problems with abstraction and redundancy.I think a particularly illuminating example is CSS. When you get down to it, CSS is just a configuration file for how your website looks. IT certainly isn't a programming language in the normal sense of the word. And look at all the problems CSS has: rules are easy to mess up, you end up copying and pasting the same snippets all over the place and css files quickly become tangled and messy.These problems are addressed to a first degree by preprocessors like Sass and Less. But they wouldn't have existed in the first place if CSS was an embedded DSL instead of a standalone language.At the very least, being an embedded language would give it access to many of the features preprocessors provide for free. You would be able to pack rules into variables, return them from functions and organize your code in more natural ways. Moreover, you could also take advantage of your language's type system to ensure each rule only had acceptable values. Instead of magically not working at runtime, your mistakes would be caught by compile time.This is particularly underscored by how woefully complex CSS3 is getting. A whole bunch of the new features look like function calls or normal code; however, since CSS is not a programming language, it has to get brand new syntax for this. This is both confusing and unnecessary: if CSS was just embedded in another language, we could just use functions in the host language.I think this approach is promising for configuration files in general. Instead of arbitrary custom languages, we could have our config files be DSL shallowly embedded in whatever host language we're using. This would simultaneously give us more flexibility and make the configuration files neater. If done correctly, it would also make it very easy to verify and process these configuration files programmatically.There has already been some software which has taken this approach. Emacs has all of its configuration in elisp, and it's probably the most customized software in existence. My .emacs file is gigantic, and yet I find it much easier to manage than configuration for a whole bunch of other programs like Apache.Another great example is XMonad. This is really more along the lines of what I really want because it lets you take advantage of Haskell's type system when writing or editing your configuration file. It catches mistakes before you can run them. Since your config file is just a normal .hs file, this also makes it very easy to add new features: for example, it would be trivial to support "user profiles" controlled by an environment variable. You would just read the environment variable inside your xmonad.hs file and load the configuration as appropriate.Specifying configuration information inside a general-purpose language--probably as an embedded DSL--is quite a step forward from the ad-hoc config files in use today. It certainly won't solve all the problems with configuration in all fields, but I think it would make for a nice improvement across the board.

评论 #5717794 未加载

评论 #5717772 未加载

评论 #5717660 未加载

评论 #5717456 未加载

kraemate大约 12 年前

All the three points in the "wishlist" are real-world problems, something which academic conferences will shun straight-away. All work in areas like configuration management must be either a)Evolutionary work based off existing tools. This unfortunately doesn't qualify as good-enough research OR b)A radically new approach, which is almost useless unless a robust enough solution is developed, deployed, and tested. Researchers prefer multiple papers over perfecting the engineering of software tools.Hence there is no incentive for academic systems researchers to focus on these (it's no fun having paper after paper shot down).

graycat大约 12 年前

In the OP, the> The "bug" here was not a software bug, or even a bad configuration: it was the unexpected interaction between two very different (and independently-maintained) software systems leading to a new mode of resource exhaustion.can be viewed as pointing to the solution:Broadly in computing, we have 'resource limits' from the largest integer that can be stored in 2's complement in 32 bits to disk partition size, maximum single file size, number of IP ports, etc. Mostly we don't pay really careful attention all of these resource limits.Well, for 'systems' such as are being considered in the OP, do consider the resource limits. So, for each such limit, have some software track its usage and then report the usage and especially when usage of that resource is close to be exhausted. And when the resource is exhausted, then report that the problem was the resource was exhausted and who the users of that resource were.For a little more, can make some progress by doing simulations of 'networks of queues' with whatever stochastic processes want to assume for 'work arrival time' and 'work service times'.For more, do some real time system monitoring. For problems that are well understood and/or have been seen before, e.g., running out of a critical resource, just monitor for the well understood aspects of the problem.Then there are problems never seen before and not yet well understood. Here monitoring is essentially forced to be a continually applied 'statistical hypothesis test' for the 'health and wellness' of the system complete with rates of 'false positives' (false alarms) and rates of 'false negatives' (missed detections of real problems).For the operational definition of 'healthy and well', collect some operational data, let it 'age' until other evidence indicates that the system was 'healthy and well' during that time, and, then, use that data as the definition. Then decree and declare that any data 'too far' from the healthy and well data indicates 'sick'.Then as usual in statistical hypothesis testing, want to know the false alarm rate in advance and want to be able to adjust it. And if get a 'detection', that is, reject the 'null hypothesis' that the system is healthy and well, then want to know the lowest false alarm rate at which the particular data observed in real time would still be a 'detection' and use that false alarm rate as a measure of the 'seriousness' of the detection.If make some relatively weak probabilistic assumptions about the data, then, in calculating the false alarm rate, can get some probabilistically 'exchangeable' data. Then get to apply some group theory (from abstract algebra) and borrow a little from classic ergodic theory to calculate false alarm rate. And can use a classic result in measure theory by S. Ulam, that the French probabilist LeCam nicely called 'tightness', to show that the hypothesis test is not 'trivial'. See Billingsly 'Convergence of Probability Measures'.Of course, the 'most powerful' test for each given false alarm rate is from the Neyman-Pearson lemma (the elementary proofs are not much fun, but there is a nice proof starting with the Hahn decomposition, a fast result from the Radon-Nikodym theorem), but for problems never seen before we do not have data enough to use that result.For the statistical hypothesis test, we need two special properties: (1) Especially for monitoring systems, especially complex or distributed systems, we should have statistical hypothesis tests that are multi-dimensional (multi-variate); that is, if we are monitoring once a second, then each second we should get data on each of a certain n variables for some positive integer n. So, we are working with n variables. (2) As U. Grenander at Brown U. observed long ago, operational data from computer systems is probabilistically, certainly in distribution, quite different from data from most of the history of statistics. He was correct. Especially since we are interested in data on several variables, we have no real hope of knowing the distribution of the data even when the system is 'healthy and well' and still less hope when it is sick with a problem we have never seen before. So, we need hypothesis tests that are 'distribution-free', that is, make no assumptions about probability distributions. So, we are faced with calculating false alarm rates for multi-dimensional data where we know nothing about the probability distribution.There is a long history of hypothesis tests for from a single variable, including many tests that are distribution-free. See old books by S. Siegel, 'Nonparametric Statistics for the Behavioral Sciences' or E. Lehmann, 'Nonparametrics'.For multi-dimensional hypothesis tests, there's not much and still less when also distribution-free.Why multi-dimensional? Because for the two interacting systems in the example in the OP, we guess that to detect the problem we would want data from both systems. More generally in a large server farm or network, we are awash in variables on which we can collect data at rates from thousands of points per second down to a point each few seconds. So, for n dimensions, n can easily be dozens, hundreds, ....For such tests, look at "anomaly detector" in 'Information Sciences' in, as I recall, 1999.If want to implement such a test, might want to read about k-D trees in, say, Sedgewick. Then think about backtracking in the tree, depth-first, and accumulating some cutting planes.For monitoring, there was long some work by Profs Patterson and Fox at Stanford and Berkelely, in their RAD Lab, funded by Google, Microsoft, and Sun. The paper in 'Information Sciences' seems to be ahead.More can be done. E.g., consider the Rocky Mountains and assume that they are porous to water. Let the mountains be the probability distribution of two-variable data when the system is healthy and well. Now pour in water until, say, the lowest 1% of the probability mass has been covered. Now the test is, observe a point in the water and raise an alarm. Now the false alarm rate is 1%. And the area of false alarms is the largest we can make it for being 1% (a rough surrogate for the best detection rate -- with a fast use of Fubini's theorem in measure theory, can say more here). As we know, lakes can have fractal boundaries, and the techniques in the paper will, indeed, approximate those. Also the techniques do not require that all the lakes be connnected. And for the dry land, it need not all be connected either and might be in islands. So, it is quite general.But may want to assume that the region of healthy and well performance is a convex set and try again. If this assumption is correct, then for a given false alarm rate will get a higher detection rate, that is, a more 'powerful' test.Still more can be done. But, really, the work is essentially some applied math with, at times, some moderately advanced prerequisites from pure and applied math. Theme: The crucial content of the most powerful future of computing is in math, not computer science. Sorry 'bout that!