Much of the other advice here is spot on, and is definitely the first place you should look.<p>That being said, if you really are constrained by pure-python speed for ordinary tasks (and your first resorts of native code/multiple processes/parallelize IO aren't available), there is a large array of (often horrifying) dirty tricks you can use to eke out a few tens of percent of speed improvements.<p>Here are some random examples that come to mind, roughly sorted in order from "somewhat advanced but useful things to know or do" to "disgusting; why are you even using Python?":<p>- Be familiar with BytesIO, memoryview, and the buffer protocol. Using these can dramatically improve memory efficiency (and even bring back a <i>little</i> bit of cache locality benefits in Python's internal pointer hell) and reduce copies. If you're coming from C++, abandon all hope of ever getting to <i>zero</i> copies, but careful use of BytesIO can bring the number way down, and unlike other hacks on this list it doesn't damage the intelligibility of your code that much.<p>- Be deeply suspicious of others' broad statements about the GIL. These are often wrong in both directions: many things that you'd assume are not GIL-bottlenecked (independent calls into some native libraries) end up running in sequence due to the GIL; on the other hand, many things <i>can</i> be truly parallelized using native Python threads--even some non-I/O tasks (some numpy operations, some cryptography/compression libraries). Benchmark early and often.<p>- Use tuples wherever possible instead of lists (but if you find yourself casting back and forth, just use lists). This only occasionally brings performance benefits (e.g. via small-tuple reuse), but it's a good practice anyway: don't add unnecessary mutability.<p>- Not all functions are created equal. Functions with small number of positionals and no kwargs are marginally faster to call than functions with kwargs/variadics/more complicated signatures.<p>- When using functional list processing functions (e.g. sort, map), the functions in the operator module are much faster than lambdas; use them if you can.<p>- Keep the cost of function calls in mind when writing or using decorator-heavy code. Each decorator is usually an added function call, and often the more expensive (varargs/complex signature) kind to boot.<p>- functools.partial can be slightly faster than wrapper functions if your arguments are uniform.<p>- Relatedly, if you are using decorators for non-intercepting purposes (like registering functions/classes by decorating them), make sure your decorators are returning the passed-in function directly rather than a wrapper. That reduces their runtime cost to zero.<p>- This isn't really algorithmic but: if you're suffering from the startup time or CPU hit from lots of invocations of small/fast standalone scripts, turn off bytecode caching. While the act of compiling bytecode is nearly free speed-wise, the I/O hit of writing the bytecode back to the filesystem can be surprisingly high. Bytecode caching was such a mistake.<p>- When using multiprocessing, share data via fork(2) wherever possible. This makes it zero-cost to access largely read-only data in your parallel processes. I talked at length about this here and in adjacent comments: <a href="https://news.ycombinator.com/item?id=36941892">https://news.ycombinator.com/item?id=36941892</a><p>- Don't be afraid to drop back to bytes for hot-loop string manipulation (unless, of course, you need non-ASCII characters). Some operations can be very slightly faster on bytes, but don't assume strings are always slow. Also, just like tuples/lists, lots of code just implicitly converts supplied bytes to strings internally anyway, so if you're passing them to a library make sure you know what it's doing.<p>- Cache dot lookups for things (even stdlib module accesses/methods) in variables next to your hot loops. This makes code pretty ugly and is at the top of my list for things that I hope interpreter optimizations/JIT can more reliably help with over the long term. There's already a bit of optimization done in this area so it may not turn out to help as much as you think it will.<p>- You can live-patch classes to amortize the overhead of __getattr[ibute]__ and property descriptors by binding new methods/fields at runtime and saving a bunch of dictionary hits. This isn't a panacea since it does require you to trade away slotted speedups in some cases, and MRO cache invalidation can cause it to hurt more than it helps. As always, benchmark.<p>- Relatedly, the presence of __getattr/__setattr <i>anywhere</i> in the MRO for a class is a bit of an optimization fence for speeding up method calls. The situations where this hurts performance have changed a lot between interpreter versions, but if you're using OO code in hot loops, removing those dunder methods from your class hierarchy is a good next step to try after caching away self-dot lookups.<p>- Don't access global variables in your hot loop; function-local variable lookups are a tiny bit faster (though this is an area where optimizations may moot this advice in the future). Remember that instance variables ("self.foo") are slower than both because of the dictionary lookup in the dot.<p>- If using multiple Python threads (even if most of them are backgrounded/waiting on IO, e.g. Sentry or database drivers), you can override the interpreter switch/check intervals in your hot loops. I've seen this work more than once, but very rarely.<p>- If for some strange reason you have lots of small fast IOs in your hot loop, you can locally change interpreter buffering behavior (or drop to lower-level os.[read|write] calls and manage your own buffering) for a marginal speedup.<p>- In some very <i>very</i> rare cases, typing.Generic can actually add runtime overhead; benchmark with and without it.<p>- An easy win for small-script startup times is to remove locations from the module search path. If you strace(2) your program's compile pass (replace main with a sleep and strace until that), you'll often see it statting handfuls of (missing) locations per import before it finds the module. This only saves a little bit of time since filesystems tend to be good at metadata caching.<p>- Seriously, function calls are <i>expensive</i>. If you can't inline them, the awful generator hack can save you a few % of function call overhead: turn your function call body into the inner loop of an infinite generator, create the generator outside of your hot loop and cache gen.send/gen.next in variables to "call" the function by sending values into the generator (fun fact: gen.next is faster than next(gen)). But seriously, if you find yourself in a situation where this makes a difference, go for a smoke and rethink your life choices.