Question? Leave a message!




Pitfalls of Object Oriented Programming

Pitfalls of Object Oriented Programming 19
Sony Computer Entertainment Europe Research Development Division Pitfalls of Object Oriented Programming Tony Albrecht – Technical Consultant Developer ServicesWhat I will be covering • A quick look at Object Oriented (OO) programming • A common example • Optimisation of that example • Summary Slide 2Object Oriented (OO) Programming • What is OO programming – a programming paradigm that uses "objects" – data structures consisting of datafields and methods together with their interactions – to design applications and computer programs. (Wikipedia) • Includes features such as – Data abstraction – Encapsulation – Polymorphism – Inheritance Slide 3What’s OOP for • OO programming allows you to think about problems in terms of objects and their interactions. • Each object is (ideally) self contained – Contains its own code and data. – Defines an interface to its code and data. • Each object can be perceived as a „black box‟. Slide 4Objects • If objects are self contained then they can be – Reused. – Maintained without side effects. – Used without understanding internal implementation/representation. • This is good, yes Slide 5Are Objects Good • Well, yes • And no. • First some history. Slide 6A Brief History of C++ C++ development started 1979 2009 Slide 7A Brief History of C++ Named “C++” 1979 1983 2009 Slide 8A Brief History of C++ First Commercial release 1979 1985 2009 Slide 9A Brief History of C++ Release of v2.0 1979 1989 2009 Slide 10A Brief History of C++ Added • multiple inheritance, • abstract classes, • static member functions, Release of v2.0 • const member functions • protected members. 1979 1989 2009 Slide 11A Brief History of C++ Standardised 1979 1998 2009 Slide 12A Brief History of C++ Updated 1979 2003 2009 Slide 13A Brief History of C++ C++0x 1979 2009 Slide 14So what has changed since 1979 • Many more features have been added to C++ • CPUs have become much faster. • Transition to multiple cores • Memory has become faster. http://www.vintagecomputing.com Slide 15CPU performance Computer architecture: a quantitative approach Slide 16 By John L. Hennessy, David A. Patterson, Andrea C. ArpaciDusseauCPU/Memory performance Computer architecture: a quantitative approach Slide 17 By John L. Hennessy, David A. Patterson, Andrea C. ArpaciDusseauWhat has changed since 1979 • One of the biggest changes is that memory access speeds are far slower (relatively) – 1980: RAM latency 1 cycle – 2009: RAM latency 400+ cycles • What can you do in 400 cycles Slide 18What has this to do with OO • OO classes encapsulate code and data. • So, an instantiated object will generally contain all data associated with it. Slide 19My Claim • With modern HW (particularly consoles), excessive encapsulation is BAD. • Data flow should be fundamental to your design (Data Oriented Design) Slide 20Consider a simple OO Scene Tree • Base Object class – Contains general data • Node – Container class • Modifier – Updates transforms • Drawable/Cube – Renders objects Slide 21Object • Each object – Maintains bounding sphere for culling – Has transform (local and world) – Dirty flag (optimisation) – Pointer to Parent Slide 22Objects Each square is Memory Layout 4 bytes Class Definition Slide 23Nodes • Each Node is an object, plus – Has a container of other objects – Has a visibility flag. Slide 24Nodes Memory Layout Class Definition Slide 25Consider the following code… • Update the world transform and world space bounding sphere for each object. Slide 26Consider the following code… • Leaf nodes (objects) return transformed bounding spheres Slide 27Consider the following code… • Leaf nodes (objects) return transformed What‟s wrong with this code bounding spheres Slide 28Consider the following code… • Leaf nodes (objects) return transformed If mDirty=false then we get branch misprediction which costs 23 or 24 bounding spheres cycles. Slide 29Consider the following code… • Leaf nodes (objects) return transformed Calculation of the world bounding sphere bounding spheres takes only 12 cycles. Slide 30Consider the following code… • Leaf nodes (objects) return transformed So using a dirty flag here is actually bounding spheres slower than not using one (in the case where it is false) Slide 31Lets illustrate cache usage Each cache line is Main Memory L2 Cache 128 bytes Slide 32Cache usage Main Memory L2 Cache parentTransform is already in the cache (somewhere) Slide 33Cache usage Assume this is a 128byte Main Memory boundary (start of cacheline) L2 Cache Slide 34Cache usage Main Memory L2 Cache Load mTransform into cache Slide 35Cache usage Main Memory L2 Cache mWorldTransform is stored via cache (writeback) Slide 36Cache usage Main Memory L2 Cache Next it loads mObjects Slide 37Cache usage Main Memory L2 Cache Then a pointer is pulled from somewhere else (Memory managed by std::vector) Slide 38Cache usage Main Memory L2 Cache vtbl ptr loaded into Cache Slide 39Cache usage Main Memory L2 Cache Look up virtual function Slide 40Cache usage Main Memory L2 Cache Then branch to that code (load in instructions) Slide 41Cache usage Main Memory L2 Cache New code checks dirty flag then sets world bounding sphere Slide 42Cache usage Main Memory L2 Cache Node‟s World Bounding Sphere is then Expanded Slide 43Cache usage Main Memory L2 Cache Then the next Object is processed Slide 44Cache usage Main Memory L2 Cache First object costs at least 7 cache misses Slide 45Cache usage Main Memory L2 Cache Subsequent objects cost at least 2 cache misses each Slide 46The Test • 11,111 nodes/objects in a tree 5 levels deep • Every node being transformed • Hierarchical culling of tree • Render method is empty Slide 47Performance This is the time taken just to traverse the tree Slide 48Why is it so slow 22ms Slide 49Look at GetWorldBoundingSphere() Slide 50Samples can be a little misleading at the source code level Slide 51if(mDirty) comparison Slide 52Stalls due to the load 2 instructions earlier Slide 53Similarly with the matrix multiply Slide 54Some rough calculations Branch Mispredictions: 50,421 23 cycles each = 0.36ms Slide 55Some rough calculations Branch Mispredictions: 50,421 23 cycles each = 0.36ms L2 Cache misses: 36,345 400 cycles each = 4.54ms Slide 56• From Tuner, 3 L2 cache misses per object – These cache misses are mostly sequential (more than 1 fetch from main memory can happen at once) – Code/cache miss/code/cache miss/code… Slide 57Slow memory is the problem here • How can we fix it • And still keep the same functionality and interface Slide 58The first step • Use homogenous, sequential sets of data Slide 59Homogeneous Sequential Data Slide 60Generating Contiguous Data • Use custom allocators – Minimal impact on existing code • Allocate contiguous – Nodes – Matrices – Bounding spheres Slide 61Performance 19.6ms 12.9ms 35 faster just by moving things around in memory Slide 62What next • Process data in order • Use implicit structure for hierarchy – Minimise to and fro from nodes. • Group logic to optimally use what is already in cache. • Remove regularly called virtuals. Slide 63Hierarchy We start with a parent Node Node Slide 64Hierarchy Node Node Node Which has children nodes Slide 65Hierarchy Node Node Node And they have a parent Slide 66Hierarchy Node Node Node Node Node Node Node Node Node Node Node And they have children Slide 67Hierarchy Node Node Node Node Node Node Node Node Node Node Node And they all have parents Slide 68Hierarchy Node Node Node Node Node Node NodeNodeNode Node Node Node Node Node NodeNode Node Node Node A lot of this information can be inferred Slide 69Hierarchy Node Use a set of arrays, one per hierarchy level Node Node Node Node Node Node Node Node Node Node Slide 70Hierarchy Node Parent has 2 children :2 Node Node Children have 4 :4 children :4 Node Node Node Node Node Node Node Node Slide 71Hierarchy Ensure nodes and their Node data are contiguous in :2 memory Node Node :4 :4 Node Node Node Node Node Node Node Node Slide 72• Make the processing global rather than local – Pull the updates out of the objects. • No more virtuals – Easier to understand too – all code in one place. Slide 73Need to change some things… • OO version – Update transform top down and expand WBS bottom up Slide 74Update transform Node Node Node Node Node Node Node Node Node Node Node Slide 75Update transform Node Node Node Node Node Node Node Node Node Node Node Slide 76Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 77Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 78Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 79Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 80Node Node Node Node Node Node Node Node Node Node Node Update transform and world bounding sphere Slide 81Node Node Node Node Node Node Node Node Node Node Node Add bounding sphere of child Slide 82• Hierarchical bounding spheres pass info up • Transforms cascade down • Data use and code is „striped‟. – Processing is alternating Slide 83Conversion to linear • To do this with a „flat‟ hierarchy, break it into 2 passes – Update the transforms and bounding spheres(from top down) – Expand bounding spheres (bottom up) Slide 84Transform and BS updates Node For each node at each level (top down) :2 multiply world transform by parent‟s Node Node transform wbs by world transform :4 :4 Node Node Node Node Node Node Node Node Slide 85Update bounding sphere hierarchies Node For For ea each n ch nod ode e at ea at each ch le level vel (bo (bott ttom u om up) p) :2 add add w wbs bs to pa to paren rent‟ t‟s s Node Node cull wbs against frustum :4 :4 Node Node Node Node Node Node Node Node Slide 86Update Transform and Bounding Sphere How many children nodes to process Slide 87Update Transform and Bounding Sphere For each child, update transform and bounding sphere Slide 88Update Transform and Bounding Sphere Note the contiguous arrays Slide 89So, what’s happening in the cache Parent Unified L2 Cache Children Children‟s node data not needed Parent Data Childrens‟ Data Slide 90Load parent and its transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 91Load child transform and set world transform Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 92Load child BS and set WBS Parent Unified L2 Cache Parent Data Childrens‟ Data Slide 93Load child BS and set WBS Parent Next child is calculated with Unified L2 Cache no extra cache misses Parent Data Childrens‟ Data Slide 94Load child BS and set WBS Parent The next 2 children incur 2 Unified L2 Cache cache misses in total Parent Data Childrens‟ Data Slide 95Prefetching Parent Because all data is linear, we Unified L2 Cache can predict what memory will be needed in 400 cycles and prefetch Parent Data Childrens‟ Data Slide 96• Tuner scans show about 1.7 cache misses per node. • But, these misses are much more frequent – Code/cache miss/cache miss/code – Less stalling Slide 97Performance 19.6 12.9 4.8ms Slide 98Prefetching • Data accesses are now predictable • Can use prefetch (dcbt) to warm the cache – Data streams can be tricky – Many reasons for stream termination – Easier to just use dcbt blindly • (look ahead x number of iterations) Slide 99Prefetching example • Prefetch a predetermined number of iterations ahead • Ignore incorrect prefetches Slide 100Performance 19.6 12.9 4.8 3.3ms Slide 101A Warning on Prefetching • This example makes very heavy use of the cache • This can affect other threads‟ use of the cache – Multiple threads with heavy cache use may thrash the cache Slide 102The old scan 22ms Slide 103The new scan 16.6ms Slide 104Up close 16.6ms Slide 105Looking at the code (samples) Slide 106Performance counters Branch mispredictions: 2,867 (cf. 47,000) L2 cache misses: 16,064 (cf 36,000) Slide 107In Summary • Just reorganising data locations was a win • Data + code reorganisation= dramatic improvement. • + prefetching equals even more WIN. Slide 108OO is not necessarily EVIL • Be careful not to design yourself into a corner • Consider data in your design – Can you decouple data from objects – …code from objects • Be aware of what the compiler and HW are doing Slide 109Its all about the memory • Optimise for data first, then code. – Memory access is probably going to be your biggest bottleneck • Simplify systems – KISS – Easier to optimise, easier to parallelise Slide 110Homogeneity • Keep code and data homogenous – Avoid introducing variations – Don‟t test for exceptions – sort by them. • Not everything needs to be an object – If you must have a pattern, then consider using Managers Slide 111Remember • You are writing a GAME – You have control over the input data – Don‟t be afraid to preformat it – drastically if need be. • Design for specifics, not generics (generally). Slide 112Data Oriented Design Delivers • Better performance • Better realisation of code optimisations • Often simpler code • More parallelisable code Slide 113The END • Questions Slide 114
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
GraceRogers
User Type:
Professional
Country:
Greece
Uploaded Date:
12-07-2017