PS3,chi ci capisce e bravo!

**spike spiegel** · 22-07-2003, 16:05:12

e mi sto riferendo a quest'articolo,che da quanto ho letto e stato scritto da Eddie Edwards ex programmatore Naughty dog,e pubblicato sul suo sito,e subito dopo rimosso misteriosamente,questo e quanto ho letto,quindi il tutto e da prendere con le pinze

The Technology of PS3
Eddie Edwards, April 2003
Foreword

Recent news articles have explained that the patent application for the technology on which PS3 will assumedly be based is now available online. I've spent some time examining the patent and I have formed some theories and educated guesses as to what it all means in practice. This document describes the patent and outlines my ideas. Some of these guesses are informed by my knowledge of PS2 (I was one of the VU coders on Naughty Dog's Jak & Daxter although I do not work for Sony now). You may wish to refer to Paul Zimmons' PowerPoint presentation which has diagrams that might make some of this stuff clearer. Also, until I get told to take it down, I have made the patent itself available in a more easily downloadable form (a 2MB ZIP containing 61 TIF files).

The technology of PS3 is based on what IBM call the "Cell Architecture". This architecture is being developed by a team of 300 engineers from Sony, IBM and Toshiba. PS2 was developed by Sony and Toshiba. Sony appear to have designed the basic architecture, while Toshiba have figured out how to implement it in silicon. The new consortium includes IBM, who for PS3 will use their advanced fabrication technologies to build the chips faster and smaller than would otherwise have been possible. In addition, the effort is supposedly a holistic approach whereby tools and applications are being developed alongside the hardware. IBM have particular expertise in building applications and operating systems for massively parallel systems - I expect IBM to have significant input into the software for this system.

There is a lot of PS2 in the Cell Architecture. It is the PS2 flavour that is most apparent to me when I read the patent. However, IBM must be bringing a significant amount of stuff to the table too. The patent for instance refers to a VLIW processor with 4 FPUs, rather than a dual-issue processor with a single SIMD vector FPU. Does this imply that the chips are based on an IBM-style VLIW ALU set? Or does it just mean that it's a fast VU with a "very long instruction word" of only 2 instructions? Furthermore, note that IBM have been making and selling massively parallel supercomputers for several decades now. IBM experts' input on the programming paradigms and tool set are going to be invaluable. And the host processor finally drops the MIPS ISA in favour of IBM's own PowerPC instruction set. But we may not get to program the PPCs inside the PS3 anyway.

I have had to make assumptions. Forgive them. If anyone with insight or knowledge wishes to enlighten me, please do.
Contents

*
* Foreword Cells
* APUs
o Instruction Width
* Winnie the PU
* PEs
* The Broadband Engine
* Visualizers
* Will the Real PS3 Please Stand Up?
* Memory : Sandboxes
* Memory : Producer / Consumer Synchronization
* Memory : Random Access, Caches, etc.
* Forward and Sideways Compatibility
* Graphics
o Modelling
* Programming PS3
* Jazzing with Blue Gene
* Stream Processing
* Readers' Comments
* Links and References

Cells

(There is some confusion as to what a "cell" is in this patent. The media is generally using the term "cell" for what the patent calls a "processing element" or "PE". In the patent, the term "cell" refers to a unit of software and data, while the term "PE" refers to a processing element that contains multiple processing units. I will use that nomenclature here.)

Cells are central to the PS3's network architecture. A cell can contain program code and/or data. Thus, a cell could be a packet in an MPEG data stream (if you were streaming a movie online) or it could be a part of an application (e.g. part of the rendering engine for a PS3 game). The format of a cell is loosely defined in the patent. All software is made up of cells (and here I use software in its most general sense to include programs and data). Furthermore, a cell can run anywhere on the network - on a server, on a client, on a PDA, etc.

Say, for instance, that a website wanted to stream a TV signal to you in their new improved format DivY. They could send you a cell that contained the program instructions for decoding the DivY stream into a regular TV picture. Then they send you the DivY-endoded picture stream. This would work if you had a PS3 or if you had a digital TV, or even if you had a powerful enough PDA - assuming their design followed the new standard.

Depending on how "open" Sony make this it might be easy or impossible to program your own PS3 just by sending it data packets you want it to run. (Note that Sony's history in this respect is interesting - their PSX Yaroze and PS2 Linux projects do show some willingness to open their machines up to hobbyists.)
APUs

Cells run on one or more "attached processing units" or APUs (I pronounce this after the character in the Simpsons!) An APU is architecturally very similar to the vector unit (VU) found in PS2, but bigger and more uniform:

* 128-bit processor
* 1024-bit external bus
* 128K (8192 x 128-bit words) of local RAM
* 128 x 128-bit registers
* 4-way floating point vector unit giving 32GFLOPS
* 4-way integer vector unit giving 32GIOPS

(Compare this to the VU's 128-bit external bus, 16K of code RAM, 16K of data RAM, 32 x 128-bit registers, single way 16-bit integer unit, and only 1.2GFLOPS.)

The APU is a very long instruction word (VLIW) processor. Each cycle it can issue one instruction to the floating point vector unit and one to the integer vector unit simultaneously. It is much more similar to a traditional DSP than to a CPU like the Pentium III - it does no dynamic analysis of the instruction stream, no reordering. The register set is imbued with enough ports that the FPU and the IPU can each read 3 registers and write one register on each cycle. Unlike the VU, the integer unit on the APU is vectorized, each vector element is a 32-bit int (VU was only 16-bit) and the register set is shared with the FPU (in VU there is a smaller dedicated integer register set). APU should therefore be somewhat easier to program and much more general-purpose than the VU.

Unlike the VU, which used a Harvard architecture (seperate program and data memories), the APU seems to use a traditional (von Neumann) architecture where the 128K of local RAM is shared by code and data. The local RAM appears to be triple- ported so that a single load or store can occur in parallel with an instruction fetch, mitigating the von Neumannism (the other port is for DMA). The connection is 256 bits wide (2 x 128 bits), so only one load or store can occur per cycle - it seems reasonable to assume therefore that the load/store instructions only occur on the integer side of the VLIW instruction, as was the case on the VU. Since there is no distinction between integer and floating point registers this works out just fine. The third RAM port attaches the APU to other components in the system and allows data to be DMAed in or out of the chip 1024 bits at a time. These DMAs can be triggered by the APU itself, which differs from the PS2 where only the host processor could trigger a DMA.

Note that the APU is not a coprocessor but a processor in its own right. Once loaded with a program and data it can sit there for years running it independently of the rest of the system. Cells can be written to use one or more APUs, thus multiple APUs can cooperate to perform a single logical task. A telling example given in the patent is where three APUs convert 3D models into 2D representations, and one APU then converts this into pixels. The implication is that PS3 will perform pure software rendering.

The declared speed of these APUs is awesome - 32GFLOPS + 32GIOPS (32 billion floating-point instructions and 32 billion integer instructions per second). I expect Sony consider a 4-way vectorized multiply-accumulate instruction to be 8 FLOPs, so the clock speed of the APU is 4GHz, as has been reported elsewhere in the media. This is very much faster than the PS2's sedate 300MHz clock - by about 13 times. I presume that the FPUs are pipelined (i.e. you can issue one instruction per cycle but it takes, say, four cycles to come up with the answer). But if PS2 had a 4-stage pipeline for the multipliers at 300MHz, what's the pipeline depth going to be at 4GHz? 8 stages? 16 stages? The details of this will depend on the precise design of the APU and this is not covered by the patent, but it is worth noting that naked pipelines are hard to code for at a depth of 4; at a depth of greater than this it may simply be unfeasible to write optimal code for these parts.

Note: the APUs may instead be using an IBM-style VLIW architecture where each ALU (4 floating point and 4 integer) is operable independently from different parts of the instruction word. However, the word size of the registers is 128, so each floating point unit must access part of the same register. This seriously limits the effectiveness of a VLIW architecture and makes it rather difficult to program for. I therefore assume that the ALUs are acting like typical 4-way vector SIMD units.

One interesting departure from PS2 is that all software cells run on APUs. On PS2 there were two VUs but also one general- purpose CPU (a MIPS chip). This chip was the only chip in the system capable of the 128-bit vector integer operations (necessary for fast construction of drawlists), and this functionality is now subsumed into the APU. There is a non-APU processor in the new system but it only runs OS code, not cells, so its precise architecture is irrelevant - it could be anything, and the same software cells would still run on the APUs just fine.
Instruction Width

Given 128 registers, it takes 7 bits to identify a register. Each instruction can have 3 inputs and 1 output which is 28 bits. I am presuming they are keeping the extremely useful vector element masks which would add 4 bits to the FPU side. Only in the case of MAC (multiply-accumulate) are 3 inputs actually needed, but say you specify a MAC on both the IPU and FPU - that's 60 bits for register specifications alone. I therefore doubt that the instruction length is 64 bits - I think the VLIW on the APU must be 128 bits wide, which is reasonable since that's the word length and since there is bandwidth to read 128 bits out of memory per cycle as well as do a load/store to/from memory at the same time. But this is probably going to mean code is not overly compact - only 8,192 instructions will fit into the whole of APU RAM, with no room for data in that case.

On the other hand, 128 bits is a lot of bits for an instruction given that only 60 are used so far. Assuming 256 distinct instructions per side (which is very very generous) that's 8 bits per side making 76. My guess is they may have another 16 bits to mask integer operations, just as 4 bits mask the FPU operations. 16 bits enables you to isolate any given byte(s) in the register. That's 92.

Another cool feature they might employ is conditional execution like on the ARM - 4 bits would control each instruction's execution according to the standard condition codes. I was suprised not to see this on the VU in PS2 (perhaps ARM have a patent?) because it helps to avoid a lot of petty little branches. If the PPC is influencing the design, they may just throw a barrel shifter in after every instruction too (that would be quite ARM-like as well). So even without unaligned memory accesses you can isolate any field in a 128-bit word in a single mask-and-shift instruction. Another 7 bits there too (integer only) ... that's still only 99 bits - 29 bits are still available.

What seems to be common on classic VLIW chips is to have parts of the ALU directly controlled by the instruction code. On a classic CPU instructions are decoded to generate the control lines for the ALU (for instance, to select which part of the ALU should calculate the result). With a VLIW chip you can encode the control lines directly - known as horizontal encoding. This makes the chips simpler and the instructions more powerful - you can do unusual things with instructions that you couldn't do on a regular CPU. Regular instructions are present as special cases. It's more like you're directly controlling the ALU with bitfields than you're issuing "instructions". This can make the processor difficult for humans to write code for - but in some ways easier for machines (e.g. compilers). It is possible that the other 29 bits go towards this encoding.

However, the patent does not go into much detail about any of this, so you should treat the above few paragraphs with some suspicion until more information comes to light.
Winnie the PU

As mentioned above, there is a secondary type of processing unit which is called just "processing unit" or PU. The patent says almost nothing about the internals of this component - media reports suggest that current incarnations will be based on the PowerPC chip, but this is not particularly relevant. The PU never runs user code, only OS code. It is responsible for coordinating activity between a set of APUs - for instance deciding which APUs will run which cells. It is also resposible for memory management - deciding which areas of RAM can be used by which APUs. The PU must run trusted code because, as I explain later, it is the PU that sets up the "sandboxes" which protect the whole system from viruses and similar malicious code downloaded off the internet.
PEs

The patent then puts several APUs and a PU together to make a "processing element" or PE. A typical PE will contain:

* A master PU which coordinates the activities in the PE
* A direct memory access controller or DMAC which deals with memory accesses
* A number of APUs, typically 8

The PE is the smallest thing you would actually make into a chip (or you can put multiple PEs onto a chip - see later). It contains a 1024-bit bus for direct attachment to DRAM. It also contains an internal bus, the PE-bus. I am inferring that this bus is 1024 bits wide since it attaches directly to the memory interface of the APUs, which are 1024 bits wide. However, the patent says almost nothing in detail about the PE-bus.

The DMAC appears to provide a simultaneous DMA channel for each APU - that is, 8 simultaneous 1024-bit channels. The DRAM itself is split into 8 sectors and for simultaneity each APU must be accessing a different sector. Nominally the DRAM is 64MB and each sector is 8MB large. Sectors themselves consist of 1MB banks configured as just 8192 1024-bit words.

A PE with 8 APUs is theoretically capable of 256GFLOPS or 1/4 TFLOPS. Surely that's enough power for a next gen console? Not according to Sony ...
The Broadband Engine

Now we put together four PEs in one chip and get what the patent calls a "Broadband Engine" or BE. Instead of each PE having its own DRAM the four PEs share it. The PE-busses of each PE are joined together in the BE-bus, and included on the same chip are optional I/O blocks. The interface to the DRAM is still indicated as being external, but still I assume the DRAM must be on the same die to accomodate the 8192-wire interface.

The BE has 1/4 the memory bandwidth of a PE since four PEs share the same DRAM. So they must share. This is done using a crosspoint switch whereby each of the 8 channels on each PE can be attached to any of the 8 sectors on the DRAM.

Additionally, each BE has 8 external DMA channels which can be attached to the DRAM through the same crosspoint mechanism. This allows BEs to be attached together and to directly access each others DRAM (presumably with some delay). The patent discusses connecting BEs up in various topologies.

One thing the patent talks about is grafting an optical waveguide directly onto the BE chip so that BEs can be interconnected optically - literally, the chip packaging would include optical ports where optical fibres could be attached directly. Think about that! If the BE plus DRAM was a self-contained unit, there would be no need at all for high-frequency electrical interfaces in a system built of BEs, and therefore board design should become much much easier than it is today. Note that the patent makes it clear that the optical interface is an option - it may never actually appear - but it would be very useful in building clusters of these things, for instance in supercomputers.

A BE with 4 PEs is theoretically capable of 1 TFLOPS - about 400 times faster than a PS2.
Visualizers

Visualizers (VSs) are mentioned a few times through the document. A visualizer is like a PE where 4 of the APUs are removed and in their place is put some video memory (VRAM), a video controller (CRTC) and a "pixel engine". Almost no details are given but it is a fair assumption that one or more of these will form the graphical "backend" for the PS3. I would presume the pixel engine performs simple operations such as those performed by the backend of a regular graphics pipeline - check and update Z, check and update stencil, check and update alpha, write pixel. The existence of the VS is further evidence to suggest that PS3 is designed for software rendering only.

Diagrams in the patent suggest that visualizers can be used in groups - presumably each VS does a quarter of the output image (similar to the GS Cube).

In my section on graphics later, I describe the software rendering techniques I believe PS3 will use. These techniques use at least 16x oversampling of the image (i.e. a 2560 x 1920 image instead of a 640x480 image), and the obvious hardware implementation might be capable of drawing up to 16 pixels simultaneously - which is equivalent to 1 pixel of a 640x480 image per cycle. Since the NTSC output is 640x480 I call these "pixels" while the 2560 x 1920 image is composed of "superpixels", with 16 superpixels per pixel.
Will the Real PS3 Please Stand Up?

So what is PS3 to be, then? The patent mentions several different possible system architectures, from PDAs (a single VS) through what it terms "graphical workstations" which are one or two PEs and one or two VSs, to massive systems made up of 5 BEs connected optically to each other. Which one is the PS3?

The most revealing diagram to me is Figure 6 in the patent, which is described as two chips - one being 4 PEs (i.e. a BE) and one being 4 VSs with an I/O processor which is rather coincidentally named IOP - the same name as the I/O processor in PS2 (this component will still be required to talk to the disk drive, joypads, USB ports, etc.) The bus between the two main chips looks like it's meant to be electrical. Oddly, each major chip has the 64MB of DRAM attached (on chip?) and this only gives 128MB of total system RAM. That seems very very low. I would expect a more practical system to have maybe 64MB per PE or VS giving a total of 512MB of RAM - much more reasonable. So perhaps the 128MB is merely a type of "secondary cache" - fast on-chip RAM. Then lots of slower RAM could be attached to the system using a regular memory controller. This slow RAM would be much cheaper than the "secondary cache" RAM and would probably not have the 8,192-wire interface. In fact, looking at the PS2's GS design, there we have 4MB of VRAM which has a 1024-bit bus to the texture cache - so perhaps the 64MB per PE is an extension of this VRAM design? On the other hand, VRAM tends to be fast and low-latency, whereas the patent specifically calls the 64MB per PE "slow DRAM".

continua...

**spike spiegel** · 22-07-2003, 16:06:30

So how powerful is this machine? Well the 4 PEs give us 1 TFLOPS. The 4 VSs give another 1/2 TFLOPS. Add integer instructions in, and call what the pixel engine does "integer operations" too, and you pretty soon see a machine that really is capable of trillions of operations per second - a superbly ludicrous amount.

Assuming the pixel engine can handle a pixel (16 superpixels) per cycle, at 4GHz with 4 VSs that's a fillrate of 16GPPS - enough to draw a 640 x 480 x 60Hz screen with 800x overdraw. Lovely. (However, note that when drawing triangles smaller than a pixel, a certain amount of "overdraw" is required just to fill the screen - so the available depth complexity is "only" of the order of 100 or so).
Memory : Sandboxes

The DRAM used in this system is not actually 1024-bits wide but 1024 + N bits where N is extra control information. This extra information is used in 2 ways - to provide hardware "sandboxing" whereby regions of memory can be set up to allow access from only a certain subset of APUs, and to provide hardware producer-consumer synchronization, which I discuss later.

Sandboxes are implemented using the following logical test:
(REQID & REQIDMASK) == (MEMID & MEMIDMASK)

Here, REQID and REQIDMASK are an ID and mask associated with the APU making the request; MEMID and MEMIDMASK are an ID and mask associated with the memory location being read or written. If the results are equal the access goes ahead, otherwise it is blocked.

This system allows for APUs to have private memory, memory shared with a specific set of other APUs, and a quite open-ended set of other permutations. It's not clear how this facility interacts with the facility for BEs to directly read the memory of other BEs - one would imagine 32 APUs per BE would mean the IDs and masks were 32 bits wide with one bit per APU - but if a potentially unlimited set of APUs in other BEs can access the DRAM then how are the IDs set up, I wonder?
Memory : Producer / Consumer Synchronization

The DRAM also performs another special function, that is to allow automatic synchronization between an APU that is producing information and an APU that is consuming that information. The synchronization works per 1024-bit word. Essentially, the system is set up so that the producer issues a DMA "sync write" to memory and the consumer issues a DMA "sync read" from memory. If the sync write occurs first, all is well. If the sync read occurs first, the consumer is stalled until the write occurs.

What does that actually mean? Well, there is a bit in each memory location internal to the APU. (We're talking here about 1024-bit locations not 128-bit locations.) This bit is set when a "sync read" is pending to that memory location. The patent description fluffs the explanation of this a little, but I am inferring that the APU issues a sync read and then carries on until code in the APU attempts to access the data that has been read. If the data has not yet arrived from RAM, the APU stops working until the data is available. (The patent seems to imply that the APU stalls immediately but that doesn't make a whole lot of sense since the extra bit internal to the RAM would not then be necessary.)

This mechanism is very important - it means that you can prefetch data into memory and as long as you can keep working on other stuff the data will arrive when it is ready and the APU need not stall. So the synchronization can be free in cycles (seamless) and free in terms of PU overhead. The overhead in DRAM for each memory location is about 18 bits - 1 free/empty bit, an APU ID (5 bits) and 1 destination address (13 bits). But note what I said above about there being more than 32 APUs accessing the same amount of DRAM - perhaps more than 5 bits is necessary for the APU ID?

Parity DRAM will already provide an extra 128 bits for every 1024 bit word - this is more than enough to provide the 40 bits required by the sandboxing and the synching, with 88 bits left for ECC (ECC is not mentioned in the patent, but it is reasonable to assume it could be a feature - ECC places an error correcting code into the spare bits of DRAM so that in the unlikely event that a cosmic ray changes a bit pattern in your RAM the system can detect and correct the error. Honestly! I'm not making this up!)

The patent makes a fuss about how this synching will make it trivial to read data from an I/O device. But it seems to me that the synching's main function is to make it trivial to stream data around the APUs with no intervention from the PU. You can set up arbitrary producer-consumer graphs and have them work as if by magic. It's a great feature for things like video processing where several APUs might be doing MPEG compression of images that are read from a digitizer and processed by other APUs (e.g. the addition of menus or even special effects). Each stage must wait for data from the previous stage, and this synchronization allows this to be done with minimum hassle. As I discuss in the programming section, streams of data are going to be a key concept in PS3 programming.
Memory : Random Access, Caches, etc.

Now I've said before that the PU doesn't run user code. Except, maybe it does. A lot of a game can be shoehorned into the APU model - certainly the whole graphics engine, probably much AI code, sound code. But it's a case of writing a piece of code that handles a 128K working set (minus code size). What about cases where you really, really, really need random access to memory? What about just porting some monolithic C onto the platform? How do we do that? Surely we need a traditional processor for "regular" tasks?

Well it's still not clear that you do. Random access to memory is going to screw you up. It would probably be faster to make a list of memory locations you need to access first (in local RAM), sort that, and do the memory accesses sequentially! A cache won't help in this case.

What about porting monolithic C? I think the answer to this runs deep. I think the answer is: you can't. I think that writing code for this beast is fundamentally different. You have to break your code into processes that can work with only 128K of data at one time. The PS3 will require a new approach to programming. I have my theories about what this new approach is, which I describe later, but ultimately the new approach should actually be more uniform than our current approach. It will force modularity upon us, which might not be a totally bad thing.

What the system may offer for trusted programs (e.g. off an encoded DVD-ROM rather than off the internet) is the ability to modify the PU operating system to e.g. change the algorithm it uses to distribute cells among APUs. Or the ability to add drivers directly to the I/O processor (IOP) on the PS3. It seems likely Sony will offer some level of control to games programmers - control that is simply not required by simple purveyors of streaming media. On the other hand, this would destroy their ability to swap the PU over to a different processor architecture.

Something which is not addressed by the patent is the question, what granularity do APUs see in their local memory? Can they do byte stores? 32-bit loads? My guess is that they can only transfer 1024-bit quantities from main DRAM to local RAM, and can only transfer 128-bit quantities from local RAM to registers, but that in a single instruction they can isolate any bitfield from a pair of 128-bit registers. But it is only a guess.
Forward and Sideways Compatibility

By forward compatibility I mean the programs can run on future revs of the hardware without error. By sideways compatibility I mean the programs can run on hardware that implements the same instruction set but that is made by different manufacturers to different designs. In both cases, we're talking about running programs on chips that have different timing characteristics to the chips it was written on.

The patent discusses a timer that is provided on each PU. You tell it how long you think an APU program ought to take, and if it takes less time (say on a faster APU) then it waits until the specified time has passed - so the program will never be faster than it should be.

I don't get this part, really. At first I thought perhaps the timer was for synching processes, but it only guarantees "no earlier than" completion, so synching would be impossible since some processes may not have "arrived" yet. Even if this is the intention, it would only work if the time budget referred only to APU processing and not to memory accesses - since the DRAM is shared with other PEs each APU only has 1/4 of a DMAC channel available to it ... so stalls may occur based on what the other APUs in the system are doing. You may easily blow your timeframe this way, through no fault of your own, and suddenly you're not synched up any more.

So what is this for, then? Perhaps it's to define timebases for your game program overall - you use the timer to specify that the game runs at 60Hz no faster even on a fast CPU. That seems unlikely, though, because standard game programming repertoire includes ways to make games run at real-time speed and higher frame rate (if the display can support that) when the processing ability of the machine improves. So maybe the timer is used to define the 60Hz NTSC output frequency - and maybe the subfrequencies off that. Remember this is an 8GHz part. It might use an APU to generate the entire NTSC picture. But it doesn't seem to though since there is a CRTC specified in each VS.

Ideas anyone?
Graphics

Software rendering appears to be the order of the day, although it may be premature to make this observation since they could always add some other GPU if the performance of the software rendering failed to impress. It might seem like a waste to do scan conversion in software when Sony already have hardware that can scan convert 75 million triangles per second in PS2.

But PS3 isn't going to do 75 million triangles per second. Oh no. It's going to do a lot more than that. I'm going to stick my neck out and say that PS3 will, at peak, be capable of 1 billion triangles per second. But before I justify that figure, let us just assume that it will do a lot of triangles. So many triangles that the average size of one is less than a pixel. So what's to scan convert? It's a dot, right? Well, no. You could draw a dot and you'd get an image but the image would look pretty odd - more like a very high-resolution mosaic than a properly anti-aliased CG image. The system will need to do subpixel rendering and average out to get a nice image. 4x4 supersampling of a 640x480 image gives 2560x2048 superpixels - 1280x1024 on each of the 4 VS units. Now, if we ever want to draw a triangle larger than a pixel in width we subdivide it. This is all done in code on the APU. Once the triangle is less than 4x4 superpixels in size there are algorithms you can use to very rapidly determine which subpixels it covers. You keep bitmasks for every possible (4x4x4x4 = 256) edge and you mask them together to give the triangle coverage. Since the triangle is less than a pixel in size there is no point texturing it - you just fill it with a single colour. So we're rendering flatshaded polys. We can do a lot of these in software. We end up with a nicely antialiased image which has the appearance of texture mapping only because we referred to a texture map when we decided each triangle's colour. We write APU programs to determine this colour - these programs are called shaders. Anyone familiar with RenderMan should be starting to understand what's going on. In a sense the rendering capabilities of PS3 are very much akin to a real-time Reyes RenderMan engine.

So how do on Earth do we get 1 billion triangles per second? Well, the hardware to calculate the triangle fragments is small and fast. Then all you need is the basic pixel operations we have already on GPUs - z-test, stencil-test, alpha-test, alpha-blend, etc. Assuming 4x4 supersampling, each triangle covers up to 16 superpixels (actually never more than 10) and 1 superpixel = 1/16 triangle per VS per cycle gives 1 billion triangles per second. (These 16 operations can even be parallelized if the VRAM is split into 16 banks so we may even get a theoretical 16 billion triangles per second, but 4x 4GHz APUs could never drive this many triangles out so it seems a little pointless).

It is possible to drive this from a simple 16-bit input mask, so software needs to determine coverage and pixel position for each small triangle; however this will take several cycles, while the hardware to "scan convert" such polygons is small and fast (a 256x16 triple-ported ROM plus some basic coordinate shifting logic). It's possible therefore that the pixel engines are actually capable of "scan-converting" these subpixel polygons, easing the software burden.

But a big caveat with all these triangle counts is the complexity of the shader itself. The simplest shader might be able to kick out a polygon every 16 cycles from each APU, pushing the pixel engines to the max, but anything more complex (e.g. with texturing or shading or fogging) will require more cycles. So as with all graphic systems, the practical triangle count will likely be less than 20% of the theoretical maximum.
Modelling

Procedural graphics has got to be the way to go with these things. No matter how fast the DRAM is, it's not going to be comparable to the 32GFLOPS available on each APU. Memory will be very slow, processing will be very fast. Pal Engstad at Naughty Dog pointed out to me that VU programming on the PS2 is akin to old-fashioned tape-based mainframe programming - you can read a small number of records from the tape into local RAM and you can only read them sequentially or efficiency suffers badly. There are algorithms for sorting records held on tapes and these algorithms can be applied equally well to the problem of sorting large arrays in memory using just the 128K available to an APU. But at the end of the day, you're going to have oodles of cycles just being wasted while you wait for memory. Procedural models can be instanced from minimum data in RAM using oodles of cycles. There's a natural match here.
Programming PS3

continua...

**spike spiegel** · 22-07-2003, 16:07:34

It is truly unknown how PS3 will be programmed. There are many possible models, because the architecture is so flexible. In particular, hybrid models are most likely - i.e. not all APUs will be programmed using the same model. For instance, fast stream "pipeline" code, for example audio code, rendering code, decompression code, might be written in native APU assembler. These assembler programs will run on dedicated APUs and be controlled by other parts of the program ... so the interactions are much simplified. In effect each piece of code like this is functioning a lot like a piece of hardware, and particularly in a simple "slave" mode. The massively parallel nature of PS3 should not trouble the authors of this code too much.

On the other hand, some code, such as AI code, is heavily object-oriented and relies on heavy intercommunication between objects. This code will have to be written in an object-oriented environment (that is, language plus run-time components). If a project can run all AI code on a single APU, things will be simple. But this defeats the point of having such a system in the first place, and it defeats the scalability of the larger server systems (think of the servers running an online game - also based on PS3 technology). So the overall programming environment will need to handle the parallel nature of the system for the programmer.

Collected here are some ideas about programming PS3.
Objects can be Locked using Memory-based Synch

The producer/consumer synching can be used to provide locking on objects in RAM:

1. When an object is created it is written using Sync Write. This puts the memory containing the object into full state.
2. If an APU wants to access the object, it issues a Sync Read. Assuming the object is initialized the object memory is now in empty state and the APU's local RAM contains a copy of the object.
3. When the APU is finished, it writes the local copy back to main RAM using a Sync Write.
4. If another APU attempts to access the object while the other APU is working, it is stalled until the Sync Write occurs.
5. A second APU attempting to access the object while this APU is stalled will cause an error on the second APU (and hopefully the PU's OS will handle this error and cause it to retry).
6. Deleting the object involves a Sync Read without a following Sync Write.
7. The granularity of this technique is only 1024-bit DRAM words ... 128 byte blocks of RAM. Multiple objects could share a RAM block but they could only be locked as a unit.

Jazzing with Blue Gene

This will make you laugh. In 1999, IBM began the 5-year Blue Gene project (the sequel to the chess-playing Deep Blue). Read about it. The idea was to make a PetaFLOPS computer by taking processors at 1GFLOPS and placing 32 on a chip to make a 32GFLOPS chip. 64 of these chips on a board would give 2 TFLOPS. A tower of 8 boards would give 16 TFLOPS and a room of 64 towers would make a PFLOPS. Enough to do protein folding at interactive rates.

By 2005, PS3 will have APUs as powerful as Blue Gene's chips and chips half as powerful as Blue Gene's boards. But Blue Gene is due at about the same time as PS3. Is it possible that Blue Gene will simply be a really really big PS3? Using Broadband Engine chips on the Blue Gene boards would provide 64 TFLOPS per board, 512 TFLOPS per tower, so a room of 64 towers would provide 32 PFLOPS. You could fit a mere PFLOPS in a closet! The optical interface could be used to link towers (note that at 4GHz light only travels 3 inches per clock cycle - so there is quite a latency even at lightspeed!). 32 PFLOPS is more than 2^64 instructions per second.

But if you're not even using your PS3 all that power could be used for something like Seti@Home or Folding@Home - you run software that connects your PS3 to other PS3s to form what's in effect a gigantic supercomputer. Only 32,000 PS3s need be connected to match Blue Gene. If people are motivated, millions of PS3s may be available at any one time.

One major application of this incredible computer power is to do protein folding experiments - experiments which help to find the causes of and hopefully cures to certain illnesses including Cystic Fibrosis, Alzheimer's Disease and some cancers. But another is to simulate nuclear bombs or hydrogen bombs. In the future these supercomputers are also likely to be useful in genetic engineering design, or in the design of fusion reactors (cheap clean power). It is up to you whether or not you wish to support any of these causes, and it should be up to you whether or not the machine you paid for is used towards them. Remember that software cells can be transmitted to your machine over the network, and it's up to Sony's OS when this is allowed to happen. It would not be impossible for Sony to code the OS so that any spare power is automatically given to any cause Sony choose as long as you're online. It is important that consumers are aware of the issue, and can donate their spare cycles to whoever they choose.
Stream Processing

While researching this article I came across the Imagine stream processor developed by William J. Dally's team at Stanford University. This is a chip which is about 10x faster again (relative to clock speed) than the PS3 chips, and uses a somewhat similar parallel design. Another team at Stanford is doing related research into Streaming Supercomputers which are just large batches of these chips connected directly together. (It's not clear at this stage whether or not Stanford patents for these designs might form "prior use" against the Sony patents).

The Stanford team have come up with tools and methods for dealing with programming on the stream architecture - they write "kernels" (effectively APU code) in a language called KernelC which is compiled and loop-unrolled (a la VU code) to target the VLIW architecture of the Imagine processor. They then chain these kernels together with streams using StreamC, which makes a regular program that runs on the host processor (a PowerPC or ARM chip in this case). Note that the Imagine system has mainly been used to accelerate specific tasks - e.g. rendering - and not to run entire games which include rendering, audio generation, AI and physics all at once.
Readers' Comments

Please let me know if you have any comments or questions about this page by emailing me at [email protected]. I will reproduce and answer the most interesting comments and questions here. Let me know if you wish to remain anonymous (I will not print email addresses, only names).

In particular I would dearly love to hear from anyone on the Cell Architecture team! I have two major questions:

1. What is the instruction encoding?
2. Can the PUs run user code or just system code?

Links and References
The Patent

The patent application (USPTO)
The patent application (single ZIP)
Paul Zimmons' PowerPoint presentation
Reyes / Renderman

Computer Graphics: Principles and Practice (2e)
Comparing Reyes and OpenGL on a Stream Architecture
The Reyes Image Rendering Architecture
Stanford Research

The Imagine stream processor homepage
Streaming Supercomputers
Protein Folding

The Science of Folding@Home
Unravelling the Mystery of Protein Folding
Programming Protein Folding Simulations

che dire se non,BOH!,ho solo tradotto i primi paragrafi,e purtroppo la traduzione mannaggia la pupazzetta e piuttosto confuso,ma di una cosa sono sicuro,se questo articolo fosse veritiero,da quel poco che ho capito per la PS3 la Sony sta pensando veramente in grande,staremo a vedere

.

PS: per chi a il coraggio di leggerselo tutto buona lettura

.

Matrice · 22-07-2003, 16:07:35

!!!! a tradurlo .........

**gustavo** · 22-07-2003, 16:14:05

Non capisco niente

**SCONOSCIUTO** · 22-07-2003, 16:21:32

Inviato da gustavo
Non capisco niente

Strano, sono termini pi� o meno noti. Basta conoscere un po di inglese..

**First Children** · 22-07-2003, 16:25:32

a dir la verit� il brevetto di ps3 di cui parla quell'articolo � gi� disponibile su www.uspto.gov da almeno un anno !!! io ho tutto salvato sul mio computer (40 pag tutto in ing + schemi logici) con tutti i progetti a riguardo... l'ho letto tutto e l'ho capito quasi tutto... x� bisogna essere nel campo dell'elettronica x capirci qualcosa !!! nn ho il link dato che l'ho preso mesi fa ma se proprio volete posso mandarvelo come allegato (tanto nn ci capite niente

bisogna essere ingegnieri)

e poi nn ci sono molti dati sull'esatta potenza della macchina x� si pu� intuirlo dalle nuove tecnologie che contiene

se questo articolo fosse veritiero,da quel poco che ho capito per la PS3 la Sony sta pensando veramente in grande,staremo a vedere .

tranquillo l'ho letto anch'io adesso il tuo articolo ed � identico a quello che ho io !! stessi termini (cell, APU, PU....)
tieni conto che i brevetti devono essere fatti molto prima che la macchina in questione esca !! cos� si beccano tutti i diritti sulle nuove tecnologie... quelli della Sony avevano gi� concluso la parte teorica un anno fa e appena finito hanno fatto il brevetto con tutte le spiegazioni (e quindi tutte le info necessarie)

**crash88** · 22-07-2003, 22:23:21

Inviato da SCONOSCIUTO
Strano, sono termini pi� o meno noti. Basta conoscere un po di inglese..

ed avere un paio d'ore per leggere tutto

**Dante 89** · 22-07-2003, 22:32:55

basta senn� metterlo sul traduttore di google se proprio vuoi vederlo in una sottospecie di italiano ...
comuqnue a me non piace PES perci� non ho problemi XD

**Snake22** · 22-07-2003, 22:44:23

RAGA ECCO LA TRADUZIONE!

BUON DIVERTIMENTO, MI CI E' VOLUTO UN PO' A TRADURLO!

La tecnologia di PS3 Eddie Edwards, prefazione di aprile del 2003 Gli articoli recenti di notizie hanno spiegato che la richiesta di brevetto per la tecnologia su cui PS3 secondo le informazioni ricevute sar� basato ora � accessibile in linea. Ho passato un certo tempo che esamino il brevetto ed ho formato alcune teorie e congetture istruite quanto a che cosa tutto significa in pratica. Questo documento descrive il brevetto e descrive le mie idee. Alcune di queste congetture sono informate dalla mia conoscenza di PS2 (ero uno dei codificatori del VU su Jak del cane naughty & Daxter anche se ora non lavoro per SONY). Potete desiderare riferirsi alla presentazione del PowerPoint del Paul Zimmons che hanno schemi che potrebbero dichiarare alcuno di questo stuff. Inoltre, fino a che non ottenessi detto a di prenderlo gi�, ho reso il brevetto in se disponibile pi� facilmente in una forma downloadable (una CHIUSURA LAMPO 2MB che contiene 61 lima di TIF).

La tecnologia di PS3 � basata su che cosa chiamata dell'IBM "l'architettura delle cellule". Questa architettura sta sviluppanda da una squadra di 300 assistenti tecnici da SONY, dall'IBM e da Toshiba. PS2 � stato sviluppato da SONY e da Toshiba. SONY sembra progettare l'architettura di base, mentre Toshiba ha calcolato verso l'esterno come effettuarla in silicone. Il nuovo consorzio include l'IBM, che per PS3 user� le loro tecnologie avanzate di montaggio per costruire i circuiti integrati pi� veloci e pi� piccoli di sarebbe stato al contrario possibile. In pi�, lo sforzo � presunto un metodo holistic per cui gli attrezzi e le applicazioni stanno sviluppandi accanto ai fissaggi. L'IBM ha perizia particolare nelle domande della costruzione e nei sistemi operativi di sistemi in maniera massiccia paralleli - invitare l'IBM ad avere input significativo nel software per questo sistema.

Ci � a.lot di PS2 nell'architettura delle cellule. � il sapore PS2 che � il pi� apparente a me quando leggo il brevetto. Tuttavia, l'IBM deve portare una quantit� significativa di stuff alla tabella anche. Il brevetto per esempio si riferisce ad un processor di VLIW con 4 FPUs, piuttosto che ad un processor dell'doppio-edizione con un singolo vettore FPU di SIMD. Questo implica che i circuiti integrati siano basati su un insieme di IBM-STILE VLIW ALU? O giusto significa che � un VU veloce con "una parola molto lunga di istruzione" di soltanto 2 istruzioni? Ancora, noti che l'IBM � stato facente e vendente in maniera massiccia i supercomputer paralleli per parecchie decadi ora. L'input degli esperti dell'IBM sui paradigmi e sull'insieme di attrezzo di programmazione sta andando essere inestimabile. Ed il processor ospite infine cade i MIPS ISA per il proprio insieme delle istruzioni di PowerPC dell'IBM. Ma non possiamo ottenere programmare comunque il PPCs all'interno dello PS3.

Ho dovuto fare i presupposti. Perdonili. Se chiunque con comprensione o conoscenza desidera chiarirlo, prego.
Indice * * Cellule Della Prefazione * Larghezza di istruzione dei APUs o * Winnie l'unit� di elaborazione * PEs * Il Motore A banda larga * Visualizzatori * Lo PS3 reale si lever� in piedi prego in su? * Memoria: Sandboxes * Memoria: Sincronizzazione Consumatore/Del Produttore * Memoria: Accesso casuale, nascondigli, ecc.
* Spedisca e la compatibilit� obliqua * Modellistica dei grafici o * PS3 di programmazione * Jazzing con il gene blu * Elaborazione Del Flusso * Osservazioni Dei Lettori * Collegamenti e riferimenti Cellule (ci � una certa confusione quanto a che cosa "una cellula" � in questo brevetto. I mezzi sta usando generalmente il termine "cellula" per che cosa il brevetto denomina "un'elaborazione l'elemento" o "del PE". Nel brevetto, il termine "cellula" si riferisce ad un'unit� di software e di dati, mentre il termine "PE" si riferisce ad un elemento d'elaborazione che contiene le unit� di elaborazione multiple. User� quella nomenclatura qui.) Le cellule sono centrali all'architettura di rete di PS3's. Una cellula pu� contenere il codice e/o i dati di programma. Quindi, una cellula potrebbe essere un pacchetto in un flusso di dati del MPEG (se steste effluendo un movie in linea) o potesse essere una parte di un'applicazione (e.g. parte del motore di rappresentazione per un gioco PS3). La disposizione di una cellula � definita senza bloccare nel brevetto. Tutto il software si compone delle cellule (e qui uso il software nel relativo senso pi� generale includere i programmi ed i dati). Ancora, una cellula pu� funzionare dovunque sulla rete - su un assistente, su un cliente, su un PDA, ecc.

Dica, per esempio, che un Web site ha desiderato effluire un segnale della TV a voi nella loro nuova disposizione migliorata DivY.

Potrebbero trasmettervi una cellula che ha contenuto le istruzioni di programma per la decodificazione del flusso di DivY in un'immagine normale della TV. Allora vi trasmettono il flusso dell'immagine di DivY-endoded. Ci� funzionerebbe se aveste uno PS3 o se aveste una TV digitale, o persino se aveste un PDA abbastanza potente - presupponendo il loro disegno seguiste il nuovo campione.

Secondo come SONY "aperto" fa questo esso potrebbe essere facile o impossible da programmare il vostro proprio PS3 giusto trasmettendogli i pacchetti che di dati lo desiderate funzionare. (nota che la storia del SONY a tale riguardo � interessante - i loro progetti di PSX Yaroze e di PS2 Linux mostra una certa compiacenza aprire le loro macchine fino ai hobbyists.) APUs Funzionamento delle cellule su uno o pi� "unit� di elaborazione fissate" o APUs (pronuncio questo dopo il carattere nel Simpsons!) Un APU � dal punto di vista architettonico molto simile all'unit� di vettore (VU) trovata in PS2, ma pi� grande e pi� uniform: * processor 128-bit * bus esterno 1024-bit * 128K (8192 parole di x 128-bit) della RAM locale * 128 registri di x 128-bit * unit� di vettore della virgola mobile 4-way che d� 32GFLOPS * unit� di vettore di numero intero 4-way che d� 32GIOPS (confronti questo al bus esterno di 128-bit del VU, 16K della RAM di codice, 16K della RAM di dati, 32 registri di x 128-bit, l'unit� a 16 bit di numero intero di singolo senso e soltanto 1.2GFLOPS.) Il APU � un processor molto lungo di parola di istruzione (VLIW). Ogni ciclo pu� pubblicare simultaneamente un'istruzione all'unit� di vettore della virgola mobile ed una all'unit� di vettore di numero intero. � molto pi� simile ad un DSP tradizionale che ad un CPU come il Pentium III - non fa analisi dinamica del flusso di istruzione, nessuna riordinazione. L'insieme del registro imbued con abbastanza orificii che i FPU ed i IPU inscatolano ciascuno leggono 3 registri e scrivono un registro su ogni ciclo. Diverso del VU, l'unit� di numero intero sul APU � vectorized, ogni elemento di vettore � un interno 32-bit (VU era soltanto a 16 bit) e l'insieme del registro � ripartito con il FPU (in VU ci � un pi� piccolo registro dedicato di numero intero regolato). Il APU dovrebbe quindi essere piuttosto pi� facile da programmarsi e molto pi� per tutti gli usi del VU.

Diverso del VU, che ha usato un'architettura de Harvard (memorie di programma e di dati del seperate), il APU sembra usare (von Neumann) un'architettura tradizionale dove il 128K della RAM locale � ripartito dal codice e dai dati. La RAM locale sembra essere triplo ported in moda da potere accadere un singolo carico o deposito in parallelo con un'istruzione prenda, attenuando il von Neumannism (l'altro orificio � per DMA). Il collegamento � 256 bit largamente (2 x 128 bit), cos� soltanto un carico o il deposito pu� accadere per il ciclo - sembra ragionevole da presupporre quindi che le istruzioni di load/store si presentino soltanto dal lato di numero intero dell'istruzione di VLIW, come era sul VU. Poich� non ci � distinzione fra il numero intero ed i registri della virgola mobile questo risolve l'indennit� giusta. Il terzo orificio della RAM fissa il APU ad altri componenti nel sistema e che permette che i dati siano DMAed in o dai bit del circuito integrato 1024 alla volta. Questo DMAs pu� essere innescato dal APU in se, che differisce da dallo PS2 dove soltanto il processor ospite potrebbe innescare un DMA.

Si noti che il APU � un coprocessor non ma un processor nella relativa propria destra. Caricato una volta con un programma ed i dati pu� sedersi l� per gli anni che li fanno funzionare indipendentemente dal resto del sistema. Le cellule possono essere scritte per utilizzare uno o pi� APUs, cos� i APUs multipli possono cooperare per effettuare una singola operazione logica. Un esempio impressionante fornito nel brevetto � dove tre APUs convertono i modelli 3D in 2D rappresentazioni ed un APU quindi converte questo in pixel. L'implicazione � che PS3 effettuer� la rappresentazione pura del software.

La velocit� dichiarata di questi APUs � impressionante - 32GFLOPS + 32GIOPS (32 miliardo istruzioni di virgola mobile e 32 miliardo istruzioni di numero intero al secondo). Prevedo che SONY consideri un 4-way vectorized moltiplic-accumuli l'istruzione per essere 8 fLOPs, in modo da la velocit� di orologio del APU � 4GHz, come � stato segnalato altrove nei mezzi. Ci� � molto pi� veloce dell'orologio calmo 300MHz di PS2's - entro circa 13 volte. Presumo che il FPUs � canalizzato (i.e. potete pubblicare un'istruzione per il ciclo ma prende per esempio quattro cicli per fornire la risposta). Ma se PS2 avesse una conduttura 4-stage per i moltiplicatori a 300MHz, che cosa la profondit� della conduttura sta andando essere a 4GHz? 8 fasi? 16 fasi? I particolari di questo dipenderanno dalla progettazione precisa del APU e questo non � coperto dal brevetto, ma vale la pena di notare che le condutture nude sono dure da codificare per ad una profondit� di 4; ad una profondit� di pi� grande di questa pu� semplicemente essere difficile scrivere il codice ottimale per queste parti.

Nota: i APUs possono preferibilmente usare un'architettura di IBM-STILE VLIW dove ogni ALU (4 virgola mobile e 4 numeri interi) � indipendentemente operabile dalle parti differenti della parola di istruzione. Tuttavia, il formato di parola dei registri � 128, in modo da ogni unit� della virgola mobile deve accedere alla parte dello stesso registro. Ci� limita seriamente l'efficacia di un'architettura di VLIW e la rende piuttosto difficile programmarsi per. I quindi suppone che il ALUs sta comportandosi come 4-way le unit� tipiche di vettore SIMD.

Una partenza interessante da PS2 � che tutte le cellule del software funzionano sui APUs. Su PS2 ci erano due VUs ma anche un CPU generale (un circuito integrato di MIPS). Questo circuito integrato era l'unico circuito integrato nel sistema capace dei funzionamenti di numero intero di vettore 128-bit (necessari per costruzione veloce dei drawlists) e questa funzionalit� ora � inclusa nel APU. Ci � un processor non-Non-APU nel nuovo sistema ma fa funzionare soltanto il codice di OS, non cellule, in modo da la relativa architettura precisa � irrilevante - potrebbe essere qualche cosa e le stesse cellule del software funzionamento tranquillo sui APUs appena fini.
Larghezza Di Istruzione Dato 128 registri, prende 7 bit per identificare un registro. Ogni istruzione pu� avere 3 input e 1 prodotti che � 28 bit. Sono presumendo sto mantenendo le mascherine estremamente utili dell'elemento di vettore che aggiungerebbero 4 bit al lato di FPU. Soltanto nella cassa del MAC (moltiplic-accumulisi) sono 3 input realmente stati necessari, ma che dicono che specificate un MAC sia sul IPU che su FPU che 60 bit per le specifiche del registro da solo. Quindi il dubbio I che la lunghezza di istruzione � 64 bit - io pensa che il VLIW sul APU debba essere 128 bit largamente, che � da allora ragionevole che � la lunghezza di parola e poich� ci � larghezza di banda da leggere 128 bit dalla memoria per il ciclo cos� come fanno una memoria di load/store to/from allo stesso tempo. Ma questo probabilmente sta andando significare che il codice non � eccessivamente compatto - soltanto 8.192 istruzioni inseriranno nel tutto della RAM del APU, senza stanza per i dati in quel caso.

D'altra parte, 128 bit � a.lot delle punte per un'istruzione poich� soltanto 60 sono usati finora. Ammettendo 256 istruzioni distinte per il lato (che � molto molto generoso) che � 8 bit per il lato che fa 76. La mia congettura � pu� avere altri 16 bit per mascherare i funzionamenti di numero intero, appena poich� 4 bit mascherano i funzionamenti di FPU. 16 bit vi permette di isolare tutto il dato byte(s) nel registro. Quello � 92.

Un'altra caratteristica che fredda potrebbero impiegare � esecuzione condizionale come sul BRACCIO - 4 bit controllerebbero l'esecuzione di ogni istruzione secondo i codici di circostanza standard. Ero suprised per non vedere questo sul VU in PS2 (forse il BRACCIO ha un brevetto?) perch� contribuisce a evitare il a.lot dei rami piccoli piccoli. Se il PPC sta influenzando il disegno, possono gettare appena un dispositivo di spostamento del barilotto dentro dopo ogni istruzione ugualmente (che sarebbe abbastanza Braccio-come pure). Cos� anche senza accessi che di memoria unaligned potete isolare tutto il campo in una parola 128-bit in un singolo mascherina-e-sposti l'istruzione. Altri 7 bit l� ugualmente (numero intero soltanto)... quello � ancora soltanto 99 bit - 29 bit sono ancora disponibili.

Che cosa sembra essere comune sui circuiti integrati classici di VLIW deve avere parti del ALU direttamente controllato dal codice di istruzione. Su un CPU classico le istruzioni sono decodificate per generare le linee di controllo per il ALU (per esempio, selezionare quale parte del ALU dovrebbe calcolare il risultato). Con un circuito integrato di VLIW potete mettere direttamente le linee in codice di controllo - conosciuto come codifica orizzontale. Ci� rende i circuiti integrati pi� semplici e le istruzioni pi� potenti - potete fare le cose insolite con le istruzioni che non potreste fare su un CPU normale. Le istruzioni normali sono presenti come casi speciali. � pi� come voi direttamente sta controllando il ALU con i bitfields che state pubblicando "le istruzioni". Ci� pu� rendere il processor difficile affinch� gli esseri umani scriva il codice per - ma per alcuni versi pi� facile per le macchine (e.g. compilatori). � possibile che gli altri 29 bit vanno verso questa codifica.

Tuttavia, il brevetto non entra in molto particolare circa c'� ne di questo, in modo da dovreste trattare il suddetto pochi paragrafi con un certo sospetto fino a che le pi� informazioni non emergano.
Winnie l'unit� di elaborazione Come detto precedentemente, ci � un tipo secondario di unit� di elaborazione che � denominata "unit� di elaborazione" giusta o unit� di elaborazione. Il brevetto non dice quasi niente circa i internals di questo componente - i rapporti di mezzi suggeriscono che i incarnations correnti saranno basati sul circuito integrato di PowerPC, ma questo non � particolarmente relativo.
L'unit� di elaborazione non fa funzionare mai il codice dell'utente, soltanto codice di OS. � responsabile della coordinazione dell'attivit� fra un insieme dei APUs - per esempio decidendo quali APUs funzioneranno che cellule. � inoltre resposible per l'amministrazione di memoria - decidendo quali zone della RAM possono essere usate da cui APUs. L'unit� di elaborazione deve fare funzionare il codice di fiducia perch�, come spiego pi� successivamente, � l'unit� di elaborazione che installa "i sandboxes" che proteggono il sistema intero dai virus e dal codice cattivo simile trasferiti fuori del Internet.
PEs Il brevetto allora unisce parecchi APUs e un'unit� di elaborazione per fare "un'elaborazione l'elemento" o del PE. Un PE tipico conterr�: * Un'unit� di elaborazione del padrone che coordina le attivit� nel PE * Un regolatore di accesso di memoria diretta o un DMAC che si occupano degli accessi di memoria * Un certo numero di APUs, in genere 8 Il PE � la pi� piccola cosa che realmente abbiate trasformato un circuito integrato (o voi pu� mettere il pEs multiplo su un circuito integrato - veda pi� successivamente). Contiene un bus 1024-bit per il collegamento diretto al DRAM. Inoltre contiene un bus interno, il Pe-bus. Sto arguendo che questo bus � 1024 bit largamente poich� fissa direttamente all'interfaccia di memoria dei APUs, che sono 1024 bit largamente. Tuttavia, il brevetto non dice dettagliatamente quasi niente circa il Pe-bus.

Il DMAC sembra fornire una scanalatura di DMA simultanea per ogni APU - cio�, 8 scanalature simultanee 1024-bit. Il DRAM in se � tagliato in 8 settori e per la simultaneit� ogni APU deve accedere ad un settore differente. Nominalmente il DRAM � 64MB ed ogni settore � 8MB grande. I settori essi stessi consistono delle serie 1MB configurate come appena 8192 parole 1024-bit.

Un PE con 8 APUs � teoricamente capace di 256GFLOPS o di 1/4 di TFLOPS. Certamente quella � abbastanza alimentazione per una sezione comandi seguente di GEN? Non secondo SONY...
Il Motore A banda larga Ora mettiamo insieme il pEs quattro in un circuito integrato ed otteniamo che cosa il brevetto denomina "un motore a banda larga" o �. Anzich� ogni PE che ha relativo proprio DRAM il pEs quattro lo ripartisce. I Pe-bus di ogni PE si associano insieme al Essere-bus ed inclusi sullo stesso circuito integrato sono i blocchetti facoltativi di I/O. L'interfaccia al DRAM ancora � indicata come essendo esterna, ma ancora presuppongo che il DRAM deve essere sullo stesso dado per accomodare l'interfaccia 8192-wire.

L'ha 1/4 di larghezza di banda di memoria di un PE da una parte dei quattro pEs lo stesso DRAM. Cos� devono ripartirsi. Ci� � fatta per mezzo di un interruttore del incrocio per cui ciascuna delle 8 scanalature su ogni PE pu� essere fissata a c'� ne dei 8 settori sul DRAM.

Ulteriormente, ciascuno � ha 8 scanalature di DMA esterne che possono essere fissate al DRAM attraverso lo stesso meccanismo del incrocio. Ci� permette che BEs sia fissato insieme e direttamente accedi ad ogni altri DRAM (presumibilmente con il qualche fa ritardare). Il brevetto discute BEs di collegamento in su in varie topologie.

Una cosa che il brevetto parla di sta innestando una guida di onde ottica direttamente sul circuito integrato di in moda da potere collegare BEs otticamente - letteralmente, l'imballaggio del circuito integrato includerebbe gli orificii ottici in cui le fibre ottiche potrebbero essere fissate direttamente. Pensi a tale proposito! Se l'pi� il DRAM fosse un'unit� autonoma, non ci sarebbe esigenza affatto delle interfacce elettriche ad alta frequenza in un sistema sviluppato di BEs e quindi il disegno della cartolina dovrebbe diventare molto molto pi� facile di � oggi. Si noti che il brevetto indica chiaramente che l'interfaccia ottica � un'opzione - pu� mai realmente comparire - ma sarebbe molto utile nelle serie di ingranaggi della costruzione di queste cose, per esempio in supercomputer.

**Snake22** · 22-07-2003, 22:46:30

Un con il pEs 4 � teoricamente capace di 1 TFLOPS - circa 400 volte pi� velocemente di uno PS2.
Visualizzatori I visualizzatori (VSs) sono accennati alcune volte attraverso il documento. Un visualizzatore � come un PE in cui 4 dei APUs sono rimossi e nel loro posto � messo una certa video memoria (VRAM), un video regolatore (CRTC) e "un motore del pixel". Quasi nessun particolare � fornito ma � un presupposto giusto che uno o pi� di questi former� il grafico "centralizza" per lo PS3. Presumerei che il motore del pixel realizza i funzionamenti semplici come quelli realizzati dall'posteriore di una conduttura normale dei grafici - controlli ed aggiorni la Z, controllo e lo stampino dell'aggiornamento, il controllo e l'alfa dell'aggiornamento, scrivono il pixel. L'esistenza del CONTRO � ulteriore prova per suggerire che PS3 � progettato per software che rende soltanto.

Gli schemi nel brevetto suggeriscono che i visualizzatori possono essere utilizzati nei gruppi - presumibilmente ciascuno CONTRO fa un quarto dell'immagine dell'uscita (simile al cubo di GS).

Nella mia sezione sui grafici pi� successivamente, descrivo il software che rende le tecniche credo che PS3 usi. Queste tecniche usano almeno oversampling 16x dell'immagine (i.e. un'immagine 2560 x 1920 anzich� un'immagine 640x480) e l'esecuzione di fissaggi evidente potrebbe essere capace dell'elaborazione a 16 pixel simultaneamente - che � equivalente a 1 pixel di un'immagine 640x480 per il ciclo. Poich� l'uscita di NTSC � 640x480 che denomino questi "pixel" mentre l'immagine 2560 x 1920 si compone "di superpixels", con 16 superpixels per il pixel.
Lo PS3 reale si lever� in piedi prego in su? Cos� che cosa � PS3 da essere, allora? Il brevetto accenna varie architetture possibili del sistema, da PDAs (un singolo CONTRO) con che cosa chiama "stazioni di lavoro grafiche" che sono uno o due pEs ed uno o due VSs, ai sistemi voluminosi composti di 5 BEs collegato l'un l'altro otticamente. Quale � lo PS3? Lo schema rivelante a me � figura 6 nel brevetto, che � descritto come due circuiti integrati - uno che � 4 pEs (i.e. un) ed uno che � 4 VSs con un processor di I/O che piuttosto per coincidenza � chiamato IOP - lo stesso nome del processor di I/O in PS2 (questo componente ancora saranno richiesti comunicare con azionatore del disco, joypads, orificii del USB, ecc.) il bus fra i due assomigli principali dei circuiti integrati esso � significato per essere elettrico. Stranamente, ogni circuito integrato principale ha il 64MB del DRAM fissato (sul circuito integrato?) e questo d� soltanto 128MB della RAM totale del sistema. Quello sembra molto molto basso. Invitare un sistema pi� pratico per avere forse 64MB per PE o CONTRO dare un totale di 512MB della RAM - molto pi� ragionevole. Cos� forse il 128MB � soltanto un tipo "di RAM veloce secondaria del su-circuito integrato del nascondiglio" -. Allora i lotti della RAM pi� lenta hanno potuto essere fissati al sistema usando un regolatore normale di memoria. Questa RAM lenta sarebbe molto pi� poco costosa "della RAM del nascondiglio secondario" e probabilmente non avrebbe l'interfaccia 8,192-wire. Infatti, guardando il disegno di PS2's GS, l� abbiamo 4MB di VRAM che ha un bus 1024-bit al nascondiglio di struttura - in modo da forse il 64MB per PE � un'estensione di questo disegno di VRAM? D'altra parte, VRAM tende ad essere veloce e basso-stato latente, mentre il brevetto specificamente denomina il 64MB per PE "DRAM lento".

Cos� quanto potente � questa macchina? Scaturiscono i 4 che il pEs ci d� 1 TFLOPS. Il give di 4 VSs un altro 1/2 TFLOPS. Aggiunga le istruzioni di numero intero dentro e denomini che cosa il motore del pixel "funzionamenti di numero intero" anche ed abbastanza presto vedrete una macchina che realmente � capace dei trilioni dei funzionamenti al secondo - un importo superbly ludicrous.

Presupporre il motore del pixel pu� maneggiare un pixel (16 superpixels) per il ciclo, a 4GHz con 4 VSs che � un fillrate di 16GPPS - abbastanza per disegnare i 640 x 480 schermo di x 60Hz con 800x overdraw. Bello. (tuttavia, noti che quando i triangoli di disegno pi� piccoli di un pixel, una determinata quantit� di "overdraw" � richiesto appena per riempire lo schermo - in modo da la complessit� disponibile di profondit� � "soltanto" dell'ordine di 100 circa).
Memoria: Sandboxes Il DRAM usato in questo sistema non � realmente 1024-bits largamente ma 1024 + punte di N dove la N � le informazioni di controllo supplementari. Queste informazioni supplementari sono usate in 2 sensi - fornire i fissaggi "che sandboxing" per cui le regioni della memoria possono essere installate per permettere l'accesso soltanto da un determinato sottoinsieme dei APUs e fornire la sincronizzazione del produttore-consumatore dei fissaggi, che discuto pi� successivamente.

Sandboxes � effettuato usando la seguente prova logica: (REQID & REQIDMASK) == (MEMID & MEMIDMASK) Qui, REQID e REQIDMASK sono un'identificazione e una mascherina connesse con il APU che fa la richiesta; MEMID e MEMIDMASK sono un'identificazione e una mascherina connesse con la posizione di memoria che � letta o scritta. Se i risultati sono uguali l'accesso va avanti, altrimenti � ostruito.

Questo sistema tiene conto affinch� i APUs abbia memoria riservata, memoria compartecipe con un insieme specifico di altri APUs e un insieme abbastanza espandibile di altre permutazioni. Non � chiaro come questa funzione si interagisce con la funzione affinch� BEs direttamente legga la memoria dell'altro BEs - uno immaginerebbe 32 APUs per ESSERE significherebbe che le identificazioni e le mascherine erano 32 bit largamente con un bit per il APU - ma se un insieme dei APUs potenzialmente illimitato nell'altro BEs pu� accedere al DRAM allora come � la messa a punto di identificazioni, mi domando? Memoria: Sincronizzazione Consumatore/Del Produttore Il DRAM inoltre effettua un'altra funzione speciale, quello deve permettere la sincronizzazione automatica fra un APU che sta producendo le informazioni e un APU che sta consumando quelle informazioni. La sincronizzazione funziona per la parola 1024-bit. Essenzialmente, il sistema � installato in modo che il produttore pubblichi un DMA "sincronizzazione scriva" alla memoria ed alle edizioni di consumatore un DMA "sincronizzazione colto" dalla memoria. Se la sincronizzazione scrive accade in primo luogo, tutta � bene. Se la sincronizzazione colta accade in primo luogo, il consumatore si arresta fino a che la scrittura non accada.

Che cosa quel realmente medio? Bene, ci � una punta in ogni posizione di memoria interna al APU. (stiamo comunicando qui circa le posizioni 128-bit di posizioni 1024-bit non.) questa punta � regolata quando "una sincronizzazione colta" � in corso a quella posizione di memoria. Le lanugine di descrizione di brevetto la spiegazione di questa un piccolo, ma io sto arguendo che il APU pubblica una sincronizzazione indicato ed allora continua fino a che il codice nel APU non tenti di accedere ai dati che sono stati letti. Se i dati ancora non sono arrivato dalla RAM, il APU smette di funzionare fino a che i dati non siano disponibili. (il brevetto sembra implicare che il APU si arresti immediatamente ma quello non fa un lotto intero di senso da quando la punta supplementare interna alla RAM allora non sarebbe necessaria.) Questo meccanismo � molto importante - significa che potete dati del prefetch nella memoria e finch� potete continuare a lavorare all'altro stuff i dati arriveranno quando esso sono pronti ed il APU non deve arrestarsi. Cos� la sincronizzazione pu� essere libera nei cicli (senza giunte) e liberare in termini di spese generali dell'unit� di elaborazione. Le spese generali nel DRAM per ogni posizione di memoria sono circa 18 bit - 1 bit di free/empty, un'identificazione del APU (5 bit) e 1 indirizzo di destinazione (13 bit). Ma nota che cosa ho detto sopra circa l� essere pi� di 32 APUs che accedono alla stessa quantit� di DRAM - forse pi� di 5 bit � necessari per l'identificazione del APU? Il DRAM di parit� gi� fornir� i bit del supplemento 128 per ogni parola dei 1024 bit - questa � pi� di abbastanza per fornire ai 40 bit richiesti dal sandboxing e dallo synching, 88 bit a sinistra per ECC (ECC non � accennato nel brevetto, ma � ragionevole da presupporre che potrebbe essere una caratteristica - posti di ECC un codice correggere di errori nelle punte di ricambio del DRAM in moda da nell'evento improbabile che un raggio cosmico cambia un modello di punta nella vostra RAM potere rilevare e correggere il sistema l'errore.
Onestamente! Non sto componendo questo!) Il brevetto fa un fuss circa come questo che synching lo render� insignificante ai dati protetti da un dispositivo di I/O. Ma sembra a me che la funzione principale degli synching � renderla insignificante per effluire dati intorno ai APUs senza intervento dall'unit� di elaborazione. Potete installare i grafici arbitrari del produttore-consumatore e farli funzionare come se da magia. � una caratteristica grande per le cose come video procedendo dove parecchi APUs potrebbero fare la compressione del MPEG delle immagini che sono lette da un convertitore analogico/digitale e sono procedute da altri APUs (e.g. l'aggiunta dei menu o persino degli effetti speciali). Ogni fase deve aspettare i dati dalla fase precedente e questa sincronizzazione permette che questa sia fatta con il hassle minimo. Mentre discuto nella sezione di programmazione, i flussi dei dati stanno andando essere un concetto chiave nella programmazione PS3.
Memoria: Accesso casuale, nascondigli, ecc.

Ora ho detto prima che l'unit� di elaborazione non fa funzionare il codice dell'utente. A meno che, forse. Il a.lot di un gioco pu� essere shoehorned nel modello del APU - certamente il motore intero dei grafici, probabilmente molto codice di IA, codice sano. Ma � un caso di scrittura della parte del codice che maneggia un insieme di funzionamento 128K (meno il formato di codice). Che cosa circa i casi dove realmente, realmente, realmente avete bisogno dell'accesso casuale alla memoria? Che cosa circa appena porting una certa C monolitica sulla piattaforma? Come facciamo quello? Certamente abbiamo bisogno di un processor tradizionale per le mansioni "normali"? Pozzo non � ancora chiaro che. L'accesso casuale alla memoria sta andando avvitarlo in su. Probabilmente sarebbe pi� veloce fare una lista delle posizioni che di memoria dovete accedere a in primo luogo (in RAM locale), specie che e fa in sequenza gli accessi di memoria! Un nascondiglio non aiuter� in questo caso.

Che cosa circa porting C monolitica? Penso che la risposta a questa funzioni in profondit�. Penso che la risposta sia: non potete. Penso che scrivere il codice per questa bestia sia fondamentalmente differente. Dovete rompere il vostro codice nei processi che possono funzionare con soltanto 128K dei dati contemporaneamente. Lo PS3 richieder� un nuovo metodo alla programmazione. Ho mie teorie circa che cosa questo nuovo metodo �, che descrivo pi� successivamente, ma infine il nuovo metodo dovrebbe realmente essere pi� uniforme del nostro metodo corrente. Forzer� la modularit� su noi, che non potrebbero essere una cosa completamente difettosa.

Che cosa il sistema pu� offrire per sperato si programma (e.g. fuori di un DVD-ROM messo piuttosto che fuori del Internet) � la capacit� di modificare il sistema operativo dell'unit� di elaborazione alla e.g. cambi la procedura che usa distribuire le cellule fra i APUs. O la capacit� di aggiungere i driver direttamente al processor di I/O (IOP) sullo PS3. Sembra che SONY probabile offrir� un certo livello di controllo ai programmatori dei giochi - controlli che non � richiesto semplicemente dai purveyors semplici dei mezzi effluenti. D'altra parte, questo distrugg la loro capacit� di scambiare l'unit� di elaborazione sopra ad un'architettura differente del processor.

Qualcosa che non sia indirizzata dal brevetto � la domanda, che granularity APUs vede nella loro memoria locale? Possono fare i depositi di byte? carichi 32-bit? La mia congettura � che possono trasferire soltanto le quantit� 1024-bit dal DRAM principale alla RAM locale e pu� trasferire soltanto le quantit� 128-bit dalla RAM locale ai registri, ma che in una singola istruzione possono isolare tutto il bitfield da un accoppiamento dei registri 128-bit. Ma � soltanto una congettura.
Spedisca e la compatibilit� obliqua Dalla compatibilit� di andata significo che i programmi possono funzionare sui revs futuri dei fissaggi senza errore. Dalla compatibilit� obliqua significo che i programmi possono funzionare su fissaggi che effettuano lo stesso insieme delle istruzioni ma che � fatto dai fornitori differenti ai disegni differenti. In entrambi i casi, stiamo parlando dei programmi correnti sui circuiti integrati che hanno caratteristiche differenti di sincronizzazione ai circuiti integrati ch'� stato scritto sopra.

Il brevetto discute un temporizzatore che � fornito su ogni unit� di elaborazione. Gli dite che quanto tempo pensate un programma del APU deve prendere e se occorre meno tempo (opinione su un APU pi� veloce) allora attende fino a che il tempo specificato non abbia passato - in modo da il programma non sar� mai pi� veloce di dovrebbe essere.

Non ottengo questa parte, realmente. Inizialmente la I pensata forse il temporizzatore era per i processi synching, ma garantisce soltanto no pi� presto "" del completamento, in modo da synching sarebbe impossibile poich� alcuni processi non possono "arrivare" ancora. Anche se questa � l'intenzione, funzionerebbe soltanto se il preventivo di tempo si riferisse soltanto al APU che procede e non agli accessi di memoria - poich� il DRAM � ripartito con l'altro pEs che ogni APU ha soltanto 1/4 di una scanalatura di DMAC disponibile ad esso.

.. cos� le stalle possono accadere basato su che cosa gli altri APUs nel sistema stanno facendo. Potete saltare facilmente il vostro timeframe questo senso, attraverso nessun difetto dei vostri propri e non siete improvvisamente synched in su altro.

Cos� che cosa � questo per, allora? Forse deve definire le basi dei tempi per il vostro camice di programma del gioco - usate il temporizzatore per specificare che il gioco funziona a 60Hz no pi� velocemente anche su un CPU veloce. Quello sembra improbabile, bench�, perch� il repertorio di programmazione del gioco standard include i sensi fare i giochi funzionare a velocit� in tempo reale ed all'pi� alto tasso della struttura (se l'esposizione pu� sostenere quella) quando l'abilit� d'elaborazione della macchina migliora. Cos� forse il temporizzatore � usato per definire la frequenza dell'uscita di 60Hz NTSC - e forse i subfrequencies fuori di quella. Ricordisi di che questa � una parte 8GHz. Potrebbe utilizzare un APU per generare l'intera immagine di NTSC. Ma non sembra a comunque poich� ci sono i CRTC specificati in ciascuno CONTRO.

**crash88** · 22-07-2003, 22:46:38

RAGA ECCO LA TRADUZIONE!

mica l'hai tradotta da solo? cmq nn offenderti ma nn la legger� lo stesso (� troppo lunga)

**Snake22** · 22-07-2003, 22:47:16

Idee chiunque? Grafici La rappresentazione del software sembra essere l'ordine del giorno, anche se pu� essere prematuro fare questa osservazione poich� potrebbero aggiungere sempre qualche altro GPU se le prestazioni della rappresentazione del software non riuscissero ad impressionare. Potrebbe sembrare come uno spreco fare la conversione di esplorazione nel software quando SONY gi� ha fissaggi che possono esplorare il convertito 75 milione triangoli al secondo in PS2.

Ma PS3 non sta andando fare 75 milione triangoli al secondo. No dell'OH. Sta andando fare molto pi� che quello. Sto andando attaccare il mio collo fuori e dire che la volont� PS3, al picco, � capace di 1 miliardo triangoli al secondo. Ma prima che giustifichi quella figura, lascili appena suppongono che far� il a.lot dei triangoli. Tanti triangoli che il formato medio di uno � di meno che un pixel. Cos� che cosa deve esplorare il convertito? � un puntino, destra? Bene, no. Potreste disegnare un puntino ed otterreste un'immagine ma l'immagine sembrerebbe dispari grazioso - pi� come un mosaico molto ad alta definizione che la a anti-aliased correttamente l'immagine di CG. Il sistema dovr� fare il subpixel che rende ed avere una media di fuori per ottenere un'immagine piacevole. supersampling 4x4 di un'immagine 640x480 d� 2560x2048 i superpixels - 1280x1024 su ciascuno dei 4 CONTRO le unit�. Ora, se desideriamo mai disegnare un triangolo pi� grande di un pixel nella larghezza la suddividiamo. Ci� interamente � fatta nel codice sul APU. Una volta che il triangolo � di meno che i superpixels 4x4 nel formato l� sono procedure che potete usare per determinare molto velocemente quali subpixels riguarda. Mantenete i bitmasks per ogni (4x4x4x4 = 256) bordo possibile e li mascherate insieme per dare il riempimento del triangolo. Poich� il triangolo � di meno che un pixel nel formato l� � punto che lo struttura - giusti lo riempite di singolo colorate. Cos� stiamo rendendo flatshaded i polys. Possiamo fare il a.lot di questi nel software. Ci concludiamo in su con la a piacevolmente antialiased l'immagine che ha l'apparenza di struttura tracciare soltanto perch� ci siamo riferiti ad un programma di struttura quando abbiamo deciso il colore di ogni triangolo. Scriviamo i programmi del APU per determinare questo colore - questi programmi sono denominati shaders. Chiunque esperto con RenderMan dovrebbe cominciare capire che cosa sta accendendo. In un senso le possibilit� di rappresentazione di PS3 sono molto analoghe di un motore in tempo reale di Reyes RenderMan.

Cos� come su terra otteniamo 1 miliardo triangoli al secondo? Bene, i fissaggi per calcolare i frammenti del triangolo sono piccoli e veloci. Allora tutto che abbiate bisogno di � i funzionamenti di base del pixel abbiamo gi� su GPUs - la z-prova, la stampino-prova, alfa-prova, alfa-si mescola, ecc. 4x4 presupponente che supersampling, ogni triangolo riguarda fino a 16 superpixels (realmente mai pi� di 10) e 1 superpixel = triangolo 1/16 per CONTRO per il ciclo d� 1 miliardo triangoli al secondo. (questi 16 funzionamenti possono persino parallelized se il VRAM � tagliato in 16 serie in modo da possiamo persino ottenere i 16 miliardo triangoli teorici al secondo, ma i APUs di 4x 4GHz potrebbero non guidare mai questo molti triangoli verso l'esterno in modo da sembra poco un superfluo).

� possibile guidare questo da una mascherina a 16 bit semplice dell'input, in modo da il software deve determinare il riempimento e la posizione del pixel per ogni piccolo triangolo; comunque questo prender� parecchi cicli, mentre i fissaggi "per esplorare il convertito" tali poligoni sono piccoli e veloci (una ROM triplice-triple-ported 256x16 pi� una certa logica mobile coordinata di base). � possibile quindi che i motori del pixel sono realmente capaci "di esplor-convertire" questi poligoni del subpixel, facilitando la difficolt� del software.

Ma un avvertimento grande con tutti questi conteggi del triangolo � la complessit� dello shader in se. Lo shader pi� semplice potrebbe potere dare dei calci verso l'esterno un poligono ad ogni 16 cicli da ogni APU, spingente i motori del pixel nel massimo, ma qualche cosa pi� complesso (e.g. con la strutturazione o proteggere o annebbiarsi) richieder� pi� cicli. Per con tutti i sistemi grafici, il conteggio pratico del triangolo probabilmente sar� meno di 20% del massimo teorico.
Modellistica I grafici procedurali deve essere il senso andare con queste cose. Non importa come digiuni il DRAM �, esso non sta andando essere paragonabile al 32GFLOPS disponibile su ogni APU. La memoria sar� molto lenta, procedendo sar� molto velocemente. Pal Engstad al cane naughty ha precisato a me che il VU che si programma sullo PS2 � analogo della programmazione registr-basata antiquata dell'elaboratore centrale - potete leggere un piccolo numero di annotazioni dal nastro nella RAM locale e potete leggerli soltanto in sequenza o l'efficienza soffre tanto. Ci sono procedure per la fascicolazione delle annotazioni tenute su nastri e queste procedure possono essere applicate ugualmente bene al problema della fascicolazione degli allineamenti grandi nella memoria usando appena il 128K disponibile ad un APU. Ma alla conclusione del giorno, state andando avere oodles dei cicli che sono sprecati appena mentre aspettate la memoria. I modelli procedurali possono essere citati ad esempio dai dati minimi in RAM usando i oodles dei cicli. Ci � un fiammifero naturale qui.

Programmazione della PS3
� allineare sconosciuto come PS3 sar� programmato. Ci sono molti modelli possibili, perch� l'architettura � cos� flessibile. In particolare, i modelli ibridi sono pi� probabili - i.e. non tutti i APUs saranno programmati usando lo stesso modello. Per esempio, il codice veloce "della conduttura" del flusso, per esempio codice audio, rendente il codice, codice di decompressione, ha potuto essere scritto in assemblatore natale del APU. Questi programmi dell'assemblatore funzioneranno sui APUs dedicati e saranno controllati da altre parti del programma... cos� le interazioni molto sono facilitate. In effetti ogni parte del codice come questa sta funzionando m�lto come una parte di fissaggi e specialmente in un modo "slave" semplice. La natura in maniera massiccia parallela di PS3 non dovrebbe disturbarsi gli autori di questo codice troppo.

D'altra parte, un certo codice, quale il codice di IA, � pesante orientato oggettivamente e conta su intercomunicazioni pesanti fra gli oggetti. Questo codice dovr� essere scritto in un ambiente orientato oggettivamente (cio� lingua pi� i componenti run-time). Se un progetto pu� fare funzionare tutto il codice di IA su un singolo APU, le cose saranno semplici. Ma questo sconfigge il punto di avere un tal sistema in primo luogo e sconfigge lo scalability di pi� grandi sistemi dell'assistente (pensi agli assistenti che fanno funzionare un gioco in linea - anche basato su tecnologia PS3). Cos� l'ambiente di programmazione generale dovr� maneggiare la natura parallela del sistema per il programmatore.

Sono raccolte qui alcune idee circa la programmazione dello PS3.
Gli oggetti possono essere locked usando Synch residente nella memoria Il producer/consumer che synching pu� essere usato per fornire il bloccaggio sugli oggetti in RAM: 1. Quando un oggetto � generato � scritto usando la sincronizzazione scrive. Ci� mette la memoria che contiene l'oggetto nella condizione completa.
2. Se un APU desidera accedere all'oggetto, pubblica una sincronizzazione colta. Presupporre l'oggetto � inizializzato la memoria dell'oggetto � ora nella condizione vuota e la RAM locale del APU contiene una copia dell'oggetto.
3. Quando il APU � rifinito, scrive la copia locale di nuovo alla RAM principale che usando una sincronizzazione scrive.
4. Se un altro APU tenta di accedere all'oggetto mentre l'altro APU sta funzionando, si arresta fino a che la sincronizzazione non scriva accada.
5. Un secondo APU che tenta di accedere all'oggetto mentre questo APU si arresta causer� un errore sul secondo APU (ed eventualmente l'OS dell'unit� di elaborazione maneggier� questo errore e lo indurr� a processare nuovamente).
6. La cancellazione dell'oggetto coinvolge una sincronizzazione colto senza una seguente sincronizzazione scrive.
7. Il granularity di questa tecnica � soltanto parole di DRAM 1024-bit... 128 blocchetti di byte della RAM. Gli oggetti multipli potrebbero ripartire un blocchetto della RAM ma potrebbero essere bloccati soltanto come unit�.

Jazzing con il gene blu Ci� vi render� la risata. In 1999, l'IBM ha cominciato il progetto blu di cinque anni del gene (il seguito all'azzurro profondo digioco). Letto su esso. L'idea era fare un calcolatore di PetaFLOPS prendendo i processor a 1GFLOPS e disponendo 32 su un circuito integrato per fare un circuito integrato 32GFLOPS. 64 di questi circuiti integrati su una cartolina davrebbero 2 TFLOPS. Una torretta di 8 cartoline dare 16 TFLOPS e una stanza di 64 torrette farebbe un PFLOPS. Abbastanza per fare proteina che si piega ai tassi interattivi.

Entro 2005, PS3 avr� APUs potenti quanto i circuiti integrati del gene blu e scheggia la met� potente quanto le cartoline del gene blu. Ma il gene blu � dovuto a tempo quasi uguale come PS3. � possibile che il gene blu sar� semplicemente uno PS3 realmente realmente grande? Usando i circuiti integrati a banda larga del motore sulle cartoline blu del gene fornirebbe 64 TFLOPS per la cartolina, 512 TFLOPS per la torretta, in modo da una stanza di 64 torrette fornirebbe 32 PFLOPS. Potreste misura un PFLOPS puro in un armadio! L'interfaccia ottica potrebbe essere usata per collegare le torrette (nota che alle corse della luce 4GHz soltanto 3 pollici per il ciclo di orologio - cos� ci � abbastanza uno stato latente anche a lightspeed!). 32 PFLOPS sono pi� delle istruzioni 2^64 al secondo.

Ma se non siete neppure usando il vostro PS3 tutto quell'alimentazione potrebbe essere usata per qualcosa come Seti@Home o Folding@Home - fate funzionare il software che collega il vostro PS3 all'altro PS3s alla forma che cosa � in effetti un supercomputer gigantesco. Soltanto 32.000 PS3s devono essere collegati al gene dell'azzurro del fiammifero. Se la gente � motivata, milioni di PS3s possono essere disponibili in qualsiasi momento.

Un'applicazione principale di questa alimentazione di calcolatore incredibile deve fare gli esperimenti piegantesi della proteina - esperimenti di cui contribuisca a trovare le cause ed eventualmente cura a determinate malattie compreso fibrosi cystic, la malattia del Alzheimer ed alcuni cancri. Ma un altro � di simulare le bombe nucleari o le bombe all'idrogeno. In avvenire questi supercomputer sono inoltre probabili essere utili nel disegno di ingegneria genetica, o nella progettazione dei reattori a fusione (alimentazione pulita poco costosa). Spetta voi se o non desiderate sostenere c'� ne di queste cause e dovrebbe spettare voi se o non la macchina che avete pagato � utilizzata verso loro. Ricordisi di che le cellule del software possono essere trasmesse alla vostra macchina sulla rete ed esso spetta all'OS del SONY quando questo � permesso accadere. Non sarebbe impossible affinch� SONY codifichi l'OS in moda da dare automaticamente tutta l'alimentazione di ricambio a qualunque causa SONY scegliesse finch� siete in linea. � importante che i consumatori sono informati dell'edizione e pu� donare i loro cicli di ricambio a chiunque che scelgano.
Elaborazione Del Flusso Mentre ricercavo questo articolo sono venuto attraverso il processor del flusso di immaginazione sviluppato da William J. Dally la squadra all'universit� di Stanford. Ci� � un circuito integrato che che � ancora circa 10x pi� velocemente (a velocit� di orologio relativa) che i circuiti integrati PS3 ed usi un disegno parallelo in qualche modo simile. Un'altra squadra a Stanford sta facendo la ricerca relativa sui supercomputer effluenti che sono appena grandi serie di questi circuiti integrati collegati direttamente insieme. (non � in questa fase chiaro se o non i brevetti della Stanford per questi disegni potrebbero formare "l'uso anteriore" contro i brevetti del SONY).

La squadra della Stanford ha fornito gli attrezzi ed i metodi per occuparsi della programmazione sull'architettura del flusso - scrivono "i noccioli" (efficacemente codice del APU) in una lingua denominata KernelC che � compilato e ciclo-svolto (un codice del VU della La) per designare l'architettura come bersaglio di VLIW del processor di immaginazione. Allora concatenano questi noccioli insieme ai flussi usando StreamC, che fa un programma normale che funziona sul processor ospite (un circuito integrato del BRACCIO o di PowerPC in questo caso). Si noti che il sistema di immaginazione pricipalmente � stato usato accelera le mansioni specifiche - e.g. rendendo - e non fare funzionare gli interi giochi che includano rendendo, generazione, IA e fisica audio tutto d'un tratto.
Osservazioni Dei Lettori Lascilo prego sapere se avete osservazioni o domande circa questa pagina emailing a [email protected]. Riprodurr� e risponder� alle osservazioni ed alle domande pi� interessanti qui. Lascilo sapere se desiderate rimanere anonimi (non stamper� gli indirizzi, soltanto i nomi del email).

In particolare amerei caro sentirsi da chiunque sulla squadra di architettura delle cellule! Ho due domande importanti: 1. Che cosa in codice l'istruzione sta mettendo? 2. Pu� il PUs fare funzionare il codice dell'utente o il codice di sistema giusto? Collegamenti e riferimenti il brevetto La richiesta di brevetto (USPTO) la presentazione Reyes/Renderman del PowerPoint del Paul lo Zimmons di richiesta di brevetto (scelga la CHIUSURA LAMPO) Grafici Di Calcolatore: Principii e pratica (2e) che confrontano Reyes ed OpenGL su un'architettura del flusso l'immagine di Reyes che rende ricerca della Stanford di architettura La piegatura effluente della proteina dei supercomputer del homepage del processor del flusso di immaginazione Scienza di Folding@Home che unravelling il mistero delle simulazioni piegantesi di programmazione piegantesi della proteina della proteina.

NON E' IL MAX DELLA PERFEZIONE COME TRADUZIONE, PERO' ACCONTENATEVI!

**Snake22** · 22-07-2003, 22:48:04

Inviato da crash88
mica l'hai tradotta da solo? cmq nn offenderti ma nn la legger� lo stesso (� troppo lunga)

8( vabb�, sxiamo ke a qualk interessi!

no, mi sn aiutato con un sito!

**crash88** · 22-07-2003, 22:52:32

Inviato da Snake22
[Bno, mi sn aiutato con un sito!

[/B]

meno male, mi ero preoccupato per la tua sanit� mentale

Amministratori
206567@RedazioneGV	233018@techGV
Guardian
215617@Don Dema	153820@Illuminated
5880@iMaX
Moderatori
95216@Alienware	234323@Bismark
236104@gabry1710	172466@King_Of_Kings_21
26508@Metalmark	28512@MikiM
235165@Redazione Gamesvillage	233457@TheGu
236103@thekingdani	87740@titan2010

Discussione: PS3,chi ci capisce e bravo!

Strumenti Discussione

PS3,chi ci capisce e bravo!

umn

ECCOVI LA TRADUZIONE!

Regole di Scrittura