Monday, March 24, 2025

Thinking Different, Thinking Slowly: LLMs on a PowerPC Mac

There is something incredibly satisfying about breathing new life into old hardware. Vintage computing is one of my favorite hobbies. The challenge of coaxing modern software onto systems designed decades ago is a puzzle I cannot resist. I have been diving into the world of large language models (LLMs), and a question began to gnaw at me: could I bring the cutting-edge of AI to the nostalgic glow of my trusty 2005 PowerBook G4? Armed with a 1.5GHz processor, a full gigabyte of RAM, and a limiting 32-bit address space, I embarked on an experiment that actually yielded results. I have successfully managed to achieve LLM inference on this classic piece of Apple history, proving that even yesteryear's hardware can have a taste of tomorrow's AI.

PowerBook G4 running TinyStories 110M Llama2 LLM inference
I started by reviewing the llama2.c project from Andrej Karpathy. This brilliant project implements Llama2 LLM inference with just a single file of vanilla C. No accelerators here. Performance is traded for simplicity which makes it easy to understand how inference is carried out.

I forked the core implementation to a project that I have titled ullm. The core algorithm remains the same, but I spent time improving a few aspects of the code so that it would stand up to abuse a little better.


Code Improvements

I started with a few basic improvements. I also added wrappers for system functions like file I/O and memory allocations. This makes it easier for me to instrument the program.
  • Introduce a status return, remove all calls to exit
  • Abstract file access to simplify status handling
  • Abstract malloc/free for some simple debug/analysis
  • Replace the 512 byte static LUT for single character strings
  • Fix a few warnings when compiling with -Wall

Start at the Library

I made more large scale changes as I organized the code into a library with a public API that is exposed by a header. This enables unit-testing to ensure that further refactoring does not break inference functionality.
// The runtime config for the inference operation.
typedef struct {
  // The prompt to generate a response to.
  const char* prompt;

  // The path to the checkpoint file.
  const char* checkpoint_path;

  // The path to the tokenizer file.
  const char* tokenizer_path;

  // Model configuration.
  float temperature;
  float topp;
  unsigned int steps;

  // The source of entropy.
  uint64_t rng_seed;

  // The callback and context for generated output.
  void (*output_callback)(const char* token, void* cookie);
  void* cookie;
} UllmLlama2RunConfig;

// The runtime state for the inference engine.
typedef struct {
  UllmFileHandle checkpoint_file;
  UllmLlama2Transformer transformer;
  UllmLlama2Tokenizer tokenizer;
  UllmLlama2Sampler sampler;
} UllmLlama2State;
The library expects two inputs: a const config which supplies details such as model paths, rng seed and token output callbacks, as well as state which keeps track of loaded weights, temporary buffers, tokenizer state and more.

The resulting API is exceedingly simple to test and easy to build command-line interface tools around.

Callback-Based Output & Testing

The migration to a public API lends itself well to replacing printf-based output with callbacks as tokens are produced by the inference engine. This was ultimately the final change necessary to enable integration testing of the end-to-end inference pipeline.
void OutputHandler(const char* token, void* cookie) {
  std::string* test_output = static_cast<std::string*>(cookie);
  test_output->append(token);
}

TEST(UllmLlama2, Stories15M) {
  const std::string expected_test_output = R"(The birds chirp. Where do they go?
The birds flew around the sky, looking for something to do.
The birds saw a big tree and flew over to it.
The birds saw a big, red apple on the ground. It looked delicious.
The birds flew down and picked up the apple.
The birds flew back up to the tree and started to eat the apple.
The apples were so delicious!
The birds ate until they were full.
The birds flew away, happy and full.
)";

  std::string test_output;
  UllmLlama2RunConfig run_config;
  UllmLlama2RunConfigInit(&run_config);
  run_config.checkpoint_path = "ullm/tinystories15M.bin";
  run_config.tokenizer_path = "ullm/tokenizer.bin";
  run_config.prompt = "The birds chirp. Where do they go?";
  run_config.output_callback = OutputHandler;
  run_config.cookie = &test_output;

  UllmLlama2State state;
  UllmStatus status = UllmLlama2Init(&run_config, &state);
  ASSERT_EQ(status, ULLM_STATUS_OK);
  status = UllmLlama2Generate(&run_config, &state);
  EXPECT_EQ(status, ULLM_STATUS_OK);
  UllmLlama2Deinit(&state);

  EXPECT_EQ(test_output, expected_test_output);
}

Internals

Beyond the public API, I also reorganized the internals. After making improvements, the code becomes rather elegant and removes all calls to exit in favor of status propagation when any aspect of initialization or inference fails.
UllmStatus UllmLlama2Init(const UllmLlama2RunConfig* config,
    UllmLlama2State* state) {
  memset(state, 0, sizeof(UllmLlama2State));
  ULLM_RETURN_IF_ERROR(UllmLlama2ValidateConfig(config));
  ULLM_RETURN_IF_ERROR(UllmLlama2BuildTransformer(config, state));
  ULLM_RETURN_IF_ERROR(UllmLlama2BuildTokenizer(config, state));
  ULLM_RETURN_IF_ERROR(UllmLlama2BuildSampler(config, state));
  return ULLM_STATUS_OK;
}

UllmStatus UllmLlama2Generate(const UllmLlama2RunConfig* config,
    UllmLlama2State* state) {
  // +3 for '\0', ?BOS, ?EOS
  size_t prompt_tokens_size = (strlen(config->prompt) + 3) * sizeof(int);
  int* prompt_tokens = UllmMemoryAlloc(prompt_tokens_size);
  if (prompt_tokens == NULL) {
    ULOGE("Failed to allocate prompt tokens");
    return ULLM_STATUS_OOM;
  }

Porting to PPC

A few adjustments were made to accommodate this code on a PowerPC Mac running with a gcc 4.x (yes, you read correctly!) toolchain. The most glaring problem is the fact that the PowerPC 7447B processor is big-endian. The model checkpoint and tokenizers are released with the assumption that they will be used on little-endian machines. This became blatantly obvious when the first malloc attempted was 2GB in size. A clear sign that something was wrong. All fingers pointed to reading a little-endian value from a file.
  • Endianness: convert model checkpoints/tokenizer from little- to big- endian
  • Alignment, 16-byte: copy weights into memory rather than memory mapping
  • Performance: AltiVec? Whoa.
After addressing a few warnings and making the above structural changes, I achieved output parity with my x86 Dell PowerEdge T440 workhorse.

Here is some of the crude code written to endian swap the model checkpoint.
    if (status != ULLM_STATUS_OK) {
      ULOGE("Failed to map checkpoint: %s", UllmStatusToString(status));
      goto cleanup;
    }

    if ((file_size % 4) != 0) {
      ULOGE("Checkpoint size must be a multiple of 4");
      status = ULLM_STATUS_INVALID_ARGUMENT;
      goto cleanup;
    }

    for (uint64_t i = 0; i < file_size; i+= 4) {
      const uint32_t le_value = *(const uint32_t*)&file_ptr[i];
      const uint32_t be_value = __builtin_bswap32(le_value);
      fwrite(&be_value, sizeof(uint32_t), 1, stdout);
    }

Models

The llama2.c project recommends the TinyStories model and for good reason. These small models have a hope of producing some form of output without any kind of specialized hardware acceleration.

I did most of my testing with the 15M variant of the model and then switched to the highest fidelity 110M model available. Anything beyond this would either be too large for the available 32-bit address space or too slow for the modest CPU and available memory bandwidth.

The good news is that these models are great at producing whimsical children's stories which are fun to see printed on the shell. They keeps things light when debugging frustrating architecture-specific issues.

Performance

To make an apples-to-apples (apples-to-dells, really) comparison of performance, I opted for single-thread inference with compiler optimizations enabled (-O2).

The baseline measurement of inference on a single Intel Xeon Silver 4216 @ 3.2GHz core yields the following:
aarossig@lithium:~/projects/ullm$ bazel build tools:ullm -c opt
aarossig@lithium:~/projects/ullm$ time bazel-bin/tools/ullm -c tinystories110M.bin -t tokenizer.bin -p "The quick brown fox jumped. Where did he go?"
The quick brown fox jumped. Where did he go? He had been running around the forest all day. He was looking for something special.
Suddenly, he saw a big tree. He stopped and looked up. He saw a big, brown bird in the tree. The bird was singing a beautiful song.
The fox was so happy. He wanted to get closer to the bird. He jumped up and down, trying to reach the bird. But the bird was too high up.
The fox was sad. He wanted to be friends with the bird. He looked around and saw a big, brown rock. He jumped on the rock and looked up at the bird.
The bird saw the fox and flew down to him. The fox was so happy. He and the bird became friends. They played together in the forest every day.
I ullm.llama2: Complete: 6.91 token/s

real    0m26.511s
user    0m26.019s
sys     0m0.472s
The target to beat is 26.5 seconds or 6.91 tokens per second (spoiler alert: not even close). The inference seed is not randomized, so the generated output will be constant from run to run. This measurement is repeatable over many runs.

So what happens if we take this identical code and run it on a PowerBook G4 with a 1.5GHz PowerPC 7447B processor?
aarossig@titanium:~/Projects/ullm$ make && time ./out/ullm.elf -c ../models/tinystories110M_be.bin -t ../models/tokenizer_be.bin -p "The quick brown fox jumped. Where did he go?"
The quick brown fox jumped. Where did he go? He had been running around the forest all day. He was looking for something special.
Suddenly, he saw a big tree. He stopped and looked up. He saw a big, brown bird in the tree. The bird was singing a beautiful song.
The fox was so happy. He wanted to get closer to the bird. He jumped up and down, trying to reach the bird. But the bird was too high up.
The fox was sad. He wanted to be friends with the bird. He looked around and saw a big, brown rock. He jumped on the rock and looked up at the bird.
The bird saw the fox and flew down to him. The fox was so happy. He and the bird became friends. They played together in the forest every day.
I ullm.llama2: Complete: 0.77 token/s

real    4m0.099s
user    3m51.387s
sys     0m5.533s
It works! Albeit much more slowly. There is a roughly 9x slowdown when compared to the same code running on a modern CPU with much higher memory bandwidth. Honestly, it is fairly impressive that a computer which is 15 years junior can do this at all.

Let's take a look at the function where most of the time is spent: matmul. Nested loops and a multiply/sum at the core of the control structure.
void matmul(float* xout, const float* x, const float* w, int n, int d) {
  // W (d,n) @ x (n,) -> xout (d,)
  const float* wptr = w;
  const float* wptr_end = &wptr[n * d];
  const float* xptr_end = &x[n];
  while (wptr < wptr_end) {
    float sum = 0.0f;
    const float* xptr = x;
    while (xptr < xptr_end) {
      sum += *wptr++ * *xptr++;
    }

    *xout++ = sum;
  }
}
Can we do better? I remembered that these PowerPC processors have vector extensions called AltiVec. This early version of the PowerPC ISA extension is fairly limited, but does contain one critical operation that could speed things up: Fused Multiply-Add. Using a fused multiply-add operation, the processor can do a 4x parallel floating point multiply and addition, relying on SIMD (single instruction multiple data) semantics to accelerate the operation.

Here is the same function rewritten to use AltiVec SIMD extensions. Please note that this is the first (and probably last) time that I have written AltiVec code. After fixing 16-byte alignment issues, this code produces identical output to the vanilla C version.
void matmul(float* xout, const float* x, const float* w, int n, int d) {
  // W (d,n) @ x (n,) -> xout (d,)
  const long addr_step = 4 * sizeof(float);
  const long w_addr_end = n * d * sizeof(float);
  const long x_addr_end = n * sizeof(float);

  long w_addr = 0;
  while (w_addr < w_addr_end) {
    __attribute__ ((aligned(16))) float psum[4] = {};
    vector float sum_vec = vec_ld(0, &psum[0]);
    for (long x_addr = 0; x_addr < x_addr_end; x_addr += addr_step) {
      vector float w_vec = vec_ld(w_addr, w);
      vector float x_vec = vec_ld(x_addr, x);
      sum_vec = vec_madd(w_vec, x_vec, sum_vec);
      w_addr += addr_step;
    }

    vec_st(sum_vec, 0, psum);
    *xout++ = psum[0] + psum[1] + psum[2] + psum[3];
  }
}
Note that there are really 4 parallel sums here, which I then sum manually at the end of the core multiply-sum loop.

This antique extension set is poorly documented on the web, but Gemini did an excellent job of helping provide code examples when none could be found. The official PowerPC documentation was also decent, but not an exact match for the older version of the ISA (read: some functionality missing).
aarossig@titanium:~/Projects/ullm$ make && time ./out/ullm.elf -c ../models/tinystories110M_be.bin -t ../models/tokenizer_be.bin -p "The quick brown fox jumped. Where did he go?"
The quick brown fox jumped. Where did he go? He had been running around the forest all day. He was looking for something special.
Suddenly, he saw a big tree. He stopped and looked up. He saw a big, brown bird in the tree. The bird was singing a beautiful song.
The fox was so happy. He wanted to get closer to the bird. He jumped up and down, trying to reach the bird. But the bird was too high up.
The fox was sad. He wanted to be friends with the bird. He looked around and saw a big, brown rock. He jumped on the rock and looked up at the bird.
The bird saw the fox and flew down to him. The fox was so happy. He and the bird became friends. They played together in the forest every day.
I ullm.llama2: Complete: 0.88 token/s

real    3m32.098s
user    3m23.535s
sys     0m5.368s
Horray! AltiVec is shaving roughly 30 seconds off of the inference operation, bringing the implementation from a 9x to 8x slowdown and another tenth of a token per second. That is great fun, and a bit clever if you ask me.

What next?

The llama2.c project discusses running multi-billion parameter models with the same code. This sounds fun, but alas, the G4 PowerPC system is 32-bit and constrained by the maximum addressable memory of 4GB. Even running this small 110M tinystories model was a somewhat tight fit. Quantization could help, but the model checkpoint is still ~7GB (~26GB for non-quantized) which exceeds the available address space.

I think this project stops here. This has been a great way to get my toes wet with LLMs and how they operate. The matrix multiplies really are the bottleneck to multiple tokens per second.

Now I can listen to Basshunter - DotA, while running inference. Just like it was back in 2006.
I think I will use this hardware for what it was originally intended: organizing and listening to the tunes on my iPod. Thanks for reading!

No comments :

Post a Comment

Note: Only a member of this blog may post a comment.