Install and run llama.cpp with ROCm 5.7 on Ubuntu 22.04

Ok so this is the run down on how to install and run llama.cpp on Ubuntu 22.04 (This works for my officially unsupported RX 6750 XT GPU running on my AMD Ryzen 5 system)

First off you need to run the usual:

sudo apt-get update
sudo apt-get upgrade

Then you need to install all the ROCm libraries etc that will be used by llama.cpp

Start with adding the official radeon source to apt-get described here:

Quick Start (Linux) Add the amdgpu module repository for SLES 15.4 sudotee/etc/zypp/repos.d/amdgpu.repo<<'EOF' [amdgpu] name=amdgpu…rocm.docs.amd.com

sudo mkdir --parents --mode=0755 /etc/apt/keyrings

wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \
    gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null
   
# Kernel driver repository for jammy
sudo tee /etc/apt/sources.list.d/amdgpu.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/5.7.1/ubuntu jammy main
EOF
# ROCm repository for jammy
sudo tee /etc/apt/sources.list.d/rocm.list <<'EOF'
deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main
EOF
# Prefer packages from the rocm repository over system packages
echo -e 'Package: *\nPin: release o=repo.radeon.com\nPin-Priority: 600' | sudo tee /etc/apt/preferences.d/rocm-pin-600

Ok all that mess just sets up the radeon repo’s for jammy jellyfish on yur system

sudo apt-get update

And update so your system knows where it all is.

Install amds purpose made driver for all you ROCm business:

sudo apt-get install amdgpu-dkms

Put the libraries on there too:

sudo apt-get install rocm-hip-libraries

Have a little rest and reboot the system

sudo reboot

Now install the remainder development stuff needed to compile llama.cpp:

sudo apt-get install rocm-dev
sudo apt-get install rocm-hip-runtime-dev rocm-hip-sdk
sudo apt-get install rocm-libs
sudo apt-get install rocminfo

Check rocminfo and you should have an output similar to this:

ROCk module is loaded
=====================   
HSA System Attributes   
=====================   
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                             
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                 
Agent 1                 
*******                 
  Name:                    AMD Ryzen 5 2600X Six-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 5 2600X Six-Core Processor
  Vendor Name:             CPU                               
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                             
  Node:                    0                                 
  Device Type:             CPU                               
  Cache Info:             
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                 
  Internal Node ID:        0                                 
  Compute Unit:            12                                 
  SIMDs per CU:            0                                 
  Shader Engines:          0                                 
  Shader Arrs. per Eng.:   0                                 
  WatchPts on Addr. Ranges:1                                 
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED       
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED     
      Size:                    32792028(0x1f45ddc) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       TRUE                               
  ISA Info:               
*******                 
Agent 2                 
*******                 
  Name:                    gfx1031                           
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 6750 XT             
  Vendor Name:             AMD                               
  Feature:                 KERNEL_DISPATCH                   
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                         
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                   
  Queue Type:              MULTI                             
  Node:                    1                                 
  Device Type:             GPU                               
  Cache Info:             
    L1:                      16(0x10) KB                       
    L2:                      3072(0xc00) KB                     
    L3:                      98304(0x18000) KB                 
  Chip ID:                 29663(0x73df)                     
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2880                               
  BDFID:                   1536                               
  Internal Node ID:        1                                 
  Compute Unit:            40                                 
  SIMDs per CU:            2                                 
  Shader Engines:          2                                 
  Shader Arrs. per Eng.:   2                                 
  WatchPts on Addr. Ranges:4                                 
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                       
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                       
    y                        1024(0x400)                       
    z                        1024(0x400)                       
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                       
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 115                               
  SDMA engine uCode::      80                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED     
      Size:                    12566528(0xbfc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       FALSE                             
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    12566528(0xbfc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                               
      Alloc Alignment:         4KB                               
      Accessible by all:       FALSE                             
    Pool 3                   
      Segment:                 GROUP                             
      Size:                    64(0x40) KB                       
      Allocatable:             FALSE                             
      Alloc Granule:           0KB                               
      Alloc Alignment:         0KB                               
      Accessible by all:       FALSE                             
  ISA Info:               
    ISA 1                   
      Name:                    amdgcn-amd-amdhsa--gfx1031         
      Machine Models:          HSA_MACHINE_MODEL_LARGE           
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                       
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                       
        y                        1024(0x400)                       
        z                        1024(0x400)                       
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Thats real juicy!

Make a note of the node number for your GPU device.

You can see mine is ‘1’.

Now you should have all the necessary stuff for compiling (assuming you have already installed a compiler)

When using the GPU to do ROCm stuff you need to be a member of the render group:

sudo usermod -a -G render yourusername

Now using git clone llama.cpp as follows

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Enter the llama directory, and compile using the following set HIP_VISIBLE_DEVICES=1 (the node value you took from rocminfo)

make clean && LLAMA_HIPLAS=1 && HIP_VISIBLE_DEVICES=1 make -j

Now after the compile is finished you need to do a little bit of tinkering to get this to work with your unsuported card.

ROCm will kick up an error that says it cannot find your device GX1031

so you need to set this GFX version number to the following:

export HSA_OVERRIDE_GFX_VERSION=10.3.0

Ok next you need to select a Model:

Make sure you download a useable model I have used this one from huggingface:

https://huggingface.co/TheBloke/zephyr-7B-beta-GGUF

Store the model in the models directory.

Now all you need to do is specify a prompt to use with the llama.cpp executeable you created:

export HSA_OVERRIDE_GFX_VERSION=10.3.0 && export HIP_VISIBLE_DEVICES=1 && sudo ./main -ngl 50 -m models/zephyr-7b-beta.Q2_K.gguf -p "How far does your knowledge of hyperplastic engineering go?"

llama is compiled to use your GPU secified earlier. Have fun guys.

llama output:

system_info: n_threads = 6 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
   repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
   top_k = 40, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.800
   mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


How far does your knowledge of hyperplastic engineering go? Do you know how the properties of materials change with plastic deformation? How many of you have encountered the problem of material anisotropy when it comes to working with metal or polymeric components in their production technology?

Using the Zephyr Model from above llama replied to my question with two more questions!! Now got to Part 2 to use Facebooks Offcial llama 2 model.

Here is how I got it finally working for my unsupported card:

Ok so basically I ended up installing two versions of ROCm first I installed 5.2 which worked for SD but not for llama.cpp I got this error for llama:

“hipErrorNoBinaryForGpu: Unable to find code object for all current devices!”

When I installed ROCm 5.2 I followed this guide:

https://askubuntu.com/questions/1429376/how-can-i-install-amd-rocm-5-on-ubuntu-22-04

This worked fine on Stable Diffusion. But not for llama.cpp giving his error:

“hipErrorNoBinaryForGpu: Unable to find code object for all current devices!”

After this I installed ROCm 5.7 using this command and using the debian repo:

sudo tee /etc/apt/sources.list.d/rocm.list <<’EOF’ deb [arch=amd64 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/rocm/apt/debian jammy main EOF

sudo amdgpu-install — rocmrelease=5.7.0 — usecase=rocm,hip — no-dkms

I think — no-dkms makes a big difference and to be fair if you just installed 5.7 it would work on all fronts

Oh yes and one more thing the Tensile library for 1030 I made a copy of the first and named it 1031 — for my gpu.

/opt/rocm/lib/rocblas/library/TensileLibrary_Type_HH_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1030.co

I renamed it.