Skip to main content

How to Fine-Tune LLaMA 3.2-1B-Instruct for Korean Instruction Tasks with LoRA and Hugging Face

LLaMA 3.2-1B-Instruct is a lightweight instruction-tuned language model released by Meta. It is designed to handle a wide range of instruction-based tasks with relatively low computational resources. Although the model was trained with multilingual capabilities, its performance on languages not included in its training set—such as Korean—is limited. This tutorial demonstrates how to fine-tune this open-source model on a Korean dataset using Hugging Face Transformers and PEFT (specifically LoRA), enabling it to better respond to Korean instructions.

1. Prerequisites

Before running the example code below, ensure you have the following libraries installed:

pip install torch transformers datasets peft accelerate mlflow huggingface_hub

To use the LLaMA model or KoAlpaca datasets, you'll need a Hugging Face token. Additionally, you may need to handle potential CUDA Out-Of-Memory (OOM) errors. The following code takes care of both:


from huggingface_hub import login

login("your_huggingface_token")

torch.cuda.empty_cache() # Prevent CUDA OOM error

2. Loading and Preprocessing Datasets

High-quality datasets are crucial for fine-tuning large language models (LLMs). This guide uses two well-structured open-source Korean instruction datasets to ensure adequate training data:


from datasets import load_dataset, concatenate_datasets

koalpaca = load_dataset("beomi/KoAlpaca-RealQA", split="train")
kullm = load_dataset("nlpai-lab/kullm-v2", split="train")

# Standardize formats
def format_koalpaca(example):
    return {
        "instruction": example["question"],
        "input": "",
        "output": example["answer"]
    }

def format_kullm(example):
    return {
        "instruction": example["instruction"],
        "input": example.get("input", ""),
        "output": example["output"]
    }

koalpaca = koalpaca.map(format_koalpaca)
kullm = kullm.map(format_kullm)

dataset = concatenate_datasets([koalpaca, kullm])
dataset = dataset.shuffle(seed=42)

3. Defining Prompt Format

To align with the format used during pretraining, the dataset prompts are structured in a conversational style. We use the tokenizer from LLaMA 3.2-1B-Instruct and extend it with special tokens: <|user|>, <|assistant|>, and <|system|>. The padding token is also set to match the tokenizer's EOS token.


from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.add_special_tokens({'additional_special_tokens': ['<|user|>', '<|assistant|>', '<|system|>']})
tokenizer.pad_token = tokenizer.eos_token

def formatting_prompts_func(example):
    if example["input"]:
        prompt = f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{example['instruction']}\nInput: {example['input']}\n<|assistant|>\n{example['output']}"
    else:
        prompt = f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{example['instruction']}\n<|assistant|>\n{example['output']}"
    
    tokenized = tokenizer(prompt, padding="max_length", truncation=True, max_length=1024)
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

dataset = dataset.map(formatting_prompts_func, remove_columns=dataset.column_names)

4. Loading the Model and Applying LoRA

There are two main ways to fine-tune an LLM: (1) full fine-tuning, which trains all parameters, and (2) parameter-efficient fine-tuning with LoRA, which trains only adapter layers. Full fine-tuning offers more customization but requires large datasets and heavy compute. LoRA-based fine-tuning is more resource-efficient and avoids catastrophic forgetting of pretrained knowledge. This tutorial focuses on the latter.

The following code loads the LLaMA-3.2-1B-Instruct model with Hugging Face’s AutoModelForCausalLM and sets up LoRA adapters on attention modules. It also resizes the embedding layer to match the new tokenizer size.


from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from accelerate import PartialState

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map={"": PartialState().process_index}, 
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.01,
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)

5. Training Setup and Execution

The training process uses Hugging Face’s Trainer API with hyperparameters configured via TrainingArguments. This example runs for 10 epochs with a learning rate of 2e-4 and no decay.


from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=10,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    weight_decay=0.001,
    logging_steps=25,
    save_steps=1000,
    save_strategy="epoch",
    save_total_limit=2,
    run_name="llama3.2-1B-korean-a100-4gpu",
    deepspeed="deepspeed_config.json",
    dataloader_num_workers=4,
    ddp_find_unused_parameters=False,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=False,
    lr_scheduler_type="constant",
)

The 'deepspeed' parameter's value 'deepseed_config.json' is a json file for DeepSpeed traing platform. This file should be created before starting train with following content.


{
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "none"
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true
  },
  "train_micro_batch_size_per_gpu": 8,
  "gradient_accumulation_steps": 2,
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true
}

6. Training the Model

Once all components are set, use the Hugging Face Trainer class to start training. A DataCollatorForLanguageModeling is used for efficient batch processing.


from transformers import Trainer, DataCollatorForLanguageModeling

trainer = Trainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

Training logs will be saved to the directory specified by output_dir. LoRA adapter weights can be saved separately to reduce storage requirements.

7. Saving and Using the Fine-Tuned Model

After training is complete, you can save just the LoRA adapter weights or merge them into the base model before saving. Merging allows for simpler inference and deployment.


# Save LoRA adapters only
model.save_pretrained("adapter-output")

# Merge adapters and save full model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("llama3.2-korean-finetuned")
tokenizer.save_pretrained("llama3.2-korean-finetuned")

8. Inference: Korean Instruction Generation

To test the model’s performance on Korean inputs, you can run following code.


from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.add_special_tokens({'additional_special_tokens': ['<|user|>', '<|assistant|>', '<|system|>']})
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model.resize_token_embeddings(len(tokenizer))

from peft import PeftModel
model = PeftModel.from_pretrained(model, "./lora_adapter")
model.eval()

def format_prompt(instruction, input_text=""):
    if input_text:
        return f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{instruction}\nInput: {input_text}\n<|assistant|>\n"
    return f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{instruction}\n<|assistant|>\n"

def generate_response(instruction, input_text=""):
    input_ids = tokenizer(format_prompt(instruction, input_text), return_tensors="pt").input_ids.to(model.device)
    outputs = model.generate(input_ids=input_ids, max_new_tokens=512, temperature=0.7, do_sample=True, top_p=0.9)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_response("서울의 날씨는 어때?"))

This setup will generate fluent Korean responses using the fine-tuned model.

9. Conclusion

This tutorial showed how to fine-tune the LLaMA 3.2-1B-Instruct model on Korean instruction datasets using Hugging Face and PEFT with LoRA. By combining efficient parameter tuning with quality datasets, you can create a compact model that performs well in Korean-language tasks without needing extensive resources.

Reference

Comments

Popular

How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings

How to Save and Retrieve a Vector Database using LangChain, FAISS, and Gemini Embeddings Efficient storage and retrieval of vector databases is foundational for building intelligent retrieval-augmented generation (RAG) systems using large language models (LLMs). In this guide, we’ll walk through a professional-grade Python implementation that utilizes LangChain with FAISS and Google Gemini Embeddings to store document embeddings and retrieve similar information. This setup is highly suitable for advanced machine learning (ML) and deep learning (DL) engineers who work with semantic search and retrieval pipelines. Why Vector Databases Matter in LLM Applications Traditional keyword-based search systems fall short when it comes to understanding semantic meaning. Vector databases store high-dimensional embeddings of text data, allowing for approximate nearest-neighbor (ANN) searches based on semantic similarity. These capabilities are critical in applications like: Question Ans...

Building an MCP Agent with UV, Python & mcp-use

Model Context Protocol (MCP) is an open protocol designed to enable AI agents to interact with external tools and data in a standardized way. MCP is composed of three components: server , client , and host . MCP host The MCP host acts as the interface between the user and the agent   (such as Claude Desktop or IDE) and plays the role of connecting to external tools or data through MCP clients and servers. Previously, Anthropic’s Claude Desktop was introduced as a host, but it required a separate desktop app, license, and API key management, leading to dependency on the Claude ecosystem.   mcp-use is an open-source Python/Node package that connects LangChain LLMs (e.g., GPT-4, Claude, Groq) to MCP servers in just six lines of code, eliminating dependencies and supporting multi-server and multi-model setups. MCP Client The MCP client manages the MCP protocol within the host and is responsible for connecting to MCP servers that provide the necessary functions for the ...

RF-DETR: Overcoming the Limitations of DETR in Object Detection

RF-DETR (Region-Focused DETR), proposed in April 2025, is an advanced object detection architecture designed to overcome fundamental drawbacks of the original DETR (DEtection TRansformer) . In this technical article, we explore RF-DETR's contributions, architecture, and how it compares with both DETR and the improved model D-FINE . We also provide experimental benchmarks and discuss its real-world applicability. RF-DETR Architecture diagram for object detection Limitations of DETR DETR revolutionized object detection by leveraging the Transformer architecture, enabling end-to-end learning without anchor boxes or NMS (Non-Maximum Suppression). However, DETR has notable limitations: Slow convergence, requiring heavy data augmentation and long training schedules Degraded performance on low-resolution objects and complex scenes Lack of locality due to global self-attention mechanisms Key Innovations in RF-DETR RF-DETR intr...