LLaMA 3.2-1B-Instruct is a lightweight instruction-tuned language model released by Meta. It is designed to handle a wide range of instruction-based tasks with relatively low computational resources. Although the model was trained with multilingual capabilities, its performance on languages not included in its training set—such as Korean—is limited. This tutorial demonstrates how to fine-tune this open-source model on a Korean dataset using Hugging Face Transformers and PEFT (specifically LoRA), enabling it to better respond to Korean instructions.
1. Prerequisites
Before running the example code below, ensure you have the following libraries installed:
pip install torch transformers datasets peft accelerate mlflow huggingface_hub
To use the LLaMA model or KoAlpaca datasets, you'll need a Hugging Face token. Additionally, you may need to handle potential CUDA Out-Of-Memory (OOM) errors. The following code takes care of both:
from huggingface_hub import login
login("your_huggingface_token")
torch.cuda.empty_cache() # Prevent CUDA OOM error
2. Loading and Preprocessing Datasets
High-quality datasets are crucial for fine-tuning large language models (LLMs). This guide uses two well-structured open-source Korean instruction datasets to ensure adequate training data:
- KoAlpaca-RealQA – Real-world question and answer format
- KULLM-v2 – Various instruction and response formats
from datasets import load_dataset, concatenate_datasets
koalpaca = load_dataset("beomi/KoAlpaca-RealQA", split="train")
kullm = load_dataset("nlpai-lab/kullm-v2", split="train")
# Standardize formats
def format_koalpaca(example):
return {
"instruction": example["question"],
"input": "",
"output": example["answer"]
}
def format_kullm(example):
return {
"instruction": example["instruction"],
"input": example.get("input", ""),
"output": example["output"]
}
koalpaca = koalpaca.map(format_koalpaca)
kullm = kullm.map(format_kullm)
dataset = concatenate_datasets([koalpaca, kullm])
dataset = dataset.shuffle(seed=42)
3. Defining Prompt Format
To align with the format used during pretraining, the dataset prompts are structured in a conversational style. We use the tokenizer from LLaMA 3.2-1B-Instruct and extend it with special tokens: <|user|>
, <|assistant|>
, and <|system|>
. The padding token is also set to match the tokenizer's EOS token.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.add_special_tokens({'additional_special_tokens': ['<|user|>', '<|assistant|>', '<|system|>']})
tokenizer.pad_token = tokenizer.eos_token
def formatting_prompts_func(example):
if example["input"]:
prompt = f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{example['instruction']}\nInput: {example['input']}\n<|assistant|>\n{example['output']}"
else:
prompt = f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{example['instruction']}\n<|assistant|>\n{example['output']}"
tokenized = tokenizer(prompt, padding="max_length", truncation=True, max_length=1024)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
dataset = dataset.map(formatting_prompts_func, remove_columns=dataset.column_names)
4. Loading the Model and Applying LoRA
There are two main ways to fine-tune an LLM: (1) full fine-tuning, which trains all parameters, and (2) parameter-efficient fine-tuning with LoRA, which trains only adapter layers. Full fine-tuning offers more customization but requires large datasets and heavy compute. LoRA-based fine-tuning is more resource-efficient and avoids catastrophic forgetting of pretrained knowledge. This tutorial focuses on the latter.
The following code loads the LLaMA-3.2-1B-Instruct model with Hugging Face’s AutoModelForCausalLM
and sets up LoRA adapters on attention modules. It also resizes the embedding layer to match the new tokenizer size.
from transformers import AutoModelForCausalLM
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from accelerate import PartialState
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
torch_dtype=torch.bfloat16,
device_map={"": PartialState().process_index},
)
model.resize_token_embeddings(len(tokenizer))
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=16,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.01,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
5. Training Setup and Execution
The training process uses Hugging Face’s Trainer
API with hyperparameters configured via TrainingArguments
. This example runs for 10 epochs with a learning rate of 2e-4 and no decay.
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=10,
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=2e-4,
weight_decay=0.001,
logging_steps=25,
save_steps=1000,
save_strategy="epoch",
save_total_limit=2,
run_name="llama3.2-1B-korean-a100-4gpu",
deepspeed="deepspeed_config.json",
dataloader_num_workers=4,
ddp_find_unused_parameters=False,
fp16=True,
bf16=False,
max_grad_norm=0.3,
max_steps=-1,
warmup_ratio=0.03,
group_by_length=False,
lr_scheduler_type="constant",
)
The 'deepspeed' parameter's value 'deepseed_config.json' is a json file for DeepSpeed traing platform. This file should be created before starting train with following content.
{
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none"
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"train_micro_batch_size_per_gpu": 8,
"gradient_accumulation_steps": 2,
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true
}
6. Training the Model
Once all components are set, use the Hugging Face Trainer
class to start training. A DataCollatorForLanguageModeling
is used for efficient batch processing.
from transformers import Trainer, DataCollatorForLanguageModeling
trainer = Trainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
Training logs will be saved to the directory specified by output_dir
. LoRA adapter weights can be saved separately to reduce storage requirements.
7. Saving and Using the Fine-Tuned Model
After training is complete, you can save just the LoRA adapter weights or merge them into the base model before saving. Merging allows for simpler inference and deployment.
# Save LoRA adapters only
model.save_pretrained("adapter-output")
# Merge adapters and save full model
merged_model = model.merge_and_unload()
merged_model.save_pretrained("llama3.2-korean-finetuned")
tokenizer.save_pretrained("llama3.2-korean-finetuned")
8. Inference: Korean Instruction Generation
To test the model’s performance on Korean inputs, you can run following code.
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.add_special_tokens({'additional_special_tokens': ['<|user|>', '<|assistant|>', '<|system|>']})
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model.resize_token_embeddings(len(tokenizer))
from peft import PeftModel
model = PeftModel.from_pretrained(model, "./lora_adapter")
model.eval()
def format_prompt(instruction, input_text=""):
if input_text:
return f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{instruction}\nInput: {input_text}\n<|assistant|>\n"
return f"<|system|>\nYou are a helpful assistant.\n<|user|>\n{instruction}\n<|assistant|>\n"
def generate_response(instruction, input_text=""):
input_ids = tokenizer(format_prompt(instruction, input_text), return_tensors="pt").input_ids.to(model.device)
outputs = model.generate(input_ids=input_ids, max_new_tokens=512, temperature=0.7, do_sample=True, top_p=0.9)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generate_response("서울의 날씨는 어때?"))
This setup will generate fluent Korean responses using the fine-tuned model.
9. Conclusion
This tutorial showed how to fine-tune the LLaMA 3.2-1B-Instruct model on Korean instruction datasets using Hugging Face and PEFT with LoRA. By combining efficient parameter tuning with quality datasets, you can create a compact model that performs well in Korean-language tasks without needing extensive resources.
Comments
Post a Comment