LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation
TL;DR Summary
The LGSID framework enhances local-life recommendation by integrating RL-based geographic alignment and hierarchical item tokenization to capture spatial relationships, outperforming existing models in empirical studies.
Abstract
Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM's geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation. It focuses on enhancing recommendation systems for local-life services by integrating geographic awareness into item representation using large language models.
1.2. Authors
The authors are:
-
Hao Jiang (Kuaishou Technology, Beijing, China)
-
Guoquan Wang (Kuaishou Technology, Beijing, China)
-
Donglin Zhou (Kuaishou Technology, Beijing, China)
-
Sheng Yu (Kuaishou Technology, Beijing, China)
-
Yang Zeng (Kuaishou Technology, Beijing, China)
-
Wencong Zeng (Kuaishou Technology, Beijing, China)
-
Kun Gai (Independent Researcher)
-
Guorui Zhou (Kuaishou Technology, Beijing, China)
Most authors are affiliated with Kuaishou Technology, a major Chinese internet company known for its short-video platform, indicating a strong industry research background, particularly in recommendation systems and large-scale data applications. One author is an independent researcher.
1.3. Journal/Conference
The paper is published as an arXiv preprint. While arXiv is a reputable platform for disseminating research quickly, it is not a peer-reviewed journal or conference. Papers on arXiv are typically pre-publication versions that may or may not undergo formal peer review for eventual publication in a conference or journal. The listed publication date is 2025-11-18T07:54:32.000Z, which appears to be a future date, suggesting this is a future-dated preprint or a placeholder for an upcoming publication.
1.4. Publication Year
The publication year, based on the provided UTC timestamp, is 2025.
1.5. Abstract
This paper introduces LGSID, an LLM-Aligned Geographic Item Tokenization Framework designed for local-life recommendation. It addresses the limitation of current text-based recommendation methods that fail to capture fine-grained spatial characteristics and distance awareness, especially when simply injecting location information into prompts for Large Language Models (LLMs). LGSID comprises two main components: RL-based Geographic LLM Alignment and Hierarchical Geographic Item Tokenization. The RL-based alignment module first trains a list-wise reward model to internalize real-world spatial relationships among items. Then, a novel G-DPO algorithm leverages this reward model to infuse generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. The Hierarchical Geographic Item Tokenization strategy generates primary tokens from discrete spatial and content attributes and refines residual tokens using the aligned LLM's geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets demonstrate that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further confirm its effectiveness.
1.6. Original Source Link
The official source link is: https://arxiv.org/abs/2511.14221
The PDF link is: https://arxiv.org/pdf/2511.14221v1.pdf
The paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a critical challenge in local-life recommendation services, such as those offered by Kuaishou and Meituan. Traditional ID-based recommendation methods, which use unique identifiers for items and heavily rely on collaborative filtering (CF) signals, struggle to capture real-world spatial characteristics and distance awareness. This leads to issues like limited user interaction opportunities for many items (the cold-start problem for spatially constrained items) and unfair exposure as the candidate item corpus grows.
Recent advancements in Large Language Models (LLMs) have shown promise in text-based recommendation by leveraging their semantic understanding capabilities. The typical approach involves designing prompts to encode item textual information and then quantizing these into discrete semantic IDs (SIDs) via item tokenization. However, a major limitation arises in local-life services: simply injecting location information into prompts is insufficient. Pre-trained LLMs often prioritize content relevance over geographic proximity, leading to recommendations that are semantically relevant but geographically inaccessible (e.g., recommending a Shanghai restaurant to a Beijing user, as illustrated in Figure 1). This highlights a gap where existing LLM-driven methods lack fine-grained spatial characteristics and real-world distance awareness.
The core problem, therefore, is to effectively inject domain-specific geographic knowledge and collaborative signals into LLMs to generate geographically-aware item representations that are suitable for local-life recommendation tasks, while simultaneously preserving the LLM's semantic understanding.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- Identification of Limitations: It identifies the key limitations of existing
LLM-driven item tokenization methodsinspatially constrained scenarios, emphasizing the necessity of aligning LLMs withdomain-specific knowledge(i.e., geographic awareness). - Proposed Framework LGSID: It introduces
LGSID(LLM-Aligned Geographic Item Tokenization Framework), a novel two-stage item tokenization framework specifically tailored forlocal-life recommendation.- RL-based Geographic LLM Alignment: This module aligns LLMs with real-world spatial knowledge. It involves:
- Training a
list-wise reward modelwithdensity-aware negative samplingto capture relative spatial distances. - Introducing a novel
G-DPO algorithmthat uses this reward model to inject geographic and collaborative signals into the LLM, balancing semantic accuracy with geographic awareness throughsimilarity regularization.
- Training a
- Hierarchical Geographic Item Tokenization: This module generates a sequence of spatial-aware discrete tokens. It involves:
- Creating
primary tokensfrom discrete spatial and content attributes usingMiniBatch K-Means. - Refining
residual tokensusing the aligned LLM's geographic representation vectors through learnable cluster centers.
- Creating
- RL-based Geographic LLM Alignment: This module aligns LLMs with real-world spatial knowledge. It involves:
- Extensive Experimental Validation: Comprehensive experiments on real-world Kuaishou industry datasets demonstrate that
LGSIDsignificantly improves performance across bothdiscriminativeandgenerative recommendation models.-
Discriminative Models:
LGSIDshowed substantialAUCimprovements overDIN,DIEN,SIM,TWIN, andETAwhen augmented with its tokenization scheme. -
Generative Models:
LGSIDachieved significant improvements inHit@KandNDCG@Kmetrics overTIGERandOneReccompared to other quantization methods. -
Ablation Studies and Visualizations: These studies validate the effectiveness of the individual components, showing how
G-DPOenhances geographic awareness while maintaining semantic understanding and how hierarchical tokenization improves category cohesion.The key conclusion is that by explicitly aligning LLMs with geographic awareness and integrating it into a hierarchical item tokenization strategy,
LGSIDeffectively addresses the spatial constraints of local-life recommendation, leading to more accurate and contextually relevant recommendations.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with the following fundamental concepts:
-
Recommendation Systems (RS):
- Definition: Algorithms designed to suggest relevant items (e.g., movies, products, restaurants) to users. They aim to predict user preferences.
- Types:
- ID-based methods: Represent users and items using unique identifiers (
IDs). These IDs are often embedded into vectors (e.g.,user embeddings,item embeddings) that capture their characteristics and relationships. - Collaborative Filtering (CF): A common ID-based technique that recommends items based on the preferences of similar users or the characteristics of similar items. For example, if user A and user B like similar movies, and user A liked a movie that user B hasn't seen, it might be recommended to user B.
- Content-based methods: Recommend items similar to those a user has liked in the past, based on item attributes (e.g., genre of a movie, description of a product).
- Hybrid methods: Combine CF and content-based approaches.
- ID-based methods: Represent users and items using unique identifiers (
- Local-life Recommendation: A specific type of recommendation system where
geographic locationanddistanceare crucial factors. Items are typically local businesses or services (e.g., restaurants, shops) and user interactions are constrained by physical proximity.
-
Large Language Models (LLMs):
- Definition: Advanced AI models (e.g., GPT, LLaMA) trained on vast amounts of text data to understand, generate, and process human language. They are typically based on the
Transformer architecture. - Semantic Understanding: LLMs excel at capturing the meaning and context of text, allowing them to generate
semantic representations(dense vector embeddings) that reflect the underlying meaning of words, sentences, or entire documents. - Pre-training: LLMs are initially trained on a large corpus of text data using self-supervised learning tasks (e.g., predicting the next word). This process allows them to learn general language patterns and semantic knowledge.
- Fine-tuning/Alignment: After pre-training, LLMs can be adapted to specific downstream tasks or domains through further training with smaller, task-specific datasets, often involving techniques like
reinforcement learning from human feedback (RLHF).
- Definition: Advanced AI models (e.g., GPT, LLaMA) trained on vast amounts of text data to understand, generate, and process human language. They are typically based on the
-
Item Tokenization / Semantic IDs (SIDs):
- Definition: The process of converting item representations (e.g., LLM embeddings) into
discrete tokensorIDs. Instead of continuous vectors, items are assigned one or more numerical codes. - Purpose:
- Efficiency: Discrete IDs can be more memory-efficient and faster to retrieve or process than high-dimensional continuous vectors, especially in large-scale recommendation systems.
- Interpretability: In some contexts, discrete tokens can offer better interpretability than raw embeddings.
- Integration with traditional systems: SIDs can bridge the gap between LLM semantic representations and traditional ID-based recommendation frameworks.
- Quantization: The general process of mapping continuous values (like LLM embeddings) to discrete values (like tokens). This often involves clustering algorithms (e.g., K-Means) or more advanced techniques like
Vector Quantized Variational AutoEncoders (VQ-VAE)orResidual Quantization (RQ).
- Definition: The process of converting item representations (e.g., LLM embeddings) into
-
Reinforcement Learning (RL):
- Definition: A type of machine learning where an
agentlearns to make decisions by interacting with anenvironment. The agent receivesrewardsorpenaltiesfor its actions and aims to maximize its cumulative reward over time. - Components:
Agent: The learning algorithm.Environment: The system with which the agent interacts.State: The current situation of the environment.Action: The agent's choice in a given state.Reward: A scalar feedback signal indicating the desirability of an action.Policy: The agent's strategy for choosing actions in different states.
- Reward Model (RM): In RLHF, a separate model trained to predict human preferences (or a proxy for preferences, like geographic relevance in this paper) for different outputs generated by an LLM. This RM then provides the reward signals for training the LLM (the
policy model). - Direct Preference Optimization (DPO): An
RLHFalgorithm that directly optimizes the LLM policy using a simple loss function derived from preferences, avoiding the complexities of explicitly training a reward model and using PPO (Proximal Policy Optimization).
- Definition: A type of machine learning where an
-
Distance Metrics:
- Euclidean Distance: The straight-line distance between two points in Euclidean space. For two vectors and , it is calculated as: $ d(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^n (a_i - b_i)^2} $
- Cosine Similarity: Measures the cosine of the angle between two non-zero vectors. It indicates how similar the directions of two vectors are. For two vectors and : $ \text{similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} = \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \sqrt{\sum_{i=1}^n b_i^2}} $
- Haversine Distance: A formula for calculating the distance between two points on a sphere (e.g., Earth) given their latitudes and longitudes. It's crucial for geographic applications. The formula is:
$
d = 2r \arcsin\left(\sqrt{\sin^2\left(\frac{\phi_2 - \phi_1}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)}\right)
$
Where:
- is the radius of the sphere (Earth's mean radius, approx. 6371 km).
- are the latitudes of point 1 and point 2 (in radians).
- are the longitudes of point 1 and point 2 (in radians).
3.2. Previous Works
The paper extensively discusses item tokenization and LLM alignment in recommendation.
3.2.1. Item Tokenization
- Early efforts (RQ-VAE, VQ-VAE): These foundational techniques map high-dimensional continuous representations (like LLM embeddings) to discrete codes.
VQ-VAE (Vector Quantized Variational AutoEncoder): A neural network that learns a discrete latent representation for data. It quantizes continuous latent vectors from an encoder into discrete codes using a codebook and reconstructs the data.RQ-VAE (Residual Quantization Variational AutoEncoder): An extension ofVQ-VAEthat applies quantization in multiple stages. It quantizes the residual error from the previous quantization step, allowing for finer-grained representation.TIGER(Rajput et al. 2023) is an early adopter, usingRQVAEto map LLM representations tosemantic IDs (SIDs).
- Two-stage methods: These methods first generate item representations (e.g., from LLMs) and then tokenize them, often with a focus on aligning these tokens with recommendation signals.
LC-Rec(Zheng et al. 2024): Integrates recommendation signals through semantic alignment.LETTER(Wang et al. 2024a): Jointly aligns semantic and collaborative spaces.QARM(Luo et al. 2024): UsesRes-Kmeans(an efficient variant of K-Means for large datasets) guided by user interaction distributions to learn semantic IDs.K-Meansis a clustering algorithm that partitions observations into clusters, where each observation belongs to the cluster with the nearest mean (centroid).EAGER(Wang et al. 2024c): Employs two-stream codebooks to model both semantic and collaborative signals.UTGRec(Zheng et al. 2025a): Integrates multimodal semantics with co-occurrence patterns into universal codes.SC(Li et al. 2025): Regularizes semantic encoders usingID embeddingsfrom pre-trainedCF models.
- End-to-end methods: These methods jointly optimize tokenization and recommendation.
-
UnifiedSID(Lin et al. 2025): UsesRQ-VAEtrained withcosineandEuclidean distancesto integratesemanticandID tokens. -
ETEGRec(Liu et al. 2025a): Introducessequence-itemandpreference-semantic alignment objectiveswith generative models training.Crucial gap identified by LGSID: Existing item tokenization methods largely overlook
domain-specific constraints, particularly geographic ones in local-life recommendations. Withoutgeographic awareness, SIDs may recommend suitable items that are too far away, negatively impacting system efficiency.
-
3.2.2. LLM Alignment for Recommendation
This area focuses on adapting LLMs for specific recommendation tasks.
- Designing LLM tasks:
LGHRec(Luo et al. 2025): Useschain-of-thought (CoT) reasoningto distill item descriptions into semantic IDs, which are then fused with vanilla IDs forGraph Neural Networks (GNNs). CoT involves prompting LLMs to show their reasoning steps, which can improve their performance on complex tasks.GNPR-SID(Wang et al. 2025): Incorporates domain attributes like location information into prompts to capture downstream signals.SIIT(Chen et al. 2024): Iteratively refines tokenization through self-improvement.
- Fine-tuning LLMs:
-
NoteLLM(Zhang et al. 2024a) andNoteLLM-2(Zhang et al. 2024b): Compress items via prompts and integratecollaborative signalsthroughsupervised fine-tuning. -
AlignRec(Liu et al. 2024): Introducesalignment objectivesformultimodalanduser-item consistency. -
LLMEmb(Liu et al. 2025b): Usessupervised contrastive fine-tuningto align LLM embeddings withcollaborative data.Contrastive learningtrains models to learn representations where similar data points are pulled closer together, and dissimilar ones are pushed apart. -
LARM(Liu et al. 2025c): Distills knowledge from open-source LLMs into smaller models. -
Lu et al. (Lu et al. 2024): Strengthen LLMs' alignment with recommendation instructions.
Crucial gap identified by LGSID: While these methods emphasize
semantic alignment, LLMs need to balancedomain-specific constraints(like geography) with accurately capturinguser preferencesbased on item content. Simply adding geographic info to prompts is insufficient; explicit alignment is needed.
-
3.3. Technological Evolution
The field of recommendation systems has evolved from traditional collaborative filtering and matrix factorization (ID-based methods) to leveraging deep learning for more sophisticated feature learning and interaction modeling. With the rise of Large Language Models, there's a shift towards text-based recommendation, where LLMs' semantic understanding offers generalization capabilities beyond mere ID embeddings.
This paper fits into this timeline by addressing a critical challenge in the LLM-driven recommendation paradigm: extending semantic understanding to include domain-specific geographic awareness. It moves beyond treating LLMs solely as text encoders or relying on simple prompt engineering. Instead, it proposes a more profound integration of domain knowledge through RL-based alignment, ensuring that the item tokenization process is not only semantically rich but also geographically intelligent.
3.4. Differentiation Analysis
Compared to existing item tokenization and LLM alignment methods, LGSID offers several core differences and innovations:
-
Explicit Geographic Alignment for LLMs:
- Prior methods: Primarily focus on LLMs as generic text encoders (e.g.,
TIGER,QARM) or inject domain signals via prompt design (GNPR-SID). They assume LLMs can naturally balance content and location, often prioritizing semantic relevance over geographic proximity (e.g.,NoteLLMfocuses on collaborative signals through supervised fine-tuning). - LGSID's innovation:
LGSIDexplicitly post-trains the LLM using anRL-based alignment strategy. It doesn't just put location in the prompt; it teaches the LLM to understand and prioritizereal-world spatial relationshipsthrough areward modeland theG-DPO algorithm. This ensuresfine-grained spatial characteristicsanddistance awarenessare deeply integrated into the LLM's representations.
- Prior methods: Primarily focus on LLMs as generic text encoders (e.g.,
-
Novel G-DPO Algorithm for Domain-specific Knowledge Injection:
- Prior DPO:
Direct Preference Optimization (DPO)is a general RLHF method. - LGSID's innovation:
G-DPOis specifically designed for local-life recommendation. It incorporates ageography-aware reward modelanddomain-mixed sampling(combiningdomain collaborative pairsandgeographically constrained pairs) to injectgeneralized spatial knowledgeandcollaborative signals. Crucially, it usessimilarity regularizationto dynamically balancesemantic accuracyandgeographic awarenessduring fine-tuning, preventing the LLM from losing its core semantic understanding while gaining spatial intelligence.
- Prior DPO:
-
Hierarchical Geographic Item Tokenization:
-
Prior tokenization methods: Many use
RQ-VAE,Res-Kmeans, orN-gram SIDs(UnifiedSID,QARM,RQ-VAE-ngram) to generate discrete tokens, but often lack a structured way to integrate multiple levels of geographic and content attributes explicitly. -
LGSID's innovation: The
hierarchical tokenizationstrategy directly addresses the multi-granularity nature of geographic information. It first generatesprimary tokensbased on a weighted combination of discrete spatial (latitude, longitude, administrative codes) and content attributes. Then,residual tokensare refined using the aligned LLM's geographic representation vectors. Thismulti-layer approachallows for efficient compression while preserving rich information and ensuresgeographic cohesionacross different levels of token granularity. The first layer's pre-computed geographic clusters provide a stronggeographic prior, which is then semantically enriched by the aligned LLM.In essence,
LGSIDmoves beyond treating LLMs as black-box text encoders or relying on simple prompt engineering. It introduces a sophisticatedpost-training alignmentandhierarchical tokenizationframework that explicitly injects and balancesgeographic constraintsandcollaborative signalsinto LLMs, leading to more practical and performant local-life recommendation systems.
-
4. Methodology
The LGSID framework for local-life recommendation is designed to infuse real-world spatial awareness into semantic IDs generated by Large Language Models (LLMs). It operates through two main modules: RL-based Geographic LLM Alignment and Hierarchical Geographic Item Tokenization.
The overall architecture is depicted in Figure 2. The upper part illustrates the RL-based alignment module, which fine-tunes the LLM to integrate geographic knowledge while preserving semantic understanding. The lower part of Figure 2 details the Hierarchical Geographic Item Tokenization (HGIT) pipeline, which combines spatial and discrete content features with the LLM's semantically rich, geographically-aware representations to create balanced item tokens.

该图像是一个示意图,展示了LLM对齐的地理项目标记化框架的两个阶段,包括基于RL的地理LLM对齐和分层地理项目标记化。在公式中,G-DPO算法与RL相结合,通过奖励评分进行更新,强调内容和地理信息的对齐,用户序列展示了最终生成的项目嵌入。
Figure 2: The upper part depicts the pipeline of RL-based Geographic LLM Alignment, showing how the LLM is updated through reward scoring by the G-DPO algorithm, emphasizing the alignment of content and geographic information. The lower part depicts the pipeline of Hierarchical Geographic Item Tokenization, showing the user sequence and the final generated item embeddings.
4.1. Preliminaries
4.1.1. Geography Prompt Design
For LLMs to derive item semantic representations, LGSID uses a structured prompt that combines textual descriptions and geographical attributes.
-
Textual Descriptions (): These include item content attributes such as name, brand, category, and price.
-
Geographical Attributes: These include
province,city, andtown.As illustrated in Figure 6 (in Appendix A), the prompt is designed to present this information to the LLM in a structured manner. The LLM then processes this prompt to produce an
item semantic representation, which is a dense vector embedding that captures the meaning and context of the item, including its geographic context. These continuous representations are subsequently converted intodiscrete semantic IDsthrough aquantizationprocess.
该图像是示意图,展示了短视频推荐系统中的内容理解嵌入生成过程。输入包括项目名称、位置、类别层级、品牌和商品信息,最后生成 LLM 嵌入。该系统根据位置等信息生成个性化推荐。
Figure 6: Illustration of our prompt design.
4.1.2. SID in Recommendation
LGSID's generated Semantic IDs (SIDs) are applicable to two mainstream recommendation paradigms:
-
Discriminative Recommendation:
- Objective: Predict a user's likelihood of interacting with an item (e.g., click-through rate).
- Pipeline: Users and items are encoded into their respective
ID embeddings. Each item is represented by its textual description and its semantic IDs . Auser encoderlearns user embeddings\mathbf{e}_u = f_u(u). The system then generates aTop-klist of recommended items by matching the user embedding with a combined representation ofitem ID(),textual description(), andsemantic IDs(). TheSIDsfromLGSIDenrich the item representation, providinggeographic awarenessandsemantic generalizationto improve matching accuracy.
-
Generative Recommendation:
- Objective: Directly predict the next item (or its semantic ID) that a user will interact with, given their historical sequence of interactions.
- Pipeline: Items are first mapped to
semantic IDsviaquantized tokenization. Given a user's history of SIDs , the generative model predicts the next SID , which corresponds to the most likely next item.LGSID'sgeographically-aware SIDsenable the generative model to produce recommendations that are not only semantically relevant but also spatially appropriate.
4.2. RL-based Geographic LLM Alignment
Existing methods often try to enhance LLM awareness by incorporating domain-specific information into prompts. However, LLMs primarily rely on semantic similarity and struggle with fine-grained spatial distinctions. To address this, LGSID proposes an RL-based post-training strategy to align LLMs with real-world geographic knowledge.
4.2.1. Geography-aware Reward Model Training
Training a reward model for reinforcement learning in recommendation systems with vast and dynamic item pools is challenging due to the impracticality of manual labeling. LGSID overcomes this by training a list-wise reward model that predicts the geographic relevance score between an item's content and its corresponding location, based on their LLM representations. This approach enables the model to internalize generalizable spatial knowledge.
- Pairwise Geodesic Distance and Negative Sampling:
- The process begins by calculating the
pairwise geodesic distance(distance on a sphere, using latitude and longitude) between items. - A
density-aware hard negative sampling strategyis then applied. This strategy selectsnegative samplesfor eachtarget itembased on theirspatial distance.Hard negative samplesare typically those that are challenging for the model to distinguish from positive samples, pushing the model to learn more discriminative features.
- The process begins by calculating the
- Prompt Mismatching Strategy:
- For each target item,
prompt sequencesare constructed by fixing the item's content () but replacing its true location with that of anegative sample(). This creates aprompt-mismatchedpair. - The definition of these prompts is given by Equation 1:
Where:
- : The prompt for the -th target item with its
true location. - : The textual attributes of the item (e.g., name, brand, category, price).
- : The
true geographic location(e.g., province, city, town) of item . - : The prompt for the -th target item with a
negative locationsampled from other items. - : A
negative geographic locationcorresponding to a sampled item .
- : The prompt for the -th target item with its
- This generates a
prompt sequencefor each target item : , where denotes the -th negative sample.
- For each target item,
- LLM Encoding and Reward Prediction:
- The LLM encodes these prompts into embeddings: , where and is the embedding dimension. would be the embedding for the true prompt .
- A
list-wise architectureis employed to learn spatial proximity. For each prompt-mismatched sampled item, its LLM representation is fed into amulti-layer perceptron (MLP)to predict areward score. This score quantifies the relationship between the target item's content and the (potentially mismatched) location . - The reward score is defined by Equation 2:
Where:
- : The
reward scorebetween the -th target item and item (where refers to the location of the sampled item). Note the notation here might be slightly ambiguous in the original paper, where seems to imply the embedding of item with location . It implies the reward model learns to score how "correct" the location is for content .
- : The
- Distance-based Soft Labels:
Soft labelsare assigned based on theHaversine distancesbetween items. Given a candidate list of prompt sequences sorted by distance fromnear to far, the soft labels are defined by Equation 3: Where:- : The
Haversine distancebetween items and in the list. - : The rank of item based on its distance to item (e.g., rank 1 for the closest, rank for the furthest).
- This labeling strategy assigns higher soft labels (higher importance) to items that are geographically closer to the target item.
- : The
- Reward Model Training Loss:
- The reward model is trained using a
weighted binary cross-entropy loss, as defined by Equation 4: Where:- : The
training lossfor the reward model. - : The
batch size. - : The
soft label(weight) for the pair of item and item , derived from their geographic distance. - : The
sigmoid activation function, which squashes the reward score into a probability-like value between 0 and 1. - : The
predicted reward scorefrom the MLP.
- : The
- This loss encourages the reward model to predict higher scores for geographically relevant (closer) item-location pairs and lower scores for irrelevant (further) ones.
- The reward model is trained using a
4.2.2. G-DPO Algorithm with RM
Building on the pre-trained reward model, LGSID introduces G-DPO (Geographic Direct Preference Optimization), an enhanced algorithm inspired by Direct Preference Optimization (DPO) (Rafailov et al. 2023), for aligning the LLM.
- Domain-Mixed Sampling Strategy ():
G-DPOuses adomain-mixed sampling strategy. This strategy combines two types of data:- Domain Collaborative Pairs (): These pairs are derived from
user historical behaviors. Items that frequentlyco-occurin user click histories tend to have bothsemanticandgeographic similarity. - The
co-occurrence scorebetween items and is defined by Equation 5: Where:- : The
indicator function, which is 1 if the condition is true, and 0 otherwise. - : User 's
click history. - Pairs with (a predefined threshold) are selected as
domain collaborative sample pairs.
- : The
- Geographically Constrained Pairs (): These pairs are formed by randomly sampling items that are
outside the target item's location. This ensuresdiversityandefficiencywithin large candidate pools.
- Domain Collaborative Pairs (): These pairs are derived from
- G-DPO Alignment Loss ():
G-DPOaligns thepolicy model(the LLM being fine-tuned) with areference model(a frozen version of the policy model or a pre-trained LLM) using thesedomain-mixed sample pairs. Here, represents a preferred item (e.g., geographically closer or collaboratively relevant) and represents a non-preferred item.- Each item pair is scored by the
pre-trained reward model, which takes embeddings from either thepolicy model() or thereference model() and outputs a distance-based score. - The
alignment lossinG-DPOis given by Equation 6: Where:- : The
alignment lossforG-DPO. - : Expectation over
domain-mixed sample pairs. - : The
sigmoid function. - : A hyperparameter that controls the
sharpnessof the alignment. - : The
pre-trained reward model. - : The embedding of item generated by the
policy model(the LLM being optimized). - : The embedding of item generated by the
reference model(a fixed version of the LLM).
- : The
- This loss encourages the policy model to assign higher reward scores to preferred items () and lower scores to non-preferred items (), relative to the reference model. This effectively injects the geographic and collaborative preferences learned by the reward model into the LLM.
- Similarity Regularization ():
- To prevent the LLM from
catastrophic forgettingor degrading its originalsemantic representationsduringG-DPOupdates, asimilarity regularizeris introduced. This is anin-batch contrastive loss. - The
similarity regularization lossis defined by Equation 7: Where:- : The
similarity regularization loss. - : Expectation over instances in a
batch. - : The
squared Euclidean distance. - : Embedding of item from the
policy model. - : Embedding of item from the
reference model. - The first term pulls the policy model's embedding for item closer to the reference model's embedding for the same item.
- The second term pushes the policy model's embedding for item away from the reference model's embeddings for other items in the same batch. This is a common form of
contrastive lossthat aims to preserve distinctiveness while maintaining semantic consistency.
- : The
- To prevent the LLM from
- Overall G-DPO Objective:
- The complete
G-DPOtraining objective combines thealignment lossand thesimilarity regularization losswith a weight , as shown in Equation 8: Where:- : A hyperparameter that controls the
strengthof the similarity regularization. A higher means more emphasis on preserving the original semantic understanding.
- : A hyperparameter that controls the
- The complete
4.3. Hierarchical Geographic Item Tokenization
Upon obtaining aligned LLM semantic representations (which are now geographically aware), LGSID introduces a Hierarchical Geographic Item Tokenization module to convert these representations into discrete tokens efficiently.
4.3.1. First Layer Tokenization (Primary Tokens)
The first layer focuses on generating primary tokens that are highly geography-aware and incorporate other discrete content attributes.
- Multi-dimensional Feature Construction:
- To initialize
geography-aware tokens, amulti-dimensional feature vectoris constructed by integrating various attribute types. Composite embeddingsare used for discrete categorical features to mitigate the inefficiency of high-dimensional one-hot encoding. These are often fixed scaling factors.Geography-aware codesare formed bylatitudeandlongitude.Administrative codesare fixed scaling factors based onprovince ID,city ID, anddistrict ID.Category codesare fixed scaling factors based onprimary categoryandsecondary category.Brand codesare fixed scaling factors based onbrand ID.
- The
final clustering feature vectoris constructed byweighted concatenationof all these components, as shown in Equation 9: Where:- : Empirically chosen
coefficientsthat reflect therelative importanceof each feature type.
- : Empirically chosen
- To initialize
- Clustering for First-Layer Tokens:
- Using the
clustering feature vector,MiniBatch K-Meansis applied.MiniBatch K-Meansis a variant of the K-Means algorithm that uses small random subsets of the data (mini-batches) to update the centroids. This makes it much faster for large datasets than standard K-Means. - This clustering process generates a
vocabulary of first-layer geographic tokens. - The
cluster representations(centroids) for this first layer are then computed as themean of the LLM embeddingswithin each token group, producing thefirst-layer cluster centers. This step is critical as it grounds the first layer of tokens in both the discrete geographic and content attributes and the rich, geographically-aligned LLM embeddings.
- Using the
4.3.2. Residual Layers Tokenization (Residual Tokens)
For subsequent residual layers (), LGSID employs learnable cluster centers with Euclidean distance-based assignment, similar to Residual Quantization (RQ) methods. This process refines the remaining information (residuals) not captured by the previous layers.
- Cluster Assignment for Residuals:
- For each residual layer , an input
residual vector(which is the input embedding minus the quantized representations from all previous layers) is assigned to the closest cluster center. - The assignment for layer is determined by Equation 10:
Where:
- : The index of the chosen cluster for layer .
- : The
residual vectorfrom the previous layer(l-1). For the second layer , would typically be the original LLM embedding . - : The -th
learnable cluster centerfor layer . - : The
squared Euclidean distance.
- For each residual layer , an input
- Quantized Representation and Residual Update:
- Once a cluster center is chosen, the
quantized representationfor that layer is simply the selected cluster center. - This is given by Equation 11:
Where:
- : The
quantized representationfor layer , which is the cluster center corresponding to .
- : The
- The
residualfor the next layer is then calculated by subtracting this quantized representation from the current residual. - This
residual updateis defined by Equation 12: Where:- : The
new residual vectorthat will be passed to the next layer . This represents the information not yet captured by the first layers of quantization. - For the first layer , the input is the original LLM embedding .
- : The
- Once a cluster center is chosen, the
4.3.3. Training Objective for HGIT
The primary goal of Hierarchical Geographic Item Tokenization (HGIT) is to minimize the reconstruction loss between the original embeddings and their quantized representations.
- Reconstruction Loss ():
- For input embeddings (the LLM's geographic representations), the
reconstruction lossis calculated as the squared Euclidean distance between the original embedding and the sum of quantized representations from all layers. - This is defined by Equation 13:
Where:
- : The
reconstruction loss. - : The original
LLM embedding. - : The sum of the
quantized representationsfrom all layers. This sum forms the reconstructed embedding.
- : The
- For input embeddings (the LLM's geographic representations), the
- Entropy-based Regularization ():
- To ensure balanced utilization of the learned clusters and prevent
cluster collapse(where some clusters are never used or capture very few data points), anentropy-based regularization termis introduced. - For each layer , the
cluster usage distributionis computed: Where:- : The
proportionof data points assigned to cluster in layer . - : The total number of data points.
- : The
indicator function, which is 1 if data point is assigned to cluster , and 0 otherwise.
- : The
- The
regularization lossencourages auniform cluster usagethroughKL divergence: Where:- : The
regularization lossfor layer . - : The
Kullback-Leibler (KL) divergencebetween distribution and distribution . It measures how one probability distribution diverges from a second, expected probability distribution. - : Represents the
uniform distributionover clusters, where is the number of clusters in layer .
- : The
- Minimizing this
KL divergencepushes the cluster usage distribution towards a uniform distribution, ensuring all clusters are effectively utilized.
- To ensure balanced utilization of the learned clusters and prevent
- Complete HGIT Training Objective:
- The
complete training objectiveforHGITcombines thereconstruction losswith thecluster usage regularization: Where:- : The overall
HGIT training loss. - : A hyperparameter that controls the
strengthof the regularization.
- : The overall
- Important Note: The regularization is
only applied to learnable layers() because the first layer uses pre-computed geographic clusters and thus doesn't require regularization for balanced usage in the same way learnable layers do.
- The
5. Experimental Setup
5.1. Datasets
The experiments are conducted on a real-world industry dataset from the feed of the Kuaishou App in the local-life scenario.
-
Source: Kuaishou App (a major short-video platform in China).
-
Domain: Local-life services, implying items like restaurants, shops, and services with geographic relevance.
-
Characteristics:
- Includes both
item geographic information(latitude, longitude, administrative codes) andtextual information(name, brand, category, price). - Items satisfy
delivery-distance constraints, which is critical for local-life services.
- Includes both
-
Scale: The dataset statistics are summarized in Table 4.
The following are the results from Table 4 of the original paper:
Kuaishou Industry Datasets # Samples 50,000,000 # Users 19,080,888 # Items 2,325,266 # Brands 19,408 # Categories 818 -
Choice Rationale: The authors chose this dataset due to the
scarcity of public datasetsthat satisfydelivery-distance constraintsand contain detaileditem textandgeographic location information. This dataset is highly effective for validating the method's performance in its target application domain. -
Future Plans: The authors plan to release their dataset in future work to support research on algorithms for
LBS (Location-Based Services)-like recommendation scenarios.
5.2. Evaluation Metrics
The evaluation metrics differ based on the recommendation paradigm:
5.2.1. Discriminative Recommendation Metrics
For discriminative models, AUC is used.
- Area Under the Receiver Operating Characteristic Curve (AUC):
- Conceptual Definition: AUC quantifies the overall performance of a binary classification model (e.g., predicting click or no-click). It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. A higher AUC value indicates better discriminative power, meaning the model is better at distinguishing between positive and negative classes. An AUC of 0.5 suggests random performance, while 1.0 indicates perfect classification.
- Mathematical Formula: The AUC is typically calculated as the area under the ROC curve. The ROC curve plots the
True Positive Rate (TPR)against theFalse Positive Rate (FPR)at various threshold settings.- (Sensitivity or Recall)
- (1 - Specificity) The formula for AUC is the integral of the ROC curve: In practice, for discrete predictions, AUC can be approximated by ranking pairs of positive and negative samples:
- Symbol Explanation:
TPR:True Positive Rate, proportion of actual positives correctly identified.FPR:False Positive Rate, proportion of actual negatives incorrectly identified as positive.TP:True Positives, correctly predicted positive cases.FN:False Negatives, actual positive cases incorrectly predicted as negative.FP:False Positives, actual negative cases incorrectly predicted as positive.TN:True Negatives, correctly predicted negative cases.- : Total number of
positive samples. - : Total number of
negative samples. - : Prediction score for a positive sample .
- : Prediction score for a negative sample .
- :
Indicator function, which is 1 if the condition is true, 0 otherwise.
5.2.2. Generative Recommendation Metrics
For generative models, NDCG@K and Hit@K with are used.
-
Normalized Discounted Cumulative Gain at K (NDCG@K):
- Conceptual Definition: NDCG@K measures the ranking quality, taking into account the position of relevant items in the ranked list. It gives higher scores to relevant items that appear earlier in the list (higher ranks) and discounts the relevance of items further down. It normalizes the DCG (Discounted Cumulative Gain) by dividing it by the Ideal DCG (IDCG), which is the DCG of a perfect ranking. This allows for comparison across different queries or lists.
- Mathematical Formula:
- First,
Discounted Cumulative Gain (DCG@K)is calculated: - Then,
Ideal DCG (IDCG@K)is calculated by sorting all relevant items by their relevance score in descending order and applying the DCG formula: - Finally,
NDCG@Kis:
- First,
- Symbol Explanation:
- : The number of top items in the ranked list being considered.
- : The
relevance scoreof the item at position in the generated ranked list. This can be binary (1 if relevant, 0 if not) or graded (e.g., 0-5). - : A
logarithmic discountfactor, giving less weight to items at lower ranks. - : The
relevance scoreof the item at position in theideal ranked list(all relevant items ranked perfectly).
-
Hit Rate at K (Hit@K):
- Conceptual Definition: Hit@K measures whether a relevant item is present within the top K recommendations. It's a binary metric: if at least one target item is found among the top K recommended items, it's considered a "hit" (1); otherwise, it's a "miss" (0). The Hit Rate is the proportion of queries (users) for which a hit occurred.
- Mathematical Formula:
Alternatively, for a single user :
The overall
Hit@Kis the average across all users. - Symbol Explanation:
- : The number of top items in the ranked list being considered.
- :
Indicator function, which is 1 if the condition is true, 0 otherwise.
5.3. Baselines
To evaluate LGSID's effectiveness, the authors compare it against representative baselines in both discriminative and generative recommendation.
5.3.1. Discriminative Recommendation Baselines
These models focus on learning effective user-item interaction representations and are widely adopted in industrial applications.
- DIN (Deep Interest Network) (Zhou et al. 2018): A model that adaptively learns the representation of user interests from their historical behaviors given a candidate item. It uses an
attention mechanismto weigh different historical behaviors based on their relevance to the current candidate item. - DIEN (Deep Interest Evolution Network) (Zhou et al. 2019): An extension of DIN that captures the
temporal evolution of user interests. It uses aGRU (Gated Recurrent Unit)-basedinterest extractor layerand aninterest evolving layerto model how user interests change over time. - ETA (End-to-end User Behavior Retrieval in Click-through Rate Prediction Model) (Chen et al. 2021): A model designed for
click-through rate (CTR)prediction that integratesuser behavior retrievaldirectly into the end-to-end training process. It efficiently finds relevant historical behaviors. - SIM (Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-through Rate Prediction) (Pi et al. 2020): Focuses on modeling
long-term user interestsfrom extensivesequential behavior data. It often involvessearch-based retrievalof relevant historical interactions. - TWIN (Twin v2: Scaling Ultra-long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou) (Si et al. 2024): A highly optimized model for
CTR predictionthat can handleultra-long user behavior sequences, common in large-scale industrial settings like Kuaishou.
5.3.2. Generative Recommendation Baselines
These models aim to generate recommendations, often by predicting item sequences or semantic IDs.
- TIGER (Recommender Systems with Generative Retrieval) (Rajput et al. 2023): One of the first models to propose
RQ-VAEfor mapping LLM representations tosemantic IDsin generative recommendation, enabling the model to retrieve items based on their semantic tokens. - OneRec (Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment) (Deng et al. 2025): A recent generative recommendation model that aims to unify the
retrieveandrankstages, often incorporating iterative preference alignment.
5.3.3. Quantization Methods for Comparison
The paper also compares LGSID's quantization with other methods when applied to TIGER and OneRec:
- Res-KMeans (Luo et al. 2024): A
residual K-Meansbased quantization model, used inQARM, to address challenges in leveraging LLM representations. - RQ-VAE (Rajput et al. 2023): The
Residual Quantized Variational AutoEncodermodel, a common method for item tokenization. - Lin et al. (Lin et al. 2025): Refers to
UnifiedSID, which employsRQ-VAEtrained withcosineandEuclidean distancesto integratesemanticandID tokens. - RQ-VAE-ngram (Zheng et al. 2025b): An extension of
RQ-VAEthat incorporatesN-gram semantic IDs, aiming to address thelong-tail problem.
5.4. Finetune Settings
- Backbone LLM:
BGE (BAAI General Embedding)(Xiao et al. 2023) is used as the backbone.BGEis a state-of-the-art multilingual text understanding model, chosen for its suitability in industrial recommendation scenarios and its ability to work without complex instructions. - Prompt Length: Set to 512 tokens.
- Embedding Size: Set to 1024. The
last token's hidden stateis used to represent the entire text (a common practice for sentence embeddings). - G-DPO Algorithm:
- Negative Samples for Reward Model: Each positive sample is paired with 15 negative samples for constructing
list-wise inputsto the reward model. - Reward Model Architecture: A
two-layer MLPwith asigmoid activationfunction. - LoRA (Low-Rank Adaptation) (Hu et al. 2022) is used for efficient LLM fine-tuning.
Rank (r): 8.LoRAinjects learnable low-rank matrices into the Transformer architecture, significantly reducing the number of trainable parameters. A rank of 8 indicates the dimension of these low-rank matrices.Dropout Rate: 0.05.- Fine-tuning Scope: Only the
keyandvalue layersof the Transformer are fine-tuned, a strategy to preserve the LLM's general semantic understanding while adapting it to the specific task.
- DPO Loss Parameters:
- : 0.9.
Similarity Constraint:Euclidean distancewith a weight .Contrastive Learning:In-batch contrastive learningis used within the similarity constraint.
- Domain-Mixed DPO Preference Pairs:
Co-occurrence Score Threshold: 1200 for filteringdomain collaborative pairs.Negative Item Selection: Randomly select one negative item per sample to construct training pairs.
- Negative Samples for Reward Model: Each positive sample is paired with 15 negative samples for constructing
5.5. Parameters Settings
- Batch Size: 10,240.
- Optimal Settings: Selected based on model performance on validation data.
- Embedding Dimension per Feature: 8.
- Predict MLP Tower Dimensions: [32, 16, 1].
- Optimizer:
AdamW. - Learning Rate: 0.1.
- Scheduler:
steplr schedulerthat decays the learning rate by a factor of 0.9 every 500 steps. - Hardware: All experiments are conducted on two GPUs, each equipped with 48GB of memory.
- Training Workflow:
- Train the
reward model. - Perform
LLM alignmentusing theG-DPO algorithm. - Finally, train the
quantization model.
- Train the
6. Results & Analysis
6.1. Overall Performance (RQ1)
6.1.1. Discriminative Recommendation Results
The following are the results from Table 1 of the original paper:
| Method | DIN | DIEN | SIM | TWIN | ETA |
| Origin | 0.5859 | 0.6255 | 0.5884 | 0.5898 | 0.5903 |
| + Res-KMeans (Luo et al. 2024) | 0.6100↑+0.0241 | 0.6369↑+0.0114 | 0.6063↑+0.0179 | 0.6087↑+0.0189 | 0.6077↑+0.0174 |
| + RQ-VAE (Rajput et al. 2023) | 0.6185↑+0.0326 | 0.6364↑+0.0109 | 0.6111↑+0.0227 | 0.6153↑+0.0255 | 0.6153↑+0.0250 |
| + Lin et al. (Lin et al. 2025) | 0.6161↑+0.0302 | 0.6368↑+0.0113 | 0.6107↑+0.0223 | 0.6148↑+0.0250 | 0.6148↑+0.0245 |
| + RQ-VAE-ngram (Zheng et al. 2025b) | 0.6163↑+0.0304 | 0.6354↑+0.0099 | 0.6116↑+0.0232 | 0.6129↑+0.0231 | 0.6145↑+0.0242 |
| + LGSID (Ours) | 0.6276↑+0.0417 | 0.6484↑+0.0229 | 0.6224↑+0.0340 | 0.6263↑+0.0365 | 0.6274↑+0.0371 |
Analysis:
Table 1 presents the offline AUC scores on the Kuaishou local-life dataset for various discriminative recommendation models augmented with different item tokenization schemes. The Origin row refers to the performance of the backbone models without any LLM-driven tokenization.
- Consistent Outperformance:
LGSIDconsistently delivers the largest absolute gains across all evaluated backbone models (DIN, DIEN, SIM, TWIN, ETA). For instance, it improves DIN's AUC by +0.0417, DIEN by +0.0229, SIM by +0.0340, TWIN by +0.0365, and ETA by +0.0371. These are significant improvements inAUC, a metric where even small gains are considered impactful. - Impact of Geographic Awareness: The authors attribute these improvements to
LGSID's two main components. Forattention-based modelslike DIN, DIEN, and SIM, traditionalflat item IDslimit the influence ofspatial proximityonattention scores. By injectingaligned LLM's spatial knowledgethrough theG-DPO phase, eachIDbecomes ageography-aware embeddingthat encodesreal-world spatial distanceandneighborhood co-visit patterns. This richer representation allows attention mechanisms to better capture relevant local items. - Efficiency for Latency-Constrained Models: For models like TWIN and ETA, which operate under strict
latency constraints,compactyetinformative codesare crucial.LGSID'shierarchical quantizationfirst compressesgeo-textual attributesintocoarse primary tokensand thenprogressively refines residualswithgeographic context. This approach yields richer item representations without expanding the embedding table, leading to notable uplifts (e.g., TWIN by 3.65% relative and ETA by 3.71% relative). This indicates thatLGSIDprovides a good balance between expressiveness and efficiency. - Comparison with Other Tokenization Methods:
LGSIDoutperforms otherLLM-driven tokenization methodssuch asRes-KMeans,RQ-VAE,Lin et al., andRQ-VAE-ngram. This suggests thatLGSID's explicitRL-based geographic alignmentandhierarchical tokenizationstrategy are more effective in capturing the specificdomain constraintsof local-life recommendation compared to methods that focus purely on semantic or general collaborative signals.
6.1.2. Generative Recommendation Results
The following are the results from Table 2 of the original paper:
| Method | TIGER | OneRec | ||||||
| Hit@5 | Hit@10 | NDCG@5 | NDCG@10 | Hit@5 | Hit@10 | NDCG@5 | NDCG@10 | |
| RQ-VAE (Rajput et al. 2023) | 0.3087 | 0.3880 | 0.2255 | 0.2512 | 0.3739 | 0.4534 | 0.2798 | 0.3056 |
| Lin et al. (Lin et al. 2025) | 0.1767 | 0.2067 | 0.1335 | 0.1432 | 0.2950 | 0.3346 | 0.2272 | 0.2401 |
| RQ-VAE-ngram (Zheng et al. 2025b) | 0.2991 | 0.3769 | 0.2158 | 0.2411 | 0.3626 | 0.4358 | 0.2720 | 0.2957 |
| LGSID (Ours) | 0.3921 | 0.5077 | 0.2817 | 0.3191 | 0.4435 | 0.5537 | 0.3304 | 0.3661 |
| IMP | 27.01% | 30.83% | 24.94% | 27.05% | 18.63% | 22.13% | 18.09% | 19.79% |
Analysis:
Table 2 presents the Hit@K and NDCG@K scores for generative recommendation models (TIGER and OneRec) when integrated with different quantization methods.
- LGSID's Superiority:
LGSIDsignificantly outperforms all other quantization methods across bothTIGERandOneRecbackbones and all metrics (Hit@5,Hit@10,NDCG@5,NDCG@10).- For
TIGER,LGSIDachievesIMP(relative improvement over the best baseline) of 27.01% forHit@5and 30.83% forHit@10. - For
OneRec, the improvements are 18.63% forHit@5and 22.13% forHit@10. - Similar strong improvements are observed for
NDCG@Kmetrics.
- For
- Weak Performance of Lin et al.: The method by
Lin et al.(UnifiedSID) obtains the worst performance. The authors hypothesize this is "possibly because it uses different distance functions across various levels of codewords, leading to convergence difficulties and challenges in model optimization." This highlights the importance of consistent and well-designed quantization strategies. - Limitations of Geographic-Agnostic Methods:
RQ-VAEandRQ-VAE-ngramachieve comparable performance but do not explicitly considergeographical constraints. Their suboptimal performance inlocal-life recommendation scenariosunderscoresLGSID's core motivation and innovation. By introducingRL-based Alignmentto generategeographically aware representationsand then transferring these intosemantic IDsthroughhierarchical geographic item tokenization,LGSIDdirectly addresses the domain-specific challenges, leading to more relevant and accurate generative recommendations.
6.2. RL-based LLM Alignment Analysis (RQ2)
This experiment evaluates the effectiveness of the RL-based LLM Alignment module (specifically, the G-DPO algorithm) by measuring how fine-tuning impacts the LLM's geographic awareness while preserving semantic understanding. Two key metrics are used:
-
Semantic Similarity: Measures the semantic relevance of the retrieved
Top-Kitems. -
Geographic Awareness: Measured by the coverage of retrieved items sharing the same
province,city, andtownas the target item (P@K,C@K,T@Krespectively).The following are the results from Table 3 of the original paper:
Method Similarity Province Coverage (P@K) City Coverage (C@K) Town Coverage (T@K) Top@5 Top@10 Top@100 P@5 P@10 P@100 C@5 C@10 C@100 T@5 T@10 T@100 Origin 0.9204 0.9133 0.8833 0.8716 0.8410 0.6681 0.7342 0.6827 0.4372 0.1601 0.1328 0.0552 DPO-PR 0.71 0.8679 0.8286 0.9001 0.8752 0.747 0.7478 0.713 0.5064 0.1452 0.1167 0.0445 DPO-LR 0.7595 0.7478 0.7088 0.8995 0.8648 0.6560 0.8681 0.8254 0.5783 0.5584 0.4966 0.2480 DPO-LRD 0.7411 0.7288 0.6876 0.8715 0.8302 0.6012 0.8277 0.7755 0.5043 0.6114 0.5435 0.2620 DPO-LRDM 0.8107 0.7954 0.7401 0.9047 0.8773 0.7261 0.7812 0.7329 0.5218 0.1816 0.1481 0.0625 DPO-LRDMS 0.8856 0.8754 0.8283 0.9960 0.9936 0.9662 0.9548 0.9352 0.8130 0.4030 0.3525 0.2260 G-DPO (Ours) 0.8977 0.8892 0.8504 0.9905 0.9852 0.9307 0.9173 0.8858 0.7065 0.294 0.2432 0.1290 IMP -2.47% -2.64% -3.72% +13.64% +17.15% +39.31% +24.94% +29.75% +61.60% +83.64% +83.13% +133.70%
Breakdown of G-DPO variants:
- Origin: The baseline LLM before any
G-DPOalignment. - DPO-PR:
DPOwith pairwise reward. - DPO-LR:
DPOwith list-wise reward. - DPO-LRD:
DPOwith list-wise reward and density-aware sampling. - DPO-LRDM:
DPOwith list-wise reward, density-aware sampling, and domain-mixed preference pairs. - DPO-LRDMS:
DPOwith list-wise reward, density-aware sampling, domain-mixed preference pairs, and similarity regularization. - G-DPO (Ours): The final proposed
G-DPOalgorithm. The IMP row indicates the relative improvement ofG-DPOover the best-performing baseline excludingG-DPO.
Analysis:
- Pure Semantic Understanding vs. Geographic Awareness (
Origin): TheOriginmodel has highsemantic similarity(Top@5: 0.9204) but relatively lowtown coverage(T@5: 0.1601, T@100: 0.0552). This confirms that a pre-trained LLM, relying solely on text, can find semantically similar items but struggles to capturereal-world geographic distances. It may confuse "Suzhou, Anhui" with "Suzhou, Jiangsu" if only textual similarity is considered. - Effectiveness of Reward Model Training (
DPO-LRvs.DPO-PRvs.Origin):- Comparing
DPO-LRtoDPO-PRshows thatlist-wise modelingis more effective.DPO-LRsignificantly boostsT@5from 0.1601 (Origin) to 0.5584, andC@5from 0.7342 to 0.8681, demonstrating that the reward model effectively captures and transfersgeographic knowledgeinto the LLM. - The
DPO-PRmodel, surprisingly, shows a decrease intown coveragecompared toOrigin(T@5: 0.1452 vs 0.1601), suggesting that simple pairwise rewards might not be sufficient or could even mislead the model if not carefully designed for the geographic context.
- Comparing
- Impact of Density-Aware Sampling (
DPO-LRDvs.DPO-LR):- Adding
density-aware list-wise modeling(DPO-LRD) further boostsT@5from 0.5584 (DPO-LR) to 0.6114. This enhancement comes from better distinguishing betweennear-distanceandfar-distancenegative samples, making the model more sensitive to geographic nuances. - However, this improvement in geographic awareness often comes at the expense of
semantic comprehension, asDPO-LRD'ssemantic similarityscores are lower thanOrigin.
- Adding
- Mitigating Trade-off with Domain-Mixed Samples (
DPO-LRDMvs.DPO-LRD):- Applying a
mixed-sample strategy(incorporatingdomain collaborative pairs) inDPO-LRDMshows an improvement insemantic similarity(Top@5: 0.8107 vs. 0.7411 forDPO-LRD) andprovince/city coverage(e.g., P@5: 0.9047 vs. 0.8715 forDPO-LRD). This strategy improves sample discrimination and integratescollaborative signals, mitigating the semantic trade-off. However,DPO-LRDMsurprisingly shows a drop intown coveragecompared toDPO-LRD(T@5: 0.1816 vs. 0.6114). This could indicate that simply mixing collaborative signals without explicit regularization might dilute the fine-grained geographic focus.
- Applying a
- Balancing with Similarity Regularization (
DPO-LRDMSandG-DPO):-
The introduction of
textual similarity regularization(DPO-LRDMS) is crucial.DPO-LRDMSachieves highsemantic similarity(Top@5: 0.8856), recovering much of the original semantic understanding, while significantly boostinggeographic coverage(P@5: 0.9960, C@5: 0.9548, T@5: 0.4030). This demonstrates that balancing domain-specific geographic knowledge with semantic preservation is essential for optimal results. -
G-DPO (Ours)achieves a slightly lower semantic similarity (Top@5: 0.8977) thanDPO-LRDMSbut provides very strongprovinceandcity coverage(P@5: 0.9905, C@5: 0.9173). Itstown coverage(T@5: 0.294) is better thanOriginandDPO-LRDMbut lower thanDPO-LRDandDPO-LRDMS. The overall improvements in geographic coverage are substantial, especially for broader regions (province and city), highlighting the effectiveness of theG-DPOdesign in injecting geographical knowledge. The negative IMP values for similarity suggest a slight trade-off in raw semantic similarity compared to the best intermediate variant (DPO-LRDMS), but this is compensated by massive gains in geographic awareness, which is the primary goal.In conclusion, the ablation study clearly validates that
pure semantic understandingis insufficient for geographic awareness. Thereward modelandlist-wise modelingeffectively transfer geographic knowledge.Density-aware samplingfurther refines this, anddomain-mixed samplinghelps integrate collaborative signals. Crucially,similarity regularizationinG-DPOmaintainssemantic understandingwhile achieving stronggeographic awareness, leading to a balanced and effective alignment of the LLM for local-life recommendation.
-
6.3. Visualization Analysis (RQ3)
6.3.1. T-SNE Visualization of Cluster Centroids
The following is the result from Figure 3 of the original paper:

该图像是T-SNE可视化图,展示了不同聚类中心周围的项目在不同标记化方法下的分布情况。上半部分(a-c)为未对齐的项目分布,下半部分(d-f)为经过对齐后的项目分布,各部分在空间上的聚集程度和相似性有明显变化,NMI指标反映了这种变化的定量结果。
Figure 3: T-SNE visualization of items around cluster centroids across tokenization methods.
Analysis:
Figure 3 shows T-SNE visualizations of item embeddings around cluster centroids, comparing unaligned (top row: a, b, c) and RL-based Geographic LLM Aligned (bottom row: d, e, f) items at different geographic granularities (Province, City, District). T-SNE is a dimensionality reduction technique used to visualize high-dimensional data in a 2D or 3D space, preserving local structures.
- Improved Geographic Cohesion: After
RL-based Geographic LLM Alignment, the cluster centers forProvince,City, andDistrictbecome significantly closer and more distinct for geographically similar items. This indicates that the aligned LLM embeddings effectively group items based on their actual geographic proximity. - Quantitative Measure (NMI): The authors provide a quantitative measure,
Normalized Mutual Information (NMI), to assess the agreement between the model's clustering partition and the ground-truth geographic labels.NMIjumps from0.0137-0.0845(unaligned) to0.6430-0.8644(aligned).- NMI Definition:
Normalized Mutual Information (NMI)is an external evaluation metric for clustering. It measures the quality of a clustering by comparing it to a ground truth partitioning. A higher NMI score (closer to 1) indicates better agreement between the clustering results and the true labels. An NMI of 0 means no mutual information, and 1 means perfect correlation.
- NMI Definition:
- Mechanism of Improvement: The key reason for this substantial improvement in NMI is that
G-DPOusesdistance-aware list-wise rewardsto injectreal-world spatial relationshipsinto the LLM. This process allows the LLM'stoken embeddingsto inherently carrygeographic priors, meaning the embeddings for geographically close items will be closer in the embedding space. This makes the subsequent clustering much more effective at identifying true geographic groupings.
6.3.2. Token Quantile Percentiles (Coverage Performance)
The following is the result from Figure 4 of the original paper:

该图像是一个雷达图,展示了不同层级下的令牌分位百分位数对比,其中包括未对齐和已对齐的所用模型的表现。图中展示了不同方法(如RQ-VAE、Lin et al.等)在不同层级(Level 1, Level 2, Level 3)上的性能变化,尤其强调了LGSID方法的效果。
Figure 4: Token quantile percentiles across hierarchical levels for local-life items.
Analysis:
Figure 4 presents a radar chart showing token quantile percentiles across different hierarchical levels, comparing LGSID with other quantization methods (RQ-VAE, Lin et al., RQ-VAE-ngram) and LGSID w/o Alignment. Quantile percentiles indicate how well tokens represent instances across the distribution spectrum; a larger area in the radar chart implies better coverage performance (tokens can represent more instances).
- Level-1 Consistency: At
Level-1,LGSIDdemonstrates remarkable consistency between its aligned and non-aligned settings, showing identical coverage patterns. BothLGSIDandLGSID w/o Alignmentmaintain11k coverageat the90% quantile, whileRQ-VAEdecays to8k. This suggests that the initial layer, which leverages pre-computed geographic clusters, establishes a robust foundation regardless of the LLM alignment status. - Pronounced Advantages at Finer Granularities: As the analysis progresses to finer granularities (
Level-2andLevel-3),LGSID's advantages become significantly more pronounced. Its area in the radar chart is the largest compared to other methods (includingLGSID w/o Alignment). This indicates that:- The
RL-based Geographic LLM Alignment(G-DPO) plays a crucial role in enriching the LLM's representations withfine-grained geographic awareness. - The
hierarchical quantization strategyeffectively utilizes these aligned representations to refineresidual tokens, leading to tokens that can represent a wider and more diverse range of item instances accurately at deeper levels of abstraction.
- The
- Overall Effectiveness: The radar chart validates that
LGSIDnot only provides good coverage at the coarse geographic level but also substantially improves the tokenization's expressive power and representational capacity at finer, more granular levels due to its unique alignment and hierarchical structure. This translates to better capture of local-life item characteristics.
6.4. Case Study (RQ4)
The case study, illustrated in Figure 5, visually demonstrates LGSID's superior geographic awareness compared to other SID methods by examining the allocation of three-layer discrete tokens of the LGSID hierarchical quantizer with and without RL-based G-DPO alignment.
The following is the result from Figure 5 of the original paper:

该图像是图表,展示了在不同SID前缀下LGSID的分层类别频率分布。图表分为六个部分,其中(a)、(b)、(c)为LGSID在不同层级的频率分布,而(d)、(e)、(f)为没有对齐的LGSID的相应分布情况。每个图表显示了多种菜系(如日本料理、海鲜、快餐等)在不同频率下的比较,帮助分析对齐与未对齐的模型在类别频率上的差异。
Figure 5: Hierarchical category frequency distribution of LGSID for different SID prefixes (Aligned vs Unalned)
Analysis:
Figure 5 shows side-by-side comparisons of the hierarchical category frequency distribution for LGSID tokens, both aligned (a, b, c) and unaligned (d, e, f), across three layers.
- First Layer (Coarse Geographic Clustering):
- Since the first layer uses
pre-computed geographic clusters(as described in the methodology), the overall distribution of categories in Figure 5(a) (aligned) and Figure 5(d) (unaligned) appears similar. This initial layer establishes a broad geographic grouping.
- Since the first layer uses
- Impact of RL-based Alignment on Cohesion:
- Figure 5(b) shows the aligned Level-1 token . After
RL-based alignment, this tokencleanly groups the entire BBQ & Grilled branchinto one coarse identifier. This means that items belonging to the "BBQ & Grilled" category, even if they have some semantic variations, are consistently assigned to this specific geographic-semantic token due to the LLM's improveddistance-aware representation. - In stark contrast, Figure 5(e) depicts the same layer without alignment. The same "BBQ & Grilled" restaurants are now
scattered across multiple Level-1 tokenslike , , and .
- Figure 5(b) shows the aligned Level-1 token . After
- Importance of Upstream LLM Quality:
- This scattering in the unaligned case occurs because the
LLM embeddingshave not beenpost-trained with G-DPOto respectdistance-aware rewards. Without this alignment, the LLM's representations might prioritize subtle textual semantic differences over critical geographic proximity, leading to fragmented clustering. - As a result, the
unaligned Level-1 tokens lose category cohesion. This is detrimental because thehierarchical quantizerin subsequent layers can no longer rely on ashared, coherent rootto effectively refine sub-categories. If the initial broad categories are fragmented, subsequent layers will struggle to build meaningful, consistent granular tokens.
- This scattering in the unaligned case occurs because the
- Validation of G-DPO's Role: This case study vividly illustrates the importance of
upstream LLMsand thequality of their embeddingsin theitem tokenization process. TheG-DPO alignmentensures that the LLM's representations are robustlygeography-aware, leading to more coherent and meaningfulhierarchical tokensthat accurately reflect real-world spatial and categorical relationships. This directly translates to more effective local-life recommendations.
6.4.1. Appendix C: Visualization Analysis for Other Quantizers
The appendix provides similar visualization analyses for other quantizers (Res-KMeans, RQ-VAE, Lin et al., RQ-VAE-ngram) when combined with G-DPO alignment. These visualizations reinforce the findings from the main LGSID case study.
The following are the results from Figure 7 of the original paper:

该图像是图表,展示了不同级别 Res-KMeans 算法下各类美食的推荐结果。通过对比不同水平(Level 1、Level 2、Level 3)及未对齐版本,可以观察到每种美食类别的推荐分布和频率变化,为理解推荐系统效果提供了直观参考。
Figure 7: Hierarchical category frequency distribution of Res-KMeans for different SID prefixes (Aligned vs Unaligned).
Res-KMeans (Figure 7): With G-DPO alignment, Res-KMeans tokens like and cleanly unify "Local Cuisine" and "Specialty Cuisine" branches, respectively. Without alignment, these categories would be scattered. This demonstrates that distance-aware rewards preserve category cohesion even with Res-KMeans.
The following are the results from Figure 8 of the original paper:

该图像是图表,展示了不同SD前缀下RQ-VAE的分层类别频率分布,包括对齐(Aligned)和未对齐(Unaligned)两种情况。图中展示了三种层级的频率分布,分别为Level 1、Level 2和Level 3,提供了各类菜品在不同层级下的出现频率。
Figure 8: Hierarchical category frequency distribution of RQ-VAE for different SD prefixes (Aligned vs Unaligned)
RQ-VAE (Figure 8): For RQ-VAE, the G-DPO-aligned LLM embeddings lead to tokens like for "Local Cuisine" and for "Specialty Cuisine" consistently capturing these categories. This cohesion is attributed to the RL-based reward that encourages distance-aware grouping, providing a homogeneous root for subsequent layers.
The following are the results from Figure 9 of the original paper:

该图像是一个示意图,展示了不同推荐模型(Lin et al. 和 Lin et al. w/o Aligned)在三个层级(Level 1、Level 2、Level 3)下的推荐效果。每个子图的横轴表示推荐数量,纵轴列出不同的菜品类别,并且用不同颜色的条形图表示各类别的推荐数量。各层级的具体数值和效果对比,表明了模型在本地生活服务推荐中的表现差异。
Figure 9: Hierarchical category frequency distribution of Lin et al. for different SID prefixes (Aligned vs Unaligned).
Lin et al. (Figure 9): Even for the Lin et al. quantizer, the aligned system assigns to "Local Cuisine" and to "Japanese Cuisine", eliminating the scattering observed without alignment. This confirms that G-DPO steers the LLM to respect distance-based rewards, ensuring unambiguous parent tokens for downstream levels.
The following are the results from Figure 10 of the original paper:

该图像是多个条形图,分别展示了不同层级下 RQ-VAE-ngram 模型的推荐结果,包括对比 aligned 和未 aligned 情况。图中展示了各种类型的美食,如地方美食和日本料理,并附有相应的数值标记,用于表示推荐权重或频次。
Figure 10: Hierarchical category frequency distribution of RQ-VAE-ngram for different SD prefixes (Aligned vs Unaligned).
RQ-VAE-ngram (Figure 10): For the RQ-VAE-ngram variant, G-DPO alignment results in "Local Cuisine" being captured by and "Specialty Cuisine" by , with other cuisines remaining compactly grouped. The absence of fragmentation demonstrates that ngram-aware embeddings, once aligned, still honor geographic cohesion, providing a stable foundation.
These visualizations consistently underscore the critical role of RL-based Geographic LLM Alignment in improving the quality and cohesion of item tokenization across various quantization methods.
6.5. Robustness Analysis
The following are the results from Table 5 of the original paper:
| mean±SD | 95% CI | median (IQR) | ||
| origin | +LGSID | +LGSID | +LGSID | |
| DIN | 0.5859 | 0.6259±.0013* | [0.6248, 0.6269] | 0.6261(0.0020) |
| DIEN | 0.6255 | 0.6482±.0019* | [0.6462, 0.6502] | 0.6483(0.0022) |
| SIM | 0.5884 | 0.6259±.0022* | [0.6236, 0.6282] | 0.6269(0.0035) |
| TWIN | 0.5898 | 0.6272±.0022* | [0.6250, 0.6295] | 0.6276(0.0030) |
| ETA | 0.5903 | 0.6272±.0010* | [0.6261, 0.6283] | 0.6274(0.0011) |
Analysis:
Table 5 presents a robustness analysis comparing the origin baseline models with their performance when augmented with across five different backbones (DIN, DIEN, SIM, TWIN, ETA). The analysis includes mean AUC with standard deviation (SD), 95% confidence interval (CI), median, and interquartile range (IQR) over 6 random seeds. The * denotes statistically significant improvements (Wilcoxon signed-rank test, ).
-
Universal Performance Elevation:
LGSIDconsistently improves the mean AUC across all five discriminative backbones. For instance, DIN rises from 0.5859 to 0.6259 (an absolute gain of 4.0 AUC, or 6.8% relative uplift). Similar substantial gains are observed for DIEN (+2.27 AUC), SIM (+3.75 AUC), TWIN (+3.74 AUC), and ETA (+3.69 AUC). -
Exceptional Stability: Beyond performance,
LGSIDalso demonstratesexceptional stabilityagainst variations from random seeds.- The
standard deviations (SDs)for are consistently small (ranging from 0.0010 to 0.0022), indicating low variability in performance across different runs. - The
95% Confidence Intervals (CIs)are narrow (e.g., DIN: [0.6248, 0.6269]), suggesting high precision and reliability of the reported mean AUCs. - The
Interquartile Ranges (IQRs)are also small (e.g., ETA: 0.0011), showing that the central 50% of the results are tightly clustered, implyinginsensitivity to extreme seeds. - The
mediansandmeansalign almost perfectly, further indicating a symmetrical and consistent performance distribution.
- The
-
Statistical Significance: The
Wilcoxon signed-rank testuniformly confirms thestatistical significanceof 's improvements over the origin baselines across all backbones (). This means the observed performance gains are not due to chance.In summary, the robustness analysis strongly evidences that
LGSIDnot onlyuniversally elevates performanceinlocal-life recommendationbut alsoendows every backbone with stabilityandreproducibilityagainst the inherent randomness of experimental setups. This is a crucial finding for deploying such a system in real-world industrial applications.
6.6. Parameters Sensitivity
The following are the results from Table 6 of the original paper:
| Similarity | |||
| λ | Top@5 | Top@10 | Top@100 |
| 1.0 | 0.8856 | 0.8754 | 0.8283 |
| 1.5 | 0.8920 | 0.8828 | 0.8395 |
| 1.8 | 0.8977 | 0.8892 | 0.8504 |
| Province Coverage (P@K) | |||
| λ | P@5 | P@10 | P@100 |
| 1.0 | 0.9960 | 0.9936 | 0.9662 |
| 1.5 | 0.9943 | 0.9908 | 0.9556 |
| 1.8 | 0.9905 | 0.9852 | 0.9347 |
| City Coverage (C@K) | |||
| λ | C@5 | C@10 | C@100 |
| 1.0 | 0.9548 | 0.9352 | 0.8130 |
| 1.5 | 0.9408 | 0.9152 | 0.7673 |
| 1.8 | 0.9173 | 0.8858 | 0.7065 |
| Town Coverage (T@K) | |||
| λ | T@5 | T@10 | T@100 |
| 1.0 | 0.4030 | 0.3525 | 0.2260 |
| 1.5 | 0.3431 | 0.2910 | 0.1688 |
| 1.8 | 0.2924 | 0.2432 | 0.1290 |
Analysis:
Table 6 details the parameter sensitivity of the similarity loss weight hyperparameter in G-DPO across various similarity and geographic coverage metrics at Top@K levels. This parameter controls the trade-off between preserving semantic understanding and enhancing geographic perception.
- Impact on Semantic Similarity:
- As
increases(from 1.0 to 1.8), thesemantic similaritymetrics (Top@5, Top@10, Top@100) alsoincrease. For instance, Top@5 similarity rises from 0.8856 () to 0.8977 (). This is expected because a higher places more emphasis on thesimilarity regularization term(), which explicitly aims to keep the policy model's embeddings close to the reference model's original semantic embeddings.
- As
- Impact on Geographic Coverage:
- Conversely, as
increases, thegeographic coveragemetrics (Province, City, and Town Coverage at P@K, C@K, T@K) generallydecrease. For example, Town Coverage at T@5 drops from 0.4030 () to 0.2924 (). This indicates atrade-off: prioritizing semantic preservation through a higher can slightly reduce the LLM's ability to focus on fine-grained geographic distinctions that might deviate from its original semantic space.
- Conversely, as
- Balancing Act: The results verify that adjusting allows for
flexible controlover the LLM's balance betweencontent understandingandspatial perception.- A
lower\lambda (e.g., 1.8) maintains strongersemantic consistencyat the cost of somegeographic focus.
- A
- Generality and Task-Specific Adaptation: This sensitivity analysis confirms the
generalityofLGSID. It allows users or developers totask-specifically adaptthe LLM by tuning to achieve the desired balance, depending on whether the application requires higher semantic precision or more critical geographic accuracy. Forlocal-life recommendation, finding the optimal balance is key to providing recommendations that are both relevant and accessible.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces LGSID, an LLM-Aligned Geographic Item Tokenization Framework specifically designed to imbue semantic IDs with real-world spatial awareness for local-life recommendation. The framework addresses the critical limitation of existing LLM-driven recommendation methods that often fail to adequately capture fine-grained geographic characteristics and distance awareness.
The core contributions lie in its two key modules:
-
RL-based Geographic LLM Alignment:
LGSIDfirst trains alist-wise reward modelusingdensity-aware negative samplingto effectively capture relative spatial distances and real-world geographic relationships. This reward model then guides a novelG-DPO algorithm, which fine-tunes the LLM to injectgeneralized spatial knowledgeandcollaborative signals. Crucially,G-DPOincorporatessimilarity regularizationto ensure that the LLM's inherent semantic understanding capabilities are preserved while it gains profound geographic awareness. -
Hierarchical Geographic Item Tokenization: Following alignment, a
hierarchical tokenization strategyis employed. This strategy generatesprimary tokensfrom discrete spatial and content attributes, establishing a strong geographic prior. Subsequently,residual tokensare refined using the aligned LLM's rich geographic representation vectors, allowing for efficient compression and reconstruction of item information at various granularities.Extensive experiments on real-world Kuaishou industry datasets consistently demonstrate that
LGSIDsignificantly outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate the effectiveness of its components, particularly how theG-DPO alignmentenhances geographic cohesion in item representations and how the hierarchical tokenization leverages this for improved recommendation accuracy.
7.2. Limitations & Future Work
The authors implicitly or explicitly acknowledge several aspects that can be considered limitations or avenues for future work:
- Dataset Availability: The reliance on a single, proprietary
Kuaishou industry datasetis a limitation. The authors state thatpublic datasetssatisfyingdelivery-distance constraintsand containing detaileditem text and geographic location informationare scarce. They plan to release their dataset in future work, which would be a significant contribution to the research community. This highlights a general challenge inLBS recommendation research. - Generalizability to Diverse Geographies: While tested on Kuaishou data, the framework's performance might vary in regions with different geographic densities, administrative structures, or cultural nuances in local-life services.
- Dynamic Geographic Changes: The current framework focuses on learning static geographic relationships. Real-world scenarios might involve dynamic changes (e.g., new stores opening, changes in administrative boundaries), which would require adaptive mechanisms.
- Computational Cost of LLM Alignment: Fine-tuning LLMs, even with LoRA, can still be computationally intensive. Exploring more efficient alignment or distillation techniques could be valuable.
- Cold-Start for New Items/Locations: While the paper addresses
limited user interaction opportunitiesfor many items, the specific handling ofcold-start itemsorentirely new geographic regionsmight warrant further dedicated investigation. The current method relies on historical data for reward model training and collaborative signals. - Beyond Geographic Distance: Local-life recommendations involve more than just distance; factors like
travel time,traffic conditions,delivery logistics, anduser mobility patternscould be integrated for even more sophisticated geographic awareness. - Multi-modal Integration: The paper primarily focuses on textual and geographic attributes. Integrating
multimodal data(e.g., images of restaurants, videos of places) could provide richer item representations.
7.3. Personal Insights & Critique
This paper presents a highly relevant and innovative solution to a critical problem in local-life recommendation. The explicit focus on geographic awareness through LLM alignment is a significant step beyond simply concatenating location information into prompts.
Key Strengths and Inspirations:
- Rigorous Integration of RL and LLMs: The
G-DPO algorithmis a well-thought-out innovation. It cleverly uses a pre-trainedreward model(trained on explicit geographic distance) to guide LLM fine-tuning, directly addressing the LLM's inherent bias towards semantic similarity over geographic proximity. Thesimilarity regularizationis crucial for preventingcatastrophic forgettingof semantic knowledge, providing a robust mechanism for balancing different forms of intelligence. - Hierarchical Tokenization for Multi-Granularity: The
hierarchical geographic item tokenizationis particularly inspiring. It acknowledges that geographic information exists at multiple granularities (province, city, town) and that a multi-layered tokenization scheme can efficiently capture this. The idea of usingpre-computed geographic clustersfor the primary layer and refiningresidualswith aligned LLM vectors is an elegant solution for combining discrete attributes with dense representations. This approach could be transferred to other domains requiring multi-granular feature representation (e.g., product categorization in e-commerce, where broad categories can be refined into sub-categories). - Addressing Real-World Industry Challenges: The paper's roots in an industry setting (Kuaishou) ensure its practical relevance. The focus on
efficiencyforlatency-constrained modelsand the use ofdensity-aware hard negative samplingreflect real-world considerations for large-scale recommendation systems. - Clear Ablation and Visualization: The
ablation studiesandT-SNE/radar chart visualizationsare excellent. They provide strong empirical evidence for the effectiveness of each component and clearly illustrate howG-DPOenhances geographic cohesion and how hierarchical tokens improve coverage.
Potential Issues or Areas for Improvement:
-
Complexity of Reward Model Design: While the
list-wise reward modelis innovative, its design and training (especiallydensity-aware hard negative samplinganddistance-based soft labels) can be complex to tune effectively. The choice of negative samples and the threshold for co-occurrence pairs could be highly dataset-dependent. -
Interpretability of Tokens: While
SIDsgenerally improve efficiency, the interpretability of thehierarchical geographic tokensthemselves might still be challenging for human understanding, particularly the residual layers. Further work could explore ways to make these tokens more human-readable or to use them to generate natural language explanations for recommendations. -
Implicit User Location: The paper assumes the user's location is known or inferable for local-life recommendations. In cases where user location is ambiguous or privacy-sensitive, this could be a challenge. The framework could potentially be extended to infer optimal user locations or preferences based on their interaction history if explicit location data is unavailable.
-
Cost-Benefit Analysis for LLM Fine-tuning: While
LoRAreduces costs, fine-tuning an LLM remains resource-intensive. A detailedcost-benefit analysis(e.g., comparing the engineering and compute cost ofG-DPOagainst the gains in ) would be valuable for practical deployment decisions, especially for smaller companies without Kuaishou's resources. -
Scalability to Global Contexts: The focus is on
local-life. For global platforms, the administrative codes (province,city,town) might not scale universally. A more abstract, continent-agnostic geographic hierarchy might be needed.Overall,
LGSIDoffers a robust and effective solution for integratinggeographic awarenessintoLLM-driven recommendation systems. Its methodical approach toRL-based alignmentandhierarchical tokenizationprovides valuable insights and a strong foundation for future research indomain-specific LLM applicationsandspatially-aware recommendation.
Similar papers
Recommended via semantic vector search.