Paper status: completed

LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation

Published:11/18/2025

LLM-based Recommendation Systems (7)Geographic Item Tokenization (1)Local-Life Recommendation (1)Reinforcement Learning Geographic Alignment (1)Hierarchical Geographic Item Tokenization (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The LGSID framework enhances local-life recommendation by integrating RL-based geographic alignment and hierarchical item tokenization to capture spatial relationships, outperforming existing models in empirical studies.

Abstract

Recent advances in Large Language Models (LLMs) have enhanced text-based recommendation by enriching traditional ID-based methods with semantic generalization capabilities. Text-based methods typically encode item textual information via prompt design and generate discrete semantic IDs through item tokenization. However, in domain-specific tasks such as local-life services, simply injecting location information into prompts fails to capture fine-grained spatial characteristics and real-world distance awareness among items. To address this, we propose LGSID, an LLM-Aligned Geographic Item Tokenization Framework for Local-life Recommendation. This framework consists of two key components: (1) RL-based Geographic LLM Alignment, and (2) Hierarchical Geographic Item Tokenization. In the RL-based alignment module, we initially train a list-wise reward model to capture real-world spatial relationships among items. We then introduce a novel G-DPO algorithm that uses pre-trained reward model to inject generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. Furthermore, we propose a hierarchical geographic item tokenization strategy, where primary tokens are derived from discrete spatial and content attributes, and residual tokens are refined using the aligned LLM's geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets show that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate its effectiveness.

Mind Map

In-depth Reading

English Analysis~45 min read · 61,788 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is LLM-Aligned Geographic Item Tokenization for Local-Life Recommendation. It focuses on enhancing recommendation systems for local-life services by integrating geographic awareness into item representation using large language models.

1.2. Authors

The authors are:

Hao Jiang (Kuaishou Technology, Beijing, China)
Guoquan Wang (Kuaishou Technology, Beijing, China)
Donglin Zhou (Kuaishou Technology, Beijing, China)
Sheng Yu (Kuaishou Technology, Beijing, China)
Yang Zeng (Kuaishou Technology, Beijing, China)
Wencong Zeng (Kuaishou Technology, Beijing, China)
Kun Gai (Independent Researcher)
Guorui Zhou (Kuaishou Technology, Beijing, China)

Most authors are affiliated with Kuaishou Technology, a major Chinese internet company known for its short-video platform, indicating a strong industry research background, particularly in recommendation systems and large-scale data applications. One author is an independent researcher.

1.3. Journal/Conference

The paper is published as an arXiv preprint. While arXiv is a reputable platform for disseminating research quickly, it is not a peer-reviewed journal or conference. Papers on arXiv are typically pre-publication versions that may or may not undergo formal peer review for eventual publication in a conference or journal. The listed publication date is 2025-11-18T07:54:32.000Z, which appears to be a future date, suggesting this is a future-dated preprint or a placeholder for an upcoming publication.

1.4. Publication Year

The publication year, based on the provided UTC timestamp, is 2025.

1.5. Abstract

This paper introduces LGSID, an LLM-Aligned Geographic Item Tokenization Framework designed for local-life recommendation. It addresses the limitation of current text-based recommendation methods that fail to capture fine-grained spatial characteristics and distance awareness, especially when simply injecting location information into prompts for Large Language Models (LLMs). LGSID comprises two main components: RL-based Geographic LLM Alignment and Hierarchical Geographic Item Tokenization. The RL-based alignment module first trains a list-wise reward model to internalize real-world spatial relationships among items. Then, a novel G-DPO algorithm leverages this reward model to infuse generalized spatial knowledge and collaborative signals into LLMs while preserving their semantic understanding. The Hierarchical Geographic Item Tokenization strategy generates primary tokens from discrete spatial and content attributes and refines residual tokens using the aligned LLM's geographic representation vectors. Extensive experiments on real-world Kuaishou industry datasets demonstrate that LGSID consistently outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further confirm its effectiveness.

1.6. Original Source Link

The official source link is: https://arxiv.org/abs/2511.14221 The PDF link is: https://arxiv.org/pdf/2511.14221v1.pdf The paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses a critical challenge in local-life recommendation services, such as those offered by Kuaishou and Meituan. Traditional ID-based recommendation methods, which use unique identifiers for items and heavily rely on collaborative filtering (CF) signals, struggle to capture real-world spatial characteristics and distance awareness. This leads to issues like limited user interaction opportunities for many items (the cold-start problem for spatially constrained items) and unfair exposure as the candidate item corpus grows.

Recent advancements in Large Language Models (LLMs) have shown promise in text-based recommendation by leveraging their semantic understanding capabilities. The typical approach involves designing prompts to encode item textual information and then quantizing these into discrete semantic IDs (SIDs) via item tokenization. However, a major limitation arises in local-life services: simply injecting location information into prompts is insufficient. Pre-trained LLMs often prioritize content relevance over geographic proximity, leading to recommendations that are semantically relevant but geographically inaccessible (e.g., recommending a Shanghai restaurant to a Beijing user, as illustrated in Figure 1). This highlights a gap where existing LLM-driven methods lack fine-grained spatial characteristics and real-world distance awareness.

The core problem, therefore, is to effectively inject domain-specific geographic knowledge and collaborative signals into LLMs to generate geographically-aware item representations that are suitable for local-life recommendation tasks, while simultaneously preserving the LLM's semantic understanding.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Identification of Limitations: It identifies the key limitations of existing LLM-driven item tokenization methods in spatially constrained scenarios, emphasizing the necessity of aligning LLMs with domain-specific knowledge (i.e., geographic awareness).
Proposed Framework LGSID: It introduces LGSID (LLM-Aligned Geographic Item Tokenization Framework), a novel two-stage item tokenization framework specifically tailored for local-life recommendation.
- RL-based Geographic LLM Alignment: This module aligns LLMs with real-world spatial knowledge. It involves:
  - Training a list-wise reward model with density-aware negative sampling to capture relative spatial distances.
  - Introducing a novel G-DPO algorithm that uses this reward model to inject geographic and collaborative signals into the LLM, balancing semantic accuracy with geographic awareness through similarity regularization.
- Hierarchical Geographic Item Tokenization: This module generates a sequence of spatial-aware discrete tokens. It involves:
  - Creating primary tokens from discrete spatial and content attributes using MiniBatch K-Means.
  - Refining residual tokens using the aligned LLM's geographic representation vectors through learnable cluster centers.
Extensive Experimental Validation: Comprehensive experiments on real-world Kuaishou industry datasets demonstrate that LGSID significantly improves performance across both discriminative and generative recommendation models.
- Discriminative Models: LGSID showed substantial AUC improvements over DIN, DIEN, SIM, TWIN, and ETA when augmented with its tokenization scheme.
- Generative Models: LGSID achieved significant improvements in Hit@K and NDCG@K metrics over TIGER and OneRec compared to other quantization methods.
- Ablation Studies and Visualizations: These studies validate the effectiveness of the individual components, showing how G-DPO enhances geographic awareness while maintaining semantic understanding and how hierarchical tokenization improves category cohesion.
  
  The key conclusion is that by explicitly aligning LLMs with geographic awareness and integrating it into a hierarchical item tokenization strategy, LGSID effectively addresses the spatial constraints of local-life recommendation, leading to more accurate and contextually relevant recommendations.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following fundamental concepts:

Recommendation Systems (RS):
- Definition: Algorithms designed to suggest relevant items (e.g., movies, products, restaurants) to users. They aim to predict user preferences.
- Types:
  - ID-based methods: Represent users and items using unique identifiers (IDs). These IDs are often embedded into vectors (e.g., user embeddings, item embeddings) that capture their characteristics and relationships.
  - Collaborative Filtering (CF): A common ID-based technique that recommends items based on the preferences of similar users or the characteristics of similar items. For example, if user A and user B like similar movies, and user A liked a movie that user B hasn't seen, it might be recommended to user B.
  - Content-based methods: Recommend items similar to those a user has liked in the past, based on item attributes (e.g., genre of a movie, description of a product).
  - Hybrid methods: Combine CF and content-based approaches.
- Local-life Recommendation: A specific type of recommendation system where geographic location and distance are crucial factors. Items are typically local businesses or services (e.g., restaurants, shops) and user interactions are constrained by physical proximity.
Large Language Models (LLMs):
- Definition: Advanced AI models (e.g., GPT, LLaMA) trained on vast amounts of text data to understand, generate, and process human language. They are typically based on the Transformer architecture.
- Semantic Understanding: LLMs excel at capturing the meaning and context of text, allowing them to generate semantic representations (dense vector embeddings) that reflect the underlying meaning of words, sentences, or entire documents.
- Pre-training: LLMs are initially trained on a large corpus of text data using self-supervised learning tasks (e.g., predicting the next word). This process allows them to learn general language patterns and semantic knowledge.
- Fine-tuning/Alignment: After pre-training, LLMs can be adapted to specific downstream tasks or domains through further training with smaller, task-specific datasets, often involving techniques like reinforcement learning from human feedback (RLHF).
Item Tokenization / Semantic IDs (SIDs):
- Definition: The process of converting item representations (e.g., LLM embeddings) into discrete tokens or IDs. Instead of continuous vectors, items are assigned one or more numerical codes.
- Purpose:
  - Efficiency: Discrete IDs can be more memory-efficient and faster to retrieve or process than high-dimensional continuous vectors, especially in large-scale recommendation systems.
  - Interpretability: In some contexts, discrete tokens can offer better interpretability than raw embeddings.
  - Integration with traditional systems: SIDs can bridge the gap between LLM semantic representations and traditional ID-based recommendation frameworks.
- Quantization: The general process of mapping continuous values (like LLM embeddings) to discrete values (like tokens). This often involves clustering algorithms (e.g., K-Means) or more advanced techniques like Vector Quantized Variational AutoEncoders (VQ-VAE) or Residual Quantization (RQ).
Reinforcement Learning (RL):
- Definition: A type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties for its actions and aims to maximize its cumulative reward over time.
- Components:
  - Agent: The learning algorithm.
  - Environment: The system with which the agent interacts.
  - State: The current situation of the environment.
  - Action: The agent's choice in a given state.
  - Reward: A scalar feedback signal indicating the desirability of an action.
  - Policy: The agent's strategy for choosing actions in different states.
- Reward Model (RM): In RLHF, a separate model trained to predict human preferences (or a proxy for preferences, like geographic relevance in this paper) for different outputs generated by an LLM. This RM then provides the reward signals for training the LLM (the policy model).
- Direct Preference Optimization (DPO): An RLHF algorithm that directly optimizes the LLM policy using a simple loss function derived from preferences, avoiding the complexities of explicitly training a reward model and using PPO (Proximal Policy Optimization).
Distance Metrics:
- Euclidean Distance: The straight-line distance between two points in Euclidean space. For two vectors $\mathbf{a} = (a_1, ..., a_n)$ and $\mathbf{b} = (b_1, ..., b_n)$ , it is calculated as: $ d(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^n (a_i - b_i)^2} $
- Cosine Similarity: Measures the cosine of the angle between two non-zero vectors. It indicates how similar the directions of two vectors are. For two vectors $\mathbf{a}$ and $\mathbf{b}$ : $ \text{similarity}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| |\mathbf{b}|} = \frac{\sum_{i=1}^n a_i b_i}{\sqrt{\sum_{i=1}^n a_i^2} \sqrt{\sum_{i=1}^n b_i^2}} $
- Haversine Distance: A formula for calculating the distance between two points on a sphere (e.g., Earth) given their latitudes and longitudes. It's crucial for geographic applications. The formula is: $ d = 2r \arcsin\left(\sqrt{\sin^2\left(\frac{\phi_2 - \phi_1}{2}\right) + \cos(\phi_1)\cos(\phi_2)\sin^2\left(\frac{\lambda_2 - \lambda_1}{2}\right)}\right) $ Where:
  - $r$ is the radius of the sphere (Earth's mean radius, approx. 6371 km).
  - $\phi_1, \phi_2$ are the latitudes of point 1 and point 2 (in radians).
  - $\lambda_1, \lambda_2$ are the longitudes of point 1 and point 2 (in radians).

3.2. Previous Works

The paper extensively discusses item tokenization and LLM alignment in recommendation.

3.2.1. Item Tokenization

Early efforts (RQ-VAE, VQ-VAE): These foundational techniques map high-dimensional continuous representations (like LLM embeddings) to discrete codes.
- VQ-VAE (Vector Quantized Variational AutoEncoder): A neural network that learns a discrete latent representation for data. It quantizes continuous latent vectors from an encoder into discrete codes using a codebook and reconstructs the data.
- RQ-VAE (Residual Quantization Variational AutoEncoder): An extension of VQ-VAE that applies quantization in multiple stages. It quantizes the residual error from the previous quantization step, allowing for finer-grained representation. TIGER (Rajput et al. 2023) is an early adopter, using RQVAE to map LLM representations to semantic IDs (SIDs).
Two-stage methods: These methods first generate item representations (e.g., from LLMs) and then tokenize them, often with a focus on aligning these tokens with recommendation signals.
- LC-Rec (Zheng et al. 2024): Integrates recommendation signals through semantic alignment.
- LETTER (Wang et al. 2024a): Jointly aligns semantic and collaborative spaces.
- QARM (Luo et al. 2024): Uses Res-Kmeans (an efficient variant of K-Means for large datasets) guided by user interaction distributions to learn semantic IDs. K-Means is a clustering algorithm that partitions $n$ observations into $k$ clusters, where each observation belongs to the cluster with the nearest mean (centroid).
- EAGER (Wang et al. 2024c): Employs two-stream codebooks to model both semantic and collaborative signals.
- UTGRec (Zheng et al. 2025a): Integrates multimodal semantics with co-occurrence patterns into universal codes.
- SC (Li et al. 2025): Regularizes semantic encoders using ID embeddings from pre-trained CF models.
End-to-end methods: These methods jointly optimize tokenization and recommendation.
- UnifiedSID (Lin et al. 2025): Uses RQ-VAE trained with cosine and Euclidean distances to integrate semantic and ID tokens.
- ETEGRec (Liu et al. 2025a): Introduces sequence-item and preference-semantic alignment objectives with generative models training.
  
  Crucial gap identified by LGSID: Existing item tokenization methods largely overlook domain-specific constraints, particularly geographic ones in local-life recommendations. Without geographic awareness, SIDs may recommend suitable items that are too far away, negatively impacting system efficiency.

3.2.2. LLM Alignment for Recommendation

This area focuses on adapting LLMs for specific recommendation tasks.

Designing LLM tasks:
- LGHRec (Luo et al. 2025): Uses chain-of-thought (CoT) reasoning to distill item descriptions into semantic IDs, which are then fused with vanilla IDs for Graph Neural Networks (GNNs). CoT involves prompting LLMs to show their reasoning steps, which can improve their performance on complex tasks.
- GNPR-SID (Wang et al. 2025): Incorporates domain attributes like location information into prompts to capture downstream signals.
- SIIT (Chen et al. 2024): Iteratively refines tokenization through self-improvement.
Fine-tuning LLMs:
- NoteLLM (Zhang et al. 2024a) and NoteLLM-2 (Zhang et al. 2024b): Compress items via prompts and integrate collaborative signals through supervised fine-tuning.
- AlignRec (Liu et al. 2024): Introduces alignment objectives for multimodal and user-item consistency.
- LLMEmb (Liu et al. 2025b): Uses supervised contrastive fine-tuning to align LLM embeddings with collaborative data. Contrastive learning trains models to learn representations where similar data points are pulled closer together, and dissimilar ones are pushed apart.
- LARM (Liu et al. 2025c): Distills knowledge from open-source LLMs into smaller models.
- Lu et al. (Lu et al. 2024): Strengthen LLMs' alignment with recommendation instructions.
  
  Crucial gap identified by LGSID: While these methods emphasize semantic alignment, LLMs need to balance domain-specific constraints (like geography) with accurately capturing user preferences based on item content. Simply adding geographic info to prompts is insufficient; explicit alignment is needed.

3.3. Technological Evolution

The field of recommendation systems has evolved from traditional collaborative filtering and matrix factorization (ID-based methods) to leveraging deep learning for more sophisticated feature learning and interaction modeling. With the rise of Large Language Models, there's a shift towards text-based recommendation, where LLMs' semantic understanding offers generalization capabilities beyond mere ID embeddings.

This paper fits into this timeline by addressing a critical challenge in the LLM-driven recommendation paradigm: extending semantic understanding to include domain-specific geographic awareness. It moves beyond treating LLMs solely as text encoders or relying on simple prompt engineering. Instead, it proposes a more profound integration of domain knowledge through RL-based alignment, ensuring that the item tokenization process is not only semantically rich but also geographically intelligent.

3.4. Differentiation Analysis

Compared to existing item tokenization and LLM alignment methods, LGSID offers several core differences and innovations:

Explicit Geographic Alignment for LLMs:
- Prior methods: Primarily focus on LLMs as generic text encoders (e.g., TIGER, QARM) or inject domain signals via prompt design (GNPR-SID). They assume LLMs can naturally balance content and location, often prioritizing semantic relevance over geographic proximity (e.g., NoteLLM focuses on collaborative signals through supervised fine-tuning).
- LGSID's innovation: LGSID explicitly post-trains the LLM using an RL-based alignment strategy. It doesn't just put location in the prompt; it teaches the LLM to understand and prioritize real-world spatial relationships through a reward model and the G-DPO algorithm. This ensures fine-grained spatial characteristics and distance awareness are deeply integrated into the LLM's representations.
Novel G-DPO Algorithm for Domain-specific Knowledge Injection:
- Prior DPO: Direct Preference Optimization (DPO) is a general RLHF method.
- LGSID's innovation: G-DPO is specifically designed for local-life recommendation. It incorporates a geography-aware reward model and domain-mixed sampling (combining domain collaborative pairs and geographically constrained pairs) to inject generalized spatial knowledge and collaborative signals. Crucially, it uses similarity regularization to dynamically balance semantic accuracy and geographic awareness during fine-tuning, preventing the LLM from losing its core semantic understanding while gaining spatial intelligence.
Hierarchical Geographic Item Tokenization:
- Prior tokenization methods: Many use RQ-VAE, Res-Kmeans, or N-gram SIDs (UnifiedSID, QARM, RQ-VAE-ngram) to generate discrete tokens, but often lack a structured way to integrate multiple levels of geographic and content attributes explicitly.
- LGSID's innovation: The hierarchical tokenization strategy directly addresses the multi-granularity nature of geographic information. It first generates primary tokens based on a weighted combination of discrete spatial (latitude, longitude, administrative codes) and content attributes. Then, residual tokens are refined using the aligned LLM's geographic representation vectors. This multi-layer approach allows for efficient compression while preserving rich information and ensures geographic cohesion across different levels of token granularity. The first layer's pre-computed geographic clusters provide a strong geographic prior, which is then semantically enriched by the aligned LLM.
  
  In essence, LGSID moves beyond treating LLMs as black-box text encoders or relying on simple prompt engineering. It introduces a sophisticated post-training alignment and hierarchical tokenization framework that explicitly injects and balances geographic constraints and collaborative signals into LLMs, leading to more practical and performant local-life recommendation systems.

4. Methodology

The LGSID framework for local-life recommendation is designed to infuse real-world spatial awareness into semantic IDs generated by Large Language Models (LLMs). It operates through two main modules: RL-based Geographic LLM Alignment and Hierarchical Geographic Item Tokenization.

The overall architecture is depicted in Figure 2. The upper part illustrates the RL-based alignment module, which fine-tunes the LLM to integrate geographic knowledge while preserving semantic understanding. The lower part of Figure 2 details the Hierarchical Geographic Item Tokenization (HGIT) pipeline, which combines spatial and discrete content features with the LLM's semantically rich, geographically-aware representations to create balanced item tokens.

该图像是一个示意图，展示了LLM对齐的地理项目标记化框架的两个阶段，包括基于RL的地理LLM对齐和分层地理项目标记化。在公式中，G-DPO算法与RL相结合，通过奖励评分进行更新，强调内容和地理信息的对齐，用户序列展示了最终生成的项目嵌入。

Figure 2: The upper part depicts the pipeline of RL-based Geographic LLM Alignment, showing how the LLM is updated through reward scoring by the G-DPO algorithm, emphasizing the alignment of content and geographic information. The lower part depicts the pipeline of Hierarchical Geographic Item Tokenization, showing the user sequence and the final generated item embeddings.

4.1. Preliminaries

4.1.1. Geography Prompt Design

For LLMs to derive item semantic representations, LGSID uses a structured prompt that combines textual descriptions and geographical attributes.

Textual Descriptions ( $\mathbf{T}_i$ ): These include item content attributes such as name, brand, category, and price.
Geographical Attributes: These include province, city, and town.

As illustrated in Figure 6 (in Appendix A), the prompt is designed to present this information to the LLM in a structured manner. The LLM then processes this prompt to produce an item semantic representation $\mathbf{E}_i$ , which is a dense vector embedding that captures the meaning and context of the item, including its geographic context. These continuous representations $\mathbf{E}_i$ are subsequently converted into discrete semantic IDs $S_i$ through a quantization process.

该图像是示意图，展示了短视频推荐系统中的内容理解嵌入生成过程。输入包括项目名称、位置、类别层级、品牌和商品信息，最后生成 LLM 嵌入。该系统根据位置等信息生成个性化推荐。

Figure 6: Illustration of our prompt design.

4.1.2. SID in Recommendation

LGSID's generated Semantic IDs (SIDs) are applicable to two mainstream recommendation paradigms:

Discriminative Recommendation:
- Objective: Predict a user's likelihood of interacting with an item (e.g., click-through rate).
- Pipeline: Users $\mathcal{U} = \{u_1, ..., u_M\}$ and items $\mathcal{I} = \{i_1, \ldots, i_N\}$ are encoded into their respective ID embeddings. Each item $i$ is represented by its textual description $\mathbf{T}_i$ and its semantic IDs $S_i$ . A user encoder $f_u(\cdot)$ learns user embeddings \mathbf{e}_u = f_u(u). The system then generates a Top-k list of recommended items $\{i_1, \ldots, i_k\}$ by matching the user embedding $\mathbf{e}_u$ with a combined representation of item ID ( $ID_i$ ), textual description ( $\mathbf{T}_i$ ), and semantic IDs ( $S_i$ ). The SIDs from LGSID enrich the item representation, providing geographic awareness and semantic generalization to improve matching accuracy.
Generative Recommendation:
- Objective: Directly predict the next item (or its semantic ID) that a user will interact with, given their historical sequence of interactions.
- Pipeline: Items are first mapped to semantic IDs $S_i = (s_i^1, \ldots, s_i^L)$ via quantized tokenization. Given a user's history of SIDs $\{S_{i_1}, \ldots, S_{i_t}\}$ , the generative model predicts the next SID $S_{i_{t+1}}$ , which corresponds to the most likely next item. LGSID's geographically-aware SIDs enable the generative model to produce recommendations that are not only semantically relevant but also spatially appropriate.

4.2. RL-based Geographic LLM Alignment

Existing methods often try to enhance LLM awareness by incorporating domain-specific information into prompts. However, LLMs primarily rely on semantic similarity and struggle with fine-grained spatial distinctions. To address this, LGSID proposes an RL-based post-training strategy to align LLMs with real-world geographic knowledge.

4.2.1. Geography-aware Reward Model Training

Training a reward model for reinforcement learning in recommendation systems with vast and dynamic item pools is challenging due to the impracticality of manual labeling. LGSID overcomes this by training a list-wise reward model $\mathcal{R}(i)$ that predicts the geographic relevance score between an item's content and its corresponding location, based on their LLM representations. This approach enables the model to internalize generalizable spatial knowledge.

Pairwise Geodesic Distance and Negative Sampling:
- The process begins by calculating the pairwise geodesic distance (distance on a sphere, using latitude and longitude) between items.
- A density-aware hard negative sampling strategy is then applied. This strategy selects $K$ negative samples for each target item based on their spatial distance. Hard negative samples are typically those that are challenging for the model to distinguish from positive samples, pushing the model to learn more discriminative features.
Prompt Mismatching Strategy:
- For each target item, prompt sequences are constructed by fixing the item's content ( $P_{\mathrm{content}}$ ) but replacing its true location with that of a negative sample ( $P_{\mathrm{location}}^{j^-}$ ). This creates a prompt-mismatched pair.
- The definition of these prompts is given by Equation 1: $P_i = \left[ P_{\mathrm{content}}, P_{\mathrm{location}}^{i} \right], \quad P_i^{j^-} = \left[ P_{\mathrm{content}}, P_{\mathrm{location}}^{j^-} \right]$ Where:
  - $P_i$ : The prompt for the $i$ -th target item with its true location.
  - $P_{\mathrm{content}}$ : The textual attributes of the item (e.g., name, brand, category, price).
  - $P_{\mathrm{location}}^{i}$ : The true geographic location (e.g., province, city, town) of item $i$ .
  - $P_i^{j^-}$ : The prompt for the $i$ -th target item with a negative location $j^-$ sampled from other items.
  - $P_{\mathrm{location}}^{j^-}$ : A negative geographic location corresponding to a sampled item $j^-$ .
- This generates a prompt sequence for each target item $i$ : $[P_i, P_i^{1^-}, \dots, P_i^{k^-}]$ , where $k^-$ denotes the $k$ -th negative sample.
LLM Encoding and Reward Prediction:
- The LLM encodes these prompts into embeddings: $\mathbf{E} = [\bar{\mathbf{E}}^i, \mathbf{E}_{1^-}^i, \dots, \mathbf{E}_{k^-}^i]$ , where $\mathbf{E}_{j^-}^i \in \mathbb{R}^d$ and $d$ is the embedding dimension. $\bar{\mathbf{E}}^i$ would be the embedding for the true prompt $P_i$ .
- A list-wise architecture is employed to learn spatial proximity. For each prompt-mismatched sampled item, its LLM representation $\mathbf{E}_{j^-}^i$ is fed into a multi-layer perceptron (MLP) to predict a reward score. This score quantifies the relationship between the target item's content and the (potentially mismatched) location $j$ .
- The reward score $r_{i,j}$ $r_{i, j}$ is defined by Equation 2: $r_{i,j} = \mathrm{MLP}(\mathbf{E}_j^i)$ Where:
  - $r_{i,j}$ : The reward score between the $i$ -th target item and item $j$ (where $j$ refers to the location of the sampled item). Note the notation here might be slightly ambiguous in the original paper, where $\mathbf{E}_j^i$ seems to imply the embedding of item $i$ with location $j$ . It implies the reward model learns to score how "correct" the location $j$ is for content $i$ .
Distance-based Soft Labels:
- Soft labels $p_{i,j}$ $p_{i, j}$ are assigned based on the Haversine distances between items. Given a candidate list of $K$ $K$ prompt sequences sorted by distance from near to far, the soft labels are defined by Equation 3: $p_{i,j} = K - \mathrm{Rank}(\mathrm{dis}_{i,j}) + 1$ Where:
  - $\mathrm{dis}_{i,j}$ : The Haversine distance between items $i$ and $j$ in the list.
  - $\mathrm{Rank}(\mathrm{dis}_{i,j})$ : The rank of item $j$ based on its distance to item $i$ (e.g., rank 1 for the closest, rank $K$ for the furthest).
  - This labeling strategy assigns higher soft labels (higher importance) to items that are geographically closer to the target item.
Reward Model Training Loss:
- The reward model is trained using a weighted binary cross-entropy loss, as defined by Equation 4: $\mathcal{L}_{\mathrm{RM}} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j} p_{i,j} \log \sigma(r_{i,j})$ Where:
  - $\mathcal{L}_{\mathrm{RM}}$ : The training loss for the reward model.
  - $N$ : The batch size.
  - $p_{i,j}$ : The soft label (weight) for the pair of item $i$ and item $j$ , derived from their geographic distance.
  - $\sigma(\cdot)$ : The sigmoid activation function, which squashes the reward score $r_{i,j}$ into a probability-like value between 0 and 1.
  - $r_{i,j}$ : The predicted reward score from the MLP.
- This loss encourages the reward model to predict higher scores for geographically relevant (closer) item-location pairs and lower scores for irrelevant (further) ones.

4.2.2. G-DPO Algorithm with RM

Building on the pre-trained reward model, LGSID introduces G-DPO (Geographic Direct Preference Optimization), an enhanced algorithm inspired by Direct Preference Optimization (DPO) (Rafailov et al. 2023), for aligning the LLM.

Domain-Mixed Sampling Strategy ( $\mathcal{D}_{\mathrm{mix}}$ ):
- G-DPO uses a domain-mixed sampling strategy $\mathcal{D}_{\mathrm{mix}} = \mathcal{D}_{\mathrm{dc}} \cup \mathcal{D}_{\mathrm{gc}}$ $D_{mix} = D_{dc} \cup D_{gc}$ . This strategy combines two types of data:
  - Domain Collaborative Pairs ( $\mathcal{D}_{\mathrm{dc}}$ ): These pairs $(i_a, i_b)$ are derived from user historical behaviors. Items that frequently co-occur in user click histories tend to have both semantic and geographic similarity.
  - The co-occurrence score $s_{i_a, i_b}$ $s_{i_{a}, i_{b}}$ between items $i_a$ $i_{a}$ and $i_b$ $i_{b}$ is defined by Equation 5: $s_{i_a, i_b} = \sum_{u=1}^{U} \mathbb{I} \big[ i_a \in H_u \land i_b \in H_u \big]$ Where:
    - $\mathbb{I}[\cdot]$ : The indicator function, which is 1 if the condition is true, and 0 otherwise.
    - $H_u$ : User $u$ 's click history.
    - Pairs $(i_a, i_b)$ with $s_{i_a, i_b} > s_{\mathrm{th}}$ (a predefined threshold) are selected as domain collaborative sample pairs.
  - Geographically Constrained Pairs ( $\mathcal{D}_{\mathrm{gc}}$ ): These pairs $(i_a, \bar{i}_r)$ are formed by randomly sampling items $\bar{i}_r$ that are outside the target item's location. This ensures diversity and efficiency within large candidate pools.
G-DPO Alignment Loss ( $\mathcal{L}_{\mathrm{align}}$ ):
- G-DPO aligns the policy model $\pi_{\theta}$ (the LLM being fine-tuned) with a reference model $\pi_{\mathrm{ref}}$ (a frozen version of the policy model or a pre-trained LLM) using these domain-mixed sample pairs $(i^+, i^-) \in \mathcal{D}_{\mathrm{mix}}$ . Here, $i^+$ represents a preferred item (e.g., geographically closer or collaboratively relevant) and $i^-$ represents a non-preferred item.
- Each item pair is scored by the pre-trained reward model $\mathcal{R}$ , which takes embeddings from either the policy model ( $\mathbf{E}_{\pi_{\theta}}(i)$ ) or the reference model ( $\mathbf{E}_{\pi_{\mathrm{ref}}}(i)$ ) and outputs a distance-based score.
- The alignment loss in G-DPO is given by Equation 6: $\begin{array}{r} \mathcal{L}_{\mathrm{align}} = - \mathbb{E}_{(i^+, i^-)} \log \sigma \Big( \beta \Big( \mathcal{R} \big( \mathbf{E}_{\pi_{\theta}}(i^+) \big) - \mathcal{R} \big( E_{\pi_{\theta}}(i^-) \big) \\ - \mathcal{R} \big( \mathbf{E}_{\pi_{\mathrm{ref}}}(i^+) \big) + \mathcal{R} \big( \mathbf{E}_{\pi_{\mathrm{ref}}}(i^-) \big) \Big) \Big) \end{array}$ Where:
  - $\mathcal{L}_{\mathrm{align}}$ : The alignment loss for G-DPO.
  - $\mathbb{E}_{(i^+, i^-)}$ : Expectation over domain-mixed sample pairs.
  - $\sigma(\cdot)$ : The sigmoid function.
  - $\beta$ : A hyperparameter that controls the sharpness of the alignment.
  - $\mathcal{R}(\cdot)$ : The pre-trained reward model.
  - $\mathbf{E}_{\pi_{\theta}}(i)$ : The embedding of item $i$ generated by the policy model (the LLM being optimized).
  - $\mathbf{E}_{\pi_{\mathrm{ref}}}(i)$ : The embedding of item $i$ generated by the reference model (a fixed version of the LLM).
- This loss encourages the policy model to assign higher reward scores to preferred items ( $i^+$ ) and lower scores to non-preferred items ( $i^-$ ), relative to the reference model. This effectively injects the geographic and collaborative preferences learned by the reward model into the LLM.
Similarity Regularization ( $\mathcal{L}_{\mathrm{sim}}$ ):
- To prevent the LLM from catastrophic forgetting or degrading its original semantic representations during G-DPO updates, a similarity regularizer is introduced. This is an in-batch contrastive loss.
- The similarity regularization loss is defined by Equation 7: $\begin{array}{r l} & \mathcal{L}_{\mathrm{sim}} = \mathbb{E}_{i \in \mathcal{B}} \Big[ \| E_{\pi_{\theta}}(i) - E_{\pi_{\mathrm{ref}}}(i) \|_2^2 \\ & \qquad - \displaystyle \frac{1}{|\mathcal{B}| - 1} \sum_{j \neq i} \| E_{\pi_{\theta}}(i) - E_{\pi_{\mathrm{ref}}}(j) \|_2^2 \Big] \end{array}$ Where:
  - $\mathcal{L}_{\mathrm{sim}}$ : The similarity regularization loss.
  - $\mathbb{E}_{i \in \mathcal{B}}$ : Expectation over instances $i$ in a batch $\mathcal{B}$ .
  - $\| \cdot \|_2^2$ : The squared Euclidean distance.
  - $E_{\pi_{\theta}}(i)$ : Embedding of item $i$ from the policy model.
  - $E_{\pi_{\mathrm{ref}}}(i)$ : Embedding of item $i$ from the reference model.
  - The first term $\| E_{\pi_{\theta}}(i) - E_{\pi_{\mathrm{ref}}}(i) \|_2^2$ pulls the policy model's embedding for item $i$ closer to the reference model's embedding for the same item.
  - The second term $- \displaystyle \frac{1}{|\mathcal{B}| - 1} \sum_{j \neq i} \| E_{\pi_{\theta}}(i) - E_{\pi_{\mathrm{ref}}}(j) \|_2^2$ pushes the policy model's embedding for item $i$ away from the reference model's embeddings for other items in the same batch. This is a common form of contrastive loss that aims to preserve distinctiveness while maintaining semantic consistency.
Overall G-DPO Objective:
- The complete G-DPO training objective combines the alignment loss and the similarity regularization loss with a weight $\lambda$ $λ$ , as shown in Equation 8: $\mathcal{L}_{\mathrm{G-DPO}} = \mathcal{L}_{\mathrm{align}} + \lambda \mathcal{L}_{\mathrm{sim}}$ Where:
  - $\lambda$ : A hyperparameter that controls the strength of the similarity regularization. A higher $\lambda$ means more emphasis on preserving the original semantic understanding.

4.3. Hierarchical Geographic Item Tokenization

Upon obtaining aligned LLM semantic representations (which are now geographically aware), LGSID introduces a Hierarchical Geographic Item Tokenization module to convert these representations into discrete tokens efficiently.

4.3.1. First Layer Tokenization (Primary Tokens)

The first layer focuses on generating primary tokens that are highly geography-aware and incorporate other discrete content attributes.

Multi-dimensional Feature Construction:
- To initialize geography-aware tokens, a multi-dimensional feature vector is constructed by integrating various attribute types.
- Composite embeddings are used for discrete categorical features to mitigate the inefficiency of high-dimensional one-hot encoding. These are often fixed scaling factors.
  - Geography-aware codes $f_{\mathrm{geo}}$ are formed by latitude and longitude.
  - Administrative codes $f_{\mathrm{admin}}$ are fixed scaling factors based on province ID, city ID, and district ID.
  - Category codes $f_{\mathrm{cat}}$ are fixed scaling factors based on primary category and secondary category.
  - Brand codes $f_{\mathrm{brand}}$ are fixed scaling factors based on brand ID.
- The final clustering feature vector $\mathbf{F}$ $F$ is constructed by weighted concatenation of all these components, as shown in Equation 9: $\mathbf{F} = [ w_{\mathrm{admin}} \cdot f_{\mathrm{admin}}, ~ w_{\mathrm{geo}} \cdot f_{\mathrm{geo}}, ~ w_{\mathrm{cat}} \cdot f_{\mathrm{cat}}, ~ w_{\mathrm{brand}} \cdot f_{\mathrm{brand}} ]$ Where:
  - $w_*$ : Empirically chosen coefficients that reflect the relative importance of each feature type.
Clustering for First-Layer Tokens:
- Using the clustering feature vector $\mathbf{F}$ , MiniBatch K-Means is applied. MiniBatch K-Means is a variant of the K-Means algorithm that uses small random subsets of the data (mini-batches) to update the centroids. This makes it much faster for large datasets than standard K-Means.
- This clustering process generates a vocabulary of first-layer geographic tokens.
- The cluster representations (centroids) for this first layer are then computed as the mean of the LLM embeddings within each token group, producing the first-layer cluster centers $\mu^{(1)}$ . This step is critical as it grounds the first layer of tokens in both the discrete geographic and content attributes and the rich, geographically-aligned LLM embeddings.

4.3.2. Residual Layers Tokenization (Residual Tokens)

For subsequent residual layers ( $l \geq 2$ ), LGSID employs learnable cluster centers with Euclidean distance-based assignment, similar to Residual Quantization (RQ) methods. This process refines the remaining information (residuals) not captured by the previous layers.

Cluster Assignment for Residuals:
- For each residual layer $l$ , an input residual vector $\mathbf{R}^{(l-1)}$ (which is the input embedding minus the quantized representations from all previous layers) is assigned to the closest cluster center.
- The assignment $\mathbf{z}^{(l)}$ $z^{(l)}$ for layer $l$ $l$ is determined by Equation 10: $\mathbf{z}^{(l)} = \arg \operatorname*{min}_{k} \| \mathbf{R}^{(l-1)} - {\pmb{\mu}}^{(l)}_k \|_2^2$ Where:
  - $\mathbf{z}^{(l)}$ : The index of the chosen cluster for layer $l$ .
  - $\mathbf{R}^{(l-1)}$ : The residual vector from the previous layer (l-1). For the second layer $(l=2)$ , $\mathbf{R}^{(1)}$ would typically be the original LLM embedding $\mathbf{X}$ .
  - ${\pmb{\mu}}^{(l)}_k$ : The $k$ -th learnable cluster center for layer $l$ .
  - $\| \cdot \|_2^2$ : The squared Euclidean distance.
Quantized Representation and Residual Update:
- Once a cluster center is chosen, the quantized representation for that layer $\mathbf{Q}^{(l)}$ is simply the selected cluster center.
- This is given by Equation 11: $\mathbf{Q}^{(l)} = {\pmb{\mu}}^{(l)}[ {\mathbf{z}}^{(l)} ]$ Where:
  - $\mathbf{Q}^{(l)}$ : The quantized representation for layer $l$ , which is the cluster center corresponding to $\mathbf{z}^{(l)}$ .
- The residual $\mathbf{R}^{(l)}$ for the next layer is then calculated by subtracting this quantized representation from the current residual.
- This residual update is defined by Equation 12: $\mathbf{R}^{(l)} = \mathbf{R}^{(l-1)} - \mathbf{Q}^{(l)}$ Where:
  - $\mathbf{R}^{(l)}$ : The new residual vector that will be passed to the next layer $(l+1)$ . This represents the information not yet captured by the first $l$ layers of quantization.
  - For the first layer $l=1$ , the input $\mathbf{R}^{(0)}$ is the original LLM embedding $\mathbf{X}$ .

4.3.3. Training Objective for HGIT

The primary goal of Hierarchical Geographic Item Tokenization (HGIT) is to minimize the reconstruction loss between the original embeddings and their quantized representations.

Reconstruction Loss ( $\mathcal{L}_{\mathrm{recon}}$ ):
- For input embeddings $\mathbf{X}$ (the LLM's geographic representations), the reconstruction loss is calculated as the squared Euclidean distance between the original embedding and the sum of quantized representations from all layers.
- This is defined by Equation 13: $\mathcal{L}_{\mathrm{recon}} = \| \mathbf{X} - \sum_{l=1}^{L} \mathbf{Q}^{(l)} \|_2^2$ Where:
  - $\mathcal{L}_{\mathrm{recon}}$ : The reconstruction loss.
  - $\mathbf{X}$ : The original LLM embedding.
  - $\sum_{l=1}^{L} \mathbf{Q}^{(l)}$ : The sum of the quantized representations from all $L$ layers. This sum forms the reconstructed embedding.
Entropy-based Regularization ( $\mathcal{L}_{\mathrm{reg}}$ ):
- To ensure balanced utilization of the learned clusters and prevent cluster collapse (where some clusters are never used or capture very few data points), an entropy-based regularization term is introduced.
- For each layer $l$ $l$ , the cluster usage distribution $p_k^{(l)}$ $p_{k}^{(l)}$ is computed: $p_k^{(l)} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I} [ \mathbf{z}_i^{(l)} = k ]$ Where:
  - $p_k^{(l)}$ : The proportion of data points assigned to cluster $k$ in layer $l$ .
  - $N$ : The total number of data points.
  - $\mathbb{I}[\cdot]$ : The indicator function, which is 1 if data point $i$ is assigned to cluster $k$ , and 0 otherwise.
- The regularization loss encourages a uniform cluster usage through KL divergence: $\mathcal{L}_{\mathrm{reg}}^{(l)} = \mathrm{KL}\left( p^{(l)} \| \mathbf{u} \right) = \sum_{k=1}^{K_l} p_k^{(l)} \log \frac{p_k^{(l)}}{1 / K_l}$ Where:
  - $\mathcal{L}_{\mathrm{reg}}^{(l)}$ : The regularization loss for layer $l$ .
  - $\mathrm{KL}(P \| Q)$ : The Kullback-Leibler (KL) divergence between distribution $P$ and distribution $Q$ . It measures how one probability distribution diverges from a second, expected probability distribution.
  - $\mathbf{u} = [1/K_l, \dots, 1/K_l]$ : Represents the uniform distribution over $K_l$ clusters, where $K_l$ is the number of clusters in layer $l$ .
- Minimizing this KL divergence pushes the cluster usage distribution $p^{(l)}$ towards a uniform distribution, ensuring all clusters are effectively utilized.
Complete HGIT Training Objective:
- The complete training objective for HGIT combines the reconstruction loss with the cluster usage regularization: $\mathcal{L}_{\mathrm{HGIT}} = \mathcal{L}_{\mathrm{recon}} + \lambda_{\mathrm{reg}} \sum_{l=2}^{L} \mathcal{L}_{\mathrm{reg}}^{(l)}$ Where:
  - $\mathcal{L}_{\mathrm{HGIT}}$ : The overall HGIT training loss.
  - $\lambda_{\mathrm{reg}}$ : A hyperparameter that controls the strength of the regularization.
- Important Note: The regularization is only applied to learnable layers ( $l \geq 2$ ) because the first layer uses pre-computed geographic clusters and thus doesn't require regularization for balanced usage in the same way learnable layers do.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on a real-world industry dataset from the feed of the Kuaishou App in the local-life scenario.

Source: Kuaishou App (a major short-video platform in China).
Domain: Local-life services, implying items like restaurants, shops, and services with geographic relevance.
Characteristics:
- Includes both item geographic information (latitude, longitude, administrative codes) and textual information (name, brand, category, price).
- Items satisfy delivery-distance constraints, which is critical for local-life services.
Scale: The dataset statistics are summarized in Table 4.

The following are the results from Table 4 of the original paper:

Kuaishou Industry Datasets

# Samples 50,000,000 # Users 19,080,888

# Items 2,325,266 # Brands 19,408

# Categories 818
Choice Rationale: The authors chose this dataset due to the scarcity of public datasets that satisfy delivery-distance constraints and contain detailed item text and geographic location information. This dataset is highly effective for validating the method's performance in its target application domain.
Future Plans: The authors plan to release their dataset in future work to support research on algorithms for LBS (Location-Based Services)-like recommendation scenarios.

Kuaishou Industry Datasets
# Samples	50,000,000	# Users	19,080,888
# Items	2,325,266	# Brands	19,408
# Categories	818

5.2. Evaluation Metrics

The evaluation metrics differ based on the recommendation paradigm:

5.2.1. Discriminative Recommendation Metrics

For discriminative models, AUC is used.

Area Under the Receiver Operating Characteristic Curve (AUC):
1. Conceptual Definition: AUC quantifies the overall performance of a binary classification model (e.g., predicting click or no-click). It represents the probability that the model will rank a randomly chosen positive instance higher than a randomly chosen negative instance. A higher AUC value indicates better discriminative power, meaning the model is better at distinguishing between positive and negative classes. An AUC of 0.5 suggests random performance, while 1.0 indicates perfect classification.
2. Mathematical Formula: The AUC is typically calculated as the area under the ROC curve. The ROC curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings.
  - $TPR = \frac{TP}{TP + FN}$ (Sensitivity or Recall)
  - $FPR = \frac{FP}{FP + TN}$ (1 - Specificity) The formula for AUC is the integral of the ROC curve: $\text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}^{-1}(x)) dx$ In practice, for discrete predictions, AUC can be approximated by ranking pairs of positive and negative samples: $\text{AUC} = \frac{\sum_{i=1}^{P} \sum_{j=1}^{N} \mathbb{I}(score_i > score_j)}{P \times N}$
3. Symbol Explanation:
  - TPR: True Positive Rate, proportion of actual positives correctly identified.
  - FPR: False Positive Rate, proportion of actual negatives incorrectly identified as positive.
  - TP: True Positives, correctly predicted positive cases.
  - FN: False Negatives, actual positive cases incorrectly predicted as negative.
  - FP: False Positives, actual negative cases incorrectly predicted as positive.
  - TN: True Negatives, correctly predicted negative cases.
  - $P$ : Total number of positive samples.
  - $N$ : Total number of negative samples.
  - $score_i$ : Prediction score for a positive sample $i$ .
  - $score_j$ : Prediction score for a negative sample $j$ .
  - $\mathbb{I}(\cdot)$ : Indicator function, which is 1 if the condition is true, 0 otherwise.

5.2.2. Generative Recommendation Metrics

For generative models, NDCG@K and Hit@K with $k = \{5, 10\}$ are used.

Normalized Discounted Cumulative Gain at K (NDCG@K):
1. Conceptual Definition: NDCG@K measures the ranking quality, taking into account the position of relevant items in the ranked list. It gives higher scores to relevant items that appear earlier in the list (higher ranks) and discounts the relevance of items further down. It normalizes the DCG (Discounted Cumulative Gain) by dividing it by the Ideal DCG (IDCG), which is the DCG of a perfect ranking. This allows for comparison across different queries or lists.
2. Mathematical Formula:
  - First, Discounted Cumulative Gain (DCG@K) is calculated: $\text{DCG@K} = \sum_{j=1}^{K} \frac{rel_j}{\log_2(j+1)}$
  - Then, Ideal DCG (IDCG@K) is calculated by sorting all relevant items by their relevance score in descending order and applying the DCG formula: $\text{IDCG@K} = \sum_{j=1}^{K} \frac{rel_{j, ideal}}{\log_2(j+1)}$
  - Finally, NDCG@K is: $\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$
3. Symbol Explanation:
  - $K$ : The number of top items in the ranked list being considered.
  - $rel_j$ : The relevance score of the item at position $j$ in the generated ranked list. This can be binary (1 if relevant, 0 if not) or graded (e.g., 0-5).
  - $\log_2(j+1)$ : A logarithmic discount factor, giving less weight to items at lower ranks.
  - $rel_{j, ideal}$ : The relevance score of the item at position $j$ in the ideal ranked list (all relevant items ranked perfectly).
Hit Rate at K (Hit@K):
1. Conceptual Definition: Hit@K measures whether a relevant item is present within the top K recommendations. It's a binary metric: if at least one target item is found among the top K recommended items, it's considered a "hit" (1); otherwise, it's a "miss" (0). The Hit Rate is the proportion of queries (users) for which a hit occurred.
2. Mathematical Formula: $\text{Hit@K} = \frac{\text{Number of users with at least one target item in top K}}{\text{Total number of users}}$ Alternatively, for a single user $u$ : $\text{Hit@K}_u = \mathbb{I}(\text{target item for user } u \text{ is in Top K recommendations})$ The overall Hit@K is the average $\text{Hit@K}_u$ across all users.
3. Symbol Explanation:
  - $K$ : The number of top items in the ranked list being considered.
  - $\mathbb{I}(\cdot)$ : Indicator function, which is 1 if the condition is true, 0 otherwise.

5.3. Baselines

To evaluate LGSID's effectiveness, the authors compare it against representative baselines in both discriminative and generative recommendation.

5.3.1. Discriminative Recommendation Baselines

These models focus on learning effective user-item interaction representations and are widely adopted in industrial applications.

DIN (Deep Interest Network) (Zhou et al. 2018): A model that adaptively learns the representation of user interests from their historical behaviors given a candidate item. It uses an attention mechanism to weigh different historical behaviors based on their relevance to the current candidate item.
DIEN (Deep Interest Evolution Network) (Zhou et al. 2019): An extension of DIN that captures the temporal evolution of user interests. It uses a GRU (Gated Recurrent Unit)-based interest extractor layer and an interest evolving layer to model how user interests change over time.
ETA (End-to-end User Behavior Retrieval in Click-through Rate Prediction Model) (Chen et al. 2021): A model designed for click-through rate (CTR) prediction that integrates user behavior retrieval directly into the end-to-end training process. It efficiently finds relevant historical behaviors.
SIM (Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-through Rate Prediction) (Pi et al. 2020): Focuses on modeling long-term user interests from extensive sequential behavior data. It often involves search-based retrieval of relevant historical interactions.
TWIN (Twin v2: Scaling Ultra-long User Behavior Sequence Modeling for Enhanced CTR Prediction at Kuaishou) (Si et al. 2024): A highly optimized model for CTR prediction that can handle ultra-long user behavior sequences, common in large-scale industrial settings like Kuaishou.

5.3.2. Generative Recommendation Baselines

These models aim to generate recommendations, often by predicting item sequences or semantic IDs.

TIGER (Recommender Systems with Generative Retrieval) (Rajput et al. 2023): One of the first models to propose RQ-VAE for mapping LLM representations to semantic IDs in generative recommendation, enabling the model to retrieve items based on their semantic tokens.
OneRec (Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment) (Deng et al. 2025): A recent generative recommendation model that aims to unify the retrieve and rank stages, often incorporating iterative preference alignment.

5.3.3. Quantization Methods for Comparison

The paper also compares LGSID's quantization with other methods when applied to TIGER and OneRec:

Res-KMeans (Luo et al. 2024): A residual K-Means based quantization model, used in QARM, to address challenges in leveraging LLM representations.
RQ-VAE (Rajput et al. 2023): The Residual Quantized Variational AutoEncoder model, a common method for item tokenization.
Lin et al. (Lin et al. 2025): Refers to UnifiedSID, which employs RQ-VAE trained with cosine and Euclidean distances to integrate semantic and ID tokens.
RQ-VAE-ngram (Zheng et al. 2025b): An extension of RQ-VAE that incorporates N-gram semantic IDs, aiming to address the long-tail problem.

5.4. Finetune Settings

Backbone LLM: BGE (BAAI General Embedding) (Xiao et al. 2023) is used as the backbone. BGE is a state-of-the-art multilingual text understanding model, chosen for its suitability in industrial recommendation scenarios and its ability to work without complex instructions.
Prompt Length: Set to 512 tokens.
Embedding Size: Set to 1024. The last token's hidden state is used to represent the entire text (a common practice for sentence embeddings).
G-DPO Algorithm:
- Negative Samples for Reward Model: Each positive sample is paired with 15 negative samples for constructing list-wise inputs to the reward model.
- Reward Model Architecture: A two-layer MLP with a sigmoid activation function.
- LoRA (Low-Rank Adaptation) (Hu et al. 2022) is used for efficient LLM fine-tuning.
  - Rank (r): 8. LoRA injects learnable low-rank matrices into the Transformer architecture, significantly reducing the number of trainable parameters. A rank of 8 indicates the dimension of these low-rank matrices.
  - Dropout Rate: 0.05.
  - Fine-tuning Scope: Only the key and value layers of the Transformer are fine-tuned, a strategy to preserve the LLM's general semantic understanding while adapting it to the specific task.
- DPO Loss Parameters:
  - $\beta$ : 0.9.
  - Similarity Constraint: Euclidean distance with a weight $\lambda = 1.5$ .
  - Contrastive Learning: In-batch contrastive learning is used within the similarity constraint.
- Domain-Mixed DPO Preference Pairs:
  - Co-occurrence Score Threshold: 1200 for filtering domain collaborative pairs.
  - Negative Item Selection: Randomly select one negative item per sample to construct training pairs.

5.5. Parameters Settings

Batch Size: 10,240.
Optimal Settings: Selected based on model performance on validation data.
Embedding Dimension per Feature: 8.
Predict MLP Tower Dimensions: [32, 16, 1].
Optimizer: AdamW.
Learning Rate: 0.1.
Scheduler: steplr scheduler that decays the learning rate by a factor of 0.9 every 500 steps.
Hardware: All experiments are conducted on two GPUs, each equipped with 48GB of memory.
Training Workflow:
1. Train the reward model.
2. Perform LLM alignment using the G-DPO algorithm.
3. Finally, train the quantization model.

6. Results & Analysis

6.1. Overall Performance (RQ1)

6.1.1. Discriminative Recommendation Results

The following are the results from Table 1 of the original paper:

Method	DIN	DIEN	SIM	TWIN	ETA
Origin	0.5859	0.6255	0.5884	0.5898	0.5903
+ Res-KMeans (Luo et al. 2024)	0.6100↑+0.0241	0.6369↑+0.0114	0.6063↑+0.0179	0.6087↑+0.0189	0.6077↑+0.0174
+ RQ-VAE (Rajput et al. 2023)	0.6185↑+0.0326	0.6364↑+0.0109	0.6111↑+0.0227	0.6153↑+0.0255	0.6153↑+0.0250
+ Lin et al. (Lin et al. 2025)	0.6161↑+0.0302	0.6368↑+0.0113	0.6107↑+0.0223	0.6148↑+0.0250	0.6148↑+0.0245
+ RQ-VAE-ngram (Zheng et al. 2025b)	0.6163↑+0.0304	0.6354↑+0.0099	0.6116↑+0.0232	0.6129↑+0.0231	0.6145↑+0.0242
+ LGSID (Ours)	0.6276↑+0.0417	0.6484↑+0.0229	0.6224↑+0.0340	0.6263↑+0.0365	0.6274↑+0.0371

Analysis: Table 1 presents the offline AUC scores on the Kuaishou local-life dataset for various discriminative recommendation models augmented with different item tokenization schemes. The Origin row refers to the performance of the backbone models without any LLM-driven tokenization.

Consistent Outperformance: LGSID consistently delivers the largest absolute gains across all evaluated backbone models (DIN, DIEN, SIM, TWIN, ETA). For instance, it improves DIN's AUC by +0.0417, DIEN by +0.0229, SIM by +0.0340, TWIN by +0.0365, and ETA by +0.0371. These are significant improvements in AUC, a metric where even small gains are considered impactful.
Impact of Geographic Awareness: The authors attribute these improvements to LGSID's two main components. For attention-based models like DIN, DIEN, and SIM, traditional flat item IDs limit the influence of spatial proximity on attention scores. By injecting aligned LLM's spatial knowledge through the G-DPO phase, each ID becomes a geography-aware embedding that encodes real-world spatial distance and neighborhood co-visit patterns. This richer representation allows attention mechanisms to better capture relevant local items.
Efficiency for Latency-Constrained Models: For models like TWIN and ETA, which operate under strict latency constraints, compact yet informative codes are crucial. LGSID's hierarchical quantization first compresses geo-textual attributes into coarse primary tokens and then progressively refines residuals with geographic context. This approach yields richer item representations without expanding the embedding table, leading to notable uplifts (e.g., TWIN by 3.65% relative and ETA by 3.71% relative). This indicates that LGSID provides a good balance between expressiveness and efficiency.
Comparison with Other Tokenization Methods: LGSID outperforms other LLM-driven tokenization methods such as Res-KMeans, RQ-VAE, Lin et al., and RQ-VAE-ngram. This suggests that LGSID's explicit RL-based geographic alignment and hierarchical tokenization strategy are more effective in capturing the specific domain constraints of local-life recommendation compared to methods that focus purely on semantic or general collaborative signals.

6.1.2. Generative Recommendation Results

The following are the results from Table 2 of the original paper:

Method	TIGER				OneRec
Method	Hit@5	Hit@10	NDCG@5	NDCG@10	Hit@5	Hit@10	NDCG@5	NDCG@10
RQ-VAE (Rajput et al. 2023)	0.3087	0.3880	0.2255	0.2512	0.3739	0.4534	0.2798	0.3056
Lin et al. (Lin et al. 2025)	0.1767	0.2067	0.1335	0.1432	0.2950	0.3346	0.2272	0.2401
RQ-VAE-ngram (Zheng et al. 2025b)	0.2991	0.3769	0.2158	0.2411	0.3626	0.4358	0.2720	0.2957
LGSID (Ours)	0.3921	0.5077	0.2817	0.3191	0.4435	0.5537	0.3304	0.3661
IMP	27.01%	30.83%	24.94%	27.05%	18.63%	22.13%	18.09%	19.79%

Analysis: Table 2 presents the Hit@K and NDCG@K scores for generative recommendation models (TIGER and OneRec) when integrated with different quantization methods.

LGSID's Superiority: LGSID significantly outperforms all other quantization methods across both TIGER and OneRec backbones and all metrics (Hit@5, Hit@10, NDCG@5, NDCG@10).
- For TIGER, LGSID achieves IMP (relative improvement over the best baseline) of 27.01% for Hit@5 and 30.83% for Hit@10.
- For OneRec, the improvements are 18.63% for Hit@5 and 22.13% for Hit@10.
- Similar strong improvements are observed for NDCG@K metrics.
Weak Performance of Lin et al.: The method by Lin et al. (UnifiedSID) obtains the worst performance. The authors hypothesize this is "possibly because it uses different distance functions across various levels of codewords, leading to convergence difficulties and challenges in model optimization." This highlights the importance of consistent and well-designed quantization strategies.
Limitations of Geographic-Agnostic Methods: RQ-VAE and RQ-VAE-ngram achieve comparable performance but do not explicitly consider geographical constraints. Their suboptimal performance in local-life recommendation scenarios underscores LGSID's core motivation and innovation. By introducing RL-based Alignment to generate geographically aware representations and then transferring these into semantic IDs through hierarchical geographic item tokenization, LGSID directly addresses the domain-specific challenges, leading to more relevant and accurate generative recommendations.

6.2. RL-based LLM Alignment Analysis (RQ2)

This experiment evaluates the effectiveness of the RL-based LLM Alignment module (specifically, the G-DPO algorithm) by measuring how fine-tuning impacts the LLM's geographic awareness while preserving semantic understanding. Two key metrics are used:

Semantic Similarity: Measures the semantic relevance of the retrieved Top-K items.

Geographic Awareness: Measured by the coverage of retrieved items sharing the same province, city, and town as the target item (P@K, C@K, T@K respectively).

The following are the results from Table 3 of the original paper:

Method	Similarity			Province Coverage (P@K)			City Coverage (C@K)			Town Coverage (T@K)
	Top@5	Top@10	Top@100	P@5	P@10	P@100	C@5	C@10	C@100	T@5	T@10	T@100
Origin	0.9204	0.9133	0.8833	0.8716	0.8410	0.6681	0.7342	0.6827	0.4372	0.1601	0.1328	0.0552
DPO-PR	0.71	0.8679	0.8286	0.9001	0.8752	0.747	0.7478	0.713	0.5064	0.1452	0.1167	0.0445
DPO-LR	0.7595	0.7478	0.7088	0.8995	0.8648	0.6560	0.8681	0.8254	0.5783	0.5584	0.4966	0.2480
DPO-LRD	0.7411	0.7288	0.6876	0.8715	0.8302	0.6012	0.8277	0.7755	0.5043	0.6114	0.5435	0.2620
DPO-LRDM	0.8107	0.7954	0.7401	0.9047	0.8773	0.7261	0.7812	0.7329	0.5218	0.1816	0.1481	0.0625
DPO-LRDMS	0.8856	0.8754	0.8283	0.9960	0.9936	0.9662	0.9548	0.9352	0.8130	0.4030	0.3525	0.2260
G-DPO (Ours)	0.8977	0.8892	0.8504	0.9905	0.9852	0.9307	0.9173	0.8858	0.7065	0.294	0.2432	0.1290
IMP	-2.47%	-2.64%	-3.72%	+13.64%	+17.15%	+39.31%	+24.94%	+29.75%	+61.60%	+83.64%	+83.13%	+133.70%

Breakdown of G-DPO variants:

Origin: The baseline LLM before any G-DPO alignment.
DPO-PR: DPO with pairwise reward.
DPO-LR: DPO with list-wise reward.
DPO-LRD: DPO with list-wise reward and density-aware sampling.
DPO-LRDM: DPO with list-wise reward, density-aware sampling, and domain-mixed preference pairs.
DPO-LRDMS: DPO with list-wise reward, density-aware sampling, domain-mixed preference pairs, and similarity regularization.
G-DPO (Ours): The final proposed G-DPO algorithm. The IMP row indicates the relative improvement of G-DPO over the best-performing baseline excluding G-DPO.

Analysis:

Pure Semantic Understanding vs. Geographic Awareness (Origin): The Origin model has high semantic similarity (Top@5: 0.9204) but relatively low town coverage (T@5: 0.1601, T@100: 0.0552). This confirms that a pre-trained LLM, relying solely on text, can find semantically similar items but struggles to capture real-world geographic distances. It may confuse "Suzhou, Anhui" with "Suzhou, Jiangsu" if only textual similarity is considered.
Effectiveness of Reward Model Training (DPO-LR vs. DPO-PR vs. Origin):
- Comparing DPO-LR to DPO-PR shows that list-wise modeling is more effective. DPO-LR significantly boosts T@5 from 0.1601 (Origin) to 0.5584, and C@5 from 0.7342 to 0.8681, demonstrating that the reward model effectively captures and transfers geographic knowledge into the LLM.
- The DPO-PR model, surprisingly, shows a decrease in town coverage compared to Origin (T@5: 0.1452 vs 0.1601), suggesting that simple pairwise rewards might not be sufficient or could even mislead the model if not carefully designed for the geographic context.
Impact of Density-Aware Sampling (DPO-LRD vs. DPO-LR):
- Adding density-aware list-wise modeling (DPO-LRD) further boosts T@5 from 0.5584 (DPO-LR) to 0.6114. This enhancement comes from better distinguishing between near-distance and far-distance negative samples, making the model more sensitive to geographic nuances.
- However, this improvement in geographic awareness often comes at the expense of semantic comprehension, as DPO-LRD's semantic similarity scores are lower than Origin.
Mitigating Trade-off with Domain-Mixed Samples (DPO-LRDM vs. DPO-LRD):
- Applying a mixed-sample strategy (incorporating domain collaborative pairs) in DPO-LRDM shows an improvement in semantic similarity (Top@5: 0.8107 vs. 0.7411 for DPO-LRD) and province/city coverage (e.g., P@5: 0.9047 vs. 0.8715 for DPO-LRD). This strategy improves sample discrimination and integrates collaborative signals, mitigating the semantic trade-off. However, DPO-LRDM surprisingly shows a drop in town coverage compared to DPO-LRD (T@5: 0.1816 vs. 0.6114). This could indicate that simply mixing collaborative signals without explicit regularization might dilute the fine-grained geographic focus.
Balancing with Similarity Regularization (DPO-LRDMS and G-DPO):
- The introduction of textual similarity regularization (DPO-LRDMS) is crucial. DPO-LRDMS achieves high semantic similarity (Top@5: 0.8856), recovering much of the original semantic understanding, while significantly boosting geographic coverage (P@5: 0.9960, C@5: 0.9548, T@5: 0.4030). This demonstrates that balancing domain-specific geographic knowledge with semantic preservation is essential for optimal results.
- G-DPO (Ours) achieves a slightly lower semantic similarity (Top@5: 0.8977) than DPO-LRDMS but provides very strong province and city coverage (P@5: 0.9905, C@5: 0.9173). Its town coverage (T@5: 0.294) is better than Origin and DPO-LRDM but lower than DPO-LRD and DPO-LRDMS. The overall improvements in geographic coverage are substantial, especially for broader regions (province and city), highlighting the effectiveness of the G-DPO design in injecting geographical knowledge. The negative IMP values for similarity suggest a slight trade-off in raw semantic similarity compared to the best intermediate variant (DPO-LRDMS), but this is compensated by massive gains in geographic awareness, which is the primary goal.
  
  In conclusion, the ablation study clearly validates that pure semantic understanding is insufficient for geographic awareness. The reward model and list-wise modeling effectively transfer geographic knowledge. Density-aware sampling further refines this, and domain-mixed sampling helps integrate collaborative signals. Crucially, similarity regularization in G-DPO maintains semantic understanding while achieving strong geographic awareness, leading to a balanced and effective alignment of the LLM for local-life recommendation.

6.3. Visualization Analysis (RQ3)

6.3.1. T-SNE Visualization of Cluster Centroids

The following is the result from Figure 3 of the original paper:

Figure 3: T-SNE visualization of items around cluster centroids across tokenization methods.
该图像是T-SNE可视化图，展示了不同聚类中心周围的项目在不同标记化方法下的分布情况。上半部分（a-c）为未对齐的项目分布，下半部分（d-f）为经过对齐后的项目分布，各部分在空间上的聚集程度和相似性有明显变化，NMI指标反映了这种变化的定量结果。

Figure 3: T-SNE visualization of items around cluster centroids across tokenization methods.

Analysis: Figure 3 shows T-SNE visualizations of item embeddings around cluster centroids, comparing unaligned (top row: a, b, c) and RL-based Geographic LLM Aligned (bottom row: d, e, f) items at different geographic granularities (Province, City, District). T-SNE is a dimensionality reduction technique used to visualize high-dimensional data in a 2D or 3D space, preserving local structures.

Improved Geographic Cohesion: After RL-based Geographic LLM Alignment, the cluster centers for Province, City, and District become significantly closer and more distinct for geographically similar items. This indicates that the aligned LLM embeddings effectively group items based on their actual geographic proximity.
Quantitative Measure (NMI): The authors provide a quantitative measure, Normalized Mutual Information (NMI), to assess the agreement between the model's clustering partition and the ground-truth geographic labels. NMI jumps from 0.0137-0.0845 (unaligned) to 0.6430-0.8644 (aligned).
- NMI Definition: Normalized Mutual Information (NMI) is an external evaluation metric for clustering. It measures the quality of a clustering by comparing it to a ground truth partitioning. A higher NMI score (closer to 1) indicates better agreement between the clustering results and the true labels. An NMI of 0 means no mutual information, and 1 means perfect correlation.
Mechanism of Improvement: The key reason for this substantial improvement in NMI is that G-DPO uses distance-aware list-wise rewards to inject real-world spatial relationships into the LLM. This process allows the LLM's token embeddings to inherently carry geographic priors, meaning the embeddings for geographically close items will be closer in the embedding space. This makes the subsequent clustering much more effective at identifying true geographic groupings.

6.3.2. Token Quantile Percentiles (Coverage Performance)

The following is the result from Figure 4 of the original paper:

Figure 4: Token quantile percentiles across hierarchical levels for local-life items.
该图像是一个雷达图，展示了不同层级下的令牌分位百分位数对比，其中包括未对齐和已对齐的所用模型的表现。图中展示了不同方法（如RQ-VAE、Lin et al.等）在不同层级（Level 1, Level 2, Level 3）上的性能变化，尤其强调了LGSID方法的效果。

Figure 4: Token quantile percentiles across hierarchical levels for local-life items.

Analysis: Figure 4 presents a radar chart showing token quantile percentiles across different hierarchical levels, comparing LGSID with other quantization methods (RQ-VAE, Lin et al., RQ-VAE-ngram) and LGSID w/o Alignment. Quantile percentiles indicate how well tokens represent instances across the distribution spectrum; a larger area in the radar chart implies better coverage performance (tokens can represent more instances).

Level-1 Consistency: At Level-1, LGSID demonstrates remarkable consistency between its aligned and non-aligned settings, showing identical coverage patterns. Both LGSID and LGSID w/o Alignment maintain 11k coverage at the 90% quantile, while RQ-VAE decays to 8k. This suggests that the initial layer, which leverages pre-computed geographic clusters, establishes a robust foundation regardless of the LLM alignment status.
Pronounced Advantages at Finer Granularities: As the analysis progresses to finer granularities (Level-2 and Level-3), LGSID's advantages become significantly more pronounced. Its area in the radar chart is the largest compared to other methods (including LGSID w/o Alignment). This indicates that:
- The RL-based Geographic LLM Alignment (G-DPO) plays a crucial role in enriching the LLM's representations with fine-grained geographic awareness.
- The hierarchical quantization strategy effectively utilizes these aligned representations to refine residual tokens, leading to tokens that can represent a wider and more diverse range of item instances accurately at deeper levels of abstraction.
Overall Effectiveness: The radar chart validates that LGSID not only provides good coverage at the coarse geographic level but also substantially improves the tokenization's expressive power and representational capacity at finer, more granular levels due to its unique alignment and hierarchical structure. This translates to better capture of local-life item characteristics.

6.4. Case Study (RQ4)

The case study, illustrated in Figure 5, visually demonstrates LGSID's superior geographic awareness compared to other SID methods by examining the allocation of three-layer discrete tokens of the LGSID hierarchical quantizer with and without RL-based G-DPO alignment.

The following is the result from Figure 5 of the original paper:

Figure 5: Hierarchical categoryfrequenc distribution of LGSID for different SID prefixes (Aligned vs Unalned)
该图像是图表，展示了在不同SID前缀下LGSID的分层类别频率分布。图表分为六个部分，其中(a)、(b)、(c)为LGSID在不同层级的频率分布，而(d)、(e)、(f)为没有对齐的LGSID的相应分布情况。每个图表显示了多种菜系（如日本料理、海鲜、快餐等）在不同频率下的比较，帮助分析对齐与未对齐的模型在类别频率上的差异。

Figure 5: Hierarchical category frequency distribution of LGSID for different SID prefixes (Aligned vs Unalned)

Analysis: Figure 5 shows side-by-side comparisons of the hierarchical category frequency distribution for LGSID tokens, both aligned (a, b, c) and unaligned (d, e, f), across three layers.

First Layer (Coarse Geographic Clustering):
- Since the first layer uses pre-computed geographic clusters (as described in the methodology), the overall distribution of categories in Figure 5(a) (aligned) and Figure 5(d) (unaligned) appears similar. This initial layer establishes a broad geographic grouping.
Impact of RL-based Alignment on Cohesion:
- Figure 5(b) shows the aligned Level-1 token $[350, 93, *]$ . After RL-based alignment, this token cleanly groups the entire BBQ & Grilled branch into one coarse identifier. This means that items belonging to the "BBQ & Grilled" category, even if they have some semantic variations, are consistently assigned to this specific geographic-semantic token due to the LLM's improved distance-aware representation.
- In stark contrast, Figure 5(e) depicts the same layer without alignment. The same "BBQ & Grilled" restaurants are now scattered across multiple Level-1 tokens like $[199, 20, *]$ , $[443, 20, *]$ , and $[350, 17, *]$ .
Importance of Upstream LLM Quality:
- This scattering in the unaligned case occurs because the LLM embeddings have not been post-trained with G-DPO to respect distance-aware rewards. Without this alignment, the LLM's representations might prioritize subtle textual semantic differences over critical geographic proximity, leading to fragmented clustering.
- As a result, the unaligned Level-1 tokens lose category cohesion. This is detrimental because the hierarchical quantizer in subsequent layers can no longer rely on a shared, coherent root to effectively refine sub-categories. If the initial broad categories are fragmented, subsequent layers will struggle to build meaningful, consistent granular tokens.
Validation of G-DPO's Role: This case study vividly illustrates the importance of upstream LLMs and the quality of their embeddings in the item tokenization process. The G-DPO alignment ensures that the LLM's representations are robustly geography-aware, leading to more coherent and meaningful hierarchical tokens that accurately reflect real-world spatial and categorical relationships. This directly translates to more effective local-life recommendations.

6.4.1. Appendix C: Visualization Analysis for Other Quantizers

The appendix provides similar visualization analyses for other quantizers (Res-KMeans, RQ-VAE, Lin et al., RQ-VAE-ngram) when combined with G-DPO alignment. These visualizations reinforce the findings from the main LGSID case study.

The following are the results from Figure 7 of the original paper:

该图像是图表，展示了不同级别 Res-KMeans 算法下各类美食的推荐结果。通过对比不同水平（Level 1、Level 2、Level 3）及未对齐版本，可以观察到每种美食类别的推荐分布和频率变化，为理解推荐系统效果提供了直观参考。

Figure 7: Hierarchical category frequency distribution of Res-KMeans for different SID prefixes (Aligned vs Unaligned).

Res-KMeans (Figure 7): With G-DPO alignment, Res-KMeans tokens like $[269, *, *]$ and $[461, *, *]$ cleanly unify "Local Cuisine" and "Specialty Cuisine" branches, respectively. Without alignment, these categories would be scattered. This demonstrates that distance-aware rewards preserve category cohesion even with Res-KMeans.

The following are the results from Figure 8 of the original paper:

Figur 8 Hierarchical category frequency distribution of RQ-VAE for different SD prefixes (Alignedvs Unaled)
该图像是图表，展示了不同SD前缀下RQ-VAE的分层类别频率分布，包括对齐（Aligned）和未对齐（Unaligned）两种情况。图中展示了三种层级的频率分布，分别为Level 1、Level 2和Level 3，提供了各类菜品在不同层级下的出现频率。

Figure 8: Hierarchical category frequency distribution of RQ-VAE for different SD prefixes (Aligned vs Unaligned)

RQ-VAE (Figure 8): For RQ-VAE, the G-DPO-aligned LLM embeddings lead to tokens like $[162, *, *]$ for "Local Cuisine" and $[31, *, *]$ for "Specialty Cuisine" consistently capturing these categories. This cohesion is attributed to the RL-based reward that encourages distance-aware grouping, providing a homogeneous root for subsequent layers.

The following are the results from Figure 9 of the original paper:

该图像是一个示意图，展示了不同推荐模型（Lin et al. 和 Lin et al. w/o Aligned）在三个层级（Level 1、Level 2、Level 3）下的推荐效果。每个子图的横轴表示推荐数量，纵轴列出不同的菜品类别，并且用不同颜色的条形图表示各类别的推荐数量。各层级的具体数值和效果对比，表明了模型在本地生活服务推荐中的表现差异。

Figure 9: Hierarchical category frequency distribution of Lin et al. for different SID prefixes (Aligned vs Unaligned).

Lin et al. (Figure 9): Even for the Lin et al. quantizer, the aligned system assigns $[449, *, *]$ to "Local Cuisine" and $[412, *, *]$ to "Japanese Cuisine", eliminating the scattering observed without alignment. This confirms that G-DPO steers the LLM to respect distance-based rewards, ensuring unambiguous parent tokens for downstream levels.

The following are the results from Figure 10 of the original paper:

该图像是多个条形图，分别展示了不同层级下 RQ-VAE-ngram 模型的推荐结果，包括对比 aligned 和未 aligned 情况。图中展示了各种类型的美食，如地方美食和日本料理，并附有相应的数值标记，用于表示推荐权重或频次。

Figure 10: Hierarchical category frequency distribution of RQ-VAE-ngram for different SD prefixes (Aligned vs Unaligned).

RQ-VAE-ngram (Figure 10): For the RQ-VAE-ngram variant, G-DPO alignment results in "Local Cuisine" being captured by $[162, *, *]$ and "Specialty Cuisine" by $[31, *, *]$ , with other cuisines remaining compactly grouped. The absence of fragmentation demonstrates that ngram-aware embeddings, once aligned, still honor geographic cohesion, providing a stable foundation.

These visualizations consistently underscore the critical role of RL-based Geographic LLM Alignment in improving the quality and cohesion of item tokenization across various quantization methods.

6.5. Robustness Analysis

The following are the results from Table 5 of the original paper:

	mean±SD		95% CI	median (IQR)
	origin	+LGSID	+LGSID	+LGSID
DIN	0.5859	0.6259±.0013*	[0.6248, 0.6269]	0.6261(0.0020)
DIEN	0.6255	0.6482±.0019*	[0.6462, 0.6502]	0.6483(0.0022)
SIM	0.5884	0.6259±.0022*	[0.6236, 0.6282]	0.6269(0.0035)
TWIN	0.5898	0.6272±.0022*	[0.6250, 0.6295]	0.6276(0.0030)
ETA	0.5903	0.6272±.0010*	[0.6261, 0.6283]	0.6274(0.0011)

Analysis: Table 5 presents a robustness analysis comparing the origin baseline models with their performance when augmented with $+LGSID$ across five different backbones (DIN, DIEN, SIM, TWIN, ETA). The analysis includes mean AUC with standard deviation (SD), 95% confidence interval (CI), median, and interquartile range (IQR) over 6 random seeds. The * denotes statistically significant improvements (Wilcoxon signed-rank test, $p < 0.05$ ).

Universal Performance Elevation: LGSID consistently improves the mean AUC across all five discriminative backbones. For instance, DIN rises from 0.5859 to 0.6259 (an absolute gain of 4.0 AUC, or 6.8% relative uplift). Similar substantial gains are observed for DIEN (+2.27 AUC), SIM (+3.75 AUC), TWIN (+3.74 AUC), and ETA (+3.69 AUC).
Exceptional Stability: Beyond performance, LGSID also demonstrates exceptional stability against variations from random seeds.
- The standard deviations (SDs) for $+LGSID$ are consistently small (ranging from 0.0010 to 0.0022), indicating low variability in performance across different runs.
- The 95% Confidence Intervals (CIs) are narrow (e.g., DIN: [0.6248, 0.6269]), suggesting high precision and reliability of the reported mean AUCs.
- The Interquartile Ranges (IQRs) are also small (e.g., ETA: 0.0011), showing that the central 50% of the results are tightly clustered, implying insensitivity to extreme seeds.
- The medians and means align almost perfectly, further indicating a symmetrical and consistent performance distribution.
Statistical Significance: The Wilcoxon signed-rank test uniformly confirms the statistical significance of $+LGSID$ 's improvements over the origin baselines across all backbones ( $p < 0.05$ ). This means the observed performance gains are not due to chance.

In summary, the robustness analysis strongly evidences that LGSID not only universally elevates performance in local-life recommendation but also endows every backbone with stability and reproducibility against the inherent randomness of experimental setups. This is a crucial finding for deploying such a system in real-world industrial applications.

6.6. Parameters Sensitivity

The following are the results from Table 6 of the original paper:

Similarity
λ	Top@5	Top@10	Top@100
1.0	0.8856	0.8754	0.8283
1.5	0.8920	0.8828	0.8395
1.8	0.8977	0.8892	0.8504
Province Coverage (P@K)
λ	P@5	P@10	P@100
1.0	0.9960	0.9936	0.9662
1.5	0.9943	0.9908	0.9556
1.8	0.9905	0.9852	0.9347
City Coverage (C@K)
λ	C@5	C@10	C@100
1.0	0.9548	0.9352	0.8130
1.5	0.9408	0.9152	0.7673
1.8	0.9173	0.8858	0.7065
Town Coverage (T@K)
λ	T@5	T@10	T@100
1.0	0.4030	0.3525	0.2260
1.5	0.3431	0.2910	0.1688
1.8	0.2924	0.2432	0.1290

Analysis: Table 6 details the parameter sensitivity of the similarity loss weight hyperparameter $\lambda$ in G-DPO across various similarity and geographic coverage metrics at Top@K levels. This parameter controls the trade-off between preserving semantic understanding and enhancing geographic perception.

Impact on Semantic Similarity:
- As $\lambda$ increases (from 1.0 to 1.8), the semantic similarity metrics (Top@5, Top@10, Top@100) also increase. For instance, Top@5 similarity rises from 0.8856 ( $\lambda=1.0$ ) to 0.8977 ( $\lambda=1.8$ ). This is expected because a higher $\lambda$ places more emphasis on the similarity regularization term ( $\mathcal{L}_{\mathrm{sim}}$ ), which explicitly aims to keep the policy model's embeddings close to the reference model's original semantic embeddings.
Impact on Geographic Coverage:
- Conversely, as $\lambda$ increases, the geographic coverage metrics (Province, City, and Town Coverage at P@K, C@K, T@K) generally decrease. For example, Town Coverage at T@5 drops from 0.4030 ( $\lambda=1.0$ ) to 0.2924 ( $\lambda=1.8$ ). This indicates a trade-off: prioritizing semantic preservation through a higher $\lambda$ can slightly reduce the LLM's ability to focus on fine-grained geographic distinctions that might deviate from its original semantic space.
Balancing Act: The results verify that adjusting $\lambda$ $λ$ allows for flexible control over the LLM's balance between content understanding and spatial perception.
- A lower\lambda $(e.g., 1.0) leads to stronger `geographic sensitivity` (higher `P@K`, `C@K`, `T@K`) at the cost of slightly lower `semantic similarity`. * A `higher`\lambda$ (e.g., 1.8) maintains stronger semantic consistency at the cost of some geographic focus.
Generality and Task-Specific Adaptation: This sensitivity analysis confirms the generality of LGSID. It allows users or developers to task-specifically adapt the LLM by tuning $\lambda$ to achieve the desired balance, depending on whether the application requires higher semantic precision or more critical geographic accuracy. For local-life recommendation, finding the optimal balance is key to providing recommendations that are both relevant and accessible.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces LGSID, an LLM-Aligned Geographic Item Tokenization Framework specifically designed to imbue semantic IDs with real-world spatial awareness for local-life recommendation. The framework addresses the critical limitation of existing LLM-driven recommendation methods that often fail to adequately capture fine-grained geographic characteristics and distance awareness.

The core contributions lie in its two key modules:

RL-based Geographic LLM Alignment: LGSID first trains a list-wise reward model using density-aware negative sampling to effectively capture relative spatial distances and real-world geographic relationships. This reward model then guides a novel G-DPO algorithm, which fine-tunes the LLM to inject generalized spatial knowledge and collaborative signals. Crucially, G-DPO incorporates similarity regularization to ensure that the LLM's inherent semantic understanding capabilities are preserved while it gains profound geographic awareness.
Hierarchical Geographic Item Tokenization: Following alignment, a hierarchical tokenization strategy is employed. This strategy generates primary tokens from discrete spatial and content attributes, establishing a strong geographic prior. Subsequently, residual tokens are refined using the aligned LLM's rich geographic representation vectors, allowing for efficient compression and reconstruction of item information at various granularities.

Extensive experiments on real-world Kuaishou industry datasets consistently demonstrate that LGSID significantly outperforms state-of-the-art discriminative and generative recommendation models. Ablation studies, visualizations, and case studies further validate the effectiveness of its components, particularly how the G-DPO alignment enhances geographic cohesion in item representations and how the hierarchical tokenization leverages this for improved recommendation accuracy.

7.2. Limitations & Future Work

The authors implicitly or explicitly acknowledge several aspects that can be considered limitations or avenues for future work:

Dataset Availability: The reliance on a single, proprietary Kuaishou industry dataset is a limitation. The authors state that public datasets satisfying delivery-distance constraints and containing detailed item text and geographic location information are scarce. They plan to release their dataset in future work, which would be a significant contribution to the research community. This highlights a general challenge in LBS recommendation research.
Generalizability to Diverse Geographies: While tested on Kuaishou data, the framework's performance might vary in regions with different geographic densities, administrative structures, or cultural nuances in local-life services.
Dynamic Geographic Changes: The current framework focuses on learning static geographic relationships. Real-world scenarios might involve dynamic changes (e.g., new stores opening, changes in administrative boundaries), which would require adaptive mechanisms.
Computational Cost of LLM Alignment: Fine-tuning LLMs, even with LoRA, can still be computationally intensive. Exploring more efficient alignment or distillation techniques could be valuable.
Cold-Start for New Items/Locations: While the paper addresses limited user interaction opportunities for many items, the specific handling of cold-start items or entirely new geographic regions might warrant further dedicated investigation. The current method relies on historical data for reward model training and collaborative signals.
Beyond Geographic Distance: Local-life recommendations involve more than just distance; factors like travel time, traffic conditions, delivery logistics, and user mobility patterns could be integrated for even more sophisticated geographic awareness.
Multi-modal Integration: The paper primarily focuses on textual and geographic attributes. Integrating multimodal data (e.g., images of restaurants, videos of places) could provide richer item representations.

7.3. Personal Insights & Critique

This paper presents a highly relevant and innovative solution to a critical problem in local-life recommendation. The explicit focus on geographic awareness through LLM alignment is a significant step beyond simply concatenating location information into prompts.

Key Strengths and Inspirations:

Rigorous Integration of RL and LLMs: The G-DPO algorithm is a well-thought-out innovation. It cleverly uses a pre-trained reward model (trained on explicit geographic distance) to guide LLM fine-tuning, directly addressing the LLM's inherent bias towards semantic similarity over geographic proximity. The similarity regularization is crucial for preventing catastrophic forgetting of semantic knowledge, providing a robust mechanism for balancing different forms of intelligence.
Hierarchical Tokenization for Multi-Granularity: The hierarchical geographic item tokenization is particularly inspiring. It acknowledges that geographic information exists at multiple granularities (province, city, town) and that a multi-layered tokenization scheme can efficiently capture this. The idea of using pre-computed geographic clusters for the primary layer and refining residuals with aligned LLM vectors is an elegant solution for combining discrete attributes with dense representations. This approach could be transferred to other domains requiring multi-granular feature representation (e.g., product categorization in e-commerce, where broad categories can be refined into sub-categories).
Addressing Real-World Industry Challenges: The paper's roots in an industry setting (Kuaishou) ensure its practical relevance. The focus on efficiency for latency-constrained models and the use of density-aware hard negative sampling reflect real-world considerations for large-scale recommendation systems.
Clear Ablation and Visualization: The ablation studies and T-SNE/radar chart visualizations are excellent. They provide strong empirical evidence for the effectiveness of each component and clearly illustrate how G-DPO enhances geographic cohesion and how hierarchical tokens improve coverage.

Potential Issues or Areas for Improvement:

Complexity of Reward Model Design: While the list-wise reward model is innovative, its design and training (especially density-aware hard negative sampling and distance-based soft labels) can be complex to tune effectively. The choice of $K$ negative samples and the threshold $s_{\mathrm{th}}$ for co-occurrence pairs could be highly dataset-dependent.
Interpretability of Tokens: While SIDs generally improve efficiency, the interpretability of the hierarchical geographic tokens themselves might still be challenging for human understanding, particularly the residual layers. Further work could explore ways to make these tokens more human-readable or to use them to generate natural language explanations for recommendations.
Implicit User Location: The paper assumes the user's location is known or inferable for local-life recommendations. In cases where user location is ambiguous or privacy-sensitive, this could be a challenge. The framework could potentially be extended to infer optimal user locations or preferences based on their interaction history if explicit location data is unavailable.
Cost-Benefit Analysis for LLM Fine-tuning: While LoRA reduces costs, fine-tuning an LLM remains resource-intensive. A detailed cost-benefit analysis (e.g., comparing the engineering and compute cost of G-DPO against the gains in $AUC/Hit@K$ ) would be valuable for practical deployment decisions, especially for smaller companies without Kuaishou's resources.
Scalability to Global Contexts: The focus is on local-life. For global platforms, the administrative codes (province, city, town) might not scale universally. A more abstract, continent-agnostic geographic hierarchy might be needed.

Overall, LGSID offers a robust and effective solution for integrating geographic awareness into LLM-driven recommendation systems. Its methodical approach to RL-based alignment and hierarchical tokenization provides valuable insights and a strong foundation for future research in domain-specific LLM applications and spatially-aware recommendation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.