Phi-3.5 on HuggingFace

Phi-3.5-MoE is a cutting-edge, lightweight open model developed from the Phi-3 datasets, which include synthetic data and curated publicly available documents, emphasizing high-quality and reasoning-intensive information. It supports multiple languages and features a 128K token context length. The model has undergone extensive refinement through supervised fine-tuning, proximal policy optimization, and direct preference optimization to guarantee accurate instruction adherence and robust safety protocols.

Intended Uses

Primary Use Cases

The Phi-3.5-MoE model is designed for both commercial and research applications in multiple languages. It is suitable for general-purpose AI systems and applications that require:

Operation in memory or compute-constrained environments
Scenarios with strict latency requirements
Strong reasoning abilities, especially for code, math, and logic tasks

The model aims to advance research in language and multimodal models and serves as a foundational component for generative AI features.

Use Case Considerations

The Phi-3.5-MoE model is not specifically tailored or evaluated for every downstream application. Developers should be aware of the typical limitations of language models when selecting use cases and should assess and address accuracy, safety, and fairness, particularly in high-risk scenarios. Additionally, developers must comply with relevant laws and regulations, including privacy and trade compliance.

The contents of this Model Card do not modify or restrict the model’s licensing terms.

Usage

Requirements

Phi-3.5-MoE-instruct will be included in the official version of Transformers. Until the official release via pip, follow these steps:

When loading the model, set trust_remote_code=True in the from_pretrained() function.
Verify the Transformers version with: pip list | grep transformers.

Required packages include:

flash_attn==2.5.8
torch==2.3.1
accelerate==0.31.0
transformers==4.43.0

Phi-3.5-MoE-instruct is also available on Azure AI Studio.

Tokenizer

The model supports a vocabulary size of up to 32,064 tokens. The provided tokenizer files include placeholder tokens for downstream fine-tuning but can be extended to the full vocabulary size.

Input Formats

Given the nature of its training data, Phi-3.5-MoE-instruct is optimized for prompts in a chat format, such as:

You are a helpful assistant.

How would you explain the Internet to a medieval knight?

Loading the Model Locally

To run the Phi-3.5-MoE-instruct model locally, use the following sample code:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline 

torch.random.manual_seed(0) 

model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/Phi-3.5-MoE-instruct",  
    device_map="cuda",  
    torch_dtype="auto",  
    trust_remote_code=True,  
) 

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-MoE-instruct") 

messages = [ 
    {"role": "system", "content": "You are a helpful AI assistant."}, 
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"}, 
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits with some lemon juice and honey."}, 
    {"role": "user", "content": "What about solving the equation 2x + 3 = 7?"}, 
] 

pipe = pipeline( 
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
) 

generation_args = { 
    "max_new_tokens": 500, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 

output = pipe(messages, **generation_args) 
print(output[0]['generated_text'])

Training

Model Specifications

Architecture: Phi-3.5-MoE features 16×3.8 billion parameters with 6.6 billion active parameters when utilizing 2 experts. It is a mixture-of-expert decoder-only Transformer model and uses a tokenizer with a vocabulary size of 32,064.
Inputs: Text, with an emphasis on chat format prompts.
Context Length: 128,000 tokens.
GPUs: Trained using 512 H100-80G GPUs.
Training Duration: 23 days.
Training Data: 4.9 trillion tokens, including 10% multilingual data.
Outputs: Text generated in response to inputs.
Training Period: April to August 2024.
Status: This is a static model trained on an offline dataset with a cutoff date of October 2023 for publicly available data. Future versions may be released as improvements are made.
Supported Languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian.
Release Date: August 2024.

Training Datasets

The training data consists of 4.9 trillion tokens and includes:

Publicly Available Documents: Filtered for quality, including educational data and code.
Synthetic Data: “Textbook-like” content designed to teach math, coding, common sense reasoning, and general world knowledge.
High-Quality Chat Format Data: Supervised data covering various topics to reflect human preferences for instruction-following, truthfulness, and helpfulness.

The data prioritizes quality to enhance the model’s reasoning ability. For instance, specific types of information, like sports results, are excluded to focus on reasoning.

Responsible AI Considerations

Like other language models, Phi-3.5-MoE may exhibit behaviors that are unfair, unreliable, or offensive. Notable considerations include:

Quality of Service: Performance may vary across different languages, with non-English languages potentially experiencing reduced effectiveness.
Multilingual Performance and Safety: Challenges remain in multilingual models. Developers should test for performance and safety gaps specific to their context and apply additional fine-tuning and safeguards as needed.
Representation and Stereotypes: The model may over-represent or under-represent certain groups, reinforce stereotypes, or produce inappropriate content. Developers should implement additional safeguards for sensitive contexts.
Information Reliability: The model might generate incorrect or outdated content. Developers should build feedback mechanisms and contextual grounding techniques like Retrieval Augmented Generation (RAG).
Code Scope: The training data primarily covers Python with common packages. Users should manually verify the use of other packages or programming languages.
Long Conversations: The model may produce repetitive or inconsistent responses in lengthy chat sessions. Developers should mitigate conversational drift.

Responsible AI Best Practices

Developers should consider:

Allocation: The model may not be suitable for high-impact decisions related to legal status, resource allocation, or life opportunities without further assessment.
High-Risk Scenarios: Evaluate the model’s suitability for sensitive applications where accuracy and reliability are critical, and implement additional safeguards.
Misinformation: Implement transparency practices and inform users they are interacting with an AI. Use feedback mechanisms to ensure accuracy.
Generation of Harmful Content: Assess and manage outputs based on context and employ safety classifiers.
Misuse: Ensure applications do not facilitate fraud, spam, or malware and comply with applicable laws and regulations.

Safety Evaluation and Red-Teaming

Various evaluation methods, including red teaming and multilingual safety benchmarks, were used to assess the Phi-3.5 models. These evaluations have shown positive impacts on safety and robustness but also highlight areas for improvement, particularly in handling longer multi-turn interactions and multilingual contexts. Investment in high-quality safety evaluation datasets is essential for addressing cultural nuances and risk areas.

Software

License

The model is licensed under the MIT license.