Code Along for New Project

stemaway · February 7, 2025, 8:34pm

Step 1: Setup Environment and Install Dependencies

In this step we create a virtual environment and install all the necessary packages that weâll use throughout the project. This ensures that our dependencies are isolated and wonât conflict with any other projects.

âââââââââââââââââââââââââââââ

Create a virtual environment (using venv)

python3 -m venv dreamweaver-env

Activate the environment:

On macOS/Linux:

source dreamweaver-env/bin/activate

On Windows:

dreamweaver-env\Scripts\activate

Once activated, install the required libraries:

pip install torch transformers streamlit numpy pandas scikit-learn âââââââââââââââââââââââââââââ

Explanation: â¢ We are using Pythonâs built-in venv module to create an isolated environment. â¢ Packages installed include PyTorch for model training, Hugging Face Transformers for language models, Streamlit for the web interface, and additional libraries like NumPy, Pandas, and scikit-learn for data processing.

âââââââââââââââââââââââââââââ

Step 2: Data Collection and Preprocessing

Here, we load our dataset of creative stories and clean the text. This prepares the raw stories for the fine-tuning process by removing unwanted characters and ensuring a consistent format.

âââââââââââââââââââââââââââââ
import pandas as pd from sklearn.model_selection import train_test_split

Load the dataset - assume a CSV file with a column called ‘story’

data = pd.read_csv(“stories_dataset.csv”) print(f"Total stories loaded: {len(data)}")

def preprocess_text(text): “”" Clean the text by replacing newline characters and removing extra spaces. Additional cleaning can be added as needed. “”" # Replace newlines with a space and strip extra whitespace text = text.replace(“\n”, " ").strip() # Further cleaning steps can be inserted here (e.g., remove special symbols) return text

Apply preprocessing to the story column

data[‘story_clean’] = data[‘story’].apply(preprocess_text)

Split data into training and validation sets (90% train, 10% validation)

train_data, val_data = train_test_split(data[‘story_clean’], test_size=0.1, random_state=42)

Save preprocessed texts to separate files, each story on a new line

with open(“train_texts.txt”, “w”, encoding=“utf-8”) as f_train: for text in train_data: f_train.write(text + “\n”)

with open(“val_texts.txt”, “w”, encoding=“utf-8”) as f_val: for text in val_data: f_val.write(text + “\n”) âââââââââââââââââââââââââââââ

Explanation: â¢ We load data using Pandas and split into training and validation sets using scikit-learnâs train_test_split. â¢ Preprocessing includes a simple function to remove newline characters; you can enhance it further depending on your dataset. â¢ The clean stories are saved into text files that will be used for model training.

âââââââââââââââââââââââââââââ

Step 3: Model Selection and Fine-Tuning

In this step, we fine-tune the pre-trained GPT-2 model on our custom story dataset using Hugging Face Transformers and PyTorch. We load the dataset from text files, prepare the data collator to format inputs correctly, and set up the Trainer for model fine-tuning.

âââââââââââââââââââââââââââââ
import torch from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

Load the pre-trained GPT-2 model and tokenizer

model_name = “gpt2” model = GPT2LMHeadModel.from_pretrained(model_name) tokenizer = GPT2Tokenizer.from_pretrained(model_name)

Set the pad token to eos token to avoid padding issues

tokenizer.pad_token = tokenizer.eos_token

def load_dataset(file_path, tokenizer, block_size=128): “”" Load the dataset from a text file using Hugging Face’s TextDataset utility. The file should have one training example per line. “”" return TextDataset( tokenizer=tokenizer, file_path=file_path, block_size=block_size, overwrite_cache=True )

Load training and validation datasets

train_dataset = load_dataset(“train_texts.txt”, tokenizer) val_dataset = load_dataset(“val_texts.txt”, tokenizer)

Use a data collator to format batches for language modeling

data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=False, # GPT-2 is a causal or autoregressive model, no masked language modeling )

Define training arguments for the Trainer

training_args = TrainingArguments( output_dir=“./dreamweaver_model”, overwrite_output_dir=True, num_train_epochs=3, per_device_train_batch_size=2, per_device_eval_batch_size=2, evaluation_strategy=“steps”, eval_steps=500, logging_steps=500, save_steps=500, warmup_steps=100, weight_decay=0.01, logging_dir=“./logs”, save_total_limit=2, )

Create a Trainer instance using our model, datasets, and training parameters

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset, data_collator=data_collator, )

Begin training the model

print(“Starting training…”) trainer.train()

Save the fine-tuned model and tokenizer to disk for later use

model.save_pretrained(“./dreamweaver_model”) tokenizer.save_pretrained(“./dreamweaver_model”) print(“Model fine-tuning complete and saved.”) âââââââââââââââââââââââââââââ

Explanation: â¢ We load GPT-2 and adjust the tokenizer to avoid issues with padding. â¢ The TextDataset helper creates a dataset object from our text file; block_size controls the maximum token length per example. â¢ DataCollatorForLanguageModeling prepares the input sequences and labels. â¢ TrainingArguments configures training parameters including batch sizes, evaluation and saving steps. â¢ Trainer simplifies the training loop â after training, the fine-tuned model is saved.

âââââââââââââââââââââââââââââ

Step 4: Implementing the Story Generation Pipeline

After fine-tuning, we want to generate creative stories from a user prompt. We use Hugging Faceâs pipeline to simplify text generation. This code defines a function that takes a prompt and returns generated text.

âââââââââââââââââââââââââââââ
from transformers import pipeline

Load the fine-tuned model and tokenizer for generation

model_path = “./dreamweaver_model” generator = pipeline(‘text-generation’, model=model_path, tokenizer=model_path)

def generate_story(prompt, max_length=200, num_return_sequences=1): “”" Generate a story based on the provided prompt.

Parameters:
  prompt (str): The seed text to begin the story.
  max_length (int): Maximum total length of the generated sequence.
  num_return_sequences (int): Number of different stories to generate.
  
Returns:
  List of generated story dictionaries.
"""
stories = generator(prompt, 
                    max_length=max_length, 
                    num_return_sequences=num_return_sequences, 
                    do_sample=True, 
                    temperature=0.8)
return stories

Example usage

prompt_text = “In a forgotten village where magic and mystery intertwine,” stories = generate_story(prompt_text) print(“Generated Story:”) print(stories[0][‘generated_text’]) âââââââââââââââââââââââââââââ

Explanation: â¢ The pipeline API abstracts text generation and loads our fine-tuned model. â¢ The generate_story function accepts configuration parameters such as max_length and temperature to control randomness. â¢ A sample prompt is provided and the generated story is printed. â¢ Temperature parameter controls diversity (lower values yield more conservative completions).

âââââââââââââââââââââââââââââ

Step 5: Creating a Web User Interface with Streamlit

To allow users to interact with our story generator, we build a simple web application using Streamlit. The app provides an input field for the story prompt, a slider to adjust maximum story length, and displays the generated story.

âââââââââââââââââââââââââââââ

Save this file as app.py

import streamlit as st from transformers import pipeline

@st.cache(allow_output_mutation=True) def load_generator(): “”" Cache the generator to avoid reloading the model every time the app runs. “”" return pipeline(‘text-generation’, model=“./dreamweaver_model”, tokenizer=“./dreamweaver_model”)

generator = load_generator()

Set up the Streamlit application layout

st.title(“DreamWeaver: Creative Story Generator”) st.write(“Enter a creative prompt below to generate a unique short story.”)

Input field for story prompt with a default value

prompt = st.text_input(“Story Prompt:”, “In a forgotten village where magic and mystery intertwine,”)

Slider to control the maximum generated length

max_length = st.slider(“Maximum Story Length”, min_value=50, max_value=500, value=200)

if st.button(“Generate Story”): with st.spinner(“Crafting your story…”): generated_output = generator(prompt, max_length=max_length, num_return_sequences=1, do_sample=True, temperature=0.8) story = generated_output[0][‘generated_text’] st.subheader(“Your Generated Story”) st.write(story) âââââââââââââââââââââââââââââ

To run the Streamlit app, execute the following command in your terminal: âââââââââââââââââââââââââââââ
streamlit run app.py âââââââââââââââââââââââââââââ

Explanation: â¢ The generator is cached with st.cache to improve app performance. â¢ The app provides a text input for a story prompt and a slider for maximum story length. â¢ When the button is pressed, the story is generated and displayed in the browser.

âââââââââââââââââââââââââââââ

Step 6: Testing and Evaluation

Evaluating the modelâs performance can be done by calculating its perplexity on the validation set. Lower perplexity generally indicates that the model is generating more coherent text.

âââââââââââââââââââââââââââââ
import math import torch from transformers import TextDataset, DataCollatorForLanguageModeling

def evaluate_perplexity(model, tokenizer, file_path): “”" Evaluate the model’s perplexity on a given dataset file.

Parameters:
  model: The fine-tuned language model.
  tokenizer: The tokenizer associated with the model.
  file_path (str): Path to the text file containing evaluation data.
  
Returns:
  Calculated perplexity (float).
"""
# Load dataset for evaluation (similar to training but for inference)
eval_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=file_path,
    block_size=128,
    overwrite_cache=True
)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, 
    mlm=False,
)

total_loss = 0.0
count = 0
model.eval()  # Set model in evaluation mode

# Iterate over the dataset in small batches
# For simplicity, using a manual batching with size=2
batch_size = 2
for i in range(0, len(eval_dataset), batch_size):
    # Prepare a batch of examples
    batch = [eval_dataset[j] for j in range(i, min(i+batch_size, len(eval_dataset)))]
    batch_data = data_collator(batch)
    # Move tensors to the model device (CPU or GPU)
    inputs = {key: tensor.to(model.device) for key, tensor in batch_data.items()}
    
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
        total_loss += loss.item()
        count += 1

average_loss = total_loss / count if count > 0 else float('inf')
perplexity = math.exp(average_loss)
return perplexity

Evaluate perplexity on the validation texts

perp = evaluate_perplexity(model, tokenizer, “val_texts.txt”) print(f"Validation Perplexity: {perp:.2f}") âââââââââââââââââââââââââââââ

Explanation: â¢ The evaluate_perplexity function computes the loss across the validation dataset. â¢ The loss is averaged over the batches, then exponentiated to retrieve perplexity. â¢ Lower perplexity signifies that the modelâs outputs are more predictable and coherent.

âââââââââââââââââââââââââââââ

Step 7: Deployment and Future Improvements

Finally, we consider deploying our application so that others can experience DreamWeaver. One common approach is to use Docker to containerize the app, ensuring that it runs consistently across different environments.

âââââââââââââââââââââââââââââ

Dockerfile for deploying DreamWeaver Streamlit App

FROM python:3.8-slim

Set the working directory in the container

WORKDIR /app

Copy and install requirements

COPY requirements.txt /app/ RUN pip install --no-cache-dir -r requirements.txt

Copy the entire project into the container

COPY . /app/

Expose the port that Streamlit uses

EXPOSE 8501

Run the Streamlit app on container start

CMD [“streamlit”, “run”, “app.py”, “–server.port”, “8501”, “–server.address”, “0.0.0.0”] âââââââââââââââââââââââââââââ

Explanation: â¢ The Dockerfile uses a slim Python image to reduce container size. â¢ requirements.txt should list all dependencies (torch, transformers, streamlit, etc.). â¢ The application is exposed on port 8501, the default for Streamlit. â¢ When the container launches, it automatically starts the Streamlit app.

âââââââââââââââââââââââââââââ

This completes the step-by-step implementation of DreamWeaverâa creative narrative generator combining data preprocessing, model fine-tuning, story generation, and a user-friendly web interface. Feel free to explore further optimizations, advanced sampling methods, and new deployment strategies as you refine your project. Happy coding and storytelling!