An AI-powered data processing pipeline designed for analyzing SaaS (Software as a Service) financial, usage, and support data. This pipeline helps businesses consolidate and analyze their data across different operational dimensions with intelligent processing and validation at each step.
🌟 Features
-
Multi-Source Data Processing: Process and analyze data from multiple sources:
- Financial/Billing data
- Product usage metrics
- Customer support interactions
-
Intelligent Data Pipeline:
-
Smart Data Gathering
- Multiple file format support
- Automatic file categorization
- Initial data validation
-
AI-Powered Data Mapping
- Intelligent column mapping suggestions
- Standard schema validation
- Custom field mapping support
-
Automated Data Cleaning
- Smart data type detection
- Missing value handling strategies
- Data standardization rules
-
Flexible Data Aggregation
- Multi-level aggregation (Customer/Product)
- AI-suggested aggregation methods
- Custom aggregation rules
-
Advanced Data Joining
- Two-phase joining process
- Smart join key detection
- Comprehensive join health validation
-
🚀 Getting Started
Prerequisites
- Python 3.8+
- OpenAI API key
Installation
- Clone the repository:
git clone [email protected]:stepfnAI/data_prep_agent.git
cd data_prep_agent
- Create and activate a virtual environment using virtualenv:
pip install virtualenv # Install virtualenv if not already installed
virtualenv venv # Create virtual environment
source venv/bin/activate # Linux/Mac
# OR
.\venv\Scripts\activate # Windows
- Install dependencies:
pip install -r requirements.txt
- Set up your OpenAI API key:
export OPENAI_API_KEY='your-api-key'
🔄 Pipeline Workflow
- Start the Application
# Windows
streamlit run .\orchestration\main_orchestration.py
# Linux/Mac
streamlit run ./orchestration/main_orchestration.py
- Follow the Step-by-Step Process:
- Upload your data files
- Confirm automatic categorization
- Review and adjust column mappings
- Configure data cleaning rules
- Set up aggregation preferences
- Validate and execute data joins
📊 Data Requirements
Billing Data
- Required fields:
- CustomerID
- BillingDate
- Revenue
- Optional fields:
- ProductID
- InvoiceID
- Subscription details
Usage Data
- Required fields:
- CustomerID
- UsageDate
- Optional fields:
- Feature usage metrics
- User engagement data
- Product-specific metrics
Support Data
- Required fields:
- CustomerID
- TicketOpenDate
- Optional fields:
- Ticket severity
- Resolution time
- Support metrics
🛠️ Architecture
The pipeline consists of these key components:
- MainOrchestrator: Controls the overall pipeline flow
- DataGatherer: Handles file uploads and categorization
- DataMapper: Manages schema mapping and validation
- DataCleaner: Processes and standardizes data
- DataAggregator: Handles data aggregation logic
- DataJoiner: Manages the joining process
🔒 Security
- Secure data handling
- Input validation
- Environment variables for sensitive data
- Safe data processing operations
📝 License MIT License
🤝 Contributing Contributions are welcome! Please follow these steps:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
📧 Contact Email: [email protected]