1. Data Ingestion and Transformation - 34%
Perform data ingestion
Read data from streaming sources (for example, Amazon Kinesis, Amazon MSK, DynamoDB Streams, AWS DMS, AWS Glue, Amazon Redshift)
Read data from batch sources (for example, Amazon S3, AWS Glue, Amazon EMR, AWS DMS, Amazon Redshift, AWS Lambda, Amazon AppFlow)
Implement appropriate configuration options for batch ingestion
Consume data APIs
Set up schedulers using EventBridge, Apache Airflow, or time-based schedules
Set up event triggers (for example, S3 Event Notifications, EventBridge)
Call a Lambda function from Kinesis
Create allowlists for IP addresses
Implement throttling and overcome rate limits
Manage fan-in and fan-out for streaming
Describe replayability of ingestion pipelines
Define stateful and stateless transactions
Transform and process data
Optimize container usage (EKS, ECS)
Connect to data sources (JDBC, ODBC)
Integrate data from multiple sources
Optimize processing costs
Use transformation services (EMR, Glue, Lambda, Redshift)
Transform data formats (for example, CSV to Parquet)
Troubleshoot transformation failures
Create data APIs using AWS services
Define volume, velocity, and variety
Integrate LLMs for data processing
Orchestrate data pipelines
Use orchestration services (Lambda, EventBridge, MWAA, Step Functions, Glue workflows)
Build scalable, resilient pipelines
Implement serverless workflows
Use notification services (SNS, SQS)
Apply programming concepts
Optimize code for runtime efficiency
Configure Lambda for concurrency
Use programming languages (Python, SQL, Scala, R, Java, Bash, PowerShell)
Apply software engineering best practices
Use Infrastructure as Code (CloudFormation, CDK)
Use AWS SAM for deployment
Use storage volumes in Lambda
Describe CI/CD processes
Define distributed computing
Describe data structures and algorithms
2. Data Store Management - 26%
Choose a data store
Implement storage services based on cost/performance (Redshift, EMR, RDS, DynamoDB, Kinesis)
Configure storage services for access patterns
Apply services to use cases (for example, HNSW, MemoryDB)
Integrate migration tools (AWS Transfer Family)
Implement migration methods (Redshift Spectrum, federated queries)
Manage locks
Manage open table formats (Apache Iceberg)
Describe vector index types (HNSW, IVF)
Understand data cataloging systems
Use data catalogs to consume data
Build catalogs (Glue Data Catalog, Hive metastore)
Discover schemas using crawlers
Synchronize partitions
Create connections for cataloging
Manage business catalogs (SageMaker Catalog)
Manage the lifecycle of data
Perform load/unload operations (S3 ↔ Redshift)
Manage S3 lifecycle policies
Expire data using lifecycle rules
Manage versioning and TTL
Delete data based on requirements
Ensure resiliency and availability
Design data models and schema evolution
Design schemas (Redshift, DynamoDB, Lake Formation)
Handle schema changes
Perform schema conversion (AWS SCT, DMS)
Establish data lineage
Apply indexing, partitioning, compression best practices
Describe vectorization concepts
3. Data Operations and Support - 22%
Automate data processing
Orchestrate pipelines (MWAA, Step Functions)
Troubleshoot workflows
Call SDKs for AWS services
Process data using EMR, Redshift, Glue
Maintain data APIs
Prepare data (DataBrew, SageMaker)
Query data (Athena)
Use Lambda for automation
Manage events (EventBridge)
Analyze data
Visualize data (QuickSight, DataBrew)
Clean data (Lambda, Athena, notebooks)
Use SQL for querying
Use Athena notebooks with Spark
Describe provisioned vs serverless tradeoffs
Define aggregation, grouping, pivoting
Maintain and monitor pipelines
Extract logs for audits
Deploy monitoring solutions
Send alerts using notifications
Troubleshoot performance issues
Track API calls using CloudTrail
Use CloudWatch Logs
Analyze logs (Athena, OpenSearch, EMR)
Ensure data quality
Run data quality checks
Define quality rules (DataBrew)
Investigate consistency
Describe sampling techniques
Implement data skew handling
4. Data Security and Governance - 18%
Apply authentication mechanisms
Update VPC security groups
Create IAM roles, policies, endpoints
Manage credentials (Secrets Manager)
Set up IAM roles for services
Apply IAM policies
Differentiate managed vs unmanaged services
Use domains and projects in SageMaker
Apply authorization mechanisms
Create custom IAM policies
Store credentials securely
Manage database access
Use Lake Formation permissions
Apply role-based and attribute-based access
Follow least privilege principles
Ensure data encryption and masking
Apply masking and anonymization
Use AWS KMS for encryption
Configure cross-account encryption
Enable encryption in transit
Prepare logs for audit
Use CloudTrail for API tracking
Store logs in CloudWatch
Use CloudTrail Lake
Analyze logs (Athena, OpenSearch)
Integrate logging services
Understand data privacy and governance
Grant permissions for data sharing
Implement PII detection (Macie)
Apply data privacy strategies
Track configuration changes (AWS Config)
Maintain data sovereignty
Manage access via SageMaker Catalog
Describe governance frameworks