Finest practices to optimize value and efficiency for AWS Glue streaming ETL jobs

0
10


AWS Glue streaming extract, rework, and cargo (ETL) jobs let you course of and enrich huge quantities of incoming information from programs comparable to Amazon Kinesis Knowledge Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), or some other Apache Kafka cluster. It makes use of the Spark Structured Streaming framework to carry out information processing in near-real time.

This put up covers use circumstances the place information must be effectively processed, delivered, and presumably actioned in a restricted period of time. This will cowl a variety of circumstances, comparable to log processing and alarming, steady information ingestion and enrichment, information validation, web of issues, machine studying (ML), and extra.

We talk about the next subjects:

  • Improvement instruments that enable you to code sooner utilizing our newly launched AWS Glue Studio notebooks
  • monitor and tune your streaming jobs
  • Finest practices for sizing and scaling your AWS Glue cluster, utilizing our newly launched options like auto scaling and the small employee sort G 0.25X

Improvement instruments

AWS Glue Studio notebooks can velocity up the event of your streaming job by permitting information engineers to work utilizing an interactive pocket book and check code modifications to get fast suggestions—from enterprise logic coding to testing configuration modifications—as a part of tuning.

Earlier than you run any code within the pocket book (which might begin the session), you have to set some necessary configurations.

The magic %streaming creates the session cluster utilizing the identical runtime as AWS Glue streaming jobs. This fashion, you interactively develop and check your code utilizing the identical runtime that you just use later within the manufacturing job.

Moreover, configure Spark UI logs, which might be very helpful for monitoring and tuning the job.

See the next configuration:

%streaming
%%configure
{
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://your_bucket/sparkui/"
}

For extra configuration choices comparable to model or variety of employees, consult with Configuring AWS Glue interactive classes for Jupyter and AWS Glue Studio notebooks.

To visualise the Spark UI logs, you want a Spark historical past server. If you happen to don’t have one already, consult with Launching the Spark Historical past Server for deployment directions.

Structured Streaming is predicated on streaming DataFrames, which characterize micro-batches of messages.
The next code is an instance of making a stream DataFrame utilizing Amazon Kinesis because the supply:

kinesis_options = {
  "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
  "startingPosition": "TRIM_HORIZON",
  "inferSchema": "true",
  "classification": "json"
}
kinesisDF = glueContext.create_data_frame_from_options(
   connection_type="kinesis",
   connection_options=kinesis_options
)

The AWS Glue API helps you create the DataFrame by doing schema detection and auto decompression, relying on the format. You can too construct it your self utilizing the Spark API straight:

kinesisDF = spark.readStream.format("kinesis").choices(**kinesis_options).load()

After your run any code cell, it triggers the startup of the session, and the applying quickly seems within the historical past server as an incomplete app (on the backside of the web page there’s a hyperlink to show incomplete apps) named GlueReplApp, as a result of it’s a session cluster. For an everyday job, it’s listed with the job title given when it was created.

History server home page

From the pocket book, you possibly can take a pattern of the streaming information. This may also help improvement and provides a sign of the sort and measurement of the streaming messages, which could affect efficiency.

Monitor the cluster with Structured Streaming

One of the simplest ways to watch and tune your AWS Glue streaming job is utilizing the Spark UI; it provides you the general streaming job tendencies on the Structured Streaming tab and the main points of every particular person micro-batch processing job.

General view of the streaming job

On the Structured Streaming tab, you possibly can see a abstract of the streams working within the cluster, as within the following instance.

Usually there is only one streaming question, representing a streaming ETL. If you happen to begin a number of in parallel, it’s good in the event you give it a recognizable title, calling queryName() in the event you use the writeStream API straight on the DataFrame.

After an excellent variety of batches are full (comparable to 10), sufficient for the averages to stabilize, you should use Avg Enter/sec column to watch what number of occasions or messages the job is processing. This may be complicated as a result of the column to the suitable, Avg Course of/sec, is comparable however typically has the next quantity. The distinction is that this course of time tells us how environment friendly our code is, whereas the common enter tells us what number of messages the cluster is studying and processing.

The necessary factor to notice is that if the 2 values are related, it means the job is working at most capability. It’s making one of the best use of the {hardware} nevertheless it doubtless received’t have the ability to address a rise in quantity with out inflicting delays.

Within the final column is the most recent batch quantity. As a result of they’re numbered incrementally from zero, this tells us what number of batches the question has processed up to now.

If you select the hyperlink within the “Run ID” column of a streaming question, you possibly can assessment the main points with graphs and histograms, as within the following instance.

The primary two rows correspond to the info that’s used to calculate the averages proven on the abstract web page.

For Enter Price, every information level is calculated by dividing the variety of occasions learn for the batch by the point handed between the present batch begin and the earlier batch begin. In a wholesome system that is ready to sustain, this is the same as the configured set off interval (within the GlueContext.forEachBatch() API, that is set utilizing the choice windowSize).

As a result of it makes use of the present batch rows with the earlier batch latency, this graph is commonly unstable within the first batches till the Batch Length (the final line graph) stabilizes.

On this instance, when it stabilizes, it will get utterly flat. Because of this both the inflow of messages is fixed or the job is hitting the restrict per batch set (we talk about how to do that later within the put up).

Watch out in the event you set a restrict per batch that’s continually hit, you might be silently constructing a backlog, however all the pieces may look good within the job metrics. To observe this, have a metric of latency measuring the distinction between the message timestamp when it will get created and the time it’s processed.

Course of Price is calculated by dividing the variety of messages in a batch by the point it took to course of that batch. As an example, if the batch accommodates 1,000 messages, and the set off interval is 10 seconds however the batch solely wanted 5 seconds to course of it, the method price can be 1000/5 = 200 msg/sec. whereas the enter price for that batch (assuming the earlier batch additionally ran throughout the interval) is 1000/10 = 100 msg/sec.

This metric is beneficial to measure how environment friendly our code processing the batch is, and due to this fact it may possibly get increased than the enter price (this doesn’t imply it’s processing extra messages, simply utilizing much less time). As talked about earlier, if each metrics get shut, it means the batch length is near the interval and due to this fact further visitors is more likely to begin inflicting batch set off delays (as a result of the earlier batch continues to be working) and enhance latency.

Later on this put up, we present how auto scaling may also help forestall this example.

Enter Rows exhibits the variety of messages learn for every batch, like enter price, however utilizing quantity as a substitute of price.

It’s necessary to notice that if the batch processes the info a number of instances (for instance, writing to a number of locations), the messages are counted a number of instances. If the charges are larger than the anticipated, this could possibly be the rationale. Typically, to keep away from studying messages a number of instances, it is best to cache the batch whereas processing it, which is the default whenever you use the GlueContext.forEachBatch() API.

The final two rows inform us how lengthy it takes to course of every batch and the way is that point spent. It’s regular to see the primary batches take for much longer till the system warms up and stabilizes.
The necessary factor to search for is that the durations are roughly steady and properly underneath the configured set off interval. If that’s not the case, the following batch will get delayed and will begin a compounding delay by constructing a backlog or rising batch measurement (if the restrict permits taking the additional messages pending).

In Operation Length, nearly all of time must be spent on addBatch (the mustard colour), which is the precise work. The remaining are fastened overhead, due to this fact the smaller the batch course of, the extra proportion of time that may take. This represents the trade-off between small batches with decrease latency or greater batches however extra computing environment friendly.

Additionally, it’s regular for the primary batch to spend important time within the latestOffset (the brown bar), finding the purpose at which it wants to start out processing when there isn’t a checkpoint.

The next question statistics present one other instance.

On this case, the enter has some variation (which means it’s not hitting the batch restrict). Additionally, the method price is roughly the identical because the enter price. This tells us the system is at max capability and struggling to maintain up. By evaluating the enter rows and enter price, we are able to guess that the interval configured is simply 3 seconds and the batch length is barely capable of meet that latency.

Lastly, in Operation Length, you possibly can observe that as a result of the batches are so frequent, a major period of time (proportionally talking) is spent saving the checkpoint (the darkish inexperienced bar).

With this info, we are able to most likely enhance the steadiness of the job by rising the set off interval to five seconds or extra. This fashion, it checkpoints much less typically and has extra time to course of information, which may be sufficient to get batch length persistently underneath the interval. The trade-off is that the latency between when a message is printed and when it’s processed is longer.

Monitor particular person batch processing

On the Jobs tab, you possibly can see how lengthy every batch is taking and dig into the completely different steps the processing entails to grasp how the time is spent. You can too examine if there are duties that succeed after retry. If this occurs repeatedly, it may possibly silently damage efficiency.

As an example, the next screenshot exhibits the batches on the Jobs tab of the Spark UI of our streaming job.

Every batch is taken into account a job by Spark (don’t confuse the job ID with the batch quantity; they solely match if there isn’t a different motion). The job group is the streaming question ID (that is necessary solely when working a number of queries).

The streaming job on this instance has a single stage with 100 partitions. Each batches processed them efficiently, so the stage is marked as succeeded and all of the duties accomplished (100/100 within the progress bar).

Nonetheless, there’s a distinction within the first batch: there have been 20 job failures. You recognize all of the failed duties succeeded within the retries, in any other case the stage would have been marked as failed. For the stage to fail, the identical job must fail 4 instances (or as configured by spark.job.maxFailures).

If the stage fails, the batch fails as properly and presumably the entire job; if the job was began through the use of GlueContext.forEachBatch(), it has a lot of retries as per the batchMaxRetries parameter (three by default).

These failures are necessary as a result of they’ve two results:

  • They’ll silently trigger delays within the batch processing, relying on how lengthy it took to fail and retry.
  • They’ll trigger information to be despatched a number of instances if the failure is within the final stage of the batch, relying on the kind of output. If the output is information, normally it received’t trigger duplicates. Nonetheless, if the vacation spot is Amazon DynamoDB, JDBC, Amazon OpenSearch Service, or one other output that makes use of batching, it’s attainable that some a part of the output has already been despatched. If you happen to can’t tolerate any duplicates, the vacation spot system ought to deal with this (for instance, being idempotent).

Selecting the outline hyperlink takes you to the Levels tab for that job. Right here you possibly can dig into the failure: What’s the exception? Is it all the time in the identical executor? Does it succeed on the primary retry or took a number of?

Ideally, you wish to establish these failures and remedy them. For instance, possibly the vacation spot system is throttling us as a result of doesn’t have sufficient provisioned capability, or a bigger timeout is required. In any other case, it is best to at the least monitor it and resolve whether it is systemic or sporadic.

Sizing and scaling

Defining tips on how to cut up the info is a key ingredient in any distributed system to run and scale effectively. The design choices on the messaging system can have a robust affect on how the streaming job will carry out and scale, and thereby have an effect on the job parallelism.

Within the case of AWS Glue Streaming, this division of labor is predicated on Apache Spark partitions, which outline tips on how to cut up the work so it may be processed in parallel. Every time the job reads a batch from the supply, it divides the incoming information into Spark partitions.

For Apache Kafka, every subject partition turns into a Spark partition; equally, for Kinesis, every stream shard turns into a Spark partition. To simplify, I’ll consult with this parallelism degree as variety of partitions, which means Spark partitions that might be decided by the enter Kafka partitions or Kinesis shards on a one-to-one foundation.

The aim is to have sufficient parallelism and capability to course of every batch of information in much less time than the configured batch interval and due to this fact have the ability to sustain. As an example, with a batch interval of 60 seconds, the job lets 60 seconds of information construct up after which processes that information. If that work takes greater than 60 seconds, the following batch waits till the earlier batch is full earlier than beginning a brand new batch with the info that has constructed up because the earlier batch began.

It’s an excellent apply to restrict the quantity of information to course of in a single batch, as a substitute of simply taking all the pieces that has been added because the final one. This helps make the job extra steady and predictable throughout peak instances. It permits you to check that the job can deal with quantity of information with out points (for instance, reminiscence or throttling).

To take action, specify a restrict when defining the supply stream DataFrame:

  • For Kinesis, specify the restrict utilizing kinesis.executor.maxFetchRecordsPerShard, and revise this quantity if the variety of shards modifications considerably. You would possibly want to extend kinesis.executor.maxFetchTimeInMs as properly, with a purpose to permit extra time to learn the batch and ensure it’s not truncated.
  • For Kafka, set maxOffsetsPerTrigger, which divides that allowance equally between the variety of partitions.

The next is an instance of setting this config for Kafka (for Kinesis, it’s equal however utilizing Kinesis properties):

kafka_properties= {
  "kafka.bootstrap.servers": "bootstrapserver1:9092",
  "subscribe": "mytopic",
  "startingOffsets": "newest",
  "maxOffsetsPerTrigger": "5000000"
}
# Go the properties as choices when creating the DataFrame
spark.spark.readStream.format("kafka").choices(**kafka_properties).load()

Preliminary benchmark

If the occasions could be processed individually (no interdependency comparable to grouping), you may get a tough estimation of what number of messages a single Spark core can deal with by working with a single partition supply (one Kafka partition or one Kinesis shard stream) with information preloaded into it and run batches with a restrict and the minimal interval (1 second). This simulates a stress check with no downtime between batches.

For these repeated assessments, clear the checkpoint listing, use a distinct one (for instance, make it dynamic utilizing the timestamp within the path), or simply disable the checkpointing (if utilizing the Spark API straight), so you possibly can reuse the identical information.
Go away a number of batches to run (at the least 10) to present time for the system and the metrics to stabilize.

Begin with a small restrict (utilizing the restrict configuration properties defined within the earlier part) and do a number of reruns, rising the worth. File the batch length for that restrict and the throughput enter price (as a result of it’s a stress check, the method price must be related).

Typically, bigger batches are typically extra environment friendly up to some extent. It’s because the fastened overhead taken for every to checkpoint, plan, and coordinate the nodes is extra important if the batches are smaller and due to this fact extra frequent.

Then decide your reference preliminary settings based mostly on the necessities:

  • If a aim SLA is required, use the most important batch measurement whose batch length is lower than half the latency SLA. It’s because within the worst case, a message that’s saved simply after a batch is triggered has to attend at the least the interval after which the processing time (which must be lower than the interval). When the system is maintaining, the latency on this worst case can be near twice the interval, so purpose for the batch length to be lower than half the goal latency.
  • Within the case the place the throughput is the precedence over latency, simply decide the batch measurement that gives the next common course of price and outline an interval that permits some buffer over the noticed batch length.

Now you’ve gotten an concept of the variety of messages per core our ETL can deal with and the latency. These numbers are idealistic as a result of the system received’t scale completely linearly whenever you add extra partitions and nodes. You should utilize the messages per core obtained to divide the overall variety of messages per second to course of and get the minimal variety of Spark partitions wanted (every core handles one partition in parallel).

With this variety of estimated Spark cores, calculate the variety of nodes wanted relying on the sort and model, as summarized within the following desk.

AWS Glue Model Employee Kind vCores Spark Cores per Employee
2 G 1X 4 8
2 G 2X 8 16
3 G 0.25X 2 2
3 G 1X 4 4
3 G 2X 8 8

Utilizing the newer model 3 is preferable as a result of it consists of extra optimizations and options like auto scaling (which we talk about later). Relating to measurement, except the job has some operation that’s heavy on reminiscence, it’s preferable to make use of the smaller situations so there aren’t so many cores competing for reminiscence, disk, and community shared assets.

Spark cores are equal to threads; due to this fact, you possibly can have extra (or much less) than the precise cores accessible within the occasion. This doesn’t imply that having extra Spark cores goes to essentially be sooner in the event that they’re not backed by bodily cores, it simply means you’ve gotten extra parallelism competing for a similar CPU.

Sizing the cluster whenever you management the enter message system

That is the perfect case as a result of you possibly can optimize the efficiency and the effectivity as wanted.

With the benchmark info you simply gathered, you possibly can outline your preliminary AWS Glue cluster measurement and configure Kafka or Kinesis with the variety of partitions or subjects estimated, plus some buffer. Check this baseline setup and modify as wanted till the job can comfortably meet the overall quantity and required latency.

As an example, if we’ve decided that we’d like 32 cores to be properly throughout the latency requirement for the amount of information to course of, then we are able to create an AWS Glue 3.0 cluster with 9 G.1X nodes (a driver and eight employees with 4 cores = 32) which reads from a Kinesis information stream with 32 shards.

Think about that the amount of information in that stream doubles and we wish to hold the latency necessities. To take action, we double the variety of employees (16 + 1 driver = 17) and the variety of shards on the stream (now 64). Bear in mind that is only a reference and must be validated; in apply you would possibly want kind of nodes relying on the cluster measurement, if the vacation spot system can sustain, complexity of transformations, or different parameters.

Sizing the cluster whenever you don’t management the message system configuration

On this case, your choices for tuning are way more restricted.

Examine if a cluster with the identical variety of Spark cores as current partitions (decided by the message system) is ready to sustain with the anticipated quantity of information and latency, plus some allowance for peak instances.

If that’s not the case, including extra nodes alone received’t assist. You want to repartition the incoming information inside AWS Glue. This operation provides an overhead to redistribute the info internally, nevertheless it’s the one method the job can scale out on this situation.

Let’s illustrate with an instance. Think about we’ve a Kinesis information stream with one shard that we don’t management, and there isn’t sufficient quantity to justify asking the proprietor to extend the shards. Within the cluster, important computing for every message is required; for every message, it runs heuristics and different ML strategies to take motion relying on the calculations. After working some benchmarks, the calculations could be achieved promptly for the anticipated quantity of messages utilizing 8 cores working in parallel. By default, as a result of there is just one shard, just one core will course of all of the messages sequentially.

To resolve this situation, we are able to provision an AWS Glue 3.0 cluster with 3 G 1X nodes to have 8 employee cores accessible. Within the code repartition, the batch distributes the messages randomly (as evenly as attainable) between them:

def batch_function(data_frame, batch_id):
    # Repartition so the udf is known as in parallel for every partition
    data_frame.repartition(8).foreach(process_event_udf)

glueContext.forEachBatch(body=streaming_df, batch_function=batch_function)

If the messaging system resizes the variety of partitions or shards, the job picks up this transformation on the following batch. You would possibly want to regulate the cluster capability accordingly with the brand new information quantity.

The streaming job is ready to course of extra partitions than Spark cores can be found, however would possibly trigger inefficiencies as a result of the extra partitions might be queued and received’t begin being processed till others end. This would possibly end in many nodes being idle whereas the remaining partitions end and the following batch could be triggered.

When the messages have processing interdependencies

If the messages to be processed rely upon different messages (both in the identical or earlier batches), that’s more likely to be a limiting issue on the scalability. In that case, it would assist to investigate a batch (job in Spark UI) to see the place the time is spent and if there are imbalances by checking the duty length percentiles on the Levels tab (you can too attain this web page by selecting a stage on the Jobs tab).

Auto scaling

To this point, you’ve gotten seen sizing strategies to deal with a steady stream of information with the occasional peak.
Nonetheless, for variable incoming volumes of information, this isn’t cost-effective as a result of you have to measurement for the worst-case situation or settle for increased latency at peak instances.

That is the place AWS Glue Streaming 3.0 auto scaling is available in. You may allow it for the job and outline the utmost variety of employees you wish to permit (for instance, utilizing the quantity you’ve gotten decided wanted for the height instances).

The runtime displays the pattern of time spent on batch processing and compares it with the configured interval. Primarily based on that, it decides to extend or lower the variety of employees as wanted, being extra aggressive because the batch instances get close to or go over the allowed interval time.

The next screenshot is an instance of a streaming job with auto scaling enabled.

Splitting workloads

You’ve got seen tips on how to scale a single job by including nodes and partitioning the info as wanted, which is sufficient on most circumstances. Because the cluster grows, there may be nonetheless a single driver and the nodes have to attend for the others to finish the batch earlier than they will take further work. If it reaches some extent that rising the cluster measurement is now not efficient, you would possibly wish to think about splitting the workload between separate jobs.

Within the case of Kinesis, you have to divide the info into a number of streams, however for Apache Kafka, you possibly can divide a subject into a number of jobs by assigning partitions to every one. To take action, as a substitute of the standard subscribe or subscribePattern the place the subjects are listed, use the property assign to assign utilizing JSON a subset of the subject partitions that the job will deal with (for instance, {"topic1": [0,1,2]}). On the time of this writing, it’s not attainable to specify a spread, so you have to listing all of the partitions, as an example constructing that listing dynamically within the code.

Sizing down

For low volumes of visitors, AWS Glue Streaming has a particular sort of small node: G 0.25X, which gives two cores and 4 GB RAM for 1 / 4 of the price of a DPU, so it’s very cost-effective. Nonetheless, even with that frugal capability, when you have many small streams, having a small cluster for every one continues to be not sensible.

For such conditions, there are at present a number of choices:

  • Configure the stream DataFrame to feed from a number of Kafka subjects or Kinesis streams. Then within the DataFrame, use the columns subject and streamName, for Kafka and Kinesis sources respectively, to find out tips on how to deal with the info (for instance, completely different transformations or locations). Be sure the DataFrame is cached, so that you don’t learn the info a number of instances.
  • When you have a mixture of Kafka and Kinesis sources, you possibly can outline a DataFrame for every, be a part of them, and course of as wanted utilizing the columns talked about within the earlier level.
  • The previous two circumstances require all of the sources to have the identical batch interval and hyperlinks their processing (for instance, a busier stream can delay a slower one). To have impartial stream processing inside the identical cluster, you possibly can set off the processing of separate stream’s DataFrames utilizing separate threads. Every stream is monitored individually within the Spark UI, however you’re chargeable for beginning and managing these threads and deal with errors.

Settings

On this put up, we confirmed some config settings that affect efficiency. The next desk summarizes those we mentioned and different necessary config properties to make use of when creating the enter stream DataFrame.

Property Applies to Remarks
maxOffsetsPerTrigger Kafka Restrict of messages per batch. Divides the restrict evenly amongst partitions.
kinesis.executor.maxFetchRecordsPerShard Kinesis Restrict per every shard, due to this fact must be revised if the variety of shards modifications.
kinesis.executor.maxFetchTimeInMs Kinesis When rising the batch measurement (both by rising the batch interval or the earlier property), the executor would possibly want extra time, allotted by this property.
startingOffsets Kafka Usually you wish to learn all the info accessible and due to this fact use earliest. Nonetheless, if there’s a huge backlog, the system would possibly take a very long time to catch up and as a substitute use newest to skip the historical past.
startingposition Kinesis Just like startingOffsets, on this case the values to make use of are TRIM_HORIZON to backload and LATEST to start out processing to any extent further.
includeHeaders Kafka Allow this flag if you have to merge and cut up a number of subjects in the identical job (see the earlier part for particulars).
kinesis.executor.maxconnections Kinesis When writing to Kinesis, by default it makes use of a single connection. Growing this would possibly enhance efficiency.
kinesis.consumer.avoidEmptyBatches Kinesis It’s finest to set it to true to keep away from losing assets (for instance, producing empty information) when there isn’t a information (just like the Kafka connector does). GlueContext.forEachBatch prevents empty batches by default.

Additional optimizations

Typically, it’s value doing a little compression on the messages to save lots of on switch time (on the expense of some CPU, relying on the compression sort used).

If the producer compresses the messages individually, AWS Glue can detect it and decompress routinely most often, relying on the format and kind. For extra info, consult with Including Streaming ETL Jobs in AWS Glue.

If utilizing Kafka, you’ve gotten the choice to compress the subject. This fashion, the compression is more practical as a result of it’s achieved in batches, end-to-end, and it’s clear to the producer and client.

By default, the GlueContext.forEachBatch perform caches the incoming information. That is useful if the info must be despatched to a number of sinks (for instance, as Amazon S3 information and likewise to replace a DynamoDB desk) as a result of in any other case the job would learn the info a number of instances from the supply. However it may be detrimental to efficiency if the amount of information is huge and there is just one output.

To disable this selection, set persistDataFrame as false:

glueContext.forEachBatch(
    body=myStreamDataFrame,
    batch_function=processBatch,
    choices={
        "windowSize": "30 seconds",
        "checkpointLocation": myCheckpointPath,
        "persistDataFrame":  "false"
    }
)

In streaming jobs, it’s frequent to have to hitch streaming information with one other DataFrame to do enrichment (for instance, lookups). In that case, you wish to keep away from any shuffle if attainable, as a result of it splits phases and causes information to be moved between nodes.

When the DataFrame you’re becoming a member of to is comparatively small to slot in reminiscence, think about using a broadcast be a part of. Nonetheless, keep in mind it will likely be distributed to the nodes on each batch, so it won’t be value it if the batches are too small.

If you have to shuffle, think about enabling the Kryo serializer (if utilizing customized serializable courses you have to register them first to make use of it).

As in any AWS Glue jobs, keep away from utilizing customized udf() if you are able to do the identical with the supplied API like Spark SQL. Person-defined capabilities (UDFs) forestall the runtime engine from performing many optimizations (the UDF code is a black field for the engine) and within the case of Python, it forces the motion of information between processes.

Keep away from producing too many small information (particularly columnar like Parquet or ORC, which have overhead per file). To take action, it may be a good suggestion to coalesce the micro-batch DataFrame earlier than writing the output. If you happen to’re writing partitioned information to Amazon S3, repartition based mostly on columns can considerably scale back the variety of output information created.

Conclusion

On this put up, you noticed tips on how to strategy sizing and tuning an AWS Glue streaming job in numerous situations, together with planning concerns, configuration, monitoring, suggestions, and pitfalls.

Now you can use these strategies to watch and enhance your current streaming jobs or use them when designing and constructing new ones.


Concerning the writer

Gonzalo Herreros is a Senior Huge Knowledge Architect on the AWS Glue staff.

LEAVE A REPLY

Please enter your comment!
Please enter your name here