Last active
April 17, 2018 03:28
-
-
Save thealmightygrant/c3324ae4b8868511b6693ac99f16ed29 to your computer and use it in GitHub Desktop.
WTF Kafka Connect?!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<section> | |
<h1>WTF Kafka Connect?!</h1> | |
<h2>Grant Sherrick</h2> | |
</section> | |
<section id="where-we-began"> | |
<h2>We started with trying to get our data from Kafka to S3.</h2> | |
<h2 class="fragment">We ran into a few issues...</h2> | |
<ul> | |
<li class="fragment">the Dockerfile</li> | |
<li class="fragment">Data Flushing</li> | |
<li class="fragment">Persistence</li> | |
<li class="fragment">Partitioning</li> | |
<li class="fragment">Heap Size</li> | |
</ul> | |
</section> | |
<section id="dockerfile"> | |
<h2>When debugging a new application, it's helpful to know how it's running.</h2> | |
<a href="https://hub.docker.com/r/confluentinc/cp-kafka-connect/">FROM: confluentinc/cp-kafka-connect</a><br> | |
<a class="fragment" href="https://docs.confluent.io/3.2.1/installation/docker/docs/development.html">How to put on your Kafka Dockers one leg at a time!</a> | |
</section> | |
<section id="flushing"> | |
<h2>Connectors output <code class="hljs cpp" style="display: initial;">flush.size</code> records per file...</h2> | |
<br> | |
<h3 class="fragment">What happens when <code class="hljs cpp" style="display: initial;">flush.size - 1</code> records exist on a topic?</h3> | |
<br> | |
<h3 class="fragment">What happens when <code class="hljs cpp" style="display: initial;">flush.size - 1</code> records exist on each partition?</h3> | |
</section> | |
<section id="rotate-schedule"> | |
<h2>Let's have this output to S3 more regularly...</h2> | |
<ul> | |
<li><a href="https://docs.confluent.io/3.2.1/connect/connect-storage-cloud/kafka-connect-s3/docs/configuration_options.html">To the docs!</a></li> | |
<li class="fragment"><a href="https://github.com/confluentinc/kafka-connect-storage-cloud/issues/27">WTF?</a></li> | |
<li class="fragment"><a href="https://docs.confluent.io/3.2.1/connect/connect-hdfs/docs/configuration_options.html#connector">To the other docs!</a></li> | |
</ul> | |
</section> | |
<section id="persistence"> | |
<h2>We didn't quite understand connector persistence when we started working with Kafka Connect</h2> | |
<ul class="fragment"> | |
<li>Connectors are persistent across Kafka Connect restarts.</li> | |
<li>Connector offsets are persistent across Kafka Connect restarts.</li> | |
<li>Connectors can be deleted or updated, connector names are unique.</li> | |
<li>Connectors can be paused, this will also be persistent.</li> | |
</ul> | |
</section> | |
<section id="partitioning1"> | |
<h2>Hive style partitioning is the default storage for the S3 Connector.</h2> | |
<pre><code>/topics/health-metrics/year=2018/month=02/day=27</pre></code> | |
<p>Partitions should be used to reduce the amount of folders that need to be traversed by commonly used queries.</p> | |
</section> | |
<section id="partitioning2"> | |
<h2>Custom Partitioning can be performed on any field or Kafka related data</h2> | |
<h4>This requires extending the DefaultPartitioner</h4> | |
</section> | |
<section id="heap-size"> | |
<h2>We've had a few issues with overruning heap size.</h2> | |
<h4 class="fragment">This is dependent on a great number of things:</h4> | |
<ul class="fragment"> | |
<li>Num Topic Partitions/Tasks</li> | |
<li>Message Size</li> | |
<li>Other Connectors</li> | |
<li>s3.part.size (records cannot be split!)</li> | |
</ul> | |
<pre class="fragment"><code class="hljs nohighlight">Heap max size >= s3.part.size * num active partitions + (at least) 100 MB</code></pre> | |
</section> | |
<section id="conclusion"> | |
<h2>Thanks!</h2> | |
</section> | |
<section id="useful-links"> | |
<h2>Some Useful Links:</h2> | |
<ul> | |
<li><a href="https://speakerdeck.com/rmoff/real-time-data-integration-at-scale-with-kafka-connect-dublin-apache-kafka-meetup-04-jul-2017">Some in depth Kafka Connect slides</a></li> | |
<li><a href="http://kafka.apache.org/documentation.html#connect_transforms"><b>Solid</b> docs on Kafka Connect Transforms</a></li> | |
<li><a href="https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/">An Example of Using Transforms IRL</a></li> | |
<li><a href="https://stackoverflow.com/questions/44014975/kafka-consumer-api-vs-stream-api">Streams vs Consumers vs Producers</a></li> | |
<li><a href="https://www.confluent.io/wp-content/uploads/confluent-kafka-definitive-guide-complete.pdf">The definitive guide to Kafka (a whole chapter on Kafka Connect!)</a></li> | |
<li><a href="https://cwiki.apache.org/confluence/display/Hive/Design#Design-HiveDataModel">On Kafka Connect (a.k.a. Hive) Partitions</a></li> | |
</ul> | |
</section> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment