Tuning the Kafka Connect Cassandra Source (part 2)

In the first part of this series we looked at how to get Kafka Connect setup with the Cassandra Source connector from Landoop. We also took a look at some design considerations for the Cassandra tables. In this post we will examine some of the options we have for tuning the Cassandra Source connector.

KCCassandraSourceOvw

Configuring when to start looking for data

The Cassandra Source connector pulls data from a Cassandra table based on a date/time column. The first property we want to configure will tell the connector when it should start looking for data in the Cassandra table. This is set with the connect.cassandra.initial.offsetproperty. Any data that exists prior to this date/time will not be published to the Kafka topic.

If this property is not set then the connector will use the default value of Jan 1, 1900. This is not what you will want as it will cause long delays in publishing data to the Kafka topic. This delay is the result of the connector having to work its way through more than a century of time slices before reaching your data. How fast it does this will be determined by how some of the other properties are configured.

"connect.cassandra.initial.offset": "2018-01-22 00:00:00.0000000Z",

Once the connector has picked up data from the table and successfully published messages to Kafka it will store the date/time of the last published row as an offset in the Kafka topic connect-offsets. Once a value has been published to the topic the connector will always use it over the value provided by initial-offset.

Configuring how often to check for data in the table

The connector works by polling the table and looking for new rows that have been inserted since it last checked for data. How often to poll for data is managed by the connect.cassandra.import.poll.intervalproperty. The configuration shown below will look for new data every ten seconds.

"connect.cassandra.import.poll.interval": 10000,

If the connector is still processing rows from the result set in the prior polling cycle, it will not query the table for more data. That polling cycle will be skipped, at least in regards to querying the Cassandra cluster. The polling cycle is still important as this also determines how often data is published to the Kafka topic. Continue reading “Tuning the Kafka Connect Cassandra Source (part 2)”

Advertisements

Using the Kafka Connect Cassandra Source (part 1)

This post will look at how to setup and tune the Cassandra Source connector that is available from Landoop. The Cassandra Source connector is used to read data from a Cassandra table, writing the contents into a Kafka topic using only a configuration file. This enables data that has been saved to be easily turned into an event stream.

KCCassandraSourceOvw.png

In our example we will be capturing data representing a pack (ie a large box) of items being shipped. Each pack is pushed to consumers in a JSON format on a Kafka topic.

The Cassandra data model and Cassandra Source connector

Modeling data in Cassandra must be done around the queries that are needed to access the data (see this article for details). Typically this means that there will be one table for each query and data (in our case about the pack) will be duplicated across numerous tables.

Regardless of the other tables used for the product, the Cassandra Source connector needs a table that will allow us to query for data using a time range. The connector is designed around its ability to generate a CQL query based on configuration. It uses this query to retrieve data from the table that is available within a configurable time range. Once all of this data has been published, Kafka Connect will mark the upper end of the time range as an offset. The connector will then query the table for more data using the next time range starting with the date/time stored in the offset. We will look at how to configure this later. For now we want to focus on the the constraints for the table. Since Cassandra doesn’t support joins, the table we are pulling data from must have all of the data that we want to put onto the Kafka topic. Data in other tables will not be available to Kafka Connect.

In it’s simplest form a table used by the Cassandra Source connector might look like this: Continue reading “Using the Kafka Connect Cassandra Source (part 1)”

Some Quick Observations from #KafkaSummit NY

Last week several of my colleagues and I were able to head up to New York City to attend the Kafka Summit. It was a blast being in the Big Apple! I wanted to share some observations about the summit and some of the sessions.

  • The first observation is that there is a lot of buzz around event/stream processing and the Kafka streaming platform.  The hotel in the center of Manhattan was top notch. As one entered the summit, the Kafka logo was projected on the walls and was even emblazoned on the cookies. There were several companies that gave presentations on what they are doing with the platform including Linked In, ING, Airbnb, Uber, Yelp, Target, and the New York Times.

c_telfzxsaa1ydt.jpg

  • To get things kicked off, Jay Kreps gave the opening keynote (available here) and explored the question: what is a streaming platform. He noted the three things that are required.
    • the ability to publish and subscribe to streams of data
    • the ability to store and replicate streams of dataIMG_20170518_175032613
    • the ability to process streams of data
  • The Kafka streaming platform, comprised of Apache Kafka along with the Kafka Producer and Consumer APIs, Kafka Streaming, and Kafka Connect, provides all of these capabilities.
  • Jay Kreps explored the three possible lenses that one is tempted to place Kafka into – Messaging, Big Data, and/or ETL. But really the Kafka streaming platform encompasses all of these.
  • One of the best quotes at the summit came during Ferd Sheepers, architect at ING, keynote. He compared the current state of streaming platforms to going through puberty.

Continue reading “Some Quick Observations from #KafkaSummit NY”