Spring Boot
Spring Boot
Quickstart
Spring in a nutshell
It’s a very popular framework for building Java Applications and provides a large number of
helper classes and annotations.
The problem with this is that building a traditional Spring application is really hard, because
raises a few questions such as which JAR dependencies do you need, what kind of
configuration we have to use (xml or Java), how to install the server, etc.
Goals of Spring
Lightweight development with Java POJOs (Plain Old Java Objects), making it much
simpler to build compared to the heavyweight EJBs from the early versions of J2EE
Dependency injection to promote loose coupling, so instead of hard wiring your
objects together, you specify the wiring via configuration file or annotations.
Minimize boilerplate Java code
Spring FAQ
Spring Boot replace Spring MVC, Spring Rest, etc…?
o No, instead, Spring Boot actually uses those technologies in the background.
Core Container
It’s the main item in Spring. It contains things like the Beans, SpEL (Spring Expression
Language), the Context and the Core.
Infrastructure
Aspect Oriented Programming (AOP): Allows us to create this application wide services
(like logging, security, transactions, instrumentation) and apply these services to our
objects in a declarative fashion.
Instrumentation: Its where you can use class loader implementation, such as Java
Agents, to remotely monitor the app with JMX (Java Management Extension).
Data Access Layer
It contains all the necessary things to facilitate our communication with the database (such as
JDBC, an ORM, Transactions, etc.)
JDBC: It helps us to reduce by nearly %50 our source code to connect with the
database.
ORM: It means Object to Relational Mapping.
JMS: It means Java Messaging Service and it’s for sending async messages to a message
broker. Spring provides helper classes for JMS
Transactions: Add transactions support. Spring makes heavy use of AOP behind the
scenes.
Web layer
It’s the layer where all web related classes go. It’s the home of the Spring MVC framework.
Spring Projects
Additional Spring modules built on top of the core Spring Framework. Think of them just as
simply add-ons.
Spring Cloud
Spring Data
Spring Batch
Spring Security
Spring Web Services
Spring LDAP.
Etc.
Maven
Maven is a Project Management Tool and its most popular use is for build management and
dependencies.
This problem is solved by Maven, where we only specify what JAR files we use and Maven
download and add them to the classpath, this last part, in the building and running part.
Benefits
Most major IDEs have built-in support for Maven
o IDEs can easily read/import maven projects.
Maven projects are portable, developers can easily share maven projects between
IDEs.
Advantages of Maven
Dependency management
Building and running your project without buildpath/classpath issues
Standard directory structure
pom.xml
The POM File (Project Object Model) it’s the configuration file of our project.
Dependency coordinates
The coordinates of a dependencies are the same as the project:
e.g.
spring-boot-starter-web:
o spring-web
o spring-webmvc
o hibernate-validator
o tomcat
o etc…
This saves the developer to have a really big list with all the individual dependencies, and
checking if the version of each one its compatible with the rest of dependencies.
If we are using a Starter Parent, we don’t need to specify the Starters version of the
dependencies, because they inherit them from the starter parent.
Also, the Starter Parent provides default configuration for the spring boot plugin.
Benefits
Default configuration: Java version, UTF encoding, etc…
Dependency management, spring boot starters inherits their version from the starter
parent
Default configuration of spring boot plugin.
To use it, we only need to put the dependency in the POM file.
In case of IntelliJ, we need to do some configuration, because it doesn’t support Dev Tools by
default.
The configuration to do is Preferences > Build, Execution, Deployment > Compiler and check
the box “Build project automatically”.
Also, there’s another configuration to do, which is in Preferences > Advanced settings, where
we need to check the box “Allow auto-make to start”.
With this, there’s no need to write additional code and you get endpoints to check the status of
the application.
Exposing endpoints
By default, only /health is exposed.
To expose others endpoints, we have to add them after a comma in the YML’s property
management.endpoints.web.exposure.include, or use a wildcard (*) and expose all of the
endpoints. Also, we need to enable the endpoint with management.
{endpoint}.env.enabled=true.
/health
Checks the status of your application, it’s normally used by monitoring apps to see if the
application is up or down.
/info
Can provide information about the application, which is customizable and, by default, it’s
empty.
The properties started with “info.”, will be used by the info endpoint.
e.g.
info.app.name=My cool APP
info.app.description=A really cool app, yohoo!
Info.app.version=1.0.0
Other endpoints
Actuator offer more than 10 endpoints, between them are:
The properties and their values defined here, can be used later in the application, with the
@Value annotation.
Also you can use any name for the properties and define how many variables you want.
Spring boot has more than 1000 properties to configure, and they’re grouped into the
following categories:
CORE - Logging properties
We can define the logging levels based on package names:
For example, the yellow one is going to log all INFO logs in the com.luv2code packages. This
configuration also logs all of its sub packages.
TRACE
DEBUG
INFO
WARN
ERROR
FATAL
OFF
Also, we can save the log into a file with the following configuration:
Static content
By default, Spring Boot will load static resources (Such as JS files, CSS files, images, etc.) from
“/resources/static” directory.
Also, notice that if the packaging is JAR, then it won’t package the content of
“/src/main/webapps”, this is exclusively used by WAR packaging.
Templates
Spring boot includes auto-configuration for following template engines:
FreeMarker
Thymeleaf
Mustache
Spring Core
Inversion of Control (IoC)
It’s the approach of outsourcing the construction and management of objects.
We going to say “give me a Coach object” and Spring will determine which coach object is
needed based on a configuration, and it will give reference to the object.
Spring container
Primary functions:
Dependency injection
Makes use of the dependency inversion principle.
The client delegates to another object the responsibility of providing its dependencies.
e.g.
Returning to the spring container example, when getting a Coach, the coach object may have
additional dependencies
For example, in this case, we get the Head Coach, and the Head Coach has a staff of assistant
coaches, physical trainers, medical staff, etc.
So we can say “give me everything that I need to make use of the given Coach object”, and the
object factory will return the object ready to use.
Another example is a Controller that wants to use a Coach object, this is a dependency:
Injection types
There are multiple types of injection with Spring.
Constructor Injection
Setter Injection
Field injection
It’s not recommended because the method makes the code harder to unit test.
@Autowired annotation
For dependency injection, Spring can use autowiring (@Autowired annotation).
Spring will look for a class that matches by type (class or interface).
1. Spring will scan for @Components that implements the Coach interface.
2. If found, Spring will inject them, e.g. CricketCoach.
@Component annotation
It’s an annotation that marks the class as a Spring Bean.
The @Component annotation also makes the bean available for dependency injection.
Component Scanning
Spring will scan your Java classes for special annotations, such as @Component, and then
automatically register the beans in the Spring container.
By default, Spring Boot starts component scanning from same package as your main Spring
Boot application. Also, scans sub-packages recursively.
This implicitly defines a base search package and allows you to leverage default component
scanning. There’s no need to explicitly reference the base package name.
@SpringBootApplication annotation
This annotation enables auto configuration, component scanning and additional configuration.
In this annotation we can define what packages (besides the main one) we want to scan:
SpringApplication.run()
This method, bootstrap our Spring Boot application, and, behind the scenes, it creates
application context and registers all beans.
@Qualifier annotation
If there’s more than one implementation of a class to inject, with this annotation, we can
specify what implementation we want.
The name is the same as the class, and the first character it’s always lower-case.
@Primary annotation
It’s an alternative solution for the @Qualifier.
If there’s multiple implementations of a class, then Spring will inject the one with the
@Primary annotation.
Anyways, by using @Primary, can have only one for multiple implementations. That means that
there’s going to be an error if there’s multiple implementations of the same class with the
@Primary annotation.
It’s recommended to use @Qualifier because it’s more specific and has higher priority.
Lazy Initialization
Instead of creating all beans up front, we can specify lazy initialization.
With this annotation, a bean will only be initialized in the following cases:
To avoid putting a @Lazy in every bean (including controllers), we can make all classes Lazy by
adding a property in our application.properties, which is:
spring.main.lazy-initialization = true
Advantages:
Disadvantages:
If you have web related components (like @RestController), it won’t be created until
requested.
May not discover configuration issues until too late.
Need to make sure you have enough memory for all beans once created.
Bean scopes
It means to the lifecycle of a bean, defining how long does the bean live, how many instances
are created and how is the bean shared.
Singleton
Spring Container creates only one instance of the bean by default.
It’s cached in memory and all dependency injections for the bean will reference the SAME
bean.
Bean lifecycle
The cycle when starts follows the next steps:
1. Container is created.
2. Beans are instantiated.
3. Dependencies are injected.
4. Internal Spring Processing.
5. Our Custom Init method. In this point the bean is ready to use.
Also, you can add custom code during bean destruction, calling custom business logic methods
and cleaning up handles to resources (db, sockets, files, etc.).
It’s important to know that, with the prototype scoped beans, the destroy method it’s not
called.
The main use case for this kind of bean declaration, it’s for make an existing third-party class
available to Spring framework.
In these scenarios, you may not have access to the source code of third-party class, so you
would like to use the third-party class as a Spring bean.
e.g.
With this, we configured the S3 Client as a Spring Bean using @Bean, now we can use inject
our own configured S3Client without configuring it over and over again.
Spring benefits
Spring is more than just Inversion of Control and Dependency Injection, but for small basic
apps, it may be hard to see the benefits of Spring.
Benefits of Hibernate
Hibernate handles all of the low-level SQL.
Minimizes the amount of JDBC code you have to develop.
Hibernate provides the Object-to-Relational Mapping (ORM).
What is JPA?
Jakarta Persistence API (JPA) previously known as Java Persistence API, it’s the standard API for
Object-to-Relational-Mapping (ORM).
Benefits of JPA
By having a standard API, you aren’t locked to vendor’s implementation.
Maintain portable, flexible code by coding to JPA spec (interfaces).
Can theoretically switch vendor implementations e.g. if Vendor ABC stops supporting
their product, we could switch to vendor XYZ without vendor lock in.
JPA Flow
1. To manage the database, we need a DAO object (Data Access Object)
2. Our DAO needs a JPA Entity Manager. JPA Entity Manager is the main component for
saving/retrieving entities.
3. Our JPA Entity Manager needs a Data Source. The Data Source, defines database
connection info.
Both, JPA Entity Managar and Data Source, are automatically created by Spring Boot
(Based on the application.properties file, JDBC URL, user id, password, etc.).
It’s a query language for retrieving objects and it’s similar in concept to SQL (where, like, order
by, join, in, etc…).
Find by query
Named parameters
Also, we can use named parameters, to avoid hard coding
Update by query
Delete by query
Configuration
In the application.properties, we can set the property in charge of creating the database, which
is spring.jpa.hibernate.ddl-auto=create.
When the APP it’s started, JPA/Hibernate will drop tables and then create them.
@Table
It’s an annotation to specify all table’s related information.
If it’s not provided, the table name it’s the same as the class.
It’s not recommended, because if the class name changes, then it probably won’t match the
existing database table name.
@Column
It’s an annotation to specify the column information.
Its use is optional, if not specified, the column name is the same name as Java field.
This approach It’s not recommended, because if the property name changes, then it probably
won’t match the existing database columns.
@Id
Marks the column as a primary key, which identifies the row as unique and cannot contain null
values.
@GeneratedValue
It’s to make the column generate the value by itself using, for example, an auto increment.
It’s useful to annotate the objects with their proper annotation because, doing it by this way,
the classes are more properly suited for processing by tools or associating with aspects.
How does Hibernate/JPA relate to JDBC?
Hibernate/JPA uses JDBC for all database communications.
EntityManager is main component for creating queries, etc. Also, the EntityManager is from
JPA.
Based on configs, Spring Boot will automatically create the beans (DataSource, EntityManager,
etc.), then, you can inject these into your app, for example your DAO.
Spring Boot will automatically configure your data source for you, based on entries from Maven
pom file. Also, the DB connection info is read from application.properties:
spring.datasource.url=jdbc:mysql://localhost:3306/student_tracker
spring.datasource.username=springstudent
spring.datasource.password=springstudent
Also, there’s no need to give JDBC driver class name. Spring Boot will automatically detect it
based on URL.
REST CRUD APIs
REST
It stands for REpresentational State Transfer, and it’s a lightweight approach for communicating
between applications.
REST is language independent, so the client and server application can use ANY programming
language.
REST applications can also use any data format, where XML and JSON are commonly used.
Also, REST API, RESTful API, REST WS, RESTful WS, etc. they are all the same.
JSON
JSON Stands for JavaScript Object Notation, and it’s a lightweight data format for storing and
exchanging data.
The client sends a HTTP request message and the services returns a HTTP response message.
Response line: server protocol and status code (200, 404, 500, etc.).
Header variables: response metadata.
Message body: contents of message.
@RequestMapping
Defines part of the url used to call the API, it’s mostly used to define the base url.
@PathParameter
Marks the parameter as a path parameter, which means that the value of this parameter goes
in the URL.
@GetMapping("/students/{studentId}")
private Student getStudentById(@PathVariable int studentId){
return students.get(studentId);
}
@ExceptionHandler
Defines the method as exception handler.
@ExceptionHandler
public ResponseEntity<StudentErrorResponse>
handleException(StudentNotFoundException e){
StudentErrorResponse error = new StudentErrorResponse();
error.setStatus(HttpStatus.NOT_FOUND.value());
error.setMessage(e.getMessage());
error.setTimeStamp(System.currentTimeMillis());
@ControllerAdvice
public class StudentRestExceptionHandler {
@ExceptionHandler
public ResponseEntity<StudentErrorResponse>
handleException(StudentNotFoundException e){
StudentErrorResponse error = new StudentErrorResponse();
error.setStatus(HttpStatus.NOT_FOUND.value());
error.setMessage(e.getMessage());
error.setTimeStamp(System.currentTimeMillis());
Also, it’s pretty much the same thing as Mapping, Serialization/Deserialization and
Marshalling/Unmarshalling, its converting from one format to another.
By default, Jackson will call appropriate getter/setter method, when you go from POJO to JSON
and viceversa.
Bad practices
DO NOT include actions in the endpoint, instead use HTTP methods to assign actions.
Design patterns
Service Layer
It’s an intermediate layer for custom business logic. It integrates data from multiple sources
(DAO/repositories).
Best practices
Apply transactional boundaries at the service layer, it’s responsibility of the service
layer to manage transaction boundaries.
Spring Data JPA
It helps us to minimize boiler-plate DAO code, giving us for example, JpaRepository, which give
us CRUD implementations for free.
@Repository
public interface EmployeeRepository extends JpaRepository<Employee,
Integer> {
}
By default, Spring Data REST will create endpoints based on entity type.
For a collection, meta-data includes page size, total elements, pages, etc.
Advanced features
Supports pagination, sorting and searching.
Extending and adding custom queries with JPQL.
Query Domain Specific Language (Query DSL).
By default, Spring Data REST will return the first 20 elements, we can navigate to the
different pages of data using query param.
It has the following properties available, and more:
Security Concepts
Authentication: Check user id and password with credentials stored in app/db.
Authorization: Check to see if user has an authorized role.
Declarative Security
Define application’s security constraints in configuration, using the annotation used for all java
config @Configuration.
Programmatic Security
Spring Security provides an API for custom application config.
In-memory
JDBC
LDAP
Custom/Pluggable
Others…
Spring Security Password Storage
In Spring Security, passwords are stored using a specific format
For example:
@Configuration
public class DemoSecurityConfig {
@Bean
public InMemoryUserDetailsManager userDetailsManager(){
UserDetails john = User.builder()
.username("john")
.password("{noop}test123")
.roles("EMPLOYEE")
.build();
e.g.
@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws
Exception {
http.authorizeHttpRequests(configurer ->
configurer
.requestMatchers(HttpMethod.GET,
"/api/employees").hasRole("EMPLOYEE")
.requestMatchers(HttpMethod.GET,
"/api/employees/**").hasRole("EMPLOYEE")
.requestMatchers(HttpMethod.POST,
"/api/employees").hasRole("MANAGER")
.requestMatchers(HttpMethod.PUT,
"/api/employees").hasRole("MANAGER")
.requestMatchers(HttpMethod.DELETE,
"/api/employees/**").hasRole("ADMIN"));
http.httpBasic(Customizer.withDefaults());
return http.build();
}
Basically, what they do in the background is embed additional authentication data/token into
all HTML forms, and, on subsequent request, web app will verify token before processing.
The primary use case is traditional web applications (HTML forms, etc…).
If you are building a REST API for non-browser clients, you may want to disable CSRF
protection.
In general, it’s not required for stateless REST APIs that use POST, PUT, DELETE and/or PATCH.
You can also, customize the table schemas. It’s useful if you have custom tables specific to your
project.
Internally, Spring Security uses the “ROLE_” prefix.
And with this, we tell Spring Security to use JDBC authentication with our data source.
@Bean
public UserDetailsManager userDetailsManager(DataSource dataSource){
return new JdbcUserDetailsManager(dataSource);
}
BCrypt
It’s an algorithm that performs one-way encrypted hashing and it’s recommended by the
Spring Security team to encrypt the database sensible data.
It adds a random salt to the password for additional protection and includes support to defeat
brute force attacks.
The salt are the first 22 characters after the last dollar sign:
$2y$13$<this is the salt, 22 chars><this is the password hash>
The password from DB is NEVER decrypted, because bcrypt is a one-way encryption algorithm.
We need to provide:
jdbcUserDetailsManager.setUsersByUsernameQuery("select user_id,
pw, active from members where user_id = ?");
jdbcUserDetailsManager.setAuthoritiesByUsernameQuery("select
user_id, roles from roles where user_id = ?");
return jdbcUserDetailsManager;
}
Spring MVC
Thymeleaf
Thymeleaf it’s an open source Java templating engine, commonly used to generate the HTML
views for web apps.
However, it’s a general purpose templating engine, we can use Thymeleaf outside of web apps.
We can create Java apps with thymeleaf, no need for Spring. It’s a separate project, unrelated
to spring.io. But there’s a lot of synergy between the two projects.
Development process
There’s two steps:
Having the Thymeleaf dependency in the POM, Spring will auto-configure to use
Thymeleaf.
Also, this will generate a template called helloworld.html (just like the returned String)
in src/main/resources/templates/helloworld.html.
@Controller
public class DemoController {
@GetMapping("/hello")
public String sayHello(Model theModel){
theModel.addAttribute("theDate", new Date());
return "helloworld";
}
}
<head>
<title>Thymeleaf Demo</title>
</head>
<body>
<p th:text="'Time on the server is ' + ${theDate}"/>
</body>
</html>
Additional Features
Looping and conditionals
CSS and JavaScript integration
Template layouts and fragments.
Also, Spring Boot will search following directories for static resources:
1. /META-INF/resources
2. /resources
3. /static
4. /public
3. Apply CSS
<br><br>
<br><br>
Check boxes:
<input type="checkbox" th:field="*{favoriteSystems}"
th:value="Linux">Linux</input>
<input type="checkbox" th:field="*{favoriteSystems}"
th:value="macOS">macOS</input>
<input type="checkbox" th:field="*{favoriteSystems}"
th:value="'Microsoft Windows'">Microsoft Windows</input>
Dynamic list:
<ul>
<li th:each="tempSystem : ${student.favoriteSystems}" th:text="$
{tempSystem}"/>
</ul>
Spring MVC
Behind the Scenes
Components of a Spring MVC Application
A set of web pages to layout UI components.
A collection of Spring beans (controllers, services, etc...).
Spring configuration (XML, Annotations or Java).
Front controller known as DispatcherServlet is part of the Spring Framework. For that, it’s
already developed by Spring Dev Team
Controller
Its code created by the developer, and contains our business logic:
Model
It contains our data and store/retrieves data via backend systems (database, web services,
spring beans, etc.).
Its where the data it’s placed, that can be any Java object/collection (Strings, Objects, Info from
Database, etc.).
When using form with method GET, the form data is added to end of URL as name/value pairs,
e.g. theUrl?field1=value1&field2=value2…
Data Binding
Spring MVC forms can make use of data binding to automatically setting or retrieving data from
a Java object/bean.
In our Spring Controller, before we show the form, we must add a model attribute. This is a
bean that will hold form data for the data binding.
And the HTML part for this should be
The name of the th:object should be the same as the added to the model:
Also, the *{…} in the inputs are a shortcut for student.firstName and student.lastName.
With this, when the form is loaded, Spring MVC will read student from the model, then call
student.getFirstName() and student.getLastName().
Now, to get the data modified from the model, we have to add it as a parameter like this:
And, for getting the modified information, we just have to get it:
Drop-Down lists
To set a value from drop down list to a property of an object
After that, we have to iterate over the list and generate every entry like this:
Country:
<select th:field="*{country}">
<option th:each="tempCountry : ${countries}" th:value="$
{tempCountry}" th:text="${tempCountry}" />
</select>
Radio Buttons
To assign info to a field from a radio button, we can do this:
Just to know, if the value has spaces, we have to put it in single quotes, just like the value
Microsoft Windows in the example.
Spring Boot and Thymeleaf also support the Bean Validation API.
Required.
Length.
Numbers.
Using regular expressions.
And custom validation.
Validation Annotations
Some validation annotations are:
Objects
HttpServletRequest
Holds HTML form data.
Model
Container for our data. When it comes as parameter in a request mapping method, it’s empty.
Annotations
@RequestMapping
This Mapping handles ALL HTTP methods, e.g. GET, POST, etc.
@GetMapping
It’s the short way of using @RequestMapping(path=”/processForm”,
method=RequestMethod.GET).
@PostMapping
It’s the short way of using @RequestMapping(path=”/processForm”,
method=RequestMethod.POST).
Useful Code
Get HTML form data and send data to the model.
@RequestMapping("/processFormVersionTwo")
public String lestShoutDude(HttpServletRequest request, Model model){
String theName = request.getParameter("studentName");
theName = theName.toUpperCase();
model.addAttribute("message", result);
return "helloworld";
}
Get HTML form data and send data to the model. Without using HttpServletRequest
object.
@RequestMapping("/processFormVersionThree")
public String processFormVersionThree(@RequestParam("studentName")
String theName, Model model){
theName = theName.toUpperCase();
model.addAttribute("message", result);
return "helloworld";
}
Data binding on Java:
@GetMapping("/showStudentForm")
public String showForm(Model model){
Student student = new Student();
model.addAttribute("student", student);
return "student-form";
}
@PostMapping("/processStudentForm")
public String processStudentForm(@ModelAttribute("student") Student
student){
System.out.println("student: " + student.getFirstName() + " " +
student.getLastName());
return "student-confirmation";
}
JPA/Hibernate advanced mappings
Mappings
One to One Mapping
Used mostly to separate information in two tables. E.g.:
Cascade
It’s apply the same operation to related entities.
Cascade types
PERSIST: If entity is persisted/saved, related entity will also be persisted.
REMOVE: If entity is removed/deleted, related entity will also be deleted.
REFRESH: If entity is refreshed, related entity will also be refreshed.
DETACH: If entity is detached (not associated w/ session), then related entity will also
be detached.
MERGE: If entity is merged, then related entity will also be merged.
ALL: All of above cascade types.
Lazy
Will retrieve on request.
Mapping directions
Uni-directional
It’s a one-way relationship
Bi-directional
It’s when you can access the relationship object from both sides.
For this type of relationships, we use mappedBy to help find associated instructor.
Entity Lifecycle
Many operations can be done over an entity:
If the partitions count increases during a topic lifecycle, you will break your keys
ordering guarantees.
If the replication factor increases during a topic lifecycle, you put more pressure on
your cluster, which can lead to unexpected performance decrease.
Guidelines:
Guidelines:
Set it to 3 to get started (you must have at least 3 brokers for that).
If replication performance is an issue, get a better broker instead of less RF.
Never set it to 1 in production.
Cluster guidelines
Total number of partitions in the cluster:
Kafka with Zookeper: max 200.000 partitions (Nov 2018) – Zookeper Scaling limit.
o Still recommend a maximum of 4000 partitions per broker (soft limit).
Kafka with KRaft: potentially millions of partitions.
If you need more than 200.000 partitions in your cluster, follow the Netflix model and create
more Kafka clusters.
Over all, you don’t need a topic with 1000 partitions to achieve high throughput. Start at a
reasonable number and test the performance.
Message type:
The dataset name is analogous to a database name in traditional RDBMS systems. It’s used
as a category to group topics together.
The data name field is analogous to a table name in traditional RDBMS systems, though it’s
fine to include further dotted notation if developers with to impose their own hierarchy within
the dataset namespace.
The data format for example .avro, .json, .protobuf, .csv, .log.
Use snake_case.
Advanced Topics Configuration
Changing a Topic Configuration
Why should I care about topic configuration?
Brokers have defaults for all topic configuration parameters.
Replication factor.
# of Partitions.
Message size.
Compression level.
Log Cleanup Policy.
Min Insync Replicas.
Other configurations.
Only one segment is ACTIVE (the one data is being written to)
An offset to position index: helps Kafka find where to read from to find a message.
A timestamp to offset index: helps Kafka find messages with a specific timestamp.
Segments: Why should I care?
A smaller log.segment.bytes (size, default 1GB) means:
Ask yourself how fast will I have new segments based on throughput?
You set a max frequency for log compaction (more frequent triggers).
Maybe you want daily compaction instead of weekly.
Control the size of the data on the disk, delete obsolete data.
Overall: Limit maintenance work of the Kafka Cluster.
Very useful if we just require a SNAPSHOT instead of full history (such as for a data table in a
database).
The idea is that we only keep the latest “update” for a key in our log.
Example
We want a topic of employee-salary and we want to keep the most recent salary for our
employees:
Log Compaction Guarantees
Any consumer that’s reading from the tail of a log (most current data) will still see all the
messages sent to the topic.
Ordering messages, it kept, log compaction only removes some messages, but does not re-
order them.
The offset of a message is immutable (it never changes). Offsets are just skipped if a message is
missing.
It doesn’t prevent you from reading duplicate data from Kafka. Same points as above.
You can’t trigger Log Compaction using a API call (for now…)
How it works
segment.ms (default 7 days): Max amount of time to wait to close active segment.
segment.bytes (default 1G): Max size of a segment.
min.compaction.lag.msg (default 0): How long to wait before a message can be
compacted.
delete.retention.ms (default 24 hours): Wait before deleting data marked for
compaction.
min.cleanable.dirty.ratio (default 0.5): Higher => less, more efficient cleaning. Lower
=> opposite.
Unclean Leader Election
unclean.leader.election.enable
If all your In Sync Replicas go offline (but you still have out of sync replicas up), you have the
following option:
If you enable unclean.leader.election.enable = true, you improve availability, but you will lose
data because other messages on ISR will be discarded when they come back online and
replicate data from the new leader.
Overall, this is a very dangerous setting, and its implications must be understood fully before
enabling it.
Use cases include metrics collection, log collection, and other cases where data loss is
somewhat acceptable, at the trade-off of availability.
Large Messages in Apache Kafka
Kafka has a default of 1 MB per message in topics, as large messages are considered inefficient
and an anti-pattern.
1. Using an external store: store messages in HDFS, Amazon S3, Google Cloud Storage,
etc.. and send a reference of that message to Apache Kafka.
2. Modifying Kafka parameters: must change broker, producer and consumer settings.
max.partition.fetch.bytes = 10485880
max.request.size = 10485880