(vision) SCS/DDD
Service Discovery einführen Consul als Service-Registry implementieren Services für automatische Registrierung konfigurieren Dynamisches Service-Routing im API-Gateway einrichten Health-Checks für jeden Service implementieren
This commit is contained in:
@@ -0,0 +1,199 @@
|
||||
# Meldestelle Monitoring System
|
||||
|
||||
This document describes the monitoring system set up for the Meldestelle application. The monitoring system includes metrics collection, visualization, centralized logging, and alerting.
|
||||
|
||||
## Components
|
||||
|
||||
The monitoring system consists of the following components:
|
||||
|
||||
1. **Prometheus** - For metrics collection and storage
|
||||
2. **Grafana** - For metrics visualization and dashboards
|
||||
3. **ELK Stack** - For centralized logging (Elasticsearch, Logstash, Kibana)
|
||||
4. **Alertmanager** - For alert management and notifications
|
||||
|
||||
## Architecture
|
||||
|
||||
The monitoring system is deployed as Docker containers alongside the Meldestelle application. The components interact as follows:
|
||||
|
||||
- The Meldestelle application exposes metrics at the `/metrics` endpoint
|
||||
- Prometheus scrapes metrics from the application and stores them
|
||||
- Grafana visualizes the metrics from Prometheus
|
||||
- The application sends logs to Logstash
|
||||
- Logstash processes the logs and sends them to Elasticsearch
|
||||
- Kibana visualizes the logs from Elasticsearch
|
||||
- Prometheus evaluates alerting rules and sends alerts to Alertmanager
|
||||
- Alertmanager manages alerts and sends notifications via configured channels (email, Slack, etc.)
|
||||
|
||||
## Setup
|
||||
|
||||
The monitoring system is configured in the `docker-compose.yml` file and the configuration files in the `config/monitoring` directory.
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker and Docker Compose
|
||||
- The Meldestelle application running with metrics enabled
|
||||
|
||||
### Starting the Monitoring System
|
||||
|
||||
To start the monitoring system, run:
|
||||
|
||||
```bash
|
||||
docker-compose up -d prometheus grafana alertmanager
|
||||
```
|
||||
|
||||
To start the ELK Stack, run:
|
||||
|
||||
```bash
|
||||
docker-compose up -d elasticsearch logstash kibana
|
||||
```
|
||||
|
||||
### Testing the Monitoring System
|
||||
|
||||
A test script is provided to verify that the monitoring system is working correctly:
|
||||
|
||||
```bash
|
||||
./test-monitoring.sh
|
||||
```
|
||||
|
||||
## Accessing the Monitoring Tools
|
||||
|
||||
- **Prometheus**: http://localhost:9090
|
||||
- **Grafana**: http://localhost:3000 (default credentials: admin/admin)
|
||||
- **Alertmanager**: http://localhost:9093
|
||||
- **Kibana**: http://localhost:5601
|
||||
|
||||
## Metrics
|
||||
|
||||
The following metrics are collected by Prometheus:
|
||||
|
||||
### JVM Metrics
|
||||
|
||||
- Memory usage (heap and non-heap)
|
||||
- Garbage collection statistics
|
||||
- Thread counts
|
||||
- Class loading statistics
|
||||
- CPU usage
|
||||
|
||||
### Application Metrics
|
||||
|
||||
- HTTP request counts
|
||||
- HTTP request durations
|
||||
- Error rates
|
||||
- Custom business metrics
|
||||
|
||||
## Dashboards
|
||||
|
||||
Grafana dashboards are provided for visualizing the metrics:
|
||||
|
||||
- **JVM Dashboard**: Shows JVM metrics such as memory usage, garbage collection, and thread counts
|
||||
- **Application Dashboard**: Shows application metrics such as request rates, error rates, and response times
|
||||
|
||||
## Alerting
|
||||
|
||||
Alerting is configured in Prometheus and Alertmanager. The following alerts are defined:
|
||||
|
||||
- **High Memory Usage**: Triggered when JVM heap memory usage exceeds 85% for 5 minutes
|
||||
- **High CPU Usage**: Triggered when CPU usage exceeds 85% for 5 minutes
|
||||
- **High Error Rate**: Triggered when the error rate exceeds 5% for 2 minutes
|
||||
- **Service Unavailable**: Triggered when the service is down for 1 minute
|
||||
- **Slow Response Time**: Triggered when the average response time exceeds 1 second for 5 minutes
|
||||
- **High GC Pause Time**: Triggered when the average GC pause time exceeds 0.5 seconds for 5 minutes
|
||||
|
||||
Alerts are sent to the configured notification channels (email and Slack).
|
||||
|
||||
## Logging
|
||||
|
||||
Logs are collected by Logstash, stored in Elasticsearch, and visualized in Kibana. The following log sources are configured:
|
||||
|
||||
- Application logs via TCP (JSON format)
|
||||
- File logs from the `/var/log/meldestelle` directory
|
||||
|
||||
## Configuration Files
|
||||
|
||||
- **Prometheus**: `config/monitoring/prometheus.yml`
|
||||
- **Alertmanager**: `config/monitoring/alertmanager/alertmanager.yml`
|
||||
- **Alerting Rules**: `config/monitoring/prometheus/rules/alerts.yml`
|
||||
- **Grafana Dashboards**: `config/monitoring/grafana/dashboards/`
|
||||
- **Grafana Datasources**: `config/monitoring/grafana/provisioning/datasources/`
|
||||
- **Logstash**: `config/monitoring/elk/logstash.conf`
|
||||
- **Elasticsearch**: `config/monitoring/elk/elasticsearch.yml`
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Prometheus
|
||||
|
||||
- Check if Prometheus is running: `docker-compose ps prometheus`
|
||||
- Check Prometheus logs: `docker-compose logs prometheus`
|
||||
- Verify that Prometheus can scrape metrics: http://localhost:9090/targets
|
||||
- Check if alerting rules are loaded: http://localhost:9090/rules
|
||||
|
||||
### Grafana
|
||||
|
||||
- Check if Grafana is running: `docker-compose ps grafana`
|
||||
- Check Grafana logs: `docker-compose logs grafana`
|
||||
- Verify that Grafana can connect to Prometheus: http://localhost:3000/datasources
|
||||
|
||||
### Alertmanager
|
||||
|
||||
- Check if Alertmanager is running: `docker-compose ps alertmanager`
|
||||
- Check Alertmanager logs: `docker-compose logs alertmanager`
|
||||
- Verify that Alertmanager is receiving alerts: http://localhost:9093/#/alerts
|
||||
|
||||
### ELK Stack
|
||||
|
||||
- Check if Elasticsearch is running: `docker-compose ps elasticsearch`
|
||||
- Check Elasticsearch logs: `docker-compose logs elasticsearch`
|
||||
- Check if Logstash is running: `docker-compose ps logstash`
|
||||
- Check Logstash logs: `docker-compose logs logstash`
|
||||
- Check if Kibana is running: `docker-compose ps kibana`
|
||||
- Check Kibana logs: `docker-compose logs kibana`
|
||||
- Verify that Elasticsearch is receiving logs: http://localhost:9200/_cat/indices
|
||||
- Verify that Kibana can connect to Elasticsearch: http://localhost:5601/app/management/kibana/indexPatterns
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Backup and Restore
|
||||
|
||||
- Prometheus data is stored in the `prometheus_data` volume
|
||||
- Grafana data is stored in the `grafana_data` volume
|
||||
- Alertmanager data is stored in the `alertmanager_data` volume
|
||||
- Elasticsearch data is stored in the `elasticsearch_data` volume
|
||||
|
||||
To backup these volumes, use Docker's volume backup functionality:
|
||||
|
||||
```bash
|
||||
docker run --rm -v prometheus_data:/source -v $(pwd)/backup:/backup alpine tar -czf /backup/prometheus_data.tar.gz -C /source .
|
||||
```
|
||||
|
||||
To restore from a backup:
|
||||
|
||||
```bash
|
||||
docker run --rm -v prometheus_data:/target -v $(pwd)/backup:/backup alpine sh -c "rm -rf /target/* && tar -xzf /backup/prometheus_data.tar.gz -C /target"
|
||||
```
|
||||
|
||||
### Updating
|
||||
|
||||
To update the monitoring components, update the image tags in the `docker-compose.yml` file and run:
|
||||
|
||||
```bash
|
||||
docker-compose pull prometheus grafana alertmanager
|
||||
docker-compose up -d prometheus grafana alertmanager
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- The monitoring system is configured for development and testing purposes
|
||||
- For production use, consider the following security measures:
|
||||
- Enable authentication for Prometheus
|
||||
- Use strong passwords for Grafana
|
||||
- Configure TLS for all components
|
||||
- Restrict access to the monitoring endpoints
|
||||
- Use environment variables for sensitive configuration values
|
||||
- Implement network segmentation to isolate the monitoring system
|
||||
|
||||
## Further Reading
|
||||
|
||||
- [Prometheus Documentation](https://prometheus.io/docs/introduction/overview/)
|
||||
- [Grafana Documentation](https://grafana.com/docs/grafana/latest/)
|
||||
- [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/alertmanager/)
|
||||
- [ELK Stack Documentation](https://www.elastic.co/guide/index.html)
|
||||
@@ -62,20 +62,6 @@ implementation("io.ktor:ktor-client-cio:${libs.versions.ktor.get()}")
|
||||
Create a service registration component in the shared-kernel module:
|
||||
|
||||
```kotlin
|
||||
package at.mocode.shared.discovery
|
||||
|
||||
import at.mocode.shared.config.AppConfig
|
||||
import com.orbitz.consul.Consul
|
||||
import com.orbitz.consul.model.agent.ImmutableRegistration
|
||||
import com.orbitz.consul.model.agent.Registration
|
||||
import kotlinx.coroutines.CoroutineScope
|
||||
import kotlinx.coroutines.Dispatchers
|
||||
import kotlinx.coroutines.delay
|
||||
import kotlinx.coroutines.launch
|
||||
import java.net.InetAddress
|
||||
import java.util.*
|
||||
import kotlin.time.Duration.Companion.seconds
|
||||
|
||||
class ServiceRegistration(
|
||||
private val serviceName: String,
|
||||
private val servicePort: Int,
|
||||
@@ -222,17 +208,6 @@ implementation("io.ktor:ktor-serialization-kotlinx-json:${libs.versions.ktor.get
|
||||
Create a service discovery component in the API Gateway:
|
||||
|
||||
```kotlin
|
||||
package at.mocode.gateway.discovery
|
||||
|
||||
import com.orbitz.consul.Consul
|
||||
import com.orbitz.consul.model.health.ServiceHealth
|
||||
import io.ktor.client.*
|
||||
import io.ktor.client.engine.cio.*
|
||||
import io.ktor.client.request.*
|
||||
import io.ktor.http.*
|
||||
import java.net.URI
|
||||
import java.util.concurrent.ConcurrentHashMap
|
||||
|
||||
class ServiceDiscovery(
|
||||
private val consulHost: String = "consul",
|
||||
private val consulPort: Int = 8500
|
||||
|
||||
@@ -157,7 +157,7 @@ private fun getRolePermissions(roles: List<UserRole>): List<Permission> {
|
||||
roles.forEach { role ->
|
||||
when (role) {
|
||||
UserRole.ADMIN -> {
|
||||
permissions.addAll(Permission.values())
|
||||
permissions.addAll(Permission.entries.toTypedArray())
|
||||
}
|
||||
UserRole.VEREINS_ADMIN -> {
|
||||
permissions.addAll(listOf(
|
||||
@@ -354,7 +354,7 @@ val PipelineContext<Unit, ApplicationCall>.userAuthContext: UserAuthContext?
|
||||
get() = call.principal<JWTPrincipal>()?.getUserAuthContext()
|
||||
|
||||
/**
|
||||
* Application call extension to check if user has specific role.
|
||||
* Application call extension to check if the user has a specific role.
|
||||
*/
|
||||
fun ApplicationCall.hasRole(role: UserRole): Boolean {
|
||||
val authContext = principal<JWTPrincipal>()?.getUserAuthContext()
|
||||
@@ -362,7 +362,7 @@ fun ApplicationCall.hasRole(role: UserRole): Boolean {
|
||||
}
|
||||
|
||||
/**
|
||||
* Application call extension to check if user has specific permission.
|
||||
* Application call extension to check if the user has specific permission.
|
||||
*/
|
||||
fun ApplicationCall.hasPermission(permission: Permission): Boolean {
|
||||
val authContext = principal<JWTPrincipal>()?.getUserAuthContext()
|
||||
|
||||
@@ -95,23 +95,23 @@ class CachingConfig(
|
||||
}
|
||||
|
||||
/**
|
||||
* Put a value in cache with TTL in minutes
|
||||
* Put a value in a cache with TTL in minutes
|
||||
*/
|
||||
fun <T> put(cacheName: String, key: String, value: T, ttlMinutes: Long = defaultTtlMinutes) {
|
||||
val stats = cacheStats.computeIfAbsent(cacheName) { CacheStats() }
|
||||
stats.puts++
|
||||
|
||||
// Store in local cache
|
||||
// Store in a local cache
|
||||
val expiresAt = System.currentTimeMillis() + TimeUnit.MINUTES.toMillis(ttlMinutes)
|
||||
val entry = CacheEntry(value as Any, expiresAt)
|
||||
getCacheMap(cacheName)[key] = entry
|
||||
}
|
||||
|
||||
/**
|
||||
* Remove a value from cache
|
||||
* Remove a value from the cache
|
||||
*/
|
||||
fun remove(cacheName: String, key: String) {
|
||||
// Remove from local cache
|
||||
// Remove from the local cache
|
||||
getCacheMap(cacheName).remove(key)
|
||||
}
|
||||
|
||||
@@ -136,7 +136,7 @@ class CachingConfig(
|
||||
}
|
||||
|
||||
/**
|
||||
* Get the appropriate cache map based on cache name
|
||||
* Get the appropriate cache map based on the cache name
|
||||
*/
|
||||
private fun getCacheMap(cacheName: String): ConcurrentHashMap<String, CacheEntry<Any>> {
|
||||
return when (cacheName) {
|
||||
|
||||
@@ -1,18 +1,13 @@
|
||||
package at.mocode.gateway.config
|
||||
|
||||
import io.ktor.server.application.*
|
||||
import io.ktor.server.plugins.*
|
||||
import io.ktor.server.request.*
|
||||
import io.ktor.server.routing.*
|
||||
import io.ktor.util.*
|
||||
import io.micrometer.core.instrument.Counter
|
||||
import io.micrometer.core.instrument.MeterRegistry
|
||||
import io.micrometer.core.instrument.Timer
|
||||
import io.micrometer.core.instrument.binder.MeterBinder
|
||||
import io.micrometer.prometheus.PrometheusMeterRegistry
|
||||
import java.time.Duration
|
||||
import java.util.concurrent.ConcurrentHashMap
|
||||
import java.util.concurrent.TimeUnit
|
||||
|
||||
/**
|
||||
* Custom application metrics configuration.
|
||||
|
||||
@@ -4,19 +4,10 @@ import at.mocode.dto.base.ApiResponse
|
||||
import at.mocode.shared.config.AppConfig
|
||||
import io.ktor.http.*
|
||||
import io.ktor.server.application.*
|
||||
import io.ktor.server.metrics.micrometer.*
|
||||
import io.ktor.server.plugins.calllogging.*
|
||||
import io.ktor.server.plugins.statuspages.*
|
||||
import io.ktor.server.request.*
|
||||
import io.ktor.server.response.*
|
||||
import io.ktor.server.routing.*
|
||||
import io.micrometer.core.instrument.binder.jvm.ClassLoaderMetrics
|
||||
import io.micrometer.core.instrument.binder.jvm.JvmGcMetrics
|
||||
import io.micrometer.core.instrument.binder.jvm.JvmMemoryMetrics
|
||||
import io.micrometer.core.instrument.binder.jvm.JvmThreadMetrics
|
||||
import io.micrometer.core.instrument.binder.system.ProcessorMetrics
|
||||
import io.micrometer.prometheus.PrometheusConfig
|
||||
import io.micrometer.prometheus.PrometheusMeterRegistry
|
||||
import org.slf4j.event.Level
|
||||
import java.time.LocalDateTime
|
||||
import java.time.format.DateTimeFormatter
|
||||
|
||||
@@ -131,7 +131,7 @@ class ServiceDiscovery(
|
||||
* @return The complete URL
|
||||
*/
|
||||
fun buildServiceUrl(instance: ServiceInstance, path: String): String {
|
||||
val baseUrl = "http://${instance.host}:${instance.port}"
|
||||
val baseUrl = "https://${instance.host}:${instance.port}"
|
||||
return URI(baseUrl).resolve(path).toString()
|
||||
}
|
||||
|
||||
@@ -143,7 +143,7 @@ class ServiceDiscovery(
|
||||
*/
|
||||
suspend fun isServiceHealthy(serviceName: String): Boolean {
|
||||
try {
|
||||
val response = httpClient.get("http://$consulHost:$consulPort/v1/health/service/$serviceName?passing=true")
|
||||
val response = httpClient.get("https://$consulHost:$consulPort/v1/health/service/$serviceName?passing=true")
|
||||
val responseBody = response.bodyAsText()
|
||||
val healthyServices = Json.decodeFromString<List<Any>>(responseBody)
|
||||
return healthyServices.isNotEmpty()
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
package at.mocode.gateway.plugins
|
||||
|
||||
import at.mocode.gateway.config.CachingConfig
|
||||
import at.mocode.gateway.config.getCachingConfig
|
||||
import io.ktor.http.*
|
||||
import io.ktor.server.application.*
|
||||
@@ -10,7 +9,6 @@ import io.ktor.util.pipeline.*
|
||||
import java.security.MessageDigest
|
||||
import java.text.SimpleDateFormat
|
||||
import java.util.*
|
||||
import kotlin.text.Charsets
|
||||
|
||||
/**
|
||||
* Configures enhanced HTTP caching headers for the application.
|
||||
@@ -190,7 +188,7 @@ suspend fun PipelineContext<Unit, ApplicationCall>.checkLastModifiedAndRespond(t
|
||||
call.respond(HttpStatusCode.NotModified)
|
||||
return true
|
||||
}
|
||||
} catch (e: Exception) {
|
||||
} catch (_: Exception) {
|
||||
// If we can't parse the date, ignore it
|
||||
}
|
||||
}
|
||||
@@ -217,7 +215,7 @@ suspend fun <T> PipelineContext<Unit, ApplicationCall>.checkCacheAndRespond(
|
||||
val application = call.application
|
||||
val cachingConfig = try {
|
||||
application.getCachingConfig()
|
||||
} catch (e: Exception) {
|
||||
} catch (_: Exception) {
|
||||
return false
|
||||
}
|
||||
|
||||
|
||||
+1
-1
@@ -25,7 +25,7 @@ dependencyResolutionManagement {
|
||||
includeGroupAndSubgroups("com.google")
|
||||
}
|
||||
}
|
||||
// Add JCenter repository (archive)
|
||||
// Add a JCenter repository (archive)
|
||||
maven {
|
||||
url = uri("https://jcenter.bintray.com")
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user