Hbase基础-概念和安装

Hbase 基本概念

Hbase 能做什么

  • 海量数据存储
  • 准实时查询

HBase在实际业务场景中的应用

  • 交通:gps,摄像头信息
  • 金融:交易信息
  • 电商:交易信息,浏览信息,物流信息

HBase特点

  • 容量大:Hbase单表可以有百亿行,百万列,数据矩阵的横纵维度所支持的数据量级都十分具有弹性
  • 面向列:HBase是面向列的存储和权限控制,并支持独立检索.列式存储,其数据在表中是按照某列存储的,这样在查询只需要少数几个字段的时候,能大大减少读取的数据量.并且 可以动态增加列
  • 多版本:HBase每一列的数据存储有多个Version
  • 稀疏性:为空的列并不占用存储空间,表可以设计的很稀疏
  • 扩展性:底层依赖于HDFS(只需要增加机器就可以扩大容量)
  • 高可靠性:WAL机制保证了数据写入时不会因集群异常而导致写入数据丢失;HBase底层使用的HDFS,会有备份
  • 高性能:底层的LSM数据结构和Rowkey有序排列等架构的独特设计,使得HBase具有非常高的写入性能.region切分,主键索引和缓存机制使得HBase在海量数据下具备-定的随机读取性能,该性能针对Rowkey的查询能够达到毫秒级别

HBase数据模型

列簇:

  • 一张表列簇不会超过5个,多个会增加磁盘交互,降低性能
  • 每个列簇中的列数没有限制
  • 列只有插入数据后存在
  • 列在列簇中是有序的

基本操作

hbase(main):001:0> create 'test','info'
0 row(s) in 11.7610 seconds

=> Hbase::Table - test
hbase(main):002:0> put 'test','0001','info:username','euraxluo'
0 row(s) in 6.6130 seconds

hbase(main):003:0> scan 'test'
ROW                     COLUMN+CELL
 0001                   column=info:username, timestamp=1577270839695, value=euraxluo   1 row(s) in 0.6160 seconds

hbase(main):004:0> put 'test','0001','info:age','12'
0 row(s) in 0.3720 seconds

hbase(main):005:0> scan 'test'
ROW                     COLUMN+CELL
 0001                   column=info:age, timestamp=1577270880203, value=12
 0001                   column=info:username, timestamp=1577270839695, value=euraxluo   1 row(s) in 0.0240 seconds

hbase(main):006:0> describe 'test'
Table test is ENABLED
test
COLUMN FAMILIES DESCRIPTION
{NAME => 'info', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.3110 seconds

hbase(main):007:0> get 'test','0001','info:age'
COLUMN                  CELL
 info:age               timestamp=1577283627980, value=12
 
hbase(main):008:0> truncate 'test'
Truncating 'test' table (it may take a while):
 - Disabling table...
 - Truncating table...
0 row(s) in 4.2500 seconds


hbase(main):009:0> disable 'test'
0 row(s) in 9.8040 seconds


hbase(main):010:0> is_enabled 'test'
false
0 row(s) in 0.1060 seconds


hbase(main):011:0> drop 'test'
0 row(s) in 5.0910 seconds

安装(Hadoop,ZooKeeper,HBase,Kafka 单机伪分布式安装过程及注意事项)

tar包地址

/etc/profile

export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL
export JAVA_HOME=/home/software/app/jdk1.8.0_202
export CLASSPATH=.:${JAVA_HOME}/jre/lib/rt.jar:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar       export PATH=$JAVA_HOME/bin:$PATH
export SCALA_HOME=/home/software/app/scala-2.12.8
export PATH=$SCALA_HOME/bin:$PATH
export MAVEN_HOME=/home/software/app/apache-maven-3.3.9
export PATH=$MAVEN_HOME/bin:$PATH
export HADOOP_HOME=/home/software/app/hadoop-2.6.0-cdh5.7.0
export CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath):$CLASSPATH
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export ZOOKEEPER_HOME=/home/software/app/zookeeper-3.4.10
export PATH=$ZOOKEEPER_HOME/bin:$PATH
export HBASE_HOME=/home/software/app/hbase-1.2.0-cdh5.7.0
export PATH=$HBASE_HOME/bin:$PATH
export KAFKA_HOME=/home/software/app/kafka #_2.12-0.10.2.0
export PATH=$KAFKA_HOME/bin:$PATH
export JAVA_OPTS="-server -Xms256m -Xmx512m  -XX:PermSize=128m -XX:MaxPermSize=256m"

安装JDK

  • 略过,就是解压,然后配置环境变量

Hadoop 安装配置过程:

  • 选择 app 目录存放这些软件

  • 解压缩 tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C app

  • 配置环境变量

  • 解压缩 native 文件,因为我们的lib文件夹是空的 tar -xvf hadoop-native-64-2.6.0.tar -C hadoop-2.6.0-cdh5.7.0.tar.g/lib/native/ ; tar -xvf hadoop-native-64-2.6.0.tar -C hadoop-2.6.0-cdh5.7.0.tar.g/lib

  • 写配置文件:

    <!-- app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/coite.xml-->
    <configuration>
            <property>
                 <name>hadoop.tmp.dir</name>
                 <value>/home/software/hadoop_tmp</value>
            </property>
            <property>
                 <name>fs.defaultFS</name>
                 <value>hdfs://localhost:9001</value>
            </property>
    </configuration>
    
    <!--  app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/hdfs-site.xml-->
    <configuration>
            <property>
                 <name>dfs.replication</name>
                 <value>1</value>
            </property>
             <property>
                   <name>dfs.namenode.name.dir</name>
                   <value>file:///home/software/hadoop_tmp/dfs/name</value>
             </property>
             <property>
                   <name>dfs.datanode.data.dir</name>
                   <value>file:///home/software/hadoop_tmp/dfs/data</value>
             </property>
    <!--
             <property>
                   <name>fs.checkpoint.data.dir</name>
                   <value>file:///home/software/hadoop_tmp/dfs/fcd</value>
             </property>
             <property>
                   <name>fs.checkpoint.edits.dir</name>
                   <value>file:///home/software/hadoop_tmp/dfs/fce</value>
             </property> -->
             <property>
                   <name>dfs.permissions.enabled</name>
                   <value>false</value>
             </property>
    </configuration>
    
    [software]# cat app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/slaves
    localhost
    
    [software]# cat app/hadoop-2.6.0-cdh5.7.0/etc/hadoop/hadoop-env.sh |grep JAVA_HOME
    # The only required environment variable is JAVA_HOME.  All others are
    # set JAVA_HOME in this file, so that it is correctly defined on
    export JAVA_HOME=/home/software/app/jdk1.8.0_202
    
  • 关闭防火墙

    firewall-cmd --state
    systemctl stop firewalld.service
    firewall-cmd --state
    
  • 测试

    hdfs namenode -format>{
    查看有无报错
    }
    start-dfs.sh>{
    查看日志有无报错
    }
    jps>{
    21366 DataNode
    7958 Jps
    20888 NameNode
    21869 SecondaryNameNode
    }
    最后去网页上查看存活节点数
    

Hadoop 伪分布式问题

 17/09/22 14:53:21 WARN hdfs.DFSClient: DataStreamer Exception
  org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /input/data.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.


  解决方案:
  看它的报错信息好像是节点没有启动,但是我的节点都启动起来了,使用jps也能查看到节点信息。
  使用hadoop dfsadmin -report命令查看磁盘使用情况,发现出现以下问题:
  Configured Capacity: 0 (0 B)Present Capacity: 0 (0 B)DFS Remaining: 0 (0 B)DFS Used: 0 (0 B)DFS Used%: NaN%Under replicated blocks: 0Blocks with corrupt replicas: 0Missing blocks: 0-------------------------------------------------Datanodes available: 0 (0 total, 0 dead)
  节点下存储空间都是空的,问题应该就是出现在这了。
         查阅资料发现造成这个问题的原因可能是使用hadoop namenode -format格式化时格式化了多次造成那么spaceID不一致,解决方案:
  1、停止集群(切换到/sbin目录下)
  stop-all.sh
  2、删除在hdfs中配置的data目录(即在core-site.xml中配置的hadoop.tmp.dir对应文件件)下面的所有数据;
  rm -rf /root/training/hadoop-2.7.3/tmp
  3、重新格式化namenode(切换到hadoop目录下的bin目录下)
  hdfs namenode -format
  4、重新启动hadoop集群(切换到hadoop目录下的sbin目录下)
  start-all.sh

Zookeeper 安装配置

  • 解压缩tar -zxvf zookeeper-3.4.10.tar.gz -C app

  • 写配置 zoo1.cfg,zoo2.cfg,zoo3.cfg

    cat app/zookeeper-3.4.10/conf/zoo1.cfg

      dataDir=/home/software/zookeeper/zoo1
      clientPort=2191
      server.1=127.0.0.1:8801:7701
      server.2=127.0.0.1:8802:7702
      server.3=127.0.0.1:8803:7703
    

    cat app/zookeeper-3.4.10/conf/zoo2.cfg

      dataDir=/home/software/zookeeper/zoo2
      clientPort=2192
      server.1=127.0.0.1:8801:7701
      server.2=127.0.0.1:8802:7702
      server.3=127.0.0.1:8803:7703
    

    cat app/zookeeper-3.4.10/conf/zoo3.cfg

      dataDir=/home/software/zookeeper/zoo3
      clientPort=2193
      server.1=127.0.0.1:8801:7701
      server.2=127.0.0.1:8802:7702
      server.3=127.0.0.1:8803:7703  
    
  • 配置myid文件

    echo "1" > zoo1/myid
    echo "2" > zoo2/myid
    echo "3" > zoo3/myid
    
  • 启动zookeeper服务

    zkServer.sh start app/zookeeper-3.4.10/conf/zoo1.cfg
    zkServer.sh start app/zookeeper-3.4.10/conf/zoo2.cfg
    zkServer.sh start app/zookeeper-3.4.10/conf/zoo3.cfg
    
  • 检查是否启动成功

    10626 QuorumPeerMain
    10568 QuorumPeerMain
    13450 Jps
    10732 QuorumPeerMain
    21366 DataNode
    20888 NameNode
    21869 SecondaryNameNode
    
    zkCli.sh -server 127.0.0.1:8802
    

HBASE配置

  • 解压缩 tar -zxvf hbase-1.2.0-cdh5.7.0.tar.gz -C app

  • 配置

    [software]# cat app/hbase-1.2.0-cdh5.7.0/conf/hbase-env.sh | grep ^export
    export JAVA_HOME=/home/software/app/jdk1.8.0_202
    export HBASE_OPTS="-XX:+UseConcMarkSweepGC"
    export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
    export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:PermSize=128m -XX:MaxPermSize=128m"
    export HBASE_MANAGES_ZK=false
    
     <!--app/hbase-1.2.0-cdh5.7.0/conf/hbase-site.xml -->
      <configuration>
      <property>
          <name>hbase.rootdir</name>
          <value>hdfs://localhost:9001/hbase</value>
        </property>
        <property>
          <name>hbase.cluster.distributed</name>
          <value>true</value>
        </property>
        <property>
          <name>hbase.zookeeper.quorum</name>               
          <value>127.0.0.1:2191,127.0.0.1:2192,127.0.0.1:2193</value>
        </property>
        <property>
          <name>hbase.tmp.dir</name>
          <value>/home/software/hbase/data</value>
        </property>
      </configuration>
    
  • 启动start-hbase.sh

  • 检验

    21361 HMaster
    10626 QuorumPeerMain
    21366 DataNode
    20888 NameNode
    10568 QuorumPeerMain
    21512 HRegionServer
    10732 QuorumPeerMain
    21869 SecondaryNameNode
    20638 Jps
    
    #查看网页:
    http://47.107.44.224:60010/master-status
    

Kafka 安装配置

参考链接

  • 解压 tar -zxvf kafka_2.12-0.10.2.0.tgz -C app

  • 环境变量

    export KAFKA_HOME=/home/software/app/kafka #_2.11-2.3.1
    export PATH=$KAFKA_HOME/bin:$PATH
    #export JAVA_OPTS="-server -Xms256m -Xmx512m  -XX:PermSize=128m -XX:MaxPermSize=256m"
    
  • 修改配置

    app/kafka/config/server1.properties

    broker.id=1
    log.dirs=/home/software/kafka/logs1
    zookeeper.connect=localhost:2191,localhost:2192,localhost:2193
    advertised.listeners=PLAINTEXT://47.107.44.224:9011
    listeners=PLAINTEXT://172.17.50.121:9011
    

    app/kafka/config/server2.properties

    broker.id=2
    log.dirs=/home/software/kafka/logs2
    zookeeper.connect=localhost:2191,localhost:2192,localhost:2193
    advertised.listeners=PLAINTEXT://47.107.44.224:9012
    listeners=PLAINTEXT://172.17.50.121:9012
    

    app/kafka/config/server3.properties

    broker.id=3
    log.dirs=/home/software/kafka/logs3
    zookeeper.connect=localhost:2191,localhost:2192,localhost:2193
    advertised.listeners=PLAINTEXT://47.107.44.224:9013
    listeners=PLAINTEXT://172.17.50.121:9013
    
  • 启动

    kafka-server-start.sh kafka/config/server1.properties &
    kafka-server-start.sh kafka/config/server2.properties &
    kafka-server-start.sh kafka/config/server3.properties &
    
  • 检验

    # 创建一个topic
    kafka-topics.sh --create --zookeeper localhost:2191,localhost:2192,localhost:2193 --replication-factor 1 --partitions 1 --topic testing
    # 查看topic list
    kafka-topics.sh --list --zookeeper localhost:2192
    # 运行生产者
    kafka-console-producer.sh --broker-list localhost:9011,localhost:9012,localhost:9013 --topic testing
    #运行消费者
    kafka-console-consumer.sh --bootstrap-server localhost:9011 --topic testing --from-beginning
    

CC BY-NC 4.0

Redis API及数据结构
Typora画图

Comments